* Fix SEO skill 34 bugs, Korean labels, and transition Ahrefs refs to our-seo-agent P0: Fix report_aggregator.py — wrong SKILL_REGISTRY[33] mapping, missing CATEGORY_WEIGHTS for 7 categories, and break bug in health score parsing that exited loop even on parse failure. P1: Remove VIEW tab references from skill 20, expand skill 32 docs, replace Ahrefs MCP references across all 16 skills (19-28, 31-34) with our-seo-agent CLI data source references. P2: Fix Korean labels in executive_report.py and dashboard_generator.py, add tenacity to base requirements, sync skill 34 base_client.py with canonical version from skill 12. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add Claude Code slash commands for SEO skills 19-34 and fix stale paths Create 14 new slash command files for skills 19-28, 31-34 so they appear as /seo-* commands in Claude Code. Also fix stale directory paths in 8 existing commands (skills 12-18, 29-30) that referenced pre-renumbering skill directories. Update .gitignore to track .claude/commands/ while keeping other .claude/ files ignored. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add 8 slash commands, enhance reference-curator with depth/output options - Add slash commands: ourdigital-brand-guide, notion-writer, notebooklm-agent, notebooklm-automation, notebooklm-studio, notebooklm-research, reference-curator, multi-agent-guide - Add --depth (light/standard/deep/full) with Firecrawl parameter mapping - Add --output with ~/Documents/reference-library/ default and user confirmation - Increase --max-sources default from 10 to 100 - Rename /reference-curator-pipeline to /reference-curator - Simplify web-crawler-orchestrator label to web-crawler in docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
286 lines
9.1 KiB
Markdown
286 lines
9.1 KiB
Markdown
---
|
|
description: Orchestrates full reference curation pipeline with configurable depth. Runs discovery > crawl > store > distill > review > export with QA loop handling.
|
|
argument-hint: <topic|urls|manifest> [--depth light|standard|deep|full] [--output ~/Documents/reference-library/] [--max-sources 100] [--max-pages 50] [--auto-approve] [--threshold 0.85] [--max-iterations 3] [--export-format project_files]
|
|
allowed-tools: WebSearch, WebFetch, Read, Write, Bash, Grep, Glob, Task
|
|
---
|
|
|
|
# Reference Curator Pipeline
|
|
|
|
Full-stack orchestration of the 6-skill reference curation workflow.
|
|
|
|
## Input Modes
|
|
|
|
| Mode | Input Example | Pipeline Start |
|
|
|------|---------------|----------------|
|
|
| **Topic** | `"Claude system prompts"` | reference-discovery |
|
|
| **URLs** | `https://docs.anthropic.com/...` | web-crawler (skip discovery) |
|
|
| **Manifest** | `./manifest.json` | web-crawler (resume from discovery) |
|
|
|
|
## Arguments
|
|
|
|
- `<input>`: Required. Topic string, URL(s), or manifest file path
|
|
- `--depth`: Crawl depth level (default: `standard`). See Depth Levels below
|
|
- `--output`: Output directory path (default: `~/Documents/reference-library/`)
|
|
- `--max-sources`: Maximum sources to discover (topic mode, default: 100)
|
|
- `--max-pages`: Maximum pages per source to crawl (default varies by depth)
|
|
- `--auto-approve`: Auto-approve scores above threshold
|
|
- `--threshold`: Approval threshold (default: 0.85)
|
|
- `--max-iterations`: Max QA loop iterations per document (default: 3)
|
|
- `--export-format`: Output format: `project_files`, `fine_tuning`, `jsonl` (default: project_files)
|
|
- `--include-subdomains`: Include subdomains in site mapping (default: false)
|
|
- `--follow-external`: Follow external links found in content (default: false)
|
|
|
|
## Output Directory
|
|
|
|
**IMPORTANT: Never store output in any claude-skills, `.claude/`, or skill-related directory.**
|
|
|
|
The `--output` argument sets the base path for all pipeline output. If omitted, the default is `~/Documents/reference-library/`.
|
|
|
|
### Directory Structure
|
|
|
|
```
|
|
{output}/
|
|
├── {topic-slug}/ # One folder per pipeline run
|
|
│ ├── README.md # Index with table of contents
|
|
│ ├── 00-page-name.md # Individual page files
|
|
│ ├── 01-page-name.md
|
|
│ ├── ...
|
|
│ ├── {topic-slug}-complete.md # Combined bundle (all pages)
|
|
│ └── manifest.json # Crawl metadata
|
|
├── pipeline_state/ # Resume state (auto-managed)
|
|
│ └── run_XXX/state.json
|
|
└── exports/ # Fine-tuning / JSONL exports
|
|
```
|
|
|
|
### Resolution Rules
|
|
|
|
1. If `--output` is provided, use that path exactly
|
|
2. If not provided, use `~/Documents/reference-library/`
|
|
3. **Before writing any files**, check if the output directory exists
|
|
4. If the directory does NOT exist, **ask the user for permission** before creating it:
|
|
- Show the full resolved path that will be created
|
|
- Wait for explicit user approval
|
|
- Only then run `mkdir -p <path>`
|
|
5. The topic slug is derived from the input (URL domain+path or topic string)
|
|
|
|
### Examples
|
|
|
|
```bash
|
|
# Uses default: ~/Documents/reference-library/glossary-for-wordpress/
|
|
/reference-curator https://docs.codeat.co/glossary/
|
|
|
|
# Custom path: /tmp/research/mcp-docs/
|
|
/reference-curator "MCP servers" --output /tmp/research
|
|
|
|
# Explicit home subfolder: ~/Projects/client-docs/api-reference/
|
|
/reference-curator https://api.example.com/docs --output ~/Projects/client-docs
|
|
```
|
|
|
|
## Depth Levels
|
|
|
|
### `--depth light`
|
|
Fast scan for quick reference. Main content only, minimal crawling.
|
|
|
|
| Parameter | Value |
|
|
|-----------|-------|
|
|
| `onlyMainContent` | `true` |
|
|
| `formats` | `["markdown"]` |
|
|
| `maxDiscoveryDepth` | 1 |
|
|
| `max-pages` default | 20 |
|
|
| Map limit | 50 |
|
|
| `deduplicateSimilarURLs` | `true` |
|
|
| Follow sub-links | No |
|
|
| JS rendering wait | None |
|
|
|
|
**Best for:** Quick lookups, single-page references, API docs you already know.
|
|
|
|
### `--depth standard` (default)
|
|
Balanced crawl. Main content with links for cross-referencing.
|
|
|
|
| Parameter | Value |
|
|
|-----------|-------|
|
|
| `onlyMainContent` | `true` |
|
|
| `formats` | `["markdown", "links"]` |
|
|
| `maxDiscoveryDepth` | 2 |
|
|
| `max-pages` default | 50 |
|
|
| Map limit | 100 |
|
|
| `deduplicateSimilarURLs` | `true` |
|
|
| Follow sub-links | Same-domain only |
|
|
| JS rendering wait | None |
|
|
|
|
**Best for:** Documentation sites, plugin guides, knowledge bases.
|
|
|
|
### `--depth deep`
|
|
Thorough crawl. Full page content including sidebars, nav, and related pages.
|
|
|
|
| Parameter | Value |
|
|
|-----------|-------|
|
|
| `onlyMainContent` | `false` |
|
|
| `formats` | `["markdown", "links", "html"]` |
|
|
| `maxDiscoveryDepth` | 3 |
|
|
| `max-pages` default | 150 |
|
|
| Map limit | 300 |
|
|
| `deduplicateSimilarURLs` | `true` |
|
|
| Follow sub-links | Same-domain + linked resources |
|
|
| JS rendering wait | `waitFor: 3000` |
|
|
| `includeSubdomains` | `true` |
|
|
|
|
**Best for:** Complete product documentation, research material, sites with sidebars/code samples hidden behind tabs.
|
|
|
|
### `--depth full`
|
|
Exhaustive crawl. Everything captured including raw HTML, screenshots, and external references.
|
|
|
|
| Parameter | Value |
|
|
|-----------|-------|
|
|
| `onlyMainContent` | `false` |
|
|
| `formats` | `["markdown", "html", "rawHtml", "links"]` |
|
|
| `maxDiscoveryDepth` | 5 |
|
|
| `max-pages` default | 500 |
|
|
| Map limit | 1000 |
|
|
| `deduplicateSimilarURLs` | `false` |
|
|
| Follow sub-links | All (same-domain + external references) |
|
|
| JS rendering wait | `waitFor: 5000` |
|
|
| `includeSubdomains` | `true` |
|
|
| Screenshots | Capture for JS-heavy pages |
|
|
|
|
**Best for:** Archival, migration references, preserving sites, training data collection.
|
|
|
|
## Depth Comparison
|
|
|
|
```
|
|
light ████░░░░░░░░░░░░ Speed: fastest Pages: ~20 Content: main only
|
|
standard ████████░░░░░░░░ Speed: fast Pages: ~50 Content: main + links
|
|
deep ████████████░░░░ Speed: moderate Pages: ~150 Content: full page + HTML
|
|
full ████████████████ Speed: slow Pages: ~500 Content: everything + raw
|
|
```
|
|
|
|
## Pipeline Stages
|
|
|
|
```
|
|
1. reference-discovery (topic mode only)
|
|
2. web-crawler <- depth controls this stage
|
|
3. content-repository
|
|
4. content-distiller <--------+
|
|
5. quality-reviewer |
|
|
+-- APPROVE -> export |
|
|
+-- REFACTOR -----------------+
|
|
+-- DEEP_RESEARCH -> crawler -+
|
|
+-- REJECT -> archive
|
|
6. markdown-exporter
|
|
```
|
|
|
|
## Crawl Execution by Depth
|
|
|
|
When executing the crawl (Stage 2), apply the depth settings to the Firecrawl tools:
|
|
|
|
### Site Mapping (`firecrawl_map`)
|
|
```
|
|
firecrawl_map:
|
|
url: <target>
|
|
limit: {depth.map_limit}
|
|
includeSubdomains: {depth.includeSubdomains}
|
|
```
|
|
|
|
### Page Scraping (`firecrawl_scrape`)
|
|
```
|
|
firecrawl_scrape:
|
|
url: <page>
|
|
formats: {depth.formats}
|
|
onlyMainContent: {depth.onlyMainContent}
|
|
waitFor: {depth.waitFor} # deep/full only
|
|
excludeTags: ["nav", "footer"] # light/standard only
|
|
```
|
|
|
|
### Batch Crawling (`firecrawl_crawl`) - for deep/full only
|
|
```
|
|
firecrawl_crawl:
|
|
url: <target>
|
|
maxDiscoveryDepth: {depth.maxDiscoveryDepth}
|
|
limit: {depth.max_pages}
|
|
deduplicateSimilarURLs: {depth.dedup}
|
|
scrapeOptions:
|
|
formats: {depth.formats}
|
|
onlyMainContent: {depth.onlyMainContent}
|
|
waitFor: {depth.waitFor}
|
|
```
|
|
|
|
## QA Loop Handling
|
|
|
|
| Decision | Action | Max Iterations |
|
|
|----------|--------|----------------|
|
|
| REFACTOR | Re-distill with feedback | 3 |
|
|
| DEEP_RESEARCH | Crawl more sources, re-distill | 2 |
|
|
| Combined | Total loops per document | 5 |
|
|
|
|
After max iterations, document marked as `needs_manual_review`.
|
|
|
|
## Example Usage
|
|
|
|
```
|
|
# Quick scan of a single doc page
|
|
/reference-curator https://docs.example.com/api --depth light
|
|
|
|
# Standard documentation crawl (default)
|
|
/reference-curator "Glossary for WordPress" --max-sources 5
|
|
|
|
# Deep crawl capturing full page content and HTML
|
|
/reference-curator https://docs.codeat.co/glossary/ --depth deep
|
|
|
|
# Full archival crawl with all formats
|
|
/reference-curator https://docs.anthropic.com --depth full --max-pages 300
|
|
|
|
# Deep crawl with auto-approval and fine-tuning export
|
|
/reference-curator "MCP servers" --depth deep --auto-approve --export-format fine_tuning
|
|
|
|
# Resume from existing manifest
|
|
/reference-curator ./manifest.json --auto-approve
|
|
```
|
|
|
|
## State Management
|
|
|
|
Pipeline state is saved after each stage to allow resume:
|
|
|
|
**With MySQL:**
|
|
```sql
|
|
SELECT * FROM pipeline_runs WHERE run_id = 123;
|
|
```
|
|
|
|
**File-based fallback:**
|
|
```
|
|
~/Documents/reference-library/pipeline_state/run_XXX/state.json
|
|
```
|
|
|
|
## Output
|
|
|
|
Pipeline returns summary on completion:
|
|
|
|
```json
|
|
{
|
|
"run_id": 123,
|
|
"status": "completed",
|
|
"depth": "standard",
|
|
"stats": {
|
|
"sources_discovered": 5,
|
|
"pages_crawled": 45,
|
|
"documents_stored": 45,
|
|
"approved": 40,
|
|
"refactored": 8,
|
|
"deep_researched": 2,
|
|
"rejected": 3,
|
|
"needs_manual_review": 2
|
|
},
|
|
"exports": {
|
|
"format": "project_files",
|
|
"path": "~/Documents/reference-library/exports/"
|
|
}
|
|
}
|
|
```
|
|
|
|
## See Also
|
|
|
|
- `/reference-discovery` - Run discovery stage only
|
|
- `/web-crawler` - Run crawler stage only
|
|
- `/content-repository` - Manage stored content
|
|
- `/quality-reviewer` - Run QA review only
|
|
- `/markdown-exporter` - Run export only
|