--- description: Orchestrates full reference curation pipeline with configurable depth. Runs discovery > crawl > store > distill > review > export with QA loop handling. argument-hint: [--depth light|standard|deep|full] [--output ~/Documents/reference-library/] [--max-sources 100] [--max-pages 50] [--auto-approve] [--threshold 0.85] [--max-iterations 3] [--export-format project_files] allowed-tools: WebSearch, WebFetch, Read, Write, Bash, Grep, Glob, Task --- # Reference Curator Pipeline Full-stack orchestration of the 6-skill reference curation workflow. ## Input Modes | Mode | Input Example | Pipeline Start | |------|---------------|----------------| | **Topic** | `"Claude system prompts"` | reference-discovery | | **URLs** | `https://docs.anthropic.com/...` | web-crawler (skip discovery) | | **Manifest** | `./manifest.json` | web-crawler (resume from discovery) | ## Arguments - ``: Required. Topic string, URL(s), or manifest file path - `--depth`: Crawl depth level (default: `standard`). See Depth Levels below - `--output`: Output directory path (default: `~/Documents/reference-library/`) - `--max-sources`: Maximum sources to discover (topic mode, default: 100) - `--max-pages`: Maximum pages per source to crawl (default varies by depth) - `--auto-approve`: Auto-approve scores above threshold - `--threshold`: Approval threshold (default: 0.85) - `--max-iterations`: Max QA loop iterations per document (default: 3) - `--export-format`: Output format: `project_files`, `fine_tuning`, `jsonl` (default: project_files) - `--include-subdomains`: Include subdomains in site mapping (default: false) - `--follow-external`: Follow external links found in content (default: false) ## Output Directory **IMPORTANT: Never store output in any claude-skills, `.claude/`, or skill-related directory.** The `--output` argument sets the base path for all pipeline output. If omitted, the default is `~/Documents/reference-library/`. ### Directory Structure ``` {output}/ ├── {topic-slug}/ # One folder per pipeline run │ ├── README.md # Index with table of contents │ ├── 00-page-name.md # Individual page files │ ├── 01-page-name.md │ ├── ... │ ├── {topic-slug}-complete.md # Combined bundle (all pages) │ └── manifest.json # Crawl metadata ├── pipeline_state/ # Resume state (auto-managed) │ └── run_XXX/state.json └── exports/ # Fine-tuning / JSONL exports ``` ### Resolution Rules 1. If `--output` is provided, use that path exactly 2. If not provided, use `~/Documents/reference-library/` 3. **Before writing any files**, check if the output directory exists 4. If the directory does NOT exist, **ask the user for permission** before creating it: - Show the full resolved path that will be created - Wait for explicit user approval - Only then run `mkdir -p ` 5. The topic slug is derived from the input (URL domain+path or topic string) ### Examples ```bash # Uses default: ~/Documents/reference-library/glossary-for-wordpress/ /reference-curator https://docs.codeat.co/glossary/ # Custom path: /tmp/research/mcp-docs/ /reference-curator "MCP servers" --output /tmp/research # Explicit home subfolder: ~/Projects/client-docs/api-reference/ /reference-curator https://api.example.com/docs --output ~/Projects/client-docs ``` ## Depth Levels ### `--depth light` Fast scan for quick reference. Main content only, minimal crawling. | Parameter | Value | |-----------|-------| | `onlyMainContent` | `true` | | `formats` | `["markdown"]` | | `maxDiscoveryDepth` | 1 | | `max-pages` default | 20 | | Map limit | 50 | | `deduplicateSimilarURLs` | `true` | | Follow sub-links | No | | JS rendering wait | None | **Best for:** Quick lookups, single-page references, API docs you already know. ### `--depth standard` (default) Balanced crawl. Main content with links for cross-referencing. | Parameter | Value | |-----------|-------| | `onlyMainContent` | `true` | | `formats` | `["markdown", "links"]` | | `maxDiscoveryDepth` | 2 | | `max-pages` default | 50 | | Map limit | 100 | | `deduplicateSimilarURLs` | `true` | | Follow sub-links | Same-domain only | | JS rendering wait | None | **Best for:** Documentation sites, plugin guides, knowledge bases. ### `--depth deep` Thorough crawl. Full page content including sidebars, nav, and related pages. | Parameter | Value | |-----------|-------| | `onlyMainContent` | `false` | | `formats` | `["markdown", "links", "html"]` | | `maxDiscoveryDepth` | 3 | | `max-pages` default | 150 | | Map limit | 300 | | `deduplicateSimilarURLs` | `true` | | Follow sub-links | Same-domain + linked resources | | JS rendering wait | `waitFor: 3000` | | `includeSubdomains` | `true` | **Best for:** Complete product documentation, research material, sites with sidebars/code samples hidden behind tabs. ### `--depth full` Exhaustive crawl. Everything captured including raw HTML, screenshots, and external references. | Parameter | Value | |-----------|-------| | `onlyMainContent` | `false` | | `formats` | `["markdown", "html", "rawHtml", "links"]` | | `maxDiscoveryDepth` | 5 | | `max-pages` default | 500 | | Map limit | 1000 | | `deduplicateSimilarURLs` | `false` | | Follow sub-links | All (same-domain + external references) | | JS rendering wait | `waitFor: 5000` | | `includeSubdomains` | `true` | | Screenshots | Capture for JS-heavy pages | **Best for:** Archival, migration references, preserving sites, training data collection. ## Depth Comparison ``` light ████░░░░░░░░░░░░ Speed: fastest Pages: ~20 Content: main only standard ████████░░░░░░░░ Speed: fast Pages: ~50 Content: main + links deep ████████████░░░░ Speed: moderate Pages: ~150 Content: full page + HTML full ████████████████ Speed: slow Pages: ~500 Content: everything + raw ``` ## Pipeline Stages ``` 1. reference-discovery (topic mode only) 2. web-crawler <- depth controls this stage 3. content-repository 4. content-distiller <--------+ 5. quality-reviewer | +-- APPROVE -> export | +-- REFACTOR -----------------+ +-- DEEP_RESEARCH -> crawler -+ +-- REJECT -> archive 6. markdown-exporter ``` ## Crawl Execution by Depth When executing the crawl (Stage 2), apply the depth settings to the Firecrawl tools: ### Site Mapping (`firecrawl_map`) ``` firecrawl_map: url: limit: {depth.map_limit} includeSubdomains: {depth.includeSubdomains} ``` ### Page Scraping (`firecrawl_scrape`) ``` firecrawl_scrape: url: formats: {depth.formats} onlyMainContent: {depth.onlyMainContent} waitFor: {depth.waitFor} # deep/full only excludeTags: ["nav", "footer"] # light/standard only ``` ### Batch Crawling (`firecrawl_crawl`) - for deep/full only ``` firecrawl_crawl: url: maxDiscoveryDepth: {depth.maxDiscoveryDepth} limit: {depth.max_pages} deduplicateSimilarURLs: {depth.dedup} scrapeOptions: formats: {depth.formats} onlyMainContent: {depth.onlyMainContent} waitFor: {depth.waitFor} ``` ## QA Loop Handling | Decision | Action | Max Iterations | |----------|--------|----------------| | REFACTOR | Re-distill with feedback | 3 | | DEEP_RESEARCH | Crawl more sources, re-distill | 2 | | Combined | Total loops per document | 5 | After max iterations, document marked as `needs_manual_review`. ## Example Usage ``` # Quick scan of a single doc page /reference-curator https://docs.example.com/api --depth light # Standard documentation crawl (default) /reference-curator "Glossary for WordPress" --max-sources 5 # Deep crawl capturing full page content and HTML /reference-curator https://docs.codeat.co/glossary/ --depth deep # Full archival crawl with all formats /reference-curator https://docs.anthropic.com --depth full --max-pages 300 # Deep crawl with auto-approval and fine-tuning export /reference-curator "MCP servers" --depth deep --auto-approve --export-format fine_tuning # Resume from existing manifest /reference-curator ./manifest.json --auto-approve ``` ## State Management Pipeline state is saved after each stage to allow resume: **With MySQL:** ```sql SELECT * FROM pipeline_runs WHERE run_id = 123; ``` **File-based fallback:** ``` ~/Documents/reference-library/pipeline_state/run_XXX/state.json ``` ## Output Pipeline returns summary on completion: ```json { "run_id": 123, "status": "completed", "depth": "standard", "stats": { "sources_discovered": 5, "pages_crawled": 45, "documents_stored": 45, "approved": 40, "refactored": 8, "deep_researched": 2, "rejected": 3, "needs_manual_review": 2 }, "exports": { "format": "project_files", "path": "~/Documents/reference-library/exports/" } } ``` ## See Also - `/reference-discovery` - Run discovery stage only - `/web-crawler` - Run crawler stage only - `/content-repository` - Manage stored content - `/quality-reviewer` - Run QA review only - `/markdown-exporter` - Run export only