our-claude-skills/custom-skills/90-reference-curator/commands/reference-curator-pipeline.md

---
description: Orchestrates full reference curation pipeline with configurable depth. Runs discovery > crawl > store > distill > review > export with QA loop handling.
argument-hint: <topic|urls|manifest> [--depth light|standard|deep|full] [--output ~/Documents/reference-library/] [--max-sources 100] [--max-pages 50] [--auto-approve] [--threshold 0.85] [--max-iterations 3] [--export-format project_files]
allowed-tools: WebSearch, WebFetch, Read, Write, Bash, Grep, Glob, Task
---

# Reference Curator Pipeline

Full-stack orchestration of the 6-skill reference curation workflow.

## Input Modes

| Mode | Input Example | Pipeline Start |
|------|---------------|----------------|
| **Topic** | `"Claude system prompts"` | reference-discovery |
| **URLs** | `https://docs.anthropic.com/...` | web-crawler (skip discovery) |
| **Manifest** | `./manifest.json` | web-crawler (resume from discovery) |

## Arguments

- `<input>`: Required. Topic string, URL(s), or manifest file path
- `--depth`: Crawl depth level (default: `standard`). See Depth Levels below
- `--output`: Output directory path (default: `~/Documents/reference-library/`)
- `--max-sources`: Maximum sources to discover (topic mode, default: 100)
- `--max-pages`: Maximum pages per source to crawl (default varies by depth)
- `--auto-approve`: Auto-approve scores above threshold
- `--threshold`: Approval threshold (default: 0.85)
- `--max-iterations`: Max QA loop iterations per document (default: 3)
- `--export-format`: Output format: `project_files`, `fine_tuning`, `jsonl` (default: project_files)
- `--include-subdomains`: Include subdomains in site mapping (default: false)
- `--follow-external`: Follow external links found in content (default: false)

## Output Directory

**IMPORTANT: Never store output in any claude-skills, `.claude/`, or skill-related directory.**

The `--output` argument sets the base path for all pipeline output. If omitted, the default is `~/Documents/reference-library/`.

### Directory Structure

```
{output}/
├── {topic-slug}/                   # One folder per pipeline run
│   ├── README.md                   # Index with table of contents
│   ├── 00-page-name.md             # Individual page files
│   ├── 01-page-name.md
│   ├── ...
│   ├── {topic-slug}-complete.md    # Combined bundle (all pages)
│   └── manifest.json               # Crawl metadata
├── pipeline_state/                 # Resume state (auto-managed)
│   └── run_XXX/state.json
└── exports/                        # Fine-tuning / JSONL exports
```

### Resolution Rules

1. If `--output` is provided, use that path exactly
2. If not provided, use `~/Documents/reference-library/`
3. **Before writing any files**, check if the output directory exists
4. If the directory does NOT exist, **ask the user for permission** before creating it:
   - Show the full resolved path that will be created
   - Wait for explicit user approval
   - Only then run `mkdir -p <path>`
5. The topic slug is derived from the input (URL domain+path or topic string)

### Examples

```bash
# Uses default: ~/Documents/reference-library/glossary-for-wordpress/
/reference-curator https://docs.codeat.co/glossary/

# Custom path: /tmp/research/mcp-docs/
/reference-curator "MCP servers" --output /tmp/research

# Explicit home subfolder: ~/Projects/client-docs/api-reference/
/reference-curator https://api.example.com/docs --output ~/Projects/client-docs
```

## Depth Levels

### `--depth light`
Fast scan for quick reference. Main content only, minimal crawling.

| Parameter | Value |
|-----------|-------|
| `onlyMainContent` | `true` |
| `formats` | `["markdown"]` |
| `maxDiscoveryDepth` | 1 |
| `max-pages` default | 20 |
| Map limit | 50 |
| `deduplicateSimilarURLs` | `true` |
| Follow sub-links | No |
| JS rendering wait | None |

**Best for:** Quick lookups, single-page references, API docs you already know.

### `--depth standard` (default)
Balanced crawl. Main content with links for cross-referencing.

| Parameter | Value |
|-----------|-------|
| `onlyMainContent` | `true` |
| `formats` | `["markdown", "links"]` |
| `maxDiscoveryDepth` | 2 |
| `max-pages` default | 50 |
| Map limit | 100 |
| `deduplicateSimilarURLs` | `true` |
| Follow sub-links | Same-domain only |
| JS rendering wait | None |

**Best for:** Documentation sites, plugin guides, knowledge bases.

### `--depth deep`
Thorough crawl. Full page content including sidebars, nav, and related pages.

| Parameter | Value |
|-----------|-------|
| `onlyMainContent` | `false` |
| `formats` | `["markdown", "links", "html"]` |
| `maxDiscoveryDepth` | 3 |
| `max-pages` default | 150 |
| Map limit | 300 |
| `deduplicateSimilarURLs` | `true` |
| Follow sub-links | Same-domain + linked resources |
| JS rendering wait | `waitFor: 3000` |
| `includeSubdomains` | `true` |

**Best for:** Complete product documentation, research material, sites with sidebars/code samples hidden behind tabs.

### `--depth full`
Exhaustive crawl. Everything captured including raw HTML, screenshots, and external references.

| Parameter | Value |
|-----------|-------|
| `onlyMainContent` | `false` |
| `formats` | `["markdown", "html", "rawHtml", "links"]` |
| `maxDiscoveryDepth` | 5 |
| `max-pages` default | 500 |
| Map limit | 1000 |
| `deduplicateSimilarURLs` | `false` |
| Follow sub-links | All (same-domain + external references) |
| JS rendering wait | `waitFor: 5000` |
| `includeSubdomains` | `true` |
| Screenshots | Capture for JS-heavy pages |

**Best for:** Archival, migration references, preserving sites, training data collection.

## Depth Comparison

```
light       ████░░░░░░░░░░░░  Speed: fastest   Pages: ~20    Content: main only
standard    ████████░░░░░░░░  Speed: fast       Pages: ~50    Content: main + links
deep        ████████████░░░░  Speed: moderate   Pages: ~150   Content: full page + HTML
full        ████████████████  Speed: slow       Pages: ~500   Content: everything + raw
```

## Pipeline Stages

```
1. reference-discovery  (topic mode only)
2. web-crawler              <- depth controls this stage
3. content-repository
4. content-distiller    <--------+
5. quality-reviewer              |
   +-- APPROVE -> export         |
   +-- REFACTOR -----------------+
   +-- DEEP_RESEARCH -> crawler -+
   +-- REJECT -> archive
6. markdown-exporter
```

## Crawl Execution by Depth

When executing the crawl (Stage 2), apply the depth settings to the Firecrawl tools:

### Site Mapping (`firecrawl_map`)
```
firecrawl_map:
  url: <target>
  limit: {depth.map_limit}
  includeSubdomains: {depth.includeSubdomains}
```

### Page Scraping (`firecrawl_scrape`)
```
firecrawl_scrape:
  url: <page>
  formats: {depth.formats}
  onlyMainContent: {depth.onlyMainContent}
  waitFor: {depth.waitFor}           # deep/full only
  excludeTags: ["nav", "footer"]     # light/standard only
```

### Batch Crawling (`firecrawl_crawl`) - for deep/full only
```
firecrawl_crawl:
  url: <target>
  maxDiscoveryDepth: {depth.maxDiscoveryDepth}
  limit: {depth.max_pages}
  deduplicateSimilarURLs: {depth.dedup}
  scrapeOptions:
    formats: {depth.formats}
    onlyMainContent: {depth.onlyMainContent}
    waitFor: {depth.waitFor}
```

## QA Loop Handling

| Decision | Action | Max Iterations |
|----------|--------|----------------|
| REFACTOR | Re-distill with feedback | 3 |
| DEEP_RESEARCH | Crawl more sources, re-distill | 2 |
| Combined | Total loops per document | 5 |

After max iterations, document marked as `needs_manual_review`.

## Example Usage

```
# Quick scan of a single doc page
/reference-curator https://docs.example.com/api --depth light

# Standard documentation crawl (default)
/reference-curator "Glossary for WordPress" --max-sources 5

# Deep crawl capturing full page content and HTML
/reference-curator https://docs.codeat.co/glossary/ --depth deep

# Full archival crawl with all formats
/reference-curator https://docs.anthropic.com --depth full --max-pages 300

# Deep crawl with auto-approval and fine-tuning export
/reference-curator "MCP servers" --depth deep --auto-approve --export-format fine_tuning

# Resume from existing manifest
/reference-curator ./manifest.json --auto-approve
```

## State Management

Pipeline state is saved after each stage to allow resume:

**With MySQL:**
```sql
SELECT * FROM pipeline_runs WHERE run_id = 123;
```

**File-based fallback:**
```
~/Documents/reference-library/pipeline_state/run_XXX/state.json
```

## Output

Pipeline returns summary on completion:

```json
{
  "run_id": 123,
  "status": "completed",
  "depth": "standard",
  "stats": {
    "sources_discovered": 5,
    "pages_crawled": 45,
    "documents_stored": 45,
    "approved": 40,
    "refactored": 8,
    "deep_researched": 2,
    "rejected": 3,
    "needs_manual_review": 2
  },
  "exports": {
    "format": "project_files",
    "path": "~/Documents/reference-library/exports/"
  }
}
```

## See Also

- `/reference-discovery` - Run discovery stage only
- `/web-crawler` - Run crawler stage only
- `/content-repository` - Manage stored content
- `/quality-reviewer` - Run QA review only
- `/markdown-exporter` - Run export only