Files

Andrew Yim 397fa2aa5d Fix SEO skills 19-34 bugs, add slash commands, enhance reference-curator (#3 )

* Fix SEO skill 34 bugs, Korean labels, and transition Ahrefs refs to our-seo-agent

P0: Fix report_aggregator.py — wrong SKILL_REGISTRY[33] mapping, missing
CATEGORY_WEIGHTS for 7 categories, and break bug in health score parsing
that exited loop even on parse failure.

P1: Remove VIEW tab references from skill 20, expand skill 32 docs,
replace Ahrefs MCP references across all 16 skills (19-28, 31-34)
with our-seo-agent CLI data source references.

P2: Fix Korean labels in executive_report.py and dashboard_generator.py,
add tenacity to base requirements, sync skill 34 base_client.py with
canonical version from skill 12.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add Claude Code slash commands for SEO skills 19-34 and fix stale paths

Create 14 new slash command files for skills 19-28, 31-34 so they
appear as /seo-* commands in Claude Code. Also fix stale directory
paths in 8 existing commands (skills 12-18, 29-30) that referenced
pre-renumbering skill directories.

Update .gitignore to track .claude/commands/ while keeping other
.claude/ files ignored.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add 8 slash commands, enhance reference-curator with depth/output options

- Add slash commands: ourdigital-brand-guide, notion-writer, notebooklm-agent,
  notebooklm-automation, notebooklm-studio, notebooklm-research,
  reference-curator, multi-agent-guide
- Add --depth (light/standard/deep/full) with Firecrawl parameter mapping
- Add --output with ~/Documents/reference-library/ default and user confirmation
- Increase --max-sources default from 10 to 100
- Rename /reference-curator-pipeline to /reference-curator
- Simplify web-crawler-orchestrator label to web-crawler in docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-24 14:12:57 +09:00

9.1 KiB

Raw Permalink Blame History

description, argument-hint, allowed-tools

description	argument-hint	allowed-tools
Orchestrates full reference curation pipeline with configurable depth. Runs discovery > crawl > store > distill > review > export with QA loop handling.	<topic\|urls\|manifest> [--depth light\|standard\|deep\|full] [--output ~/Documents/reference-library/] [--max-sources 100] [--max-pages 50] [--auto-approve] [--threshold 0.85] [--max-iterations 3] [--export-format project_files]	WebSearch, WebFetch, Read, Write, Bash, Grep, Glob, Task

Reference Curator Pipeline

Full-stack orchestration of the 6-skill reference curation workflow.

Input Modes

Mode	Input Example	Pipeline Start
Topic	`"Claude system prompts"`	reference-discovery
URLs	`https://docs.anthropic.com/...`	web-crawler (skip discovery)
Manifest	`./manifest.json`	web-crawler (resume from discovery)

Arguments

<input>: Required. Topic string, URL(s), or manifest file path
--depth: Crawl depth level (default: standard). See Depth Levels below
--output: Output directory path (default: ~/Documents/reference-library/)
--max-sources: Maximum sources to discover (topic mode, default: 100)
--max-pages: Maximum pages per source to crawl (default varies by depth)
--auto-approve: Auto-approve scores above threshold
--threshold: Approval threshold (default: 0.85)
--max-iterations: Max QA loop iterations per document (default: 3)
--export-format: Output format: project_files, fine_tuning, jsonl (default: project_files)
--include-subdomains: Include subdomains in site mapping (default: false)
--follow-external: Follow external links found in content (default: false)

Output Directory

IMPORTANT: Never store output in any claude-skills, .claude/, or skill-related directory.

The --output argument sets the base path for all pipeline output. If omitted, the default is ~/Documents/reference-library/.

Directory Structure

{output}/
├── {topic-slug}/                   # One folder per pipeline run
│   ├── README.md                   # Index with table of contents
│   ├── 00-page-name.md             # Individual page files
│   ├── 01-page-name.md
│   ├── ...
│   ├── {topic-slug}-complete.md    # Combined bundle (all pages)
│   └── manifest.json               # Crawl metadata
├── pipeline_state/                 # Resume state (auto-managed)
│   └── run_XXX/state.json
└── exports/                        # Fine-tuning / JSONL exports

Resolution Rules

If --output is provided, use that path exactly
If not provided, use ~/Documents/reference-library/
Before writing any files, check if the output directory exists
If the directory does NOT exist, ask the user for permission before creating it:
- Show the full resolved path that will be created
- Wait for explicit user approval
- Only then run mkdir -p <path>
The topic slug is derived from the input (URL domain+path or topic string)

Examples

# Uses default: ~/Documents/reference-library/glossary-for-wordpress/
/reference-curator https://docs.codeat.co/glossary/

# Custom path: /tmp/research/mcp-docs/
/reference-curator "MCP servers" --output /tmp/research

# Explicit home subfolder: ~/Projects/client-docs/api-reference/
/reference-curator https://api.example.com/docs --output ~/Projects/client-docs

Depth Levels

`--depth light`

Fast scan for quick reference. Main content only, minimal crawling.

Parameter	Value
`onlyMainContent`	`true`
`formats`	`["markdown"]`
`maxDiscoveryDepth`	1
`max-pages` default	20
Map limit	50
`deduplicateSimilarURLs`	`true`
Follow sub-links	No
JS rendering wait	None

Best for: Quick lookups, single-page references, API docs you already know.

`--depth standard` (default)

Balanced crawl. Main content with links for cross-referencing.

Parameter	Value
`onlyMainContent`	`true`
`formats`	`["markdown", "links"]`
`maxDiscoveryDepth`	2
`max-pages` default	50
Map limit	100
`deduplicateSimilarURLs`	`true`
Follow sub-links	Same-domain only
JS rendering wait	None

Best for: Documentation sites, plugin guides, knowledge bases.

`--depth deep`

Thorough crawl. Full page content including sidebars, nav, and related pages.

Parameter	Value
`onlyMainContent`	`false`
`formats`	`["markdown", "links", "html"]`
`maxDiscoveryDepth`	3
`max-pages` default	150
Map limit	300
`deduplicateSimilarURLs`	`true`
Follow sub-links	Same-domain + linked resources
JS rendering wait	`waitFor: 3000`
`includeSubdomains`	`true`

Best for: Complete product documentation, research material, sites with sidebars/code samples hidden behind tabs.

`--depth full`

Exhaustive crawl. Everything captured including raw HTML, screenshots, and external references.

Parameter	Value
`onlyMainContent`	`false`
`formats`	`["markdown", "html", "rawHtml", "links"]`
`maxDiscoveryDepth`	5
`max-pages` default	500
Map limit	1000
`deduplicateSimilarURLs`	`false`
Follow sub-links	All (same-domain + external references)
JS rendering wait	`waitFor: 5000`
`includeSubdomains`	`true`
Screenshots	Capture for JS-heavy pages

Best for: Archival, migration references, preserving sites, training data collection.

Depth Comparison

light       ████░░░░░░░░░░░░  Speed: fastest   Pages: ~20    Content: main only
standard    ████████░░░░░░░░  Speed: fast       Pages: ~50    Content: main + links
deep        ████████████░░░░  Speed: moderate   Pages: ~150   Content: full page + HTML
full        ████████████████  Speed: slow       Pages: ~500   Content: everything + raw

Pipeline Stages

1. reference-discovery  (topic mode only)
2. web-crawler              <- depth controls this stage
3. content-repository
4. content-distiller    <--------+
5. quality-reviewer              |
   +-- APPROVE -> export         |
   +-- REFACTOR -----------------+
   +-- DEEP_RESEARCH -> crawler -+
   +-- REJECT -> archive
6. markdown-exporter

Crawl Execution by Depth

When executing the crawl (Stage 2), apply the depth settings to the Firecrawl tools:

Site Mapping (`firecrawl_map`)

firecrawl_map:
  url: <target>
  limit: {depth.map_limit}
  includeSubdomains: {depth.includeSubdomains}

Page Scraping (`firecrawl_scrape`)

firecrawl_scrape:
  url: <page>
  formats: {depth.formats}
  onlyMainContent: {depth.onlyMainContent}
  waitFor: {depth.waitFor}           # deep/full only
  excludeTags: ["nav", "footer"]     # light/standard only

Batch Crawling (`firecrawl_crawl`) - for deep/full only

firecrawl_crawl:
  url: <target>
  maxDiscoveryDepth: {depth.maxDiscoveryDepth}
  limit: {depth.max_pages}
  deduplicateSimilarURLs: {depth.dedup}
  scrapeOptions:
    formats: {depth.formats}
    onlyMainContent: {depth.onlyMainContent}
    waitFor: {depth.waitFor}

QA Loop Handling

Decision	Action	Max Iterations
REFACTOR	Re-distill with feedback	3
DEEP_RESEARCH	Crawl more sources, re-distill	2
Combined	Total loops per document	5

After max iterations, document marked as needs_manual_review.

Example Usage

# Quick scan of a single doc page
/reference-curator https://docs.example.com/api --depth light

# Standard documentation crawl (default)
/reference-curator "Glossary for WordPress" --max-sources 5

# Deep crawl capturing full page content and HTML
/reference-curator https://docs.codeat.co/glossary/ --depth deep

# Full archival crawl with all formats
/reference-curator https://docs.anthropic.com --depth full --max-pages 300

# Deep crawl with auto-approval and fine-tuning export
/reference-curator "MCP servers" --depth deep --auto-approve --export-format fine_tuning

# Resume from existing manifest
/reference-curator ./manifest.json --auto-approve

State Management

Pipeline state is saved after each stage to allow resume:

With MySQL:

SELECT * FROM pipeline_runs WHERE run_id = 123;

File-based fallback:

~/Documents/reference-library/pipeline_state/run_XXX/state.json

Output

Pipeline returns summary on completion:

{
  "run_id": 123,
  "status": "completed",
  "depth": "standard",
  "stats": {
    "sources_discovered": 5,
    "pages_crawled": 45,
    "documents_stored": 45,
    "approved": 40,
    "refactored": 8,
    "deep_researched": 2,
    "rejected": 3,
    "needs_manual_review": 2
  },
  "exports": {
    "format": "project_files",
    "path": "~/Documents/reference-library/exports/"
  }
}

9.1 KiB Raw Permalink Blame History