Fix SEO skills 19-34 bugs, add slash commands, enhance reference-curator (#3)

* Fix SEO skill 34 bugs, Korean labels, and transition Ahrefs refs to our-seo-agent P0: Fix report_aggregator.py — wrong SKILL_REGISTRY[33] mapping, missing CATEGORY_WEIGHTS for 7 categories, and break bug in health score parsing that exited loop even on parse failure. P1: Remove VIEW tab references from skill 20, expand skill 32 docs, replace Ahrefs MCP references across all 16 skills (19-28, 31-34) with our-seo-agent CLI data source references. P2: Fix Korean labels in executive_report.py and dashboard_generator.py, add tenacity to base requirements, sync skill 34 base_client.py with canonical version from skill 12. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add Claude Code slash commands for SEO skills 19-34 and fix stale paths Create 14 new slash command files for skills 19-28, 31-34 so they appear as /seo-* commands in Claude Code. Also fix stale directory paths in 8 existing commands (skills 12-18, 29-30) that referenced pre-renumbering skill directories. Update .gitignore to track .claude/commands/ while keeping other .claude/ files ignored. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add 8 slash commands, enhance reference-curator with depth/output options - Add slash commands: ourdigital-brand-guide, notion-writer, notebooklm-agent, notebooklm-automation, notebooklm-studio, notebooklm-research, reference-curator, multi-agent-guide - Add --depth (light/standard/deep/full) with Firecrawl parameter mapping - Add --output with ~/Documents/reference-library/ default and user confirmation - Increase --max-sources default from 10 to 100 - Rename /reference-curator-pipeline to /reference-curator - Simplify web-crawler-orchestrator label to web-crawler in docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 14:12:57 +09:00
parent 59e5c519f5
commit 397fa2aa5d
33 changed files with 1699 additions and 48 deletions
--- a/custom-skills/90-reference-curator/commands/reference-curator-pipeline.md
+++ b/custom-skills/90-reference-curator/commands/reference-curator-pipeline.md
@@ -1,6 +1,6 @@
 ---
-description: Orchestrates full reference curation pipeline as background task. Runs discovery → crawl → store → distill → review → export with QA loop handling.
-argument-hint: <topic|urls|manifest> [--max-sources 10] [--max-pages 50] [--auto-approve] [--threshold 0.85] [--max-iterations 3] [--export-format project_files]
+description: Orchestrates full reference curation pipeline with configurable depth. Runs discovery > crawl > store > distill > review > export with QA loop handling.
+argument-hint: <topic|urls|manifest> [--depth light|standard|deep|full] [--output ~/Documents/reference-library/] [--max-sources 100] [--max-pages 50] [--auto-approve] [--threshold 0.85] [--max-iterations 3] [--export-format project_files]
 allowed-tools: WebSearch, WebFetch, Read, Write, Bash, Grep, Glob, Task
 ---

@@ -19,28 +19,191 @@ Full-stack orchestration of the 6-skill reference curation workflow.
 ## Arguments

 - `<input>`: Required. Topic string, URL(s), or manifest file path
- `--max-sources`: Maximum sources to discover (topic mode, default: 10)
- `--max-pages`: Maximum pages per source to crawl (default: 50)
+- `--depth`: Crawl depth level (default: `standard`). See Depth Levels below
+- `--output`: Output directory path (default: `~/Documents/reference-library/`)
+- `--max-sources`: Maximum sources to discover (topic mode, default: 100)
+- `--max-pages`: Maximum pages per source to crawl (default varies by depth)
 - `--auto-approve`: Auto-approve scores above threshold
 - `--threshold`: Approval threshold (default: 0.85)
 - `--max-iterations`: Max QA loop iterations per document (default: 3)
 - `--export-format`: Output format: `project_files`, `fine_tuning`, `jsonl` (default: project_files)
+- `--include-subdomains`: Include subdomains in site mapping (default: false)
+- `--follow-external`: Follow external links found in content (default: false)
+
+## Output Directory
+
+**IMPORTANT: Never store output in any claude-skills, `.claude/`, or skill-related directory.**
+
+The `--output` argument sets the base path for all pipeline output. If omitted, the default is `~/Documents/reference-library/`.
+
+### Directory Structure
+
+```
+{output}/
+├── {topic-slug}/                   # One folder per pipeline run
+│   ├── README.md                   # Index with table of contents
+│   ├── 00-page-name.md             # Individual page files
+│   ├── 01-page-name.md
+│   ├── ...
+│   ├── {topic-slug}-complete.md    # Combined bundle (all pages)
+│   └── manifest.json               # Crawl metadata
+├── pipeline_state/                 # Resume state (auto-managed)
+│   └── run_XXX/state.json
+└── exports/                        # Fine-tuning / JSONL exports
+```
+
+### Resolution Rules
+
+1. If `--output` is provided, use that path exactly
+2. If not provided, use `~/Documents/reference-library/`
+3. **Before writing any files**, check if the output directory exists
+4. If the directory does NOT exist, **ask the user for permission** before creating it:
+   - Show the full resolved path that will be created
+   - Wait for explicit user approval
+   - Only then run `mkdir -p <path>`
+5. The topic slug is derived from the input (URL domain+path or topic string)
+
+### Examples
+
+```bash
+# Uses default: ~/Documents/reference-library/glossary-for-wordpress/
+/reference-curator https://docs.codeat.co/glossary/
+
+# Custom path: /tmp/research/mcp-docs/
+/reference-curator "MCP servers" --output /tmp/research
+
+# Explicit home subfolder: ~/Projects/client-docs/api-reference/
+/reference-curator https://api.example.com/docs --output ~/Projects/client-docs
+```
+
+## Depth Levels
+
+### `--depth light`
+Fast scan for quick reference. Main content only, minimal crawling.
+
+| Parameter | Value |
+|-----------|-------|
+| `onlyMainContent` | `true` |
+| `formats` | `["markdown"]` |
+| `maxDiscoveryDepth` | 1 |
+| `max-pages` default | 20 |
+| Map limit | 50 |
+| `deduplicateSimilarURLs` | `true` |
+| Follow sub-links | No |
+| JS rendering wait | None |
+
+**Best for:** Quick lookups, single-page references, API docs you already know.
+
+### `--depth standard` (default)
+Balanced crawl. Main content with links for cross-referencing.
+
+| Parameter | Value |
+|-----------|-------|
+| `onlyMainContent` | `true` |
+| `formats` | `["markdown", "links"]` |
+| `maxDiscoveryDepth` | 2 |
+| `max-pages` default | 50 |
+| Map limit | 100 |
+| `deduplicateSimilarURLs` | `true` |
+| Follow sub-links | Same-domain only |
+| JS rendering wait | None |
+
+**Best for:** Documentation sites, plugin guides, knowledge bases.
+
+### `--depth deep`
+Thorough crawl. Full page content including sidebars, nav, and related pages.
+
+| Parameter | Value |
+|-----------|-------|
+| `onlyMainContent` | `false` |
+| `formats` | `["markdown", "links", "html"]` |
+| `maxDiscoveryDepth` | 3 |
+| `max-pages` default | 150 |
+| Map limit | 300 |
+| `deduplicateSimilarURLs` | `true` |
+| Follow sub-links | Same-domain + linked resources |
+| JS rendering wait | `waitFor: 3000` |
+| `includeSubdomains` | `true` |
+
+**Best for:** Complete product documentation, research material, sites with sidebars/code samples hidden behind tabs.
+
+### `--depth full`
+Exhaustive crawl. Everything captured including raw HTML, screenshots, and external references.
+
+| Parameter | Value |
+|-----------|-------|
+| `onlyMainContent` | `false` |
+| `formats` | `["markdown", "html", "rawHtml", "links"]` |
+| `maxDiscoveryDepth` | 5 |
+| `max-pages` default | 500 |
+| Map limit | 1000 |
+| `deduplicateSimilarURLs` | `false` |
+| Follow sub-links | All (same-domain + external references) |
+| JS rendering wait | `waitFor: 5000` |
+| `includeSubdomains` | `true` |
+| Screenshots | Capture for JS-heavy pages |
+
+**Best for:** Archival, migration references, preserving sites, training data collection.
+
+## Depth Comparison
+
+```
+light       ████░░░░░░░░░░░░  Speed: fastest   Pages: ~20    Content: main only
+standard    ████████░░░░░░░░  Speed: fast       Pages: ~50    Content: main + links
+deep        ████████████░░░░  Speed: moderate   Pages: ~150   Content: full page + HTML
+full        ████████████████  Speed: slow       Pages: ~500   Content: everything + raw
+```

 ## Pipeline Stages

 ```
 1. reference-discovery  (topic mode only)
-2. web-crawler-orchestrator
+2. web-crawler              <- depth controls this stage
 3. content-repository
-4. content-distiller    ◄────────┐
-5. quality-reviewer              │
-   ├── APPROVE → export          │
-   ├── REFACTOR ─────────────────┤
-   ├── DEEP_RESEARCH → crawler ──┘
-   └── REJECT → archive
+4. content-distiller    <--------+
+5. quality-reviewer              |
+   +-- APPROVE -> export         |
+   +-- REFACTOR -----------------+
+   +-- DEEP_RESEARCH -> crawler -+
+   +-- REJECT -> archive
 6. markdown-exporter
 ```

+## Crawl Execution by Depth
+
+When executing the crawl (Stage 2), apply the depth settings to the Firecrawl tools:
+
+### Site Mapping (`firecrawl_map`)
+```
+firecrawl_map:
+  url: <target>
+  limit: {depth.map_limit}
+  includeSubdomains: {depth.includeSubdomains}
+```
+
+### Page Scraping (`firecrawl_scrape`)
+```
+firecrawl_scrape:
+  url: <page>
+  formats: {depth.formats}
+  onlyMainContent: {depth.onlyMainContent}
+  waitFor: {depth.waitFor}           # deep/full only
+  excludeTags: ["nav", "footer"]     # light/standard only
+```
+
+### Batch Crawling (`firecrawl_crawl`) - for deep/full only
+```
+firecrawl_crawl:
+  url: <target>
+  maxDiscoveryDepth: {depth.maxDiscoveryDepth}
+  limit: {depth.max_pages}
+  deduplicateSimilarURLs: {depth.dedup}
+  scrapeOptions:
+    formats: {depth.formats}
+    onlyMainContent: {depth.onlyMainContent}
+    waitFor: {depth.waitFor}
+```
+
 ## QA Loop Handling

 | Decision | Action | Max Iterations |
@@ -54,17 +217,23 @@ After max iterations, document marked as `needs_manual_review`.
 ## Example Usage

 ```
-# Full pipeline from topic
-/reference-curator-pipeline "Claude Code best practices" --max-sources 5
+# Quick scan of a single doc page
+/reference-curator https://docs.example.com/api --depth light

-# Pipeline from specific URLs (skip discovery)
-/reference-curator-pipeline https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
+# Standard documentation crawl (default)
+/reference-curator "Glossary for WordPress" --max-sources 5
+
+# Deep crawl capturing full page content and HTML
+/reference-curator https://docs.codeat.co/glossary/ --depth deep
+
+# Full archival crawl with all formats
+/reference-curator https://docs.anthropic.com --depth full --max-pages 300
+
+# Deep crawl with auto-approval and fine-tuning export
+/reference-curator "MCP servers" --depth deep --auto-approve --export-format fine_tuning

 # Resume from existing manifest
-/reference-curator-pipeline ./manifest.json --auto-approve
-
-# Fine-tuning dataset output
-/reference-curator-pipeline "MCP servers" --export-format fine_tuning --auto-approve
+/reference-curator ./manifest.json --auto-approve
 ```

 ## State Management
@@ -78,7 +247,7 @@ SELECT * FROM pipeline_runs WHERE run_id = 123;

 **File-based fallback:**
 ```
-~/reference-library/pipeline_state/run_XXX/state.json
+~/Documents/reference-library/pipeline_state/run_XXX/state.json
 ```

 ## Output
@@ -89,6 +258,7 @@ Pipeline returns summary on completion:
 {
  "run_id": 123,
  "status": "completed",
+  "depth": "standard",
  "stats": {
    "sources_discovered": 5,
    "pages_crawled": 45,
@@ -101,7 +271,7 @@ Pipeline returns summary on completion:
  },
  "exports": {
    "format": "project_files",
-    "path": "~/reference-library/exports/"
+    "path": "~/Documents/reference-library/exports/"
  }
 }
 ```