Files
Andrew Yim 6d7a6d7a88 feat(reference-curator): Add portable skill suite for reference documentation curation
6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 00:20:27 +07:00

3.9 KiB

Web Crawler Orchestrator

Orchestrates web crawling with intelligent backend selection. Automatically chooses the best crawler based on site characteristics.

Trigger Keywords

"crawl URLs", "fetch documents", "scrape pages", "download references"

Intelligent Crawler Selection

Claude automatically selects the optimal crawler based on the request:

Crawler Best For Auto-Selected When
Node.js (default) Small docs sites ≤50 pages, static content
Python aiohttp Technical docs ≤200 pages, needs SEO data
Scrapy Enterprise crawls >200 pages, multi-domain
Firecrawl MCP Dynamic sites SPAs, JS-rendered content

Decision Flow

[Crawl Request]
      │
      ├─ Is it SPA/React/Vue/Angular? → Firecrawl MCP
      │
      ├─ >200 pages or multi-domain? → Scrapy
      │
      ├─ Needs SEO extraction? → Python aiohttp
      │
      └─ Default (small site) → Node.js

Crawler Backends

Node.js (Default)

Fast, lightweight crawler for small documentation sites.

cd ~/Project/our-seo-agent/util/js-crawler
node src/crawler.js <URL> --max-pages 50

Python aiohttp

Async crawler with full SEO extraction.

cd ~/Project/our-seo-agent
python -m seo_agent.crawler --url <URL> --max-pages 100

Scrapy

Enterprise-grade crawler with pipelines.

cd ~/Project/our-seo-agent
scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500

Firecrawl MCP

Use MCP tools for JavaScript-heavy sites:

firecrawl_scrape(url, formats=["markdown"], only_main_content=true)
firecrawl_crawl(url, max_depth=2, limit=50)
firecrawl_map(url, limit=100)  # Discover URLs first

Workflow

Step 1: Analyze Target Site

Determine site characteristics:

  • Is it a SPA? (React, Vue, Angular, Next.js)
  • How many pages expected?
  • Does it need JavaScript rendering?
  • Is SEO data extraction needed?

Step 2: Select Crawler

Based on analysis, select the appropriate backend.

Step 3: Load URL Manifest

# From reference-discovery output
cat manifest.json | jq '.urls[].url'

Step 4: Execute Crawl

For Node.js:

cd ~/Project/our-seo-agent/util/js-crawler
for url in $(cat urls.txt); do
  node src/crawler.js "$url" --max-pages 50
  sleep 2
done

For Firecrawl MCP (Claude Desktop/Code): Use the firecrawl MCP tools directly in conversation.

Step 5: Save Raw Content

~/reference-library/raw/
└── 2025/01/
    ├── a1b2c3d4.md
    └── b2c3d4e5.md

Step 6: Generate Crawl Manifest

{
  "crawl_date": "2025-01-28T12:00:00",
  "crawler_used": "nodejs",
  "total_crawled": 45,
  "total_failed": 5,
  "documents": [...]
}

Rate Limiting

All crawlers respect these limits:

  • 20 requests/minute
  • 3 concurrent requests
  • Exponential backoff on 429/5xx

Error Handling

Error Action
Timeout Retry once with 2x timeout
Rate limit (429) Exponential backoff, max 3 retries
Not found (404) Log and skip
Access denied (403) Log, mark as failed
JS rendering needed Switch to Firecrawl

Site Type Detection

Indicators for automatic routing:

SPA (→ Firecrawl):

  • URL contains #/ or uses hash routing
  • Page source shows React/Vue/Angular markers
  • Content loads dynamically after initial load

Static docs (→ Node.js/aiohttp):

  • Built with Hugo, Jekyll, MkDocs, Docusaurus, GitBook
  • Clean HTML structure
  • Server-side rendered

Scripts

  • scripts/select_crawler.py - Intelligent crawler selection
  • scripts/crawl_with_nodejs.sh - Node.js wrapper
  • scripts/crawl_with_aiohttp.sh - Python wrapper
  • scripts/crawl_with_firecrawl.py - Firecrawl MCP wrapper

Integration

From To
reference-discovery URL manifest input
content-repository (crawl manifest + raw files)
quality-reviewer (deep_research) Additional crawl requests