Files

Andrew Yim 6d7a6d7a88 feat(reference-curator): Add portable skill suite for reference documentation curation

6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-29 00:20:27 +07:00

3.9 KiB

Raw Permalink Blame History

Web Crawler Orchestrator

Orchestrates web crawling with intelligent backend selection. Automatically chooses the best crawler based on site characteristics.

Trigger Keywords

"crawl URLs", "fetch documents", "scrape pages", "download references"

Intelligent Crawler Selection

Claude automatically selects the optimal crawler based on the request:

Crawler	Best For	Auto-Selected When
Node.js (default)	Small docs sites	≤50 pages, static content
Python aiohttp	Technical docs	≤200 pages, needs SEO data
Scrapy	Enterprise crawls	>200 pages, multi-domain
Firecrawl MCP	Dynamic sites	SPAs, JS-rendered content

Decision Flow

[Crawl Request]
      │
      ├─ Is it SPA/React/Vue/Angular? → Firecrawl MCP
      │
      ├─ >200 pages or multi-domain? → Scrapy
      │
      ├─ Needs SEO extraction? → Python aiohttp
      │
      └─ Default (small site) → Node.js

Crawler Backends

Node.js (Default)

Fast, lightweight crawler for small documentation sites.

cd ~/Project/our-seo-agent/util/js-crawler
node src/crawler.js <URL> --max-pages 50

Python aiohttp

Async crawler with full SEO extraction.

cd ~/Project/our-seo-agent
python -m seo_agent.crawler --url <URL> --max-pages 100

Scrapy

Enterprise-grade crawler with pipelines.

cd ~/Project/our-seo-agent
scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500

Firecrawl MCP

Use MCP tools for JavaScript-heavy sites:

firecrawl_scrape(url, formats=["markdown"], only_main_content=true)
firecrawl_crawl(url, max_depth=2, limit=50)
firecrawl_map(url, limit=100)  # Discover URLs first

Workflow

Step 1: Analyze Target Site

Determine site characteristics:

Is it a SPA? (React, Vue, Angular, Next.js)
How many pages expected?
Does it need JavaScript rendering?
Is SEO data extraction needed?

Step 2: Select Crawler

Based on analysis, select the appropriate backend.

Step 3: Load URL Manifest

# From reference-discovery output
cat manifest.json | jq '.urls[].url'

Step 4: Execute Crawl

For Node.js:

cd ~/Project/our-seo-agent/util/js-crawler
for url in $(cat urls.txt); do
  node src/crawler.js "$url" --max-pages 50
  sleep 2
done

For Firecrawl MCP (Claude Desktop/Code): Use the firecrawl MCP tools directly in conversation.

Step 5: Save Raw Content

~/reference-library/raw/
└── 2025/01/
    ├── a1b2c3d4.md
    └── b2c3d4e5.md

Step 6: Generate Crawl Manifest

{
  "crawl_date": "2025-01-28T12:00:00",
  "crawler_used": "nodejs",
  "total_crawled": 45,
  "total_failed": 5,
  "documents": [...]
}

Rate Limiting

All crawlers respect these limits:

20 requests/minute
3 concurrent requests
Exponential backoff on 429/5xx

Error Handling

Error	Action
Timeout	Retry once with 2x timeout
Rate limit (429)	Exponential backoff, max 3 retries
Not found (404)	Log and skip
Access denied (403)	Log, mark as `failed`
JS rendering needed	Switch to Firecrawl

Site Type Detection

Indicators for automatic routing:

SPA (→ Firecrawl):

URL contains #/ or uses hash routing
Page source shows React/Vue/Angular markers
Content loads dynamically after initial load

Static docs (→ Node.js/aiohttp):

Built with Hugo, Jekyll, MkDocs, Docusaurus, GitBook
Clean HTML structure
Server-side rendered

Scripts

scripts/select_crawler.py - Intelligent crawler selection
scripts/crawl_with_nodejs.sh - Node.js wrapper
scripts/crawl_with_aiohttp.sh - Python wrapper
scripts/crawl_with_firecrawl.py - Firecrawl MCP wrapper

Integration

From	To
reference-discovery	URL manifest input
→	content-repository (crawl manifest + raw files)
quality-reviewer (deep_research)	Additional crawl requests

3.9 KiB Raw Permalink Blame History