6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.9 KiB
Web Crawler Orchestrator
Orchestrates web crawling with intelligent backend selection. Automatically chooses the best crawler based on site characteristics.
Trigger Keywords
"crawl URLs", "fetch documents", "scrape pages", "download references"
Intelligent Crawler Selection
Claude automatically selects the optimal crawler based on the request:
| Crawler | Best For | Auto-Selected When |
|---|---|---|
| Node.js (default) | Small docs sites | ≤50 pages, static content |
| Python aiohttp | Technical docs | ≤200 pages, needs SEO data |
| Scrapy | Enterprise crawls | >200 pages, multi-domain |
| Firecrawl MCP | Dynamic sites | SPAs, JS-rendered content |
Decision Flow
[Crawl Request]
│
├─ Is it SPA/React/Vue/Angular? → Firecrawl MCP
│
├─ >200 pages or multi-domain? → Scrapy
│
├─ Needs SEO extraction? → Python aiohttp
│
└─ Default (small site) → Node.js
Crawler Backends
Node.js (Default)
Fast, lightweight crawler for small documentation sites.
cd ~/Project/our-seo-agent/util/js-crawler
node src/crawler.js <URL> --max-pages 50
Python aiohttp
Async crawler with full SEO extraction.
cd ~/Project/our-seo-agent
python -m seo_agent.crawler --url <URL> --max-pages 100
Scrapy
Enterprise-grade crawler with pipelines.
cd ~/Project/our-seo-agent
scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500
Firecrawl MCP
Use MCP tools for JavaScript-heavy sites:
firecrawl_scrape(url, formats=["markdown"], only_main_content=true)
firecrawl_crawl(url, max_depth=2, limit=50)
firecrawl_map(url, limit=100) # Discover URLs first
Workflow
Step 1: Analyze Target Site
Determine site characteristics:
- Is it a SPA? (React, Vue, Angular, Next.js)
- How many pages expected?
- Does it need JavaScript rendering?
- Is SEO data extraction needed?
Step 2: Select Crawler
Based on analysis, select the appropriate backend.
Step 3: Load URL Manifest
# From reference-discovery output
cat manifest.json | jq '.urls[].url'
Step 4: Execute Crawl
For Node.js:
cd ~/Project/our-seo-agent/util/js-crawler
for url in $(cat urls.txt); do
node src/crawler.js "$url" --max-pages 50
sleep 2
done
For Firecrawl MCP (Claude Desktop/Code): Use the firecrawl MCP tools directly in conversation.
Step 5: Save Raw Content
~/reference-library/raw/
└── 2025/01/
├── a1b2c3d4.md
└── b2c3d4e5.md
Step 6: Generate Crawl Manifest
{
"crawl_date": "2025-01-28T12:00:00",
"crawler_used": "nodejs",
"total_crawled": 45,
"total_failed": 5,
"documents": [...]
}
Rate Limiting
All crawlers respect these limits:
- 20 requests/minute
- 3 concurrent requests
- Exponential backoff on 429/5xx
Error Handling
| Error | Action |
|---|---|
| Timeout | Retry once with 2x timeout |
| Rate limit (429) | Exponential backoff, max 3 retries |
| Not found (404) | Log and skip |
| Access denied (403) | Log, mark as failed |
| JS rendering needed | Switch to Firecrawl |
Site Type Detection
Indicators for automatic routing:
SPA (→ Firecrawl):
- URL contains
#/or uses hash routing - Page source shows React/Vue/Angular markers
- Content loads dynamically after initial load
Static docs (→ Node.js/aiohttp):
- Built with Hugo, Jekyll, MkDocs, Docusaurus, GitBook
- Clean HTML structure
- Server-side rendered
Scripts
scripts/select_crawler.py- Intelligent crawler selectionscripts/crawl_with_nodejs.sh- Node.js wrapperscripts/crawl_with_aiohttp.sh- Python wrapperscripts/crawl_with_firecrawl.py- Firecrawl MCP wrapper
Integration
| From | To |
|---|---|
| reference-discovery | URL manifest input |
| → | content-repository (crawl manifest + raw files) |
| quality-reviewer (deep_research) | Additional crawl requests |