6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2.0 KiB
2.0 KiB
description, argument-hint, allowed-tools
| description | argument-hint | allowed-tools |
|---|---|---|
| Crawl URLs with intelligent backend selection. Auto-selects Node.js, Python aiohttp, Scrapy, or Firecrawl based on site characteristics. | <url|manifest> [--crawler nodejs|aiohttp|scrapy|firecrawl] [--max-pages 50] | Bash, Read, Write, WebFetch, Glob, Grep |
Web Crawler Orchestrator
Crawl web content with intelligent backend selection.
Arguments
<url|manifest>: Single URL or path to manifest.json from reference-discovery--crawler: Force specific crawler (nodejs, aiohttp, scrapy, firecrawl)--max-pages: Maximum pages to crawl (default: 50)
Intelligent Crawler Selection
Auto-select based on site characteristics:
| Crawler | Best For | Auto-Selected When |
|---|---|---|
| Node.js (default) | Small docs sites | ≤50 pages, static content |
| Python aiohttp | Technical docs | ≤200 pages, needs SEO data |
| Scrapy | Enterprise crawls | >200 pages, multi-domain |
| Firecrawl MCP | Dynamic sites | SPAs, JS-rendered content |
Detection Flow
[URL] → Is SPA/React/Vue? → Firecrawl
→ >200 pages? → Scrapy
→ Needs SEO? → aiohttp
→ Default → Node.js
Crawler Commands
Node.js:
cd ~/Project/our-seo-agent/util/js-crawler
node src/crawler.js <URL> --max-pages 50
Python aiohttp:
cd ~/Project/our-seo-agent
python -m seo_agent.crawler --url <URL> --max-pages 100
Scrapy:
cd ~/Project/our-seo-agent
scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500
Firecrawl MCP:
Use MCP tools: firecrawl_scrape, firecrawl_crawl, firecrawl_map
Output
Save crawled content to ~/reference-library/raw/YYYY/MM/:
- One markdown file per page
- Filename:
{url_hash}.md
Generate crawl manifest:
{
"crawl_date": "ISO timestamp",
"crawler_used": "nodejs",
"total_crawled": 45,
"documents": [...]
}
Rate Limiting
All crawlers respect:
- 20 requests/minute
- 3 concurrent requests
- Exponential backoff on 429/5xx