6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
80 lines
2.0 KiB
Markdown
80 lines
2.0 KiB
Markdown
---
|
|
description: Crawl URLs with intelligent backend selection. Auto-selects Node.js, Python aiohttp, Scrapy, or Firecrawl based on site characteristics.
|
|
argument-hint: <url|manifest> [--crawler nodejs|aiohttp|scrapy|firecrawl] [--max-pages 50]
|
|
allowed-tools: Bash, Read, Write, WebFetch, Glob, Grep
|
|
---
|
|
|
|
# Web Crawler Orchestrator
|
|
|
|
Crawl web content with intelligent backend selection.
|
|
|
|
## Arguments
|
|
- `<url|manifest>`: Single URL or path to manifest.json from reference-discovery
|
|
- `--crawler`: Force specific crawler (nodejs, aiohttp, scrapy, firecrawl)
|
|
- `--max-pages`: Maximum pages to crawl (default: 50)
|
|
|
|
## Intelligent Crawler Selection
|
|
|
|
Auto-select based on site characteristics:
|
|
|
|
| Crawler | Best For | Auto-Selected When |
|
|
|---------|----------|-------------------|
|
|
| **Node.js** (default) | Small docs sites | ≤50 pages, static content |
|
|
| **Python aiohttp** | Technical docs | ≤200 pages, needs SEO data |
|
|
| **Scrapy** | Enterprise crawls | >200 pages, multi-domain |
|
|
| **Firecrawl MCP** | Dynamic sites | SPAs, JS-rendered content |
|
|
|
|
### Detection Flow
|
|
```
|
|
[URL] → Is SPA/React/Vue? → Firecrawl
|
|
→ >200 pages? → Scrapy
|
|
→ Needs SEO? → aiohttp
|
|
→ Default → Node.js
|
|
```
|
|
|
|
## Crawler Commands
|
|
|
|
**Node.js:**
|
|
```bash
|
|
cd ~/Project/our-seo-agent/util/js-crawler
|
|
node src/crawler.js <URL> --max-pages 50
|
|
```
|
|
|
|
**Python aiohttp:**
|
|
```bash
|
|
cd ~/Project/our-seo-agent
|
|
python -m seo_agent.crawler --url <URL> --max-pages 100
|
|
```
|
|
|
|
**Scrapy:**
|
|
```bash
|
|
cd ~/Project/our-seo-agent
|
|
scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500
|
|
```
|
|
|
|
**Firecrawl MCP:**
|
|
Use MCP tools: `firecrawl_scrape`, `firecrawl_crawl`, `firecrawl_map`
|
|
|
|
## Output
|
|
|
|
Save crawled content to `~/reference-library/raw/YYYY/MM/`:
|
|
- One markdown file per page
|
|
- Filename: `{url_hash}.md`
|
|
|
|
Generate crawl manifest:
|
|
```json
|
|
{
|
|
"crawl_date": "ISO timestamp",
|
|
"crawler_used": "nodejs",
|
|
"total_crawled": 45,
|
|
"documents": [...]
|
|
}
|
|
```
|
|
|
|
## Rate Limiting
|
|
|
|
All crawlers respect:
|
|
- 20 requests/minute
|
|
- 3 concurrent requests
|
|
- Exponential backoff on 429/5xx
|