Files
our-claude-skills/custom-skills/90-reference-curator/commands/web-crawler.md
Andrew Yim 6d7a6d7a88 feat(reference-curator): Add portable skill suite for reference documentation curation
6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 00:20:27 +07:00

80 lines
2.0 KiB
Markdown

---
description: Crawl URLs with intelligent backend selection. Auto-selects Node.js, Python aiohttp, Scrapy, or Firecrawl based on site characteristics.
argument-hint: <url|manifest> [--crawler nodejs|aiohttp|scrapy|firecrawl] [--max-pages 50]
allowed-tools: Bash, Read, Write, WebFetch, Glob, Grep
---
# Web Crawler Orchestrator
Crawl web content with intelligent backend selection.
## Arguments
- `<url|manifest>`: Single URL or path to manifest.json from reference-discovery
- `--crawler`: Force specific crawler (nodejs, aiohttp, scrapy, firecrawl)
- `--max-pages`: Maximum pages to crawl (default: 50)
## Intelligent Crawler Selection
Auto-select based on site characteristics:
| Crawler | Best For | Auto-Selected When |
|---------|----------|-------------------|
| **Node.js** (default) | Small docs sites | ≤50 pages, static content |
| **Python aiohttp** | Technical docs | ≤200 pages, needs SEO data |
| **Scrapy** | Enterprise crawls | >200 pages, multi-domain |
| **Firecrawl MCP** | Dynamic sites | SPAs, JS-rendered content |
### Detection Flow
```
[URL] → Is SPA/React/Vue? → Firecrawl
→ >200 pages? → Scrapy
→ Needs SEO? → aiohttp
→ Default → Node.js
```
## Crawler Commands
**Node.js:**
```bash
cd ~/Project/our-seo-agent/util/js-crawler
node src/crawler.js <URL> --max-pages 50
```
**Python aiohttp:**
```bash
cd ~/Project/our-seo-agent
python -m seo_agent.crawler --url <URL> --max-pages 100
```
**Scrapy:**
```bash
cd ~/Project/our-seo-agent
scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500
```
**Firecrawl MCP:**
Use MCP tools: `firecrawl_scrape`, `firecrawl_crawl`, `firecrawl_map`
## Output
Save crawled content to `~/reference-library/raw/YYYY/MM/`:
- One markdown file per page
- Filename: `{url_hash}.md`
Generate crawl manifest:
```json
{
"crawl_date": "ISO timestamp",
"crawler_used": "nodejs",
"total_crawled": 45,
"documents": [...]
}
```
## Rate Limiting
All crawlers respect:
- 20 requests/minute
- 3 concurrent requests
- Exponential backoff on 429/5xx