our-claude-skills/custom-skills/90-reference-curator/commands/web-crawler.md

---
description: Crawl URLs with intelligent backend selection. Auto-selects Node.js, Python aiohttp, Scrapy, or Firecrawl based on site characteristics.
argument-hint: <url|manifest> [--crawler nodejs|aiohttp|scrapy|firecrawl] [--max-pages 50]
allowed-tools: Bash, Read, Write, WebFetch, Glob, Grep
---

# Web Crawler Orchestrator

Crawl web content with intelligent backend selection.

## Arguments
- `<url|manifest>`: Single URL or path to manifest.json from reference-discovery
- `--crawler`: Force specific crawler (nodejs, aiohttp, scrapy, firecrawl)
- `--max-pages`: Maximum pages to crawl (default: 50)

## Intelligent Crawler Selection

Auto-select based on site characteristics:

| Crawler | Best For | Auto-Selected When |
|---------|----------|-------------------|
| **Node.js** (default) | Small docs sites | ≤50 pages, static content |
| **Python aiohttp** | Technical docs | ≤200 pages, needs SEO data |
| **Scrapy** | Enterprise crawls | >200 pages, multi-domain |
| **Firecrawl MCP** | Dynamic sites | SPAs, JS-rendered content |

### Detection Flow
```
[URL] → Is SPA/React/Vue? → Firecrawl
      → >200 pages? → Scrapy
      → Needs SEO? → aiohttp
      → Default → Node.js
```

## Crawler Commands

**Node.js:**
```bash
cd ~/Project/our-seo-agent/util/js-crawler
node src/crawler.js <URL> --max-pages 50
```

**Python aiohttp:**
```bash
cd ~/Project/our-seo-agent
python -m seo_agent.crawler --url <URL> --max-pages 100
```

**Scrapy:**
```bash
cd ~/Project/our-seo-agent
scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500
```

**Firecrawl MCP:**
Use MCP tools: `firecrawl_scrape`, `firecrawl_crawl`, `firecrawl_map`

## Output

Save crawled content to `~/reference-library/raw/YYYY/MM/`:
- One markdown file per page
- Filename: `{url_hash}.md`

Generate crawl manifest:
```json
{
  "crawl_date": "ISO timestamp",
  "crawler_used": "nodejs",
  "total_crawled": 45,
  "documents": [...]
}
```

## Rate Limiting

All crawlers respect:
- 20 requests/minute
- 3 concurrent requests
- Exponential backoff on 429/5xx