# Web Crawler Orchestrator

Orchestrates web crawling with intelligent backend selection. Automatically chooses the best crawler based on site characteristics.

## Trigger Keywords
"crawl URLs", "fetch documents", "scrape pages", "download references"

## Intelligent Crawler Selection

Claude automatically selects the optimal crawler based on the request:

| Crawler | Best For | Auto-Selected When |
|---------|----------|-------------------|
| **Node.js** (default) | Small docs sites | ≤50 pages, static content |
| **Python aiohttp** | Technical docs | ≤200 pages, needs SEO data |
| **Scrapy** | Enterprise crawls | >200 pages, multi-domain |
| **Firecrawl MCP** | Dynamic sites | SPAs, JS-rendered content |

### Decision Flow

```
[Crawl Request]
      │
      ├─ Is it SPA/React/Vue/Angular? → Firecrawl MCP
      │
      ├─ >200 pages or multi-domain? → Scrapy
      │
      ├─ Needs SEO extraction? → Python aiohttp
      │
      └─ Default (small site) → Node.js
```

## Crawler Backends

### Node.js (Default)
Fast, lightweight crawler for small documentation sites.
```bash
cd ~/Project/our-seo-agent/util/js-crawler
node src/crawler.js <URL> --max-pages 50
```

### Python aiohttp
Async crawler with full SEO extraction.
```bash
cd ~/Project/our-seo-agent
python -m seo_agent.crawler --url <URL> --max-pages 100
```

### Scrapy
Enterprise-grade crawler with pipelines.
```bash
cd ~/Project/our-seo-agent
scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500
```

### Firecrawl MCP
Use MCP tools for JavaScript-heavy sites:
```
firecrawl_scrape(url, formats=["markdown"], only_main_content=true)
firecrawl_crawl(url, max_depth=2, limit=50)
firecrawl_map(url, limit=100)  # Discover URLs first
```

## Workflow

### Step 1: Analyze Target Site
Determine site characteristics:
- Is it a SPA? (React, Vue, Angular, Next.js)
- How many pages expected?
- Does it need JavaScript rendering?
- Is SEO data extraction needed?

### Step 2: Select Crawler
Based on analysis, select the appropriate backend.

### Step 3: Load URL Manifest
```bash
# From reference-discovery output
cat manifest.json | jq '.urls[].url'
```

### Step 4: Execute Crawl

**For Node.js:**
```bash
cd ~/Project/our-seo-agent/util/js-crawler
for url in $(cat urls.txt); do
  node src/crawler.js "$url" --max-pages 50
  sleep 2
done
```

**For Firecrawl MCP (Claude Desktop/Code):**
Use the firecrawl MCP tools directly in conversation.

### Step 5: Save Raw Content
```
~/reference-library/raw/
└── 2025/01/
    ├── a1b2c3d4.md
    └── b2c3d4e5.md
```

### Step 6: Generate Crawl Manifest
```json
{
  "crawl_date": "2025-01-28T12:00:00",
  "crawler_used": "nodejs",
  "total_crawled": 45,
  "total_failed": 5,
  "documents": [...]
}
```

## Rate Limiting

All crawlers respect these limits:
- 20 requests/minute
- 3 concurrent requests
- Exponential backoff on 429/5xx

## Error Handling

| Error | Action |
|-------|--------|
| Timeout | Retry once with 2x timeout |
| Rate limit (429) | Exponential backoff, max 3 retries |
| Not found (404) | Log and skip |
| Access denied (403) | Log, mark as `failed` |
| JS rendering needed | Switch to Firecrawl |

## Site Type Detection

Indicators for automatic routing:

**SPA (→ Firecrawl):**
- URL contains `#/` or uses hash routing
- Page source shows React/Vue/Angular markers
- Content loads dynamically after initial load

**Static docs (→ Node.js/aiohttp):**
- Built with Hugo, Jekyll, MkDocs, Docusaurus, GitBook
- Clean HTML structure
- Server-side rendered

## Scripts

- `scripts/select_crawler.py` - Intelligent crawler selection
- `scripts/crawl_with_nodejs.sh` - Node.js wrapper
- `scripts/crawl_with_aiohttp.sh` - Python wrapper
- `scripts/crawl_with_firecrawl.py` - Firecrawl MCP wrapper

## Integration

| From | To |
|------|-----|
| reference-discovery | URL manifest input |
| → | content-repository (crawl manifest + raw files) |
| quality-reviewer (deep_research) | Additional crawl requests |