# Web Crawler Orchestrator Orchestrates web crawling with intelligent backend selection. Automatically chooses the best crawler based on site characteristics. ## Trigger Keywords "crawl URLs", "fetch documents", "scrape pages", "download references" ## Intelligent Crawler Selection Claude automatically selects the optimal crawler based on the request: | Crawler | Best For | Auto-Selected When | |---------|----------|-------------------| | **Node.js** (default) | Small docs sites | ≤50 pages, static content | | **Python aiohttp** | Technical docs | ≤200 pages, needs SEO data | | **Scrapy** | Enterprise crawls | >200 pages, multi-domain | | **Firecrawl MCP** | Dynamic sites | SPAs, JS-rendered content | ### Decision Flow ``` [Crawl Request] │ ├─ Is it SPA/React/Vue/Angular? → Firecrawl MCP │ ├─ >200 pages or multi-domain? → Scrapy │ ├─ Needs SEO extraction? → Python aiohttp │ └─ Default (small site) → Node.js ``` ## Crawler Backends ### Node.js (Default) Fast, lightweight crawler for small documentation sites. ```bash cd ~/Project/our-seo-agent/util/js-crawler node src/crawler.js --max-pages 50 ``` ### Python aiohttp Async crawler with full SEO extraction. ```bash cd ~/Project/our-seo-agent python -m seo_agent.crawler --url --max-pages 100 ``` ### Scrapy Enterprise-grade crawler with pipelines. ```bash cd ~/Project/our-seo-agent scrapy crawl seo_spider -a start_url= -a max_pages=500 ``` ### Firecrawl MCP Use MCP tools for JavaScript-heavy sites: ``` firecrawl_scrape(url, formats=["markdown"], only_main_content=true) firecrawl_crawl(url, max_depth=2, limit=50) firecrawl_map(url, limit=100) # Discover URLs first ``` ## Workflow ### Step 1: Analyze Target Site Determine site characteristics: - Is it a SPA? (React, Vue, Angular, Next.js) - How many pages expected? - Does it need JavaScript rendering? - Is SEO data extraction needed? ### Step 2: Select Crawler Based on analysis, select the appropriate backend. ### Step 3: Load URL Manifest ```bash # From reference-discovery output cat manifest.json | jq '.urls[].url' ``` ### Step 4: Execute Crawl **For Node.js:** ```bash cd ~/Project/our-seo-agent/util/js-crawler for url in $(cat urls.txt); do node src/crawler.js "$url" --max-pages 50 sleep 2 done ``` **For Firecrawl MCP (Claude Desktop/Code):** Use the firecrawl MCP tools directly in conversation. ### Step 5: Save Raw Content ``` ~/reference-library/raw/ └── 2025/01/ ├── a1b2c3d4.md └── b2c3d4e5.md ``` ### Step 6: Generate Crawl Manifest ```json { "crawl_date": "2025-01-28T12:00:00", "crawler_used": "nodejs", "total_crawled": 45, "total_failed": 5, "documents": [...] } ``` ## Rate Limiting All crawlers respect these limits: - 20 requests/minute - 3 concurrent requests - Exponential backoff on 429/5xx ## Error Handling | Error | Action | |-------|--------| | Timeout | Retry once with 2x timeout | | Rate limit (429) | Exponential backoff, max 3 retries | | Not found (404) | Log and skip | | Access denied (403) | Log, mark as `failed` | | JS rendering needed | Switch to Firecrawl | ## Site Type Detection Indicators for automatic routing: **SPA (→ Firecrawl):** - URL contains `#/` or uses hash routing - Page source shows React/Vue/Angular markers - Content loads dynamically after initial load **Static docs (→ Node.js/aiohttp):** - Built with Hugo, Jekyll, MkDocs, Docusaurus, GitBook - Clean HTML structure - Server-side rendered ## Scripts - `scripts/select_crawler.py` - Intelligent crawler selection - `scripts/crawl_with_nodejs.sh` - Node.js wrapper - `scripts/crawl_with_aiohttp.sh` - Python wrapper - `scripts/crawl_with_firecrawl.py` - Firecrawl MCP wrapper ## Integration | From | To | |------|-----| | reference-discovery | URL manifest input | | → | content-repository (crawl manifest + raw files) | | quality-reviewer (deep_research) | Additional crawl requests |