--- description: Crawl URLs with intelligent backend selection. Auto-selects Node.js, Python aiohttp, Scrapy, or Firecrawl based on site characteristics. argument-hint: [--crawler nodejs|aiohttp|scrapy|firecrawl] [--max-pages 50] allowed-tools: Bash, Read, Write, WebFetch, Glob, Grep --- # Web Crawler Orchestrator Crawl web content with intelligent backend selection. ## Arguments - ``: Single URL or path to manifest.json from reference-discovery - `--crawler`: Force specific crawler (nodejs, aiohttp, scrapy, firecrawl) - `--max-pages`: Maximum pages to crawl (default: 50) ## Intelligent Crawler Selection Auto-select based on site characteristics: | Crawler | Best For | Auto-Selected When | |---------|----------|-------------------| | **Node.js** (default) | Small docs sites | ≤50 pages, static content | | **Python aiohttp** | Technical docs | ≤200 pages, needs SEO data | | **Scrapy** | Enterprise crawls | >200 pages, multi-domain | | **Firecrawl MCP** | Dynamic sites | SPAs, JS-rendered content | ### Detection Flow ``` [URL] → Is SPA/React/Vue? → Firecrawl → >200 pages? → Scrapy → Needs SEO? → aiohttp → Default → Node.js ``` ## Crawler Commands **Node.js:** ```bash cd ~/Project/our-seo-agent/util/js-crawler node src/crawler.js --max-pages 50 ``` **Python aiohttp:** ```bash cd ~/Project/our-seo-agent python -m seo_agent.crawler --url --max-pages 100 ``` **Scrapy:** ```bash cd ~/Project/our-seo-agent scrapy crawl seo_spider -a start_url= -a max_pages=500 ``` **Firecrawl MCP:** Use MCP tools: `firecrawl_scrape`, `firecrawl_crawl`, `firecrawl_map` ## Output Save crawled content to `~/reference-library/raw/YYYY/MM/`: - One markdown file per page - Filename: `{url_hash}.md` Generate crawl manifest: ```json { "crawl_date": "ISO timestamp", "crawler_used": "nodejs", "total_crawled": 45, "documents": [...] } ``` ## Rate Limiting All crawlers respect: - 20 requests/minute - 3 concurrent requests - Exponential backoff on 429/5xx