--- name: web-crawler-orchestrator description: | Multi-backend web crawler with Firecrawl MCP, rate limiting, and format handling. Triggers: crawl URLs, fetch pages, scrape content, web crawler. --- # Web Crawler Orchestrator Manages crawling operations using Firecrawl MCP with rate limiting and format handling. ## Prerequisites - Firecrawl MCP server connected - Config file at `~/.config/reference-curator/crawl_config.yaml` - Storage directory exists: `~/reference-library/raw/` ## Crawl Configuration ```yaml # ~/.config/reference-curator/crawl_config.yaml firecrawl: rate_limit: requests_per_minute: 20 concurrent_requests: 3 default_options: timeout: 30000 only_main_content: true include_html: false processing: max_content_size_mb: 50 raw_content_dir: ~/reference-library/raw/ ``` ## Crawl Workflow ### Step 1: Load URL Manifest Receive manifest from `reference-discovery`: ```python def load_manifest(manifest_path): with open(manifest_path) as f: manifest = json.load(f) return manifest["urls"] ``` ### Step 2: Determine Crawl Strategy ```python def select_strategy(url): """Select optimal crawl strategy based on URL characteristics.""" if url.endswith('.pdf'): return 'pdf_extract' elif 'github.com' in url and '/blob/' in url: return 'raw_content' # Get raw file content elif 'github.com' in url: return 'scrape' # Repository pages elif any(d in url for d in ['docs.', 'documentation']): return 'scrape' # Documentation sites else: return 'scrape' # Default ``` ### Step 3: Execute Firecrawl Use Firecrawl MCP for crawling: ```python # Single page scrape firecrawl_scrape( url="https://docs.anthropic.com/en/docs/prompt-engineering", formats=["markdown"], # markdown | html | screenshot only_main_content=True, timeout=30000 ) # Multi-page crawl (documentation sites) firecrawl_crawl( url="https://docs.anthropic.com/en/docs/", max_depth=2, limit=50, formats=["markdown"], only_main_content=True ) ``` ### Step 4: Rate Limiting ```python import time from collections import deque class RateLimiter: def __init__(self, requests_per_minute=20): self.rpm = requests_per_minute self.request_times = deque() def wait_if_needed(self): now = time.time() # Remove requests older than 1 minute while self.request_times and now - self.request_times[0] > 60: self.request_times.popleft() if len(self.request_times) >= self.rpm: wait_time = 60 - (now - self.request_times[0]) if wait_time > 0: time.sleep(wait_time) self.request_times.append(time.time()) ``` ### Step 5: Save Raw Content ```python import hashlib from pathlib import Path def save_content(url, content, content_type='markdown'): """Save crawled content to raw storage.""" # Generate filename from URL hash url_hash = hashlib.sha256(url.encode()).hexdigest()[:16] # Determine extension ext_map = {'markdown': '.md', 'html': '.html', 'pdf': '.pdf'} ext = ext_map.get(content_type, '.txt') # Create dated subdirectory date_dir = datetime.now().strftime('%Y/%m') output_dir = Path.home() / 'reference-library/raw' / date_dir output_dir.mkdir(parents=True, exist_ok=True) # Save file filepath = output_dir / f"{url_hash}{ext}" if content_type == 'pdf': filepath.write_bytes(content) else: filepath.write_text(content, encoding='utf-8') return str(filepath) ``` ### Step 6: Generate Crawl Manifest ```python def create_crawl_manifest(results): manifest = { "crawl_date": datetime.now().isoformat(), "total_crawled": len([r for r in results if r["status"] == "success"]), "total_failed": len([r for r in results if r["status"] == "failed"]), "documents": [] } for result in results: manifest["documents"].append({ "url": result["url"], "status": result["status"], "raw_content_path": result.get("filepath"), "content_size": result.get("size"), "crawl_method": "firecrawl", "error": result.get("error") }) return manifest ``` ## Error Handling | Error | Action | |-------|--------| | Timeout | Retry once with 2x timeout | | Rate limit (429) | Exponential backoff, max 3 retries | | Not found (404) | Log and skip | | Access denied (403) | Log, mark as `failed` | | Connection error | Retry with backoff | ```python def crawl_with_retry(url, max_retries=3): for attempt in range(max_retries): try: result = firecrawl_scrape(url) return {"status": "success", "content": result} except RateLimitError: wait = 2 ** attempt * 10 # 10, 20, 40 seconds time.sleep(wait) except TimeoutError: if attempt == 0: # Retry with doubled timeout result = firecrawl_scrape(url, timeout=60000) return {"status": "success", "content": result} except NotFoundError: return {"status": "failed", "error": "404 Not Found"} except Exception as e: if attempt == max_retries - 1: return {"status": "failed", "error": str(e)} return {"status": "failed", "error": "Max retries exceeded"} ``` ## Firecrawl MCP Reference **scrape** - Single page: ``` firecrawl_scrape(url, formats, only_main_content, timeout) ``` **crawl** - Multi-page: ``` firecrawl_crawl(url, max_depth, limit, formats, only_main_content) ``` **map** - Discover URLs: ``` firecrawl_map(url, limit) # Returns list of URLs on site ``` ## Integration | From | Input | To | |------|-------|-----| | reference-discovery | URL manifest | web-crawler-orchestrator | | web-crawler-orchestrator | Crawl manifest + raw files | content-repository | | quality-reviewer (deep_research) | Additional queries | reference-discovery → here | ## Output Structure ``` ~/reference-library/raw/ └── 2025/01/ ├── a1b2c3d4e5f6g7h8.md # Markdown content ├── b2c3d4e5f6g7h8i9.md └── c3d4e5f6g7h8i9j0.pdf # PDF documents ```