Files
our-claude-skills/custom-skills/90-reference-curator/02-web-crawler-orchestrator/desktop/SKILL.md
Andrew Yim 6d7a6d7a88 feat(reference-curator): Add portable skill suite for reference documentation curation
6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 00:20:27 +07:00

6.3 KiB

name, description
name description
web-crawler-orchestrator Orchestrates web crawling using Firecrawl MCP. Handles rate limiting, selects crawl strategies, manages formats (HTML/PDF/markdown), and produces raw content with manifests. Triggers on "crawl URLs", "fetch documents", "scrape pages", "download references", "Firecrawl crawl".

Web Crawler Orchestrator

Manages crawling operations using Firecrawl MCP with rate limiting and format handling.

Prerequisites

  • Firecrawl MCP server connected
  • Config file at ~/.config/reference-curator/crawl_config.yaml
  • Storage directory exists: ~/reference-library/raw/

Crawl Configuration

# ~/.config/reference-curator/crawl_config.yaml
firecrawl:
  rate_limit:
    requests_per_minute: 20
    concurrent_requests: 3
  default_options:
    timeout: 30000
    only_main_content: true
    include_html: false

processing:
  max_content_size_mb: 50
  raw_content_dir: ~/reference-library/raw/

Crawl Workflow

Step 1: Load URL Manifest

Receive manifest from reference-discovery:

def load_manifest(manifest_path):
    with open(manifest_path) as f:
        manifest = json.load(f)
    return manifest["urls"]

Step 2: Determine Crawl Strategy

def select_strategy(url):
    """Select optimal crawl strategy based on URL characteristics."""
    
    if url.endswith('.pdf'):
        return 'pdf_extract'
    elif 'github.com' in url and '/blob/' in url:
        return 'raw_content'  # Get raw file content
    elif 'github.com' in url:
        return 'scrape'  # Repository pages
    elif any(d in url for d in ['docs.', 'documentation']):
        return 'scrape'  # Documentation sites
    else:
        return 'scrape'  # Default

Step 3: Execute Firecrawl

Use Firecrawl MCP for crawling:

# Single page scrape
firecrawl_scrape(
    url="https://docs.anthropic.com/en/docs/prompt-engineering",
    formats=["markdown"],  # markdown | html | screenshot
    only_main_content=True,
    timeout=30000
)

# Multi-page crawl (documentation sites)
firecrawl_crawl(
    url="https://docs.anthropic.com/en/docs/",
    max_depth=2,
    limit=50,
    formats=["markdown"],
    only_main_content=True
)

Step 4: Rate Limiting

import time
from collections import deque

class RateLimiter:
    def __init__(self, requests_per_minute=20):
        self.rpm = requests_per_minute
        self.request_times = deque()
    
    def wait_if_needed(self):
        now = time.time()
        # Remove requests older than 1 minute
        while self.request_times and now - self.request_times[0] > 60:
            self.request_times.popleft()
        
        if len(self.request_times) >= self.rpm:
            wait_time = 60 - (now - self.request_times[0])
            if wait_time > 0:
                time.sleep(wait_time)
        
        self.request_times.append(time.time())

Step 5: Save Raw Content

import hashlib
from pathlib import Path

def save_content(url, content, content_type='markdown'):
    """Save crawled content to raw storage."""
    
    # Generate filename from URL hash
    url_hash = hashlib.sha256(url.encode()).hexdigest()[:16]
    
    # Determine extension
    ext_map = {'markdown': '.md', 'html': '.html', 'pdf': '.pdf'}
    ext = ext_map.get(content_type, '.txt')
    
    # Create dated subdirectory
    date_dir = datetime.now().strftime('%Y/%m')
    output_dir = Path.home() / 'reference-library/raw' / date_dir
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Save file
    filepath = output_dir / f"{url_hash}{ext}"
    if content_type == 'pdf':
        filepath.write_bytes(content)
    else:
        filepath.write_text(content, encoding='utf-8')
    
    return str(filepath)

Step 6: Generate Crawl Manifest

def create_crawl_manifest(results):
    manifest = {
        "crawl_date": datetime.now().isoformat(),
        "total_crawled": len([r for r in results if r["status"] == "success"]),
        "total_failed": len([r for r in results if r["status"] == "failed"]),
        "documents": []
    }
    
    for result in results:
        manifest["documents"].append({
            "url": result["url"],
            "status": result["status"],
            "raw_content_path": result.get("filepath"),
            "content_size": result.get("size"),
            "crawl_method": "firecrawl",
            "error": result.get("error")
        })
    
    return manifest

Error Handling

Error Action
Timeout Retry once with 2x timeout
Rate limit (429) Exponential backoff, max 3 retries
Not found (404) Log and skip
Access denied (403) Log, mark as failed
Connection error Retry with backoff
def crawl_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = firecrawl_scrape(url)
            return {"status": "success", "content": result}
        except RateLimitError:
            wait = 2 ** attempt * 10  # 10, 20, 40 seconds
            time.sleep(wait)
        except TimeoutError:
            if attempt == 0:
                # Retry with doubled timeout
                result = firecrawl_scrape(url, timeout=60000)
                return {"status": "success", "content": result}
        except NotFoundError:
            return {"status": "failed", "error": "404 Not Found"}
        except Exception as e:
            if attempt == max_retries - 1:
                return {"status": "failed", "error": str(e)}
    
    return {"status": "failed", "error": "Max retries exceeded"}

Firecrawl MCP Reference

scrape - Single page:

firecrawl_scrape(url, formats, only_main_content, timeout)

crawl - Multi-page:

firecrawl_crawl(url, max_depth, limit, formats, only_main_content)

map - Discover URLs:

firecrawl_map(url, limit)  # Returns list of URLs on site

Integration

From Input To
reference-discovery URL manifest web-crawler-orchestrator
web-crawler-orchestrator Crawl manifest + raw files content-repository
quality-reviewer (deep_research) Additional queries reference-discovery → here

Output Structure

~/reference-library/raw/
└── 2025/01/
    ├── a1b2c3d4e5f6g7h8.md   # Markdown content
    ├── b2c3d4e5f6g7h8i9.md
    └── c3d4e5f6g7h8i9j0.pdf  # PDF documents