Files
our-claude-skills/custom-skills/90-reference-curator/02-web-crawler-orchestrator/desktop/SKILL.md
Andrew Yim d1cd1298a8 feat(reference-curator): Add pipeline orchestrator and refactor skill format
Pipeline Orchestrator:
- Add 07-pipeline-orchestrator skill with code/CLAUDE.md and desktop/SKILL.md
- Add /reference-curator-pipeline slash command for full workflow automation
- Add pipeline_runs and pipeline_iteration_tracker tables to schema.sql
- Add v_pipeline_status and v_pipeline_iterations views
- Add pipeline_config.yaml configuration template
- Update AGENTS.md with Reference Curator Skills section
- Update claude-project files with pipeline documentation

Skill Format Refactoring:
- Extract YAML frontmatter from SKILL.md files to separate skill.yaml
- Add tools/ directories with MCP tool documentation
- Update SKILL-FORMAT-REQUIREMENTS.md with new structure
- Add migrate-skill-structure.py script for format conversion

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 01:01:02 +07:00

6.0 KiB

Web Crawler Orchestrator

Manages crawling operations using Firecrawl MCP with rate limiting and format handling.

Prerequisites

  • Firecrawl MCP server connected
  • Config file at ~/.config/reference-curator/crawl_config.yaml
  • Storage directory exists: ~/reference-library/raw/

Crawl Configuration

# ~/.config/reference-curator/crawl_config.yaml
firecrawl:
  rate_limit:
    requests_per_minute: 20
    concurrent_requests: 3
  default_options:
    timeout: 30000
    only_main_content: true
    include_html: false

processing:
  max_content_size_mb: 50
  raw_content_dir: ~/reference-library/raw/

Crawl Workflow

Step 1: Load URL Manifest

Receive manifest from reference-discovery:

def load_manifest(manifest_path):
    with open(manifest_path) as f:
        manifest = json.load(f)
    return manifest["urls"]

Step 2: Determine Crawl Strategy

def select_strategy(url):
    """Select optimal crawl strategy based on URL characteristics."""
    
    if url.endswith('.pdf'):
        return 'pdf_extract'
    elif 'github.com' in url and '/blob/' in url:
        return 'raw_content'  # Get raw file content
    elif 'github.com' in url:
        return 'scrape'  # Repository pages
    elif any(d in url for d in ['docs.', 'documentation']):
        return 'scrape'  # Documentation sites
    else:
        return 'scrape'  # Default

Step 3: Execute Firecrawl

Use Firecrawl MCP for crawling:

# Single page scrape
firecrawl_scrape(
    url="https://docs.anthropic.com/en/docs/prompt-engineering",
    formats=["markdown"],  # markdown | html | screenshot
    only_main_content=True,
    timeout=30000
)

# Multi-page crawl (documentation sites)
firecrawl_crawl(
    url="https://docs.anthropic.com/en/docs/",
    max_depth=2,
    limit=50,
    formats=["markdown"],
    only_main_content=True
)

Step 4: Rate Limiting

import time
from collections import deque

class RateLimiter:
    def __init__(self, requests_per_minute=20):
        self.rpm = requests_per_minute
        self.request_times = deque()
    
    def wait_if_needed(self):
        now = time.time()
        # Remove requests older than 1 minute
        while self.request_times and now - self.request_times[0] > 60:
            self.request_times.popleft()
        
        if len(self.request_times) >= self.rpm:
            wait_time = 60 - (now - self.request_times[0])
            if wait_time > 0:
                time.sleep(wait_time)
        
        self.request_times.append(time.time())

Step 5: Save Raw Content

import hashlib
from pathlib import Path

def save_content(url, content, content_type='markdown'):
    """Save crawled content to raw storage."""
    
    # Generate filename from URL hash
    url_hash = hashlib.sha256(url.encode()).hexdigest()[:16]
    
    # Determine extension
    ext_map = {'markdown': '.md', 'html': '.html', 'pdf': '.pdf'}
    ext = ext_map.get(content_type, '.txt')
    
    # Create dated subdirectory
    date_dir = datetime.now().strftime('%Y/%m')
    output_dir = Path.home() / 'reference-library/raw' / date_dir
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Save file
    filepath = output_dir / f"{url_hash}{ext}"
    if content_type == 'pdf':
        filepath.write_bytes(content)
    else:
        filepath.write_text(content, encoding='utf-8')
    
    return str(filepath)

Step 6: Generate Crawl Manifest

def create_crawl_manifest(results):
    manifest = {
        "crawl_date": datetime.now().isoformat(),
        "total_crawled": len([r for r in results if r["status"] == "success"]),
        "total_failed": len([r for r in results if r["status"] == "failed"]),
        "documents": []
    }
    
    for result in results:
        manifest["documents"].append({
            "url": result["url"],
            "status": result["status"],
            "raw_content_path": result.get("filepath"),
            "content_size": result.get("size"),
            "crawl_method": "firecrawl",
            "error": result.get("error")
        })
    
    return manifest

Error Handling

Error Action
Timeout Retry once with 2x timeout
Rate limit (429) Exponential backoff, max 3 retries
Not found (404) Log and skip
Access denied (403) Log, mark as failed
Connection error Retry with backoff
def crawl_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = firecrawl_scrape(url)
            return {"status": "success", "content": result}
        except RateLimitError:
            wait = 2 ** attempt * 10  # 10, 20, 40 seconds
            time.sleep(wait)
        except TimeoutError:
            if attempt == 0:
                # Retry with doubled timeout
                result = firecrawl_scrape(url, timeout=60000)
                return {"status": "success", "content": result}
        except NotFoundError:
            return {"status": "failed", "error": "404 Not Found"}
        except Exception as e:
            if attempt == max_retries - 1:
                return {"status": "failed", "error": str(e)}
    
    return {"status": "failed", "error": "Max retries exceeded"}

Firecrawl MCP Reference

scrape - Single page:

firecrawl_scrape(url, formats, only_main_content, timeout)

crawl - Multi-page:

firecrawl_crawl(url, max_depth, limit, formats, only_main_content)

map - Discover URLs:

firecrawl_map(url, limit)  # Returns list of URLs on site

Integration

From Input To
reference-discovery URL manifest web-crawler-orchestrator
web-crawler-orchestrator Crawl manifest + raw files content-repository
quality-reviewer (deep_research) Additional queries reference-discovery → here

Output Structure

~/reference-library/raw/
└── 2025/01/
    ├── a1b2c3d4e5f6g7h8.md   # Markdown content
    ├── b2c3d4e5f6g7h8i9.md
    └── c3d4e5f6g7h8i9j0.pdf  # PDF documents