Pipeline Orchestrator: - Add 07-pipeline-orchestrator skill with code/CLAUDE.md and desktop/SKILL.md - Add /reference-curator-pipeline slash command for full workflow automation - Add pipeline_runs and pipeline_iteration_tracker tables to schema.sql - Add v_pipeline_status and v_pipeline_iterations views - Add pipeline_config.yaml configuration template - Update AGENTS.md with Reference Curator Skills section - Update claude-project files with pipeline documentation Skill Format Refactoring: - Extract YAML frontmatter from SKILL.md files to separate skill.yaml - Add tools/ directories with MCP tool documentation - Update SKILL-FORMAT-REQUIREMENTS.md with new structure - Add migrate-skill-structure.py script for format conversion Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
6.0 KiB
6.0 KiB
Web Crawler Orchestrator
Manages crawling operations using Firecrawl MCP with rate limiting and format handling.
Prerequisites
- Firecrawl MCP server connected
- Config file at
~/.config/reference-curator/crawl_config.yaml - Storage directory exists:
~/reference-library/raw/
Crawl Configuration
# ~/.config/reference-curator/crawl_config.yaml
firecrawl:
rate_limit:
requests_per_minute: 20
concurrent_requests: 3
default_options:
timeout: 30000
only_main_content: true
include_html: false
processing:
max_content_size_mb: 50
raw_content_dir: ~/reference-library/raw/
Crawl Workflow
Step 1: Load URL Manifest
Receive manifest from reference-discovery:
def load_manifest(manifest_path):
with open(manifest_path) as f:
manifest = json.load(f)
return manifest["urls"]
Step 2: Determine Crawl Strategy
def select_strategy(url):
"""Select optimal crawl strategy based on URL characteristics."""
if url.endswith('.pdf'):
return 'pdf_extract'
elif 'github.com' in url and '/blob/' in url:
return 'raw_content' # Get raw file content
elif 'github.com' in url:
return 'scrape' # Repository pages
elif any(d in url for d in ['docs.', 'documentation']):
return 'scrape' # Documentation sites
else:
return 'scrape' # Default
Step 3: Execute Firecrawl
Use Firecrawl MCP for crawling:
# Single page scrape
firecrawl_scrape(
url="https://docs.anthropic.com/en/docs/prompt-engineering",
formats=["markdown"], # markdown | html | screenshot
only_main_content=True,
timeout=30000
)
# Multi-page crawl (documentation sites)
firecrawl_crawl(
url="https://docs.anthropic.com/en/docs/",
max_depth=2,
limit=50,
formats=["markdown"],
only_main_content=True
)
Step 4: Rate Limiting
import time
from collections import deque
class RateLimiter:
def __init__(self, requests_per_minute=20):
self.rpm = requests_per_minute
self.request_times = deque()
def wait_if_needed(self):
now = time.time()
# Remove requests older than 1 minute
while self.request_times and now - self.request_times[0] > 60:
self.request_times.popleft()
if len(self.request_times) >= self.rpm:
wait_time = 60 - (now - self.request_times[0])
if wait_time > 0:
time.sleep(wait_time)
self.request_times.append(time.time())
Step 5: Save Raw Content
import hashlib
from pathlib import Path
def save_content(url, content, content_type='markdown'):
"""Save crawled content to raw storage."""
# Generate filename from URL hash
url_hash = hashlib.sha256(url.encode()).hexdigest()[:16]
# Determine extension
ext_map = {'markdown': '.md', 'html': '.html', 'pdf': '.pdf'}
ext = ext_map.get(content_type, '.txt')
# Create dated subdirectory
date_dir = datetime.now().strftime('%Y/%m')
output_dir = Path.home() / 'reference-library/raw' / date_dir
output_dir.mkdir(parents=True, exist_ok=True)
# Save file
filepath = output_dir / f"{url_hash}{ext}"
if content_type == 'pdf':
filepath.write_bytes(content)
else:
filepath.write_text(content, encoding='utf-8')
return str(filepath)
Step 6: Generate Crawl Manifest
def create_crawl_manifest(results):
manifest = {
"crawl_date": datetime.now().isoformat(),
"total_crawled": len([r for r in results if r["status"] == "success"]),
"total_failed": len([r for r in results if r["status"] == "failed"]),
"documents": []
}
for result in results:
manifest["documents"].append({
"url": result["url"],
"status": result["status"],
"raw_content_path": result.get("filepath"),
"content_size": result.get("size"),
"crawl_method": "firecrawl",
"error": result.get("error")
})
return manifest
Error Handling
| Error | Action |
|---|---|
| Timeout | Retry once with 2x timeout |
| Rate limit (429) | Exponential backoff, max 3 retries |
| Not found (404) | Log and skip |
| Access denied (403) | Log, mark as failed |
| Connection error | Retry with backoff |
def crawl_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
result = firecrawl_scrape(url)
return {"status": "success", "content": result}
except RateLimitError:
wait = 2 ** attempt * 10 # 10, 20, 40 seconds
time.sleep(wait)
except TimeoutError:
if attempt == 0:
# Retry with doubled timeout
result = firecrawl_scrape(url, timeout=60000)
return {"status": "success", "content": result}
except NotFoundError:
return {"status": "failed", "error": "404 Not Found"}
except Exception as e:
if attempt == max_retries - 1:
return {"status": "failed", "error": str(e)}
return {"status": "failed", "error": "Max retries exceeded"}
Firecrawl MCP Reference
scrape - Single page:
firecrawl_scrape(url, formats, only_main_content, timeout)
crawl - Multi-page:
firecrawl_crawl(url, max_depth, limit, formats, only_main_content)
map - Discover URLs:
firecrawl_map(url, limit) # Returns list of URLs on site
Integration
| From | Input | To |
|---|---|---|
| reference-discovery | URL manifest | web-crawler-orchestrator |
| web-crawler-orchestrator | Crawl manifest + raw files | content-repository |
| quality-reviewer (deep_research) | Additional queries | reference-discovery → here |
Output Structure
~/reference-library/raw/
└── 2025/01/
├── a1b2c3d4e5f6g7h8.md # Markdown content
├── b2c3d4e5f6g7h8i9.md
└── c3d4e5f6g7h8i9j0.pdf # PDF documents