feat(reference-curator): Add portable skill suite for reference documentation curation
6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,159 @@
|
||||
# Web Crawler Orchestrator
|
||||
|
||||
Orchestrates web crawling with intelligent backend selection. Automatically chooses the best crawler based on site characteristics.
|
||||
|
||||
## Trigger Keywords
|
||||
"crawl URLs", "fetch documents", "scrape pages", "download references"
|
||||
|
||||
## Intelligent Crawler Selection
|
||||
|
||||
Claude automatically selects the optimal crawler based on the request:
|
||||
|
||||
| Crawler | Best For | Auto-Selected When |
|
||||
|---------|----------|-------------------|
|
||||
| **Node.js** (default) | Small docs sites | ≤50 pages, static content |
|
||||
| **Python aiohttp** | Technical docs | ≤200 pages, needs SEO data |
|
||||
| **Scrapy** | Enterprise crawls | >200 pages, multi-domain |
|
||||
| **Firecrawl MCP** | Dynamic sites | SPAs, JS-rendered content |
|
||||
|
||||
### Decision Flow
|
||||
|
||||
```
|
||||
[Crawl Request]
|
||||
│
|
||||
├─ Is it SPA/React/Vue/Angular? → Firecrawl MCP
|
||||
│
|
||||
├─ >200 pages or multi-domain? → Scrapy
|
||||
│
|
||||
├─ Needs SEO extraction? → Python aiohttp
|
||||
│
|
||||
└─ Default (small site) → Node.js
|
||||
```
|
||||
|
||||
## Crawler Backends
|
||||
|
||||
### Node.js (Default)
|
||||
Fast, lightweight crawler for small documentation sites.
|
||||
```bash
|
||||
cd ~/Project/our-seo-agent/util/js-crawler
|
||||
node src/crawler.js <URL> --max-pages 50
|
||||
```
|
||||
|
||||
### Python aiohttp
|
||||
Async crawler with full SEO extraction.
|
||||
```bash
|
||||
cd ~/Project/our-seo-agent
|
||||
python -m seo_agent.crawler --url <URL> --max-pages 100
|
||||
```
|
||||
|
||||
### Scrapy
|
||||
Enterprise-grade crawler with pipelines.
|
||||
```bash
|
||||
cd ~/Project/our-seo-agent
|
||||
scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500
|
||||
```
|
||||
|
||||
### Firecrawl MCP
|
||||
Use MCP tools for JavaScript-heavy sites:
|
||||
```
|
||||
firecrawl_scrape(url, formats=["markdown"], only_main_content=true)
|
||||
firecrawl_crawl(url, max_depth=2, limit=50)
|
||||
firecrawl_map(url, limit=100) # Discover URLs first
|
||||
```
|
||||
|
||||
## Workflow
|
||||
|
||||
### Step 1: Analyze Target Site
|
||||
Determine site characteristics:
|
||||
- Is it a SPA? (React, Vue, Angular, Next.js)
|
||||
- How many pages expected?
|
||||
- Does it need JavaScript rendering?
|
||||
- Is SEO data extraction needed?
|
||||
|
||||
### Step 2: Select Crawler
|
||||
Based on analysis, select the appropriate backend.
|
||||
|
||||
### Step 3: Load URL Manifest
|
||||
```bash
|
||||
# From reference-discovery output
|
||||
cat manifest.json | jq '.urls[].url'
|
||||
```
|
||||
|
||||
### Step 4: Execute Crawl
|
||||
|
||||
**For Node.js:**
|
||||
```bash
|
||||
cd ~/Project/our-seo-agent/util/js-crawler
|
||||
for url in $(cat urls.txt); do
|
||||
node src/crawler.js "$url" --max-pages 50
|
||||
sleep 2
|
||||
done
|
||||
```
|
||||
|
||||
**For Firecrawl MCP (Claude Desktop/Code):**
|
||||
Use the firecrawl MCP tools directly in conversation.
|
||||
|
||||
### Step 5: Save Raw Content
|
||||
```
|
||||
~/reference-library/raw/
|
||||
└── 2025/01/
|
||||
├── a1b2c3d4.md
|
||||
└── b2c3d4e5.md
|
||||
```
|
||||
|
||||
### Step 6: Generate Crawl Manifest
|
||||
```json
|
||||
{
|
||||
"crawl_date": "2025-01-28T12:00:00",
|
||||
"crawler_used": "nodejs",
|
||||
"total_crawled": 45,
|
||||
"total_failed": 5,
|
||||
"documents": [...]
|
||||
}
|
||||
```
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
All crawlers respect these limits:
|
||||
- 20 requests/minute
|
||||
- 3 concurrent requests
|
||||
- Exponential backoff on 429/5xx
|
||||
|
||||
## Error Handling
|
||||
|
||||
| Error | Action |
|
||||
|-------|--------|
|
||||
| Timeout | Retry once with 2x timeout |
|
||||
| Rate limit (429) | Exponential backoff, max 3 retries |
|
||||
| Not found (404) | Log and skip |
|
||||
| Access denied (403) | Log, mark as `failed` |
|
||||
| JS rendering needed | Switch to Firecrawl |
|
||||
|
||||
## Site Type Detection
|
||||
|
||||
Indicators for automatic routing:
|
||||
|
||||
**SPA (→ Firecrawl):**
|
||||
- URL contains `#/` or uses hash routing
|
||||
- Page source shows React/Vue/Angular markers
|
||||
- Content loads dynamically after initial load
|
||||
|
||||
**Static docs (→ Node.js/aiohttp):**
|
||||
- Built with Hugo, Jekyll, MkDocs, Docusaurus, GitBook
|
||||
- Clean HTML structure
|
||||
- Server-side rendered
|
||||
|
||||
## Scripts
|
||||
|
||||
- `scripts/select_crawler.py` - Intelligent crawler selection
|
||||
- `scripts/crawl_with_nodejs.sh` - Node.js wrapper
|
||||
- `scripts/crawl_with_aiohttp.sh` - Python wrapper
|
||||
- `scripts/crawl_with_firecrawl.py` - Firecrawl MCP wrapper
|
||||
|
||||
## Integration
|
||||
|
||||
| From | To |
|
||||
|------|-----|
|
||||
| reference-discovery | URL manifest input |
|
||||
| → | content-repository (crawl manifest + raw files) |
|
||||
| quality-reviewer (deep_research) | Additional crawl requests |
|
||||
@@ -0,0 +1,234 @@
|
||||
---
|
||||
name: web-crawler-orchestrator
|
||||
description: Orchestrates web crawling using Firecrawl MCP. Handles rate limiting, selects crawl strategies, manages formats (HTML/PDF/markdown), and produces raw content with manifests. Triggers on "crawl URLs", "fetch documents", "scrape pages", "download references", "Firecrawl crawl".
|
||||
---
|
||||
|
||||
# Web Crawler Orchestrator
|
||||
|
||||
Manages crawling operations using Firecrawl MCP with rate limiting and format handling.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Firecrawl MCP server connected
|
||||
- Config file at `~/.config/reference-curator/crawl_config.yaml`
|
||||
- Storage directory exists: `~/reference-library/raw/`
|
||||
|
||||
## Crawl Configuration
|
||||
|
||||
```yaml
|
||||
# ~/.config/reference-curator/crawl_config.yaml
|
||||
firecrawl:
|
||||
rate_limit:
|
||||
requests_per_minute: 20
|
||||
concurrent_requests: 3
|
||||
default_options:
|
||||
timeout: 30000
|
||||
only_main_content: true
|
||||
include_html: false
|
||||
|
||||
processing:
|
||||
max_content_size_mb: 50
|
||||
raw_content_dir: ~/reference-library/raw/
|
||||
```
|
||||
|
||||
## Crawl Workflow
|
||||
|
||||
### Step 1: Load URL Manifest
|
||||
|
||||
Receive manifest from `reference-discovery`:
|
||||
|
||||
```python
|
||||
def load_manifest(manifest_path):
|
||||
with open(manifest_path) as f:
|
||||
manifest = json.load(f)
|
||||
return manifest["urls"]
|
||||
```
|
||||
|
||||
### Step 2: Determine Crawl Strategy
|
||||
|
||||
```python
|
||||
def select_strategy(url):
|
||||
"""Select optimal crawl strategy based on URL characteristics."""
|
||||
|
||||
if url.endswith('.pdf'):
|
||||
return 'pdf_extract'
|
||||
elif 'github.com' in url and '/blob/' in url:
|
||||
return 'raw_content' # Get raw file content
|
||||
elif 'github.com' in url:
|
||||
return 'scrape' # Repository pages
|
||||
elif any(d in url for d in ['docs.', 'documentation']):
|
||||
return 'scrape' # Documentation sites
|
||||
else:
|
||||
return 'scrape' # Default
|
||||
```
|
||||
|
||||
### Step 3: Execute Firecrawl
|
||||
|
||||
Use Firecrawl MCP for crawling:
|
||||
|
||||
```python
|
||||
# Single page scrape
|
||||
firecrawl_scrape(
|
||||
url="https://docs.anthropic.com/en/docs/prompt-engineering",
|
||||
formats=["markdown"], # markdown | html | screenshot
|
||||
only_main_content=True,
|
||||
timeout=30000
|
||||
)
|
||||
|
||||
# Multi-page crawl (documentation sites)
|
||||
firecrawl_crawl(
|
||||
url="https://docs.anthropic.com/en/docs/",
|
||||
max_depth=2,
|
||||
limit=50,
|
||||
formats=["markdown"],
|
||||
only_main_content=True
|
||||
)
|
||||
```
|
||||
|
||||
### Step 4: Rate Limiting
|
||||
|
||||
```python
|
||||
import time
|
||||
from collections import deque
|
||||
|
||||
class RateLimiter:
|
||||
def __init__(self, requests_per_minute=20):
|
||||
self.rpm = requests_per_minute
|
||||
self.request_times = deque()
|
||||
|
||||
def wait_if_needed(self):
|
||||
now = time.time()
|
||||
# Remove requests older than 1 minute
|
||||
while self.request_times and now - self.request_times[0] > 60:
|
||||
self.request_times.popleft()
|
||||
|
||||
if len(self.request_times) >= self.rpm:
|
||||
wait_time = 60 - (now - self.request_times[0])
|
||||
if wait_time > 0:
|
||||
time.sleep(wait_time)
|
||||
|
||||
self.request_times.append(time.time())
|
||||
```
|
||||
|
||||
### Step 5: Save Raw Content
|
||||
|
||||
```python
|
||||
import hashlib
|
||||
from pathlib import Path
|
||||
|
||||
def save_content(url, content, content_type='markdown'):
|
||||
"""Save crawled content to raw storage."""
|
||||
|
||||
# Generate filename from URL hash
|
||||
url_hash = hashlib.sha256(url.encode()).hexdigest()[:16]
|
||||
|
||||
# Determine extension
|
||||
ext_map = {'markdown': '.md', 'html': '.html', 'pdf': '.pdf'}
|
||||
ext = ext_map.get(content_type, '.txt')
|
||||
|
||||
# Create dated subdirectory
|
||||
date_dir = datetime.now().strftime('%Y/%m')
|
||||
output_dir = Path.home() / 'reference-library/raw' / date_dir
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Save file
|
||||
filepath = output_dir / f"{url_hash}{ext}"
|
||||
if content_type == 'pdf':
|
||||
filepath.write_bytes(content)
|
||||
else:
|
||||
filepath.write_text(content, encoding='utf-8')
|
||||
|
||||
return str(filepath)
|
||||
```
|
||||
|
||||
### Step 6: Generate Crawl Manifest
|
||||
|
||||
```python
|
||||
def create_crawl_manifest(results):
|
||||
manifest = {
|
||||
"crawl_date": datetime.now().isoformat(),
|
||||
"total_crawled": len([r for r in results if r["status"] == "success"]),
|
||||
"total_failed": len([r for r in results if r["status"] == "failed"]),
|
||||
"documents": []
|
||||
}
|
||||
|
||||
for result in results:
|
||||
manifest["documents"].append({
|
||||
"url": result["url"],
|
||||
"status": result["status"],
|
||||
"raw_content_path": result.get("filepath"),
|
||||
"content_size": result.get("size"),
|
||||
"crawl_method": "firecrawl",
|
||||
"error": result.get("error")
|
||||
})
|
||||
|
||||
return manifest
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
| Error | Action |
|
||||
|-------|--------|
|
||||
| Timeout | Retry once with 2x timeout |
|
||||
| Rate limit (429) | Exponential backoff, max 3 retries |
|
||||
| Not found (404) | Log and skip |
|
||||
| Access denied (403) | Log, mark as `failed` |
|
||||
| Connection error | Retry with backoff |
|
||||
|
||||
```python
|
||||
def crawl_with_retry(url, max_retries=3):
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
result = firecrawl_scrape(url)
|
||||
return {"status": "success", "content": result}
|
||||
except RateLimitError:
|
||||
wait = 2 ** attempt * 10 # 10, 20, 40 seconds
|
||||
time.sleep(wait)
|
||||
except TimeoutError:
|
||||
if attempt == 0:
|
||||
# Retry with doubled timeout
|
||||
result = firecrawl_scrape(url, timeout=60000)
|
||||
return {"status": "success", "content": result}
|
||||
except NotFoundError:
|
||||
return {"status": "failed", "error": "404 Not Found"}
|
||||
except Exception as e:
|
||||
if attempt == max_retries - 1:
|
||||
return {"status": "failed", "error": str(e)}
|
||||
|
||||
return {"status": "failed", "error": "Max retries exceeded"}
|
||||
```
|
||||
|
||||
## Firecrawl MCP Reference
|
||||
|
||||
**scrape** - Single page:
|
||||
```
|
||||
firecrawl_scrape(url, formats, only_main_content, timeout)
|
||||
```
|
||||
|
||||
**crawl** - Multi-page:
|
||||
```
|
||||
firecrawl_crawl(url, max_depth, limit, formats, only_main_content)
|
||||
```
|
||||
|
||||
**map** - Discover URLs:
|
||||
```
|
||||
firecrawl_map(url, limit) # Returns list of URLs on site
|
||||
```
|
||||
|
||||
## Integration
|
||||
|
||||
| From | Input | To |
|
||||
|------|-------|-----|
|
||||
| reference-discovery | URL manifest | web-crawler-orchestrator |
|
||||
| web-crawler-orchestrator | Crawl manifest + raw files | content-repository |
|
||||
| quality-reviewer (deep_research) | Additional queries | reference-discovery → here |
|
||||
|
||||
## Output Structure
|
||||
|
||||
```
|
||||
~/reference-library/raw/
|
||||
└── 2025/01/
|
||||
├── a1b2c3d4e5f6g7h8.md # Markdown content
|
||||
├── b2c3d4e5f6g7h8i9.md
|
||||
└── c3d4e5f6g7h8i9j0.pdf # PDF documents
|
||||
```
|
||||
Reference in New Issue
Block a user