feat(reference-curator): Add portable skill suite for reference documentation curation

6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 00:20:27 +07:00
parent e80056ae8a
commit 6d7a6d7a88
26 changed files with 4486 additions and 1 deletions
--- a/custom-skills/90-reference-curator/02-web-crawler-orchestrator/code/CLAUDE.md
+++ b/custom-skills/90-reference-curator/02-web-crawler-orchestrator/code/CLAUDE.md
@@ -0,0 +1,159 @@
+# Web Crawler Orchestrator
+
+Orchestrates web crawling with intelligent backend selection. Automatically chooses the best crawler based on site characteristics.
+
+## Trigger Keywords
+"crawl URLs", "fetch documents", "scrape pages", "download references"
+
+## Intelligent Crawler Selection
+
+Claude automatically selects the optimal crawler based on the request:
+
+| Crawler | Best For | Auto-Selected When |
+|---------|----------|-------------------|
+| **Node.js** (default) | Small docs sites | ≤50 pages, static content |
+| **Python aiohttp** | Technical docs | ≤200 pages, needs SEO data |
+| **Scrapy** | Enterprise crawls | >200 pages, multi-domain |
+| **Firecrawl MCP** | Dynamic sites | SPAs, JS-rendered content |
+
+### Decision Flow
+
+```
+[Crawl Request]
+      │
+      ├─ Is it SPA/React/Vue/Angular? → Firecrawl MCP
+      │
+      ├─ >200 pages or multi-domain? → Scrapy
+      │
+      ├─ Needs SEO extraction? → Python aiohttp
+      │
+      └─ Default (small site) → Node.js
+```
+
+## Crawler Backends
+
+### Node.js (Default)
+Fast, lightweight crawler for small documentation sites.
+```bash
+cd ~/Project/our-seo-agent/util/js-crawler
+node src/crawler.js <URL> --max-pages 50
+```
+
+### Python aiohttp
+Async crawler with full SEO extraction.
+```bash
+cd ~/Project/our-seo-agent
+python -m seo_agent.crawler --url <URL> --max-pages 100
+```
+
+### Scrapy
+Enterprise-grade crawler with pipelines.
+```bash
+cd ~/Project/our-seo-agent
+scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500
+```
+
+### Firecrawl MCP
+Use MCP tools for JavaScript-heavy sites:
+```
+firecrawl_scrape(url, formats=["markdown"], only_main_content=true)
+firecrawl_crawl(url, max_depth=2, limit=50)
+firecrawl_map(url, limit=100)  # Discover URLs first
+```
+
+## Workflow
+
+### Step 1: Analyze Target Site
+Determine site characteristics:
+- Is it a SPA? (React, Vue, Angular, Next.js)
+- How many pages expected?
+- Does it need JavaScript rendering?
+- Is SEO data extraction needed?
+
+### Step 2: Select Crawler
+Based on analysis, select the appropriate backend.
+
+### Step 3: Load URL Manifest
+```bash
+# From reference-discovery output
+cat manifest.json | jq '.urls[].url'
+```
+
+### Step 4: Execute Crawl
+
+**For Node.js:**
+```bash
+cd ~/Project/our-seo-agent/util/js-crawler
+for url in $(cat urls.txt); do
+  node src/crawler.js "$url" --max-pages 50
+  sleep 2
+done
+```
+
+**For Firecrawl MCP (Claude Desktop/Code):**
+Use the firecrawl MCP tools directly in conversation.
+
+### Step 5: Save Raw Content
+```
+~/reference-library/raw/
+└── 2025/01/
+    ├── a1b2c3d4.md
+    └── b2c3d4e5.md
+```
+
+### Step 6: Generate Crawl Manifest
+```json
+{
+  "crawl_date": "2025-01-28T12:00:00",
+  "crawler_used": "nodejs",
+  "total_crawled": 45,
+  "total_failed": 5,
+  "documents": [...]
+}
+```
+
+## Rate Limiting
+
+All crawlers respect these limits:
+- 20 requests/minute
+- 3 concurrent requests
+- Exponential backoff on 429/5xx
+
+## Error Handling
+
+| Error | Action |
+|-------|--------|
+| Timeout | Retry once with 2x timeout |
+| Rate limit (429) | Exponential backoff, max 3 retries |
+| Not found (404) | Log and skip |
+| Access denied (403) | Log, mark as `failed` |
+| JS rendering needed | Switch to Firecrawl |
+
+## Site Type Detection
+
+Indicators for automatic routing:
+
+**SPA (→ Firecrawl):**
+- URL contains `#/` or uses hash routing
+- Page source shows React/Vue/Angular markers
+- Content loads dynamically after initial load
+
+**Static docs (→ Node.js/aiohttp):**
+- Built with Hugo, Jekyll, MkDocs, Docusaurus, GitBook
+- Clean HTML structure
+- Server-side rendered
+
+## Scripts
+
+- `scripts/select_crawler.py` - Intelligent crawler selection
+- `scripts/crawl_with_nodejs.sh` - Node.js wrapper
+- `scripts/crawl_with_aiohttp.sh` - Python wrapper
+- `scripts/crawl_with_firecrawl.py` - Firecrawl MCP wrapper
+
+## Integration
+
+| From | To |
+|------|-----|
+| reference-discovery | URL manifest input |
+| → | content-repository (crawl manifest + raw files) |
+| quality-reviewer (deep_research) | Additional crawl requests |
--- a/custom-skills/90-reference-curator/02-web-crawler-orchestrator/desktop/SKILL.md
+++ b/custom-skills/90-reference-curator/02-web-crawler-orchestrator/desktop/SKILL.md
@@ -0,0 +1,234 @@
+---
+name: web-crawler-orchestrator
+description: Orchestrates web crawling using Firecrawl MCP. Handles rate limiting, selects crawl strategies, manages formats (HTML/PDF/markdown), and produces raw content with manifests. Triggers on "crawl URLs", "fetch documents", "scrape pages", "download references", "Firecrawl crawl".
+---
+
+# Web Crawler Orchestrator
+
+Manages crawling operations using Firecrawl MCP with rate limiting and format handling.
+
+## Prerequisites
+
+- Firecrawl MCP server connected
+- Config file at `~/.config/reference-curator/crawl_config.yaml`
+- Storage directory exists: `~/reference-library/raw/`
+
+## Crawl Configuration
+
+```yaml
+# ~/.config/reference-curator/crawl_config.yaml
+firecrawl:
+  rate_limit:
+    requests_per_minute: 20
+    concurrent_requests: 3
+  default_options:
+    timeout: 30000
+    only_main_content: true
+    include_html: false
+
+processing:
+  max_content_size_mb: 50
+  raw_content_dir: ~/reference-library/raw/
+```
+
+## Crawl Workflow
+
+### Step 1: Load URL Manifest
+
+Receive manifest from `reference-discovery`:
+
+```python
+def load_manifest(manifest_path):
+    with open(manifest_path) as f:
+        manifest = json.load(f)
+    return manifest["urls"]
+```
+
+### Step 2: Determine Crawl Strategy
+
+```python
+def select_strategy(url):
+    """Select optimal crawl strategy based on URL characteristics."""
+    
+    if url.endswith('.pdf'):
+        return 'pdf_extract'
+    elif 'github.com' in url and '/blob/' in url:
+        return 'raw_content'  # Get raw file content
+    elif 'github.com' in url:
+        return 'scrape'  # Repository pages
+    elif any(d in url for d in ['docs.', 'documentation']):
+        return 'scrape'  # Documentation sites
+    else:
+        return 'scrape'  # Default
+```
+
+### Step 3: Execute Firecrawl
+
+Use Firecrawl MCP for crawling:
+
+```python
+# Single page scrape
+firecrawl_scrape(
+    url="https://docs.anthropic.com/en/docs/prompt-engineering",
+    formats=["markdown"],  # markdown | html | screenshot
+    only_main_content=True,
+    timeout=30000
+)
+
+# Multi-page crawl (documentation sites)
+firecrawl_crawl(
+    url="https://docs.anthropic.com/en/docs/",
+    max_depth=2,
+    limit=50,
+    formats=["markdown"],
+    only_main_content=True
+)
+```
+
+### Step 4: Rate Limiting
+
+```python
+import time
+from collections import deque
+
+class RateLimiter:
+    def __init__(self, requests_per_minute=20):
+        self.rpm = requests_per_minute
+        self.request_times = deque()
+    
+    def wait_if_needed(self):
+        now = time.time()
+        # Remove requests older than 1 minute
+        while self.request_times and now - self.request_times[0] > 60:
+            self.request_times.popleft()
+        
+        if len(self.request_times) >= self.rpm:
+            wait_time = 60 - (now - self.request_times[0])
+            if wait_time > 0:
+                time.sleep(wait_time)
+        
+        self.request_times.append(time.time())
+```
+
+### Step 5: Save Raw Content
+
+```python
+import hashlib
+from pathlib import Path
+
+def save_content(url, content, content_type='markdown'):
+    """Save crawled content to raw storage."""
+    
+    # Generate filename from URL hash
+    url_hash = hashlib.sha256(url.encode()).hexdigest()[:16]
+    
+    # Determine extension
+    ext_map = {'markdown': '.md', 'html': '.html', 'pdf': '.pdf'}
+    ext = ext_map.get(content_type, '.txt')
+    
+    # Create dated subdirectory
+    date_dir = datetime.now().strftime('%Y/%m')
+    output_dir = Path.home() / 'reference-library/raw' / date_dir
+    output_dir.mkdir(parents=True, exist_ok=True)
+    
+    # Save file
+    filepath = output_dir / f"{url_hash}{ext}"
+    if content_type == 'pdf':
+        filepath.write_bytes(content)
+    else:
+        filepath.write_text(content, encoding='utf-8')
+    
+    return str(filepath)
+```
+
+### Step 6: Generate Crawl Manifest
+
+```python
+def create_crawl_manifest(results):
+    manifest = {
+        "crawl_date": datetime.now().isoformat(),
+        "total_crawled": len([r for r in results if r["status"] == "success"]),
+        "total_failed": len([r for r in results if r["status"] == "failed"]),
+        "documents": []
+    }
+    
+    for result in results:
+        manifest["documents"].append({
+            "url": result["url"],
+            "status": result["status"],
+            "raw_content_path": result.get("filepath"),
+            "content_size": result.get("size"),
+            "crawl_method": "firecrawl",
+            "error": result.get("error")
+        })
+    
+    return manifest
+```
+
+## Error Handling
+
+| Error | Action |
+|-------|--------|
+| Timeout | Retry once with 2x timeout |
+| Rate limit (429) | Exponential backoff, max 3 retries |
+| Not found (404) | Log and skip |
+| Access denied (403) | Log, mark as `failed` |
+| Connection error | Retry with backoff |
+
+```python
+def crawl_with_retry(url, max_retries=3):
+    for attempt in range(max_retries):
+        try:
+            result = firecrawl_scrape(url)
+            return {"status": "success", "content": result}
+        except RateLimitError:
+            wait = 2 ** attempt * 10  # 10, 20, 40 seconds
+            time.sleep(wait)
+        except TimeoutError:
+            if attempt == 0:
+                # Retry with doubled timeout
+                result = firecrawl_scrape(url, timeout=60000)
+                return {"status": "success", "content": result}
+        except NotFoundError:
+            return {"status": "failed", "error": "404 Not Found"}
+        except Exception as e:
+            if attempt == max_retries - 1:
+                return {"status": "failed", "error": str(e)}
+    
+    return {"status": "failed", "error": "Max retries exceeded"}
+```
+
+## Firecrawl MCP Reference
+
+**scrape** - Single page:
+```
+firecrawl_scrape(url, formats, only_main_content, timeout)
+```
+
+**crawl** - Multi-page:
+```
+firecrawl_crawl(url, max_depth, limit, formats, only_main_content)
+```
+
+**map** - Discover URLs:
+```
+firecrawl_map(url, limit)  # Returns list of URLs on site
+```
+
+## Integration
+
+| From | Input | To |
+|------|-------|-----|
+| reference-discovery | URL manifest | web-crawler-orchestrator |
+| web-crawler-orchestrator | Crawl manifest + raw files | content-repository |
+| quality-reviewer (deep_research) | Additional queries | reference-discovery → here |
+
+## Output Structure
+
+```
+~/reference-library/raw/
+└── 2025/01/
+    ├── a1b2c3d4e5f6g7h8.md   # Markdown content
+    ├── b2c3d4e5f6g7h8i9.md
+    └── c3d4e5f6g7h8i9j0.pdf  # PDF documents
+```