feat(reference-curator): Add portable skill suite for reference documentation curation

6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 00:20:27 +07:00
parent e80056ae8a
commit 6d7a6d7a88
26 changed files with 4486 additions and 1 deletions
--- a/custom-skills/90-reference-curator/commands/web-crawler.md
+++ b/custom-skills/90-reference-curator/commands/web-crawler.md
@@ -0,0 +1,79 @@
+---
+description: Crawl URLs with intelligent backend selection. Auto-selects Node.js, Python aiohttp, Scrapy, or Firecrawl based on site characteristics.
+argument-hint: <url|manifest> [--crawler nodejs|aiohttp|scrapy|firecrawl] [--max-pages 50]
+allowed-tools: Bash, Read, Write, WebFetch, Glob, Grep
+---
+
+# Web Crawler Orchestrator
+
+Crawl web content with intelligent backend selection.
+
+## Arguments
+- `<url|manifest>`: Single URL or path to manifest.json from reference-discovery
+- `--crawler`: Force specific crawler (nodejs, aiohttp, scrapy, firecrawl)
+- `--max-pages`: Maximum pages to crawl (default: 50)
+
+## Intelligent Crawler Selection
+
+Auto-select based on site characteristics:
+
+| Crawler | Best For | Auto-Selected When |
+|---------|----------|-------------------|
+| **Node.js** (default) | Small docs sites | ≤50 pages, static content |
+| **Python aiohttp** | Technical docs | ≤200 pages, needs SEO data |
+| **Scrapy** | Enterprise crawls | >200 pages, multi-domain |
+| **Firecrawl MCP** | Dynamic sites | SPAs, JS-rendered content |
+
+### Detection Flow
+```
+[URL] → Is SPA/React/Vue? → Firecrawl
+      → >200 pages? → Scrapy
+      → Needs SEO? → aiohttp
+      → Default → Node.js
+```
+
+## Crawler Commands
+
+**Node.js:**
+```bash
+cd ~/Project/our-seo-agent/util/js-crawler
+node src/crawler.js <URL> --max-pages 50
+```
+
+**Python aiohttp:**
+```bash
+cd ~/Project/our-seo-agent
+python -m seo_agent.crawler --url <URL> --max-pages 100
+```
+
+**Scrapy:**
+```bash
+cd ~/Project/our-seo-agent
+scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500
+```
+
+**Firecrawl MCP:**
+Use MCP tools: `firecrawl_scrape`, `firecrawl_crawl`, `firecrawl_map`
+
+## Output
+
+Save crawled content to `~/reference-library/raw/YYYY/MM/`:
+- One markdown file per page
+- Filename: `{url_hash}.md`
+
+Generate crawl manifest:
+```json
+{
+  "crawl_date": "ISO timestamp",
+  "crawler_used": "nodejs",
+  "total_crawled": 45,
+  "documents": [...]
+}
+```
+
+## Rate Limiting
+
+All crawlers respect:
+- 20 requests/minute
+- 3 concurrent requests
+- Exponential backoff on 429/5xx