Files

Andrew Yim 6d7a6d7a88 feat(reference-curator): Add portable skill suite for reference documentation curation

6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-29 00:20:27 +07:00

2.0 KiB

Raw Blame History

description, argument-hint, allowed-tools

description	argument-hint	allowed-tools
Crawl URLs with intelligent backend selection. Auto-selects Node.js, Python aiohttp, Scrapy, or Firecrawl based on site characteristics.	<url\|manifest> [--crawler nodejs\|aiohttp\|scrapy\|firecrawl] [--max-pages 50]	Bash, Read, Write, WebFetch, Glob, Grep

Web Crawler Orchestrator

Crawl web content with intelligent backend selection.

Arguments

<url|manifest>: Single URL or path to manifest.json from reference-discovery
--crawler: Force specific crawler (nodejs, aiohttp, scrapy, firecrawl)
--max-pages: Maximum pages to crawl (default: 50)

Intelligent Crawler Selection

Auto-select based on site characteristics:

Crawler	Best For	Auto-Selected When
Node.js (default)	Small docs sites	≤50 pages, static content
Python aiohttp	Technical docs	≤200 pages, needs SEO data
Scrapy	Enterprise crawls	>200 pages, multi-domain
Firecrawl MCP	Dynamic sites	SPAs, JS-rendered content

Detection Flow

[URL] → Is SPA/React/Vue? → Firecrawl
      → >200 pages? → Scrapy
      → Needs SEO? → aiohttp
      → Default → Node.js

Crawler Commands

Node.js:

cd ~/Project/our-seo-agent/util/js-crawler
node src/crawler.js <URL> --max-pages 50

Python aiohttp:

cd ~/Project/our-seo-agent
python -m seo_agent.crawler --url <URL> --max-pages 100

Scrapy:

cd ~/Project/our-seo-agent
scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500

Firecrawl MCP: Use MCP tools: firecrawl_scrape, firecrawl_crawl, firecrawl_map

Output

Save crawled content to ~/reference-library/raw/YYYY/MM/:

One markdown file per page
Filename: {url_hash}.md

Generate crawl manifest:

{
  "crawl_date": "ISO timestamp",
  "crawler_used": "nodejs",
  "total_crawled": 45,
  "documents": [...]
}

Rate Limiting

All crawlers respect:

20 requests/minute
3 concurrent requests
Exponential backoff on 429/5xx

2.0 KiB Raw Blame History