Files
our-claude-skills/custom-skills/90-reference-curator/commands/web-crawler.md
Andrew Yim 6d7a6d7a88 feat(reference-curator): Add portable skill suite for reference documentation curation
6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 00:20:27 +07:00

2.0 KiB

description, argument-hint, allowed-tools
description argument-hint allowed-tools
Crawl URLs with intelligent backend selection. Auto-selects Node.js, Python aiohttp, Scrapy, or Firecrawl based on site characteristics. <url|manifest> [--crawler nodejs|aiohttp|scrapy|firecrawl] [--max-pages 50] Bash, Read, Write, WebFetch, Glob, Grep

Web Crawler Orchestrator

Crawl web content with intelligent backend selection.

Arguments

  • <url|manifest>: Single URL or path to manifest.json from reference-discovery
  • --crawler: Force specific crawler (nodejs, aiohttp, scrapy, firecrawl)
  • --max-pages: Maximum pages to crawl (default: 50)

Intelligent Crawler Selection

Auto-select based on site characteristics:

Crawler Best For Auto-Selected When
Node.js (default) Small docs sites ≤50 pages, static content
Python aiohttp Technical docs ≤200 pages, needs SEO data
Scrapy Enterprise crawls >200 pages, multi-domain
Firecrawl MCP Dynamic sites SPAs, JS-rendered content

Detection Flow

[URL] → Is SPA/React/Vue? → Firecrawl
      → >200 pages? → Scrapy
      → Needs SEO? → aiohttp
      → Default → Node.js

Crawler Commands

Node.js:

cd ~/Project/our-seo-agent/util/js-crawler
node src/crawler.js <URL> --max-pages 50

Python aiohttp:

cd ~/Project/our-seo-agent
python -m seo_agent.crawler --url <URL> --max-pages 100

Scrapy:

cd ~/Project/our-seo-agent
scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500

Firecrawl MCP: Use MCP tools: firecrawl_scrape, firecrawl_crawl, firecrawl_map

Output

Save crawled content to ~/reference-library/raw/YYYY/MM/:

  • One markdown file per page
  • Filename: {url_hash}.md

Generate crawl manifest:

{
  "crawl_date": "ISO timestamp",
  "crawler_used": "nodejs",
  "total_crawled": 45,
  "documents": [...]
}

Rate Limiting

All crawlers respect:

  • 20 requests/minute
  • 3 concurrent requests
  • Exponential backoff on 429/5xx