feat(reference-curator): Add portable skill suite for reference documentation curation
6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
79
custom-skills/90-reference-curator/commands/web-crawler.md
Normal file
79
custom-skills/90-reference-curator/commands/web-crawler.md
Normal file
@@ -0,0 +1,79 @@
|
||||
---
|
||||
description: Crawl URLs with intelligent backend selection. Auto-selects Node.js, Python aiohttp, Scrapy, or Firecrawl based on site characteristics.
|
||||
argument-hint: <url|manifest> [--crawler nodejs|aiohttp|scrapy|firecrawl] [--max-pages 50]
|
||||
allowed-tools: Bash, Read, Write, WebFetch, Glob, Grep
|
||||
---
|
||||
|
||||
# Web Crawler Orchestrator
|
||||
|
||||
Crawl web content with intelligent backend selection.
|
||||
|
||||
## Arguments
|
||||
- `<url|manifest>`: Single URL or path to manifest.json from reference-discovery
|
||||
- `--crawler`: Force specific crawler (nodejs, aiohttp, scrapy, firecrawl)
|
||||
- `--max-pages`: Maximum pages to crawl (default: 50)
|
||||
|
||||
## Intelligent Crawler Selection
|
||||
|
||||
Auto-select based on site characteristics:
|
||||
|
||||
| Crawler | Best For | Auto-Selected When |
|
||||
|---------|----------|-------------------|
|
||||
| **Node.js** (default) | Small docs sites | ≤50 pages, static content |
|
||||
| **Python aiohttp** | Technical docs | ≤200 pages, needs SEO data |
|
||||
| **Scrapy** | Enterprise crawls | >200 pages, multi-domain |
|
||||
| **Firecrawl MCP** | Dynamic sites | SPAs, JS-rendered content |
|
||||
|
||||
### Detection Flow
|
||||
```
|
||||
[URL] → Is SPA/React/Vue? → Firecrawl
|
||||
→ >200 pages? → Scrapy
|
||||
→ Needs SEO? → aiohttp
|
||||
→ Default → Node.js
|
||||
```
|
||||
|
||||
## Crawler Commands
|
||||
|
||||
**Node.js:**
|
||||
```bash
|
||||
cd ~/Project/our-seo-agent/util/js-crawler
|
||||
node src/crawler.js <URL> --max-pages 50
|
||||
```
|
||||
|
||||
**Python aiohttp:**
|
||||
```bash
|
||||
cd ~/Project/our-seo-agent
|
||||
python -m seo_agent.crawler --url <URL> --max-pages 100
|
||||
```
|
||||
|
||||
**Scrapy:**
|
||||
```bash
|
||||
cd ~/Project/our-seo-agent
|
||||
scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500
|
||||
```
|
||||
|
||||
**Firecrawl MCP:**
|
||||
Use MCP tools: `firecrawl_scrape`, `firecrawl_crawl`, `firecrawl_map`
|
||||
|
||||
## Output
|
||||
|
||||
Save crawled content to `~/reference-library/raw/YYYY/MM/`:
|
||||
- One markdown file per page
|
||||
- Filename: `{url_hash}.md`
|
||||
|
||||
Generate crawl manifest:
|
||||
```json
|
||||
{
|
||||
"crawl_date": "ISO timestamp",
|
||||
"crawler_used": "nodejs",
|
||||
"total_crawled": 45,
|
||||
"documents": [...]
|
||||
}
|
||||
```
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
All crawlers respect:
|
||||
- 20 requests/minute
|
||||
- 3 concurrent requests
|
||||
- Exponential backoff on 429/5xx
|
||||
Reference in New Issue
Block a user