6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2.3 KiB
2.3 KiB
Reference Discovery
Search and identify authoritative sources for reference materials. Validates source credibility, prioritizes by relevance, and outputs curated URL lists with metadata.
Trigger Keywords
"find references", "search documentation", "discover sources", "find authoritative materials", "research topic sources"
Source Priority Hierarchy
| Tier | Source Type | Examples |
|---|---|---|
| Tier 1 | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
| Tier 1 | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
| Tier 1 | Official GitHub repos | github.com/anthropics/, github.com/openai/ |
| Tier 2 | Research papers | arxiv.org, papers with citations |
| Tier 2 | Verified community guides | Cookbook examples, official tutorials |
| Tier 3 | Community content | Blog posts, tutorials, Stack Overflow |
Workflow
Step 1: Define Search Scope
Gather topic, target vendors, and freshness requirements from user input.
Step 2: Execute Web Search
Use WebSearch tool with targeted queries:
site:docs.anthropic.com {topic}
site:github.com/anthropics {topic}
site:arxiv.org {topic}
Step 3: Score and Validate Sources
Apply credibility scoring:
- Domain credibility (0.10 - 0.40)
- Freshness signals (0.10 - 0.20)
- Relevance signals (0.15)
Step 4: Output URL Manifest
Generate JSON manifest for the crawler skill:
{
"discovery_date": "2025-01-28T10:30:00",
"topic": "prompt engineering",
"total_urls": 15,
"urls": [
{
"url": "https://docs.anthropic.com/en/docs/prompt-engineering",
"title": "Prompt Engineering Guide",
"credibility_tier": "tier1_official",
"credibility_score": 0.85,
"source_type": "official_docs",
"vendor": "anthropic"
}
]
}
Scripts
discover_sources.py
Main discovery script. Usage:
python scripts/discover_sources.py --topic "prompt engineering" --vendors anthropic,openai --output manifest.json
Output
manifest.json→ Handoff to02-web-crawler-orchestrator- Register new sources in
sourcestable via03-content-repository
Deduplication
Before outputting:
- Normalize URLs (remove trailing slashes, query params)
- Check against existing
documentstable - Merge duplicates, keeping highest credibility score