6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
5.7 KiB
5.7 KiB
name, description
| name | description |
|---|---|
| reference-discovery | Search and identify authoritative sources for reference materials. Validates source credibility, prioritizes by relevance, and outputs curated URL lists with metadata. Triggers on "find references", "search documentation", "discover sources", "find authoritative materials", "research topic sources". |
Reference Discovery
Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.
Source Priority Hierarchy
| Tier | Source Type | Examples |
|---|---|---|
| Tier 1 | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
| Tier 1 | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
| Tier 1 | Official GitHub repos | github.com/anthropics/, github.com/openai/ |
| Tier 2 | Research papers | arxiv.org, papers with citations |
| Tier 2 | Verified community guides | Cookbook examples, official tutorials |
| Tier 3 | Community content | Blog posts, tutorials, Stack Overflow |
Discovery Workflow
Step 1: Define Search Scope
search_config = {
"topic": "prompt engineering",
"vendors": ["anthropic", "openai", "google"],
"source_types": ["official_docs", "engineering_blog", "github_repo"],
"freshness": "past_year", # past_week, past_month, past_year, any
"max_results_per_query": 20
}
Step 2: Generate Search Queries
For a given topic, generate targeted queries:
def generate_queries(topic, vendors):
queries = []
# Official documentation queries
for vendor in vendors:
queries.append(f"site:docs.{vendor}.com {topic}")
queries.append(f"site:{vendor}.com/docs {topic}")
# Engineering blog queries
for vendor in vendors:
queries.append(f"site:{vendor}.com/blog {topic}")
queries.append(f"site:{vendor}.com/news {topic}")
# GitHub queries
for vendor in vendors:
queries.append(f"site:github.com/{vendor} {topic}")
# Research queries
queries.append(f"site:arxiv.org {topic}")
return queries
Step 3: Execute Search
Use web search tool for each query:
def execute_discovery(queries):
results = []
for query in queries:
search_results = web_search(query)
for result in search_results:
results.append({
"url": result.url,
"title": result.title,
"snippet": result.snippet,
"query_used": query
})
return deduplicate_by_url(results)
Step 4: Validate and Score Sources
def score_source(url, title):
score = 0.0
# Domain credibility
if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
score += 0.40 # Tier 1 official docs
elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
score += 0.30 # Tier 1 official blog/news
elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
score += 0.30 # Tier 1 official repos
elif 'arxiv.org' in url:
score += 0.20 # Tier 2 research
else:
score += 0.10 # Tier 3 community
# Freshness signals (from title/snippet)
if any(year in title for year in ['2025', '2024']):
score += 0.20
elif any(year in title for year in ['2023']):
score += 0.10
# Relevance signals
if any(kw in title.lower() for kw in ['guide', 'documentation', 'tutorial', 'best practices']):
score += 0.15
return min(score, 1.0)
def assign_credibility_tier(score):
if score >= 0.60:
return 'tier1_official'
elif score >= 0.40:
return 'tier2_verified'
else:
return 'tier3_community'
Step 5: Output URL Manifest
def create_manifest(scored_results, topic):
manifest = {
"discovery_date": datetime.now().isoformat(),
"topic": topic,
"total_urls": len(scored_results),
"urls": []
}
for result in sorted(scored_results, key=lambda x: x['score'], reverse=True):
manifest["urls"].append({
"url": result["url"],
"title": result["title"],
"credibility_tier": result["tier"],
"credibility_score": result["score"],
"source_type": infer_source_type(result["url"]),
"vendor": infer_vendor(result["url"])
})
return manifest
Output Format
Discovery produces a JSON manifest for the crawler:
{
"discovery_date": "2025-01-28T10:30:00",
"topic": "prompt engineering",
"total_urls": 15,
"urls": [
{
"url": "https://docs.anthropic.com/en/docs/prompt-engineering",
"title": "Prompt Engineering Guide",
"credibility_tier": "tier1_official",
"credibility_score": 0.85,
"source_type": "official_docs",
"vendor": "anthropic"
}
]
}
Known Authoritative Sources
Pre-validated sources for common topics:
| Vendor | Documentation | Blog/News | GitHub |
|---|---|---|---|
| Anthropic | docs.anthropic.com, docs.claude.com | anthropic.com/news | github.com/anthropics |
| OpenAI | platform.openai.com/docs | openai.com/blog | github.com/openai |
| ai.google.dev/docs | blog.google/technology/ai | github.com/google |
Integration
Output: URL manifest JSON → web-crawler-orchestrator
Database: Register new sources in sources table via content-repository
Deduplication
Before outputting, deduplicate URLs:
- Normalize URLs (remove trailing slashes, query params)
- Check against existing
documentstable viacontent-repository - Merge duplicate entries, keeping highest credibility score