Files

Andrew Yim 6d7a6d7a88 feat(reference-curator): Add portable skill suite for reference documentation curation

6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-29 00:20:27 +07:00

2.3 KiB

Raw Blame History

Reference Discovery

Search and identify authoritative sources for reference materials. Validates source credibility, prioritizes by relevance, and outputs curated URL lists with metadata.

Trigger Keywords

"find references", "search documentation", "discover sources", "find authoritative materials", "research topic sources"

Source Priority Hierarchy

Tier	Source Type	Examples
Tier 1	Official documentation	docs.anthropic.com, docs.claude.com, platform.openai.com/docs
Tier 1	Engineering blogs (official)	anthropic.com/news, openai.com/blog
Tier 1	Official GitHub repos	github.com/anthropics/, github.com/openai/
Tier 2	Research papers	arxiv.org, papers with citations
Tier 2	Verified community guides	Cookbook examples, official tutorials
Tier 3	Community content	Blog posts, tutorials, Stack Overflow

Workflow

Step 1: Define Search Scope

Gather topic, target vendors, and freshness requirements from user input.

Step 2: Execute Web Search

Use WebSearch tool with targeted queries:

site:docs.anthropic.com {topic}
site:github.com/anthropics {topic}
site:arxiv.org {topic}

Step 3: Score and Validate Sources

Apply credibility scoring:

Domain credibility (0.10 - 0.40)
Freshness signals (0.10 - 0.20)
Relevance signals (0.15)

Step 4: Output URL Manifest

Generate JSON manifest for the crawler skill:

{
  "discovery_date": "2025-01-28T10:30:00",
  "topic": "prompt engineering",
  "total_urls": 15,
  "urls": [
    {
      "url": "https://docs.anthropic.com/en/docs/prompt-engineering",
      "title": "Prompt Engineering Guide",
      "credibility_tier": "tier1_official",
      "credibility_score": 0.85,
      "source_type": "official_docs",
      "vendor": "anthropic"
    }
  ]
}

Scripts

`discover_sources.py`

Main discovery script. Usage:

python scripts/discover_sources.py --topic "prompt engineering" --vendors anthropic,openai --output manifest.json

Output

manifest.json → Handoff to 02-web-crawler-orchestrator
Register new sources in sources table via 03-content-repository

Deduplication

Before outputting:

Normalize URLs (remove trailing slashes, query params)
Check against existing documents table
Merge duplicates, keeping highest credibility score

2.3 KiB Raw Blame History