feat(reference-curator): Add portable skill suite for reference documentation curation
6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,75 @@
|
||||
# Reference Discovery
|
||||
|
||||
Search and identify authoritative sources for reference materials. Validates source credibility, prioritizes by relevance, and outputs curated URL lists with metadata.
|
||||
|
||||
## Trigger Keywords
|
||||
"find references", "search documentation", "discover sources", "find authoritative materials", "research topic sources"
|
||||
|
||||
## Source Priority Hierarchy
|
||||
|
||||
| Tier | Source Type | Examples |
|
||||
|------|-------------|----------|
|
||||
| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
|
||||
| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
|
||||
| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* |
|
||||
| **Tier 2** | Research papers | arxiv.org, papers with citations |
|
||||
| **Tier 2** | Verified community guides | Cookbook examples, official tutorials |
|
||||
| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow |
|
||||
|
||||
## Workflow
|
||||
|
||||
### Step 1: Define Search Scope
|
||||
Gather topic, target vendors, and freshness requirements from user input.
|
||||
|
||||
### Step 2: Execute Web Search
|
||||
Use WebSearch tool with targeted queries:
|
||||
```
|
||||
site:docs.anthropic.com {topic}
|
||||
site:github.com/anthropics {topic}
|
||||
site:arxiv.org {topic}
|
||||
```
|
||||
|
||||
### Step 3: Score and Validate Sources
|
||||
Apply credibility scoring:
|
||||
- Domain credibility (0.10 - 0.40)
|
||||
- Freshness signals (0.10 - 0.20)
|
||||
- Relevance signals (0.15)
|
||||
|
||||
### Step 4: Output URL Manifest
|
||||
Generate JSON manifest for the crawler skill:
|
||||
|
||||
```json
|
||||
{
|
||||
"discovery_date": "2025-01-28T10:30:00",
|
||||
"topic": "prompt engineering",
|
||||
"total_urls": 15,
|
||||
"urls": [
|
||||
{
|
||||
"url": "https://docs.anthropic.com/en/docs/prompt-engineering",
|
||||
"title": "Prompt Engineering Guide",
|
||||
"credibility_tier": "tier1_official",
|
||||
"credibility_score": 0.85,
|
||||
"source_type": "official_docs",
|
||||
"vendor": "anthropic"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Scripts
|
||||
|
||||
### `discover_sources.py`
|
||||
Main discovery script. Usage:
|
||||
```bash
|
||||
python scripts/discover_sources.py --topic "prompt engineering" --vendors anthropic,openai --output manifest.json
|
||||
```
|
||||
|
||||
## Output
|
||||
- `manifest.json` → Handoff to `02-web-crawler-orchestrator`
|
||||
- Register new sources in `sources` table via `03-content-repository`
|
||||
|
||||
## Deduplication
|
||||
Before outputting:
|
||||
- Normalize URLs (remove trailing slashes, query params)
|
||||
- Check against existing `documents` table
|
||||
- Merge duplicates, keeping highest credibility score
|
||||
Reference in New Issue
Block a user