feat(reference-curator): Add portable skill suite for reference documentation curation

6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 00:20:27 +07:00
parent e80056ae8a
commit 6d7a6d7a88
26 changed files with 4486 additions and 1 deletions
--- a/custom-skills/90-reference-curator/01-reference-discovery/code/CLAUDE.md
+++ b/custom-skills/90-reference-curator/01-reference-discovery/code/CLAUDE.md
@@ -0,0 +1,75 @@
+# Reference Discovery
+
+Search and identify authoritative sources for reference materials. Validates source credibility, prioritizes by relevance, and outputs curated URL lists with metadata.
+
+## Trigger Keywords
+"find references", "search documentation", "discover sources", "find authoritative materials", "research topic sources"
+
+## Source Priority Hierarchy
+
+| Tier | Source Type | Examples |
+|------|-------------|----------|
+| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
+| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
+| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* |
+| **Tier 2** | Research papers | arxiv.org, papers with citations |
+| **Tier 2** | Verified community guides | Cookbook examples, official tutorials |
+| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow |
+
+## Workflow
+
+### Step 1: Define Search Scope
+Gather topic, target vendors, and freshness requirements from user input.
+
+### Step 2: Execute Web Search
+Use WebSearch tool with targeted queries:
+```
+site:docs.anthropic.com {topic}
+site:github.com/anthropics {topic}
+site:arxiv.org {topic}
+```
+
+### Step 3: Score and Validate Sources
+Apply credibility scoring:
+- Domain credibility (0.10 - 0.40)
+- Freshness signals (0.10 - 0.20)
+- Relevance signals (0.15)
+
+### Step 4: Output URL Manifest
+Generate JSON manifest for the crawler skill:
+
+```json
+{
+  "discovery_date": "2025-01-28T10:30:00",
+  "topic": "prompt engineering",
+  "total_urls": 15,
+  "urls": [
+    {
+      "url": "https://docs.anthropic.com/en/docs/prompt-engineering",
+      "title": "Prompt Engineering Guide",
+      "credibility_tier": "tier1_official",
+      "credibility_score": 0.85,
+      "source_type": "official_docs",
+      "vendor": "anthropic"
+    }
+  ]
+}
+```
+
+## Scripts
+
+### `discover_sources.py`
+Main discovery script. Usage:
+```bash
+python scripts/discover_sources.py --topic "prompt engineering" --vendors anthropic,openai --output manifest.json
+```
+
+## Output
+- `manifest.json` → Handoff to `02-web-crawler-orchestrator`
+- Register new sources in `sources` table via `03-content-repository`
+
+## Deduplication
+Before outputting:
+- Normalize URLs (remove trailing slashes, query params)
+- Check against existing `documents` table
+- Merge duplicates, keeping highest credibility score
--- a/custom-skills/90-reference-curator/01-reference-discovery/desktop/SKILL.md
+++ b/custom-skills/90-reference-curator/01-reference-discovery/desktop/SKILL.md
@@ -0,0 +1,188 @@
+---
+name: reference-discovery
+description: Search and identify authoritative sources for reference materials. Validates source credibility, prioritizes by relevance, and outputs curated URL lists with metadata. Triggers on "find references", "search documentation", "discover sources", "find authoritative materials", "research topic sources".
+---
+
+# Reference Discovery
+
+Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.
+
+## Source Priority Hierarchy
+
+| Tier | Source Type | Examples |
+|------|-------------|----------|
+| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
+| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
+| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* |
+| **Tier 2** | Research papers | arxiv.org, papers with citations |
+| **Tier 2** | Verified community guides | Cookbook examples, official tutorials |
+| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow |
+
+## Discovery Workflow
+
+### Step 1: Define Search Scope
+
+```python
+search_config = {
+    "topic": "prompt engineering",
+    "vendors": ["anthropic", "openai", "google"],
+    "source_types": ["official_docs", "engineering_blog", "github_repo"],
+    "freshness": "past_year",  # past_week, past_month, past_year, any
+    "max_results_per_query": 20
+}
+```
+
+### Step 2: Generate Search Queries
+
+For a given topic, generate targeted queries:
+
+```python
+def generate_queries(topic, vendors):
+    queries = []
+    
+    # Official documentation queries
+    for vendor in vendors:
+        queries.append(f"site:docs.{vendor}.com {topic}")
+        queries.append(f"site:{vendor}.com/docs {topic}")
+    
+    # Engineering blog queries
+    for vendor in vendors:
+        queries.append(f"site:{vendor}.com/blog {topic}")
+        queries.append(f"site:{vendor}.com/news {topic}")
+    
+    # GitHub queries
+    for vendor in vendors:
+        queries.append(f"site:github.com/{vendor} {topic}")
+    
+    # Research queries
+    queries.append(f"site:arxiv.org {topic}")
+    
+    return queries
+```
+
+### Step 3: Execute Search
+
+Use web search tool for each query:
+
+```python
+def execute_discovery(queries):
+    results = []
+    for query in queries:
+        search_results = web_search(query)
+        for result in search_results:
+            results.append({
+                "url": result.url,
+                "title": result.title,
+                "snippet": result.snippet,
+                "query_used": query
+            })
+    return deduplicate_by_url(results)
+```
+
+### Step 4: Validate and Score Sources
+
+```python
+def score_source(url, title):
+    score = 0.0
+    
+    # Domain credibility
+    if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
+        score += 0.40  # Tier 1 official docs
+    elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
+        score += 0.30  # Tier 1 official blog/news
+    elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
+        score += 0.30  # Tier 1 official repos
+    elif 'arxiv.org' in url:
+        score += 0.20  # Tier 2 research
+    else:
+        score += 0.10  # Tier 3 community
+    
+    # Freshness signals (from title/snippet)
+    if any(year in title for year in ['2025', '2024']):
+        score += 0.20
+    elif any(year in title for year in ['2023']):
+        score += 0.10
+    
+    # Relevance signals
+    if any(kw in title.lower() for kw in ['guide', 'documentation', 'tutorial', 'best practices']):
+        score += 0.15
+    
+    return min(score, 1.0)
+
+def assign_credibility_tier(score):
+    if score >= 0.60:
+        return 'tier1_official'
+    elif score >= 0.40:
+        return 'tier2_verified'
+    else:
+        return 'tier3_community'
+```
+
+### Step 5: Output URL Manifest
+
+```python
+def create_manifest(scored_results, topic):
+    manifest = {
+        "discovery_date": datetime.now().isoformat(),
+        "topic": topic,
+        "total_urls": len(scored_results),
+        "urls": []
+    }
+    
+    for result in sorted(scored_results, key=lambda x: x['score'], reverse=True):
+        manifest["urls"].append({
+            "url": result["url"],
+            "title": result["title"],
+            "credibility_tier": result["tier"],
+            "credibility_score": result["score"],
+            "source_type": infer_source_type(result["url"]),
+            "vendor": infer_vendor(result["url"])
+        })
+    
+    return manifest
+```
+
+## Output Format
+
+Discovery produces a JSON manifest for the crawler:
+
+```json
+{
+  "discovery_date": "2025-01-28T10:30:00",
+  "topic": "prompt engineering",
+  "total_urls": 15,
+  "urls": [
+    {
+      "url": "https://docs.anthropic.com/en/docs/prompt-engineering",
+      "title": "Prompt Engineering Guide",
+      "credibility_tier": "tier1_official",
+      "credibility_score": 0.85,
+      "source_type": "official_docs",
+      "vendor": "anthropic"
+    }
+  ]
+}
+```
+
+## Known Authoritative Sources
+
+Pre-validated sources for common topics:
+
+| Vendor | Documentation | Blog/News | GitHub |
+|--------|--------------|-----------|--------|
+| Anthropic | docs.anthropic.com, docs.claude.com | anthropic.com/news | github.com/anthropics |
+| OpenAI | platform.openai.com/docs | openai.com/blog | github.com/openai |
+| Google | ai.google.dev/docs | blog.google/technology/ai | github.com/google |
+
+## Integration
+
+**Output:** URL manifest JSON → `web-crawler-orchestrator`
+
+**Database:** Register new sources in `sources` table via `content-repository`
+
+## Deduplication
+
+Before outputting, deduplicate URLs:
+- Normalize URLs (remove trailing slashes, query params)
+- Check against existing `documents` table via `content-repository`
+- Merge duplicate entries, keeping highest credibility score