Files
our-claude-skills/custom-skills/90-reference-curator/claude-project/reference-curator-complete.md
Andrew Yim d1cd1298a8 feat(reference-curator): Add pipeline orchestrator and refactor skill format
Pipeline Orchestrator:
- Add 07-pipeline-orchestrator skill with code/CLAUDE.md and desktop/SKILL.md
- Add /reference-curator-pipeline slash command for full workflow automation
- Add pipeline_runs and pipeline_iteration_tracker tables to schema.sql
- Add v_pipeline_status and v_pipeline_iterations views
- Add pipeline_config.yaml configuration template
- Update AGENTS.md with Reference Curator Skills section
- Update claude-project files with pipeline documentation

Skill Format Refactoring:
- Extract YAML frontmatter from SKILL.md files to separate skill.yaml
- Add tools/ directories with MCP tool documentation
- Update SKILL-FORMAT-REQUIREMENTS.md with new structure
- Add migrate-skill-structure.py script for format conversion

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 01:01:02 +07:00

578 lines
16 KiB
Markdown

# Reference Curator - Complete Skill Set
This document contains all 7 skills for curating, processing, and exporting reference documentation.
---
# Pipeline Orchestrator (Recommended Entry Point)
Coordinates the full 6-skill workflow with automated QA loop handling.
## Quick Start
```
# Full pipeline from topic
curate references on "Claude Code best practices"
# From URLs (skip discovery)
curate these URLs: https://docs.anthropic.com/en/docs/prompt-caching
# With auto-approve
curate references on "MCP servers" with auto-approve and fine-tuning output
```
## Configuration Options
| Option | Default | Description |
|--------|---------|-------------|
| max_sources | 10 | Maximum sources to discover |
| max_pages | 50 | Maximum pages per source |
| auto_approve | false | Auto-approve above threshold |
| threshold | 0.85 | Approval threshold |
| max_iterations | 3 | Max QA loop iterations |
| export_format | project_files | Output format |
## Pipeline Flow
```
[Input: Topic | URLs | Manifest]
1. reference-discovery (skip if URLs/manifest)
2. web-crawler
3. content-repository
4. content-distiller ◄─────────────┐
│ │
▼ │
5. quality-reviewer │
│ │
├── APPROVE → export │
├── REFACTOR (max 3) ─────┤
├── DEEP_RESEARCH (max 2) → crawler
└── REJECT → archive
6. markdown-exporter
```
## QA Loop Handling
| Decision | Action | Max Iterations |
|----------|--------|----------------|
| APPROVE | Proceed to export | - |
| REFACTOR | Re-distill with feedback | 3 |
| DEEP_RESEARCH | Crawl more sources | 2 |
| REJECT | Archive with reason | - |
Documents exceeding iteration limits are marked `needs_manual_review`.
## Output Summary
```
Pipeline Complete:
- Sources discovered: 5
- Pages crawled: 45
- Approved: 40
- Needs manual review: 2
- Exports: ~/reference-library/exports/
```
---
# 1. Reference Discovery
Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.
## Source Priority Hierarchy
| Tier | Source Type | Examples |
|------|-------------|----------|
| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* |
| **Tier 2** | Research papers | arxiv.org, papers with citations |
| **Tier 2** | Verified community guides | Cookbook examples, official tutorials |
| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow |
## Discovery Workflow
### Step 1: Define Search Scope
```python
search_config = {
"topic": "prompt engineering",
"vendors": ["anthropic", "openai", "google"],
"source_types": ["official_docs", "engineering_blog", "github_repo"],
"freshness": "past_year",
"max_results_per_query": 20
}
```
### Step 2: Generate Search Queries
```python
def generate_queries(topic, vendors):
queries = []
for vendor in vendors:
queries.append(f"site:docs.{vendor}.com {topic}")
queries.append(f"site:{vendor}.com/docs {topic}")
queries.append(f"site:{vendor}.com/blog {topic}")
queries.append(f"site:github.com/{vendor} {topic}")
queries.append(f"site:arxiv.org {topic}")
return queries
```
### Step 3: Validate and Score Sources
```python
def score_source(url, title):
score = 0.0
if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
score += 0.40 # Tier 1 official docs
elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
score += 0.30 # Tier 1 official blog/news
elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
score += 0.30 # Tier 1 official repos
elif 'arxiv.org' in url:
score += 0.20 # Tier 2 research
else:
score += 0.10 # Tier 3 community
return min(score, 1.0)
def assign_credibility_tier(score):
if score >= 0.60:
return 'tier1_official'
elif score >= 0.40:
return 'tier2_verified'
else:
return 'tier3_community'
```
## Output Format
```json
{
"discovery_date": "2025-01-28T10:30:00",
"topic": "prompt engineering",
"total_urls": 15,
"urls": [
{
"url": "https://docs.anthropic.com/en/docs/prompt-engineering",
"title": "Prompt Engineering Guide",
"credibility_tier": "tier1_official",
"credibility_score": 0.85,
"source_type": "official_docs",
"vendor": "anthropic"
}
]
}
```
---
# 2. Web Crawler Orchestrator
Manages crawling operations using Firecrawl MCP with rate limiting and format handling.
## Crawl Configuration
```yaml
firecrawl:
rate_limit:
requests_per_minute: 20
concurrent_requests: 3
default_options:
timeout: 30000
only_main_content: true
```
## Crawl Workflow
### Determine Crawl Strategy
```python
def select_strategy(url):
if url.endswith('.pdf'):
return 'pdf_extract'
elif 'github.com' in url and '/blob/' in url:
return 'raw_content'
elif any(d in url for d in ['docs.', 'documentation']):
return 'scrape'
else:
return 'scrape'
```
### Execute Firecrawl
```python
# Single page scrape
firecrawl_scrape(
url="https://docs.anthropic.com/en/docs/prompt-engineering",
formats=["markdown"],
only_main_content=True,
timeout=30000
)
# Multi-page crawl
firecrawl_crawl(
url="https://docs.anthropic.com/en/docs/",
max_depth=2,
limit=50,
formats=["markdown"]
)
```
### Rate Limiting
```python
class RateLimiter:
def __init__(self, requests_per_minute=20):
self.rpm = requests_per_minute
self.request_times = deque()
def wait_if_needed(self):
now = time.time()
while self.request_times and now - self.request_times[0] > 60:
self.request_times.popleft()
if len(self.request_times) >= self.rpm:
wait_time = 60 - (now - self.request_times[0])
if wait_time > 0:
time.sleep(wait_time)
self.request_times.append(time.time())
```
## Error Handling
| Error | Action |
|-------|--------|
| Timeout | Retry once with 2x timeout |
| Rate limit (429) | Exponential backoff, max 3 retries |
| Not found (404) | Log and skip |
| Access denied (403) | Log, mark as `failed` |
---
# 3. Content Repository
Manages MySQL storage for the reference library. Handles document storage, version control, deduplication, and retrieval.
## Core Operations
**Store New Document:**
```python
def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
sql = """
INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
ON DUPLICATE KEY UPDATE
version = version + 1,
crawl_date = NOW(),
raw_content_path = VALUES(raw_content_path)
"""
cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
return cursor.lastrowid
```
**Check Duplicate:**
```python
def is_duplicate(cursor, url):
cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
return cursor.fetchone() is not None
```
## Table Quick Reference
| Table | Purpose | Key Fields |
|-------|---------|------------|
| `sources` | Authorized content sources | source_type, credibility_tier, vendor |
| `documents` | Crawled document metadata | url_hash (dedup), version, crawl_status |
| `distilled_content` | Processed summaries | review_status, compression_ratio |
| `review_logs` | QA decisions | quality_score, decision |
| `topics` | Taxonomy | topic_slug, parent_topic_id |
## Status Values
- **crawl_status:** `pending``completed` | `failed` | `stale`
- **review_status:** `pending``in_review``approved` | `needs_refactor` | `rejected`
- **decision:** `approve` | `refactor` | `deep_research` | `reject`
---
# 4. Content Distiller
Transforms raw crawled content into structured, high-quality reference materials.
## Distillation Goals
1. **Compress** - Reduce token count while preserving essential information
2. **Structure** - Organize content for easy retrieval and reference
3. **Extract** - Pull out code snippets, key concepts, and actionable patterns
4. **Annotate** - Add metadata for searchability and categorization
## Extract Key Components
**Extract Code Snippets:**
```python
def extract_code_snippets(content):
pattern = r'```(\w*)\n([\s\S]*?)```'
snippets = []
for match in re.finditer(pattern, content):
snippets.append({
"language": match.group(1) or "text",
"code": match.group(2).strip(),
"context": get_surrounding_text(content, match.start(), 200)
})
return snippets
```
**Extract Key Concepts:**
```python
def extract_key_concepts(content, title):
prompt = f"""
Analyze this document and extract key concepts:
Title: {title}
Content: {content[:8000]}
Return JSON with:
- concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
- techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
- best_practices: ["..."]
"""
return claude_extract(prompt)
```
## Summary Template
```markdown
# {title}
**Source:** {url}
**Type:** {source_type} | **Tier:** {credibility_tier}
## Executive Summary
{2-3 sentence overview}
## Key Concepts
{bulleted list of core concepts}
## Techniques & Patterns
{extracted techniques with use cases}
## Code Examples
{relevant code snippets}
## Best Practices
{actionable recommendations}
```
## Quality Metrics
| Metric | Target |
|--------|--------|
| Compression Ratio | 25-35% of original |
| Key Concept Coverage | ≥90% of important terms |
| Code Snippet Retention | 100% of relevant examples |
---
# 5. Quality Reviewer
Evaluates distilled content, routes decisions, and triggers refactoring or additional research.
## Review Workflow
```
[Distilled Content]
┌─────────────────┐
│ Score Criteria │ → accuracy, completeness, clarity, PE quality, usability
└─────────────────┘
├── ≥ 0.85 → APPROVE → markdown-exporter
├── 0.60-0.84 → REFACTOR → content-distiller (with instructions)
├── 0.40-0.59 → DEEP_RESEARCH → web-crawler (with queries)
└── < 0.40 → REJECT → archive with reason
```
## Scoring Criteria
| Criterion | Weight | Checks |
|-----------|--------|--------|
| **Accuracy** | 0.25 | Factual correctness, up-to-date info, proper attribution |
| **Completeness** | 0.20 | Covers key concepts, includes examples, addresses edge cases |
| **Clarity** | 0.20 | Clear structure, concise language, logical flow |
| **PE Quality** | 0.25 | Demonstrates techniques, before/after examples, explains why |
| **Usability** | 0.10 | Easy to reference, searchable keywords, appropriate length |
## Calculate Final Score
```python
WEIGHTS = {
"accuracy": 0.25,
"completeness": 0.20,
"clarity": 0.20,
"prompt_engineering_quality": 0.25,
"usability": 0.10
}
def calculate_quality_score(assessment):
return sum(
assessment[criterion]["score"] * weight
for criterion, weight in WEIGHTS.items()
)
```
## Route Decision
```python
def determine_decision(score, assessment):
if score >= 0.85:
return "approve", None, None
elif score >= 0.60:
instructions = generate_refactor_instructions(assessment)
return "refactor", instructions, None
elif score >= 0.40:
queries = generate_research_queries(assessment)
return "deep_research", None, queries
else:
return "reject", f"Quality score {score:.2f} below minimum", None
```
## Prompt Engineering Quality Checklist
- [ ] Demonstrates specific techniques (CoT, few-shot, etc.)
- [ ] Shows before/after examples
- [ ] Explains *why* techniques work, not just *what*
- [ ] Provides actionable patterns
- [ ] Includes edge cases and failure modes
- [ ] References authoritative sources
---
# 6. Markdown Exporter
Exports approved content as structured markdown files for Claude Projects or fine-tuning.
## Export Structure
**Nested by Topic (recommended):**
```
exports/
├── INDEX.md
├── prompt-engineering/
│ ├── _index.md
│ ├── 01-chain-of-thought.md
│ └── 02-few-shot-prompting.md
├── claude-models/
│ ├── _index.md
│ └── 01-model-comparison.md
└── agent-building/
└── 01-tool-use.md
```
## Document File Template
```python
def generate_document_file(doc, include_metadata=True):
content = []
if include_metadata:
content.append("---")
content.append(f"title: {doc['title']}")
content.append(f"source: {doc['url']}")
content.append(f"vendor: {doc['vendor']}")
content.append(f"tier: {doc['credibility_tier']}")
content.append(f"quality_score: {doc['quality_score']:.2f}")
content.append("---")
content.append("")
content.append(doc['structured_content'])
return "\n".join(content)
```
## Fine-tuning Export (JSONL)
```python
def export_fine_tuning_dataset(content_list, config):
with open('fine_tuning.jsonl', 'w') as f:
for doc in content_list:
sample = {
"messages": [
{"role": "system", "content": "You are an expert on AI and prompt engineering."},
{"role": "user", "content": f"Explain {doc['title']}"},
{"role": "assistant", "content": doc['structured_content']}
],
"metadata": {
"source": doc['url'],
"topic": doc['topic_slug'],
"quality_score": doc['quality_score']
}
}
f.write(json.dumps(sample) + '\n')
```
## Cross-Reference Generation
```python
def add_cross_references(doc, all_docs):
related = []
doc_concepts = set(c['term'].lower() for c in doc['key_concepts'])
for other in all_docs:
if other['doc_id'] == doc['doc_id']:
continue
other_concepts = set(c['term'].lower() for c in other['key_concepts'])
overlap = len(doc_concepts & other_concepts)
if overlap >= 2:
related.append({
"title": other['title'],
"path": generate_relative_path(doc, other),
"overlap": overlap
})
return sorted(related, key=lambda x: x['overlap'], reverse=True)[:5]
```
---
# Integration Flow
| From | Output | To |
|------|--------|-----|
| **pipeline-orchestrator** | Coordinates all stages | All skills below |
| **reference-discovery** | URL manifest | web-crawler |
| **web-crawler** | Raw content + manifest | content-repository |
| **content-repository** | Document records | content-distiller |
| **content-distiller** | Distilled content | quality-reviewer |
| **quality-reviewer** (approve) | Approved IDs | markdown-exporter |
| **quality-reviewer** (refactor) | Instructions | content-distiller |
| **quality-reviewer** (deep_research) | Queries | web-crawler |
## State Management
The pipeline orchestrator tracks state for resume capability:
**With Database:**
- `pipeline_runs` table tracks run status, current stage, statistics
- `pipeline_iteration_tracker` tracks QA loop iterations per document
**File-Based Fallback:**
```
~/reference-library/pipeline_state/run_XXX/
├── state.json # Current stage and stats
├── manifest.json # Discovered sources
└── review_log.json # QA decisions
```
## Resume Pipeline
To resume a paused or failed pipeline:
1. Provide the run_id or state file path
2. Pipeline continues from last successful checkpoint