Pipeline Orchestrator: - Add 07-pipeline-orchestrator skill with code/CLAUDE.md and desktop/SKILL.md - Add /reference-curator-pipeline slash command for full workflow automation - Add pipeline_runs and pipeline_iteration_tracker tables to schema.sql - Add v_pipeline_status and v_pipeline_iterations views - Add pipeline_config.yaml configuration template - Update AGENTS.md with Reference Curator Skills section - Update claude-project files with pipeline documentation Skill Format Refactoring: - Extract YAML frontmatter from SKILL.md files to separate skill.yaml - Add tools/ directories with MCP tool documentation - Update SKILL-FORMAT-REQUIREMENTS.md with new structure - Add migrate-skill-structure.py script for format conversion Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
578 lines
16 KiB
Markdown
578 lines
16 KiB
Markdown
# Reference Curator - Complete Skill Set
|
|
|
|
This document contains all 7 skills for curating, processing, and exporting reference documentation.
|
|
|
|
---
|
|
|
|
# Pipeline Orchestrator (Recommended Entry Point)
|
|
|
|
Coordinates the full 6-skill workflow with automated QA loop handling.
|
|
|
|
## Quick Start
|
|
|
|
```
|
|
# Full pipeline from topic
|
|
curate references on "Claude Code best practices"
|
|
|
|
# From URLs (skip discovery)
|
|
curate these URLs: https://docs.anthropic.com/en/docs/prompt-caching
|
|
|
|
# With auto-approve
|
|
curate references on "MCP servers" with auto-approve and fine-tuning output
|
|
```
|
|
|
|
## Configuration Options
|
|
|
|
| Option | Default | Description |
|
|
|--------|---------|-------------|
|
|
| max_sources | 10 | Maximum sources to discover |
|
|
| max_pages | 50 | Maximum pages per source |
|
|
| auto_approve | false | Auto-approve above threshold |
|
|
| threshold | 0.85 | Approval threshold |
|
|
| max_iterations | 3 | Max QA loop iterations |
|
|
| export_format | project_files | Output format |
|
|
|
|
## Pipeline Flow
|
|
|
|
```
|
|
[Input: Topic | URLs | Manifest]
|
|
│
|
|
▼
|
|
1. reference-discovery (skip if URLs/manifest)
|
|
│
|
|
▼
|
|
2. web-crawler
|
|
│
|
|
▼
|
|
3. content-repository
|
|
│
|
|
▼
|
|
4. content-distiller ◄─────────────┐
|
|
│ │
|
|
▼ │
|
|
5. quality-reviewer │
|
|
│ │
|
|
├── APPROVE → export │
|
|
├── REFACTOR (max 3) ─────┤
|
|
├── DEEP_RESEARCH (max 2) → crawler
|
|
└── REJECT → archive
|
|
│
|
|
▼
|
|
6. markdown-exporter
|
|
```
|
|
|
|
## QA Loop Handling
|
|
|
|
| Decision | Action | Max Iterations |
|
|
|----------|--------|----------------|
|
|
| APPROVE | Proceed to export | - |
|
|
| REFACTOR | Re-distill with feedback | 3 |
|
|
| DEEP_RESEARCH | Crawl more sources | 2 |
|
|
| REJECT | Archive with reason | - |
|
|
|
|
Documents exceeding iteration limits are marked `needs_manual_review`.
|
|
|
|
## Output Summary
|
|
|
|
```
|
|
Pipeline Complete:
|
|
- Sources discovered: 5
|
|
- Pages crawled: 45
|
|
- Approved: 40
|
|
- Needs manual review: 2
|
|
- Exports: ~/reference-library/exports/
|
|
```
|
|
|
|
---
|
|
|
|
# 1. Reference Discovery
|
|
|
|
Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.
|
|
|
|
## Source Priority Hierarchy
|
|
|
|
| Tier | Source Type | Examples |
|
|
|------|-------------|----------|
|
|
| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
|
|
| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
|
|
| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* |
|
|
| **Tier 2** | Research papers | arxiv.org, papers with citations |
|
|
| **Tier 2** | Verified community guides | Cookbook examples, official tutorials |
|
|
| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow |
|
|
|
|
## Discovery Workflow
|
|
|
|
### Step 1: Define Search Scope
|
|
|
|
```python
|
|
search_config = {
|
|
"topic": "prompt engineering",
|
|
"vendors": ["anthropic", "openai", "google"],
|
|
"source_types": ["official_docs", "engineering_blog", "github_repo"],
|
|
"freshness": "past_year",
|
|
"max_results_per_query": 20
|
|
}
|
|
```
|
|
|
|
### Step 2: Generate Search Queries
|
|
|
|
```python
|
|
def generate_queries(topic, vendors):
|
|
queries = []
|
|
for vendor in vendors:
|
|
queries.append(f"site:docs.{vendor}.com {topic}")
|
|
queries.append(f"site:{vendor}.com/docs {topic}")
|
|
queries.append(f"site:{vendor}.com/blog {topic}")
|
|
queries.append(f"site:github.com/{vendor} {topic}")
|
|
queries.append(f"site:arxiv.org {topic}")
|
|
return queries
|
|
```
|
|
|
|
### Step 3: Validate and Score Sources
|
|
|
|
```python
|
|
def score_source(url, title):
|
|
score = 0.0
|
|
if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
|
|
score += 0.40 # Tier 1 official docs
|
|
elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
|
|
score += 0.30 # Tier 1 official blog/news
|
|
elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
|
|
score += 0.30 # Tier 1 official repos
|
|
elif 'arxiv.org' in url:
|
|
score += 0.20 # Tier 2 research
|
|
else:
|
|
score += 0.10 # Tier 3 community
|
|
return min(score, 1.0)
|
|
|
|
def assign_credibility_tier(score):
|
|
if score >= 0.60:
|
|
return 'tier1_official'
|
|
elif score >= 0.40:
|
|
return 'tier2_verified'
|
|
else:
|
|
return 'tier3_community'
|
|
```
|
|
|
|
## Output Format
|
|
|
|
```json
|
|
{
|
|
"discovery_date": "2025-01-28T10:30:00",
|
|
"topic": "prompt engineering",
|
|
"total_urls": 15,
|
|
"urls": [
|
|
{
|
|
"url": "https://docs.anthropic.com/en/docs/prompt-engineering",
|
|
"title": "Prompt Engineering Guide",
|
|
"credibility_tier": "tier1_official",
|
|
"credibility_score": 0.85,
|
|
"source_type": "official_docs",
|
|
"vendor": "anthropic"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
# 2. Web Crawler Orchestrator
|
|
|
|
Manages crawling operations using Firecrawl MCP with rate limiting and format handling.
|
|
|
|
## Crawl Configuration
|
|
|
|
```yaml
|
|
firecrawl:
|
|
rate_limit:
|
|
requests_per_minute: 20
|
|
concurrent_requests: 3
|
|
default_options:
|
|
timeout: 30000
|
|
only_main_content: true
|
|
```
|
|
|
|
## Crawl Workflow
|
|
|
|
### Determine Crawl Strategy
|
|
|
|
```python
|
|
def select_strategy(url):
|
|
if url.endswith('.pdf'):
|
|
return 'pdf_extract'
|
|
elif 'github.com' in url and '/blob/' in url:
|
|
return 'raw_content'
|
|
elif any(d in url for d in ['docs.', 'documentation']):
|
|
return 'scrape'
|
|
else:
|
|
return 'scrape'
|
|
```
|
|
|
|
### Execute Firecrawl
|
|
|
|
```python
|
|
# Single page scrape
|
|
firecrawl_scrape(
|
|
url="https://docs.anthropic.com/en/docs/prompt-engineering",
|
|
formats=["markdown"],
|
|
only_main_content=True,
|
|
timeout=30000
|
|
)
|
|
|
|
# Multi-page crawl
|
|
firecrawl_crawl(
|
|
url="https://docs.anthropic.com/en/docs/",
|
|
max_depth=2,
|
|
limit=50,
|
|
formats=["markdown"]
|
|
)
|
|
```
|
|
|
|
### Rate Limiting
|
|
|
|
```python
|
|
class RateLimiter:
|
|
def __init__(self, requests_per_minute=20):
|
|
self.rpm = requests_per_minute
|
|
self.request_times = deque()
|
|
|
|
def wait_if_needed(self):
|
|
now = time.time()
|
|
while self.request_times and now - self.request_times[0] > 60:
|
|
self.request_times.popleft()
|
|
if len(self.request_times) >= self.rpm:
|
|
wait_time = 60 - (now - self.request_times[0])
|
|
if wait_time > 0:
|
|
time.sleep(wait_time)
|
|
self.request_times.append(time.time())
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
| Error | Action |
|
|
|-------|--------|
|
|
| Timeout | Retry once with 2x timeout |
|
|
| Rate limit (429) | Exponential backoff, max 3 retries |
|
|
| Not found (404) | Log and skip |
|
|
| Access denied (403) | Log, mark as `failed` |
|
|
|
|
---
|
|
|
|
# 3. Content Repository
|
|
|
|
Manages MySQL storage for the reference library. Handles document storage, version control, deduplication, and retrieval.
|
|
|
|
## Core Operations
|
|
|
|
**Store New Document:**
|
|
```python
|
|
def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
|
|
sql = """
|
|
INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
|
|
VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
|
|
ON DUPLICATE KEY UPDATE
|
|
version = version + 1,
|
|
crawl_date = NOW(),
|
|
raw_content_path = VALUES(raw_content_path)
|
|
"""
|
|
cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
|
|
return cursor.lastrowid
|
|
```
|
|
|
|
**Check Duplicate:**
|
|
```python
|
|
def is_duplicate(cursor, url):
|
|
cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
|
|
return cursor.fetchone() is not None
|
|
```
|
|
|
|
## Table Quick Reference
|
|
|
|
| Table | Purpose | Key Fields |
|
|
|-------|---------|------------|
|
|
| `sources` | Authorized content sources | source_type, credibility_tier, vendor |
|
|
| `documents` | Crawled document metadata | url_hash (dedup), version, crawl_status |
|
|
| `distilled_content` | Processed summaries | review_status, compression_ratio |
|
|
| `review_logs` | QA decisions | quality_score, decision |
|
|
| `topics` | Taxonomy | topic_slug, parent_topic_id |
|
|
|
|
## Status Values
|
|
|
|
- **crawl_status:** `pending` → `completed` | `failed` | `stale`
|
|
- **review_status:** `pending` → `in_review` → `approved` | `needs_refactor` | `rejected`
|
|
- **decision:** `approve` | `refactor` | `deep_research` | `reject`
|
|
|
|
---
|
|
|
|
# 4. Content Distiller
|
|
|
|
Transforms raw crawled content into structured, high-quality reference materials.
|
|
|
|
## Distillation Goals
|
|
|
|
1. **Compress** - Reduce token count while preserving essential information
|
|
2. **Structure** - Organize content for easy retrieval and reference
|
|
3. **Extract** - Pull out code snippets, key concepts, and actionable patterns
|
|
4. **Annotate** - Add metadata for searchability and categorization
|
|
|
|
## Extract Key Components
|
|
|
|
**Extract Code Snippets:**
|
|
```python
|
|
def extract_code_snippets(content):
|
|
pattern = r'```(\w*)\n([\s\S]*?)```'
|
|
snippets = []
|
|
for match in re.finditer(pattern, content):
|
|
snippets.append({
|
|
"language": match.group(1) or "text",
|
|
"code": match.group(2).strip(),
|
|
"context": get_surrounding_text(content, match.start(), 200)
|
|
})
|
|
return snippets
|
|
```
|
|
|
|
**Extract Key Concepts:**
|
|
```python
|
|
def extract_key_concepts(content, title):
|
|
prompt = f"""
|
|
Analyze this document and extract key concepts:
|
|
|
|
Title: {title}
|
|
Content: {content[:8000]}
|
|
|
|
Return JSON with:
|
|
- concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
|
|
- techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
|
|
- best_practices: ["..."]
|
|
"""
|
|
return claude_extract(prompt)
|
|
```
|
|
|
|
## Summary Template
|
|
|
|
```markdown
|
|
# {title}
|
|
|
|
**Source:** {url}
|
|
**Type:** {source_type} | **Tier:** {credibility_tier}
|
|
|
|
## Executive Summary
|
|
{2-3 sentence overview}
|
|
|
|
## Key Concepts
|
|
{bulleted list of core concepts}
|
|
|
|
## Techniques & Patterns
|
|
{extracted techniques with use cases}
|
|
|
|
## Code Examples
|
|
{relevant code snippets}
|
|
|
|
## Best Practices
|
|
{actionable recommendations}
|
|
```
|
|
|
|
## Quality Metrics
|
|
|
|
| Metric | Target |
|
|
|--------|--------|
|
|
| Compression Ratio | 25-35% of original |
|
|
| Key Concept Coverage | ≥90% of important terms |
|
|
| Code Snippet Retention | 100% of relevant examples |
|
|
|
|
---
|
|
|
|
# 5. Quality Reviewer
|
|
|
|
Evaluates distilled content, routes decisions, and triggers refactoring or additional research.
|
|
|
|
## Review Workflow
|
|
|
|
```
|
|
[Distilled Content]
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Score Criteria │ → accuracy, completeness, clarity, PE quality, usability
|
|
└─────────────────┘
|
|
│
|
|
├── ≥ 0.85 → APPROVE → markdown-exporter
|
|
├── 0.60-0.84 → REFACTOR → content-distiller (with instructions)
|
|
├── 0.40-0.59 → DEEP_RESEARCH → web-crawler (with queries)
|
|
└── < 0.40 → REJECT → archive with reason
|
|
```
|
|
|
|
## Scoring Criteria
|
|
|
|
| Criterion | Weight | Checks |
|
|
|-----------|--------|--------|
|
|
| **Accuracy** | 0.25 | Factual correctness, up-to-date info, proper attribution |
|
|
| **Completeness** | 0.20 | Covers key concepts, includes examples, addresses edge cases |
|
|
| **Clarity** | 0.20 | Clear structure, concise language, logical flow |
|
|
| **PE Quality** | 0.25 | Demonstrates techniques, before/after examples, explains why |
|
|
| **Usability** | 0.10 | Easy to reference, searchable keywords, appropriate length |
|
|
|
|
## Calculate Final Score
|
|
|
|
```python
|
|
WEIGHTS = {
|
|
"accuracy": 0.25,
|
|
"completeness": 0.20,
|
|
"clarity": 0.20,
|
|
"prompt_engineering_quality": 0.25,
|
|
"usability": 0.10
|
|
}
|
|
|
|
def calculate_quality_score(assessment):
|
|
return sum(
|
|
assessment[criterion]["score"] * weight
|
|
for criterion, weight in WEIGHTS.items()
|
|
)
|
|
```
|
|
|
|
## Route Decision
|
|
|
|
```python
|
|
def determine_decision(score, assessment):
|
|
if score >= 0.85:
|
|
return "approve", None, None
|
|
elif score >= 0.60:
|
|
instructions = generate_refactor_instructions(assessment)
|
|
return "refactor", instructions, None
|
|
elif score >= 0.40:
|
|
queries = generate_research_queries(assessment)
|
|
return "deep_research", None, queries
|
|
else:
|
|
return "reject", f"Quality score {score:.2f} below minimum", None
|
|
```
|
|
|
|
## Prompt Engineering Quality Checklist
|
|
|
|
- [ ] Demonstrates specific techniques (CoT, few-shot, etc.)
|
|
- [ ] Shows before/after examples
|
|
- [ ] Explains *why* techniques work, not just *what*
|
|
- [ ] Provides actionable patterns
|
|
- [ ] Includes edge cases and failure modes
|
|
- [ ] References authoritative sources
|
|
|
|
---
|
|
|
|
# 6. Markdown Exporter
|
|
|
|
Exports approved content as structured markdown files for Claude Projects or fine-tuning.
|
|
|
|
## Export Structure
|
|
|
|
**Nested by Topic (recommended):**
|
|
```
|
|
exports/
|
|
├── INDEX.md
|
|
├── prompt-engineering/
|
|
│ ├── _index.md
|
|
│ ├── 01-chain-of-thought.md
|
|
│ └── 02-few-shot-prompting.md
|
|
├── claude-models/
|
|
│ ├── _index.md
|
|
│ └── 01-model-comparison.md
|
|
└── agent-building/
|
|
└── 01-tool-use.md
|
|
```
|
|
|
|
## Document File Template
|
|
|
|
```python
|
|
def generate_document_file(doc, include_metadata=True):
|
|
content = []
|
|
if include_metadata:
|
|
content.append("---")
|
|
content.append(f"title: {doc['title']}")
|
|
content.append(f"source: {doc['url']}")
|
|
content.append(f"vendor: {doc['vendor']}")
|
|
content.append(f"tier: {doc['credibility_tier']}")
|
|
content.append(f"quality_score: {doc['quality_score']:.2f}")
|
|
content.append("---")
|
|
content.append("")
|
|
content.append(doc['structured_content'])
|
|
return "\n".join(content)
|
|
```
|
|
|
|
## Fine-tuning Export (JSONL)
|
|
|
|
```python
|
|
def export_fine_tuning_dataset(content_list, config):
|
|
with open('fine_tuning.jsonl', 'w') as f:
|
|
for doc in content_list:
|
|
sample = {
|
|
"messages": [
|
|
{"role": "system", "content": "You are an expert on AI and prompt engineering."},
|
|
{"role": "user", "content": f"Explain {doc['title']}"},
|
|
{"role": "assistant", "content": doc['structured_content']}
|
|
],
|
|
"metadata": {
|
|
"source": doc['url'],
|
|
"topic": doc['topic_slug'],
|
|
"quality_score": doc['quality_score']
|
|
}
|
|
}
|
|
f.write(json.dumps(sample) + '\n')
|
|
```
|
|
|
|
## Cross-Reference Generation
|
|
|
|
```python
|
|
def add_cross_references(doc, all_docs):
|
|
related = []
|
|
doc_concepts = set(c['term'].lower() for c in doc['key_concepts'])
|
|
|
|
for other in all_docs:
|
|
if other['doc_id'] == doc['doc_id']:
|
|
continue
|
|
other_concepts = set(c['term'].lower() for c in other['key_concepts'])
|
|
overlap = len(doc_concepts & other_concepts)
|
|
if overlap >= 2:
|
|
related.append({
|
|
"title": other['title'],
|
|
"path": generate_relative_path(doc, other),
|
|
"overlap": overlap
|
|
})
|
|
|
|
return sorted(related, key=lambda x: x['overlap'], reverse=True)[:5]
|
|
```
|
|
|
|
---
|
|
|
|
# Integration Flow
|
|
|
|
| From | Output | To |
|
|
|------|--------|-----|
|
|
| **pipeline-orchestrator** | Coordinates all stages | All skills below |
|
|
| **reference-discovery** | URL manifest | web-crawler |
|
|
| **web-crawler** | Raw content + manifest | content-repository |
|
|
| **content-repository** | Document records | content-distiller |
|
|
| **content-distiller** | Distilled content | quality-reviewer |
|
|
| **quality-reviewer** (approve) | Approved IDs | markdown-exporter |
|
|
| **quality-reviewer** (refactor) | Instructions | content-distiller |
|
|
| **quality-reviewer** (deep_research) | Queries | web-crawler |
|
|
|
|
## State Management
|
|
|
|
The pipeline orchestrator tracks state for resume capability:
|
|
|
|
**With Database:**
|
|
- `pipeline_runs` table tracks run status, current stage, statistics
|
|
- `pipeline_iteration_tracker` tracks QA loop iterations per document
|
|
|
|
**File-Based Fallback:**
|
|
```
|
|
~/reference-library/pipeline_state/run_XXX/
|
|
├── state.json # Current stage and stats
|
|
├── manifest.json # Discovered sources
|
|
└── review_log.json # QA decisions
|
|
```
|
|
|
|
## Resume Pipeline
|
|
|
|
To resume a paused or failed pipeline:
|
|
1. Provide the run_id or state file path
|
|
2. Pipeline continues from last successful checkpoint
|