Files

Andrew Yim d1cd1298a8 feat(reference-curator): Add pipeline orchestrator and refactor skill format

Pipeline Orchestrator:
- Add 07-pipeline-orchestrator skill with code/CLAUDE.md and desktop/SKILL.md
- Add /reference-curator-pipeline slash command for full workflow automation
- Add pipeline_runs and pipeline_iteration_tracker tables to schema.sql
- Add v_pipeline_status and v_pipeline_iterations views
- Add pipeline_config.yaml configuration template
- Update AGENTS.md with Reference Curator Skills section
- Update claude-project files with pipeline documentation

Skill Format Refactoring:
- Extract YAML frontmatter from SKILL.md files to separate skill.yaml
- Add tools/ directories with MCP tool documentation
- Update SKILL-FORMAT-REQUIREMENTS.md with new structure
- Add migrate-skill-structure.py script for format conversion

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-29 01:01:02 +07:00

16 KiB

Raw Blame History

Reference Curator - Complete Skill Set

This document contains all 7 skills for curating, processing, and exporting reference documentation.

Pipeline Orchestrator (Recommended Entry Point)

Coordinates the full 6-skill workflow with automated QA loop handling.

Quick Start

# Full pipeline from topic
curate references on "Claude Code best practices"

# From URLs (skip discovery)
curate these URLs: https://docs.anthropic.com/en/docs/prompt-caching

# With auto-approve
curate references on "MCP servers" with auto-approve and fine-tuning output

Configuration Options

Option	Default	Description
max_sources	10	Maximum sources to discover
max_pages	50	Maximum pages per source
auto_approve	false	Auto-approve above threshold
threshold	0.85	Approval threshold
max_iterations	3	Max QA loop iterations
export_format	project_files	Output format

Pipeline Flow

[Input: Topic | URLs | Manifest]
            │
            ▼
   1. reference-discovery  (skip if URLs/manifest)
            │
            ▼
   2. web-crawler
            │
            ▼
   3. content-repository
            │
            ▼
   4. content-distiller ◄─────────────┐
            │                         │
            ▼                         │
   5. quality-reviewer                │
            │                         │
            ├── APPROVE → export      │
            ├── REFACTOR (max 3) ─────┤
            ├── DEEP_RESEARCH (max 2) → crawler
            └── REJECT → archive
            │
            ▼
   6. markdown-exporter

QA Loop Handling

Decision	Action	Max Iterations
APPROVE	Proceed to export	-
REFACTOR	Re-distill with feedback	3
DEEP_RESEARCH	Crawl more sources	2
REJECT	Archive with reason	-

Documents exceeding iteration limits are marked needs_manual_review.

Output Summary

Pipeline Complete:
- Sources discovered: 5
- Pages crawled: 45
- Approved: 40
- Needs manual review: 2
- Exports: ~/reference-library/exports/

1. Reference Discovery

Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.

Source Priority Hierarchy

Tier	Source Type	Examples
Tier 1	Official documentation	docs.anthropic.com, docs.claude.com, platform.openai.com/docs
Tier 1	Engineering blogs (official)	anthropic.com/news, openai.com/blog
Tier 1	Official GitHub repos	github.com/anthropics/, github.com/openai/
Tier 2	Research papers	arxiv.org, papers with citations
Tier 2	Verified community guides	Cookbook examples, official tutorials
Tier 3	Community content	Blog posts, tutorials, Stack Overflow

Discovery Workflow

Step 1: Define Search Scope

search_config = {
    "topic": "prompt engineering",
    "vendors": ["anthropic", "openai", "google"],
    "source_types": ["official_docs", "engineering_blog", "github_repo"],
    "freshness": "past_year",
    "max_results_per_query": 20
}

Step 2: Generate Search Queries

def generate_queries(topic, vendors):
    queries = []
    for vendor in vendors:
        queries.append(f"site:docs.{vendor}.com {topic}")
        queries.append(f"site:{vendor}.com/docs {topic}")
        queries.append(f"site:{vendor}.com/blog {topic}")
        queries.append(f"site:github.com/{vendor} {topic}")
    queries.append(f"site:arxiv.org {topic}")
    return queries

Step 3: Validate and Score Sources

def score_source(url, title):
    score = 0.0
    if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
        score += 0.40  # Tier 1 official docs
    elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
        score += 0.30  # Tier 1 official blog/news
    elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
        score += 0.30  # Tier 1 official repos
    elif 'arxiv.org' in url:
        score += 0.20  # Tier 2 research
    else:
        score += 0.10  # Tier 3 community
    return min(score, 1.0)

def assign_credibility_tier(score):
    if score >= 0.60:
        return 'tier1_official'
    elif score >= 0.40:
        return 'tier2_verified'
    else:
        return 'tier3_community'

Output Format

{
  "discovery_date": "2025-01-28T10:30:00",
  "topic": "prompt engineering",
  "total_urls": 15,
  "urls": [
    {
      "url": "https://docs.anthropic.com/en/docs/prompt-engineering",
      "title": "Prompt Engineering Guide",
      "credibility_tier": "tier1_official",
      "credibility_score": 0.85,
      "source_type": "official_docs",
      "vendor": "anthropic"
    }
  ]
}

2. Web Crawler Orchestrator

Manages crawling operations using Firecrawl MCP with rate limiting and format handling.

Crawl Configuration

firecrawl:
  rate_limit:
    requests_per_minute: 20
    concurrent_requests: 3
  default_options:
    timeout: 30000
    only_main_content: true

Crawl Workflow

Determine Crawl Strategy

def select_strategy(url):
    if url.endswith('.pdf'):
        return 'pdf_extract'
    elif 'github.com' in url and '/blob/' in url:
        return 'raw_content'
    elif any(d in url for d in ['docs.', 'documentation']):
        return 'scrape'
    else:
        return 'scrape'

Execute Firecrawl

# Single page scrape
firecrawl_scrape(
    url="https://docs.anthropic.com/en/docs/prompt-engineering",
    formats=["markdown"],
    only_main_content=True,
    timeout=30000
)

# Multi-page crawl
firecrawl_crawl(
    url="https://docs.anthropic.com/en/docs/",
    max_depth=2,
    limit=50,
    formats=["markdown"]
)

Rate Limiting

class RateLimiter:
    def __init__(self, requests_per_minute=20):
        self.rpm = requests_per_minute
        self.request_times = deque()

    def wait_if_needed(self):
        now = time.time()
        while self.request_times and now - self.request_times[0] > 60:
            self.request_times.popleft()
        if len(self.request_times) >= self.rpm:
            wait_time = 60 - (now - self.request_times[0])
            if wait_time > 0:
                time.sleep(wait_time)
        self.request_times.append(time.time())

Error Handling

Error	Action
Timeout	Retry once with 2x timeout
Rate limit (429)	Exponential backoff, max 3 retries
Not found (404)	Log and skip
Access denied (403)	Log, mark as `failed`

3. Content Repository

Manages MySQL storage for the reference library. Handles document storage, version control, deduplication, and retrieval.

Core Operations

Store New Document:

def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
    sql = """
    INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
    VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
    ON DUPLICATE KEY UPDATE
        version = version + 1,
        crawl_date = NOW(),
        raw_content_path = VALUES(raw_content_path)
    """
    cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
    return cursor.lastrowid

Check Duplicate:

def is_duplicate(cursor, url):
    cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
    return cursor.fetchone() is not None

Table Quick Reference

Table	Purpose	Key Fields
`sources`	Authorized content sources	source_type, credibility_tier, vendor
`documents`	Crawled document metadata	url_hash (dedup), version, crawl_status
`distilled_content`	Processed summaries	review_status, compression_ratio
`review_logs`	QA decisions	quality_score, decision
`topics`	Taxonomy	topic_slug, parent_topic_id

Status Values

crawl_status: pending → completed | failed | stale
review_status: pending → in_review → approved | needs_refactor | rejected
decision: approve | refactor | deep_research | reject

4. Content Distiller

Transforms raw crawled content into structured, high-quality reference materials.

Distillation Goals

Compress - Reduce token count while preserving essential information
Structure - Organize content for easy retrieval and reference
Extract - Pull out code snippets, key concepts, and actionable patterns
Annotate - Add metadata for searchability and categorization

Extract Key Components

Extract Code Snippets:

def extract_code_snippets(content):
    pattern = r'```(\w*)\n([\s\S]*?)```'
    snippets = []
    for match in re.finditer(pattern, content):
        snippets.append({
            "language": match.group(1) or "text",
            "code": match.group(2).strip(),
            "context": get_surrounding_text(content, match.start(), 200)
        })
    return snippets

Extract Key Concepts:

def extract_key_concepts(content, title):
    prompt = f"""
    Analyze this document and extract key concepts:

    Title: {title}
    Content: {content[:8000]}

    Return JSON with:
    - concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
    - techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
    - best_practices: ["..."]
    """
    return claude_extract(prompt)

Summary Template

# {title}

**Source:** {url}
**Type:** {source_type} | **Tier:** {credibility_tier}

## Executive Summary
{2-3 sentence overview}

## Key Concepts
{bulleted list of core concepts}

## Techniques & Patterns
{extracted techniques with use cases}

## Code Examples
{relevant code snippets}

## Best Practices
{actionable recommendations}

Quality Metrics

Metric	Target
Compression Ratio	25-35% of original
Key Concept Coverage	≥90% of important terms
Code Snippet Retention	100% of relevant examples

5. Quality Reviewer

Evaluates distilled content, routes decisions, and triggers refactoring or additional research.

Review Workflow

[Distilled Content]
       │
       ▼
┌─────────────────┐
│ Score Criteria  │ → accuracy, completeness, clarity, PE quality, usability
└─────────────────┘
       │
       ├── ≥ 0.85 → APPROVE → markdown-exporter
       ├── 0.60-0.84 → REFACTOR → content-distiller (with instructions)
       ├── 0.40-0.59 → DEEP_RESEARCH → web-crawler (with queries)
       └── < 0.40 → REJECT → archive with reason

Scoring Criteria

Criterion	Weight	Checks
Accuracy	0.25	Factual correctness, up-to-date info, proper attribution
Completeness	0.20	Covers key concepts, includes examples, addresses edge cases
Clarity	0.20	Clear structure, concise language, logical flow
PE Quality	0.25	Demonstrates techniques, before/after examples, explains why
Usability	0.10	Easy to reference, searchable keywords, appropriate length

Calculate Final Score

WEIGHTS = {
    "accuracy": 0.25,
    "completeness": 0.20,
    "clarity": 0.20,
    "prompt_engineering_quality": 0.25,
    "usability": 0.10
}

def calculate_quality_score(assessment):
    return sum(
        assessment[criterion]["score"] * weight
        for criterion, weight in WEIGHTS.items()
    )

Route Decision

def determine_decision(score, assessment):
    if score >= 0.85:
        return "approve", None, None
    elif score >= 0.60:
        instructions = generate_refactor_instructions(assessment)
        return "refactor", instructions, None
    elif score >= 0.40:
        queries = generate_research_queries(assessment)
        return "deep_research", None, queries
    else:
        return "reject", f"Quality score {score:.2f} below minimum", None

Prompt Engineering Quality Checklist

Demonstrates specific techniques (CoT, few-shot, etc.)
Shows before/after examples
Explains why techniques work, not just what
Provides actionable patterns
Includes edge cases and failure modes
References authoritative sources

6. Markdown Exporter

Exports approved content as structured markdown files for Claude Projects or fine-tuning.

Export Structure

Nested by Topic (recommended):

exports/
├── INDEX.md
├── prompt-engineering/
│   ├── _index.md
│   ├── 01-chain-of-thought.md
│   └── 02-few-shot-prompting.md
├── claude-models/
│   ├── _index.md
│   └── 01-model-comparison.md
└── agent-building/
    └── 01-tool-use.md

Document File Template

def generate_document_file(doc, include_metadata=True):
    content = []
    if include_metadata:
        content.append("---")
        content.append(f"title: {doc['title']}")
        content.append(f"source: {doc['url']}")
        content.append(f"vendor: {doc['vendor']}")
        content.append(f"tier: {doc['credibility_tier']}")
        content.append(f"quality_score: {doc['quality_score']:.2f}")
        content.append("---")
        content.append("")
    content.append(doc['structured_content'])
    return "\n".join(content)

Fine-tuning Export (JSONL)

def export_fine_tuning_dataset(content_list, config):
    with open('fine_tuning.jsonl', 'w') as f:
        for doc in content_list:
            sample = {
                "messages": [
                    {"role": "system", "content": "You are an expert on AI and prompt engineering."},
                    {"role": "user", "content": f"Explain {doc['title']}"},
                    {"role": "assistant", "content": doc['structured_content']}
                ],
                "metadata": {
                    "source": doc['url'],
                    "topic": doc['topic_slug'],
                    "quality_score": doc['quality_score']
                }
            }
            f.write(json.dumps(sample) + '\n')

Cross-Reference Generation

def add_cross_references(doc, all_docs):
    related = []
    doc_concepts = set(c['term'].lower() for c in doc['key_concepts'])

    for other in all_docs:
        if other['doc_id'] == doc['doc_id']:
            continue
        other_concepts = set(c['term'].lower() for c in other['key_concepts'])
        overlap = len(doc_concepts & other_concepts)
        if overlap >= 2:
            related.append({
                "title": other['title'],
                "path": generate_relative_path(doc, other),
                "overlap": overlap
            })

    return sorted(related, key=lambda x: x['overlap'], reverse=True)[:5]

Integration Flow

From	Output	To
pipeline-orchestrator	Coordinates all stages	All skills below
reference-discovery	URL manifest	web-crawler
web-crawler	Raw content + manifest	content-repository
content-repository	Document records	content-distiller
content-distiller	Distilled content	quality-reviewer
quality-reviewer (approve)	Approved IDs	markdown-exporter
quality-reviewer (refactor)	Instructions	content-distiller
quality-reviewer (deep_research)	Queries	web-crawler

State Management

The pipeline orchestrator tracks state for resume capability:

With Database:

pipeline_runs table tracks run status, current stage, statistics
pipeline_iteration_tracker tracks QA loop iterations per document

File-Based Fallback:

~/reference-library/pipeline_state/run_XXX/
├── state.json       # Current stage and stats
├── manifest.json    # Discovered sources
└── review_log.json  # QA decisions

Resume Pipeline

To resume a paused or failed pipeline:

Provide the run_id or state file path
Pipeline continues from last successful checkpoint

16 KiB Raw Blame History