Files
our-claude-skills/custom-skills/90-reference-curator/claude-project/reference-curator-complete.md
Andrew Yim d1cd1298a8 feat(reference-curator): Add pipeline orchestrator and refactor skill format
Pipeline Orchestrator:
- Add 07-pipeline-orchestrator skill with code/CLAUDE.md and desktop/SKILL.md
- Add /reference-curator-pipeline slash command for full workflow automation
- Add pipeline_runs and pipeline_iteration_tracker tables to schema.sql
- Add v_pipeline_status and v_pipeline_iterations views
- Add pipeline_config.yaml configuration template
- Update AGENTS.md with Reference Curator Skills section
- Update claude-project files with pipeline documentation

Skill Format Refactoring:
- Extract YAML frontmatter from SKILL.md files to separate skill.yaml
- Add tools/ directories with MCP tool documentation
- Update SKILL-FORMAT-REQUIREMENTS.md with new structure
- Add migrate-skill-structure.py script for format conversion

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 01:01:02 +07:00

16 KiB

Reference Curator - Complete Skill Set

This document contains all 7 skills for curating, processing, and exporting reference documentation.


Pipeline Orchestrator (Recommended Entry Point)

Coordinates the full 6-skill workflow with automated QA loop handling.

Quick Start

# Full pipeline from topic
curate references on "Claude Code best practices"

# From URLs (skip discovery)
curate these URLs: https://docs.anthropic.com/en/docs/prompt-caching

# With auto-approve
curate references on "MCP servers" with auto-approve and fine-tuning output

Configuration Options

Option Default Description
max_sources 10 Maximum sources to discover
max_pages 50 Maximum pages per source
auto_approve false Auto-approve above threshold
threshold 0.85 Approval threshold
max_iterations 3 Max QA loop iterations
export_format project_files Output format

Pipeline Flow

[Input: Topic | URLs | Manifest]
            │
            ▼
   1. reference-discovery  (skip if URLs/manifest)
            │
            ▼
   2. web-crawler
            │
            ▼
   3. content-repository
            │
            ▼
   4. content-distiller ◄─────────────┐
            │                         │
            ▼                         │
   5. quality-reviewer                │
            │                         │
            ├── APPROVE → export      │
            ├── REFACTOR (max 3) ─────┤
            ├── DEEP_RESEARCH (max 2) → crawler
            └── REJECT → archive
            │
            ▼
   6. markdown-exporter

QA Loop Handling

Decision Action Max Iterations
APPROVE Proceed to export -
REFACTOR Re-distill with feedback 3
DEEP_RESEARCH Crawl more sources 2
REJECT Archive with reason -

Documents exceeding iteration limits are marked needs_manual_review.

Output Summary

Pipeline Complete:
- Sources discovered: 5
- Pages crawled: 45
- Approved: 40
- Needs manual review: 2
- Exports: ~/reference-library/exports/

1. Reference Discovery

Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.

Source Priority Hierarchy

Tier Source Type Examples
Tier 1 Official documentation docs.anthropic.com, docs.claude.com, platform.openai.com/docs
Tier 1 Engineering blogs (official) anthropic.com/news, openai.com/blog
Tier 1 Official GitHub repos github.com/anthropics/, github.com/openai/
Tier 2 Research papers arxiv.org, papers with citations
Tier 2 Verified community guides Cookbook examples, official tutorials
Tier 3 Community content Blog posts, tutorials, Stack Overflow

Discovery Workflow

Step 1: Define Search Scope

search_config = {
    "topic": "prompt engineering",
    "vendors": ["anthropic", "openai", "google"],
    "source_types": ["official_docs", "engineering_blog", "github_repo"],
    "freshness": "past_year",
    "max_results_per_query": 20
}

Step 2: Generate Search Queries

def generate_queries(topic, vendors):
    queries = []
    for vendor in vendors:
        queries.append(f"site:docs.{vendor}.com {topic}")
        queries.append(f"site:{vendor}.com/docs {topic}")
        queries.append(f"site:{vendor}.com/blog {topic}")
        queries.append(f"site:github.com/{vendor} {topic}")
    queries.append(f"site:arxiv.org {topic}")
    return queries

Step 3: Validate and Score Sources

def score_source(url, title):
    score = 0.0
    if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
        score += 0.40  # Tier 1 official docs
    elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
        score += 0.30  # Tier 1 official blog/news
    elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
        score += 0.30  # Tier 1 official repos
    elif 'arxiv.org' in url:
        score += 0.20  # Tier 2 research
    else:
        score += 0.10  # Tier 3 community
    return min(score, 1.0)

def assign_credibility_tier(score):
    if score >= 0.60:
        return 'tier1_official'
    elif score >= 0.40:
        return 'tier2_verified'
    else:
        return 'tier3_community'

Output Format

{
  "discovery_date": "2025-01-28T10:30:00",
  "topic": "prompt engineering",
  "total_urls": 15,
  "urls": [
    {
      "url": "https://docs.anthropic.com/en/docs/prompt-engineering",
      "title": "Prompt Engineering Guide",
      "credibility_tier": "tier1_official",
      "credibility_score": 0.85,
      "source_type": "official_docs",
      "vendor": "anthropic"
    }
  ]
}

2. Web Crawler Orchestrator

Manages crawling operations using Firecrawl MCP with rate limiting and format handling.

Crawl Configuration

firecrawl:
  rate_limit:
    requests_per_minute: 20
    concurrent_requests: 3
  default_options:
    timeout: 30000
    only_main_content: true

Crawl Workflow

Determine Crawl Strategy

def select_strategy(url):
    if url.endswith('.pdf'):
        return 'pdf_extract'
    elif 'github.com' in url and '/blob/' in url:
        return 'raw_content'
    elif any(d in url for d in ['docs.', 'documentation']):
        return 'scrape'
    else:
        return 'scrape'

Execute Firecrawl

# Single page scrape
firecrawl_scrape(
    url="https://docs.anthropic.com/en/docs/prompt-engineering",
    formats=["markdown"],
    only_main_content=True,
    timeout=30000
)

# Multi-page crawl
firecrawl_crawl(
    url="https://docs.anthropic.com/en/docs/",
    max_depth=2,
    limit=50,
    formats=["markdown"]
)

Rate Limiting

class RateLimiter:
    def __init__(self, requests_per_minute=20):
        self.rpm = requests_per_minute
        self.request_times = deque()

    def wait_if_needed(self):
        now = time.time()
        while self.request_times and now - self.request_times[0] > 60:
            self.request_times.popleft()
        if len(self.request_times) >= self.rpm:
            wait_time = 60 - (now - self.request_times[0])
            if wait_time > 0:
                time.sleep(wait_time)
        self.request_times.append(time.time())

Error Handling

Error Action
Timeout Retry once with 2x timeout
Rate limit (429) Exponential backoff, max 3 retries
Not found (404) Log and skip
Access denied (403) Log, mark as failed

3. Content Repository

Manages MySQL storage for the reference library. Handles document storage, version control, deduplication, and retrieval.

Core Operations

Store New Document:

def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
    sql = """
    INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
    VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
    ON DUPLICATE KEY UPDATE
        version = version + 1,
        crawl_date = NOW(),
        raw_content_path = VALUES(raw_content_path)
    """
    cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
    return cursor.lastrowid

Check Duplicate:

def is_duplicate(cursor, url):
    cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
    return cursor.fetchone() is not None

Table Quick Reference

Table Purpose Key Fields
sources Authorized content sources source_type, credibility_tier, vendor
documents Crawled document metadata url_hash (dedup), version, crawl_status
distilled_content Processed summaries review_status, compression_ratio
review_logs QA decisions quality_score, decision
topics Taxonomy topic_slug, parent_topic_id

Status Values

  • crawl_status: pendingcompleted | failed | stale
  • review_status: pendingin_reviewapproved | needs_refactor | rejected
  • decision: approve | refactor | deep_research | reject

4. Content Distiller

Transforms raw crawled content into structured, high-quality reference materials.

Distillation Goals

  1. Compress - Reduce token count while preserving essential information
  2. Structure - Organize content for easy retrieval and reference
  3. Extract - Pull out code snippets, key concepts, and actionable patterns
  4. Annotate - Add metadata for searchability and categorization

Extract Key Components

Extract Code Snippets:

def extract_code_snippets(content):
    pattern = r'```(\w*)\n([\s\S]*?)```'
    snippets = []
    for match in re.finditer(pattern, content):
        snippets.append({
            "language": match.group(1) or "text",
            "code": match.group(2).strip(),
            "context": get_surrounding_text(content, match.start(), 200)
        })
    return snippets

Extract Key Concepts:

def extract_key_concepts(content, title):
    prompt = f"""
    Analyze this document and extract key concepts:

    Title: {title}
    Content: {content[:8000]}

    Return JSON with:
    - concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
    - techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
    - best_practices: ["..."]
    """
    return claude_extract(prompt)

Summary Template

# {title}

**Source:** {url}
**Type:** {source_type} | **Tier:** {credibility_tier}

## Executive Summary
{2-3 sentence overview}

## Key Concepts
{bulleted list of core concepts}

## Techniques & Patterns
{extracted techniques with use cases}

## Code Examples
{relevant code snippets}

## Best Practices
{actionable recommendations}

Quality Metrics

Metric Target
Compression Ratio 25-35% of original
Key Concept Coverage ≥90% of important terms
Code Snippet Retention 100% of relevant examples

5. Quality Reviewer

Evaluates distilled content, routes decisions, and triggers refactoring or additional research.

Review Workflow

[Distilled Content]
       │
       ▼
┌─────────────────┐
│ Score Criteria  │ → accuracy, completeness, clarity, PE quality, usability
└─────────────────┘
       │
       ├── ≥ 0.85 → APPROVE → markdown-exporter
       ├── 0.60-0.84 → REFACTOR → content-distiller (with instructions)
       ├── 0.40-0.59 → DEEP_RESEARCH → web-crawler (with queries)
       └── < 0.40 → REJECT → archive with reason

Scoring Criteria

Criterion Weight Checks
Accuracy 0.25 Factual correctness, up-to-date info, proper attribution
Completeness 0.20 Covers key concepts, includes examples, addresses edge cases
Clarity 0.20 Clear structure, concise language, logical flow
PE Quality 0.25 Demonstrates techniques, before/after examples, explains why
Usability 0.10 Easy to reference, searchable keywords, appropriate length

Calculate Final Score

WEIGHTS = {
    "accuracy": 0.25,
    "completeness": 0.20,
    "clarity": 0.20,
    "prompt_engineering_quality": 0.25,
    "usability": 0.10
}

def calculate_quality_score(assessment):
    return sum(
        assessment[criterion]["score"] * weight
        for criterion, weight in WEIGHTS.items()
    )

Route Decision

def determine_decision(score, assessment):
    if score >= 0.85:
        return "approve", None, None
    elif score >= 0.60:
        instructions = generate_refactor_instructions(assessment)
        return "refactor", instructions, None
    elif score >= 0.40:
        queries = generate_research_queries(assessment)
        return "deep_research", None, queries
    else:
        return "reject", f"Quality score {score:.2f} below minimum", None

Prompt Engineering Quality Checklist

  • Demonstrates specific techniques (CoT, few-shot, etc.)
  • Shows before/after examples
  • Explains why techniques work, not just what
  • Provides actionable patterns
  • Includes edge cases and failure modes
  • References authoritative sources

6. Markdown Exporter

Exports approved content as structured markdown files for Claude Projects or fine-tuning.

Export Structure

Nested by Topic (recommended):

exports/
├── INDEX.md
├── prompt-engineering/
│   ├── _index.md
│   ├── 01-chain-of-thought.md
│   └── 02-few-shot-prompting.md
├── claude-models/
│   ├── _index.md
│   └── 01-model-comparison.md
└── agent-building/
    └── 01-tool-use.md

Document File Template

def generate_document_file(doc, include_metadata=True):
    content = []
    if include_metadata:
        content.append("---")
        content.append(f"title: {doc['title']}")
        content.append(f"source: {doc['url']}")
        content.append(f"vendor: {doc['vendor']}")
        content.append(f"tier: {doc['credibility_tier']}")
        content.append(f"quality_score: {doc['quality_score']:.2f}")
        content.append("---")
        content.append("")
    content.append(doc['structured_content'])
    return "\n".join(content)

Fine-tuning Export (JSONL)

def export_fine_tuning_dataset(content_list, config):
    with open('fine_tuning.jsonl', 'w') as f:
        for doc in content_list:
            sample = {
                "messages": [
                    {"role": "system", "content": "You are an expert on AI and prompt engineering."},
                    {"role": "user", "content": f"Explain {doc['title']}"},
                    {"role": "assistant", "content": doc['structured_content']}
                ],
                "metadata": {
                    "source": doc['url'],
                    "topic": doc['topic_slug'],
                    "quality_score": doc['quality_score']
                }
            }
            f.write(json.dumps(sample) + '\n')

Cross-Reference Generation

def add_cross_references(doc, all_docs):
    related = []
    doc_concepts = set(c['term'].lower() for c in doc['key_concepts'])

    for other in all_docs:
        if other['doc_id'] == doc['doc_id']:
            continue
        other_concepts = set(c['term'].lower() for c in other['key_concepts'])
        overlap = len(doc_concepts & other_concepts)
        if overlap >= 2:
            related.append({
                "title": other['title'],
                "path": generate_relative_path(doc, other),
                "overlap": overlap
            })

    return sorted(related, key=lambda x: x['overlap'], reverse=True)[:5]

Integration Flow

From Output To
pipeline-orchestrator Coordinates all stages All skills below
reference-discovery URL manifest web-crawler
web-crawler Raw content + manifest content-repository
content-repository Document records content-distiller
content-distiller Distilled content quality-reviewer
quality-reviewer (approve) Approved IDs markdown-exporter
quality-reviewer (refactor) Instructions content-distiller
quality-reviewer (deep_research) Queries web-crawler

State Management

The pipeline orchestrator tracks state for resume capability:

With Database:

  • pipeline_runs table tracks run status, current stage, statistics
  • pipeline_iteration_tracker tracks QA loop iterations per document

File-Based Fallback:

~/reference-library/pipeline_state/run_XXX/
├── state.json       # Current stage and stats
├── manifest.json    # Discovered sources
└── review_log.json  # QA decisions

Resume Pipeline

To resume a paused or failed pipeline:

  1. Provide the run_id or state file path
  2. Pipeline continues from last successful checkpoint