Files
our-claude-skills/custom-skills/90-reference-curator/claude-project/reference-curator-complete.md
Andrew Yim 243b9d851c feat(reference-curator): Add Claude.ai Projects export format
Add claude-project/ folder with skill files formatted for upload to
Claude.ai Projects (web interface):

- reference-curator-complete.md: All 6 skills consolidated
- INDEX.md: Overview and workflow documentation
- Individual skill files (01-06) without YAML frontmatter

Add --claude-ai option to install.sh:
- Lists available files for upload
- Optionally copies to custom destination directory
- Provides upload instructions for Claude.ai

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 00:33:06 +07:00

13 KiB

Reference Curator - Complete Skill Set

This document contains all 6 skills for curating, processing, and exporting reference documentation.


1. Reference Discovery

Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.

Source Priority Hierarchy

Tier Source Type Examples
Tier 1 Official documentation docs.anthropic.com, docs.claude.com, platform.openai.com/docs
Tier 1 Engineering blogs (official) anthropic.com/news, openai.com/blog
Tier 1 Official GitHub repos github.com/anthropics/, github.com/openai/
Tier 2 Research papers arxiv.org, papers with citations
Tier 2 Verified community guides Cookbook examples, official tutorials
Tier 3 Community content Blog posts, tutorials, Stack Overflow

Discovery Workflow

Step 1: Define Search Scope

search_config = {
    "topic": "prompt engineering",
    "vendors": ["anthropic", "openai", "google"],
    "source_types": ["official_docs", "engineering_blog", "github_repo"],
    "freshness": "past_year",
    "max_results_per_query": 20
}

Step 2: Generate Search Queries

def generate_queries(topic, vendors):
    queries = []
    for vendor in vendors:
        queries.append(f"site:docs.{vendor}.com {topic}")
        queries.append(f"site:{vendor}.com/docs {topic}")
        queries.append(f"site:{vendor}.com/blog {topic}")
        queries.append(f"site:github.com/{vendor} {topic}")
    queries.append(f"site:arxiv.org {topic}")
    return queries

Step 3: Validate and Score Sources

def score_source(url, title):
    score = 0.0
    if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
        score += 0.40  # Tier 1 official docs
    elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
        score += 0.30  # Tier 1 official blog/news
    elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
        score += 0.30  # Tier 1 official repos
    elif 'arxiv.org' in url:
        score += 0.20  # Tier 2 research
    else:
        score += 0.10  # Tier 3 community
    return min(score, 1.0)

def assign_credibility_tier(score):
    if score >= 0.60:
        return 'tier1_official'
    elif score >= 0.40:
        return 'tier2_verified'
    else:
        return 'tier3_community'

Output Format

{
  "discovery_date": "2025-01-28T10:30:00",
  "topic": "prompt engineering",
  "total_urls": 15,
  "urls": [
    {
      "url": "https://docs.anthropic.com/en/docs/prompt-engineering",
      "title": "Prompt Engineering Guide",
      "credibility_tier": "tier1_official",
      "credibility_score": 0.85,
      "source_type": "official_docs",
      "vendor": "anthropic"
    }
  ]
}

2. Web Crawler Orchestrator

Manages crawling operations using Firecrawl MCP with rate limiting and format handling.

Crawl Configuration

firecrawl:
  rate_limit:
    requests_per_minute: 20
    concurrent_requests: 3
  default_options:
    timeout: 30000
    only_main_content: true

Crawl Workflow

Determine Crawl Strategy

def select_strategy(url):
    if url.endswith('.pdf'):
        return 'pdf_extract'
    elif 'github.com' in url and '/blob/' in url:
        return 'raw_content'
    elif any(d in url for d in ['docs.', 'documentation']):
        return 'scrape'
    else:
        return 'scrape'

Execute Firecrawl

# Single page scrape
firecrawl_scrape(
    url="https://docs.anthropic.com/en/docs/prompt-engineering",
    formats=["markdown"],
    only_main_content=True,
    timeout=30000
)

# Multi-page crawl
firecrawl_crawl(
    url="https://docs.anthropic.com/en/docs/",
    max_depth=2,
    limit=50,
    formats=["markdown"]
)

Rate Limiting

class RateLimiter:
    def __init__(self, requests_per_minute=20):
        self.rpm = requests_per_minute
        self.request_times = deque()

    def wait_if_needed(self):
        now = time.time()
        while self.request_times and now - self.request_times[0] > 60:
            self.request_times.popleft()
        if len(self.request_times) >= self.rpm:
            wait_time = 60 - (now - self.request_times[0])
            if wait_time > 0:
                time.sleep(wait_time)
        self.request_times.append(time.time())

Error Handling

Error Action
Timeout Retry once with 2x timeout
Rate limit (429) Exponential backoff, max 3 retries
Not found (404) Log and skip
Access denied (403) Log, mark as failed

3. Content Repository

Manages MySQL storage for the reference library. Handles document storage, version control, deduplication, and retrieval.

Core Operations

Store New Document:

def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
    sql = """
    INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
    VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
    ON DUPLICATE KEY UPDATE
        version = version + 1,
        crawl_date = NOW(),
        raw_content_path = VALUES(raw_content_path)
    """
    cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
    return cursor.lastrowid

Check Duplicate:

def is_duplicate(cursor, url):
    cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
    return cursor.fetchone() is not None

Table Quick Reference

Table Purpose Key Fields
sources Authorized content sources source_type, credibility_tier, vendor
documents Crawled document metadata url_hash (dedup), version, crawl_status
distilled_content Processed summaries review_status, compression_ratio
review_logs QA decisions quality_score, decision
topics Taxonomy topic_slug, parent_topic_id

Status Values

  • crawl_status: pendingcompleted | failed | stale
  • review_status: pendingin_reviewapproved | needs_refactor | rejected
  • decision: approve | refactor | deep_research | reject

4. Content Distiller

Transforms raw crawled content into structured, high-quality reference materials.

Distillation Goals

  1. Compress - Reduce token count while preserving essential information
  2. Structure - Organize content for easy retrieval and reference
  3. Extract - Pull out code snippets, key concepts, and actionable patterns
  4. Annotate - Add metadata for searchability and categorization

Extract Key Components

Extract Code Snippets:

def extract_code_snippets(content):
    pattern = r'```(\w*)\n([\s\S]*?)```'
    snippets = []
    for match in re.finditer(pattern, content):
        snippets.append({
            "language": match.group(1) or "text",
            "code": match.group(2).strip(),
            "context": get_surrounding_text(content, match.start(), 200)
        })
    return snippets

Extract Key Concepts:

def extract_key_concepts(content, title):
    prompt = f"""
    Analyze this document and extract key concepts:

    Title: {title}
    Content: {content[:8000]}

    Return JSON with:
    - concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
    - techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
    - best_practices: ["..."]
    """
    return claude_extract(prompt)

Summary Template

# {title}

**Source:** {url}
**Type:** {source_type} | **Tier:** {credibility_tier}

## Executive Summary
{2-3 sentence overview}

## Key Concepts
{bulleted list of core concepts}

## Techniques & Patterns
{extracted techniques with use cases}

## Code Examples
{relevant code snippets}

## Best Practices
{actionable recommendations}

Quality Metrics

Metric Target
Compression Ratio 25-35% of original
Key Concept Coverage ≥90% of important terms
Code Snippet Retention 100% of relevant examples

5. Quality Reviewer

Evaluates distilled content, routes decisions, and triggers refactoring or additional research.

Review Workflow

[Distilled Content]
       │
       ▼
┌─────────────────┐
│ Score Criteria  │ → accuracy, completeness, clarity, PE quality, usability
└─────────────────┘
       │
       ├── ≥ 0.85 → APPROVE → markdown-exporter
       ├── 0.60-0.84 → REFACTOR → content-distiller (with instructions)
       ├── 0.40-0.59 → DEEP_RESEARCH → web-crawler (with queries)
       └── < 0.40 → REJECT → archive with reason

Scoring Criteria

Criterion Weight Checks
Accuracy 0.25 Factual correctness, up-to-date info, proper attribution
Completeness 0.20 Covers key concepts, includes examples, addresses edge cases
Clarity 0.20 Clear structure, concise language, logical flow
PE Quality 0.25 Demonstrates techniques, before/after examples, explains why
Usability 0.10 Easy to reference, searchable keywords, appropriate length

Calculate Final Score

WEIGHTS = {
    "accuracy": 0.25,
    "completeness": 0.20,
    "clarity": 0.20,
    "prompt_engineering_quality": 0.25,
    "usability": 0.10
}

def calculate_quality_score(assessment):
    return sum(
        assessment[criterion]["score"] * weight
        for criterion, weight in WEIGHTS.items()
    )

Route Decision

def determine_decision(score, assessment):
    if score >= 0.85:
        return "approve", None, None
    elif score >= 0.60:
        instructions = generate_refactor_instructions(assessment)
        return "refactor", instructions, None
    elif score >= 0.40:
        queries = generate_research_queries(assessment)
        return "deep_research", None, queries
    else:
        return "reject", f"Quality score {score:.2f} below minimum", None

Prompt Engineering Quality Checklist

  • Demonstrates specific techniques (CoT, few-shot, etc.)
  • Shows before/after examples
  • Explains why techniques work, not just what
  • Provides actionable patterns
  • Includes edge cases and failure modes
  • References authoritative sources

6. Markdown Exporter

Exports approved content as structured markdown files for Claude Projects or fine-tuning.

Export Structure

Nested by Topic (recommended):

exports/
├── INDEX.md
├── prompt-engineering/
│   ├── _index.md
│   ├── 01-chain-of-thought.md
│   └── 02-few-shot-prompting.md
├── claude-models/
│   ├── _index.md
│   └── 01-model-comparison.md
└── agent-building/
    └── 01-tool-use.md

Document File Template

def generate_document_file(doc, include_metadata=True):
    content = []
    if include_metadata:
        content.append("---")
        content.append(f"title: {doc['title']}")
        content.append(f"source: {doc['url']}")
        content.append(f"vendor: {doc['vendor']}")
        content.append(f"tier: {doc['credibility_tier']}")
        content.append(f"quality_score: {doc['quality_score']:.2f}")
        content.append("---")
        content.append("")
    content.append(doc['structured_content'])
    return "\n".join(content)

Fine-tuning Export (JSONL)

def export_fine_tuning_dataset(content_list, config):
    with open('fine_tuning.jsonl', 'w') as f:
        for doc in content_list:
            sample = {
                "messages": [
                    {"role": "system", "content": "You are an expert on AI and prompt engineering."},
                    {"role": "user", "content": f"Explain {doc['title']}"},
                    {"role": "assistant", "content": doc['structured_content']}
                ],
                "metadata": {
                    "source": doc['url'],
                    "topic": doc['topic_slug'],
                    "quality_score": doc['quality_score']
                }
            }
            f.write(json.dumps(sample) + '\n')

Cross-Reference Generation

def add_cross_references(doc, all_docs):
    related = []
    doc_concepts = set(c['term'].lower() for c in doc['key_concepts'])

    for other in all_docs:
        if other['doc_id'] == doc['doc_id']:
            continue
        other_concepts = set(c['term'].lower() for c in other['key_concepts'])
        overlap = len(doc_concepts & other_concepts)
        if overlap >= 2:
            related.append({
                "title": other['title'],
                "path": generate_relative_path(doc, other),
                "overlap": overlap
            })

    return sorted(related, key=lambda x: x['overlap'], reverse=True)[:5]

Integration Flow

From Output To
reference-discovery URL manifest web-crawler
web-crawler Raw content + manifest content-repository
content-repository Document records content-distiller
content-distiller Distilled content quality-reviewer
quality-reviewer (approve) Approved IDs markdown-exporter
quality-reviewer (refactor) Instructions content-distiller
quality-reviewer (deep_research) Queries web-crawler