Files

Andrew Yim 243b9d851c feat(reference-curator): Add Claude.ai Projects export format

Add claude-project/ folder with skill files formatted for upload to
Claude.ai Projects (web interface):

- reference-curator-complete.md: All 6 skills consolidated
- INDEX.md: Overview and workflow documentation
- Individual skill files (01-06) without YAML frontmatter

Add --claude-ai option to install.sh:
- Lists available files for upload
- Optionally copies to custom destination directory
- Provides upload instructions for Claude.ai

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-29 00:33:06 +07:00

13 KiB

Raw Blame History

Reference Curator - Complete Skill Set

This document contains all 6 skills for curating, processing, and exporting reference documentation.

1. Reference Discovery

Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.

Source Priority Hierarchy

Tier	Source Type	Examples
Tier 1	Official documentation	docs.anthropic.com, docs.claude.com, platform.openai.com/docs
Tier 1	Engineering blogs (official)	anthropic.com/news, openai.com/blog
Tier 1	Official GitHub repos	github.com/anthropics/, github.com/openai/
Tier 2	Research papers	arxiv.org, papers with citations
Tier 2	Verified community guides	Cookbook examples, official tutorials
Tier 3	Community content	Blog posts, tutorials, Stack Overflow

Discovery Workflow

Step 1: Define Search Scope

search_config = {
    "topic": "prompt engineering",
    "vendors": ["anthropic", "openai", "google"],
    "source_types": ["official_docs", "engineering_blog", "github_repo"],
    "freshness": "past_year",
    "max_results_per_query": 20
}

Step 2: Generate Search Queries

def generate_queries(topic, vendors):
    queries = []
    for vendor in vendors:
        queries.append(f"site:docs.{vendor}.com {topic}")
        queries.append(f"site:{vendor}.com/docs {topic}")
        queries.append(f"site:{vendor}.com/blog {topic}")
        queries.append(f"site:github.com/{vendor} {topic}")
    queries.append(f"site:arxiv.org {topic}")
    return queries

Step 3: Validate and Score Sources

def score_source(url, title):
    score = 0.0
    if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
        score += 0.40  # Tier 1 official docs
    elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
        score += 0.30  # Tier 1 official blog/news
    elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
        score += 0.30  # Tier 1 official repos
    elif 'arxiv.org' in url:
        score += 0.20  # Tier 2 research
    else:
        score += 0.10  # Tier 3 community
    return min(score, 1.0)

def assign_credibility_tier(score):
    if score >= 0.60:
        return 'tier1_official'
    elif score >= 0.40:
        return 'tier2_verified'
    else:
        return 'tier3_community'

Output Format

{
  "discovery_date": "2025-01-28T10:30:00",
  "topic": "prompt engineering",
  "total_urls": 15,
  "urls": [
    {
      "url": "https://docs.anthropic.com/en/docs/prompt-engineering",
      "title": "Prompt Engineering Guide",
      "credibility_tier": "tier1_official",
      "credibility_score": 0.85,
      "source_type": "official_docs",
      "vendor": "anthropic"
    }
  ]
}

2. Web Crawler Orchestrator

Manages crawling operations using Firecrawl MCP with rate limiting and format handling.

Crawl Configuration

firecrawl:
  rate_limit:
    requests_per_minute: 20
    concurrent_requests: 3
  default_options:
    timeout: 30000
    only_main_content: true

Crawl Workflow

Determine Crawl Strategy

def select_strategy(url):
    if url.endswith('.pdf'):
        return 'pdf_extract'
    elif 'github.com' in url and '/blob/' in url:
        return 'raw_content'
    elif any(d in url for d in ['docs.', 'documentation']):
        return 'scrape'
    else:
        return 'scrape'

Execute Firecrawl

# Single page scrape
firecrawl_scrape(
    url="https://docs.anthropic.com/en/docs/prompt-engineering",
    formats=["markdown"],
    only_main_content=True,
    timeout=30000
)

# Multi-page crawl
firecrawl_crawl(
    url="https://docs.anthropic.com/en/docs/",
    max_depth=2,
    limit=50,
    formats=["markdown"]
)

Rate Limiting

class RateLimiter:
    def __init__(self, requests_per_minute=20):
        self.rpm = requests_per_minute
        self.request_times = deque()

    def wait_if_needed(self):
        now = time.time()
        while self.request_times and now - self.request_times[0] > 60:
            self.request_times.popleft()
        if len(self.request_times) >= self.rpm:
            wait_time = 60 - (now - self.request_times[0])
            if wait_time > 0:
                time.sleep(wait_time)
        self.request_times.append(time.time())

Error Handling

Error	Action
Timeout	Retry once with 2x timeout
Rate limit (429)	Exponential backoff, max 3 retries
Not found (404)	Log and skip
Access denied (403)	Log, mark as `failed`

3. Content Repository

Manages MySQL storage for the reference library. Handles document storage, version control, deduplication, and retrieval.

Core Operations

Store New Document:

def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
    sql = """
    INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
    VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
    ON DUPLICATE KEY UPDATE
        version = version + 1,
        crawl_date = NOW(),
        raw_content_path = VALUES(raw_content_path)
    """
    cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
    return cursor.lastrowid

Check Duplicate:

def is_duplicate(cursor, url):
    cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
    return cursor.fetchone() is not None

Table Quick Reference

Table	Purpose	Key Fields
`sources`	Authorized content sources	source_type, credibility_tier, vendor
`documents`	Crawled document metadata	url_hash (dedup), version, crawl_status
`distilled_content`	Processed summaries	review_status, compression_ratio
`review_logs`	QA decisions	quality_score, decision
`topics`	Taxonomy	topic_slug, parent_topic_id

Status Values

crawl_status: pending → completed | failed | stale
review_status: pending → in_review → approved | needs_refactor | rejected
decision: approve | refactor | deep_research | reject

4. Content Distiller

Transforms raw crawled content into structured, high-quality reference materials.

Distillation Goals

Compress - Reduce token count while preserving essential information
Structure - Organize content for easy retrieval and reference
Extract - Pull out code snippets, key concepts, and actionable patterns
Annotate - Add metadata for searchability and categorization

Extract Key Components

Extract Code Snippets:

def extract_code_snippets(content):
    pattern = r'```(\w*)\n([\s\S]*?)```'
    snippets = []
    for match in re.finditer(pattern, content):
        snippets.append({
            "language": match.group(1) or "text",
            "code": match.group(2).strip(),
            "context": get_surrounding_text(content, match.start(), 200)
        })
    return snippets

Extract Key Concepts:

def extract_key_concepts(content, title):
    prompt = f"""
    Analyze this document and extract key concepts:

    Title: {title}
    Content: {content[:8000]}

    Return JSON with:
    - concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
    - techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
    - best_practices: ["..."]
    """
    return claude_extract(prompt)

Summary Template

# {title}

**Source:** {url}
**Type:** {source_type} | **Tier:** {credibility_tier}

## Executive Summary
{2-3 sentence overview}

## Key Concepts
{bulleted list of core concepts}

## Techniques & Patterns
{extracted techniques with use cases}

## Code Examples
{relevant code snippets}

## Best Practices
{actionable recommendations}

Quality Metrics

Metric	Target
Compression Ratio	25-35% of original
Key Concept Coverage	≥90% of important terms
Code Snippet Retention	100% of relevant examples

5. Quality Reviewer

Evaluates distilled content, routes decisions, and triggers refactoring or additional research.

Review Workflow

[Distilled Content]
       │
       ▼
┌─────────────────┐
│ Score Criteria  │ → accuracy, completeness, clarity, PE quality, usability
└─────────────────┘
       │
       ├── ≥ 0.85 → APPROVE → markdown-exporter
       ├── 0.60-0.84 → REFACTOR → content-distiller (with instructions)
       ├── 0.40-0.59 → DEEP_RESEARCH → web-crawler (with queries)
       └── < 0.40 → REJECT → archive with reason

Scoring Criteria

Criterion	Weight	Checks
Accuracy	0.25	Factual correctness, up-to-date info, proper attribution
Completeness	0.20	Covers key concepts, includes examples, addresses edge cases
Clarity	0.20	Clear structure, concise language, logical flow
PE Quality	0.25	Demonstrates techniques, before/after examples, explains why
Usability	0.10	Easy to reference, searchable keywords, appropriate length

Calculate Final Score

WEIGHTS = {
    "accuracy": 0.25,
    "completeness": 0.20,
    "clarity": 0.20,
    "prompt_engineering_quality": 0.25,
    "usability": 0.10
}

def calculate_quality_score(assessment):
    return sum(
        assessment[criterion]["score"] * weight
        for criterion, weight in WEIGHTS.items()
    )

Route Decision

def determine_decision(score, assessment):
    if score >= 0.85:
        return "approve", None, None
    elif score >= 0.60:
        instructions = generate_refactor_instructions(assessment)
        return "refactor", instructions, None
    elif score >= 0.40:
        queries = generate_research_queries(assessment)
        return "deep_research", None, queries
    else:
        return "reject", f"Quality score {score:.2f} below minimum", None

Prompt Engineering Quality Checklist

Demonstrates specific techniques (CoT, few-shot, etc.)
Shows before/after examples
Explains why techniques work, not just what
Provides actionable patterns
Includes edge cases and failure modes
References authoritative sources

6. Markdown Exporter

Exports approved content as structured markdown files for Claude Projects or fine-tuning.

Export Structure

Nested by Topic (recommended):

exports/
├── INDEX.md
├── prompt-engineering/
│   ├── _index.md
│   ├── 01-chain-of-thought.md
│   └── 02-few-shot-prompting.md
├── claude-models/
│   ├── _index.md
│   └── 01-model-comparison.md
└── agent-building/
    └── 01-tool-use.md

Document File Template

def generate_document_file(doc, include_metadata=True):
    content = []
    if include_metadata:
        content.append("---")
        content.append(f"title: {doc['title']}")
        content.append(f"source: {doc['url']}")
        content.append(f"vendor: {doc['vendor']}")
        content.append(f"tier: {doc['credibility_tier']}")
        content.append(f"quality_score: {doc['quality_score']:.2f}")
        content.append("---")
        content.append("")
    content.append(doc['structured_content'])
    return "\n".join(content)

Fine-tuning Export (JSONL)

def export_fine_tuning_dataset(content_list, config):
    with open('fine_tuning.jsonl', 'w') as f:
        for doc in content_list:
            sample = {
                "messages": [
                    {"role": "system", "content": "You are an expert on AI and prompt engineering."},
                    {"role": "user", "content": f"Explain {doc['title']}"},
                    {"role": "assistant", "content": doc['structured_content']}
                ],
                "metadata": {
                    "source": doc['url'],
                    "topic": doc['topic_slug'],
                    "quality_score": doc['quality_score']
                }
            }
            f.write(json.dumps(sample) + '\n')

Cross-Reference Generation

def add_cross_references(doc, all_docs):
    related = []
    doc_concepts = set(c['term'].lower() for c in doc['key_concepts'])

    for other in all_docs:
        if other['doc_id'] == doc['doc_id']:
            continue
        other_concepts = set(c['term'].lower() for c in other['key_concepts'])
        overlap = len(doc_concepts & other_concepts)
        if overlap >= 2:
            related.append({
                "title": other['title'],
                "path": generate_relative_path(doc, other),
                "overlap": overlap
            })

    return sorted(related, key=lambda x: x['overlap'], reverse=True)[:5]

Integration Flow

From	Output	To
reference-discovery	URL manifest	web-crawler
web-crawler	Raw content + manifest	content-repository
content-repository	Document records	content-distiller
content-distiller	Distilled content	quality-reviewer
quality-reviewer (approve)	Approved IDs	markdown-exporter
quality-reviewer (refactor)	Instructions	content-distiller
quality-reviewer (deep_research)	Queries	web-crawler

13 KiB Raw Blame History