Files

Andrew Yim 6d7a6d7a88 feat(reference-curator): Add portable skill suite for reference documentation curation

6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-29 00:20:27 +07:00

6.8 KiB

Raw Blame History

name, description

name	description
content-distiller	Analyzes and distills raw crawled content into concise reference materials. Extracts key concepts, code snippets, and creates structured summaries optimized for prompt engineering use cases. Triggers on "distill content", "summarize document", "extract key concepts", "process raw content", "create reference summary".

Content Distiller

Transforms raw crawled content into structured, high-quality reference materials.

Distillation Goals

Compress - Reduce token count while preserving essential information
Structure - Organize content for easy retrieval and reference
Extract - Pull out code snippets, key concepts, and actionable patterns
Annotate - Add metadata for searchability and categorization

Distillation Workflow

Step 1: Load Raw Content

def load_for_distillation(cursor):
    """Get documents ready for distillation."""
    sql = """
    SELECT d.doc_id, d.title, d.url, d.raw_content_path, 
           d.doc_type, s.source_type, s.credibility_tier
    FROM documents d
    JOIN sources s ON d.source_id = s.source_id
    LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
    WHERE d.crawl_status = 'completed'
    AND dc.distill_id IS NULL
    ORDER BY s.credibility_tier ASC
    """
    cursor.execute(sql)
    return cursor.fetchall()

Step 2: Analyze Content Structure

Identify content type and select appropriate distillation strategy:

def analyze_structure(content, doc_type):
    """Analyze document structure for distillation."""
    analysis = {
        "has_code_blocks": bool(re.findall(r'```[\s\S]*?```', content)),
        "has_headers": bool(re.findall(r'^#+\s', content, re.MULTILINE)),
        "has_lists": bool(re.findall(r'^\s*[-*]\s', content, re.MULTILINE)),
        "has_tables": bool(re.findall(r'\|.*\|', content)),
        "estimated_tokens": len(content.split()) * 1.3,  # Rough estimate
        "section_count": len(re.findall(r'^#+\s', content, re.MULTILINE))
    }
    return analysis

Step 3: Extract Key Components

Extract Code Snippets:

def extract_code_snippets(content):
    """Extract all code blocks with language tags."""
    pattern = r'```(\w*)\n([\s\S]*?)```'
    snippets = []
    for match in re.finditer(pattern, content):
        snippets.append({
            "language": match.group(1) or "text",
            "code": match.group(2).strip(),
            "context": get_surrounding_text(content, match.start(), 200)
        })
    return snippets

Extract Key Concepts:

def extract_key_concepts(content, title):
    """Use Claude to extract key concepts and definitions."""
    prompt = f"""
    Analyze this document and extract key concepts:
    
    Title: {title}
    Content: {content[:8000]}  # Limit for context
    
    Return JSON with:
    - concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
    - techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
    - best_practices: ["..."]
    """
    # Use Claude API to process
    return claude_extract(prompt)

Step 4: Create Structured Summary

Summary Template:

# {title}

**Source:** {url}
**Type:** {source_type} | **Tier:** {credibility_tier}
**Distilled:** {date}

## Executive Summary
{2-3 sentence overview}

## Key Concepts
{bulleted list of core concepts with brief definitions}

## Techniques & Patterns
{extracted techniques with use cases}

## Code Examples
{relevant code snippets with context}

## Best Practices
{actionable recommendations}

## Related Topics
{links to related content in library}

Step 5: Optimize for Tokens

def optimize_content(structured_content, target_ratio=0.30):
    """
    Compress content to target ratio while preserving quality.
    Target: 30% of original token count.
    """
    original_tokens = count_tokens(structured_content)
    target_tokens = int(original_tokens * target_ratio)
    
    # Prioritized compression strategies
    strategies = [
        remove_redundant_explanations,
        condense_examples,
        merge_similar_sections,
        trim_verbose_descriptions
    ]
    
    optimized = structured_content
    for strategy in strategies:
        if count_tokens(optimized) > target_tokens:
            optimized = strategy(optimized)
    
    return optimized

Step 6: Store Distilled Content

def store_distilled(cursor, doc_id, summary, key_concepts, 
                    code_snippets, structured_content, 
                    original_tokens, distilled_tokens):
    sql = """
    INSERT INTO distilled_content 
    (doc_id, summary, key_concepts, code_snippets, structured_content,
     token_count_original, token_count_distilled, distill_model, review_status)
    VALUES (%s, %s, %s, %s, %s, %s, %s, 'claude-opus-4-5', 'pending')
    """
    cursor.execute(sql, (
        doc_id, summary, 
        json.dumps(key_concepts), 
        json.dumps(code_snippets),
        structured_content,
        original_tokens, 
        distilled_tokens
    ))
    return cursor.lastrowid

Distillation Prompts

For Prompt Engineering Content:

Focus on:
1. Specific techniques with before/after examples
2. Why techniques work (not just what)
3. Common pitfalls and how to avoid them
4. Actionable patterns that can be directly applied

For API Documentation:

Focus on:
1. Endpoint specifications and parameters
2. Request/response examples
3. Error codes and handling
4. Rate limits and best practices

For Research Papers:

Focus on:
1. Key findings and conclusions
2. Novel techniques introduced
3. Practical applications
4. Limitations and caveats

Quality Metrics

Track compression efficiency:

Metric	Target
Compression Ratio	25-35% of original
Key Concept Coverage	≥90% of important terms
Code Snippet Retention	100% of relevant examples
Readability	Clear, scannable structure

Handling Refactor Requests

When quality-reviewer returns refactor decision:

def handle_refactor(distill_id, instructions):
    """Re-distill based on reviewer feedback."""
    # Load original content and existing distillation
    original = load_raw_content(distill_id)
    existing = load_distilled_content(distill_id)
    
    # Apply specific improvements based on instructions
    improved = apply_improvements(existing, instructions)
    
    # Update distilled_content
    update_distilled(distill_id, improved)
    
    # Reset review status
    set_review_status(distill_id, 'pending')

Integration

From	Input	To
content-repository	Raw document records	content-distiller
content-distiller	Distilled content	quality-reviewer
quality-reviewer	Refactor instructions	content-distiller (loop)

6.8 KiB Raw Blame History