Files

Andrew Yim 243b9d851c feat(reference-curator): Add Claude.ai Projects export format

Add claude-project/ folder with skill files formatted for upload to
Claude.ai Projects (web interface):

- reference-curator-complete.md: All 6 skills consolidated
- INDEX.md: Overview and workflow documentation
- Individual skill files (01-06) without YAML frontmatter

Add --claude-ai option to install.sh:
- Lists available files for upload
- Optionally copies to custom destination directory
- Provides upload instructions for Claude.ai

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-29 00:33:06 +07:00

6.4 KiB

Raw Blame History

Content Distiller

Transforms raw crawled content into structured, high-quality reference materials.

Distillation Goals

Compress - Reduce token count while preserving essential information
Structure - Organize content for easy retrieval and reference
Extract - Pull out code snippets, key concepts, and actionable patterns
Annotate - Add metadata for searchability and categorization

Distillation Workflow

Step 1: Load Raw Content

def load_for_distillation(cursor):
    """Get documents ready for distillation."""
    sql = """
    SELECT d.doc_id, d.title, d.url, d.raw_content_path, 
           d.doc_type, s.source_type, s.credibility_tier
    FROM documents d
    JOIN sources s ON d.source_id = s.source_id
    LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
    WHERE d.crawl_status = 'completed'
    AND dc.distill_id IS NULL
    ORDER BY s.credibility_tier ASC
    """
    cursor.execute(sql)
    return cursor.fetchall()

Step 2: Analyze Content Structure

Identify content type and select appropriate distillation strategy:

def analyze_structure(content, doc_type):
    """Analyze document structure for distillation."""
    analysis = {
        "has_code_blocks": bool(re.findall(r'```[\s\S]*?```', content)),
        "has_headers": bool(re.findall(r'^#+\s', content, re.MULTILINE)),
        "has_lists": bool(re.findall(r'^\s*[-*]\s', content, re.MULTILINE)),
        "has_tables": bool(re.findall(r'\|.*\|', content)),
        "estimated_tokens": len(content.split()) * 1.3,  # Rough estimate
        "section_count": len(re.findall(r'^#+\s', content, re.MULTILINE))
    }
    return analysis

Step 3: Extract Key Components

Extract Code Snippets:

def extract_code_snippets(content):
    """Extract all code blocks with language tags."""
    pattern = r'```(\w*)\n([\s\S]*?)```'
    snippets = []
    for match in re.finditer(pattern, content):
        snippets.append({
            "language": match.group(1) or "text",
            "code": match.group(2).strip(),
            "context": get_surrounding_text(content, match.start(), 200)
        })
    return snippets

Extract Key Concepts:

def extract_key_concepts(content, title):
    """Use Claude to extract key concepts and definitions."""
    prompt = f"""
    Analyze this document and extract key concepts:
    
    Title: {title}
    Content: {content[:8000]}  # Limit for context
    
    Return JSON with:
    - concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
    - techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
    - best_practices: ["..."]
    """
    # Use Claude API to process
    return claude_extract(prompt)

Step 4: Create Structured Summary

Summary Template:

# {title}

**Source:** {url}
**Type:** {source_type} | **Tier:** {credibility_tier}
**Distilled:** {date}

## Executive Summary
{2-3 sentence overview}

## Key Concepts
{bulleted list of core concepts with brief definitions}

## Techniques & Patterns
{extracted techniques with use cases}

## Code Examples
{relevant code snippets with context}

## Best Practices
{actionable recommendations}

## Related Topics
{links to related content in library}

Step 5: Optimize for Tokens

def optimize_content(structured_content, target_ratio=0.30):
    """
    Compress content to target ratio while preserving quality.
    Target: 30% of original token count.
    """
    original_tokens = count_tokens(structured_content)
    target_tokens = int(original_tokens * target_ratio)
    
    # Prioritized compression strategies
    strategies = [
        remove_redundant_explanations,
        condense_examples,
        merge_similar_sections,
        trim_verbose_descriptions
    ]
    
    optimized = structured_content
    for strategy in strategies:
        if count_tokens(optimized) > target_tokens:
            optimized = strategy(optimized)
    
    return optimized

Step 6: Store Distilled Content

def store_distilled(cursor, doc_id, summary, key_concepts, 
                    code_snippets, structured_content, 
                    original_tokens, distilled_tokens):
    sql = """
    INSERT INTO distilled_content 
    (doc_id, summary, key_concepts, code_snippets, structured_content,
     token_count_original, token_count_distilled, distill_model, review_status)
    VALUES (%s, %s, %s, %s, %s, %s, %s, 'claude-opus-4-5', 'pending')
    """
    cursor.execute(sql, (
        doc_id, summary, 
        json.dumps(key_concepts), 
        json.dumps(code_snippets),
        structured_content,
        original_tokens, 
        distilled_tokens
    ))
    return cursor.lastrowid

Distillation Prompts

For Prompt Engineering Content:

Focus on:
1. Specific techniques with before/after examples
2. Why techniques work (not just what)
3. Common pitfalls and how to avoid them
4. Actionable patterns that can be directly applied

For API Documentation:

Focus on:
1. Endpoint specifications and parameters
2. Request/response examples
3. Error codes and handling
4. Rate limits and best practices

For Research Papers:

Focus on:
1. Key findings and conclusions
2. Novel techniques introduced
3. Practical applications
4. Limitations and caveats

Quality Metrics

Track compression efficiency:

Metric	Target
Compression Ratio	25-35% of original
Key Concept Coverage	≥90% of important terms
Code Snippet Retention	100% of relevant examples
Readability	Clear, scannable structure

Handling Refactor Requests

When quality-reviewer returns refactor decision:

def handle_refactor(distill_id, instructions):
    """Re-distill based on reviewer feedback."""
    # Load original content and existing distillation
    original = load_raw_content(distill_id)
    existing = load_distilled_content(distill_id)
    
    # Apply specific improvements based on instructions
    improved = apply_improvements(existing, instructions)
    
    # Update distilled_content
    update_distilled(distill_id, improved)
    
    # Reset review status
    set_review_status(distill_id, 'pending')

Integration

From	Input	To
content-repository	Raw document records	content-distiller
content-distiller	Distilled content	quality-reviewer
quality-reviewer	Refactor instructions	content-distiller (loop)

6.4 KiB Raw Blame History