---
name: content-distiller
description: Analyzes and distills raw crawled content into concise reference materials. Extracts key concepts, code snippets, and creates structured summaries optimized for prompt engineering use cases. Triggers on "distill content", "summarize document", "extract key concepts", "process raw content", "create reference summary".
---

# Content Distiller

Transforms raw crawled content into structured, high-quality reference materials.

## Distillation Goals

1. **Compress** - Reduce token count while preserving essential information
2. **Structure** - Organize content for easy retrieval and reference
3. **Extract** - Pull out code snippets, key concepts, and actionable patterns
4. **Annotate** - Add metadata for searchability and categorization

## Distillation Workflow

### Step 1: Load Raw Content

```python
def load_for_distillation(cursor):
    """Get documents ready for distillation."""
    sql = """
    SELECT d.doc_id, d.title, d.url, d.raw_content_path, 
           d.doc_type, s.source_type, s.credibility_tier
    FROM documents d
    JOIN sources s ON d.source_id = s.source_id
    LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
    WHERE d.crawl_status = 'completed'
    AND dc.distill_id IS NULL
    ORDER BY s.credibility_tier ASC
    """
    cursor.execute(sql)
    return cursor.fetchall()
```

### Step 2: Analyze Content Structure

Identify content type and select appropriate distillation strategy:

```python
def analyze_structure(content, doc_type):
    """Analyze document structure for distillation."""
    analysis = {
        "has_code_blocks": bool(re.findall(r'```[\s\S]*?```', content)),
        "has_headers": bool(re.findall(r'^#+\s', content, re.MULTILINE)),
        "has_lists": bool(re.findall(r'^\s*[-*]\s', content, re.MULTILINE)),
        "has_tables": bool(re.findall(r'\|.*\|', content)),
        "estimated_tokens": len(content.split()) * 1.3,  # Rough estimate
        "section_count": len(re.findall(r'^#+\s', content, re.MULTILINE))
    }
    return analysis
```

### Step 3: Extract Key Components

**Extract Code Snippets:**
```python
def extract_code_snippets(content):
    """Extract all code blocks with language tags."""
    pattern = r'```(\w*)\n([\s\S]*?)```'
    snippets = []
    for match in re.finditer(pattern, content):
        snippets.append({
            "language": match.group(1) or "text",
            "code": match.group(2).strip(),
            "context": get_surrounding_text(content, match.start(), 200)
        })
    return snippets
```

**Extract Key Concepts:**
```python
def extract_key_concepts(content, title):
    """Use Claude to extract key concepts and definitions."""
    prompt = f"""
    Analyze this document and extract key concepts:
    
    Title: {title}
    Content: {content[:8000]}  # Limit for context
    
    Return JSON with:
    - concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
    - techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
    - best_practices: ["..."]
    """
    # Use Claude API to process
    return claude_extract(prompt)
```

### Step 4: Create Structured Summary

**Summary Template:**
```markdown
# {title}

**Source:** {url}
**Type:** {source_type} | **Tier:** {credibility_tier}
**Distilled:** {date}

## Executive Summary
{2-3 sentence overview}

## Key Concepts
{bulleted list of core concepts with brief definitions}

## Techniques & Patterns
{extracted techniques with use cases}

## Code Examples
{relevant code snippets with context}

## Best Practices
{actionable recommendations}

## Related Topics
{links to related content in library}
```

### Step 5: Optimize for Tokens

```python
def optimize_content(structured_content, target_ratio=0.30):
    """
    Compress content to target ratio while preserving quality.
    Target: 30% of original token count.
    """
    original_tokens = count_tokens(structured_content)
    target_tokens = int(original_tokens * target_ratio)
    
    # Prioritized compression strategies
    strategies = [
        remove_redundant_explanations,
        condense_examples,
        merge_similar_sections,
        trim_verbose_descriptions
    ]
    
    optimized = structured_content
    for strategy in strategies:
        if count_tokens(optimized) > target_tokens:
            optimized = strategy(optimized)
    
    return optimized
```

### Step 6: Store Distilled Content

```python
def store_distilled(cursor, doc_id, summary, key_concepts, 
                    code_snippets, structured_content, 
                    original_tokens, distilled_tokens):
    sql = """
    INSERT INTO distilled_content 
    (doc_id, summary, key_concepts, code_snippets, structured_content,
     token_count_original, token_count_distilled, distill_model, review_status)
    VALUES (%s, %s, %s, %s, %s, %s, %s, 'claude-opus-4-5', 'pending')
    """
    cursor.execute(sql, (
        doc_id, summary, 
        json.dumps(key_concepts), 
        json.dumps(code_snippets),
        structured_content,
        original_tokens, 
        distilled_tokens
    ))
    return cursor.lastrowid
```

## Distillation Prompts

**For Prompt Engineering Content:**
```
Focus on:
1. Specific techniques with before/after examples
2. Why techniques work (not just what)
3. Common pitfalls and how to avoid them
4. Actionable patterns that can be directly applied
```

**For API Documentation:**
```
Focus on:
1. Endpoint specifications and parameters
2. Request/response examples
3. Error codes and handling
4. Rate limits and best practices
```

**For Research Papers:**
```
Focus on:
1. Key findings and conclusions
2. Novel techniques introduced
3. Practical applications
4. Limitations and caveats
```

## Quality Metrics

Track compression efficiency:

| Metric | Target |
|--------|--------|
| Compression Ratio | 25-35% of original |
| Key Concept Coverage | ≥90% of important terms |
| Code Snippet Retention | 100% of relevant examples |
| Readability | Clear, scannable structure |

## Handling Refactor Requests

When `quality-reviewer` returns `refactor` decision:

```python
def handle_refactor(distill_id, instructions):
    """Re-distill based on reviewer feedback."""
    # Load original content and existing distillation
    original = load_raw_content(distill_id)
    existing = load_distilled_content(distill_id)
    
    # Apply specific improvements based on instructions
    improved = apply_improvements(existing, instructions)
    
    # Update distilled_content
    update_distilled(distill_id, improved)
    
    # Reset review status
    set_review_status(distill_id, 'pending')
```

## Integration

| From | Input | To |
|------|-------|-----|
| content-repository | Raw document records | content-distiller |
| content-distiller | Distilled content | quality-reviewer |
| quality-reviewer | Refactor instructions | content-distiller (loop) |