# Content Distiller Transforms raw crawled content into structured, high-quality reference materials. ## Distillation Goals 1. **Compress** - Reduce token count while preserving essential information 2. **Structure** - Organize content for easy retrieval and reference 3. **Extract** - Pull out code snippets, key concepts, and actionable patterns 4. **Annotate** - Add metadata for searchability and categorization ## Distillation Workflow ### Step 1: Load Raw Content ```python def load_for_distillation(cursor): """Get documents ready for distillation.""" sql = """ SELECT d.doc_id, d.title, d.url, d.raw_content_path, d.doc_type, s.source_type, s.credibility_tier FROM documents d JOIN sources s ON d.source_id = s.source_id LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id WHERE d.crawl_status = 'completed' AND dc.distill_id IS NULL ORDER BY s.credibility_tier ASC """ cursor.execute(sql) return cursor.fetchall() ``` ### Step 2: Analyze Content Structure Identify content type and select appropriate distillation strategy: ```python def analyze_structure(content, doc_type): """Analyze document structure for distillation.""" analysis = { "has_code_blocks": bool(re.findall(r'```[\s\S]*?```', content)), "has_headers": bool(re.findall(r'^#+\s', content, re.MULTILINE)), "has_lists": bool(re.findall(r'^\s*[-*]\s', content, re.MULTILINE)), "has_tables": bool(re.findall(r'\|.*\|', content)), "estimated_tokens": len(content.split()) * 1.3, # Rough estimate "section_count": len(re.findall(r'^#+\s', content, re.MULTILINE)) } return analysis ``` ### Step 3: Extract Key Components **Extract Code Snippets:** ```python def extract_code_snippets(content): """Extract all code blocks with language tags.""" pattern = r'```(\w*)\n([\s\S]*?)```' snippets = [] for match in re.finditer(pattern, content): snippets.append({ "language": match.group(1) or "text", "code": match.group(2).strip(), "context": get_surrounding_text(content, match.start(), 200) }) return snippets ``` **Extract Key Concepts:** ```python def extract_key_concepts(content, title): """Use Claude to extract key concepts and definitions.""" prompt = f""" Analyze this document and extract key concepts: Title: {title} Content: {content[:8000]} # Limit for context Return JSON with: - concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}] - techniques: [{{"name": "...", "description": "...", "use_case": "..."}}] - best_practices: ["..."] """ # Use Claude API to process return claude_extract(prompt) ``` ### Step 4: Create Structured Summary **Summary Template:** ```markdown # {title} **Source:** {url} **Type:** {source_type} | **Tier:** {credibility_tier} **Distilled:** {date} ## Executive Summary {2-3 sentence overview} ## Key Concepts {bulleted list of core concepts with brief definitions} ## Techniques & Patterns {extracted techniques with use cases} ## Code Examples {relevant code snippets with context} ## Best Practices {actionable recommendations} ## Related Topics {links to related content in library} ``` ### Step 5: Optimize for Tokens ```python def optimize_content(structured_content, target_ratio=0.30): """ Compress content to target ratio while preserving quality. Target: 30% of original token count. """ original_tokens = count_tokens(structured_content) target_tokens = int(original_tokens * target_ratio) # Prioritized compression strategies strategies = [ remove_redundant_explanations, condense_examples, merge_similar_sections, trim_verbose_descriptions ] optimized = structured_content for strategy in strategies: if count_tokens(optimized) > target_tokens: optimized = strategy(optimized) return optimized ``` ### Step 6: Store Distilled Content ```python def store_distilled(cursor, doc_id, summary, key_concepts, code_snippets, structured_content, original_tokens, distilled_tokens): sql = """ INSERT INTO distilled_content (doc_id, summary, key_concepts, code_snippets, structured_content, token_count_original, token_count_distilled, distill_model, review_status) VALUES (%s, %s, %s, %s, %s, %s, %s, 'claude-opus-4-5', 'pending') """ cursor.execute(sql, ( doc_id, summary, json.dumps(key_concepts), json.dumps(code_snippets), structured_content, original_tokens, distilled_tokens )) return cursor.lastrowid ``` ## Distillation Prompts **For Prompt Engineering Content:** ``` Focus on: 1. Specific techniques with before/after examples 2. Why techniques work (not just what) 3. Common pitfalls and how to avoid them 4. Actionable patterns that can be directly applied ``` **For API Documentation:** ``` Focus on: 1. Endpoint specifications and parameters 2. Request/response examples 3. Error codes and handling 4. Rate limits and best practices ``` **For Research Papers:** ``` Focus on: 1. Key findings and conclusions 2. Novel techniques introduced 3. Practical applications 4. Limitations and caveats ``` ## Quality Metrics Track compression efficiency: | Metric | Target | |--------|--------| | Compression Ratio | 25-35% of original | | Key Concept Coverage | ≥90% of important terms | | Code Snippet Retention | 100% of relevant examples | | Readability | Clear, scannable structure | ## Handling Refactor Requests When `quality-reviewer` returns `refactor` decision: ```python def handle_refactor(distill_id, instructions): """Re-distill based on reviewer feedback.""" # Load original content and existing distillation original = load_raw_content(distill_id) existing = load_distilled_content(distill_id) # Apply specific improvements based on instructions improved = apply_improvements(existing, instructions) # Update distilled_content update_distilled(distill_id, improved) # Reset review status set_review_status(distill_id, 'pending') ``` ## Integration | From | Input | To | |------|-------|-----| | content-repository | Raw document records | content-distiller | | content-distiller | Distilled content | quality-reviewer | | quality-reviewer | Refactor instructions | content-distiller (loop) |