--- name: content-distiller description: Analyzes and distills raw crawled content into concise reference materials. Extracts key concepts, code snippets, and creates structured summaries optimized for prompt engineering use cases. Triggers on "distill content", "summarize document", "extract key concepts", "process raw content", "create reference summary". --- # Content Distiller Transforms raw crawled content into structured, high-quality reference materials. ## Distillation Goals 1. **Compress** - Reduce token count while preserving essential information 2. **Structure** - Organize content for easy retrieval and reference 3. **Extract** - Pull out code snippets, key concepts, and actionable patterns 4. **Annotate** - Add metadata for searchability and categorization ## Distillation Workflow ### Step 1: Load Raw Content ```python def load_for_distillation(cursor): """Get documents ready for distillation.""" sql = """ SELECT d.doc_id, d.title, d.url, d.raw_content_path, d.doc_type, s.source_type, s.credibility_tier FROM documents d JOIN sources s ON d.source_id = s.source_id LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id WHERE d.crawl_status = 'completed' AND dc.distill_id IS NULL ORDER BY s.credibility_tier ASC """ cursor.execute(sql) return cursor.fetchall() ``` ### Step 2: Analyze Content Structure Identify content type and select appropriate distillation strategy: ```python def analyze_structure(content, doc_type): """Analyze document structure for distillation.""" analysis = { "has_code_blocks": bool(re.findall(r'```[\s\S]*?```', content)), "has_headers": bool(re.findall(r'^#+\s', content, re.MULTILINE)), "has_lists": bool(re.findall(r'^\s*[-*]\s', content, re.MULTILINE)), "has_tables": bool(re.findall(r'\|.*\|', content)), "estimated_tokens": len(content.split()) * 1.3, # Rough estimate "section_count": len(re.findall(r'^#+\s', content, re.MULTILINE)) } return analysis ``` ### Step 3: Extract Key Components **Extract Code Snippets:** ```python def extract_code_snippets(content): """Extract all code blocks with language tags.""" pattern = r'```(\w*)\n([\s\S]*?)```' snippets = [] for match in re.finditer(pattern, content): snippets.append({ "language": match.group(1) or "text", "code": match.group(2).strip(), "context": get_surrounding_text(content, match.start(), 200) }) return snippets ``` **Extract Key Concepts:** ```python def extract_key_concepts(content, title): """Use Claude to extract key concepts and definitions.""" prompt = f""" Analyze this document and extract key concepts: Title: {title} Content: {content[:8000]} # Limit for context Return JSON with: - concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}] - techniques: [{{"name": "...", "description": "...", "use_case": "..."}}] - best_practices: ["..."] """ # Use Claude API to process return claude_extract(prompt) ``` ### Step 4: Create Structured Summary **Summary Template:** ```markdown # {title} **Source:** {url} **Type:** {source_type} | **Tier:** {credibility_tier} **Distilled:** {date} ## Executive Summary {2-3 sentence overview} ## Key Concepts {bulleted list of core concepts with brief definitions} ## Techniques & Patterns {extracted techniques with use cases} ## Code Examples {relevant code snippets with context} ## Best Practices {actionable recommendations} ## Related Topics {links to related content in library} ``` ### Step 5: Optimize for Tokens ```python def optimize_content(structured_content, target_ratio=0.30): """ Compress content to target ratio while preserving quality. Target: 30% of original token count. """ original_tokens = count_tokens(structured_content) target_tokens = int(original_tokens * target_ratio) # Prioritized compression strategies strategies = [ remove_redundant_explanations, condense_examples, merge_similar_sections, trim_verbose_descriptions ] optimized = structured_content for strategy in strategies: if count_tokens(optimized) > target_tokens: optimized = strategy(optimized) return optimized ``` ### Step 6: Store Distilled Content ```python def store_distilled(cursor, doc_id, summary, key_concepts, code_snippets, structured_content, original_tokens, distilled_tokens): sql = """ INSERT INTO distilled_content (doc_id, summary, key_concepts, code_snippets, structured_content, token_count_original, token_count_distilled, distill_model, review_status) VALUES (%s, %s, %s, %s, %s, %s, %s, 'claude-opus-4-5', 'pending') """ cursor.execute(sql, ( doc_id, summary, json.dumps(key_concepts), json.dumps(code_snippets), structured_content, original_tokens, distilled_tokens )) return cursor.lastrowid ``` ## Distillation Prompts **For Prompt Engineering Content:** ``` Focus on: 1. Specific techniques with before/after examples 2. Why techniques work (not just what) 3. Common pitfalls and how to avoid them 4. Actionable patterns that can be directly applied ``` **For API Documentation:** ``` Focus on: 1. Endpoint specifications and parameters 2. Request/response examples 3. Error codes and handling 4. Rate limits and best practices ``` **For Research Papers:** ``` Focus on: 1. Key findings and conclusions 2. Novel techniques introduced 3. Practical applications 4. Limitations and caveats ``` ## Quality Metrics Track compression efficiency: | Metric | Target | |--------|--------| | Compression Ratio | 25-35% of original | | Key Concept Coverage | ≥90% of important terms | | Code Snippet Retention | 100% of relevant examples | | Readability | Clear, scannable structure | ## Handling Refactor Requests When `quality-reviewer` returns `refactor` decision: ```python def handle_refactor(distill_id, instructions): """Re-distill based on reviewer feedback.""" # Load original content and existing distillation original = load_raw_content(distill_id) existing = load_distilled_content(distill_id) # Apply specific improvements based on instructions improved = apply_improvements(existing, instructions) # Update distilled_content update_distilled(distill_id, improved) # Reset review status set_review_status(distill_id, 'pending') ``` ## Integration | From | Input | To | |------|-------|-----| | content-repository | Raw document records | content-distiller | | content-distiller | Distilled content | quality-reviewer | | quality-reviewer | Refactor instructions | content-distiller (loop) |