6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
239 lines
6.8 KiB
Markdown
239 lines
6.8 KiB
Markdown
---
|
|
name: content-distiller
|
|
description: Analyzes and distills raw crawled content into concise reference materials. Extracts key concepts, code snippets, and creates structured summaries optimized for prompt engineering use cases. Triggers on "distill content", "summarize document", "extract key concepts", "process raw content", "create reference summary".
|
|
---
|
|
|
|
# Content Distiller
|
|
|
|
Transforms raw crawled content into structured, high-quality reference materials.
|
|
|
|
## Distillation Goals
|
|
|
|
1. **Compress** - Reduce token count while preserving essential information
|
|
2. **Structure** - Organize content for easy retrieval and reference
|
|
3. **Extract** - Pull out code snippets, key concepts, and actionable patterns
|
|
4. **Annotate** - Add metadata for searchability and categorization
|
|
|
|
## Distillation Workflow
|
|
|
|
### Step 1: Load Raw Content
|
|
|
|
```python
|
|
def load_for_distillation(cursor):
|
|
"""Get documents ready for distillation."""
|
|
sql = """
|
|
SELECT d.doc_id, d.title, d.url, d.raw_content_path,
|
|
d.doc_type, s.source_type, s.credibility_tier
|
|
FROM documents d
|
|
JOIN sources s ON d.source_id = s.source_id
|
|
LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
|
|
WHERE d.crawl_status = 'completed'
|
|
AND dc.distill_id IS NULL
|
|
ORDER BY s.credibility_tier ASC
|
|
"""
|
|
cursor.execute(sql)
|
|
return cursor.fetchall()
|
|
```
|
|
|
|
### Step 2: Analyze Content Structure
|
|
|
|
Identify content type and select appropriate distillation strategy:
|
|
|
|
```python
|
|
def analyze_structure(content, doc_type):
|
|
"""Analyze document structure for distillation."""
|
|
analysis = {
|
|
"has_code_blocks": bool(re.findall(r'```[\s\S]*?```', content)),
|
|
"has_headers": bool(re.findall(r'^#+\s', content, re.MULTILINE)),
|
|
"has_lists": bool(re.findall(r'^\s*[-*]\s', content, re.MULTILINE)),
|
|
"has_tables": bool(re.findall(r'\|.*\|', content)),
|
|
"estimated_tokens": len(content.split()) * 1.3, # Rough estimate
|
|
"section_count": len(re.findall(r'^#+\s', content, re.MULTILINE))
|
|
}
|
|
return analysis
|
|
```
|
|
|
|
### Step 3: Extract Key Components
|
|
|
|
**Extract Code Snippets:**
|
|
```python
|
|
def extract_code_snippets(content):
|
|
"""Extract all code blocks with language tags."""
|
|
pattern = r'```(\w*)\n([\s\S]*?)```'
|
|
snippets = []
|
|
for match in re.finditer(pattern, content):
|
|
snippets.append({
|
|
"language": match.group(1) or "text",
|
|
"code": match.group(2).strip(),
|
|
"context": get_surrounding_text(content, match.start(), 200)
|
|
})
|
|
return snippets
|
|
```
|
|
|
|
**Extract Key Concepts:**
|
|
```python
|
|
def extract_key_concepts(content, title):
|
|
"""Use Claude to extract key concepts and definitions."""
|
|
prompt = f"""
|
|
Analyze this document and extract key concepts:
|
|
|
|
Title: {title}
|
|
Content: {content[:8000]} # Limit for context
|
|
|
|
Return JSON with:
|
|
- concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
|
|
- techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
|
|
- best_practices: ["..."]
|
|
"""
|
|
# Use Claude API to process
|
|
return claude_extract(prompt)
|
|
```
|
|
|
|
### Step 4: Create Structured Summary
|
|
|
|
**Summary Template:**
|
|
```markdown
|
|
# {title}
|
|
|
|
**Source:** {url}
|
|
**Type:** {source_type} | **Tier:** {credibility_tier}
|
|
**Distilled:** {date}
|
|
|
|
## Executive Summary
|
|
{2-3 sentence overview}
|
|
|
|
## Key Concepts
|
|
{bulleted list of core concepts with brief definitions}
|
|
|
|
## Techniques & Patterns
|
|
{extracted techniques with use cases}
|
|
|
|
## Code Examples
|
|
{relevant code snippets with context}
|
|
|
|
## Best Practices
|
|
{actionable recommendations}
|
|
|
|
## Related Topics
|
|
{links to related content in library}
|
|
```
|
|
|
|
### Step 5: Optimize for Tokens
|
|
|
|
```python
|
|
def optimize_content(structured_content, target_ratio=0.30):
|
|
"""
|
|
Compress content to target ratio while preserving quality.
|
|
Target: 30% of original token count.
|
|
"""
|
|
original_tokens = count_tokens(structured_content)
|
|
target_tokens = int(original_tokens * target_ratio)
|
|
|
|
# Prioritized compression strategies
|
|
strategies = [
|
|
remove_redundant_explanations,
|
|
condense_examples,
|
|
merge_similar_sections,
|
|
trim_verbose_descriptions
|
|
]
|
|
|
|
optimized = structured_content
|
|
for strategy in strategies:
|
|
if count_tokens(optimized) > target_tokens:
|
|
optimized = strategy(optimized)
|
|
|
|
return optimized
|
|
```
|
|
|
|
### Step 6: Store Distilled Content
|
|
|
|
```python
|
|
def store_distilled(cursor, doc_id, summary, key_concepts,
|
|
code_snippets, structured_content,
|
|
original_tokens, distilled_tokens):
|
|
sql = """
|
|
INSERT INTO distilled_content
|
|
(doc_id, summary, key_concepts, code_snippets, structured_content,
|
|
token_count_original, token_count_distilled, distill_model, review_status)
|
|
VALUES (%s, %s, %s, %s, %s, %s, %s, 'claude-opus-4-5', 'pending')
|
|
"""
|
|
cursor.execute(sql, (
|
|
doc_id, summary,
|
|
json.dumps(key_concepts),
|
|
json.dumps(code_snippets),
|
|
structured_content,
|
|
original_tokens,
|
|
distilled_tokens
|
|
))
|
|
return cursor.lastrowid
|
|
```
|
|
|
|
## Distillation Prompts
|
|
|
|
**For Prompt Engineering Content:**
|
|
```
|
|
Focus on:
|
|
1. Specific techniques with before/after examples
|
|
2. Why techniques work (not just what)
|
|
3. Common pitfalls and how to avoid them
|
|
4. Actionable patterns that can be directly applied
|
|
```
|
|
|
|
**For API Documentation:**
|
|
```
|
|
Focus on:
|
|
1. Endpoint specifications and parameters
|
|
2. Request/response examples
|
|
3. Error codes and handling
|
|
4. Rate limits and best practices
|
|
```
|
|
|
|
**For Research Papers:**
|
|
```
|
|
Focus on:
|
|
1. Key findings and conclusions
|
|
2. Novel techniques introduced
|
|
3. Practical applications
|
|
4. Limitations and caveats
|
|
```
|
|
|
|
## Quality Metrics
|
|
|
|
Track compression efficiency:
|
|
|
|
| Metric | Target |
|
|
|--------|--------|
|
|
| Compression Ratio | 25-35% of original |
|
|
| Key Concept Coverage | ≥90% of important terms |
|
|
| Code Snippet Retention | 100% of relevant examples |
|
|
| Readability | Clear, scannable structure |
|
|
|
|
## Handling Refactor Requests
|
|
|
|
When `quality-reviewer` returns `refactor` decision:
|
|
|
|
```python
|
|
def handle_refactor(distill_id, instructions):
|
|
"""Re-distill based on reviewer feedback."""
|
|
# Load original content and existing distillation
|
|
original = load_raw_content(distill_id)
|
|
existing = load_distilled_content(distill_id)
|
|
|
|
# Apply specific improvements based on instructions
|
|
improved = apply_improvements(existing, instructions)
|
|
|
|
# Update distilled_content
|
|
update_distilled(distill_id, improved)
|
|
|
|
# Reset review status
|
|
set_review_status(distill_id, 'pending')
|
|
```
|
|
|
|
## Integration
|
|
|
|
| From | Input | To |
|
|
|------|-------|-----|
|
|
| content-repository | Raw document records | content-distiller |
|
|
| content-distiller | Distilled content | quality-reviewer |
|
|
| quality-reviewer | Refactor instructions | content-distiller (loop) |
|