feat(reference-curator): Add portable skill suite for reference documentation curation

6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-29 00:20:27 +07:00
parent e80056ae8a
commit 6d7a6d7a88
26 changed files with 4486 additions and 1 deletions

View File

@@ -0,0 +1,238 @@
---
name: content-distiller
description: Analyzes and distills raw crawled content into concise reference materials. Extracts key concepts, code snippets, and creates structured summaries optimized for prompt engineering use cases. Triggers on "distill content", "summarize document", "extract key concepts", "process raw content", "create reference summary".
---
# Content Distiller
Transforms raw crawled content into structured, high-quality reference materials.
## Distillation Goals
1. **Compress** - Reduce token count while preserving essential information
2. **Structure** - Organize content for easy retrieval and reference
3. **Extract** - Pull out code snippets, key concepts, and actionable patterns
4. **Annotate** - Add metadata for searchability and categorization
## Distillation Workflow
### Step 1: Load Raw Content
```python
def load_for_distillation(cursor):
"""Get documents ready for distillation."""
sql = """
SELECT d.doc_id, d.title, d.url, d.raw_content_path,
d.doc_type, s.source_type, s.credibility_tier
FROM documents d
JOIN sources s ON d.source_id = s.source_id
LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
WHERE d.crawl_status = 'completed'
AND dc.distill_id IS NULL
ORDER BY s.credibility_tier ASC
"""
cursor.execute(sql)
return cursor.fetchall()
```
### Step 2: Analyze Content Structure
Identify content type and select appropriate distillation strategy:
```python
def analyze_structure(content, doc_type):
"""Analyze document structure for distillation."""
analysis = {
"has_code_blocks": bool(re.findall(r'```[\s\S]*?```', content)),
"has_headers": bool(re.findall(r'^#+\s', content, re.MULTILINE)),
"has_lists": bool(re.findall(r'^\s*[-*]\s', content, re.MULTILINE)),
"has_tables": bool(re.findall(r'\|.*\|', content)),
"estimated_tokens": len(content.split()) * 1.3, # Rough estimate
"section_count": len(re.findall(r'^#+\s', content, re.MULTILINE))
}
return analysis
```
### Step 3: Extract Key Components
**Extract Code Snippets:**
```python
def extract_code_snippets(content):
"""Extract all code blocks with language tags."""
pattern = r'```(\w*)\n([\s\S]*?)```'
snippets = []
for match in re.finditer(pattern, content):
snippets.append({
"language": match.group(1) or "text",
"code": match.group(2).strip(),
"context": get_surrounding_text(content, match.start(), 200)
})
return snippets
```
**Extract Key Concepts:**
```python
def extract_key_concepts(content, title):
"""Use Claude to extract key concepts and definitions."""
prompt = f"""
Analyze this document and extract key concepts:
Title: {title}
Content: {content[:8000]} # Limit for context
Return JSON with:
- concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
- techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
- best_practices: ["..."]
"""
# Use Claude API to process
return claude_extract(prompt)
```
### Step 4: Create Structured Summary
**Summary Template:**
```markdown
# {title}
**Source:** {url}
**Type:** {source_type} | **Tier:** {credibility_tier}
**Distilled:** {date}
## Executive Summary
{2-3 sentence overview}
## Key Concepts
{bulleted list of core concepts with brief definitions}
## Techniques & Patterns
{extracted techniques with use cases}
## Code Examples
{relevant code snippets with context}
## Best Practices
{actionable recommendations}
## Related Topics
{links to related content in library}
```
### Step 5: Optimize for Tokens
```python
def optimize_content(structured_content, target_ratio=0.30):
"""
Compress content to target ratio while preserving quality.
Target: 30% of original token count.
"""
original_tokens = count_tokens(structured_content)
target_tokens = int(original_tokens * target_ratio)
# Prioritized compression strategies
strategies = [
remove_redundant_explanations,
condense_examples,
merge_similar_sections,
trim_verbose_descriptions
]
optimized = structured_content
for strategy in strategies:
if count_tokens(optimized) > target_tokens:
optimized = strategy(optimized)
return optimized
```
### Step 6: Store Distilled Content
```python
def store_distilled(cursor, doc_id, summary, key_concepts,
code_snippets, structured_content,
original_tokens, distilled_tokens):
sql = """
INSERT INTO distilled_content
(doc_id, summary, key_concepts, code_snippets, structured_content,
token_count_original, token_count_distilled, distill_model, review_status)
VALUES (%s, %s, %s, %s, %s, %s, %s, 'claude-opus-4-5', 'pending')
"""
cursor.execute(sql, (
doc_id, summary,
json.dumps(key_concepts),
json.dumps(code_snippets),
structured_content,
original_tokens,
distilled_tokens
))
return cursor.lastrowid
```
## Distillation Prompts
**For Prompt Engineering Content:**
```
Focus on:
1. Specific techniques with before/after examples
2. Why techniques work (not just what)
3. Common pitfalls and how to avoid them
4. Actionable patterns that can be directly applied
```
**For API Documentation:**
```
Focus on:
1. Endpoint specifications and parameters
2. Request/response examples
3. Error codes and handling
4. Rate limits and best practices
```
**For Research Papers:**
```
Focus on:
1. Key findings and conclusions
2. Novel techniques introduced
3. Practical applications
4. Limitations and caveats
```
## Quality Metrics
Track compression efficiency:
| Metric | Target |
|--------|--------|
| Compression Ratio | 25-35% of original |
| Key Concept Coverage | ≥90% of important terms |
| Code Snippet Retention | 100% of relevant examples |
| Readability | Clear, scannable structure |
## Handling Refactor Requests
When `quality-reviewer` returns `refactor` decision:
```python
def handle_refactor(distill_id, instructions):
"""Re-distill based on reviewer feedback."""
# Load original content and existing distillation
original = load_raw_content(distill_id)
existing = load_distilled_content(distill_id)
# Apply specific improvements based on instructions
improved = apply_improvements(existing, instructions)
# Update distilled_content
update_distilled(distill_id, improved)
# Reset review status
set_review_status(distill_id, 'pending')
```
## Integration
| From | Input | To |
|------|-------|-----|
| content-repository | Raw document records | content-distiller |
| content-distiller | Distilled content | quality-reviewer |
| quality-reviewer | Refactor instructions | content-distiller (loop) |