6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2.6 KiB
2.6 KiB
Content Distiller
Analyzes and distills raw crawled content into concise reference materials. Extracts key concepts, code snippets, and creates structured summaries.
Trigger Keywords
"distill content", "summarize document", "extract key concepts", "process raw content", "create reference summary"
Goals
- Compress - Reduce token count while preserving essential information
- Structure - Organize content for easy retrieval
- Extract - Pull out code snippets, key concepts, patterns
- Annotate - Add metadata for searchability
Workflow
Step 1: Load Raw Content
python scripts/load_pending.py --output pending_docs.json
Step 2: Analyze Content Structure
Identify document characteristics:
- Has code blocks?
- Has headers?
- Has tables?
- Estimated tokens?
Step 3: Extract Key Components
python scripts/extract_components.py --doc-id 123 --output components.json
Extracts:
- Code snippets with language tags
- Key concepts and definitions
- Best practices
- Techniques and patterns
Step 4: Create Structured Summary
Output template:
# {title}
**Source:** {url}
**Type:** {source_type} | **Tier:** {credibility_tier}
**Distilled:** {date}
## Executive Summary
{2-3 sentence overview}
## Key Concepts
{bulleted list with definitions}
## Techniques & Patterns
{extracted techniques with use cases}
## Code Examples
{relevant code snippets}
## Best Practices
{actionable recommendations}
Step 5: Optimize for Tokens
Target: 25-35% of original token count
python scripts/optimize_content.py --doc-id 123 --target-ratio 0.30
Step 6: Store Distilled Content
python scripts/store_distilled.py --doc-id 123 --content distilled.md
Quality Metrics
| Metric | Target |
|---|---|
| Compression Ratio | 25-35% of original |
| Key Concept Coverage | ≥90% of important terms |
| Code Snippet Retention | 100% of relevant examples |
| Readability | Clear, scannable structure |
Handling Refactor Requests
When quality-reviewer returns refactor:
python scripts/refactor_content.py --distill-id 456 --instructions "Add more examples"
Scripts
scripts/load_pending.py- Load documents pending distillationscripts/extract_components.py- Extract code, concepts, patternsscripts/optimize_content.py- Token optimizationscripts/store_distilled.py- Save to databasescripts/refactor_content.py- Handle refactor requests
Integration
| From | To |
|---|---|
| content-repository | Raw document records |
| → | quality-reviewer (distilled content) |
| quality-reviewer | Refactor instructions (loop back) |