6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
107 lines
2.6 KiB
Markdown
107 lines
2.6 KiB
Markdown
# Content Distiller
|
|
|
|
Analyzes and distills raw crawled content into concise reference materials. Extracts key concepts, code snippets, and creates structured summaries.
|
|
|
|
## Trigger Keywords
|
|
"distill content", "summarize document", "extract key concepts", "process raw content", "create reference summary"
|
|
|
|
## Goals
|
|
|
|
1. **Compress** - Reduce token count while preserving essential information
|
|
2. **Structure** - Organize content for easy retrieval
|
|
3. **Extract** - Pull out code snippets, key concepts, patterns
|
|
4. **Annotate** - Add metadata for searchability
|
|
|
|
## Workflow
|
|
|
|
### Step 1: Load Raw Content
|
|
```bash
|
|
python scripts/load_pending.py --output pending_docs.json
|
|
```
|
|
|
|
### Step 2: Analyze Content Structure
|
|
Identify document characteristics:
|
|
- Has code blocks?
|
|
- Has headers?
|
|
- Has tables?
|
|
- Estimated tokens?
|
|
|
|
### Step 3: Extract Key Components
|
|
```bash
|
|
python scripts/extract_components.py --doc-id 123 --output components.json
|
|
```
|
|
|
|
Extracts:
|
|
- Code snippets with language tags
|
|
- Key concepts and definitions
|
|
- Best practices
|
|
- Techniques and patterns
|
|
|
|
### Step 4: Create Structured Summary
|
|
Output template:
|
|
```markdown
|
|
# {title}
|
|
|
|
**Source:** {url}
|
|
**Type:** {source_type} | **Tier:** {credibility_tier}
|
|
**Distilled:** {date}
|
|
|
|
## Executive Summary
|
|
{2-3 sentence overview}
|
|
|
|
## Key Concepts
|
|
{bulleted list with definitions}
|
|
|
|
## Techniques & Patterns
|
|
{extracted techniques with use cases}
|
|
|
|
## Code Examples
|
|
{relevant code snippets}
|
|
|
|
## Best Practices
|
|
{actionable recommendations}
|
|
```
|
|
|
|
### Step 5: Optimize for Tokens
|
|
Target: 25-35% of original token count
|
|
```bash
|
|
python scripts/optimize_content.py --doc-id 123 --target-ratio 0.30
|
|
```
|
|
|
|
### Step 6: Store Distilled Content
|
|
```bash
|
|
python scripts/store_distilled.py --doc-id 123 --content distilled.md
|
|
```
|
|
|
|
## Quality Metrics
|
|
|
|
| Metric | Target |
|
|
|--------|--------|
|
|
| Compression Ratio | 25-35% of original |
|
|
| Key Concept Coverage | ≥90% of important terms |
|
|
| Code Snippet Retention | 100% of relevant examples |
|
|
| Readability | Clear, scannable structure |
|
|
|
|
## Handling Refactor Requests
|
|
|
|
When `quality-reviewer` returns `refactor`:
|
|
```bash
|
|
python scripts/refactor_content.py --distill-id 456 --instructions "Add more examples"
|
|
```
|
|
|
|
## Scripts
|
|
|
|
- `scripts/load_pending.py` - Load documents pending distillation
|
|
- `scripts/extract_components.py` - Extract code, concepts, patterns
|
|
- `scripts/optimize_content.py` - Token optimization
|
|
- `scripts/store_distilled.py` - Save to database
|
|
- `scripts/refactor_content.py` - Handle refactor requests
|
|
|
|
## Integration
|
|
|
|
| From | To |
|
|
|------|-----|
|
|
| content-repository | Raw document records |
|
|
| → | quality-reviewer (distilled content) |
|
|
| quality-reviewer | Refactor instructions (loop back) |
|