Files
Andrew Yim 6d7a6d7a88 feat(reference-curator): Add portable skill suite for reference documentation curation
6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 00:20:27 +07:00

107 lines
2.6 KiB
Markdown

# Content Distiller
Analyzes and distills raw crawled content into concise reference materials. Extracts key concepts, code snippets, and creates structured summaries.
## Trigger Keywords
"distill content", "summarize document", "extract key concepts", "process raw content", "create reference summary"
## Goals
1. **Compress** - Reduce token count while preserving essential information
2. **Structure** - Organize content for easy retrieval
3. **Extract** - Pull out code snippets, key concepts, patterns
4. **Annotate** - Add metadata for searchability
## Workflow
### Step 1: Load Raw Content
```bash
python scripts/load_pending.py --output pending_docs.json
```
### Step 2: Analyze Content Structure
Identify document characteristics:
- Has code blocks?
- Has headers?
- Has tables?
- Estimated tokens?
### Step 3: Extract Key Components
```bash
python scripts/extract_components.py --doc-id 123 --output components.json
```
Extracts:
- Code snippets with language tags
- Key concepts and definitions
- Best practices
- Techniques and patterns
### Step 4: Create Structured Summary
Output template:
```markdown
# {title}
**Source:** {url}
**Type:** {source_type} | **Tier:** {credibility_tier}
**Distilled:** {date}
## Executive Summary
{2-3 sentence overview}
## Key Concepts
{bulleted list with definitions}
## Techniques & Patterns
{extracted techniques with use cases}
## Code Examples
{relevant code snippets}
## Best Practices
{actionable recommendations}
```
### Step 5: Optimize for Tokens
Target: 25-35% of original token count
```bash
python scripts/optimize_content.py --doc-id 123 --target-ratio 0.30
```
### Step 6: Store Distilled Content
```bash
python scripts/store_distilled.py --doc-id 123 --content distilled.md
```
## Quality Metrics
| Metric | Target |
|--------|--------|
| Compression Ratio | 25-35% of original |
| Key Concept Coverage | ≥90% of important terms |
| Code Snippet Retention | 100% of relevant examples |
| Readability | Clear, scannable structure |
## Handling Refactor Requests
When `quality-reviewer` returns `refactor`:
```bash
python scripts/refactor_content.py --distill-id 456 --instructions "Add more examples"
```
## Scripts
- `scripts/load_pending.py` - Load documents pending distillation
- `scripts/extract_components.py` - Extract code, concepts, patterns
- `scripts/optimize_content.py` - Token optimization
- `scripts/store_distilled.py` - Save to database
- `scripts/refactor_content.py` - Handle refactor requests
## Integration
| From | To |
|------|-----|
| content-repository | Raw document records |
| → | quality-reviewer (distilled content) |
| quality-reviewer | Refactor instructions (loop back) |