Files

Andrew Yim 6d7a6d7a88 feat(reference-curator): Add portable skill suite for reference documentation curation

6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-29 00:20:27 +07:00

2.6 KiB

Raw Blame History

Content Distiller

Analyzes and distills raw crawled content into concise reference materials. Extracts key concepts, code snippets, and creates structured summaries.

Trigger Keywords

"distill content", "summarize document", "extract key concepts", "process raw content", "create reference summary"

Goals

Compress - Reduce token count while preserving essential information
Structure - Organize content for easy retrieval
Extract - Pull out code snippets, key concepts, patterns
Annotate - Add metadata for searchability

Workflow

Step 1: Load Raw Content

python scripts/load_pending.py --output pending_docs.json

Step 2: Analyze Content Structure

Identify document characteristics:

Has code blocks?
Has headers?
Has tables?
Estimated tokens?

Step 3: Extract Key Components

python scripts/extract_components.py --doc-id 123 --output components.json

Extracts:

Code snippets with language tags
Key concepts and definitions
Best practices
Techniques and patterns

Step 4: Create Structured Summary

Output template:

# {title}

**Source:** {url}
**Type:** {source_type} | **Tier:** {credibility_tier}
**Distilled:** {date}

## Executive Summary
{2-3 sentence overview}

## Key Concepts
{bulleted list with definitions}

## Techniques & Patterns
{extracted techniques with use cases}

## Code Examples
{relevant code snippets}

## Best Practices
{actionable recommendations}

Step 5: Optimize for Tokens

Target: 25-35% of original token count

python scripts/optimize_content.py --doc-id 123 --target-ratio 0.30

Step 6: Store Distilled Content

python scripts/store_distilled.py --doc-id 123 --content distilled.md

Quality Metrics

Metric	Target
Compression Ratio	25-35% of original
Key Concept Coverage	≥90% of important terms
Code Snippet Retention	100% of relevant examples
Readability	Clear, scannable structure

Handling Refactor Requests

When quality-reviewer returns refactor:

python scripts/refactor_content.py --distill-id 456 --instructions "Add more examples"

Scripts

scripts/load_pending.py - Load documents pending distillation
scripts/extract_components.py - Extract code, concepts, patterns
scripts/optimize_content.py - Token optimization
scripts/store_distilled.py - Save to database
scripts/refactor_content.py - Handle refactor requests

Integration

From	To
content-repository	Raw document records
→	quality-reviewer (distilled content)
quality-reviewer	Refactor instructions (loop back)

2.6 KiB Raw Blame History