feat(reference-curator): Add portable skill suite for reference documentation curation

6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-29 00:20:27 +07:00
parent e80056ae8a
commit 6d7a6d7a88
26 changed files with 4486 additions and 1 deletions

View File

@@ -0,0 +1,92 @@
---
description: Analyze and summarize stored documents. Extracts key concepts, code snippets, and creates structured content.
argument-hint: <doc-id|all-pending> [--focus keywords] [--max-tokens 2000]
allowed-tools: Read, Write, Bash, Glob, Grep
---
# Content Distiller
Analyze, summarize, and extract key information from stored documents.
## Arguments
- `<doc-id|all-pending>`: Specific document ID or process all pending
- `--focus`: Keywords to emphasize in distillation
- `--max-tokens`: Target token count for distilled output (default: 2000)
## Distillation Process
### 1. Load Raw Content
```bash
source ~/.envrc
# Get document path
mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library -N -e \
"SELECT raw_content_path FROM documents WHERE doc_id = $DOC_ID"
```
### 2. Analyze Content
Extract:
- **Summary**: 2-3 sentence executive summary
- **Key Concepts**: Important terms with definitions
- **Code Snippets**: Relevant code examples
- **Structured Content**: Full distilled markdown
### 3. Output Format
```json
{
"summary": "Executive summary of the document...",
"key_concepts": [
{"term": "System Prompt", "definition": "..."},
{"term": "Context Window", "definition": "..."}
],
"code_snippets": [
{"language": "python", "description": "...", "code": "..."}
],
"structured_content": "# Title\n\n## Overview\n..."
}
```
### 4. Store Distilled Content
```sql
INSERT INTO distilled_content
(doc_id, summary, key_concepts, code_snippets, structured_content,
token_count_original, token_count_distilled, distill_model, review_status)
VALUES
(?, ?, ?, ?, ?, ?, ?, 'claude-opus-4', 'pending');
```
### 5. Calculate Metrics
- `token_count_original`: Tokens in raw content
- `token_count_distilled`: Tokens in output
- `compression_ratio`: Auto-calculated (distilled/original * 100)
## Distillation Guidelines
**For Prompt Engineering Content:**
- Emphasize techniques and patterns
- Include before/after examples
- Extract actionable best practices
- Note model-specific behaviors
**For API Documentation:**
- Focus on endpoint signatures
- Include request/response examples
- Note rate limits and constraints
- Extract error handling patterns
**For Code Repositories:**
- Summarize architecture
- Extract key functions/classes
- Note dependencies
- Include usage examples
## Example Usage
```
/content-distiller 42
/content-distiller all-pending --focus "system prompts"
/content-distiller 15 --max-tokens 3000
```

View File

@@ -0,0 +1,94 @@
---
description: Store and manage crawled content in MySQL. Handles versioning, deduplication, and document metadata.
argument-hint: <action> [--doc-id N] [--source-id N]
allowed-tools: Bash, Read, Write, Glob, Grep
---
# Content Repository
Manage crawled content in MySQL database.
## Arguments
- `<action>`: store | list | get | update | delete | stats
- `--doc-id`: Specific document ID
- `--source-id`: Filter by source ID
## Database Connection
```bash
source ~/.envrc
mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library
```
## Actions
### store
Store new documents from crawl output:
```bash
# Read crawl manifest
cat ~/reference-library/raw/YYYY/MM/crawl_manifest.json
# Insert into documents table
INSERT INTO documents (source_id, title, url, doc_type, raw_content_path, crawl_date, crawl_status)
VALUES (...);
```
### list
List documents with filters:
```sql
SELECT doc_id, title, crawl_status, created_at
FROM documents
WHERE source_id = ? AND crawl_status = 'completed'
ORDER BY created_at DESC;
```
### get
Retrieve specific document:
```sql
SELECT d.*, s.source_name, s.credibility_tier
FROM documents d
JOIN sources s ON d.source_id = s.source_id
WHERE d.doc_id = ?;
```
### stats
Show repository statistics:
```sql
SELECT
COUNT(*) as total_docs,
SUM(CASE WHEN crawl_status = 'completed' THEN 1 ELSE 0 END) as completed,
SUM(CASE WHEN crawl_status = 'pending' THEN 1 ELSE 0 END) as pending
FROM documents;
```
## Deduplication
Documents are deduplicated by URL hash:
```sql
-- url_hash is auto-generated: SHA2(url, 256)
SELECT * FROM documents WHERE url_hash = SHA2('https://...', 256);
```
## Version Tracking
When content changes:
```sql
-- Create new version
INSERT INTO documents (..., version, previous_version_id)
SELECT ..., version + 1, doc_id FROM documents WHERE doc_id = ?;
-- Mark old as superseded
UPDATE documents SET crawl_status = 'stale' WHERE doc_id = ?;
```
## Schema Reference
Key tables:
- `sources` - Authoritative source registry
- `documents` - Crawled document storage
- `distilled_content` - Processed summaries
- `review_logs` - QA decisions
Views:
- `v_pending_reviews` - Documents awaiting review
- `v_export_ready` - Approved for export

View File

@@ -0,0 +1,138 @@
---
description: Export approved content to markdown files or JSONL for fine-tuning. Generates structured output with cross-references.
argument-hint: <format> [--topic slug] [--min-score 0.80]
allowed-tools: Read, Write, Bash, Glob, Grep
---
# Markdown Exporter
Export approved content to structured formats.
## Arguments
- `<format>`: project_files | fine_tuning | knowledge_base
- `--topic`: Filter by topic slug (e.g., "prompt-engineering")
- `--min-score`: Minimum quality score (default: 0.80)
## Export Formats
### project_files (Claude Projects)
Output structure:
```
~/reference-library/exports/
├── INDEX.md # Master index
└── {topic-slug}/
├── _index.md # Topic overview
├── {document-1}.md
└── {document-2}.md
```
**INDEX.md format:**
```markdown
# Reference Library Index
Generated: {timestamp}
Total Documents: {count}
## Topics
### [Prompt Engineering](./prompt-engineering/)
{count} documents | Last updated: {date}
### [Claude Models](./claude-models/)
{count} documents | Last updated: {date}
```
**Document format:**
```markdown
---
source: {source_name}
url: {original_url}
credibility: {tier}
quality_score: {score}
exported: {timestamp}
---
# {title}
{structured_content}
## Related Documents
- [Related Doc 1](./related-1.md)
- [Related Doc 2](./related-2.md)
```
### fine_tuning (JSONL)
Output: `~/reference-library/exports/fine_tuning_{timestamp}.jsonl`
```json
{"messages": [
{"role": "system", "content": "You are an expert on AI and prompt engineering."},
{"role": "user", "content": "Explain {topic}"},
{"role": "assistant", "content": "{structured_content}"}
]}
```
### knowledge_base (Flat)
Single consolidated file with table of contents.
## Export Process
### 1. Query Approved Content
```sql
SELECT dc.*, d.title, d.url, s.source_name, s.credibility_tier, t.topic_slug
FROM distilled_content dc
JOIN documents d ON dc.doc_id = d.doc_id
JOIN sources s ON d.source_id = s.source_id
LEFT JOIN document_topics dt ON d.doc_id = dt.doc_id
LEFT JOIN topics t ON dt.topic_id = t.topic_id
WHERE dc.review_status = 'approved'
AND (SELECT MAX(quality_score) FROM review_logs WHERE distill_id = dc.distill_id) >= ?;
```
### 2. Generate Cross-References
Find related documents by:
- Shared topics
- Overlapping key concepts
- Same source
### 3. Write Files
```bash
mkdir -p ~/reference-library/exports/{topic-slug}
```
### 4. Log Export Job
```sql
INSERT INTO export_jobs
(export_name, export_type, output_format, topic_filter,
min_quality_score, output_path, total_documents, status)
VALUES
(?, ?, ?, ?, ?, ?, ?, 'completed');
```
## Configuration
From `~/.config/reference-curator/export_config.yaml`:
```yaml
output:
base_path: ~/reference-library/exports/
project_files:
structure: nested_by_topic
include_metadata: true
quality:
min_score_for_export: 0.80
auto_approve_tier1_sources: true
```
## Example Usage
```
/markdown-exporter project_files
/markdown-exporter fine_tuning --topic prompt-engineering
/markdown-exporter project_files --min-score 0.90
```

View File

@@ -0,0 +1,122 @@
---
description: Review distilled content quality. Multi-criteria scoring with decision routing (approve/refactor/deep_research/reject).
argument-hint: <distill-id|all-pending> [--auto-approve] [--threshold 0.85]
allowed-tools: Read, Write, Bash, Glob, Grep
---
# Quality Reviewer
Review distilled content for quality and route decisions.
## Arguments
- `<distill-id|all-pending>`: Specific distill ID or review all pending
- `--auto-approve`: Auto-approve scores above threshold
- `--threshold`: Approval threshold (default: 0.85)
## Review Criteria
### Scoring Dimensions
| Criterion | Weight | Checks |
|-----------|--------|--------|
| **Accuracy** | 25% | Factual correctness, up-to-date info, proper attribution |
| **Completeness** | 20% | Covers key concepts, includes examples, addresses edge cases |
| **Clarity** | 20% | Clear structure, concise language, logical flow |
| **Prompt Engineering Quality** | 25% | Demonstrates techniques, shows before/after, actionable |
| **Usability** | 10% | Easy to reference, searchable keywords, appropriate length |
### Score Calculation
```python
score = (
accuracy * 0.25 +
completeness * 0.20 +
clarity * 0.20 +
prompt_eng_quality * 0.25 +
usability * 0.10
)
```
## Decision Thresholds
| Score | Decision | Action |
|-------|----------|--------|
| ≥ 0.85 | **APPROVE** | Ready for export |
| 0.60-0.84 | **REFACTOR** | Re-distill with feedback |
| 0.40-0.59 | **DEEP_RESEARCH** | Gather more sources |
| < 0.40 | **REJECT** | Archive (low quality) |
## Review Process
### 1. Load Distilled Content
```bash
source ~/.envrc
mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library -e \
"SELECT * FROM distilled_content WHERE distill_id = $ID"
```
### 2. Evaluate Each Criterion
Score 0.0 to 1.0 for each dimension.
### 3. Generate Assessment
```json
{
"accuracy": 0.90,
"completeness": 0.85,
"clarity": 0.95,
"prompt_engineering_quality": 0.88,
"usability": 0.82,
"overall_score": 0.88,
"decision": "approve",
"feedback": "Well-structured with clear examples...",
"refactor_instructions": null
}
```
### 4. Log Review
```sql
INSERT INTO review_logs
(distill_id, review_round, reviewer_type, quality_score,
assessment, decision, feedback, refactor_instructions)
VALUES
(?, 1, 'claude_review', ?, ?, ?, ?, ?);
```
### 5. Update Status
```sql
UPDATE distilled_content
SET review_status = 'approved'
WHERE distill_id = ?;
```
## Decision Routing
**APPROVE → markdown-exporter**
Content is ready for export.
**REFACTOR → content-distiller**
Re-distill with specific feedback:
```json
{"refactor_instructions": "Add more code examples for the API authentication section"}
```
**DEEP_RESEARCH → web-crawler**
Need more sources:
```json
{"research_queries": ["Claude API authentication examples", "Anthropic SDK best practices"]}
```
**REJECT → Archive**
Mark as rejected, optionally note reason.
## Example Usage
```
/quality-reviewer 15
/quality-reviewer all-pending --auto-approve
/quality-reviewer 42 --threshold 0.80
```

View File

@@ -0,0 +1,72 @@
---
description: Search and discover authoritative reference sources for a topic. Validates credibility, generates URL manifests for crawling.
argument-hint: <topic> [--vendor anthropic|openai|google] [--max-sources 10]
allowed-tools: WebSearch, WebFetch, Read, Write, Bash, Grep, Glob
---
# Reference Discovery
Search for authoritative reference sources on a given topic.
## Arguments
- `<topic>`: Required. The subject to find references for (e.g., "Claude system prompts")
- `--vendor`: Filter to specific vendor (anthropic, openai, google)
- `--max-sources`: Maximum sources to discover (default: 10)
## Workflow
### 1. Search Strategy
Use multiple search approaches:
- Official documentation sites
- Engineering blogs
- GitHub repositories
- Research papers
- Community guides
### 2. Source Validation
Evaluate each source for credibility:
| Tier | Description | Examples |
|------|-------------|----------|
| tier1_official | Vendor documentation | docs.anthropic.com |
| tier2_verified | Verified engineering blogs | anthropic.com/news |
| tier3_community | Community resources | GitHub repos, tutorials |
### 3. Output Manifest
Generate `manifest.json` in working directory:
```json
{
"topic": "user provided topic",
"discovered_at": "ISO timestamp",
"sources": [
{
"url": "https://...",
"title": "Page title",
"source_type": "official_docs",
"credibility_tier": "tier1_official",
"vendor": "anthropic"
}
]
}
```
### 4. Store Sources
Insert discovered sources into MySQL:
```bash
source ~/.envrc
mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library
```
Use the `sources` table schema from `~/.config/reference-curator/`.
## Example Usage
```
/reference-discovery Claude's system prompt best practices
/reference-discovery MCP server development --vendor anthropic
/reference-discovery prompt engineering --max-sources 20
```

View File

@@ -0,0 +1,79 @@
---
description: Crawl URLs with intelligent backend selection. Auto-selects Node.js, Python aiohttp, Scrapy, or Firecrawl based on site characteristics.
argument-hint: <url|manifest> [--crawler nodejs|aiohttp|scrapy|firecrawl] [--max-pages 50]
allowed-tools: Bash, Read, Write, WebFetch, Glob, Grep
---
# Web Crawler Orchestrator
Crawl web content with intelligent backend selection.
## Arguments
- `<url|manifest>`: Single URL or path to manifest.json from reference-discovery
- `--crawler`: Force specific crawler (nodejs, aiohttp, scrapy, firecrawl)
- `--max-pages`: Maximum pages to crawl (default: 50)
## Intelligent Crawler Selection
Auto-select based on site characteristics:
| Crawler | Best For | Auto-Selected When |
|---------|----------|-------------------|
| **Node.js** (default) | Small docs sites | ≤50 pages, static content |
| **Python aiohttp** | Technical docs | ≤200 pages, needs SEO data |
| **Scrapy** | Enterprise crawls | >200 pages, multi-domain |
| **Firecrawl MCP** | Dynamic sites | SPAs, JS-rendered content |
### Detection Flow
```
[URL] → Is SPA/React/Vue? → Firecrawl
→ >200 pages? → Scrapy
→ Needs SEO? → aiohttp
→ Default → Node.js
```
## Crawler Commands
**Node.js:**
```bash
cd ~/Project/our-seo-agent/util/js-crawler
node src/crawler.js <URL> --max-pages 50
```
**Python aiohttp:**
```bash
cd ~/Project/our-seo-agent
python -m seo_agent.crawler --url <URL> --max-pages 100
```
**Scrapy:**
```bash
cd ~/Project/our-seo-agent
scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500
```
**Firecrawl MCP:**
Use MCP tools: `firecrawl_scrape`, `firecrawl_crawl`, `firecrawl_map`
## Output
Save crawled content to `~/reference-library/raw/YYYY/MM/`:
- One markdown file per page
- Filename: `{url_hash}.md`
Generate crawl manifest:
```json
{
"crawl_date": "ISO timestamp",
"crawler_used": "nodejs",
"total_crawled": 45,
"documents": [...]
}
```
## Rate Limiting
All crawlers respect:
- 20 requests/minute
- 3 concurrent requests
- Exponential backoff on 429/5xx