feat(reference-curator): Add portable skill suite for reference documentation curation
6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,92 @@
|
||||
---
|
||||
description: Analyze and summarize stored documents. Extracts key concepts, code snippets, and creates structured content.
|
||||
argument-hint: <doc-id|all-pending> [--focus keywords] [--max-tokens 2000]
|
||||
allowed-tools: Read, Write, Bash, Glob, Grep
|
||||
---
|
||||
|
||||
# Content Distiller
|
||||
|
||||
Analyze, summarize, and extract key information from stored documents.
|
||||
|
||||
## Arguments
|
||||
- `<doc-id|all-pending>`: Specific document ID or process all pending
|
||||
- `--focus`: Keywords to emphasize in distillation
|
||||
- `--max-tokens`: Target token count for distilled output (default: 2000)
|
||||
|
||||
## Distillation Process
|
||||
|
||||
### 1. Load Raw Content
|
||||
```bash
|
||||
source ~/.envrc
|
||||
# Get document path
|
||||
mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library -N -e \
|
||||
"SELECT raw_content_path FROM documents WHERE doc_id = $DOC_ID"
|
||||
```
|
||||
|
||||
### 2. Analyze Content
|
||||
|
||||
Extract:
|
||||
- **Summary**: 2-3 sentence executive summary
|
||||
- **Key Concepts**: Important terms with definitions
|
||||
- **Code Snippets**: Relevant code examples
|
||||
- **Structured Content**: Full distilled markdown
|
||||
|
||||
### 3. Output Format
|
||||
|
||||
```json
|
||||
{
|
||||
"summary": "Executive summary of the document...",
|
||||
"key_concepts": [
|
||||
{"term": "System Prompt", "definition": "..."},
|
||||
{"term": "Context Window", "definition": "..."}
|
||||
],
|
||||
"code_snippets": [
|
||||
{"language": "python", "description": "...", "code": "..."}
|
||||
],
|
||||
"structured_content": "# Title\n\n## Overview\n..."
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Store Distilled Content
|
||||
|
||||
```sql
|
||||
INSERT INTO distilled_content
|
||||
(doc_id, summary, key_concepts, code_snippets, structured_content,
|
||||
token_count_original, token_count_distilled, distill_model, review_status)
|
||||
VALUES
|
||||
(?, ?, ?, ?, ?, ?, ?, 'claude-opus-4', 'pending');
|
||||
```
|
||||
|
||||
### 5. Calculate Metrics
|
||||
|
||||
- `token_count_original`: Tokens in raw content
|
||||
- `token_count_distilled`: Tokens in output
|
||||
- `compression_ratio`: Auto-calculated (distilled/original * 100)
|
||||
|
||||
## Distillation Guidelines
|
||||
|
||||
**For Prompt Engineering Content:**
|
||||
- Emphasize techniques and patterns
|
||||
- Include before/after examples
|
||||
- Extract actionable best practices
|
||||
- Note model-specific behaviors
|
||||
|
||||
**For API Documentation:**
|
||||
- Focus on endpoint signatures
|
||||
- Include request/response examples
|
||||
- Note rate limits and constraints
|
||||
- Extract error handling patterns
|
||||
|
||||
**For Code Repositories:**
|
||||
- Summarize architecture
|
||||
- Extract key functions/classes
|
||||
- Note dependencies
|
||||
- Include usage examples
|
||||
|
||||
## Example Usage
|
||||
|
||||
```
|
||||
/content-distiller 42
|
||||
/content-distiller all-pending --focus "system prompts"
|
||||
/content-distiller 15 --max-tokens 3000
|
||||
```
|
||||
@@ -0,0 +1,94 @@
|
||||
---
|
||||
description: Store and manage crawled content in MySQL. Handles versioning, deduplication, and document metadata.
|
||||
argument-hint: <action> [--doc-id N] [--source-id N]
|
||||
allowed-tools: Bash, Read, Write, Glob, Grep
|
||||
---
|
||||
|
||||
# Content Repository
|
||||
|
||||
Manage crawled content in MySQL database.
|
||||
|
||||
## Arguments
|
||||
- `<action>`: store | list | get | update | delete | stats
|
||||
- `--doc-id`: Specific document ID
|
||||
- `--source-id`: Filter by source ID
|
||||
|
||||
## Database Connection
|
||||
|
||||
```bash
|
||||
source ~/.envrc
|
||||
mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library
|
||||
```
|
||||
|
||||
## Actions
|
||||
|
||||
### store
|
||||
Store new documents from crawl output:
|
||||
```bash
|
||||
# Read crawl manifest
|
||||
cat ~/reference-library/raw/YYYY/MM/crawl_manifest.json
|
||||
|
||||
# Insert into documents table
|
||||
INSERT INTO documents (source_id, title, url, doc_type, raw_content_path, crawl_date, crawl_status)
|
||||
VALUES (...);
|
||||
```
|
||||
|
||||
### list
|
||||
List documents with filters:
|
||||
```sql
|
||||
SELECT doc_id, title, crawl_status, created_at
|
||||
FROM documents
|
||||
WHERE source_id = ? AND crawl_status = 'completed'
|
||||
ORDER BY created_at DESC;
|
||||
```
|
||||
|
||||
### get
|
||||
Retrieve specific document:
|
||||
```sql
|
||||
SELECT d.*, s.source_name, s.credibility_tier
|
||||
FROM documents d
|
||||
JOIN sources s ON d.source_id = s.source_id
|
||||
WHERE d.doc_id = ?;
|
||||
```
|
||||
|
||||
### stats
|
||||
Show repository statistics:
|
||||
```sql
|
||||
SELECT
|
||||
COUNT(*) as total_docs,
|
||||
SUM(CASE WHEN crawl_status = 'completed' THEN 1 ELSE 0 END) as completed,
|
||||
SUM(CASE WHEN crawl_status = 'pending' THEN 1 ELSE 0 END) as pending
|
||||
FROM documents;
|
||||
```
|
||||
|
||||
## Deduplication
|
||||
|
||||
Documents are deduplicated by URL hash:
|
||||
```sql
|
||||
-- url_hash is auto-generated: SHA2(url, 256)
|
||||
SELECT * FROM documents WHERE url_hash = SHA2('https://...', 256);
|
||||
```
|
||||
|
||||
## Version Tracking
|
||||
|
||||
When content changes:
|
||||
```sql
|
||||
-- Create new version
|
||||
INSERT INTO documents (..., version, previous_version_id)
|
||||
SELECT ..., version + 1, doc_id FROM documents WHERE doc_id = ?;
|
||||
|
||||
-- Mark old as superseded
|
||||
UPDATE documents SET crawl_status = 'stale' WHERE doc_id = ?;
|
||||
```
|
||||
|
||||
## Schema Reference
|
||||
|
||||
Key tables:
|
||||
- `sources` - Authoritative source registry
|
||||
- `documents` - Crawled document storage
|
||||
- `distilled_content` - Processed summaries
|
||||
- `review_logs` - QA decisions
|
||||
|
||||
Views:
|
||||
- `v_pending_reviews` - Documents awaiting review
|
||||
- `v_export_ready` - Approved for export
|
||||
138
custom-skills/90-reference-curator/commands/markdown-exporter.md
Normal file
138
custom-skills/90-reference-curator/commands/markdown-exporter.md
Normal file
@@ -0,0 +1,138 @@
|
||||
---
|
||||
description: Export approved content to markdown files or JSONL for fine-tuning. Generates structured output with cross-references.
|
||||
argument-hint: <format> [--topic slug] [--min-score 0.80]
|
||||
allowed-tools: Read, Write, Bash, Glob, Grep
|
||||
---
|
||||
|
||||
# Markdown Exporter
|
||||
|
||||
Export approved content to structured formats.
|
||||
|
||||
## Arguments
|
||||
- `<format>`: project_files | fine_tuning | knowledge_base
|
||||
- `--topic`: Filter by topic slug (e.g., "prompt-engineering")
|
||||
- `--min-score`: Minimum quality score (default: 0.80)
|
||||
|
||||
## Export Formats
|
||||
|
||||
### project_files (Claude Projects)
|
||||
|
||||
Output structure:
|
||||
```
|
||||
~/reference-library/exports/
|
||||
├── INDEX.md # Master index
|
||||
└── {topic-slug}/
|
||||
├── _index.md # Topic overview
|
||||
├── {document-1}.md
|
||||
└── {document-2}.md
|
||||
```
|
||||
|
||||
**INDEX.md format:**
|
||||
```markdown
|
||||
# Reference Library Index
|
||||
|
||||
Generated: {timestamp}
|
||||
Total Documents: {count}
|
||||
|
||||
## Topics
|
||||
|
||||
### [Prompt Engineering](./prompt-engineering/)
|
||||
{count} documents | Last updated: {date}
|
||||
|
||||
### [Claude Models](./claude-models/)
|
||||
{count} documents | Last updated: {date}
|
||||
```
|
||||
|
||||
**Document format:**
|
||||
```markdown
|
||||
---
|
||||
source: {source_name}
|
||||
url: {original_url}
|
||||
credibility: {tier}
|
||||
quality_score: {score}
|
||||
exported: {timestamp}
|
||||
---
|
||||
|
||||
# {title}
|
||||
|
||||
{structured_content}
|
||||
|
||||
## Related Documents
|
||||
- [Related Doc 1](./related-1.md)
|
||||
- [Related Doc 2](./related-2.md)
|
||||
```
|
||||
|
||||
### fine_tuning (JSONL)
|
||||
|
||||
Output: `~/reference-library/exports/fine_tuning_{timestamp}.jsonl`
|
||||
|
||||
```json
|
||||
{"messages": [
|
||||
{"role": "system", "content": "You are an expert on AI and prompt engineering."},
|
||||
{"role": "user", "content": "Explain {topic}"},
|
||||
{"role": "assistant", "content": "{structured_content}"}
|
||||
]}
|
||||
```
|
||||
|
||||
### knowledge_base (Flat)
|
||||
|
||||
Single consolidated file with table of contents.
|
||||
|
||||
## Export Process
|
||||
|
||||
### 1. Query Approved Content
|
||||
```sql
|
||||
SELECT dc.*, d.title, d.url, s.source_name, s.credibility_tier, t.topic_slug
|
||||
FROM distilled_content dc
|
||||
JOIN documents d ON dc.doc_id = d.doc_id
|
||||
JOIN sources s ON d.source_id = s.source_id
|
||||
LEFT JOIN document_topics dt ON d.doc_id = dt.doc_id
|
||||
LEFT JOIN topics t ON dt.topic_id = t.topic_id
|
||||
WHERE dc.review_status = 'approved'
|
||||
AND (SELECT MAX(quality_score) FROM review_logs WHERE distill_id = dc.distill_id) >= ?;
|
||||
```
|
||||
|
||||
### 2. Generate Cross-References
|
||||
|
||||
Find related documents by:
|
||||
- Shared topics
|
||||
- Overlapping key concepts
|
||||
- Same source
|
||||
|
||||
### 3. Write Files
|
||||
|
||||
```bash
|
||||
mkdir -p ~/reference-library/exports/{topic-slug}
|
||||
```
|
||||
|
||||
### 4. Log Export Job
|
||||
|
||||
```sql
|
||||
INSERT INTO export_jobs
|
||||
(export_name, export_type, output_format, topic_filter,
|
||||
min_quality_score, output_path, total_documents, status)
|
||||
VALUES
|
||||
(?, ?, ?, ?, ?, ?, ?, 'completed');
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
From `~/.config/reference-curator/export_config.yaml`:
|
||||
```yaml
|
||||
output:
|
||||
base_path: ~/reference-library/exports/
|
||||
project_files:
|
||||
structure: nested_by_topic
|
||||
include_metadata: true
|
||||
quality:
|
||||
min_score_for_export: 0.80
|
||||
auto_approve_tier1_sources: true
|
||||
```
|
||||
|
||||
## Example Usage
|
||||
|
||||
```
|
||||
/markdown-exporter project_files
|
||||
/markdown-exporter fine_tuning --topic prompt-engineering
|
||||
/markdown-exporter project_files --min-score 0.90
|
||||
```
|
||||
122
custom-skills/90-reference-curator/commands/quality-reviewer.md
Normal file
122
custom-skills/90-reference-curator/commands/quality-reviewer.md
Normal file
@@ -0,0 +1,122 @@
|
||||
---
|
||||
description: Review distilled content quality. Multi-criteria scoring with decision routing (approve/refactor/deep_research/reject).
|
||||
argument-hint: <distill-id|all-pending> [--auto-approve] [--threshold 0.85]
|
||||
allowed-tools: Read, Write, Bash, Glob, Grep
|
||||
---
|
||||
|
||||
# Quality Reviewer
|
||||
|
||||
Review distilled content for quality and route decisions.
|
||||
|
||||
## Arguments
|
||||
- `<distill-id|all-pending>`: Specific distill ID or review all pending
|
||||
- `--auto-approve`: Auto-approve scores above threshold
|
||||
- `--threshold`: Approval threshold (default: 0.85)
|
||||
|
||||
## Review Criteria
|
||||
|
||||
### Scoring Dimensions
|
||||
|
||||
| Criterion | Weight | Checks |
|
||||
|-----------|--------|--------|
|
||||
| **Accuracy** | 25% | Factual correctness, up-to-date info, proper attribution |
|
||||
| **Completeness** | 20% | Covers key concepts, includes examples, addresses edge cases |
|
||||
| **Clarity** | 20% | Clear structure, concise language, logical flow |
|
||||
| **Prompt Engineering Quality** | 25% | Demonstrates techniques, shows before/after, actionable |
|
||||
| **Usability** | 10% | Easy to reference, searchable keywords, appropriate length |
|
||||
|
||||
### Score Calculation
|
||||
|
||||
```python
|
||||
score = (
|
||||
accuracy * 0.25 +
|
||||
completeness * 0.20 +
|
||||
clarity * 0.20 +
|
||||
prompt_eng_quality * 0.25 +
|
||||
usability * 0.10
|
||||
)
|
||||
```
|
||||
|
||||
## Decision Thresholds
|
||||
|
||||
| Score | Decision | Action |
|
||||
|-------|----------|--------|
|
||||
| ≥ 0.85 | **APPROVE** | Ready for export |
|
||||
| 0.60-0.84 | **REFACTOR** | Re-distill with feedback |
|
||||
| 0.40-0.59 | **DEEP_RESEARCH** | Gather more sources |
|
||||
| < 0.40 | **REJECT** | Archive (low quality) |
|
||||
|
||||
## Review Process
|
||||
|
||||
### 1. Load Distilled Content
|
||||
```bash
|
||||
source ~/.envrc
|
||||
mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library -e \
|
||||
"SELECT * FROM distilled_content WHERE distill_id = $ID"
|
||||
```
|
||||
|
||||
### 2. Evaluate Each Criterion
|
||||
|
||||
Score 0.0 to 1.0 for each dimension.
|
||||
|
||||
### 3. Generate Assessment
|
||||
|
||||
```json
|
||||
{
|
||||
"accuracy": 0.90,
|
||||
"completeness": 0.85,
|
||||
"clarity": 0.95,
|
||||
"prompt_engineering_quality": 0.88,
|
||||
"usability": 0.82,
|
||||
"overall_score": 0.88,
|
||||
"decision": "approve",
|
||||
"feedback": "Well-structured with clear examples...",
|
||||
"refactor_instructions": null
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Log Review
|
||||
|
||||
```sql
|
||||
INSERT INTO review_logs
|
||||
(distill_id, review_round, reviewer_type, quality_score,
|
||||
assessment, decision, feedback, refactor_instructions)
|
||||
VALUES
|
||||
(?, 1, 'claude_review', ?, ?, ?, ?, ?);
|
||||
```
|
||||
|
||||
### 5. Update Status
|
||||
|
||||
```sql
|
||||
UPDATE distilled_content
|
||||
SET review_status = 'approved'
|
||||
WHERE distill_id = ?;
|
||||
```
|
||||
|
||||
## Decision Routing
|
||||
|
||||
**APPROVE → markdown-exporter**
|
||||
Content is ready for export.
|
||||
|
||||
**REFACTOR → content-distiller**
|
||||
Re-distill with specific feedback:
|
||||
```json
|
||||
{"refactor_instructions": "Add more code examples for the API authentication section"}
|
||||
```
|
||||
|
||||
**DEEP_RESEARCH → web-crawler**
|
||||
Need more sources:
|
||||
```json
|
||||
{"research_queries": ["Claude API authentication examples", "Anthropic SDK best practices"]}
|
||||
```
|
||||
|
||||
**REJECT → Archive**
|
||||
Mark as rejected, optionally note reason.
|
||||
|
||||
## Example Usage
|
||||
|
||||
```
|
||||
/quality-reviewer 15
|
||||
/quality-reviewer all-pending --auto-approve
|
||||
/quality-reviewer 42 --threshold 0.80
|
||||
```
|
||||
@@ -0,0 +1,72 @@
|
||||
---
|
||||
description: Search and discover authoritative reference sources for a topic. Validates credibility, generates URL manifests for crawling.
|
||||
argument-hint: <topic> [--vendor anthropic|openai|google] [--max-sources 10]
|
||||
allowed-tools: WebSearch, WebFetch, Read, Write, Bash, Grep, Glob
|
||||
---
|
||||
|
||||
# Reference Discovery
|
||||
|
||||
Search for authoritative reference sources on a given topic.
|
||||
|
||||
## Arguments
|
||||
- `<topic>`: Required. The subject to find references for (e.g., "Claude system prompts")
|
||||
- `--vendor`: Filter to specific vendor (anthropic, openai, google)
|
||||
- `--max-sources`: Maximum sources to discover (default: 10)
|
||||
|
||||
## Workflow
|
||||
|
||||
### 1. Search Strategy
|
||||
Use multiple search approaches:
|
||||
- Official documentation sites
|
||||
- Engineering blogs
|
||||
- GitHub repositories
|
||||
- Research papers
|
||||
- Community guides
|
||||
|
||||
### 2. Source Validation
|
||||
|
||||
Evaluate each source for credibility:
|
||||
|
||||
| Tier | Description | Examples |
|
||||
|------|-------------|----------|
|
||||
| tier1_official | Vendor documentation | docs.anthropic.com |
|
||||
| tier2_verified | Verified engineering blogs | anthropic.com/news |
|
||||
| tier3_community | Community resources | GitHub repos, tutorials |
|
||||
|
||||
### 3. Output Manifest
|
||||
|
||||
Generate `manifest.json` in working directory:
|
||||
|
||||
```json
|
||||
{
|
||||
"topic": "user provided topic",
|
||||
"discovered_at": "ISO timestamp",
|
||||
"sources": [
|
||||
{
|
||||
"url": "https://...",
|
||||
"title": "Page title",
|
||||
"source_type": "official_docs",
|
||||
"credibility_tier": "tier1_official",
|
||||
"vendor": "anthropic"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Store Sources
|
||||
|
||||
Insert discovered sources into MySQL:
|
||||
```bash
|
||||
source ~/.envrc
|
||||
mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library
|
||||
```
|
||||
|
||||
Use the `sources` table schema from `~/.config/reference-curator/`.
|
||||
|
||||
## Example Usage
|
||||
|
||||
```
|
||||
/reference-discovery Claude's system prompt best practices
|
||||
/reference-discovery MCP server development --vendor anthropic
|
||||
/reference-discovery prompt engineering --max-sources 20
|
||||
```
|
||||
79
custom-skills/90-reference-curator/commands/web-crawler.md
Normal file
79
custom-skills/90-reference-curator/commands/web-crawler.md
Normal file
@@ -0,0 +1,79 @@
|
||||
---
|
||||
description: Crawl URLs with intelligent backend selection. Auto-selects Node.js, Python aiohttp, Scrapy, or Firecrawl based on site characteristics.
|
||||
argument-hint: <url|manifest> [--crawler nodejs|aiohttp|scrapy|firecrawl] [--max-pages 50]
|
||||
allowed-tools: Bash, Read, Write, WebFetch, Glob, Grep
|
||||
---
|
||||
|
||||
# Web Crawler Orchestrator
|
||||
|
||||
Crawl web content with intelligent backend selection.
|
||||
|
||||
## Arguments
|
||||
- `<url|manifest>`: Single URL or path to manifest.json from reference-discovery
|
||||
- `--crawler`: Force specific crawler (nodejs, aiohttp, scrapy, firecrawl)
|
||||
- `--max-pages`: Maximum pages to crawl (default: 50)
|
||||
|
||||
## Intelligent Crawler Selection
|
||||
|
||||
Auto-select based on site characteristics:
|
||||
|
||||
| Crawler | Best For | Auto-Selected When |
|
||||
|---------|----------|-------------------|
|
||||
| **Node.js** (default) | Small docs sites | ≤50 pages, static content |
|
||||
| **Python aiohttp** | Technical docs | ≤200 pages, needs SEO data |
|
||||
| **Scrapy** | Enterprise crawls | >200 pages, multi-domain |
|
||||
| **Firecrawl MCP** | Dynamic sites | SPAs, JS-rendered content |
|
||||
|
||||
### Detection Flow
|
||||
```
|
||||
[URL] → Is SPA/React/Vue? → Firecrawl
|
||||
→ >200 pages? → Scrapy
|
||||
→ Needs SEO? → aiohttp
|
||||
→ Default → Node.js
|
||||
```
|
||||
|
||||
## Crawler Commands
|
||||
|
||||
**Node.js:**
|
||||
```bash
|
||||
cd ~/Project/our-seo-agent/util/js-crawler
|
||||
node src/crawler.js <URL> --max-pages 50
|
||||
```
|
||||
|
||||
**Python aiohttp:**
|
||||
```bash
|
||||
cd ~/Project/our-seo-agent
|
||||
python -m seo_agent.crawler --url <URL> --max-pages 100
|
||||
```
|
||||
|
||||
**Scrapy:**
|
||||
```bash
|
||||
cd ~/Project/our-seo-agent
|
||||
scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500
|
||||
```
|
||||
|
||||
**Firecrawl MCP:**
|
||||
Use MCP tools: `firecrawl_scrape`, `firecrawl_crawl`, `firecrawl_map`
|
||||
|
||||
## Output
|
||||
|
||||
Save crawled content to `~/reference-library/raw/YYYY/MM/`:
|
||||
- One markdown file per page
|
||||
- Filename: `{url_hash}.md`
|
||||
|
||||
Generate crawl manifest:
|
||||
```json
|
||||
{
|
||||
"crawl_date": "ISO timestamp",
|
||||
"crawler_used": "nodejs",
|
||||
"total_crawled": 45,
|
||||
"documents": [...]
|
||||
}
|
||||
```
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
All crawlers respect:
|
||||
- 20 requests/minute
|
||||
- 3 concurrent requests
|
||||
- Exponential backoff on 429/5xx
|
||||
Reference in New Issue
Block a user