feat(reference-curator): Add portable skill suite for reference documentation curation

6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 00:20:27 +07:00
parent e80056ae8a
commit 6d7a6d7a88
26 changed files with 4486 additions and 1 deletions
--- a/custom-skills/90-reference-curator/commands/content-distiller.md
+++ b/custom-skills/90-reference-curator/commands/content-distiller.md
@@ -0,0 +1,92 @@
+---
+description: Analyze and summarize stored documents. Extracts key concepts, code snippets, and creates structured content.
+argument-hint: <doc-id|all-pending> [--focus keywords] [--max-tokens 2000]
+allowed-tools: Read, Write, Bash, Glob, Grep
+---
+
+# Content Distiller
+
+Analyze, summarize, and extract key information from stored documents.
+
+## Arguments
+- `<doc-id|all-pending>`: Specific document ID or process all pending
+- `--focus`: Keywords to emphasize in distillation
+- `--max-tokens`: Target token count for distilled output (default: 2000)
+
+## Distillation Process
+
+### 1. Load Raw Content
+```bash
+source ~/.envrc
+# Get document path
+mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library -N -e \
+  "SELECT raw_content_path FROM documents WHERE doc_id = $DOC_ID"
+```
+
+### 2. Analyze Content
+
+Extract:
+- **Summary**: 2-3 sentence executive summary
+- **Key Concepts**: Important terms with definitions
+- **Code Snippets**: Relevant code examples
+- **Structured Content**: Full distilled markdown
+
+### 3. Output Format
+
+```json
+{
+  "summary": "Executive summary of the document...",
+  "key_concepts": [
+    {"term": "System Prompt", "definition": "..."},
+    {"term": "Context Window", "definition": "..."}
+  ],
+  "code_snippets": [
+    {"language": "python", "description": "...", "code": "..."}
+  ],
+  "structured_content": "# Title\n\n## Overview\n..."
+}
+```
+
+### 4. Store Distilled Content
+
+```sql
+INSERT INTO distilled_content
+  (doc_id, summary, key_concepts, code_snippets, structured_content,
+   token_count_original, token_count_distilled, distill_model, review_status)
+VALUES
+  (?, ?, ?, ?, ?, ?, ?, 'claude-opus-4', 'pending');
+```
+
+### 5. Calculate Metrics
+
+- `token_count_original`: Tokens in raw content
+- `token_count_distilled`: Tokens in output
+- `compression_ratio`: Auto-calculated (distilled/original * 100)
+
+## Distillation Guidelines
+
+**For Prompt Engineering Content:**
+- Emphasize techniques and patterns
+- Include before/after examples
+- Extract actionable best practices
+- Note model-specific behaviors
+
+**For API Documentation:**
+- Focus on endpoint signatures
+- Include request/response examples
+- Note rate limits and constraints
+- Extract error handling patterns
+
+**For Code Repositories:**
+- Summarize architecture
+- Extract key functions/classes
+- Note dependencies
+- Include usage examples
+
+## Example Usage
+
+```
+/content-distiller 42
+/content-distiller all-pending --focus "system prompts"
+/content-distiller 15 --max-tokens 3000
+```
--- a/custom-skills/90-reference-curator/commands/content-repository.md
+++ b/custom-skills/90-reference-curator/commands/content-repository.md
@@ -0,0 +1,94 @@
+---
+description: Store and manage crawled content in MySQL. Handles versioning, deduplication, and document metadata.
+argument-hint: <action> [--doc-id N] [--source-id N]
+allowed-tools: Bash, Read, Write, Glob, Grep
+---
+
+# Content Repository
+
+Manage crawled content in MySQL database.
+
+## Arguments
+- `<action>`: store | list | get | update | delete | stats
+- `--doc-id`: Specific document ID
+- `--source-id`: Filter by source ID
+
+## Database Connection
+
+```bash
+source ~/.envrc
+mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library
+```
+
+## Actions
+
+### store
+Store new documents from crawl output:
+```bash
+# Read crawl manifest
+cat ~/reference-library/raw/YYYY/MM/crawl_manifest.json
+
+# Insert into documents table
+INSERT INTO documents (source_id, title, url, doc_type, raw_content_path, crawl_date, crawl_status)
+VALUES (...);
+```
+
+### list
+List documents with filters:
+```sql
+SELECT doc_id, title, crawl_status, created_at
+FROM documents
+WHERE source_id = ? AND crawl_status = 'completed'
+ORDER BY created_at DESC;
+```
+
+### get
+Retrieve specific document:
+```sql
+SELECT d.*, s.source_name, s.credibility_tier
+FROM documents d
+JOIN sources s ON d.source_id = s.source_id
+WHERE d.doc_id = ?;
+```
+
+### stats
+Show repository statistics:
+```sql
+SELECT
+  COUNT(*) as total_docs,
+  SUM(CASE WHEN crawl_status = 'completed' THEN 1 ELSE 0 END) as completed,
+  SUM(CASE WHEN crawl_status = 'pending' THEN 1 ELSE 0 END) as pending
+FROM documents;
+```
+
+## Deduplication
+
+Documents are deduplicated by URL hash:
+```sql
+-- url_hash is auto-generated: SHA2(url, 256)
+SELECT * FROM documents WHERE url_hash = SHA2('https://...', 256);
+```
+
+## Version Tracking
+
+When content changes:
+```sql
+-- Create new version
+INSERT INTO documents (..., version, previous_version_id)
+SELECT ..., version + 1, doc_id FROM documents WHERE doc_id = ?;
+
+-- Mark old as superseded
+UPDATE documents SET crawl_status = 'stale' WHERE doc_id = ?;
+```
+
+## Schema Reference
+
+Key tables:
+- `sources` - Authoritative source registry
+- `documents` - Crawled document storage
+- `distilled_content` - Processed summaries
+- `review_logs` - QA decisions
+
+Views:
+- `v_pending_reviews` - Documents awaiting review
+- `v_export_ready` - Approved for export
--- a/custom-skills/90-reference-curator/commands/markdown-exporter.md
+++ b/custom-skills/90-reference-curator/commands/markdown-exporter.md
@@ -0,0 +1,138 @@
+---
+description: Export approved content to markdown files or JSONL for fine-tuning. Generates structured output with cross-references.
+argument-hint: <format> [--topic slug] [--min-score 0.80]
+allowed-tools: Read, Write, Bash, Glob, Grep
+---
+
+# Markdown Exporter
+
+Export approved content to structured formats.
+
+## Arguments
+- `<format>`: project_files | fine_tuning | knowledge_base
+- `--topic`: Filter by topic slug (e.g., "prompt-engineering")
+- `--min-score`: Minimum quality score (default: 0.80)
+
+## Export Formats
+
+### project_files (Claude Projects)
+
+Output structure:
+```
+~/reference-library/exports/
+├── INDEX.md                    # Master index
+└── {topic-slug}/
+    ├── _index.md               # Topic overview
+    ├── {document-1}.md
+    └── {document-2}.md
+```
+
+**INDEX.md format:**
+```markdown
+# Reference Library Index
+
+Generated: {timestamp}
+Total Documents: {count}
+
+## Topics
+
+### [Prompt Engineering](./prompt-engineering/)
+{count} documents | Last updated: {date}
+
+### [Claude Models](./claude-models/)
+{count} documents | Last updated: {date}
+```
+
+**Document format:**
+```markdown
+---
+source: {source_name}
+url: {original_url}
+credibility: {tier}
+quality_score: {score}
+exported: {timestamp}
+---
+
+# {title}
+
+{structured_content}
+
+## Related Documents
+- [Related Doc 1](./related-1.md)
+- [Related Doc 2](./related-2.md)
+```
+
+### fine_tuning (JSONL)
+
+Output: `~/reference-library/exports/fine_tuning_{timestamp}.jsonl`
+
+```json
+{"messages": [
+  {"role": "system", "content": "You are an expert on AI and prompt engineering."},
+  {"role": "user", "content": "Explain {topic}"},
+  {"role": "assistant", "content": "{structured_content}"}
+]}
+```
+
+### knowledge_base (Flat)
+
+Single consolidated file with table of contents.
+
+## Export Process
+
+### 1. Query Approved Content
+```sql
+SELECT dc.*, d.title, d.url, s.source_name, s.credibility_tier, t.topic_slug
+FROM distilled_content dc
+JOIN documents d ON dc.doc_id = d.doc_id
+JOIN sources s ON d.source_id = s.source_id
+LEFT JOIN document_topics dt ON d.doc_id = dt.doc_id
+LEFT JOIN topics t ON dt.topic_id = t.topic_id
+WHERE dc.review_status = 'approved'
+  AND (SELECT MAX(quality_score) FROM review_logs WHERE distill_id = dc.distill_id) >= ?;
+```
+
+### 2. Generate Cross-References
+
+Find related documents by:
+- Shared topics
+- Overlapping key concepts
+- Same source
+
+### 3. Write Files
+
+```bash
+mkdir -p ~/reference-library/exports/{topic-slug}
+```
+
+### 4. Log Export Job
+
+```sql
+INSERT INTO export_jobs
+  (export_name, export_type, output_format, topic_filter,
+   min_quality_score, output_path, total_documents, status)
+VALUES
+  (?, ?, ?, ?, ?, ?, ?, 'completed');
+```
+
+## Configuration
+
+From `~/.config/reference-curator/export_config.yaml`:
+```yaml
+output:
+  base_path: ~/reference-library/exports/
+  project_files:
+    structure: nested_by_topic
+    include_metadata: true
+quality:
+  min_score_for_export: 0.80
+  auto_approve_tier1_sources: true
+```
+
+## Example Usage
+
+```
+/markdown-exporter project_files
+/markdown-exporter fine_tuning --topic prompt-engineering
+/markdown-exporter project_files --min-score 0.90
+```
--- a/custom-skills/90-reference-curator/commands/quality-reviewer.md
+++ b/custom-skills/90-reference-curator/commands/quality-reviewer.md
@@ -0,0 +1,122 @@
+---
+description: Review distilled content quality. Multi-criteria scoring with decision routing (approve/refactor/deep_research/reject).
+argument-hint: <distill-id|all-pending> [--auto-approve] [--threshold 0.85]
+allowed-tools: Read, Write, Bash, Glob, Grep
+---
+
+# Quality Reviewer
+
+Review distilled content for quality and route decisions.
+
+## Arguments
+- `<distill-id|all-pending>`: Specific distill ID or review all pending
+- `--auto-approve`: Auto-approve scores above threshold
+- `--threshold`: Approval threshold (default: 0.85)
+
+## Review Criteria
+
+### Scoring Dimensions
+
+| Criterion | Weight | Checks |
+|-----------|--------|--------|
+| **Accuracy** | 25% | Factual correctness, up-to-date info, proper attribution |
+| **Completeness** | 20% | Covers key concepts, includes examples, addresses edge cases |
+| **Clarity** | 20% | Clear structure, concise language, logical flow |
+| **Prompt Engineering Quality** | 25% | Demonstrates techniques, shows before/after, actionable |
+| **Usability** | 10% | Easy to reference, searchable keywords, appropriate length |
+
+### Score Calculation
+
+```python
+score = (
+    accuracy * 0.25 +
+    completeness * 0.20 +
+    clarity * 0.20 +
+    prompt_eng_quality * 0.25 +
+    usability * 0.10
+)
+```
+
+## Decision Thresholds
+
+| Score | Decision | Action |
+|-------|----------|--------|
+| ≥ 0.85 | **APPROVE** | Ready for export |
+| 0.60-0.84 | **REFACTOR** | Re-distill with feedback |
+| 0.40-0.59 | **DEEP_RESEARCH** | Gather more sources |
+| < 0.40 | **REJECT** | Archive (low quality) |
+
+## Review Process
+
+### 1. Load Distilled Content
+```bash
+source ~/.envrc
+mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library -e \
+  "SELECT * FROM distilled_content WHERE distill_id = $ID"
+```
+
+### 2. Evaluate Each Criterion
+
+Score 0.0 to 1.0 for each dimension.
+
+### 3. Generate Assessment
+
+```json
+{
+  "accuracy": 0.90,
+  "completeness": 0.85,
+  "clarity": 0.95,
+  "prompt_engineering_quality": 0.88,
+  "usability": 0.82,
+  "overall_score": 0.88,
+  "decision": "approve",
+  "feedback": "Well-structured with clear examples...",
+  "refactor_instructions": null
+}
+```
+
+### 4. Log Review
+
+```sql
+INSERT INTO review_logs
+  (distill_id, review_round, reviewer_type, quality_score,
+   assessment, decision, feedback, refactor_instructions)
+VALUES
+  (?, 1, 'claude_review', ?, ?, ?, ?, ?);
+```
+
+### 5. Update Status
+
+```sql
+UPDATE distilled_content
+SET review_status = 'approved'
+WHERE distill_id = ?;
+```
+
+## Decision Routing
+
+**APPROVE → markdown-exporter**
+Content is ready for export.
+
+**REFACTOR → content-distiller**
+Re-distill with specific feedback:
+```json
+{"refactor_instructions": "Add more code examples for the API authentication section"}
+```
+
+**DEEP_RESEARCH → web-crawler**
+Need more sources:
+```json
+{"research_queries": ["Claude API authentication examples", "Anthropic SDK best practices"]}
+```
+
+**REJECT → Archive**
+Mark as rejected, optionally note reason.
+
+## Example Usage
+
+```
+/quality-reviewer 15
+/quality-reviewer all-pending --auto-approve
+/quality-reviewer 42 --threshold 0.80
+```
--- a/custom-skills/90-reference-curator/commands/reference-discovery.md
+++ b/custom-skills/90-reference-curator/commands/reference-discovery.md
@@ -0,0 +1,72 @@
+---
+description: Search and discover authoritative reference sources for a topic. Validates credibility, generates URL manifests for crawling.
+argument-hint: <topic> [--vendor anthropic|openai|google] [--max-sources 10]
+allowed-tools: WebSearch, WebFetch, Read, Write, Bash, Grep, Glob
+---
+
+# Reference Discovery
+
+Search for authoritative reference sources on a given topic.
+
+## Arguments
+- `<topic>`: Required. The subject to find references for (e.g., "Claude system prompts")
+- `--vendor`: Filter to specific vendor (anthropic, openai, google)
+- `--max-sources`: Maximum sources to discover (default: 10)
+
+## Workflow
+
+### 1. Search Strategy
+Use multiple search approaches:
+- Official documentation sites
+- Engineering blogs
+- GitHub repositories
+- Research papers
+- Community guides
+
+### 2. Source Validation
+
+Evaluate each source for credibility:
+
+| Tier | Description | Examples |
+|------|-------------|----------|
+| tier1_official | Vendor documentation | docs.anthropic.com |
+| tier2_verified | Verified engineering blogs | anthropic.com/news |
+| tier3_community | Community resources | GitHub repos, tutorials |
+
+### 3. Output Manifest
+
+Generate `manifest.json` in working directory:
+
+```json
+{
+  "topic": "user provided topic",
+  "discovered_at": "ISO timestamp",
+  "sources": [
+    {
+      "url": "https://...",
+      "title": "Page title",
+      "source_type": "official_docs",
+      "credibility_tier": "tier1_official",
+      "vendor": "anthropic"
+    }
+  ]
+}
+```
+
+### 4. Store Sources
+
+Insert discovered sources into MySQL:
+```bash
+source ~/.envrc
+mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library
+```
+
+Use the `sources` table schema from `~/.config/reference-curator/`.
+
+## Example Usage
+
+```
+/reference-discovery Claude's system prompt best practices
+/reference-discovery MCP server development --vendor anthropic
+/reference-discovery prompt engineering --max-sources 20
+```
--- a/custom-skills/90-reference-curator/commands/web-crawler.md
+++ b/custom-skills/90-reference-curator/commands/web-crawler.md
@@ -0,0 +1,79 @@
+---
+description: Crawl URLs with intelligent backend selection. Auto-selects Node.js, Python aiohttp, Scrapy, or Firecrawl based on site characteristics.
+argument-hint: <url|manifest> [--crawler nodejs|aiohttp|scrapy|firecrawl] [--max-pages 50]
+allowed-tools: Bash, Read, Write, WebFetch, Glob, Grep
+---
+
+# Web Crawler Orchestrator
+
+Crawl web content with intelligent backend selection.
+
+## Arguments
+- `<url|manifest>`: Single URL or path to manifest.json from reference-discovery
+- `--crawler`: Force specific crawler (nodejs, aiohttp, scrapy, firecrawl)
+- `--max-pages`: Maximum pages to crawl (default: 50)
+
+## Intelligent Crawler Selection
+
+Auto-select based on site characteristics:
+
+| Crawler | Best For | Auto-Selected When |
+|---------|----------|-------------------|
+| **Node.js** (default) | Small docs sites | ≤50 pages, static content |
+| **Python aiohttp** | Technical docs | ≤200 pages, needs SEO data |
+| **Scrapy** | Enterprise crawls | >200 pages, multi-domain |
+| **Firecrawl MCP** | Dynamic sites | SPAs, JS-rendered content |
+
+### Detection Flow
+```
+[URL] → Is SPA/React/Vue? → Firecrawl
+      → >200 pages? → Scrapy
+      → Needs SEO? → aiohttp
+      → Default → Node.js
+```
+
+## Crawler Commands
+
+**Node.js:**
+```bash
+cd ~/Project/our-seo-agent/util/js-crawler
+node src/crawler.js <URL> --max-pages 50
+```
+
+**Python aiohttp:**
+```bash
+cd ~/Project/our-seo-agent
+python -m seo_agent.crawler --url <URL> --max-pages 100
+```
+
+**Scrapy:**
+```bash
+cd ~/Project/our-seo-agent
+scrapy crawl seo_spider -a start_url=<URL> -a max_pages=500
+```
+
+**Firecrawl MCP:**
+Use MCP tools: `firecrawl_scrape`, `firecrawl_crawl`, `firecrawl_map`
+
+## Output
+
+Save crawled content to `~/reference-library/raw/YYYY/MM/`:
+- One markdown file per page
+- Filename: `{url_hash}.md`
+
+Generate crawl manifest:
+```json
+{
+  "crawl_date": "ISO timestamp",
+  "crawler_used": "nodejs",
+  "total_crawled": 45,
+  "documents": [...]
+}
+```
+
+## Rate Limiting
+
+All crawlers respect:
+- 20 requests/minute
+- 3 concurrent requests
+- Exponential backoff on 429/5xx