feat(reference-curator): Add portable skill suite for reference documentation curation
6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,97 @@
|
||||
# Content Repository
|
||||
|
||||
MySQL storage management for the reference library. Handles document storage, version control, deduplication, and retrieval.
|
||||
|
||||
## Trigger Keywords
|
||||
"store content", "save to database", "check duplicates", "version tracking", "document retrieval", "reference library DB"
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- MySQL 8.0+ with utf8mb4 charset
|
||||
- Config file at `~/.config/reference-curator/db_config.yaml`
|
||||
- Database `reference_library` initialized
|
||||
|
||||
## Database Setup
|
||||
|
||||
```bash
|
||||
# Initialize database
|
||||
mysql -u root -p < references/schema.sql
|
||||
|
||||
# Verify tables
|
||||
mysql -u root -p reference_library -e "SHOW TABLES;"
|
||||
```
|
||||
|
||||
## Core Scripts
|
||||
|
||||
### Store Document
|
||||
```bash
|
||||
python scripts/store_document.py \
|
||||
--source-id 1 \
|
||||
--title "Prompt Engineering Guide" \
|
||||
--url "https://docs.anthropic.com/..." \
|
||||
--doc-type webpage \
|
||||
--raw-path ~/reference-library/raw/2025/01/abc123.md
|
||||
```
|
||||
|
||||
### Check Duplicate
|
||||
```bash
|
||||
python scripts/check_duplicate.py --url "https://docs.anthropic.com/..."
|
||||
```
|
||||
|
||||
### Query by Topic
|
||||
```bash
|
||||
python scripts/query_topic.py --topic-slug prompt-engineering --min-quality 0.80
|
||||
```
|
||||
|
||||
## Table Quick Reference
|
||||
|
||||
| Table | Purpose | Key Fields |
|
||||
|-------|---------|------------|
|
||||
| `sources` | Authorized sources | source_type, credibility_tier, vendor |
|
||||
| `documents` | Document metadata | url_hash (dedup), version, crawl_status |
|
||||
| `distilled_content` | Processed summaries | review_status, compression_ratio |
|
||||
| `review_logs` | QA decisions | quality_score, decision |
|
||||
| `topics` | Taxonomy | topic_slug, parent_topic_id |
|
||||
| `document_topics` | Many-to-many links | relevance_score |
|
||||
| `export_jobs` | Export tracking | export_type, status |
|
||||
|
||||
## Status Values
|
||||
|
||||
**crawl_status:** `pending` → `completed` | `failed` | `stale`
|
||||
|
||||
**review_status:** `pending` → `in_review` → `approved` | `needs_refactor` | `rejected`
|
||||
|
||||
## Common Queries
|
||||
|
||||
### Find Stale Documents
|
||||
```bash
|
||||
python scripts/find_stale.py --output stale_docs.json
|
||||
```
|
||||
|
||||
### Get Pending Reviews
|
||||
```bash
|
||||
python scripts/pending_reviews.py --output pending.json
|
||||
```
|
||||
|
||||
### Export-Ready Content
|
||||
```bash
|
||||
python scripts/export_ready.py --min-score 0.85 --output ready.json
|
||||
```
|
||||
|
||||
## Scripts
|
||||
|
||||
- `scripts/store_document.py` - Store new document
|
||||
- `scripts/check_duplicate.py` - URL deduplication
|
||||
- `scripts/query_topic.py` - Query by topic
|
||||
- `scripts/find_stale.py` - Find stale documents
|
||||
- `scripts/pending_reviews.py` - Get pending reviews
|
||||
- `scripts/db_utils.py` - Database connection utilities
|
||||
|
||||
## Integration
|
||||
|
||||
| From | Action | To |
|
||||
|------|--------|-----|
|
||||
| crawler-orchestrator | Store crawled content | → |
|
||||
| → | Query pending docs | content-distiller |
|
||||
| quality-reviewer | Update review_status | → |
|
||||
| → | Query approved content | markdown-exporter |
|
||||
@@ -0,0 +1,162 @@
|
||||
---
|
||||
name: content-repository
|
||||
description: MySQL storage management for reference library. Use when storing crawled content, managing document versions, deduplicating URLs, querying stored references, or tracking document metadata. Triggers on keywords like "store content", "save to database", "check duplicates", "version tracking", "document retrieval", "reference library DB".
|
||||
---
|
||||
|
||||
# Content Repository
|
||||
|
||||
Manages MySQL storage for the reference library system. Handles document storage, version control, deduplication, and retrieval.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- MySQL 8.0+ with utf8mb4 charset
|
||||
- Config file at `~/.config/reference-curator/db_config.yaml`
|
||||
- Database `reference_library` initialized with schema
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### Connection Setup
|
||||
|
||||
```python
|
||||
import yaml
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
def get_db_config():
|
||||
config_path = Path.home() / ".config/reference-curator/db_config.yaml"
|
||||
with open(config_path) as f:
|
||||
config = yaml.safe_load(f)
|
||||
|
||||
# Resolve environment variables
|
||||
mysql = config['mysql']
|
||||
return {
|
||||
'host': mysql['host'],
|
||||
'port': mysql['port'],
|
||||
'database': mysql['database'],
|
||||
'user': os.environ.get('MYSQL_USER', mysql.get('user', '')),
|
||||
'password': os.environ.get('MYSQL_PASSWORD', mysql.get('password', '')),
|
||||
'charset': mysql['charset']
|
||||
}
|
||||
```
|
||||
|
||||
### Core Operations
|
||||
|
||||
**Store New Document:**
|
||||
```python
|
||||
def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
|
||||
sql = """
|
||||
INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
|
||||
VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
|
||||
ON DUPLICATE KEY UPDATE
|
||||
version = version + 1,
|
||||
previous_version_id = doc_id,
|
||||
crawl_date = NOW(),
|
||||
raw_content_path = VALUES(raw_content_path)
|
||||
"""
|
||||
cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
|
||||
return cursor.lastrowid
|
||||
```
|
||||
|
||||
**Check Duplicate:**
|
||||
```python
|
||||
def is_duplicate(cursor, url):
|
||||
cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
|
||||
return cursor.fetchone() is not None
|
||||
```
|
||||
|
||||
**Get Document by Topic:**
|
||||
```python
|
||||
def get_docs_by_topic(cursor, topic_slug, min_quality=0.80):
|
||||
sql = """
|
||||
SELECT d.doc_id, d.title, d.url, dc.structured_content, dc.quality_score
|
||||
FROM documents d
|
||||
JOIN document_topics dt ON d.doc_id = dt.doc_id
|
||||
JOIN topics t ON dt.topic_id = t.topic_id
|
||||
LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
|
||||
WHERE t.topic_slug = %s
|
||||
AND (dc.review_status = 'approved' OR dc.review_status IS NULL)
|
||||
ORDER BY dt.relevance_score DESC
|
||||
"""
|
||||
cursor.execute(sql, (topic_slug,))
|
||||
return cursor.fetchall()
|
||||
```
|
||||
|
||||
## Table Quick Reference
|
||||
|
||||
| Table | Purpose | Key Fields |
|
||||
|-------|---------|------------|
|
||||
| `sources` | Authorized content sources | source_type, credibility_tier, vendor |
|
||||
| `documents` | Crawled document metadata | url_hash (dedup), version, crawl_status |
|
||||
| `distilled_content` | Processed summaries | review_status, compression_ratio |
|
||||
| `review_logs` | QA decisions | quality_score, decision, refactor_instructions |
|
||||
| `topics` | Taxonomy | topic_slug, parent_topic_id |
|
||||
| `document_topics` | Many-to-many linking | relevance_score |
|
||||
| `export_jobs` | Export tracking | export_type, output_format, status |
|
||||
|
||||
## Status Values
|
||||
|
||||
**crawl_status:** `pending` → `completed` | `failed` | `stale`
|
||||
|
||||
**review_status:** `pending` → `in_review` → `approved` | `needs_refactor` | `rejected`
|
||||
|
||||
**decision (review):** `approve` | `refactor` | `deep_research` | `reject`
|
||||
|
||||
## Common Queries
|
||||
|
||||
### Find Stale Documents (needs re-crawl)
|
||||
```sql
|
||||
SELECT d.doc_id, d.title, d.url, d.crawl_date
|
||||
FROM documents d
|
||||
JOIN crawl_schedule cs ON d.source_id = cs.source_id
|
||||
WHERE d.crawl_date < DATE_SUB(NOW(), INTERVAL
|
||||
CASE cs.frequency
|
||||
WHEN 'daily' THEN 1
|
||||
WHEN 'weekly' THEN 7
|
||||
WHEN 'biweekly' THEN 14
|
||||
WHEN 'monthly' THEN 30
|
||||
END DAY)
|
||||
AND cs.is_enabled = TRUE;
|
||||
```
|
||||
|
||||
### Get Pending Reviews
|
||||
```sql
|
||||
SELECT dc.distill_id, d.title, d.url, dc.token_count_distilled
|
||||
FROM distilled_content dc
|
||||
JOIN documents d ON dc.doc_id = d.doc_id
|
||||
WHERE dc.review_status = 'pending'
|
||||
ORDER BY dc.distill_date ASC;
|
||||
```
|
||||
|
||||
### Export-Ready Content
|
||||
```sql
|
||||
SELECT d.title, d.url, dc.structured_content, t.topic_slug
|
||||
FROM documents d
|
||||
JOIN distilled_content dc ON d.doc_id = dc.doc_id
|
||||
JOIN document_topics dt ON d.doc_id = dt.doc_id
|
||||
JOIN topics t ON dt.topic_id = t.topic_id
|
||||
JOIN review_logs rl ON dc.distill_id = rl.distill_id
|
||||
WHERE rl.decision = 'approve'
|
||||
AND rl.quality_score >= 0.85
|
||||
ORDER BY t.topic_slug, dt.relevance_score DESC;
|
||||
```
|
||||
|
||||
## Workflow Integration
|
||||
|
||||
1. **From crawler-orchestrator:** Receive URL + raw content path → `store_document()`
|
||||
2. **To content-distiller:** Query pending documents → send for processing
|
||||
3. **From quality-reviewer:** Update `review_status` based on decision
|
||||
4. **To markdown-exporter:** Query approved content by topic
|
||||
|
||||
## Error Handling
|
||||
|
||||
- **Duplicate URL:** Silent update (version increment) via `ON DUPLICATE KEY UPDATE`
|
||||
- **Missing source_id:** Validate against `sources` table before insert
|
||||
- **Connection failure:** Implement retry with exponential backoff
|
||||
|
||||
## Full Schema Reference
|
||||
|
||||
See `references/schema.sql` for complete table definitions including indexes and constraints.
|
||||
|
||||
## Config File Template
|
||||
|
||||
See `references/db_config_template.yaml` for connection configuration template.
|
||||
Reference in New Issue
Block a user