Add claude-project/ folder with skill files formatted for upload to Claude.ai Projects (web interface): - reference-curator-complete.md: All 6 skills consolidated - INDEX.md: Overview and workflow documentation - Individual skill files (01-06) without YAML frontmatter Add --claude-ai option to install.sh: - Lists available files for upload - Optionally copies to custom destination directory - Provides upload instructions for Claude.ai Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
159 lines
5.0 KiB
Markdown
159 lines
5.0 KiB
Markdown
|
|
# Content Repository
|
|
|
|
Manages MySQL storage for the reference library system. Handles document storage, version control, deduplication, and retrieval.
|
|
|
|
## Prerequisites
|
|
|
|
- MySQL 8.0+ with utf8mb4 charset
|
|
- Config file at `~/.config/reference-curator/db_config.yaml`
|
|
- Database `reference_library` initialized with schema
|
|
|
|
## Quick Reference
|
|
|
|
### Connection Setup
|
|
|
|
```python
|
|
import yaml
|
|
import os
|
|
from pathlib import Path
|
|
|
|
def get_db_config():
|
|
config_path = Path.home() / ".config/reference-curator/db_config.yaml"
|
|
with open(config_path) as f:
|
|
config = yaml.safe_load(f)
|
|
|
|
# Resolve environment variables
|
|
mysql = config['mysql']
|
|
return {
|
|
'host': mysql['host'],
|
|
'port': mysql['port'],
|
|
'database': mysql['database'],
|
|
'user': os.environ.get('MYSQL_USER', mysql.get('user', '')),
|
|
'password': os.environ.get('MYSQL_PASSWORD', mysql.get('password', '')),
|
|
'charset': mysql['charset']
|
|
}
|
|
```
|
|
|
|
### Core Operations
|
|
|
|
**Store New Document:**
|
|
```python
|
|
def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
|
|
sql = """
|
|
INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
|
|
VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
|
|
ON DUPLICATE KEY UPDATE
|
|
version = version + 1,
|
|
previous_version_id = doc_id,
|
|
crawl_date = NOW(),
|
|
raw_content_path = VALUES(raw_content_path)
|
|
"""
|
|
cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
|
|
return cursor.lastrowid
|
|
```
|
|
|
|
**Check Duplicate:**
|
|
```python
|
|
def is_duplicate(cursor, url):
|
|
cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
|
|
return cursor.fetchone() is not None
|
|
```
|
|
|
|
**Get Document by Topic:**
|
|
```python
|
|
def get_docs_by_topic(cursor, topic_slug, min_quality=0.80):
|
|
sql = """
|
|
SELECT d.doc_id, d.title, d.url, dc.structured_content, dc.quality_score
|
|
FROM documents d
|
|
JOIN document_topics dt ON d.doc_id = dt.doc_id
|
|
JOIN topics t ON dt.topic_id = t.topic_id
|
|
LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
|
|
WHERE t.topic_slug = %s
|
|
AND (dc.review_status = 'approved' OR dc.review_status IS NULL)
|
|
ORDER BY dt.relevance_score DESC
|
|
"""
|
|
cursor.execute(sql, (topic_slug,))
|
|
return cursor.fetchall()
|
|
```
|
|
|
|
## Table Quick Reference
|
|
|
|
| Table | Purpose | Key Fields |
|
|
|-------|---------|------------|
|
|
| `sources` | Authorized content sources | source_type, credibility_tier, vendor |
|
|
| `documents` | Crawled document metadata | url_hash (dedup), version, crawl_status |
|
|
| `distilled_content` | Processed summaries | review_status, compression_ratio |
|
|
| `review_logs` | QA decisions | quality_score, decision, refactor_instructions |
|
|
| `topics` | Taxonomy | topic_slug, parent_topic_id |
|
|
| `document_topics` | Many-to-many linking | relevance_score |
|
|
| `export_jobs` | Export tracking | export_type, output_format, status |
|
|
|
|
## Status Values
|
|
|
|
**crawl_status:** `pending` → `completed` | `failed` | `stale`
|
|
|
|
**review_status:** `pending` → `in_review` → `approved` | `needs_refactor` | `rejected`
|
|
|
|
**decision (review):** `approve` | `refactor` | `deep_research` | `reject`
|
|
|
|
## Common Queries
|
|
|
|
### Find Stale Documents (needs re-crawl)
|
|
```sql
|
|
SELECT d.doc_id, d.title, d.url, d.crawl_date
|
|
FROM documents d
|
|
JOIN crawl_schedule cs ON d.source_id = cs.source_id
|
|
WHERE d.crawl_date < DATE_SUB(NOW(), INTERVAL
|
|
CASE cs.frequency
|
|
WHEN 'daily' THEN 1
|
|
WHEN 'weekly' THEN 7
|
|
WHEN 'biweekly' THEN 14
|
|
WHEN 'monthly' THEN 30
|
|
END DAY)
|
|
AND cs.is_enabled = TRUE;
|
|
```
|
|
|
|
### Get Pending Reviews
|
|
```sql
|
|
SELECT dc.distill_id, d.title, d.url, dc.token_count_distilled
|
|
FROM distilled_content dc
|
|
JOIN documents d ON dc.doc_id = d.doc_id
|
|
WHERE dc.review_status = 'pending'
|
|
ORDER BY dc.distill_date ASC;
|
|
```
|
|
|
|
### Export-Ready Content
|
|
```sql
|
|
SELECT d.title, d.url, dc.structured_content, t.topic_slug
|
|
FROM documents d
|
|
JOIN distilled_content dc ON d.doc_id = dc.doc_id
|
|
JOIN document_topics dt ON d.doc_id = dt.doc_id
|
|
JOIN topics t ON dt.topic_id = t.topic_id
|
|
JOIN review_logs rl ON dc.distill_id = rl.distill_id
|
|
WHERE rl.decision = 'approve'
|
|
AND rl.quality_score >= 0.85
|
|
ORDER BY t.topic_slug, dt.relevance_score DESC;
|
|
```
|
|
|
|
## Workflow Integration
|
|
|
|
1. **From crawler-orchestrator:** Receive URL + raw content path → `store_document()`
|
|
2. **To content-distiller:** Query pending documents → send for processing
|
|
3. **From quality-reviewer:** Update `review_status` based on decision
|
|
4. **To markdown-exporter:** Query approved content by topic
|
|
|
|
## Error Handling
|
|
|
|
- **Duplicate URL:** Silent update (version increment) via `ON DUPLICATE KEY UPDATE`
|
|
- **Missing source_id:** Validate against `sources` table before insert
|
|
- **Connection failure:** Implement retry with exponential backoff
|
|
|
|
## Full Schema Reference
|
|
|
|
See `references/schema.sql` for complete table definitions including indexes and constraints.
|
|
|
|
## Config File Template
|
|
|
|
See `references/db_config_template.yaml` for connection configuration template.
|