our-claude-skills/custom-skills/90-reference-curator/03-content-repository/desktop/SKILL.md

---
name: content-repository
description: MySQL storage management for reference library. Use when storing crawled content, managing document versions, deduplicating URLs, querying stored references, or tracking document metadata. Triggers on keywords like "store content", "save to database", "check duplicates", "version tracking", "document retrieval", "reference library DB".
---

# Content Repository

Manages MySQL storage for the reference library system. Handles document storage, version control, deduplication, and retrieval.

## Prerequisites

- MySQL 8.0+ with utf8mb4 charset
- Config file at `~/.config/reference-curator/db_config.yaml`
- Database `reference_library` initialized with schema

## Quick Reference

### Connection Setup

```python
import yaml
import os
from pathlib import Path

def get_db_config():
    config_path = Path.home() / ".config/reference-curator/db_config.yaml"
    with open(config_path) as f:
        config = yaml.safe_load(f)

    # Resolve environment variables
    mysql = config['mysql']
    return {
        'host': mysql['host'],
        'port': mysql['port'],
        'database': mysql['database'],
        'user': os.environ.get('MYSQL_USER', mysql.get('user', '')),
        'password': os.environ.get('MYSQL_PASSWORD', mysql.get('password', '')),
        'charset': mysql['charset']
    }
```

### Core Operations

**Store New Document:**
```python
def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
    sql = """
    INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
    VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
    ON DUPLICATE KEY UPDATE
        version = version + 1,
        previous_version_id = doc_id,
        crawl_date = NOW(),
        raw_content_path = VALUES(raw_content_path)
    """
    cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
    return cursor.lastrowid
```

**Check Duplicate:**
```python
def is_duplicate(cursor, url):
    cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
    return cursor.fetchone() is not None
```

**Get Document by Topic:**
```python
def get_docs_by_topic(cursor, topic_slug, min_quality=0.80):
    sql = """
    SELECT d.doc_id, d.title, d.url, dc.structured_content, dc.quality_score
    FROM documents d
    JOIN document_topics dt ON d.doc_id = dt.doc_id
    JOIN topics t ON dt.topic_id = t.topic_id
    LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
    WHERE t.topic_slug = %s
    AND (dc.review_status = 'approved' OR dc.review_status IS NULL)
    ORDER BY dt.relevance_score DESC
    """
    cursor.execute(sql, (topic_slug,))
    return cursor.fetchall()
```

## Table Quick Reference

| Table | Purpose | Key Fields |
|-------|---------|------------|
| `sources` | Authorized content sources | source_type, credibility_tier, vendor |
| `documents` | Crawled document metadata | url_hash (dedup), version, crawl_status |
| `distilled_content` | Processed summaries | review_status, compression_ratio |
| `review_logs` | QA decisions | quality_score, decision, refactor_instructions |
| `topics` | Taxonomy | topic_slug, parent_topic_id |
| `document_topics` | Many-to-many linking | relevance_score |
| `export_jobs` | Export tracking | export_type, output_format, status |

## Status Values

**crawl_status:** `pending` → `completed` | `failed` | `stale`

**review_status:** `pending` → `in_review` → `approved` | `needs_refactor` | `rejected`

**decision (review):** `approve` | `refactor` | `deep_research` | `reject`

## Common Queries

### Find Stale Documents (needs re-crawl)
```sql
SELECT d.doc_id, d.title, d.url, d.crawl_date
FROM documents d
JOIN crawl_schedule cs ON d.source_id = cs.source_id
WHERE d.crawl_date < DATE_SUB(NOW(), INTERVAL
    CASE cs.frequency
        WHEN 'daily' THEN 1
        WHEN 'weekly' THEN 7
        WHEN 'biweekly' THEN 14
        WHEN 'monthly' THEN 30
    END DAY)
AND cs.is_enabled = TRUE;
```

### Get Pending Reviews
```sql
SELECT dc.distill_id, d.title, d.url, dc.token_count_distilled
FROM distilled_content dc
JOIN documents d ON dc.doc_id = d.doc_id
WHERE dc.review_status = 'pending'
ORDER BY dc.distill_date ASC;
```

### Export-Ready Content
```sql
SELECT d.title, d.url, dc.structured_content, t.topic_slug
FROM documents d
JOIN distilled_content dc ON d.doc_id = dc.doc_id
JOIN document_topics dt ON d.doc_id = dt.doc_id
JOIN topics t ON dt.topic_id = t.topic_id
JOIN review_logs rl ON dc.distill_id = rl.distill_id
WHERE rl.decision = 'approve'
AND rl.quality_score >= 0.85
ORDER BY t.topic_slug, dt.relevance_score DESC;
```

## Workflow Integration

1. **From crawler-orchestrator:** Receive URL + raw content path → `store_document()`
2. **To content-distiller:** Query pending documents → send for processing
3. **From quality-reviewer:** Update `review_status` based on decision
4. **To markdown-exporter:** Query approved content by topic

## Error Handling

- **Duplicate URL:** Silent update (version increment) via `ON DUPLICATE KEY UPDATE`
- **Missing source_id:** Validate against `sources` table before insert
- **Connection failure:** Implement retry with exponential backoff

## Full Schema Reference

See `references/schema.sql` for complete table definitions including indexes and constraints.

## Config File Template

See `references/db_config_template.yaml` for connection configuration template.