--- name: content-repository description: | MySQL storage manager for reference library with versioning and deduplication. Triggers: store content, manage repository, document database, content storage. --- # Content Repository Manages MySQL storage for the reference library system. Handles document storage, version control, deduplication, and retrieval. ## Prerequisites - MySQL 8.0+ with utf8mb4 charset - Config file at `~/.config/reference-curator/db_config.yaml` - Database `reference_library` initialized with schema ## Quick Reference ### Connection Setup ```python import yaml import os from pathlib import Path def get_db_config(): config_path = Path.home() / ".config/reference-curator/db_config.yaml" with open(config_path) as f: config = yaml.safe_load(f) # Resolve environment variables mysql = config['mysql'] return { 'host': mysql['host'], 'port': mysql['port'], 'database': mysql['database'], 'user': os.environ.get('MYSQL_USER', mysql.get('user', '')), 'password': os.environ.get('MYSQL_PASSWORD', mysql.get('password', '')), 'charset': mysql['charset'] } ``` ### Core Operations **Store New Document:** ```python def store_document(cursor, source_id, title, url, doc_type, raw_content_path): sql = """ INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path) VALUES (%s, %s, %s, %s, NOW(), 'completed', %s) ON DUPLICATE KEY UPDATE version = version + 1, previous_version_id = doc_id, crawl_date = NOW(), raw_content_path = VALUES(raw_content_path) """ cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path)) return cursor.lastrowid ``` **Check Duplicate:** ```python def is_duplicate(cursor, url): cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,)) return cursor.fetchone() is not None ``` **Get Document by Topic:** ```python def get_docs_by_topic(cursor, topic_slug, min_quality=0.80): sql = """ SELECT d.doc_id, d.title, d.url, dc.structured_content, dc.quality_score FROM documents d JOIN document_topics dt ON d.doc_id = dt.doc_id JOIN topics t ON dt.topic_id = t.topic_id LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id WHERE t.topic_slug = %s AND (dc.review_status = 'approved' OR dc.review_status IS NULL) ORDER BY dt.relevance_score DESC """ cursor.execute(sql, (topic_slug,)) return cursor.fetchall() ``` ## Table Quick Reference | Table | Purpose | Key Fields | |-------|---------|------------| | `sources` | Authorized content sources | source_type, credibility_tier, vendor | | `documents` | Crawled document metadata | url_hash (dedup), version, crawl_status | | `distilled_content` | Processed summaries | review_status, compression_ratio | | `review_logs` | QA decisions | quality_score, decision, refactor_instructions | | `topics` | Taxonomy | topic_slug, parent_topic_id | | `document_topics` | Many-to-many linking | relevance_score | | `export_jobs` | Export tracking | export_type, output_format, status | ## Status Values **crawl_status:** `pending` → `completed` | `failed` | `stale` **review_status:** `pending` → `in_review` → `approved` | `needs_refactor` | `rejected` **decision (review):** `approve` | `refactor` | `deep_research` | `reject` ## Common Queries ### Find Stale Documents (needs re-crawl) ```sql SELECT d.doc_id, d.title, d.url, d.crawl_date FROM documents d JOIN crawl_schedule cs ON d.source_id = cs.source_id WHERE d.crawl_date < DATE_SUB(NOW(), INTERVAL CASE cs.frequency WHEN 'daily' THEN 1 WHEN 'weekly' THEN 7 WHEN 'biweekly' THEN 14 WHEN 'monthly' THEN 30 END DAY) AND cs.is_enabled = TRUE; ``` ### Get Pending Reviews ```sql SELECT dc.distill_id, d.title, d.url, dc.token_count_distilled FROM distilled_content dc JOIN documents d ON dc.doc_id = d.doc_id WHERE dc.review_status = 'pending' ORDER BY dc.distill_date ASC; ``` ### Export-Ready Content ```sql SELECT d.title, d.url, dc.structured_content, t.topic_slug FROM documents d JOIN distilled_content dc ON d.doc_id = dc.doc_id JOIN document_topics dt ON d.doc_id = dt.doc_id JOIN topics t ON dt.topic_id = t.topic_id JOIN review_logs rl ON dc.distill_id = rl.distill_id WHERE rl.decision = 'approve' AND rl.quality_score >= 0.85 ORDER BY t.topic_slug, dt.relevance_score DESC; ``` ## Workflow Integration 1. **From crawler-orchestrator:** Receive URL + raw content path → `store_document()` 2. **To content-distiller:** Query pending documents → send for processing 3. **From quality-reviewer:** Update `review_status` based on decision 4. **To markdown-exporter:** Query approved content by topic ## Error Handling - **Duplicate URL:** Silent update (version increment) via `ON DUPLICATE KEY UPDATE` - **Missing source_id:** Validate against `sources` table before insert - **Connection failure:** Implement retry with exponential backoff ## Full Schema Reference See `references/schema.sql` for complete table definitions including indexes and constraints. ## Config File Template See `references/db_config_template.yaml` for connection configuration template.