--- description: Store and manage crawled content in MySQL. Handles versioning, deduplication, and document metadata. argument-hint: [--doc-id N] [--source-id N] allowed-tools: Bash, Read, Write, Glob, Grep --- # Content Repository Manage crawled content in MySQL database. ## Arguments - ``: store | list | get | update | delete | stats - `--doc-id`: Specific document ID - `--source-id`: Filter by source ID ## Database Connection ```bash source ~/.envrc mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library ``` ## Actions ### store Store new documents from crawl output: ```bash # Read crawl manifest cat ~/reference-library/raw/YYYY/MM/crawl_manifest.json # Insert into documents table INSERT INTO documents (source_id, title, url, doc_type, raw_content_path, crawl_date, crawl_status) VALUES (...); ``` ### list List documents with filters: ```sql SELECT doc_id, title, crawl_status, created_at FROM documents WHERE source_id = ? AND crawl_status = 'completed' ORDER BY created_at DESC; ``` ### get Retrieve specific document: ```sql SELECT d.*, s.source_name, s.credibility_tier FROM documents d JOIN sources s ON d.source_id = s.source_id WHERE d.doc_id = ?; ``` ### stats Show repository statistics: ```sql SELECT COUNT(*) as total_docs, SUM(CASE WHEN crawl_status = 'completed' THEN 1 ELSE 0 END) as completed, SUM(CASE WHEN crawl_status = 'pending' THEN 1 ELSE 0 END) as pending FROM documents; ``` ## Deduplication Documents are deduplicated by URL hash: ```sql -- url_hash is auto-generated: SHA2(url, 256) SELECT * FROM documents WHERE url_hash = SHA2('https://...', 256); ``` ## Version Tracking When content changes: ```sql -- Create new version INSERT INTO documents (..., version, previous_version_id) SELECT ..., version + 1, doc_id FROM documents WHERE doc_id = ?; -- Mark old as superseded UPDATE documents SET crawl_status = 'stale' WHERE doc_id = ?; ``` ## Schema Reference Key tables: - `sources` - Authoritative source registry - `documents` - Crawled document storage - `distilled_content` - Processed summaries - `review_logs` - QA decisions Views: - `v_pending_reviews` - Documents awaiting review - `v_export_ready` - Approved for export