Files

Andrew Yim 6d7a6d7a88 feat(reference-curator): Add portable skill suite for reference documentation curation

6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-29 00:20:27 +07:00

5.3 KiB

Raw Blame History

name, description

name	description
content-repository	MySQL storage management for reference library. Use when storing crawled content, managing document versions, deduplicating URLs, querying stored references, or tracking document metadata. Triggers on keywords like "store content", "save to database", "check duplicates", "version tracking", "document retrieval", "reference library DB".

Content Repository

Manages MySQL storage for the reference library system. Handles document storage, version control, deduplication, and retrieval.

Prerequisites

MySQL 8.0+ with utf8mb4 charset
Config file at ~/.config/reference-curator/db_config.yaml
Database reference_library initialized with schema

Quick Reference

Connection Setup

import yaml
import os
from pathlib import Path

def get_db_config():
    config_path = Path.home() / ".config/reference-curator/db_config.yaml"
    with open(config_path) as f:
        config = yaml.safe_load(f)
    
    # Resolve environment variables
    mysql = config['mysql']
    return {
        'host': mysql['host'],
        'port': mysql['port'],
        'database': mysql['database'],
        'user': os.environ.get('MYSQL_USER', mysql.get('user', '')),
        'password': os.environ.get('MYSQL_PASSWORD', mysql.get('password', '')),
        'charset': mysql['charset']
    }

Core Operations

Store New Document:

def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
    sql = """
    INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
    VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
    ON DUPLICATE KEY UPDATE 
        version = version + 1,
        previous_version_id = doc_id,
        crawl_date = NOW(),
        raw_content_path = VALUES(raw_content_path)
    """
    cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
    return cursor.lastrowid

Check Duplicate:

def is_duplicate(cursor, url):
    cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
    return cursor.fetchone() is not None

Get Document by Topic:

def get_docs_by_topic(cursor, topic_slug, min_quality=0.80):
    sql = """
    SELECT d.doc_id, d.title, d.url, dc.structured_content, dc.quality_score
    FROM documents d
    JOIN document_topics dt ON d.doc_id = dt.doc_id
    JOIN topics t ON dt.topic_id = t.topic_id
    LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
    WHERE t.topic_slug = %s 
    AND (dc.review_status = 'approved' OR dc.review_status IS NULL)
    ORDER BY dt.relevance_score DESC
    """
    cursor.execute(sql, (topic_slug,))
    return cursor.fetchall()

Table Quick Reference

Table	Purpose	Key Fields
`sources`	Authorized content sources	source_type, credibility_tier, vendor
`documents`	Crawled document metadata	url_hash (dedup), version, crawl_status
`distilled_content`	Processed summaries	review_status, compression_ratio
`review_logs`	QA decisions	quality_score, decision, refactor_instructions
`topics`	Taxonomy	topic_slug, parent_topic_id
`document_topics`	Many-to-many linking	relevance_score
`export_jobs`	Export tracking	export_type, output_format, status

Status Values

crawl_status: pending → completed | failed | stale

review_status: pending → in_review → approved | needs_refactor | rejected

decision (review): approve | refactor | deep_research | reject

Common Queries

Find Stale Documents (needs re-crawl)

SELECT d.doc_id, d.title, d.url, d.crawl_date
FROM documents d
JOIN crawl_schedule cs ON d.source_id = cs.source_id
WHERE d.crawl_date < DATE_SUB(NOW(), INTERVAL 
    CASE cs.frequency 
        WHEN 'daily' THEN 1 
        WHEN 'weekly' THEN 7 
        WHEN 'biweekly' THEN 14 
        WHEN 'monthly' THEN 30 
    END DAY)
AND cs.is_enabled = TRUE;

Get Pending Reviews

SELECT dc.distill_id, d.title, d.url, dc.token_count_distilled
FROM distilled_content dc
JOIN documents d ON dc.doc_id = d.doc_id
WHERE dc.review_status = 'pending'
ORDER BY dc.distill_date ASC;

Export-Ready Content

SELECT d.title, d.url, dc.structured_content, t.topic_slug
FROM documents d
JOIN distilled_content dc ON d.doc_id = dc.doc_id
JOIN document_topics dt ON d.doc_id = dt.doc_id
JOIN topics t ON dt.topic_id = t.topic_id
JOIN review_logs rl ON dc.distill_id = rl.distill_id
WHERE rl.decision = 'approve'
AND rl.quality_score >= 0.85
ORDER BY t.topic_slug, dt.relevance_score DESC;

Workflow Integration

From crawler-orchestrator: Receive URL + raw content path → store_document()
To content-distiller: Query pending documents → send for processing
From quality-reviewer: Update review_status based on decision
To markdown-exporter: Query approved content by topic

Error Handling

Duplicate URL: Silent update (version increment) via ON DUPLICATE KEY UPDATE
Missing source_id: Validate against sources table before insert
Connection failure: Implement retry with exponential backoff

Full Schema Reference

See references/schema.sql for complete table definitions including indexes and constraints.

Config File Template

See references/db_config_template.yaml for connection configuration template.

5.3 KiB Raw Blame History