Files

Andrew Yim 243b9d851c feat(reference-curator): Add Claude.ai Projects export format

Add claude-project/ folder with skill files formatted for upload to
Claude.ai Projects (web interface):

- reference-curator-complete.md: All 6 skills consolidated
- INDEX.md: Overview and workflow documentation
- Individual skill files (01-06) without YAML frontmatter

Add --claude-ai option to install.sh:
- Lists available files for upload
- Optionally copies to custom destination directory
- Provides upload instructions for Claude.ai

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-29 00:33:06 +07:00

5.0 KiB

Raw Blame History

Content Repository

Manages MySQL storage for the reference library system. Handles document storage, version control, deduplication, and retrieval.

Prerequisites

MySQL 8.0+ with utf8mb4 charset
Config file at ~/.config/reference-curator/db_config.yaml
Database reference_library initialized with schema

Quick Reference

Connection Setup

import yaml
import os
from pathlib import Path

def get_db_config():
    config_path = Path.home() / ".config/reference-curator/db_config.yaml"
    with open(config_path) as f:
        config = yaml.safe_load(f)
    
    # Resolve environment variables
    mysql = config['mysql']
    return {
        'host': mysql['host'],
        'port': mysql['port'],
        'database': mysql['database'],
        'user': os.environ.get('MYSQL_USER', mysql.get('user', '')),
        'password': os.environ.get('MYSQL_PASSWORD', mysql.get('password', '')),
        'charset': mysql['charset']
    }

Core Operations

Store New Document:

def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
    sql = """
    INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
    VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
    ON DUPLICATE KEY UPDATE 
        version = version + 1,
        previous_version_id = doc_id,
        crawl_date = NOW(),
        raw_content_path = VALUES(raw_content_path)
    """
    cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
    return cursor.lastrowid

Check Duplicate:

def is_duplicate(cursor, url):
    cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
    return cursor.fetchone() is not None

Get Document by Topic:

def get_docs_by_topic(cursor, topic_slug, min_quality=0.80):
    sql = """
    SELECT d.doc_id, d.title, d.url, dc.structured_content, dc.quality_score
    FROM documents d
    JOIN document_topics dt ON d.doc_id = dt.doc_id
    JOIN topics t ON dt.topic_id = t.topic_id
    LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
    WHERE t.topic_slug = %s 
    AND (dc.review_status = 'approved' OR dc.review_status IS NULL)
    ORDER BY dt.relevance_score DESC
    """
    cursor.execute(sql, (topic_slug,))
    return cursor.fetchall()

Table Quick Reference

Table	Purpose	Key Fields
`sources`	Authorized content sources	source_type, credibility_tier, vendor
`documents`	Crawled document metadata	url_hash (dedup), version, crawl_status
`distilled_content`	Processed summaries	review_status, compression_ratio
`review_logs`	QA decisions	quality_score, decision, refactor_instructions
`topics`	Taxonomy	topic_slug, parent_topic_id
`document_topics`	Many-to-many linking	relevance_score
`export_jobs`	Export tracking	export_type, output_format, status

Status Values

crawl_status: pending → completed | failed | stale

review_status: pending → in_review → approved | needs_refactor | rejected

decision (review): approve | refactor | deep_research | reject

Common Queries

Find Stale Documents (needs re-crawl)

SELECT d.doc_id, d.title, d.url, d.crawl_date
FROM documents d
JOIN crawl_schedule cs ON d.source_id = cs.source_id
WHERE d.crawl_date < DATE_SUB(NOW(), INTERVAL 
    CASE cs.frequency 
        WHEN 'daily' THEN 1 
        WHEN 'weekly' THEN 7 
        WHEN 'biweekly' THEN 14 
        WHEN 'monthly' THEN 30 
    END DAY)
AND cs.is_enabled = TRUE;

Get Pending Reviews

SELECT dc.distill_id, d.title, d.url, dc.token_count_distilled
FROM distilled_content dc
JOIN documents d ON dc.doc_id = d.doc_id
WHERE dc.review_status = 'pending'
ORDER BY dc.distill_date ASC;

Export-Ready Content

SELECT d.title, d.url, dc.structured_content, t.topic_slug
FROM documents d
JOIN distilled_content dc ON d.doc_id = dc.doc_id
JOIN document_topics dt ON d.doc_id = dt.doc_id
JOIN topics t ON dt.topic_id = t.topic_id
JOIN review_logs rl ON dc.distill_id = rl.distill_id
WHERE rl.decision = 'approve'
AND rl.quality_score >= 0.85
ORDER BY t.topic_slug, dt.relevance_score DESC;

Workflow Integration

From crawler-orchestrator: Receive URL + raw content path → store_document()
To content-distiller: Query pending documents → send for processing
From quality-reviewer: Update review_status based on decision
To markdown-exporter: Query approved content by topic

Error Handling

Duplicate URL: Silent update (version increment) via ON DUPLICATE KEY UPDATE
Missing source_id: Validate against sources table before insert
Connection failure: Implement retry with exponential backoff

Full Schema Reference

See references/schema.sql for complete table definitions including indexes and constraints.

Config File Template

See references/db_config_template.yaml for connection configuration template.

5.0 KiB Raw Blame History