Files
Andrew Yim 6d7a6d7a88 feat(reference-curator): Add portable skill suite for reference documentation curation
6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 00:20:27 +07:00

2.7 KiB

Content Repository

MySQL storage management for the reference library. Handles document storage, version control, deduplication, and retrieval.

Trigger Keywords

"store content", "save to database", "check duplicates", "version tracking", "document retrieval", "reference library DB"

Prerequisites

  • MySQL 8.0+ with utf8mb4 charset
  • Config file at ~/.config/reference-curator/db_config.yaml
  • Database reference_library initialized

Database Setup

# Initialize database
mysql -u root -p < references/schema.sql

# Verify tables
mysql -u root -p reference_library -e "SHOW TABLES;"

Core Scripts

Store Document

python scripts/store_document.py \
  --source-id 1 \
  --title "Prompt Engineering Guide" \
  --url "https://docs.anthropic.com/..." \
  --doc-type webpage \
  --raw-path ~/reference-library/raw/2025/01/abc123.md

Check Duplicate

python scripts/check_duplicate.py --url "https://docs.anthropic.com/..."

Query by Topic

python scripts/query_topic.py --topic-slug prompt-engineering --min-quality 0.80

Table Quick Reference

Table Purpose Key Fields
sources Authorized sources source_type, credibility_tier, vendor
documents Document metadata url_hash (dedup), version, crawl_status
distilled_content Processed summaries review_status, compression_ratio
review_logs QA decisions quality_score, decision
topics Taxonomy topic_slug, parent_topic_id
document_topics Many-to-many links relevance_score
export_jobs Export tracking export_type, status

Status Values

crawl_status: pendingcompleted | failed | stale

review_status: pendingin_reviewapproved | needs_refactor | rejected

Common Queries

Find Stale Documents

python scripts/find_stale.py --output stale_docs.json

Get Pending Reviews

python scripts/pending_reviews.py --output pending.json

Export-Ready Content

python scripts/export_ready.py --min-score 0.85 --output ready.json

Scripts

  • scripts/store_document.py - Store new document
  • scripts/check_duplicate.py - URL deduplication
  • scripts/query_topic.py - Query by topic
  • scripts/find_stale.py - Find stale documents
  • scripts/pending_reviews.py - Get pending reviews
  • scripts/db_utils.py - Database connection utilities

Integration

From Action To
crawler-orchestrator Store crawled content
Query pending docs content-distiller
quality-reviewer Update review_status
Query approved content markdown-exporter