6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2.7 KiB
2.7 KiB
Content Repository
MySQL storage management for the reference library. Handles document storage, version control, deduplication, and retrieval.
Trigger Keywords
"store content", "save to database", "check duplicates", "version tracking", "document retrieval", "reference library DB"
Prerequisites
- MySQL 8.0+ with utf8mb4 charset
- Config file at
~/.config/reference-curator/db_config.yaml - Database
reference_libraryinitialized
Database Setup
# Initialize database
mysql -u root -p < references/schema.sql
# Verify tables
mysql -u root -p reference_library -e "SHOW TABLES;"
Core Scripts
Store Document
python scripts/store_document.py \
--source-id 1 \
--title "Prompt Engineering Guide" \
--url "https://docs.anthropic.com/..." \
--doc-type webpage \
--raw-path ~/reference-library/raw/2025/01/abc123.md
Check Duplicate
python scripts/check_duplicate.py --url "https://docs.anthropic.com/..."
Query by Topic
python scripts/query_topic.py --topic-slug prompt-engineering --min-quality 0.80
Table Quick Reference
| Table | Purpose | Key Fields |
|---|---|---|
sources |
Authorized sources | source_type, credibility_tier, vendor |
documents |
Document metadata | url_hash (dedup), version, crawl_status |
distilled_content |
Processed summaries | review_status, compression_ratio |
review_logs |
QA decisions | quality_score, decision |
topics |
Taxonomy | topic_slug, parent_topic_id |
document_topics |
Many-to-many links | relevance_score |
export_jobs |
Export tracking | export_type, status |
Status Values
crawl_status: pending → completed | failed | stale
review_status: pending → in_review → approved | needs_refactor | rejected
Common Queries
Find Stale Documents
python scripts/find_stale.py --output stale_docs.json
Get Pending Reviews
python scripts/pending_reviews.py --output pending.json
Export-Ready Content
python scripts/export_ready.py --min-score 0.85 --output ready.json
Scripts
scripts/store_document.py- Store new documentscripts/check_duplicate.py- URL deduplicationscripts/query_topic.py- Query by topicscripts/find_stale.py- Find stale documentsscripts/pending_reviews.py- Get pending reviewsscripts/db_utils.py- Database connection utilities
Integration
| From | Action | To |
|---|---|---|
| crawler-orchestrator | Store crawled content | → |
| → | Query pending docs | content-distiller |
| quality-reviewer | Update review_status | → |
| → | Query approved content | markdown-exporter |