6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2.2 KiB
2.2 KiB
description, argument-hint, allowed-tools
| description | argument-hint | allowed-tools |
|---|---|---|
| Store and manage crawled content in MySQL. Handles versioning, deduplication, and document metadata. | <action> [--doc-id N] [--source-id N] | Bash, Read, Write, Glob, Grep |
Content Repository
Manage crawled content in MySQL database.
Arguments
<action>: store | list | get | update | delete | stats--doc-id: Specific document ID--source-id: Filter by source ID
Database Connection
source ~/.envrc
mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library
Actions
store
Store new documents from crawl output:
# Read crawl manifest
cat ~/reference-library/raw/YYYY/MM/crawl_manifest.json
# Insert into documents table
INSERT INTO documents (source_id, title, url, doc_type, raw_content_path, crawl_date, crawl_status)
VALUES (...);
list
List documents with filters:
SELECT doc_id, title, crawl_status, created_at
FROM documents
WHERE source_id = ? AND crawl_status = 'completed'
ORDER BY created_at DESC;
get
Retrieve specific document:
SELECT d.*, s.source_name, s.credibility_tier
FROM documents d
JOIN sources s ON d.source_id = s.source_id
WHERE d.doc_id = ?;
stats
Show repository statistics:
SELECT
COUNT(*) as total_docs,
SUM(CASE WHEN crawl_status = 'completed' THEN 1 ELSE 0 END) as completed,
SUM(CASE WHEN crawl_status = 'pending' THEN 1 ELSE 0 END) as pending
FROM documents;
Deduplication
Documents are deduplicated by URL hash:
-- url_hash is auto-generated: SHA2(url, 256)
SELECT * FROM documents WHERE url_hash = SHA2('https://...', 256);
Version Tracking
When content changes:
-- Create new version
INSERT INTO documents (..., version, previous_version_id)
SELECT ..., version + 1, doc_id FROM documents WHERE doc_id = ?;
-- Mark old as superseded
UPDATE documents SET crawl_status = 'stale' WHERE doc_id = ?;
Schema Reference
Key tables:
sources- Authoritative source registrydocuments- Crawled document storagedistilled_content- Processed summariesreview_logs- QA decisions
Views:
v_pending_reviews- Documents awaiting reviewv_export_ready- Approved for export