Files

Andrew Yim 6d7a6d7a88 feat(reference-curator): Add portable skill suite for reference documentation curation

6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-29 00:20:27 +07:00

2.2 KiB

Raw Permalink Blame History

description, argument-hint, allowed-tools

description	argument-hint	allowed-tools
Store and manage crawled content in MySQL. Handles versioning, deduplication, and document metadata.	<action> [--doc-id N] [--source-id N]	Bash, Read, Write, Glob, Grep

Content Repository

Manage crawled content in MySQL database.

Arguments

<action>: store | list | get | update | delete | stats
--doc-id: Specific document ID
--source-id: Filter by source ID

Database Connection

source ~/.envrc
mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library

Actions

store

Store new documents from crawl output:

# Read crawl manifest
cat ~/reference-library/raw/YYYY/MM/crawl_manifest.json

# Insert into documents table
INSERT INTO documents (source_id, title, url, doc_type, raw_content_path, crawl_date, crawl_status)
VALUES (...);

list

List documents with filters:

SELECT doc_id, title, crawl_status, created_at
FROM documents
WHERE source_id = ? AND crawl_status = 'completed'
ORDER BY created_at DESC;

get

Retrieve specific document:

SELECT d.*, s.source_name, s.credibility_tier
FROM documents d
JOIN sources s ON d.source_id = s.source_id
WHERE d.doc_id = ?;

stats

Show repository statistics:

SELECT
  COUNT(*) as total_docs,
  SUM(CASE WHEN crawl_status = 'completed' THEN 1 ELSE 0 END) as completed,
  SUM(CASE WHEN crawl_status = 'pending' THEN 1 ELSE 0 END) as pending
FROM documents;

Deduplication

Documents are deduplicated by URL hash:

-- url_hash is auto-generated: SHA2(url, 256)
SELECT * FROM documents WHERE url_hash = SHA2('https://...', 256);

Version Tracking

When content changes:

-- Create new version
INSERT INTO documents (..., version, previous_version_id)
SELECT ..., version + 1, doc_id FROM documents WHERE doc_id = ?;

-- Mark old as superseded
UPDATE documents SET crawl_status = 'stale' WHERE doc_id = ?;

Schema Reference

Key tables:

sources - Authoritative source registry
documents - Crawled document storage
distilled_content - Processed summaries
review_logs - QA decisions

Views:

v_pending_reviews - Documents awaiting review
v_export_ready - Approved for export

2.2 KiB Raw Permalink Blame History