feat(reference-curator): Add portable skill suite for reference documentation curation

6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-29 00:20:27 +07:00
parent e80056ae8a
commit 6d7a6d7a88
26 changed files with 4486 additions and 1 deletions

View File

@@ -0,0 +1,94 @@
---
description: Store and manage crawled content in MySQL. Handles versioning, deduplication, and document metadata.
argument-hint: <action> [--doc-id N] [--source-id N]
allowed-tools: Bash, Read, Write, Glob, Grep
---
# Content Repository
Manage crawled content in MySQL database.
## Arguments
- `<action>`: store | list | get | update | delete | stats
- `--doc-id`: Specific document ID
- `--source-id`: Filter by source ID
## Database Connection
```bash
source ~/.envrc
mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library
```
## Actions
### store
Store new documents from crawl output:
```bash
# Read crawl manifest
cat ~/reference-library/raw/YYYY/MM/crawl_manifest.json
# Insert into documents table
INSERT INTO documents (source_id, title, url, doc_type, raw_content_path, crawl_date, crawl_status)
VALUES (...);
```
### list
List documents with filters:
```sql
SELECT doc_id, title, crawl_status, created_at
FROM documents
WHERE source_id = ? AND crawl_status = 'completed'
ORDER BY created_at DESC;
```
### get
Retrieve specific document:
```sql
SELECT d.*, s.source_name, s.credibility_tier
FROM documents d
JOIN sources s ON d.source_id = s.source_id
WHERE d.doc_id = ?;
```
### stats
Show repository statistics:
```sql
SELECT
COUNT(*) as total_docs,
SUM(CASE WHEN crawl_status = 'completed' THEN 1 ELSE 0 END) as completed,
SUM(CASE WHEN crawl_status = 'pending' THEN 1 ELSE 0 END) as pending
FROM documents;
```
## Deduplication
Documents are deduplicated by URL hash:
```sql
-- url_hash is auto-generated: SHA2(url, 256)
SELECT * FROM documents WHERE url_hash = SHA2('https://...', 256);
```
## Version Tracking
When content changes:
```sql
-- Create new version
INSERT INTO documents (..., version, previous_version_id)
SELECT ..., version + 1, doc_id FROM documents WHERE doc_id = ?;
-- Mark old as superseded
UPDATE documents SET crawl_status = 'stale' WHERE doc_id = ?;
```
## Schema Reference
Key tables:
- `sources` - Authoritative source registry
- `documents` - Crawled document storage
- `distilled_content` - Processed summaries
- `review_logs` - QA decisions
Views:
- `v_pending_reviews` - Documents awaiting review
- `v_export_ready` - Approved for export