6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
95 lines
2.2 KiB
Markdown
95 lines
2.2 KiB
Markdown
---
|
|
description: Store and manage crawled content in MySQL. Handles versioning, deduplication, and document metadata.
|
|
argument-hint: <action> [--doc-id N] [--source-id N]
|
|
allowed-tools: Bash, Read, Write, Glob, Grep
|
|
---
|
|
|
|
# Content Repository
|
|
|
|
Manage crawled content in MySQL database.
|
|
|
|
## Arguments
|
|
- `<action>`: store | list | get | update | delete | stats
|
|
- `--doc-id`: Specific document ID
|
|
- `--source-id`: Filter by source ID
|
|
|
|
## Database Connection
|
|
|
|
```bash
|
|
source ~/.envrc
|
|
mysql -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library
|
|
```
|
|
|
|
## Actions
|
|
|
|
### store
|
|
Store new documents from crawl output:
|
|
```bash
|
|
# Read crawl manifest
|
|
cat ~/reference-library/raw/YYYY/MM/crawl_manifest.json
|
|
|
|
# Insert into documents table
|
|
INSERT INTO documents (source_id, title, url, doc_type, raw_content_path, crawl_date, crawl_status)
|
|
VALUES (...);
|
|
```
|
|
|
|
### list
|
|
List documents with filters:
|
|
```sql
|
|
SELECT doc_id, title, crawl_status, created_at
|
|
FROM documents
|
|
WHERE source_id = ? AND crawl_status = 'completed'
|
|
ORDER BY created_at DESC;
|
|
```
|
|
|
|
### get
|
|
Retrieve specific document:
|
|
```sql
|
|
SELECT d.*, s.source_name, s.credibility_tier
|
|
FROM documents d
|
|
JOIN sources s ON d.source_id = s.source_id
|
|
WHERE d.doc_id = ?;
|
|
```
|
|
|
|
### stats
|
|
Show repository statistics:
|
|
```sql
|
|
SELECT
|
|
COUNT(*) as total_docs,
|
|
SUM(CASE WHEN crawl_status = 'completed' THEN 1 ELSE 0 END) as completed,
|
|
SUM(CASE WHEN crawl_status = 'pending' THEN 1 ELSE 0 END) as pending
|
|
FROM documents;
|
|
```
|
|
|
|
## Deduplication
|
|
|
|
Documents are deduplicated by URL hash:
|
|
```sql
|
|
-- url_hash is auto-generated: SHA2(url, 256)
|
|
SELECT * FROM documents WHERE url_hash = SHA2('https://...', 256);
|
|
```
|
|
|
|
## Version Tracking
|
|
|
|
When content changes:
|
|
```sql
|
|
-- Create new version
|
|
INSERT INTO documents (..., version, previous_version_id)
|
|
SELECT ..., version + 1, doc_id FROM documents WHERE doc_id = ?;
|
|
|
|
-- Mark old as superseded
|
|
UPDATE documents SET crawl_status = 'stale' WHERE doc_id = ?;
|
|
```
|
|
|
|
## Schema Reference
|
|
|
|
Key tables:
|
|
- `sources` - Authoritative source registry
|
|
- `documents` - Crawled document storage
|
|
- `distilled_content` - Processed summaries
|
|
- `review_logs` - QA decisions
|
|
|
|
Views:
|
|
- `v_pending_reviews` - Documents awaiting review
|
|
- `v_export_ready` - Approved for export
|