Files
our-claude-skills/custom-skills/90-reference-curator/03-content-repository/code/CLAUDE.md
Andrew Yim 6d7a6d7a88 feat(reference-curator): Add portable skill suite for reference documentation curation
6 modular skills for curating, processing, and exporting reference docs:
- reference-discovery: Search and validate authoritative sources
- web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy)
- content-repository: MySQL storage with version tracking
- content-distiller: Summarization and key concept extraction
- quality-reviewer: QA loop with approve/refactor/research routing
- markdown-exporter: Structured output for Claude Projects or fine-tuning

Cross-machine installation support:
- Environment-based config (~/.reference-curator.env)
- Commands tracked in repo, symlinked during install
- install.sh with --minimal, --check, --uninstall modes
- Firecrawl MCP as default (always available)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 00:20:27 +07:00

98 lines
2.7 KiB
Markdown

# Content Repository
MySQL storage management for the reference library. Handles document storage, version control, deduplication, and retrieval.
## Trigger Keywords
"store content", "save to database", "check duplicates", "version tracking", "document retrieval", "reference library DB"
## Prerequisites
- MySQL 8.0+ with utf8mb4 charset
- Config file at `~/.config/reference-curator/db_config.yaml`
- Database `reference_library` initialized
## Database Setup
```bash
# Initialize database
mysql -u root -p < references/schema.sql
# Verify tables
mysql -u root -p reference_library -e "SHOW TABLES;"
```
## Core Scripts
### Store Document
```bash
python scripts/store_document.py \
--source-id 1 \
--title "Prompt Engineering Guide" \
--url "https://docs.anthropic.com/..." \
--doc-type webpage \
--raw-path ~/reference-library/raw/2025/01/abc123.md
```
### Check Duplicate
```bash
python scripts/check_duplicate.py --url "https://docs.anthropic.com/..."
```
### Query by Topic
```bash
python scripts/query_topic.py --topic-slug prompt-engineering --min-quality 0.80
```
## Table Quick Reference
| Table | Purpose | Key Fields |
|-------|---------|------------|
| `sources` | Authorized sources | source_type, credibility_tier, vendor |
| `documents` | Document metadata | url_hash (dedup), version, crawl_status |
| `distilled_content` | Processed summaries | review_status, compression_ratio |
| `review_logs` | QA decisions | quality_score, decision |
| `topics` | Taxonomy | topic_slug, parent_topic_id |
| `document_topics` | Many-to-many links | relevance_score |
| `export_jobs` | Export tracking | export_type, status |
## Status Values
**crawl_status:** `pending``completed` | `failed` | `stale`
**review_status:** `pending``in_review``approved` | `needs_refactor` | `rejected`
## Common Queries
### Find Stale Documents
```bash
python scripts/find_stale.py --output stale_docs.json
```
### Get Pending Reviews
```bash
python scripts/pending_reviews.py --output pending.json
```
### Export-Ready Content
```bash
python scripts/export_ready.py --min-score 0.85 --output ready.json
```
## Scripts
- `scripts/store_document.py` - Store new document
- `scripts/check_duplicate.py` - URL deduplication
- `scripts/query_topic.py` - Query by topic
- `scripts/find_stale.py` - Find stale documents
- `scripts/pending_reviews.py` - Get pending reviews
- `scripts/db_utils.py` - Database connection utilities
## Integration
| From | Action | To |
|------|--------|-----|
| crawler-orchestrator | Store crawled content | → |
| → | Query pending docs | content-distiller |
| quality-reviewer | Update review_status | → |
| → | Query approved content | markdown-exporter |