6 modular skills for curating, processing, and exporting reference docs: - reference-discovery: Search and validate authoritative sources - web-crawler-orchestrator: Multi-backend crawling (Firecrawl/Node/aiohttp/Scrapy) - content-repository: MySQL storage with version tracking - content-distiller: Summarization and key concept extraction - quality-reviewer: QA loop with approve/refactor/research routing - markdown-exporter: Structured output for Claude Projects or fine-tuning Cross-machine installation support: - Environment-based config (~/.reference-curator.env) - Commands tracked in repo, symlinked during install - install.sh with --minimal, --check, --uninstall modes - Firecrawl MCP as default (always available) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
98 lines
2.7 KiB
Markdown
98 lines
2.7 KiB
Markdown
# Content Repository
|
|
|
|
MySQL storage management for the reference library. Handles document storage, version control, deduplication, and retrieval.
|
|
|
|
## Trigger Keywords
|
|
"store content", "save to database", "check duplicates", "version tracking", "document retrieval", "reference library DB"
|
|
|
|
## Prerequisites
|
|
|
|
- MySQL 8.0+ with utf8mb4 charset
|
|
- Config file at `~/.config/reference-curator/db_config.yaml`
|
|
- Database `reference_library` initialized
|
|
|
|
## Database Setup
|
|
|
|
```bash
|
|
# Initialize database
|
|
mysql -u root -p < references/schema.sql
|
|
|
|
# Verify tables
|
|
mysql -u root -p reference_library -e "SHOW TABLES;"
|
|
```
|
|
|
|
## Core Scripts
|
|
|
|
### Store Document
|
|
```bash
|
|
python scripts/store_document.py \
|
|
--source-id 1 \
|
|
--title "Prompt Engineering Guide" \
|
|
--url "https://docs.anthropic.com/..." \
|
|
--doc-type webpage \
|
|
--raw-path ~/reference-library/raw/2025/01/abc123.md
|
|
```
|
|
|
|
### Check Duplicate
|
|
```bash
|
|
python scripts/check_duplicate.py --url "https://docs.anthropic.com/..."
|
|
```
|
|
|
|
### Query by Topic
|
|
```bash
|
|
python scripts/query_topic.py --topic-slug prompt-engineering --min-quality 0.80
|
|
```
|
|
|
|
## Table Quick Reference
|
|
|
|
| Table | Purpose | Key Fields |
|
|
|-------|---------|------------|
|
|
| `sources` | Authorized sources | source_type, credibility_tier, vendor |
|
|
| `documents` | Document metadata | url_hash (dedup), version, crawl_status |
|
|
| `distilled_content` | Processed summaries | review_status, compression_ratio |
|
|
| `review_logs` | QA decisions | quality_score, decision |
|
|
| `topics` | Taxonomy | topic_slug, parent_topic_id |
|
|
| `document_topics` | Many-to-many links | relevance_score |
|
|
| `export_jobs` | Export tracking | export_type, status |
|
|
|
|
## Status Values
|
|
|
|
**crawl_status:** `pending` → `completed` | `failed` | `stale`
|
|
|
|
**review_status:** `pending` → `in_review` → `approved` | `needs_refactor` | `rejected`
|
|
|
|
## Common Queries
|
|
|
|
### Find Stale Documents
|
|
```bash
|
|
python scripts/find_stale.py --output stale_docs.json
|
|
```
|
|
|
|
### Get Pending Reviews
|
|
```bash
|
|
python scripts/pending_reviews.py --output pending.json
|
|
```
|
|
|
|
### Export-Ready Content
|
|
```bash
|
|
python scripts/export_ready.py --min-score 0.85 --output ready.json
|
|
```
|
|
|
|
## Scripts
|
|
|
|
- `scripts/store_document.py` - Store new document
|
|
- `scripts/check_duplicate.py` - URL deduplication
|
|
- `scripts/query_topic.py` - Query by topic
|
|
- `scripts/find_stale.py` - Find stale documents
|
|
- `scripts/pending_reviews.py` - Get pending reviews
|
|
- `scripts/db_utils.py` - Database connection utilities
|
|
|
|
## Integration
|
|
|
|
| From | Action | To |
|
|
|------|--------|-----|
|
|
| crawler-orchestrator | Store crawled content | → |
|
|
| → | Query pending docs | content-distiller |
|
|
| quality-reviewer | Update review_status | → |
|
|
| → | Query approved content | markdown-exporter |
|