our-claude-skills/custom-skills/90-reference-curator/README.md

# Reference Curator Skills

Modular Claude Skills for curating, processing, and exporting reference documentation.

## Quick Start

```bash
# Clone and install
git clone https://github.com/ourdigital/our-claude-skills.git
cd our-claude-skills/custom-skills/90-reference-curator
./install.sh

# Or minimal install (Firecrawl only, no MySQL)
./install.sh --minimal

# Check installation status
./install.sh --check

# Uninstall
./install.sh --uninstall
```

---

## Installing from GitHub

### For New Machines

1. **Clone the repository:**
   ```bash
   git clone https://github.com/ourdigital/our-claude-skills.git
   cd our-claude-skills/custom-skills/90-reference-curator
   ```

2. **Run the installer:**
   ```bash
   ./install.sh
   ```

3. **Follow the interactive prompts:**
   - Set storage directory path
   - Configure MySQL credentials (optional)
   - Choose crawler backend (Firecrawl MCP recommended)

4. **Add to shell profile:**
   ```bash
   echo 'source ~/.reference-curator.env' >> ~/.zshrc
   source ~/.reference-curator.env
   ```

5. **Verify installation:**
   ```bash
   ./install.sh --check
   ```

### Installation Modes

| Mode | Command | Description |
|------|---------|-------------|
| **Full** | `./install.sh` | Interactive setup with MySQL and crawlers |
| **Minimal** | `./install.sh --minimal` | Firecrawl MCP only, no database |
| **Check** | `./install.sh --check` | Verify installation status |
| **Claude.ai** | `./install.sh --claude-ai` | Export skills for Claude.ai Projects |
| **Uninstall** | `./install.sh --uninstall` | Remove installation (preserves data) |

### What Gets Installed

| Component | Location | Purpose |
|-----------|----------|---------|
| Environment config | `~/.reference-curator.env` | Credentials and paths |
| Config files | `~/.config/reference-curator/` | YAML configuration |
| Storage directories | `~/reference-library/` | Raw/processed/exports |
| Claude Code commands | `~/.claude/commands/` | Slash commands |
| Claude Desktop skills | `~/.claude/skills/` | Skill symlinks |
| MySQL database | `reference_library` | Document storage |

### Environment Variables

The installer creates `~/.reference-curator.env` with:

```bash
# Storage paths
export REFERENCE_LIBRARY_PATH="~/reference-library"

# MySQL configuration (if enabled)
export MYSQL_HOST="localhost"
export MYSQL_PORT="3306"
export MYSQL_USER="youruser"
export MYSQL_PASSWORD="yourpassword"

# Crawler configuration
export DEFAULT_CRAWLER="firecrawl"  # or "nodejs"
export CRAWLER_PROJECT_PATH=""       # Path to local crawlers (optional)
```

---

## Claude.ai Projects Installation

To use these skills in Claude.ai (web interface), export the skill files for upload:

```bash
./install.sh --claude-ai
```

This displays available files in `claude-project/` and optionally copies them to a convenient location.

### Files for Upload

| File | Description |
|------|-------------|
| `reference-curator-complete.md` | All 6 skills combined (recommended) |
| `INDEX.md` | Overview and workflow documentation |
| `01-reference-discovery.md` | Source discovery skill |
| `02-web-crawler.md` | Crawling orchestration skill |
| `03-content-repository.md` | Database storage skill |
| `04-content-distiller.md` | Content summarization skill |
| `05-quality-reviewer.md` | QA review skill |
| `06-markdown-exporter.md` | Export skill |

### Upload Instructions

1. Go to [claude.ai](https://claude.ai)
2. Create a new Project or open existing one
3. Click "Add to project knowledge"
4. Upload `reference-curator-complete.md` (or individual skills as needed)

---

## Architecture

```
                    ┌──────────────────────────────┐
                    │ reference-curator-pipeline   │  (Orchestrator)
                    │ /reference-curator-pipeline  │
                    └──────────────────────────────┘
                                  │
          ┌───────────────────────┼───────────────────────┐
          ▼                       ▼                       ▼
     [Topic Input]          [URL Input]          [Manifest Input]
          │                       │                       │
          ▼                       │                       │
┌─────────────────────┐           │                       │
│ reference-discovery │ ◄─────────┴───────────────────────┘
└─────────────────────┘                            (skip if URLs/manifest)
          │
          ▼
┌──────────────────────────┐
│ web-crawler-orchestrator │ → Crawl (Firecrawl/Node.js/aiohttp/Scrapy)
└──────────────────────────┘
          │
          ▼
┌────────────────────┐
│ content-repository │ → Store in MySQL
└────────────────────┘
          │
          ▼
┌───────────────────┐
│ content-distiller │ → Summarize & extract  ◄─────┐
└───────────────────┘                              │
          │                                        │
          ▼                                        │
┌──────────────────┐                               │
│ quality-reviewer │ → QA loop                     │
└──────────────────┘                               │
          │                                        │
          ├── REFACTOR (max 3) ────────────────────┤
          ├── DEEP_RESEARCH (max 2) → crawler ─────┘
          │
          ▼ APPROVE
┌───────────────────┐
│ markdown-exporter │ → Project files / Fine-tuning
└───────────────────┘
```

---

## User Guide

### Full Pipeline (Recommended)

Run the complete curation workflow with a single command:

```
# From topic - runs all 6 stages automatically
/reference-curator-pipeline "Claude Code best practices" --max-sources 5

# From URLs - skip discovery, start at crawler
/reference-curator-pipeline https://docs.anthropic.com/en/docs/prompt-caching

# Resume from manifest file
/reference-curator-pipeline ./manifest.json --auto-approve

# Fine-tuning dataset output
/reference-curator-pipeline "MCP servers" --export-format fine_tuning
```

**Pipeline Options:**
- `--max-sources 10` - Max sources to discover (topic mode)
- `--max-pages 50` - Max pages per source to crawl
- `--auto-approve` - Auto-approve scores above threshold
- `--threshold 0.85` - Approval threshold
- `--max-iterations 3` - Max QA loop iterations per document
- `--export-format project_files` - Output format (project_files, fine_tuning, jsonl)

---

### Manual Workflow (Step-by-Step)

**Step 1: Discover References**
```
/reference-discovery Claude's system prompt best practices
```
Claude searches the web, validates sources, and creates a manifest of URLs to crawl.

**Step 2: Crawl Content**
```
/web-crawler https://docs.anthropic.com --max-pages 50
```
The crawler automatically selects the best backend based on site characteristics.

**Step 3: Store in Repository**
```
/content-repository store
```
Documents are saved to MySQL with deduplication and version tracking.

**Step 4: Distill Content**
```
/content-distiller all-pending
```
Claude summarizes, extracts key concepts, and creates structured content.

**Step 5: Quality Review**
```
/quality-reviewer all-pending --auto-approve
```
Automated QA scoring determines: approve, refactor, deep research, or reject.

**Step 6: Export**
```
/markdown-exporter project_files --topic prompt-engineering
```
Generates markdown files organized by topic with cross-references.

### Example Prompts

| Task | Command |
|------|---------|
| **Discover sources** | `/reference-discovery MCP server development` |
| **Crawl URL** | `/web-crawler https://docs.anthropic.com` |
| **Check repository** | `/content-repository stats` |
| **Distill document** | `/content-distiller 42` |
| **Review quality** | `/quality-reviewer all-pending` |
| **Export files** | `/markdown-exporter project_files` |

### Crawler Selection

The system automatically selects the optimal crawler:

| Site Type | Crawler | When Auto-Selected |
|-----------|---------|-------------------|
| SPAs/Dynamic | **Firecrawl MCP** | React, Vue, Angular sites (default) |
| Small docs sites | **Node.js** | ≤50 pages, static HTML |
| Technical docs | **Python aiohttp** | ≤200 pages, needs SEO data |
| Enterprise sites | **Scrapy** | >200 pages, multi-domain |

Override with explicit request:
```
/web-crawler https://example.com --crawler firecrawl
```

### Quality Review Decisions

| Score | Decision | What Happens |
|-------|----------|--------------|
| ≥ 0.85 | **Approve** | Ready for export |
| 0.60-0.84 | **Refactor** | Re-distill with feedback |
| 0.40-0.59 | **Deep Research** | Gather more sources |
| < 0.40 | **Reject** | Archive (low quality) |

### Database Queries

Check your reference library status:

```bash
# Source credentials
source ~/.reference-curator.env

# Count documents by status
mysql -h $MYSQL_HOST -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library -e "
  SELECT crawl_status, COUNT(*) as count
  FROM documents GROUP BY crawl_status;"

# View pending reviews
mysql -h $MYSQL_HOST -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library -e "
  SELECT * FROM v_pending_reviews;"

# View export-ready documents
mysql -h $MYSQL_HOST -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library -e "
  SELECT * FROM v_export_ready;"
```

### Output Formats

**For Claude Projects:**
```
~/reference-library/exports/
├── INDEX.md                 # Master index with all topics
└── prompt-engineering/      # Topic folder
    ├── _index.md            # Topic overview
    ├── system-prompts.md    # Individual document
    └── chain-of-thought.md
```

**For Fine-tuning (JSONL):**
```json
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```

---

## Skills & Commands Reference

| # | Skill | Command | Purpose |
|---|-------|---------|---------|
| 01 | reference-discovery | `/reference-discovery` | Search authoritative sources |
| 02 | web-crawler-orchestrator | `/web-crawler` | Multi-backend crawling |
| 03 | content-repository | `/content-repository` | MySQL storage management |
| 04 | content-distiller | `/content-distiller` | Summarize & extract |
| 05 | quality-reviewer | `/quality-reviewer` | QA scoring & routing |
| 06 | markdown-exporter | `/markdown-exporter` | Export to markdown/JSONL |
| 07 | pipeline-orchestrator | `/reference-curator-pipeline` | Full pipeline orchestration |

---

## Configuration

### Environment (`~/.reference-curator.env`)

```bash
# Required for MySQL features
export MYSQL_HOST="localhost"
export MYSQL_USER="youruser"
export MYSQL_PASSWORD="yourpassword"

# Storage location
export REFERENCE_LIBRARY_PATH="~/reference-library"

# Crawler selection
export DEFAULT_CRAWLER="firecrawl"  # firecrawl, nodejs, aiohttp, scrapy
```

### Database (`~/.config/reference-curator/db_config.yaml`)

```yaml
mysql:
  host: ${MYSQL_HOST:-localhost}
  port: ${MYSQL_PORT:-3306}
  database: reference_library
  user: ${MYSQL_USER}
  password: ${MYSQL_PASSWORD}
```

### Crawlers (`~/.config/reference-curator/crawl_config.yaml`)

```yaml
default_crawler: ${DEFAULT_CRAWLER:-firecrawl}

rate_limit:
  requests_per_minute: 20
  concurrent_requests: 3

default_options:
  timeout: 30000
  max_depth: 3
  max_pages: 100
```

### Export (`~/.config/reference-curator/export_config.yaml`)

```yaml
output:
  base_path: ${REFERENCE_LIBRARY_PATH:-~/reference-library}/exports/

quality:
  min_score_for_export: 0.80
  auto_approve_tier1_sources: true
```

---

## Troubleshooting

### MySQL Connection Failed

```bash
# Check MySQL is running
brew services list | grep mysql   # macOS
systemctl status mysql            # Linux

# Start MySQL
brew services start mysql         # macOS
sudo systemctl start mysql        # Linux

# Verify credentials
source ~/.reference-curator.env
mysql -h $MYSQL_HOST -u $MYSQL_USER -p"$MYSQL_PASSWORD" -e "SELECT 1"
```

### Commands Not Found

```bash
# Check commands are registered
ls -la ~/.claude/commands/

# Re-run installer to fix
./install.sh
```

### Crawler Timeout

For slow sites, increase timeout in `~/.config/reference-curator/crawl_config.yaml`:
```yaml
default_options:
  timeout: 60000  # 60 seconds
```

### Skills Not Loading

```bash
# Check symlinks exist
ls -la ~/.claude/skills/

# Re-run installer
./install.sh --uninstall
./install.sh
```

### Database Schema Outdated

```bash
# Re-apply schema (preserves data with IF NOT EXISTS)
source ~/.reference-curator.env
mysql -h $MYSQL_HOST -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library < shared/schema.sql
```

---

## Directory Structure

```
90-reference-curator/
├── README.md                     # This file
├── CHANGELOG.md                  # Version history
├── install.sh                    # Portable installation script
│
├── claude-project/               # Files for Claude.ai Projects
│   ├── INDEX.md                  # Overview
│   ├── reference-curator-complete.md  # All skills combined
│   ├── 01-reference-discovery.md
│   ├── 02-web-crawler.md
│   ├── 03-content-repository.md
│   ├── 04-content-distiller.md
│   ├── 05-quality-reviewer.md
│   └── 06-markdown-exporter.md
│
├── commands/                     # Claude Code commands (tracked in git)
│   ├── reference-discovery.md
│   ├── web-crawler.md
│   ├── content-repository.md
│   ├── content-distiller.md
│   ├── quality-reviewer.md
│   ├── markdown-exporter.md
│   └── reference-curator-pipeline.md
│
├── 01-reference-discovery/
│   ├── code/CLAUDE.md            # Claude Code directive
│   └── desktop/SKILL.md          # Claude Desktop directive
├── 02-web-crawler-orchestrator/
│   ├── code/CLAUDE.md
│   └── desktop/SKILL.md
├── 03-content-repository/
│   ├── code/CLAUDE.md
│   └── desktop/SKILL.md
├── 04-content-distiller/
│   ├── code/CLAUDE.md
│   └── desktop/SKILL.md
├── 05-quality-reviewer/
│   ├── code/CLAUDE.md
│   └── desktop/SKILL.md
├── 06-markdown-exporter/
│   ├── code/CLAUDE.md
│   └── desktop/SKILL.md
├── 07-pipeline-orchestrator/
│   ├── code/CLAUDE.md
│   └── desktop/SKILL.md
│
└── shared/
    ├── schema.sql                # MySQL schema
    └── config/                   # Config templates
        ├── db_config.yaml
        ├── crawl_config.yaml
        └── export_config.yaml
```

---

## Platform Differences

| Aspect | `code/` (Claude Code) | `desktop/` (Claude Desktop) |
|--------|----------------------|----------------------------|
| Directive | CLAUDE.md | SKILL.md (YAML frontmatter) |
| Commands | `~/.claude/commands/` | Not used |
| Skills | Reference only | `~/.claude/skills/` symlinks |
| Execution | Direct Bash/Python | MCP tools only |
| Best for | Automation, CI/CD | Interactive use |

---

## Prerequisites

### Required
- macOS or Linux
- Claude Code or Claude Desktop

### Optional (for full features)
- MySQL 8.0+ (for database storage)
- Firecrawl MCP server configured
- Node.js 18+ (for Node.js crawler)
- Python 3.12+ (for aiohttp/Scrapy crawlers)