feat(reference-curator): Add pipeline orchestrator and refactor skill format

Pipeline Orchestrator: - Add 07-pipeline-orchestrator skill with code/CLAUDE.md and desktop/SKILL.md - Add /reference-curator-pipeline slash command for full workflow automation - Add pipeline_runs and pipeline_iteration_tracker tables to schema.sql - Add v_pipeline_status and v_pipeline_iterations views - Add pipeline_config.yaml configuration template - Update AGENTS.md with Reference Curator Skills section - Update claude-project files with pipeline documentation Skill Format Refactoring: - Extract YAML frontmatter from SKILL.md files to separate skill.yaml - Add tools/ directories with MCP tool documentation - Update SKILL-FORMAT-REQUIREMENTS.md with new structure - Add migrate-skill-structure.py script for format conversion Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 01:01:02 +07:00
parent 243b9d851c
commit d1cd1298a8
91 changed files with 2475 additions and 281 deletions
--- a/custom-skills/90-reference-curator/claude-project/07-pipeline-orchestrator.md
+++ b/custom-skills/90-reference-curator/claude-project/07-pipeline-orchestrator.md
@@ -0,0 +1,175 @@
+# Pipeline Orchestrator
+
+Coordinates the full 6-skill reference curation workflow with automated QA loop handling.
+
+## Trigger Phrases
+
+- "curate references on [topic]"
+- "run full curation pipeline"
+- "automate reference curation"
+- "curate these URLs: [url1, url2]"
+
+## Input Modes
+
+| Mode | Example | Pipeline Start |
+|------|---------|----------------|
+| **Topic** | "curate references on Claude system prompts" | Stage 1 (discovery) |
+| **URLs** | "curate these URLs: https://docs.anthropic.com/..." | Stage 2 (crawler) |
+| **Manifest** | "resume curation from manifest.json" | Stage 2 (crawler) |
+
+## Pipeline Stages
+
+```
+1. reference-discovery  (topic mode only)
+       │
+       ▼
+2. web-crawler-orchestrator
+       │
+       ▼
+3. content-repository
+       │
+       ▼
+4. content-distiller ◄─────────────┐
+       │                           │
+       ▼                           │
+5. quality-reviewer                │
+       │                           │
+       ├── APPROVE → Stage 6       │
+       ├── REFACTOR ───────────────┤
+       ├── DEEP_RESEARCH → Stage 2 ┘
+       └── REJECT → Archive
+       │
+       ▼
+6. markdown-exporter
+```
+
+## Configuration Options
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| max_sources | 10 | Maximum sources to discover (topic mode) |
+| max_pages | 50 | Maximum pages per source to crawl |
+| auto_approve | false | Auto-approve scores above threshold |
+| threshold | 0.85 | Quality score threshold for approval |
+| max_iterations | 3 | Maximum QA loop iterations per document |
+| export_format | project_files | Output format (project_files, fine_tuning, jsonl) |
+
+## QA Loop Handling
+
+The orchestrator automatically handles QA decisions:
+
+| Decision | Action | Iteration Limit |
+|----------|--------|-----------------|
+| **APPROVE** | Proceed to export | - |
+| **REFACTOR** | Re-distill with feedback | 3 iterations |
+| **DEEP_RESEARCH** | Crawl more sources, re-distill | 2 iterations |
+| **REJECT** | Archive with reason | - |
+
+After reaching iteration limits, documents are marked `needs_manual_review`.
+
+## State Management
+
+### With Database
+
+Pipeline state is tracked in `pipeline_runs` table:
+- Run ID, input type, current stage
+- Statistics (crawled, distilled, approved, etc.)
+- Error handling and resume capability
+
+### File-Based Fallback
+
+State saved to `~/reference-library/pipeline_state/run_XXX/`:
+- `state.json` - Current stage and statistics
+- `manifest.json` - Discovered sources
+- `review_log.json` - QA decisions
+
+## Progress Tracking
+
+The orchestrator reports progress at each stage:
+
+```
+[Pipeline] Stage 1/6: Discovery - Found 8 sources
+[Pipeline] Stage 2/6: Crawling - 45/50 pages complete
+[Pipeline] Stage 3/6: Storing - 45 documents saved
+[Pipeline] Stage 4/6: Distilling - 45 documents processed
+[Pipeline] Stage 5/6: Reviewing - 40 approved, 3 refactored, 2 rejected
+[Pipeline] Stage 6/6: Exporting - 40 documents exported
+[Pipeline] Complete! See ~/reference-library/exports/
+```
+
+## Output Summary
+
+On completion, returns detailed summary:
+
+```
+Pipeline Complete:
+- Sources discovered: 5
+- Pages crawled: 45
+- Documents stored: 45
+- Approved: 40
+- Refactored: 8
+- Deep researched: 2
+- Rejected: 3
+- Needs manual review: 2
+
+Exports saved to: ~/reference-library/exports/
+Format: project_files
+```
+
+## Error Handling
+
+If a stage fails:
+1. State is checkpointed
+2. Error is logged with details
+3. Pipeline can be resumed from last successful stage
+
+## Resume Pipeline
+
+To resume a paused or failed pipeline:
+- Provide the manifest.json or run state file
+- Pipeline continues from last checkpoint
+
+## Example Usage
+
+### Full Pipeline from Topic
+
+```
+User: Curate references on Claude Code best practices with 5 sources
+
+Claude: I'll run the full curation pipeline for "Claude Code best practices":
+
+[Pipeline] Stage 1/6: Discovering sources...
+Found 5 authoritative sources (3 official, 2 community)
+
+[Pipeline] Stage 2/6: Crawling 5 sources...
+Crawled 45 pages total
+
+[Pipeline] Stage 3/6: Storing documents...
+45 documents saved to repository
+
+[Pipeline] Stage 4/6: Distilling content...
+45 documents processed
+
+[Pipeline] Stage 5/6: Quality review...
+- 38 approved
+- 5 refactored (re-distilling...)
+- 2 rejected (low quality)
+
+[Pipeline] Stage 6/6: Exporting...
+40 documents exported to ~/reference-library/exports/
+
+Pipeline complete! 40 reference documents ready for use.
+```
+
+### From Specific URLs
+
+```
+User: Curate these URLs with auto-approve:
+- https://docs.anthropic.com/en/docs/prompt-caching
+- https://docs.anthropic.com/en/docs/build-with-claude/tool-use
+
+Claude: Running pipeline from URLs (skipping discovery)...
+
+[Pipeline] Stage 2/6: Crawling 2 sources...
+[...continues with remaining stages...]
+```
--- a/custom-skills/90-reference-curator/claude-project/INDEX.md
+++ b/custom-skills/90-reference-curator/claude-project/INDEX.md
@@ -1,11 +1,27 @@
 # Reference Curator - Claude.ai Project Knowledge

-This project knowledge enables Claude to curate, process, and export reference documentation through 6 modular skills.
+This project knowledge enables Claude to curate, process, and export reference documentation through 7 modular skills.
+
+## Quick Start - Pipeline Orchestrator
+
+Run the full curation workflow with a single command:
+
+```
+# Full pipeline from topic
+curate references on "Claude Code best practices"
+
+# From URLs (skip discovery)
+curate these URLs: https://docs.anthropic.com/en/docs/prompt-caching
+
+# With auto-approve
+curate references on "MCP servers" with auto-approve
+```

 ## Skills Overview

 | Skill | Purpose | Trigger Phrases |
 |-------|---------|-----------------|
+| **pipeline-orchestrator** | Full 6-skill workflow with QA loops | "curate references", "run full pipeline", "automate curation" |
 | **reference-discovery** | Search & validate authoritative sources | "find references", "search documentation", "discover sources" |
 | **web-crawler** | Multi-backend crawling orchestration | "crawl URL", "fetch documents", "scrape pages" |
 | **content-repository** | MySQL storage management | "store content", "save to database", "check duplicates" |
@@ -16,37 +32,43 @@ This project knowledge enables Claude to curate, process, and export reference d
 ## Workflow

 ```
-[Topic Input]
-     │
-     ▼
-┌─────────────────────┐
-│ reference-discovery │ → Search & validate sources
-└─────────────────────┘
-     │
-     ▼
+                ┌───────────────────────────┐
+                │   pipeline-orchestrator   │  (Coordinates all stages)
+                └───────────────────────────┘
+                            │
+        ┌───────────────────┼───────────────────┐
+        ▼                   ▼                   ▼
+   [Topic Input]      [URL Input]        [Manifest Input]
+        │                   │                   │
+        ▼                   │                   │
+┌─────────────────────┐     │                   │
+│ reference-discovery │ ◄───┴───────────────────┘
+└─────────────────────┘                  (skip if URLs/manifest)
+        │
+        ▼
 ┌─────────────────────┐
 │ web-crawler         │ → Crawl (Firecrawl/Node.js/aiohttp/Scrapy)
 └─────────────────────┘
-     │
-     ▼
+        │
+        ▼
 ┌─────────────────────┐
 │ content-repository  │ → Store in MySQL
 └─────────────────────┘
-     │
-     ▼
+        │
+        ▼
 ┌─────────────────────┐
-│ content-distiller   │ → Summarize & extract
-└─────────────────────┘
-     │
-     ▼
-┌─────────────────────┐
-│ quality-reviewer    │ → QA loop
-└─────────────────────┘
-     │
-     ├── REFACTOR → content-distiller
-     ├── DEEP_RESEARCH → web-crawler
-     │
-     ▼ APPROVE
+│ content-distiller   │ → Summarize & extract  ◄────┐
+└─────────────────────┘                             │
+        │                                           │
+        ▼                                           │
+┌─────────────────────┐                             │
+│ quality-reviewer    │ → QA loop                   │
+└─────────────────────┘                             │
+        │                                           │
+        ├── REFACTOR (max 3) ───────────────────────┤
+        ├── DEEP_RESEARCH (max 2) → crawler ────────┘
+        │
+        ▼ APPROVE
 ┌─────────────────────┐
 │ markdown-exporter   │ → Project files / Fine-tuning
 └─────────────────────┘
@@ -74,16 +96,28 @@ This project knowledge enables Claude to curate, process, and export reference d
 ## Files in This Project

 - `INDEX.md` - This overview file
- `reference-curator-complete.md` - All 6 skills in one file
+- `reference-curator-complete.md` - All 7 skills in one file (recommended)
 - `01-reference-discovery.md` - Source discovery skill
 - `02-web-crawler.md` - Crawling orchestration skill
 - `03-content-repository.md` - Database storage skill
 - `04-content-distiller.md` - Content summarization skill
 - `05-quality-reviewer.md` - QA review skill
 - `06-markdown-exporter.md` - Export skill
+- `07-pipeline-orchestrator.md` - Full pipeline orchestration

 ## Usage

 Upload all files to a Claude.ai Project, or upload only the skills you need.

 For the complete experience, upload `reference-curator-complete.md` which contains all skills in one file.
+
+## Pipeline Orchestrator Options
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| max_sources | 10 | Max sources to discover |
+| max_pages | 50 | Max pages per source |
+| auto_approve | false | Auto-approve above threshold |
+| threshold | 0.85 | Approval threshold |
+| max_iterations | 3 | Max QA loop iterations |
+| export_format | project_files | Output format |
--- a/custom-skills/90-reference-curator/claude-project/reference-curator-complete.md
+++ b/custom-skills/90-reference-curator/claude-project/reference-curator-complete.md
@@ -1,6 +1,87 @@
 # Reference Curator - Complete Skill Set

-This document contains all 6 skills for curating, processing, and exporting reference documentation.
+This document contains all 7 skills for curating, processing, and exporting reference documentation.
+
+---
+
+# Pipeline Orchestrator (Recommended Entry Point)
+
+Coordinates the full 6-skill workflow with automated QA loop handling.
+
+## Quick Start
+
+```
+# Full pipeline from topic
+curate references on "Claude Code best practices"
+
+# From URLs (skip discovery)
+curate these URLs: https://docs.anthropic.com/en/docs/prompt-caching
+
+# With auto-approve
+curate references on "MCP servers" with auto-approve and fine-tuning output
+```
+
+## Configuration Options
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| max_sources | 10 | Maximum sources to discover |
+| max_pages | 50 | Maximum pages per source |
+| auto_approve | false | Auto-approve above threshold |
+| threshold | 0.85 | Approval threshold |
+| max_iterations | 3 | Max QA loop iterations |
+| export_format | project_files | Output format |
+
+## Pipeline Flow
+
+```
+[Input: Topic | URLs | Manifest]
+            │
+            ▼
+   1. reference-discovery  (skip if URLs/manifest)
+            │
+            ▼
+   2. web-crawler
+            │
+            ▼
+   3. content-repository
+            │
+            ▼
+   4. content-distiller ◄─────────────┐
+            │                         │
+            ▼                         │
+   5. quality-reviewer                │
+            │                         │
+            ├── APPROVE → export      │
+            ├── REFACTOR (max 3) ─────┤
+            ├── DEEP_RESEARCH (max 2) → crawler
+            └── REJECT → archive
+            │
+            ▼
+   6. markdown-exporter
+```
+
+## QA Loop Handling
+
+| Decision | Action | Max Iterations |
+|----------|--------|----------------|
+| APPROVE | Proceed to export | - |
+| REFACTOR | Re-distill with feedback | 3 |
+| DEEP_RESEARCH | Crawl more sources | 2 |
+| REJECT | Archive with reason | - |
+
+Documents exceeding iteration limits are marked `needs_manual_review`.
+
+## Output Summary
+
+```
+Pipeline Complete:
+- Sources discovered: 5
+- Pages crawled: 45
+- Approved: 40
+- Needs manual review: 2
+- Exports: ~/reference-library/exports/
+```

 ---

@@ -464,6 +545,7 @@ def add_cross_references(doc, all_docs):

 | From | Output | To |
 |------|--------|-----|
+| **pipeline-orchestrator** | Coordinates all stages | All skills below |
 | **reference-discovery** | URL manifest | web-crawler |
 | **web-crawler** | Raw content + manifest | content-repository |
 | **content-repository** | Document records | content-distiller |
@@ -471,3 +553,25 @@ def add_cross_references(doc, all_docs):
 | **quality-reviewer** (approve) | Approved IDs | markdown-exporter |
 | **quality-reviewer** (refactor) | Instructions | content-distiller |
 | **quality-reviewer** (deep_research) | Queries | web-crawler |
+
+## State Management
+
+The pipeline orchestrator tracks state for resume capability:
+
+**With Database:**
+- `pipeline_runs` table tracks run status, current stage, statistics
+- `pipeline_iteration_tracker` tracks QA loop iterations per document
+
+**File-Based Fallback:**
+```
+~/reference-library/pipeline_state/run_XXX/
+├── state.json       # Current stage and stats
+├── manifest.json    # Discovered sources
+└── review_log.json  # QA decisions
+```
+
+## Resume Pipeline
+
+To resume a paused or failed pipeline:
+1. Provide the run_id or state file path
+2. Pipeline continues from last successful checkpoint