feat(reference-curator): Add pipeline orchestrator and refactor skill format

Pipeline Orchestrator: - Add 07-pipeline-orchestrator skill with code/CLAUDE.md and desktop/SKILL.md - Add /reference-curator-pipeline slash command for full workflow automation - Add pipeline_runs and pipeline_iteration_tracker tables to schema.sql - Add v_pipeline_status and v_pipeline_iterations views - Add pipeline_config.yaml configuration template - Update AGENTS.md with Reference Curator Skills section - Update claude-project files with pipeline documentation Skill Format Refactoring: - Extract YAML frontmatter from SKILL.md files to separate skill.yaml - Add tools/ directories with MCP tool documentation - Update SKILL-FORMAT-REQUIREMENTS.md with new structure - Add migrate-skill-structure.py script for format conversion Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 01:01:02 +07:00
parent 243b9d851c
commit d1cd1298a8
91 changed files with 2475 additions and 281 deletions
--- a/custom-skills/90-reference-curator/07-pipeline-orchestrator/code/CLAUDE.md
+++ b/custom-skills/90-reference-curator/07-pipeline-orchestrator/code/CLAUDE.md
@@ -0,0 +1,296 @@
+# Pipeline Orchestrator
+
+Coordinates the full 6-skill reference curation workflow with QA loop handling.
+
+## Trigger Keywords
+"curate references", "full pipeline", "run curation", "reference-curator-pipeline"
+
+## Architecture
+
+```
+[Input] → discovery → crawler → repository → distiller ◄──┐
+                                                    │     │
+                                              reviewer     │
+                                                    │     │
+                    ┌───────────────────────────────┼─────┤
+                    ▼           ▼                   ▼     │
+                 APPROVE     REJECT          REFACTOR ────┤
+                    │           │                         │
+                    ▼           ▼                DEEP_RESEARCH
+                 export      archive                 │
+                                                     ▼
+                                                  crawler ─┘
+```
+
+## Input Detection
+
+Parse input to determine mode:
+
+```python
+def detect_input_mode(input_value):
+    if input_value.endswith('.json') and os.path.exists(input_value):
+        return 'manifest'
+    elif input_value.startswith('http://') or input_value.startswith('https://'):
+        return 'urls'
+    else:
+        return 'topic'
+```
+
+## Pipeline Execution
+
+### Stage 1: Reference Discovery (Topic Mode Only)
+
+```bash
+# Skip if input mode is 'urls' or 'manifest'
+if mode == 'topic':
+    /reference-discovery "$TOPIC" --max-sources $MAX_SOURCES
+    # Output: manifest.json
+```
+
+### Stage 2: Web Crawler
+
+```bash
+# From manifest or URLs
+/web-crawler $INPUT --max-pages $MAX_PAGES
+# Output: crawled files in ~/reference-library/raw/
+```
+
+### Stage 3: Content Repository
+
+```bash
+/content-repository store
+# Output: documents stored in MySQL or file-based storage
+```
+
+### Stage 4: Content Distiller
+
+```bash
+/content-distiller all-pending
+# Output: distilled content records
+```
+
+### Stage 5: Quality Reviewer
+
+```bash
+if auto_approve:
+    /quality-reviewer all-pending --auto-approve --threshold $THRESHOLD
+else:
+    /quality-reviewer all-pending
+```
+
+Handle QA decisions:
+- **APPROVE**: Add to export queue
+- **REFACTOR**: Re-run distiller with feedback (track iteration count)
+- **DEEP_RESEARCH**: Run crawler for additional sources, then distill
+- **REJECT**: Archive with reason
+
+### Stage 6: Markdown Exporter
+
+```bash
+/markdown-exporter $EXPORT_FORMAT
+# Output: files in ~/reference-library/exports/
+```
+
+## State Management
+
+### Initialize Pipeline State
+
+```python
+def init_pipeline_state(run_id, input_value, options):
+    state = {
+        "run_id": run_id,
+        "run_type": detect_input_mode(input_value),
+        "input_value": input_value,
+        "status": "running",
+        "current_stage": "discovery",
+        "options": options,
+        "stats": {
+            "sources_discovered": 0,
+            "pages_crawled": 0,
+            "documents_stored": 0,
+            "documents_distilled": 0,
+            "approved": 0,
+            "refactored": 0,
+            "deep_researched": 0,
+            "rejected": 0,
+            "needs_manual_review": 0
+        },
+        "started_at": datetime.now().isoformat()
+    }
+    save_state(run_id, state)
+    return state
+```
+
+### MySQL State (Preferred)
+
+```sql
+INSERT INTO pipeline_runs (run_type, input_value, options)
+VALUES ('topic', 'Claude system prompts', '{"max_sources": 10}');
+```
+
+### File-Based Fallback
+
+```
+~/reference-library/pipeline_state/run_XXX/
+├── state.json           # Current stage and stats
+├── manifest.json        # Discovered sources
+├── crawl_results.json   # Crawled document paths
+├── review_log.json      # QA decisions per document
+└── errors.log           # Any errors encountered
+```
+
+## QA Loop Logic
+
+```python
+MAX_REFACTOR_ITERATIONS = 3
+MAX_DEEP_RESEARCH_ITERATIONS = 2
+MAX_TOTAL_ITERATIONS = 5
+
+def handle_qa_decision(doc_id, decision, iteration_counts):
+    refactor_count = iteration_counts.get('refactor', 0)
+    research_count = iteration_counts.get('deep_research', 0)
+    total = refactor_count + research_count
+
+    if total >= MAX_TOTAL_ITERATIONS:
+        return 'needs_manual_review'
+
+    if decision == 'refactor':
+        if refactor_count >= MAX_REFACTOR_ITERATIONS:
+            return 'needs_manual_review'
+        iteration_counts['refactor'] = refactor_count + 1
+        return 're_distill'
+
+    if decision == 'deep_research':
+        if research_count >= MAX_DEEP_RESEARCH_ITERATIONS:
+            return 'needs_manual_review'
+        iteration_counts['deep_research'] = research_count + 1
+        return 're_crawl_and_distill'
+
+    return decision  # approve or reject
+```
+
+## Checkpoint Strategy
+
+Save checkpoint after each stage completes:
+
+| Stage | Checkpoint | Resume Point |
+|-------|------------|--------------|
+| discovery | `manifest.json` created | → crawler |
+| crawl | `crawl_results.json` | → repository |
+| store | DB records or file list | → distiller |
+| distill | distilled_content records | → reviewer |
+| review | review_logs records | → exporter or loop |
+| export | final export complete | Done |
+
+## Progress Reporting
+
+Report progress to user at key checkpoints:
+
+```
+[Pipeline] Stage 1/6: Discovery - Found 8 sources
+[Pipeline] Stage 2/6: Crawling - 45/50 pages complete
+[Pipeline] Stage 3/6: Storing - 45 documents saved
+[Pipeline] Stage 4/6: Distilling - 45 documents processed
+[Pipeline] Stage 5/6: Reviewing - 40 approved, 3 refactored, 2 rejected
+[Pipeline] Stage 6/6: Exporting - 40 documents exported
+[Pipeline] Complete! See ~/reference-library/exports/
+```
+
+## Error Handling
+
+```python
+def handle_stage_error(stage, error, state):
+    state['status'] = 'paused'
+    state['error_message'] = str(error)
+    state['error_stage'] = stage
+    save_state(state['run_id'], state)
+
+    # Log to errors.log
+    log_error(state['run_id'], stage, error)
+
+    # Report to user
+    return f"Pipeline paused at {stage}: {error}. Resume with run_id {state['run_id']}"
+```
+
+## Resume Pipeline
+
+```python
+def resume_pipeline(run_id):
+    state = load_state(run_id)
+
+    if state['status'] != 'paused':
+        return f"Pipeline {run_id} is {state['status']}, cannot resume"
+
+    stage = state['current_stage']
+    state['status'] = 'running'
+    state['error_message'] = None
+    save_state(run_id, state)
+
+    # Resume from failed stage
+    return execute_from_stage(stage, state)
+```
+
+## Output Summary
+
+On completion, generate summary:
+
+```json
+{
+  "run_id": 123,
+  "status": "completed",
+  "duration_minutes": 15,
+  "stats": {
+    "sources_discovered": 5,
+    "pages_crawled": 45,
+    "documents_stored": 45,
+    "documents_distilled": 45,
+    "approved": 40,
+    "refactored": 8,
+    "deep_researched": 2,
+    "rejected": 3,
+    "needs_manual_review": 2
+  },
+  "exports": {
+    "format": "project_files",
+    "path": "~/reference-library/exports/",
+    "document_count": 40
+  },
+  "errors": []
+}
+```
+
+## Integration Points
+
+| Skill | Called By | Provides |
+|-------|-----------|----------|
+| reference-discovery | Orchestrator | manifest.json |
+| web-crawler | Orchestrator | Raw crawled files |
+| content-repository | Orchestrator | Stored documents |
+| content-distiller | Orchestrator, QA loop | Distilled content |
+| quality-reviewer | Orchestrator | QA decisions |
+| markdown-exporter | Orchestrator | Final exports |
+
+## Configuration
+
+Read from `~/.config/reference-curator/pipeline_config.yaml`:
+
+```yaml
+pipeline:
+  max_sources: 10
+  max_pages: 50
+  auto_approve: false
+  approval_threshold: 0.85
+
+qa_loop:
+  max_refactor_iterations: 3
+  max_deep_research_iterations: 2
+  max_total_iterations: 5
+
+export:
+  default_format: project_files
+  include_rejected: false
+
+state:
+  backend: mysql  # or 'file'
+  state_directory: ~/reference-library/pipeline_state/
+```
--- a/custom-skills/90-reference-curator/07-pipeline-orchestrator/desktop/SKILL.md
+++ b/custom-skills/90-reference-curator/07-pipeline-orchestrator/desktop/SKILL.md
@@ -0,0 +1,279 @@
+# Pipeline Orchestrator
+
+Coordinates the full reference curation workflow, handling QA loops and state management.
+
+## Pipeline Architecture
+
+```
+[Input: Topic | URLs | Manifest]
+          │
+          ▼
+1. reference-discovery ──────────────────┐
+   (skip if URLs/manifest)               │
+          │                              │
+          ▼                              │
+2. web-crawler-orchestrator              │
+          │                              │
+          ▼                              │
+3. content-repository                    │
+          │                              │
+          ▼                              │
+4. content-distiller ◄───────────────────┤
+          │                              │
+          ▼                              │
+5. quality-reviewer                      │
+          │                              │
+    ┌─────┼─────┬────────────────┐       │
+    ▼     ▼     ▼                ▼       │
+ APPROVE REJECT REFACTOR    DEEP_RESEARCH│
+    │     │        │             │       │
+    │     │        └─────────────┤       │
+    │     │                      └───────┘
+    ▼     ▼
+6. markdown-exporter   archive
+          │
+          ▼
+      [Complete]
+```
+
+## Input Modes
+
+| Mode | Example Input | Pipeline Start |
+|------|--------------|----------------|
+| **Topic** | `"Claude system prompts"` | Stage 1 (discovery) |
+| **URLs** | `["https://docs.anthropic.com/..."]` | Stage 2 (crawler) |
+| **Manifest** | Path to `manifest.json` | Stage 2 (crawler) |
+
+## Configuration Options
+
+```yaml
+pipeline:
+  max_sources: 10        # Discovery limit
+  max_pages: 50          # Pages per source
+  auto_approve: false    # Auto-approve above threshold
+  approval_threshold: 0.85
+
+qa_loop:
+  max_refactor_iterations: 3
+  max_deep_research_iterations: 2
+  max_total_iterations: 5
+
+export:
+  format: project_files  # or fine_tuning, jsonl
+```
+
+## Pipeline Execution
+
+### Stage 1: Reference Discovery
+
+For topic-based input, search and validate authoritative sources:
+
+```python
+def run_discovery(topic, max_sources=10):
+    # Uses WebSearch to find sources
+    # Validates credibility
+    # Outputs manifest.json with source URLs
+    sources = search_authoritative_sources(topic, max_sources)
+    validate_and_rank_sources(sources)
+    write_manifest(sources)
+    return manifest_path
+```
+
+### Stage 2: Web Crawler
+
+Crawl URLs from manifest or direct input:
+
+```python
+def run_crawler(input_source, max_pages=50):
+    # Selects optimal crawler backend
+    # Respects rate limits
+    # Stores raw content
+    urls = load_urls(input_source)
+    for url in urls:
+        crawl_with_best_backend(url, max_pages)
+    return crawl_results
+```
+
+### Stage 3: Content Repository
+
+Store crawled content with deduplication:
+
+```python
+def run_repository(crawl_results):
+    # Deduplicates by URL hash
+    # Tracks versions
+    # Returns stored doc IDs
+    for result in crawl_results:
+        store_document(result)
+    return stored_doc_ids
+```
+
+### Stage 4: Content Distiller
+
+Process raw content into structured summaries:
+
+```python
+def run_distiller(doc_ids, refactor_instructions=None):
+    # Extracts key concepts
+    # Generates summaries
+    # Creates structured markdown
+    for doc_id in doc_ids:
+        distill_document(doc_id, instructions=refactor_instructions)
+    return distilled_ids
+```
+
+### Stage 5: Quality Reviewer
+
+Score and route content based on quality:
+
+```python
+def run_reviewer(distilled_ids, auto_approve=False, threshold=0.85):
+    decisions = {}
+    for distill_id in distilled_ids:
+        score, assessment = score_content(distill_id)
+
+        if auto_approve and score >= threshold:
+            decisions[distill_id] = ('approve', None)
+        elif score >= 0.85:
+            decisions[distill_id] = ('approve', None)
+        elif score >= 0.60:
+            instructions = generate_feedback(assessment)
+            decisions[distill_id] = ('refactor', instructions)
+        elif score >= 0.40:
+            queries = generate_research_queries(assessment)
+            decisions[distill_id] = ('deep_research', queries)
+        else:
+            decisions[distill_id] = ('reject', assessment)
+
+    return decisions
+```
+
+### Stage 6: Markdown Exporter
+
+Export approved content:
+
+```python
+def run_exporter(approved_ids, format='project_files'):
+    # Organizes by topic
+    # Generates INDEX.md
+    # Creates cross-references
+    export_documents(approved_ids, format=format)
+    return export_path
+```
+
+## QA Loop Handling
+
+```python
+def handle_qa_loop(distill_id, decision, iteration_tracker):
+    counts = iteration_tracker.get(distill_id, {'refactor': 0, 'deep_research': 0})
+
+    if decision == 'refactor':
+        if counts['refactor'] >= MAX_REFACTOR:
+            return 'needs_manual_review'
+        counts['refactor'] += 1
+        iteration_tracker[distill_id] = counts
+        return 're_distill'
+
+    if decision == 'deep_research':
+        if counts['deep_research'] >= MAX_DEEP_RESEARCH:
+            return 'needs_manual_review'
+        counts['deep_research'] += 1
+        iteration_tracker[distill_id] = counts
+        return 're_crawl'
+
+    return decision
+```
+
+## State Management
+
+### MySQL Backend (Preferred)
+
+```sql
+SELECT run_id, status, current_stage, stats
+FROM pipeline_runs
+WHERE run_id = ?;
+```
+
+### File-Based Fallback
+
+```
+~/reference-library/pipeline_state/
+├── run_001/
+│   ├── state.json       # Pipeline state
+│   ├── manifest.json    # Discovered sources
+│   ├── crawl_results.json
+│   └── review_log.json  # QA decisions
+```
+
+State JSON format:
+```json
+{
+  "run_id": "run_001",
+  "run_type": "topic",
+  "input_value": "Claude system prompts",
+  "status": "running",
+  "current_stage": "distilling",
+  "stats": {
+    "sources_discovered": 5,
+    "pages_crawled": 45,
+    "approved": 0,
+    "refactored": 0
+  },
+  "started_at": "2026-01-29T10:00:00Z"
+}
+```
+
+## Checkpointing
+
+Checkpoint after each stage to enable resume:
+
+| Checkpoint | Trigger | Resume From |
+|------------|---------|-------------|
+| `discovery_complete` | Manifest saved | → crawler |
+| `crawl_complete` | All pages crawled | → repository |
+| `store_complete` | Docs in database | → distiller |
+| `distill_complete` | Content processed | → reviewer |
+| `review_complete` | Decisions logged | → exporter |
+| `export_complete` | Files generated | Done |
+
+## Output Summary
+
+```json
+{
+  "run_id": 123,
+  "status": "completed",
+  "duration_minutes": 15,
+  "stats": {
+    "sources_discovered": 5,
+    "pages_crawled": 45,
+    "documents_stored": 45,
+    "documents_distilled": 45,
+    "approved": 40,
+    "refactored": 8,
+    "deep_researched": 2,
+    "rejected": 3,
+    "needs_manual_review": 2
+  },
+  "exports": {
+    "format": "project_files",
+    "path": "~/reference-library/exports/",
+    "document_count": 40
+  }
+}
+```
+
+## Error Handling
+
+On stage failure:
+1. Save checkpoint with error state
+2. Log error details
+3. Report to user with resume instructions
+
+```python
+try:
+    run_stage(stage_name)
+    save_checkpoint(stage_name, 'complete')
+except Exception as e:
+    save_checkpoint(stage_name, 'failed', error=str(e))
+    report_error(f"Pipeline paused at {stage_name}: {e}")
+```
--- a/custom-skills/90-reference-curator/07-pipeline-orchestrator/desktop/skill.yaml
+++ b/custom-skills/90-reference-curator/07-pipeline-orchestrator/desktop/skill.yaml
@@ -0,0 +1,9 @@
+# Skill metadata (extracted from SKILL.md frontmatter)
+
+name: pipeline-orchestrator
+description: |
+  Orchestrates the full 6-skill reference curation pipeline as a background task. Coordinates discovery → crawl → store → distill → review → export with QA loop handling. Triggers on "curate references", "run full pipeline", "reference pipeline", "automate curation".
+
+# Optional fields
+
+# triggers: []  # TODO: Extract from description