feat(reference-curator): Add Claude.ai Projects export format

Add claude-project/ folder with skill files formatted for upload to Claude.ai Projects (web interface): - reference-curator-complete.md: All 6 skills consolidated - INDEX.md: Overview and workflow documentation - Individual skill files (01-06) without YAML frontmatter Add --claude-ai option to install.sh: - Lists available files for upload - Optionally copies to custom destination directory - Provides upload instructions for Claude.ai Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 00:33:06 +07:00
parent 8762f68e6e
commit 243b9d851c
10 changed files with 1987 additions and 0 deletions
--- a/custom-skills/90-reference-curator/claude-project/01-reference-discovery.md
+++ b/custom-skills/90-reference-curator/claude-project/01-reference-discovery.md
@@ -0,0 +1,184 @@
+
+# Reference Discovery
+
+Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.
+
+## Source Priority Hierarchy
+
+| Tier | Source Type | Examples |
+|------|-------------|----------|
+| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
+| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
+| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* |
+| **Tier 2** | Research papers | arxiv.org, papers with citations |
+| **Tier 2** | Verified community guides | Cookbook examples, official tutorials |
+| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow |
+
+## Discovery Workflow
+
+### Step 1: Define Search Scope
+
+```python
+search_config = {
+    "topic": "prompt engineering",
+    "vendors": ["anthropic", "openai", "google"],
+    "source_types": ["official_docs", "engineering_blog", "github_repo"],
+    "freshness": "past_year",  # past_week, past_month, past_year, any
+    "max_results_per_query": 20
+}
+```
+
+### Step 2: Generate Search Queries
+
+For a given topic, generate targeted queries:
+
+```python
+def generate_queries(topic, vendors):
+    queries = []
+    
+    # Official documentation queries
+    for vendor in vendors:
+        queries.append(f"site:docs.{vendor}.com {topic}")
+        queries.append(f"site:{vendor}.com/docs {topic}")
+    
+    # Engineering blog queries
+    for vendor in vendors:
+        queries.append(f"site:{vendor}.com/blog {topic}")
+        queries.append(f"site:{vendor}.com/news {topic}")
+    
+    # GitHub queries
+    for vendor in vendors:
+        queries.append(f"site:github.com/{vendor} {topic}")
+    
+    # Research queries
+    queries.append(f"site:arxiv.org {topic}")
+    
+    return queries
+```
+
+### Step 3: Execute Search
+
+Use web search tool for each query:
+
+```python
+def execute_discovery(queries):
+    results = []
+    for query in queries:
+        search_results = web_search(query)
+        for result in search_results:
+            results.append({
+                "url": result.url,
+                "title": result.title,
+                "snippet": result.snippet,
+                "query_used": query
+            })
+    return deduplicate_by_url(results)
+```
+
+### Step 4: Validate and Score Sources
+
+```python
+def score_source(url, title):
+    score = 0.0
+    
+    # Domain credibility
+    if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
+        score += 0.40  # Tier 1 official docs
+    elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
+        score += 0.30  # Tier 1 official blog/news
+    elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
+        score += 0.30  # Tier 1 official repos
+    elif 'arxiv.org' in url:
+        score += 0.20  # Tier 2 research
+    else:
+        score += 0.10  # Tier 3 community
+    
+    # Freshness signals (from title/snippet)
+    if any(year in title for year in ['2025', '2024']):
+        score += 0.20
+    elif any(year in title for year in ['2023']):
+        score += 0.10
+    
+    # Relevance signals
+    if any(kw in title.lower() for kw in ['guide', 'documentation', 'tutorial', 'best practices']):
+        score += 0.15
+    
+    return min(score, 1.0)
+
+def assign_credibility_tier(score):
+    if score >= 0.60:
+        return 'tier1_official'
+    elif score >= 0.40:
+        return 'tier2_verified'
+    else:
+        return 'tier3_community'
+```
+
+### Step 5: Output URL Manifest
+
+```python
+def create_manifest(scored_results, topic):
+    manifest = {
+        "discovery_date": datetime.now().isoformat(),
+        "topic": topic,
+        "total_urls": len(scored_results),
+        "urls": []
+    }
+    
+    for result in sorted(scored_results, key=lambda x: x['score'], reverse=True):
+        manifest["urls"].append({
+            "url": result["url"],
+            "title": result["title"],
+            "credibility_tier": result["tier"],
+            "credibility_score": result["score"],
+            "source_type": infer_source_type(result["url"]),
+            "vendor": infer_vendor(result["url"])
+        })
+    
+    return manifest
+```
+
+## Output Format
+
+Discovery produces a JSON manifest for the crawler:
+
+```json
+{
+  "discovery_date": "2025-01-28T10:30:00",
+  "topic": "prompt engineering",
+  "total_urls": 15,
+  "urls": [
+    {
+      "url": "https://docs.anthropic.com/en/docs/prompt-engineering",
+      "title": "Prompt Engineering Guide",
+      "credibility_tier": "tier1_official",
+      "credibility_score": 0.85,
+      "source_type": "official_docs",
+      "vendor": "anthropic"
+    }
+  ]
+}
+```
+
+## Known Authoritative Sources
+
+Pre-validated sources for common topics:
+
+| Vendor | Documentation | Blog/News | GitHub |
+|--------|--------------|-----------|--------|
+| Anthropic | docs.anthropic.com, docs.claude.com | anthropic.com/news | github.com/anthropics |
+| OpenAI | platform.openai.com/docs | openai.com/blog | github.com/openai |
+| Google | ai.google.dev/docs | blog.google/technology/ai | github.com/google |
+
+## Integration
+
+**Output:** URL manifest JSON → `web-crawler-orchestrator`
+
+**Database:** Register new sources in `sources` table via `content-repository`
+
+## Deduplication
+
+Before outputting, deduplicate URLs:
+- Normalize URLs (remove trailing slashes, query params)
+- Check against existing `documents` table via `content-repository`
+- Merge duplicate entries, keeping highest credibility score
--- a/custom-skills/90-reference-curator/claude-project/02-web-crawler.md
+++ b/custom-skills/90-reference-curator/claude-project/02-web-crawler.md
@@ -0,0 +1,230 @@
+
+# Web Crawler Orchestrator
+
+Manages crawling operations using Firecrawl MCP with rate limiting and format handling.
+
+## Prerequisites
+
+- Firecrawl MCP server connected
+- Config file at `~/.config/reference-curator/crawl_config.yaml`
+- Storage directory exists: `~/reference-library/raw/`
+
+## Crawl Configuration
+
+```yaml
+# ~/.config/reference-curator/crawl_config.yaml
+firecrawl:
+  rate_limit:
+    requests_per_minute: 20
+    concurrent_requests: 3
+  default_options:
+    timeout: 30000
+    only_main_content: true
+    include_html: false
+
+processing:
+  max_content_size_mb: 50
+  raw_content_dir: ~/reference-library/raw/
+```
+
+## Crawl Workflow
+
+### Step 1: Load URL Manifest
+
+Receive manifest from `reference-discovery`:
+
+```python
+def load_manifest(manifest_path):
+    with open(manifest_path) as f:
+        manifest = json.load(f)
+    return manifest["urls"]
+```
+
+### Step 2: Determine Crawl Strategy
+
+```python
+def select_strategy(url):
+    """Select optimal crawl strategy based on URL characteristics."""
+    
+    if url.endswith('.pdf'):
+        return 'pdf_extract'
+    elif 'github.com' in url and '/blob/' in url:
+        return 'raw_content'  # Get raw file content
+    elif 'github.com' in url:
+        return 'scrape'  # Repository pages
+    elif any(d in url for d in ['docs.', 'documentation']):
+        return 'scrape'  # Documentation sites
+    else:
+        return 'scrape'  # Default
+```
+
+### Step 3: Execute Firecrawl
+
+Use Firecrawl MCP for crawling:
+
+```python
+# Single page scrape
+firecrawl_scrape(
+    url="https://docs.anthropic.com/en/docs/prompt-engineering",
+    formats=["markdown"],  # markdown | html | screenshot
+    only_main_content=True,
+    timeout=30000
+)
+
+# Multi-page crawl (documentation sites)
+firecrawl_crawl(
+    url="https://docs.anthropic.com/en/docs/",
+    max_depth=2,
+    limit=50,
+    formats=["markdown"],
+    only_main_content=True
+)
+```
+
+### Step 4: Rate Limiting
+
+```python
+import time
+from collections import deque
+
+class RateLimiter:
+    def __init__(self, requests_per_minute=20):
+        self.rpm = requests_per_minute
+        self.request_times = deque()
+    
+    def wait_if_needed(self):
+        now = time.time()
+        # Remove requests older than 1 minute
+        while self.request_times and now - self.request_times[0] > 60:
+            self.request_times.popleft()
+        
+        if len(self.request_times) >= self.rpm:
+            wait_time = 60 - (now - self.request_times[0])
+            if wait_time > 0:
+                time.sleep(wait_time)
+        
+        self.request_times.append(time.time())
+```
+
+### Step 5: Save Raw Content
+
+```python
+import hashlib
+from pathlib import Path
+
+def save_content(url, content, content_type='markdown'):
+    """Save crawled content to raw storage."""
+    
+    # Generate filename from URL hash
+    url_hash = hashlib.sha256(url.encode()).hexdigest()[:16]
+    
+    # Determine extension
+    ext_map = {'markdown': '.md', 'html': '.html', 'pdf': '.pdf'}
+    ext = ext_map.get(content_type, '.txt')
+    
+    # Create dated subdirectory
+    date_dir = datetime.now().strftime('%Y/%m')
+    output_dir = Path.home() / 'reference-library/raw' / date_dir
+    output_dir.mkdir(parents=True, exist_ok=True)
+    
+    # Save file
+    filepath = output_dir / f"{url_hash}{ext}"
+    if content_type == 'pdf':
+        filepath.write_bytes(content)
+    else:
+        filepath.write_text(content, encoding='utf-8')
+    
+    return str(filepath)
+```
+
+### Step 6: Generate Crawl Manifest
+
+```python
+def create_crawl_manifest(results):
+    manifest = {
+        "crawl_date": datetime.now().isoformat(),
+        "total_crawled": len([r for r in results if r["status"] == "success"]),
+        "total_failed": len([r for r in results if r["status"] == "failed"]),
+        "documents": []
+    }
+    
+    for result in results:
+        manifest["documents"].append({
+            "url": result["url"],
+            "status": result["status"],
+            "raw_content_path": result.get("filepath"),
+            "content_size": result.get("size"),
+            "crawl_method": "firecrawl",
+            "error": result.get("error")
+        })
+    
+    return manifest
+```
+
+## Error Handling
+
+| Error | Action |
+|-------|--------|
+| Timeout | Retry once with 2x timeout |
+| Rate limit (429) | Exponential backoff, max 3 retries |
+| Not found (404) | Log and skip |
+| Access denied (403) | Log, mark as `failed` |
+| Connection error | Retry with backoff |
+
+```python
+def crawl_with_retry(url, max_retries=3):
+    for attempt in range(max_retries):
+        try:
+            result = firecrawl_scrape(url)
+            return {"status": "success", "content": result}
+        except RateLimitError:
+            wait = 2 ** attempt * 10  # 10, 20, 40 seconds
+            time.sleep(wait)
+        except TimeoutError:
+            if attempt == 0:
+                # Retry with doubled timeout
+                result = firecrawl_scrape(url, timeout=60000)
+                return {"status": "success", "content": result}
+        except NotFoundError:
+            return {"status": "failed", "error": "404 Not Found"}
+        except Exception as e:
+            if attempt == max_retries - 1:
+                return {"status": "failed", "error": str(e)}
+    
+    return {"status": "failed", "error": "Max retries exceeded"}
+```
+
+## Firecrawl MCP Reference
+
+**scrape** - Single page:
+```
+firecrawl_scrape(url, formats, only_main_content, timeout)
+```
+
+**crawl** - Multi-page:
+```
+firecrawl_crawl(url, max_depth, limit, formats, only_main_content)
+```
+
+**map** - Discover URLs:
+```
+firecrawl_map(url, limit)  # Returns list of URLs on site
+```
+
+## Integration
+
+| From | Input | To |
+|------|-------|-----|
+| reference-discovery | URL manifest | web-crawler-orchestrator |
+| web-crawler-orchestrator | Crawl manifest + raw files | content-repository |
+| quality-reviewer (deep_research) | Additional queries | reference-discovery → here |
+
+## Output Structure
+
+```
+~/reference-library/raw/
+└── 2025/01/
+    ├── a1b2c3d4e5f6g7h8.md   # Markdown content
+    ├── b2c3d4e5f6g7h8i9.md
+    └── c3d4e5f6g7h8i9j0.pdf  # PDF documents
+```
--- a/custom-skills/90-reference-curator/claude-project/03-content-repository.md
+++ b/custom-skills/90-reference-curator/claude-project/03-content-repository.md
@@ -0,0 +1,158 @@
+
+# Content Repository
+
+Manages MySQL storage for the reference library system. Handles document storage, version control, deduplication, and retrieval.
+
+## Prerequisites
+
+- MySQL 8.0+ with utf8mb4 charset
+- Config file at `~/.config/reference-curator/db_config.yaml`
+- Database `reference_library` initialized with schema
+
+## Quick Reference
+
+### Connection Setup
+
+```python
+import yaml
+import os
+from pathlib import Path
+
+def get_db_config():
+    config_path = Path.home() / ".config/reference-curator/db_config.yaml"
+    with open(config_path) as f:
+        config = yaml.safe_load(f)
+    
+    # Resolve environment variables
+    mysql = config['mysql']
+    return {
+        'host': mysql['host'],
+        'port': mysql['port'],
+        'database': mysql['database'],
+        'user': os.environ.get('MYSQL_USER', mysql.get('user', '')),
+        'password': os.environ.get('MYSQL_PASSWORD', mysql.get('password', '')),
+        'charset': mysql['charset']
+    }
+```
+
+### Core Operations
+
+**Store New Document:**
+```python
+def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
+    sql = """
+    INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
+    VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
+    ON DUPLICATE KEY UPDATE 
+        version = version + 1,
+        previous_version_id = doc_id,
+        crawl_date = NOW(),
+        raw_content_path = VALUES(raw_content_path)
+    """
+    cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
+    return cursor.lastrowid
+```
+
+**Check Duplicate:**
+```python
+def is_duplicate(cursor, url):
+    cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
+    return cursor.fetchone() is not None
+```
+
+**Get Document by Topic:**
+```python
+def get_docs_by_topic(cursor, topic_slug, min_quality=0.80):
+    sql = """
+    SELECT d.doc_id, d.title, d.url, dc.structured_content, dc.quality_score
+    FROM documents d
+    JOIN document_topics dt ON d.doc_id = dt.doc_id
+    JOIN topics t ON dt.topic_id = t.topic_id
+    LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
+    WHERE t.topic_slug = %s 
+    AND (dc.review_status = 'approved' OR dc.review_status IS NULL)
+    ORDER BY dt.relevance_score DESC
+    """
+    cursor.execute(sql, (topic_slug,))
+    return cursor.fetchall()
+```
+
+## Table Quick Reference
+
+| Table | Purpose | Key Fields |
+|-------|---------|------------|
+| `sources` | Authorized content sources | source_type, credibility_tier, vendor |
+| `documents` | Crawled document metadata | url_hash (dedup), version, crawl_status |
+| `distilled_content` | Processed summaries | review_status, compression_ratio |
+| `review_logs` | QA decisions | quality_score, decision, refactor_instructions |
+| `topics` | Taxonomy | topic_slug, parent_topic_id |
+| `document_topics` | Many-to-many linking | relevance_score |
+| `export_jobs` | Export tracking | export_type, output_format, status |
+
+## Status Values
+
+**crawl_status:** `pending` → `completed` | `failed` | `stale`
+
+**review_status:** `pending` → `in_review` → `approved` | `needs_refactor` | `rejected`
+
+**decision (review):** `approve` | `refactor` | `deep_research` | `reject`
+
+## Common Queries
+
+### Find Stale Documents (needs re-crawl)
+```sql
+SELECT d.doc_id, d.title, d.url, d.crawl_date
+FROM documents d
+JOIN crawl_schedule cs ON d.source_id = cs.source_id
+WHERE d.crawl_date < DATE_SUB(NOW(), INTERVAL 
+    CASE cs.frequency 
+        WHEN 'daily' THEN 1 
+        WHEN 'weekly' THEN 7 
+        WHEN 'biweekly' THEN 14 
+        WHEN 'monthly' THEN 30 
+    END DAY)
+AND cs.is_enabled = TRUE;
+```
+
+### Get Pending Reviews
+```sql
+SELECT dc.distill_id, d.title, d.url, dc.token_count_distilled
+FROM distilled_content dc
+JOIN documents d ON dc.doc_id = d.doc_id
+WHERE dc.review_status = 'pending'
+ORDER BY dc.distill_date ASC;
+```
+
+### Export-Ready Content
+```sql
+SELECT d.title, d.url, dc.structured_content, t.topic_slug
+FROM documents d
+JOIN distilled_content dc ON d.doc_id = dc.doc_id
+JOIN document_topics dt ON d.doc_id = dt.doc_id
+JOIN topics t ON dt.topic_id = t.topic_id
+JOIN review_logs rl ON dc.distill_id = rl.distill_id
+WHERE rl.decision = 'approve'
+AND rl.quality_score >= 0.85
+ORDER BY t.topic_slug, dt.relevance_score DESC;
+```
+
+## Workflow Integration
+
+1. **From crawler-orchestrator:** Receive URL + raw content path → `store_document()`
+2. **To content-distiller:** Query pending documents → send for processing
+3. **From quality-reviewer:** Update `review_status` based on decision
+4. **To markdown-exporter:** Query approved content by topic
+
+## Error Handling
+
+- **Duplicate URL:** Silent update (version increment) via `ON DUPLICATE KEY UPDATE`
+- **Missing source_id:** Validate against `sources` table before insert
+- **Connection failure:** Implement retry with exponential backoff
+
+## Full Schema Reference
+
+See `references/schema.sql` for complete table definitions including indexes and constraints.
+
+## Config File Template
+
+See `references/db_config_template.yaml` for connection configuration template.
--- a/custom-skills/90-reference-curator/claude-project/04-content-distiller.md
+++ b/custom-skills/90-reference-curator/claude-project/04-content-distiller.md
@@ -0,0 +1,234 @@
+
+# Content Distiller
+
+Transforms raw crawled content into structured, high-quality reference materials.
+
+## Distillation Goals
+
+1. **Compress** - Reduce token count while preserving essential information
+2. **Structure** - Organize content for easy retrieval and reference
+3. **Extract** - Pull out code snippets, key concepts, and actionable patterns
+4. **Annotate** - Add metadata for searchability and categorization
+
+## Distillation Workflow
+
+### Step 1: Load Raw Content
+
+```python
+def load_for_distillation(cursor):
+    """Get documents ready for distillation."""
+    sql = """
+    SELECT d.doc_id, d.title, d.url, d.raw_content_path, 
+           d.doc_type, s.source_type, s.credibility_tier
+    FROM documents d
+    JOIN sources s ON d.source_id = s.source_id
+    LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
+    WHERE d.crawl_status = 'completed'
+    AND dc.distill_id IS NULL
+    ORDER BY s.credibility_tier ASC
+    """
+    cursor.execute(sql)
+    return cursor.fetchall()
+```
+
+### Step 2: Analyze Content Structure
+
+Identify content type and select appropriate distillation strategy:
+
+```python
+def analyze_structure(content, doc_type):
+    """Analyze document structure for distillation."""
+    analysis = {
+        "has_code_blocks": bool(re.findall(r'```[\s\S]*?```', content)),
+        "has_headers": bool(re.findall(r'^#+\s', content, re.MULTILINE)),
+        "has_lists": bool(re.findall(r'^\s*[-*]\s', content, re.MULTILINE)),
+        "has_tables": bool(re.findall(r'\|.*\|', content)),
+        "estimated_tokens": len(content.split()) * 1.3,  # Rough estimate
+        "section_count": len(re.findall(r'^#+\s', content, re.MULTILINE))
+    }
+    return analysis
+```
+
+### Step 3: Extract Key Components
+
+**Extract Code Snippets:**
+```python
+def extract_code_snippets(content):
+    """Extract all code blocks with language tags."""
+    pattern = r'```(\w*)\n([\s\S]*?)```'
+    snippets = []
+    for match in re.finditer(pattern, content):
+        snippets.append({
+            "language": match.group(1) or "text",
+            "code": match.group(2).strip(),
+            "context": get_surrounding_text(content, match.start(), 200)
+        })
+    return snippets
+```
+
+**Extract Key Concepts:**
+```python
+def extract_key_concepts(content, title):
+    """Use Claude to extract key concepts and definitions."""
+    prompt = f"""
+    Analyze this document and extract key concepts:
+    
+    Title: {title}
+    Content: {content[:8000]}  # Limit for context
+    
+    Return JSON with:
+    - concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
+    - techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
+    - best_practices: ["..."]
+    """
+    # Use Claude API to process
+    return claude_extract(prompt)
+```
+
+### Step 4: Create Structured Summary
+
+**Summary Template:**
+```markdown
+# {title}
+
+**Source:** {url}
+**Type:** {source_type} | **Tier:** {credibility_tier}
+**Distilled:** {date}
+
+## Executive Summary
+{2-3 sentence overview}
+
+## Key Concepts
+{bulleted list of core concepts with brief definitions}
+
+## Techniques & Patterns
+{extracted techniques with use cases}
+
+## Code Examples
+{relevant code snippets with context}
+
+## Best Practices
+{actionable recommendations}
+
+## Related Topics
+{links to related content in library}
+```
+
+### Step 5: Optimize for Tokens
+
+```python
+def optimize_content(structured_content, target_ratio=0.30):
+    """
+    Compress content to target ratio while preserving quality.
+    Target: 30% of original token count.
+    """
+    original_tokens = count_tokens(structured_content)
+    target_tokens = int(original_tokens * target_ratio)
+    
+    # Prioritized compression strategies
+    strategies = [
+        remove_redundant_explanations,
+        condense_examples,
+        merge_similar_sections,
+        trim_verbose_descriptions
+    ]
+    
+    optimized = structured_content
+    for strategy in strategies:
+        if count_tokens(optimized) > target_tokens:
+            optimized = strategy(optimized)
+    
+    return optimized
+```
+
+### Step 6: Store Distilled Content
+
+```python
+def store_distilled(cursor, doc_id, summary, key_concepts, 
+                    code_snippets, structured_content, 
+                    original_tokens, distilled_tokens):
+    sql = """
+    INSERT INTO distilled_content 
+    (doc_id, summary, key_concepts, code_snippets, structured_content,
+     token_count_original, token_count_distilled, distill_model, review_status)
+    VALUES (%s, %s, %s, %s, %s, %s, %s, 'claude-opus-4-5', 'pending')
+    """
+    cursor.execute(sql, (
+        doc_id, summary, 
+        json.dumps(key_concepts), 
+        json.dumps(code_snippets),
+        structured_content,
+        original_tokens, 
+        distilled_tokens
+    ))
+    return cursor.lastrowid
+```
+
+## Distillation Prompts
+
+**For Prompt Engineering Content:**
+```
+Focus on:
+1. Specific techniques with before/after examples
+2. Why techniques work (not just what)
+3. Common pitfalls and how to avoid them
+4. Actionable patterns that can be directly applied
+```
+
+**For API Documentation:**
+```
+Focus on:
+1. Endpoint specifications and parameters
+2. Request/response examples
+3. Error codes and handling
+4. Rate limits and best practices
+```
+
+**For Research Papers:**
+```
+Focus on:
+1. Key findings and conclusions
+2. Novel techniques introduced
+3. Practical applications
+4. Limitations and caveats
+```
+
+## Quality Metrics
+
+Track compression efficiency:
+
+| Metric | Target |
+|--------|--------|
+| Compression Ratio | 25-35% of original |
+| Key Concept Coverage | ≥90% of important terms |
+| Code Snippet Retention | 100% of relevant examples |
+| Readability | Clear, scannable structure |
+
+## Handling Refactor Requests
+
+When `quality-reviewer` returns `refactor` decision:
+
+```python
+def handle_refactor(distill_id, instructions):
+    """Re-distill based on reviewer feedback."""
+    # Load original content and existing distillation
+    original = load_raw_content(distill_id)
+    existing = load_distilled_content(distill_id)
+    
+    # Apply specific improvements based on instructions
+    improved = apply_improvements(existing, instructions)
+    
+    # Update distilled_content
+    update_distilled(distill_id, improved)
+    
+    # Reset review status
+    set_review_status(distill_id, 'pending')
+```
+
+## Integration
+
+| From | Input | To |
+|------|-------|-----|
+| content-repository | Raw document records | content-distiller |
+| content-distiller | Distilled content | quality-reviewer |
+| quality-reviewer | Refactor instructions | content-distiller (loop) |
--- a/custom-skills/90-reference-curator/claude-project/05-quality-reviewer.md
+++ b/custom-skills/90-reference-curator/claude-project/05-quality-reviewer.md
@@ -0,0 +1,223 @@
+
+# Quality Reviewer
+
+Evaluates distilled content for quality, routes decisions, and triggers refactoring or additional research when needed.
+
+## Review Workflow
+
+```
+[Distilled Content]
+       │
+       ▼
+┌─────────────────┐
+│ Score Criteria  │ → accuracy, completeness, clarity, PE quality, usability
+└─────────────────┘
+       │
+       ▼
+┌─────────────────┐
+│ Calculate Total │ → weighted average
+└─────────────────┘
+       │
+       ├── ≥ 0.85 → APPROVE → markdown-exporter
+       ├── 0.60-0.84 → REFACTOR → content-distiller (with instructions)
+       ├── 0.40-0.59 → DEEP_RESEARCH → web-crawler-orchestrator (with queries)
+       └── < 0.40 → REJECT → archive with reason
+```
+
+## Scoring Criteria
+
+| Criterion | Weight | Checks |
+|-----------|--------|--------|
+| **Accuracy** | 0.25 | Factual correctness, up-to-date info, proper attribution |
+| **Completeness** | 0.20 | Covers key concepts, includes examples, addresses edge cases |
+| **Clarity** | 0.20 | Clear structure, concise language, logical flow |
+| **PE Quality** | 0.25 | Demonstrates techniques, before/after examples, explains why |
+| **Usability** | 0.10 | Easy to reference, searchable keywords, appropriate length |
+
+## Decision Thresholds
+
+| Score Range | Decision | Action |
+|-------------|----------|--------|
+| ≥ 0.85 | `approve` | Proceed to export |
+| 0.60 - 0.84 | `refactor` | Return to distiller with feedback |
+| 0.40 - 0.59 | `deep_research` | Gather more sources, then re-distill |
+| < 0.40 | `reject` | Archive, log reason |
+
+## Review Process
+
+### Step 1: Load Content for Review
+
+```python
+def get_pending_reviews(cursor):
+    sql = """
+    SELECT dc.distill_id, dc.doc_id, d.title, d.url, 
+           dc.summary, dc.key_concepts, dc.structured_content,
+           dc.token_count_original, dc.token_count_distilled,
+           s.credibility_tier
+    FROM distilled_content dc
+    JOIN documents d ON dc.doc_id = d.doc_id
+    JOIN sources s ON d.source_id = s.source_id
+    WHERE dc.review_status = 'pending'
+    ORDER BY s.credibility_tier ASC, dc.distill_date ASC
+    """
+    cursor.execute(sql)
+    return cursor.fetchall()
+```
+
+### Step 2: Score Each Criterion
+
+Evaluate content against each criterion using this assessment template:
+
+```python
+assessment_template = {
+    "accuracy": {
+        "score": 0.0,  # 0.00 - 1.00
+        "notes": "",
+        "issues": []   # Specific factual errors if any
+    },
+    "completeness": {
+        "score": 0.0,
+        "notes": "",
+        "missing_topics": []  # Concepts that should be covered
+    },
+    "clarity": {
+        "score": 0.0,
+        "notes": "",
+        "confusing_sections": []  # Sections needing rewrite
+    },
+    "prompt_engineering_quality": {
+        "score": 0.0,
+        "notes": "",
+        "improvements": []  # Specific PE technique gaps
+    },
+    "usability": {
+        "score": 0.0,
+        "notes": "",
+        "suggestions": []
+    }
+}
+```
+
+### Step 3: Calculate Final Score
+
+```python
+WEIGHTS = {
+    "accuracy": 0.25,
+    "completeness": 0.20,
+    "clarity": 0.20,
+    "prompt_engineering_quality": 0.25,
+    "usability": 0.10
+}
+
+def calculate_quality_score(assessment):
+    return sum(
+        assessment[criterion]["score"] * weight
+        for criterion, weight in WEIGHTS.items()
+    )
+```
+
+### Step 4: Route Decision
+
+```python
+def determine_decision(score, assessment):
+    if score >= 0.85:
+        return "approve", None, None
+    elif score >= 0.60:
+        instructions = generate_refactor_instructions(assessment)
+        return "refactor", instructions, None
+    elif score >= 0.40:
+        queries = generate_research_queries(assessment)
+        return "deep_research", None, queries
+    else:
+        return "reject", f"Quality score {score:.2f} below minimum threshold", None
+
+def generate_refactor_instructions(assessment):
+    """Extract actionable feedback from low-scoring criteria."""
+    instructions = []
+    for criterion, data in assessment.items():
+        if data["score"] < 0.80:
+            if data.get("issues"):
+                instructions.extend(data["issues"])
+            if data.get("missing_topics"):
+                instructions.append(f"Add coverage for: {', '.join(data['missing_topics'])}")
+            if data.get("improvements"):
+                instructions.extend(data["improvements"])
+    return "\n".join(instructions)
+
+def generate_research_queries(assessment):
+    """Generate search queries for content gaps."""
+    queries = []
+    if assessment["completeness"]["missing_topics"]:
+        for topic in assessment["completeness"]["missing_topics"]:
+            queries.append(f"{topic} documentation guide")
+    if assessment["accuracy"]["issues"]:
+        queries.append("latest official documentation verification")
+    return queries
+```
+
+### Step 5: Log Review Decision
+
+```python
+def log_review(cursor, distill_id, assessment, score, decision, instructions=None, queries=None):
+    # Get current round number
+    cursor.execute(
+        "SELECT COALESCE(MAX(review_round), 0) + 1 FROM review_logs WHERE distill_id = %s",
+        (distill_id,)
+    )
+    review_round = cursor.fetchone()[0]
+    
+    sql = """
+    INSERT INTO review_logs 
+    (distill_id, review_round, reviewer_type, quality_score, assessment, 
+     decision, refactor_instructions, research_queries)
+    VALUES (%s, %s, 'claude_review', %s, %s, %s, %s, %s)
+    """
+    cursor.execute(sql, (
+        distill_id, review_round, score, 
+        json.dumps(assessment), decision, instructions, 
+        json.dumps(queries) if queries else None
+    ))
+    
+    # Update distilled_content status
+    status_map = {
+        "approve": "approved",
+        "refactor": "needs_refactor", 
+        "deep_research": "needs_refactor",
+        "reject": "rejected"
+    }
+    cursor.execute(
+        "UPDATE distilled_content SET review_status = %s WHERE distill_id = %s",
+        (status_map[decision], distill_id)
+    )
+```
+
+## Prompt Engineering Quality Checklist
+
+When scoring `prompt_engineering_quality`, verify:
+
+- [ ] Demonstrates specific techniques (CoT, few-shot, etc.)
+- [ ] Shows before/after examples
+- [ ] Explains *why* techniques work, not just *what*
+- [ ] Provides actionable patterns
+- [ ] Includes edge cases and failure modes
+- [ ] References authoritative sources
+
+## Auto-Approve Rules
+
+Tier 1 (official) sources with score ≥ 0.80 may auto-approve without human review if configured:
+
+```yaml
+# In export_config.yaml
+quality:
+  auto_approve_tier1_sources: true
+  auto_approve_min_score: 0.80
+```
+
+## Integration Points
+
+| From | Action | To |
+|------|--------|-----|
+| content-distiller | Sends distilled content | quality-reviewer |
+| quality-reviewer | APPROVE | markdown-exporter |
+| quality-reviewer | REFACTOR + instructions | content-distiller |
+| quality-reviewer | DEEP_RESEARCH + queries | web-crawler-orchestrator |
--- a/custom-skills/90-reference-curator/claude-project/06-markdown-exporter.md
+++ b/custom-skills/90-reference-curator/claude-project/06-markdown-exporter.md
@@ -0,0 +1,290 @@
+
+# Markdown Exporter
+
+Exports approved content as structured markdown files for Claude Projects or fine-tuning.
+
+## Export Configuration
+
+```yaml
+# ~/.config/reference-curator/export_config.yaml
+output:
+  base_path: ~/reference-library/exports/
+  
+  project_files:
+    structure: nested_by_topic  # flat | nested_by_topic | nested_by_source
+    index_file: INDEX.md
+    include_metadata: true
+  
+  fine_tuning:
+    format: jsonl
+    max_tokens_per_sample: 4096
+    include_system_prompt: true
+
+quality:
+  min_score_for_export: 0.80
+```
+
+## Export Workflow
+
+### Step 1: Query Approved Content
+
+```python
+def get_exportable_content(cursor, min_score=0.80, topic_filter=None):
+    """Get all approved content meeting quality threshold."""
+    sql = """
+    SELECT d.doc_id, d.title, d.url, 
+           dc.summary, dc.key_concepts, dc.code_snippets, dc.structured_content,
+           t.topic_slug, t.topic_name,
+           rl.quality_score, s.credibility_tier, s.vendor
+    FROM documents d
+    JOIN distilled_content dc ON d.doc_id = dc.doc_id
+    JOIN document_topics dt ON d.doc_id = dt.doc_id
+    JOIN topics t ON dt.topic_id = t.topic_id
+    JOIN review_logs rl ON dc.distill_id = rl.distill_id
+    JOIN sources s ON d.source_id = s.source_id
+    WHERE rl.decision = 'approve'
+    AND rl.quality_score >= %s
+    AND rl.review_id = (
+        SELECT MAX(review_id) FROM review_logs 
+        WHERE distill_id = dc.distill_id
+    )
+    """
+    params = [min_score]
+    
+    if topic_filter:
+        sql += " AND t.topic_slug IN (%s)" % ','.join(['%s'] * len(topic_filter))
+        params.extend(topic_filter)
+    
+    sql += " ORDER BY t.topic_slug, rl.quality_score DESC"
+    cursor.execute(sql, params)
+    return cursor.fetchall()
+```
+
+### Step 2: Organize by Structure
+
+**Nested by Topic (recommended):**
+```
+exports/
+├── INDEX.md
+├── prompt-engineering/
+│   ├── _index.md
+│   ├── 01-chain-of-thought.md
+│   ├── 02-few-shot-prompting.md
+│   └── 03-system-prompts.md
+├── claude-models/
+│   ├── _index.md
+│   ├── 01-model-comparison.md
+│   └── 02-context-windows.md
+└── agent-building/
+    ├── _index.md
+    └── 01-tool-use.md
+```
+
+**Flat Structure:**
+```
+exports/
+├── INDEX.md
+├── prompt-engineering-chain-of-thought.md
+├── prompt-engineering-few-shot.md
+└── claude-models-comparison.md
+```
+
+### Step 3: Generate Files
+
+**Document File Template:**
+```python
+def generate_document_file(doc, include_metadata=True):
+    content = []
+    
+    if include_metadata:
+        content.append("---")
+        content.append(f"title: {doc['title']}")
+        content.append(f"source: {doc['url']}")
+        content.append(f"vendor: {doc['vendor']}")
+        content.append(f"tier: {doc['credibility_tier']}")
+        content.append(f"quality_score: {doc['quality_score']:.2f}")
+        content.append(f"exported: {datetime.now().isoformat()}")
+        content.append("---")
+        content.append("")
+    
+    content.append(doc['structured_content'])
+    
+    return "\n".join(content)
+```
+
+**Topic Index Template:**
+```python
+def generate_topic_index(topic_slug, topic_name, documents):
+    content = [
+        f"# {topic_name}",
+        "",
+        f"This section contains {len(documents)} reference documents.",
+        "",
+        "## Contents",
+        ""
+    ]
+    
+    for i, doc in enumerate(documents, 1):
+        filename = generate_filename(doc['title'])
+        content.append(f"{i}. [{doc['title']}]({filename})")
+    
+    return "\n".join(content)
+```
+
+**Root INDEX Template:**
+```python
+def generate_root_index(topics_with_counts, export_date):
+    content = [
+        "# Reference Library",
+        "",
+        f"Exported: {export_date}",
+        "",
+        "## Topics",
+        ""
+    ]
+    
+    for topic in topics_with_counts:
+        content.append(f"- [{topic['name']}]({topic['slug']}/) ({topic['count']} documents)")
+    
+    content.extend([
+        "",
+        "## Quality Standards",
+        "",
+        "All documents in this library have:",
+        "- Passed quality review (score ≥ 0.80)",
+        "- Been distilled for conciseness",
+        "- Verified source attribution"
+    ])
+    
+    return "\n".join(content)
+```
+
+### Step 4: Write Files
+
+```python
+def export_project_files(content_list, config):
+    base_path = Path(config['output']['base_path'])
+    structure = config['output']['project_files']['structure']
+    
+    # Group by topic
+    by_topic = defaultdict(list)
+    for doc in content_list:
+        by_topic[doc['topic_slug']].append(doc)
+    
+    # Create directories and files
+    for topic_slug, docs in by_topic.items():
+        if structure == 'nested_by_topic':
+            topic_dir = base_path / topic_slug
+            topic_dir.mkdir(parents=True, exist_ok=True)
+            
+            # Write topic index
+            topic_index = generate_topic_index(topic_slug, docs[0]['topic_name'], docs)
+            (topic_dir / '_index.md').write_text(topic_index)
+            
+            # Write document files
+            for i, doc in enumerate(docs, 1):
+                filename = f"{i:02d}-{slugify(doc['title'])}.md"
+                file_content = generate_document_file(doc)
+                (topic_dir / filename).write_text(file_content)
+    
+    # Write root INDEX
+    topics_summary = [
+        {"slug": slug, "name": docs[0]['topic_name'], "count": len(docs)}
+        for slug, docs in by_topic.items()
+    ]
+    root_index = generate_root_index(topics_summary, datetime.now().isoformat())
+    (base_path / 'INDEX.md').write_text(root_index)
+```
+
+### Step 5: Fine-tuning Export (Optional)
+
+```python
+def export_fine_tuning_dataset(content_list, config):
+    """Export as JSONL for fine-tuning."""
+    output_path = Path(config['output']['base_path']) / 'fine_tuning.jsonl'
+    max_tokens = config['output']['fine_tuning']['max_tokens_per_sample']
+    
+    with open(output_path, 'w') as f:
+        for doc in content_list:
+            sample = {
+                "messages": [
+                    {
+                        "role": "system",
+                        "content": "You are an expert on AI and prompt engineering."
+                    },
+                    {
+                        "role": "user", 
+                        "content": f"Explain {doc['title']}"
+                    },
+                    {
+                        "role": "assistant",
+                        "content": truncate_to_tokens(doc['structured_content'], max_tokens)
+                    }
+                ],
+                "metadata": {
+                    "source": doc['url'],
+                    "topic": doc['topic_slug'],
+                    "quality_score": doc['quality_score']
+                }
+            }
+            f.write(json.dumps(sample) + '\n')
+```
+
+### Step 6: Log Export Job
+
+```python
+def log_export_job(cursor, export_name, export_type, output_path, 
+                   topic_filter, total_docs, total_tokens):
+    sql = """
+    INSERT INTO export_jobs 
+    (export_name, export_type, output_format, topic_filter, output_path,
+     total_documents, total_tokens, status, started_at, completed_at)
+    VALUES (%s, %s, 'markdown', %s, %s, %s, %s, 'completed', NOW(), NOW())
+    """
+    cursor.execute(sql, (
+        export_name, export_type, 
+        json.dumps(topic_filter) if topic_filter else None,
+        str(output_path), total_docs, total_tokens
+    ))
+```
+
+## Cross-Reference Generation
+
+Link related documents:
+
+```python
+def add_cross_references(doc, all_docs):
+    """Find and link related documents."""
+    related = []
+    doc_concepts = set(c['term'].lower() for c in doc['key_concepts'])
+    
+    for other in all_docs:
+        if other['doc_id'] == doc['doc_id']:
+            continue
+        other_concepts = set(c['term'].lower() for c in other['key_concepts'])
+        overlap = len(doc_concepts & other_concepts)
+        if overlap >= 2:
+            related.append({
+                "title": other['title'],
+                "path": generate_relative_path(doc, other),
+                "overlap": overlap
+            })
+    
+    return sorted(related, key=lambda x: x['overlap'], reverse=True)[:5]
+```
+
+## Output Verification
+
+After export, verify:
+- [ ] All files readable and valid markdown
+- [ ] INDEX.md links resolve correctly
+- [ ] No broken cross-references
+- [ ] Total token count matches expectation
+- [ ] No duplicate content
+
+## Integration
+
+| From | Input | To |
+|------|-------|-----|
+| quality-reviewer | Approved content IDs | markdown-exporter |
+| markdown-exporter | Structured files | Project knowledge / Fine-tuning |
--- a/custom-skills/90-reference-curator/claude-project/INDEX.md
+++ b/custom-skills/90-reference-curator/claude-project/INDEX.md
@@ -0,0 +1,89 @@
+# Reference Curator - Claude.ai Project Knowledge
+
+This project knowledge enables Claude to curate, process, and export reference documentation through 6 modular skills.
+
+## Skills Overview
+
+| Skill | Purpose | Trigger Phrases |
+|-------|---------|-----------------|
+| **reference-discovery** | Search & validate authoritative sources | "find references", "search documentation", "discover sources" |
+| **web-crawler** | Multi-backend crawling orchestration | "crawl URL", "fetch documents", "scrape pages" |
+| **content-repository** | MySQL storage management | "store content", "save to database", "check duplicates" |
+| **content-distiller** | Summarize & extract key concepts | "distill content", "summarize document", "extract key concepts" |
+| **quality-reviewer** | QA scoring & routing decisions | "review content", "quality check", "assess distilled content" |
+| **markdown-exporter** | Export to markdown/JSONL | "export references", "generate project files", "create markdown output" |
+
+## Workflow
+
+```
+[Topic Input]
+     │
+     ▼
+┌─────────────────────┐
+│ reference-discovery │ → Search & validate sources
+└─────────────────────┘
+     │
+     ▼
+┌─────────────────────┐
+│ web-crawler         │ → Crawl (Firecrawl/Node.js/aiohttp/Scrapy)
+└─────────────────────┘
+     │
+     ▼
+┌─────────────────────┐
+│ content-repository  │ → Store in MySQL
+└─────────────────────┘
+     │
+     ▼
+┌─────────────────────┐
+│ content-distiller   │ → Summarize & extract
+└─────────────────────┘
+     │
+     ▼
+┌─────────────────────┐
+│ quality-reviewer    │ → QA loop
+└─────────────────────┘
+     │
+     ├── REFACTOR → content-distiller
+     ├── DEEP_RESEARCH → web-crawler
+     │
+     ▼ APPROVE
+┌─────────────────────┐
+│ markdown-exporter   │ → Project files / Fine-tuning
+└─────────────────────┘
+```
+
+## Quality Scoring Thresholds
+
+| Score | Decision | Action |
+|-------|----------|--------|
+| ≥ 0.85 | **Approve** | Ready for export |
+| 0.60-0.84 | **Refactor** | Re-distill with feedback |
+| 0.40-0.59 | **Deep Research** | Gather more sources |
+| < 0.40 | **Reject** | Archive (low quality) |
+
+## Source Credibility Tiers
+
+| Tier | Source Type | Examples |
+|------|-------------|----------|
+| **Tier 1** | Official documentation | docs.anthropic.com, platform.openai.com/docs |
+| **Tier 1** | Official engineering blogs | anthropic.com/news, openai.com/blog |
+| **Tier 2** | Research papers | arxiv.org papers with citations |
+| **Tier 2** | Verified community guides | Official cookbooks, tutorials |
+| **Tier 3** | Community content | Blog posts, Stack Overflow |
+
+## Files in This Project
+
+- `INDEX.md` - This overview file
+- `reference-curator-complete.md` - All 6 skills in one file
+- `01-reference-discovery.md` - Source discovery skill
+- `02-web-crawler.md` - Crawling orchestration skill
+- `03-content-repository.md` - Database storage skill
+- `04-content-distiller.md` - Content summarization skill
+- `05-quality-reviewer.md` - QA review skill
+- `06-markdown-exporter.md` - Export skill
+
+## Usage
+
+Upload all files to a Claude.ai Project, or upload only the skills you need.
+
+For the complete experience, upload `reference-curator-complete.md` which contains all skills in one file.
--- a/custom-skills/90-reference-curator/claude-project/reference-curator-complete.md
+++ b/custom-skills/90-reference-curator/claude-project/reference-curator-complete.md
@@ -0,0 +1,473 @@
+# Reference Curator - Complete Skill Set
+
+This document contains all 6 skills for curating, processing, and exporting reference documentation.
+
+---
+
+# 1. Reference Discovery
+
+Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.
+
+## Source Priority Hierarchy
+
+| Tier | Source Type | Examples |
+|------|-------------|----------|
+| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
+| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
+| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* |
+| **Tier 2** | Research papers | arxiv.org, papers with citations |
+| **Tier 2** | Verified community guides | Cookbook examples, official tutorials |
+| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow |
+
+## Discovery Workflow
+
+### Step 1: Define Search Scope
+
+```python
+search_config = {
+    "topic": "prompt engineering",
+    "vendors": ["anthropic", "openai", "google"],
+    "source_types": ["official_docs", "engineering_blog", "github_repo"],
+    "freshness": "past_year",
+    "max_results_per_query": 20
+}
+```
+
+### Step 2: Generate Search Queries
+
+```python
+def generate_queries(topic, vendors):
+    queries = []
+    for vendor in vendors:
+        queries.append(f"site:docs.{vendor}.com {topic}")
+        queries.append(f"site:{vendor}.com/docs {topic}")
+        queries.append(f"site:{vendor}.com/blog {topic}")
+        queries.append(f"site:github.com/{vendor} {topic}")
+    queries.append(f"site:arxiv.org {topic}")
+    return queries
+```
+
+### Step 3: Validate and Score Sources
+
+```python
+def score_source(url, title):
+    score = 0.0
+    if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
+        score += 0.40  # Tier 1 official docs
+    elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
+        score += 0.30  # Tier 1 official blog/news
+    elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
+        score += 0.30  # Tier 1 official repos
+    elif 'arxiv.org' in url:
+        score += 0.20  # Tier 2 research
+    else:
+        score += 0.10  # Tier 3 community
+    return min(score, 1.0)
+
+def assign_credibility_tier(score):
+    if score >= 0.60:
+        return 'tier1_official'
+    elif score >= 0.40:
+        return 'tier2_verified'
+    else:
+        return 'tier3_community'
+```
+
+## Output Format
+
+```json
+{
+  "discovery_date": "2025-01-28T10:30:00",
+  "topic": "prompt engineering",
+  "total_urls": 15,
+  "urls": [
+    {
+      "url": "https://docs.anthropic.com/en/docs/prompt-engineering",
+      "title": "Prompt Engineering Guide",
+      "credibility_tier": "tier1_official",
+      "credibility_score": 0.85,
+      "source_type": "official_docs",
+      "vendor": "anthropic"
+    }
+  ]
+}
+```
+
+---
+
+# 2. Web Crawler Orchestrator
+
+Manages crawling operations using Firecrawl MCP with rate limiting and format handling.
+
+## Crawl Configuration
+
+```yaml
+firecrawl:
+  rate_limit:
+    requests_per_minute: 20
+    concurrent_requests: 3
+  default_options:
+    timeout: 30000
+    only_main_content: true
+```
+
+## Crawl Workflow
+
+### Determine Crawl Strategy
+
+```python
+def select_strategy(url):
+    if url.endswith('.pdf'):
+        return 'pdf_extract'
+    elif 'github.com' in url and '/blob/' in url:
+        return 'raw_content'
+    elif any(d in url for d in ['docs.', 'documentation']):
+        return 'scrape'
+    else:
+        return 'scrape'
+```
+
+### Execute Firecrawl
+
+```python
+# Single page scrape
+firecrawl_scrape(
+    url="https://docs.anthropic.com/en/docs/prompt-engineering",
+    formats=["markdown"],
+    only_main_content=True,
+    timeout=30000
+)
+
+# Multi-page crawl
+firecrawl_crawl(
+    url="https://docs.anthropic.com/en/docs/",
+    max_depth=2,
+    limit=50,
+    formats=["markdown"]
+)
+```
+
+### Rate Limiting
+
+```python
+class RateLimiter:
+    def __init__(self, requests_per_minute=20):
+        self.rpm = requests_per_minute
+        self.request_times = deque()
+
+    def wait_if_needed(self):
+        now = time.time()
+        while self.request_times and now - self.request_times[0] > 60:
+            self.request_times.popleft()
+        if len(self.request_times) >= self.rpm:
+            wait_time = 60 - (now - self.request_times[0])
+            if wait_time > 0:
+                time.sleep(wait_time)
+        self.request_times.append(time.time())
+```
+
+## Error Handling
+
+| Error | Action |
+|-------|--------|
+| Timeout | Retry once with 2x timeout |
+| Rate limit (429) | Exponential backoff, max 3 retries |
+| Not found (404) | Log and skip |
+| Access denied (403) | Log, mark as `failed` |
+
+---
+
+# 3. Content Repository
+
+Manages MySQL storage for the reference library. Handles document storage, version control, deduplication, and retrieval.
+
+## Core Operations
+
+**Store New Document:**
+```python
+def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
+    sql = """
+    INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
+    VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
+    ON DUPLICATE KEY UPDATE
+        version = version + 1,
+        crawl_date = NOW(),
+        raw_content_path = VALUES(raw_content_path)
+    """
+    cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
+    return cursor.lastrowid
+```
+
+**Check Duplicate:**
+```python
+def is_duplicate(cursor, url):
+    cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
+    return cursor.fetchone() is not None
+```
+
+## Table Quick Reference
+
+| Table | Purpose | Key Fields |
+|-------|---------|------------|
+| `sources` | Authorized content sources | source_type, credibility_tier, vendor |
+| `documents` | Crawled document metadata | url_hash (dedup), version, crawl_status |
+| `distilled_content` | Processed summaries | review_status, compression_ratio |
+| `review_logs` | QA decisions | quality_score, decision |
+| `topics` | Taxonomy | topic_slug, parent_topic_id |
+
+## Status Values
+
+- **crawl_status:** `pending` → `completed` | `failed` | `stale`
+- **review_status:** `pending` → `in_review` → `approved` | `needs_refactor` | `rejected`
+- **decision:** `approve` | `refactor` | `deep_research` | `reject`
+
+---
+
+# 4. Content Distiller
+
+Transforms raw crawled content into structured, high-quality reference materials.
+
+## Distillation Goals
+
+1. **Compress** - Reduce token count while preserving essential information
+2. **Structure** - Organize content for easy retrieval and reference
+3. **Extract** - Pull out code snippets, key concepts, and actionable patterns
+4. **Annotate** - Add metadata for searchability and categorization
+
+## Extract Key Components
+
+**Extract Code Snippets:**
+```python
+def extract_code_snippets(content):
+    pattern = r'```(\w*)\n([\s\S]*?)```'
+    snippets = []
+    for match in re.finditer(pattern, content):
+        snippets.append({
+            "language": match.group(1) or "text",
+            "code": match.group(2).strip(),
+            "context": get_surrounding_text(content, match.start(), 200)
+        })
+    return snippets
+```
+
+**Extract Key Concepts:**
+```python
+def extract_key_concepts(content, title):
+    prompt = f"""
+    Analyze this document and extract key concepts:
+
+    Title: {title}
+    Content: {content[:8000]}
+
+    Return JSON with:
+    - concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
+    - techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
+    - best_practices: ["..."]
+    """
+    return claude_extract(prompt)
+```
+
+## Summary Template
+
+```markdown
+# {title}
+
+**Source:** {url}
+**Type:** {source_type} | **Tier:** {credibility_tier}
+
+## Executive Summary
+{2-3 sentence overview}
+
+## Key Concepts
+{bulleted list of core concepts}
+
+## Techniques & Patterns
+{extracted techniques with use cases}
+
+## Code Examples
+{relevant code snippets}
+
+## Best Practices
+{actionable recommendations}
+```
+
+## Quality Metrics
+
+| Metric | Target |
+|--------|--------|
+| Compression Ratio | 25-35% of original |
+| Key Concept Coverage | ≥90% of important terms |
+| Code Snippet Retention | 100% of relevant examples |
+
+---
+
+# 5. Quality Reviewer
+
+Evaluates distilled content, routes decisions, and triggers refactoring or additional research.
+
+## Review Workflow
+
+```
+[Distilled Content]
+       │
+       ▼
+┌─────────────────┐
+│ Score Criteria  │ → accuracy, completeness, clarity, PE quality, usability
+└─────────────────┘
+       │
+       ├── ≥ 0.85 → APPROVE → markdown-exporter
+       ├── 0.60-0.84 → REFACTOR → content-distiller (with instructions)
+       ├── 0.40-0.59 → DEEP_RESEARCH → web-crawler (with queries)
+       └── < 0.40 → REJECT → archive with reason
+```
+
+## Scoring Criteria
+
+| Criterion | Weight | Checks |
+|-----------|--------|--------|
+| **Accuracy** | 0.25 | Factual correctness, up-to-date info, proper attribution |
+| **Completeness** | 0.20 | Covers key concepts, includes examples, addresses edge cases |
+| **Clarity** | 0.20 | Clear structure, concise language, logical flow |
+| **PE Quality** | 0.25 | Demonstrates techniques, before/after examples, explains why |
+| **Usability** | 0.10 | Easy to reference, searchable keywords, appropriate length |
+
+## Calculate Final Score
+
+```python
+WEIGHTS = {
+    "accuracy": 0.25,
+    "completeness": 0.20,
+    "clarity": 0.20,
+    "prompt_engineering_quality": 0.25,
+    "usability": 0.10
+}
+
+def calculate_quality_score(assessment):
+    return sum(
+        assessment[criterion]["score"] * weight
+        for criterion, weight in WEIGHTS.items()
+    )
+```
+
+## Route Decision
+
+```python
+def determine_decision(score, assessment):
+    if score >= 0.85:
+        return "approve", None, None
+    elif score >= 0.60:
+        instructions = generate_refactor_instructions(assessment)
+        return "refactor", instructions, None
+    elif score >= 0.40:
+        queries = generate_research_queries(assessment)
+        return "deep_research", None, queries
+    else:
+        return "reject", f"Quality score {score:.2f} below minimum", None
+```
+
+## Prompt Engineering Quality Checklist
+
+- [ ] Demonstrates specific techniques (CoT, few-shot, etc.)
+- [ ] Shows before/after examples
+- [ ] Explains *why* techniques work, not just *what*
+- [ ] Provides actionable patterns
+- [ ] Includes edge cases and failure modes
+- [ ] References authoritative sources
+
+---
+
+# 6. Markdown Exporter
+
+Exports approved content as structured markdown files for Claude Projects or fine-tuning.
+
+## Export Structure
+
+**Nested by Topic (recommended):**
+```
+exports/
+├── INDEX.md
+├── prompt-engineering/
+│   ├── _index.md
+│   ├── 01-chain-of-thought.md
+│   └── 02-few-shot-prompting.md
+├── claude-models/
+│   ├── _index.md
+│   └── 01-model-comparison.md
+└── agent-building/
+    └── 01-tool-use.md
+```
+
+## Document File Template
+
+```python
+def generate_document_file(doc, include_metadata=True):
+    content = []
+    if include_metadata:
+        content.append("---")
+        content.append(f"title: {doc['title']}")
+        content.append(f"source: {doc['url']}")
+        content.append(f"vendor: {doc['vendor']}")
+        content.append(f"tier: {doc['credibility_tier']}")
+        content.append(f"quality_score: {doc['quality_score']:.2f}")
+        content.append("---")
+        content.append("")
+    content.append(doc['structured_content'])
+    return "\n".join(content)
+```
+
+## Fine-tuning Export (JSONL)
+
+```python
+def export_fine_tuning_dataset(content_list, config):
+    with open('fine_tuning.jsonl', 'w') as f:
+        for doc in content_list:
+            sample = {
+                "messages": [
+                    {"role": "system", "content": "You are an expert on AI and prompt engineering."},
+                    {"role": "user", "content": f"Explain {doc['title']}"},
+                    {"role": "assistant", "content": doc['structured_content']}
+                ],
+                "metadata": {
+                    "source": doc['url'],
+                    "topic": doc['topic_slug'],
+                    "quality_score": doc['quality_score']
+                }
+            }
+            f.write(json.dumps(sample) + '\n')
+```
+
+## Cross-Reference Generation
+
+```python
+def add_cross_references(doc, all_docs):
+    related = []
+    doc_concepts = set(c['term'].lower() for c in doc['key_concepts'])
+
+    for other in all_docs:
+        if other['doc_id'] == doc['doc_id']:
+            continue
+        other_concepts = set(c['term'].lower() for c in other['key_concepts'])
+        overlap = len(doc_concepts & other_concepts)
+        if overlap >= 2:
+            related.append({
+                "title": other['title'],
+                "path": generate_relative_path(doc, other),
+                "overlap": overlap
+            })
+
+    return sorted(related, key=lambda x: x['overlap'], reverse=True)[:5]
+```
+
+---
+
+# Integration Flow
+
+| From | Output | To |
+|------|--------|-----|
+| **reference-discovery** | URL manifest | web-crawler |
+| **web-crawler** | Raw content + manifest | content-repository |
+| **content-repository** | Document records | content-distiller |
+| **content-distiller** | Distilled content | quality-reviewer |
+| **quality-reviewer** (approve) | Approved IDs | markdown-exporter |
+| **quality-reviewer** (refactor) | Instructions | content-distiller |
+| **quality-reviewer** (deep_research) | Queries | web-crawler |