diff --git a/custom-skills/90-reference-curator/README.md b/custom-skills/90-reference-curator/README.md index d58d780..acda312 100644 --- a/custom-skills/90-reference-curator/README.md +++ b/custom-skills/90-reference-curator/README.md @@ -60,6 +60,7 @@ cd our-claude-skills/custom-skills/90-reference-curator | **Full** | `./install.sh` | Interactive setup with MySQL and crawlers | | **Minimal** | `./install.sh --minimal` | Firecrawl MCP only, no database | | **Check** | `./install.sh --check` | Verify installation status | +| **Claude.ai** | `./install.sh --claude-ai` | Export skills for Claude.ai Projects | | **Uninstall** | `./install.sh --uninstall` | Remove installation (preserves data) | ### What Gets Installed @@ -94,6 +95,38 @@ export CRAWLER_PROJECT_PATH="" # Path to local crawlers (optional) --- +## Claude.ai Projects Installation + +To use these skills in Claude.ai (web interface), export the skill files for upload: + +```bash +./install.sh --claude-ai +``` + +This displays available files in `claude-project/` and optionally copies them to a convenient location. + +### Files for Upload + +| File | Description | +|------|-------------| +| `reference-curator-complete.md` | All 6 skills combined (recommended) | +| `INDEX.md` | Overview and workflow documentation | +| `01-reference-discovery.md` | Source discovery skill | +| `02-web-crawler.md` | Crawling orchestration skill | +| `03-content-repository.md` | Database storage skill | +| `04-content-distiller.md` | Content summarization skill | +| `05-quality-reviewer.md` | QA review skill | +| `06-markdown-exporter.md` | Export skill | + +### Upload Instructions + +1. Go to [claude.ai](https://claude.ai) +2. Create a new Project or open existing one +3. Click "Add to project knowledge" +4. Upload `reference-curator-complete.md` (or individual skills as needed) + +--- + ## Architecture ``` @@ -386,6 +419,16 @@ mysql -h $MYSQL_HOST -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library < shar ├── CHANGELOG.md # Version history ├── install.sh # Portable installation script │ +├── claude-project/ # Files for Claude.ai Projects +│ ├── INDEX.md # Overview +│ ├── reference-curator-complete.md # All skills combined +│ ├── 01-reference-discovery.md +│ ├── 02-web-crawler.md +│ ├── 03-content-repository.md +│ ├── 04-content-distiller.md +│ ├── 05-quality-reviewer.md +│ └── 06-markdown-exporter.md +│ ├── commands/ # Claude Code commands (tracked in git) │ ├── reference-discovery.md │ ├── web-crawler.md diff --git a/custom-skills/90-reference-curator/claude-project/01-reference-discovery.md b/custom-skills/90-reference-curator/claude-project/01-reference-discovery.md new file mode 100644 index 0000000..c10d2b5 --- /dev/null +++ b/custom-skills/90-reference-curator/claude-project/01-reference-discovery.md @@ -0,0 +1,184 @@ + +# Reference Discovery + +Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling. + +## Source Priority Hierarchy + +| Tier | Source Type | Examples | +|------|-------------|----------| +| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs | +| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog | +| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* | +| **Tier 2** | Research papers | arxiv.org, papers with citations | +| **Tier 2** | Verified community guides | Cookbook examples, official tutorials | +| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow | + +## Discovery Workflow + +### Step 1: Define Search Scope + +```python +search_config = { + "topic": "prompt engineering", + "vendors": ["anthropic", "openai", "google"], + "source_types": ["official_docs", "engineering_blog", "github_repo"], + "freshness": "past_year", # past_week, past_month, past_year, any + "max_results_per_query": 20 +} +``` + +### Step 2: Generate Search Queries + +For a given topic, generate targeted queries: + +```python +def generate_queries(topic, vendors): + queries = [] + + # Official documentation queries + for vendor in vendors: + queries.append(f"site:docs.{vendor}.com {topic}") + queries.append(f"site:{vendor}.com/docs {topic}") + + # Engineering blog queries + for vendor in vendors: + queries.append(f"site:{vendor}.com/blog {topic}") + queries.append(f"site:{vendor}.com/news {topic}") + + # GitHub queries + for vendor in vendors: + queries.append(f"site:github.com/{vendor} {topic}") + + # Research queries + queries.append(f"site:arxiv.org {topic}") + + return queries +``` + +### Step 3: Execute Search + +Use web search tool for each query: + +```python +def execute_discovery(queries): + results = [] + for query in queries: + search_results = web_search(query) + for result in search_results: + results.append({ + "url": result.url, + "title": result.title, + "snippet": result.snippet, + "query_used": query + }) + return deduplicate_by_url(results) +``` + +### Step 4: Validate and Score Sources + +```python +def score_source(url, title): + score = 0.0 + + # Domain credibility + if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']): + score += 0.40 # Tier 1 official docs + elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']): + score += 0.30 # Tier 1 official blog/news + elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']): + score += 0.30 # Tier 1 official repos + elif 'arxiv.org' in url: + score += 0.20 # Tier 2 research + else: + score += 0.10 # Tier 3 community + + # Freshness signals (from title/snippet) + if any(year in title for year in ['2025', '2024']): + score += 0.20 + elif any(year in title for year in ['2023']): + score += 0.10 + + # Relevance signals + if any(kw in title.lower() for kw in ['guide', 'documentation', 'tutorial', 'best practices']): + score += 0.15 + + return min(score, 1.0) + +def assign_credibility_tier(score): + if score >= 0.60: + return 'tier1_official' + elif score >= 0.40: + return 'tier2_verified' + else: + return 'tier3_community' +``` + +### Step 5: Output URL Manifest + +```python +def create_manifest(scored_results, topic): + manifest = { + "discovery_date": datetime.now().isoformat(), + "topic": topic, + "total_urls": len(scored_results), + "urls": [] + } + + for result in sorted(scored_results, key=lambda x: x['score'], reverse=True): + manifest["urls"].append({ + "url": result["url"], + "title": result["title"], + "credibility_tier": result["tier"], + "credibility_score": result["score"], + "source_type": infer_source_type(result["url"]), + "vendor": infer_vendor(result["url"]) + }) + + return manifest +``` + +## Output Format + +Discovery produces a JSON manifest for the crawler: + +```json +{ + "discovery_date": "2025-01-28T10:30:00", + "topic": "prompt engineering", + "total_urls": 15, + "urls": [ + { + "url": "https://docs.anthropic.com/en/docs/prompt-engineering", + "title": "Prompt Engineering Guide", + "credibility_tier": "tier1_official", + "credibility_score": 0.85, + "source_type": "official_docs", + "vendor": "anthropic" + } + ] +} +``` + +## Known Authoritative Sources + +Pre-validated sources for common topics: + +| Vendor | Documentation | Blog/News | GitHub | +|--------|--------------|-----------|--------| +| Anthropic | docs.anthropic.com, docs.claude.com | anthropic.com/news | github.com/anthropics | +| OpenAI | platform.openai.com/docs | openai.com/blog | github.com/openai | +| Google | ai.google.dev/docs | blog.google/technology/ai | github.com/google | + +## Integration + +**Output:** URL manifest JSON → `web-crawler-orchestrator` + +**Database:** Register new sources in `sources` table via `content-repository` + +## Deduplication + +Before outputting, deduplicate URLs: +- Normalize URLs (remove trailing slashes, query params) +- Check against existing `documents` table via `content-repository` +- Merge duplicate entries, keeping highest credibility score diff --git a/custom-skills/90-reference-curator/claude-project/02-web-crawler.md b/custom-skills/90-reference-curator/claude-project/02-web-crawler.md new file mode 100644 index 0000000..a553f40 --- /dev/null +++ b/custom-skills/90-reference-curator/claude-project/02-web-crawler.md @@ -0,0 +1,230 @@ + +# Web Crawler Orchestrator + +Manages crawling operations using Firecrawl MCP with rate limiting and format handling. + +## Prerequisites + +- Firecrawl MCP server connected +- Config file at `~/.config/reference-curator/crawl_config.yaml` +- Storage directory exists: `~/reference-library/raw/` + +## Crawl Configuration + +```yaml +# ~/.config/reference-curator/crawl_config.yaml +firecrawl: + rate_limit: + requests_per_minute: 20 + concurrent_requests: 3 + default_options: + timeout: 30000 + only_main_content: true + include_html: false + +processing: + max_content_size_mb: 50 + raw_content_dir: ~/reference-library/raw/ +``` + +## Crawl Workflow + +### Step 1: Load URL Manifest + +Receive manifest from `reference-discovery`: + +```python +def load_manifest(manifest_path): + with open(manifest_path) as f: + manifest = json.load(f) + return manifest["urls"] +``` + +### Step 2: Determine Crawl Strategy + +```python +def select_strategy(url): + """Select optimal crawl strategy based on URL characteristics.""" + + if url.endswith('.pdf'): + return 'pdf_extract' + elif 'github.com' in url and '/blob/' in url: + return 'raw_content' # Get raw file content + elif 'github.com' in url: + return 'scrape' # Repository pages + elif any(d in url for d in ['docs.', 'documentation']): + return 'scrape' # Documentation sites + else: + return 'scrape' # Default +``` + +### Step 3: Execute Firecrawl + +Use Firecrawl MCP for crawling: + +```python +# Single page scrape +firecrawl_scrape( + url="https://docs.anthropic.com/en/docs/prompt-engineering", + formats=["markdown"], # markdown | html | screenshot + only_main_content=True, + timeout=30000 +) + +# Multi-page crawl (documentation sites) +firecrawl_crawl( + url="https://docs.anthropic.com/en/docs/", + max_depth=2, + limit=50, + formats=["markdown"], + only_main_content=True +) +``` + +### Step 4: Rate Limiting + +```python +import time +from collections import deque + +class RateLimiter: + def __init__(self, requests_per_minute=20): + self.rpm = requests_per_minute + self.request_times = deque() + + def wait_if_needed(self): + now = time.time() + # Remove requests older than 1 minute + while self.request_times and now - self.request_times[0] > 60: + self.request_times.popleft() + + if len(self.request_times) >= self.rpm: + wait_time = 60 - (now - self.request_times[0]) + if wait_time > 0: + time.sleep(wait_time) + + self.request_times.append(time.time()) +``` + +### Step 5: Save Raw Content + +```python +import hashlib +from pathlib import Path + +def save_content(url, content, content_type='markdown'): + """Save crawled content to raw storage.""" + + # Generate filename from URL hash + url_hash = hashlib.sha256(url.encode()).hexdigest()[:16] + + # Determine extension + ext_map = {'markdown': '.md', 'html': '.html', 'pdf': '.pdf'} + ext = ext_map.get(content_type, '.txt') + + # Create dated subdirectory + date_dir = datetime.now().strftime('%Y/%m') + output_dir = Path.home() / 'reference-library/raw' / date_dir + output_dir.mkdir(parents=True, exist_ok=True) + + # Save file + filepath = output_dir / f"{url_hash}{ext}" + if content_type == 'pdf': + filepath.write_bytes(content) + else: + filepath.write_text(content, encoding='utf-8') + + return str(filepath) +``` + +### Step 6: Generate Crawl Manifest + +```python +def create_crawl_manifest(results): + manifest = { + "crawl_date": datetime.now().isoformat(), + "total_crawled": len([r for r in results if r["status"] == "success"]), + "total_failed": len([r for r in results if r["status"] == "failed"]), + "documents": [] + } + + for result in results: + manifest["documents"].append({ + "url": result["url"], + "status": result["status"], + "raw_content_path": result.get("filepath"), + "content_size": result.get("size"), + "crawl_method": "firecrawl", + "error": result.get("error") + }) + + return manifest +``` + +## Error Handling + +| Error | Action | +|-------|--------| +| Timeout | Retry once with 2x timeout | +| Rate limit (429) | Exponential backoff, max 3 retries | +| Not found (404) | Log and skip | +| Access denied (403) | Log, mark as `failed` | +| Connection error | Retry with backoff | + +```python +def crawl_with_retry(url, max_retries=3): + for attempt in range(max_retries): + try: + result = firecrawl_scrape(url) + return {"status": "success", "content": result} + except RateLimitError: + wait = 2 ** attempt * 10 # 10, 20, 40 seconds + time.sleep(wait) + except TimeoutError: + if attempt == 0: + # Retry with doubled timeout + result = firecrawl_scrape(url, timeout=60000) + return {"status": "success", "content": result} + except NotFoundError: + return {"status": "failed", "error": "404 Not Found"} + except Exception as e: + if attempt == max_retries - 1: + return {"status": "failed", "error": str(e)} + + return {"status": "failed", "error": "Max retries exceeded"} +``` + +## Firecrawl MCP Reference + +**scrape** - Single page: +``` +firecrawl_scrape(url, formats, only_main_content, timeout) +``` + +**crawl** - Multi-page: +``` +firecrawl_crawl(url, max_depth, limit, formats, only_main_content) +``` + +**map** - Discover URLs: +``` +firecrawl_map(url, limit) # Returns list of URLs on site +``` + +## Integration + +| From | Input | To | +|------|-------|-----| +| reference-discovery | URL manifest | web-crawler-orchestrator | +| web-crawler-orchestrator | Crawl manifest + raw files | content-repository | +| quality-reviewer (deep_research) | Additional queries | reference-discovery → here | + +## Output Structure + +``` +~/reference-library/raw/ +└── 2025/01/ + ├── a1b2c3d4e5f6g7h8.md # Markdown content + ├── b2c3d4e5f6g7h8i9.md + └── c3d4e5f6g7h8i9j0.pdf # PDF documents +``` diff --git a/custom-skills/90-reference-curator/claude-project/03-content-repository.md b/custom-skills/90-reference-curator/claude-project/03-content-repository.md new file mode 100644 index 0000000..a0ed4eb --- /dev/null +++ b/custom-skills/90-reference-curator/claude-project/03-content-repository.md @@ -0,0 +1,158 @@ + +# Content Repository + +Manages MySQL storage for the reference library system. Handles document storage, version control, deduplication, and retrieval. + +## Prerequisites + +- MySQL 8.0+ with utf8mb4 charset +- Config file at `~/.config/reference-curator/db_config.yaml` +- Database `reference_library` initialized with schema + +## Quick Reference + +### Connection Setup + +```python +import yaml +import os +from pathlib import Path + +def get_db_config(): + config_path = Path.home() / ".config/reference-curator/db_config.yaml" + with open(config_path) as f: + config = yaml.safe_load(f) + + # Resolve environment variables + mysql = config['mysql'] + return { + 'host': mysql['host'], + 'port': mysql['port'], + 'database': mysql['database'], + 'user': os.environ.get('MYSQL_USER', mysql.get('user', '')), + 'password': os.environ.get('MYSQL_PASSWORD', mysql.get('password', '')), + 'charset': mysql['charset'] + } +``` + +### Core Operations + +**Store New Document:** +```python +def store_document(cursor, source_id, title, url, doc_type, raw_content_path): + sql = """ + INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path) + VALUES (%s, %s, %s, %s, NOW(), 'completed', %s) + ON DUPLICATE KEY UPDATE + version = version + 1, + previous_version_id = doc_id, + crawl_date = NOW(), + raw_content_path = VALUES(raw_content_path) + """ + cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path)) + return cursor.lastrowid +``` + +**Check Duplicate:** +```python +def is_duplicate(cursor, url): + cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,)) + return cursor.fetchone() is not None +``` + +**Get Document by Topic:** +```python +def get_docs_by_topic(cursor, topic_slug, min_quality=0.80): + sql = """ + SELECT d.doc_id, d.title, d.url, dc.structured_content, dc.quality_score + FROM documents d + JOIN document_topics dt ON d.doc_id = dt.doc_id + JOIN topics t ON dt.topic_id = t.topic_id + LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id + WHERE t.topic_slug = %s + AND (dc.review_status = 'approved' OR dc.review_status IS NULL) + ORDER BY dt.relevance_score DESC + """ + cursor.execute(sql, (topic_slug,)) + return cursor.fetchall() +``` + +## Table Quick Reference + +| Table | Purpose | Key Fields | +|-------|---------|------------| +| `sources` | Authorized content sources | source_type, credibility_tier, vendor | +| `documents` | Crawled document metadata | url_hash (dedup), version, crawl_status | +| `distilled_content` | Processed summaries | review_status, compression_ratio | +| `review_logs` | QA decisions | quality_score, decision, refactor_instructions | +| `topics` | Taxonomy | topic_slug, parent_topic_id | +| `document_topics` | Many-to-many linking | relevance_score | +| `export_jobs` | Export tracking | export_type, output_format, status | + +## Status Values + +**crawl_status:** `pending` → `completed` | `failed` | `stale` + +**review_status:** `pending` → `in_review` → `approved` | `needs_refactor` | `rejected` + +**decision (review):** `approve` | `refactor` | `deep_research` | `reject` + +## Common Queries + +### Find Stale Documents (needs re-crawl) +```sql +SELECT d.doc_id, d.title, d.url, d.crawl_date +FROM documents d +JOIN crawl_schedule cs ON d.source_id = cs.source_id +WHERE d.crawl_date < DATE_SUB(NOW(), INTERVAL + CASE cs.frequency + WHEN 'daily' THEN 1 + WHEN 'weekly' THEN 7 + WHEN 'biweekly' THEN 14 + WHEN 'monthly' THEN 30 + END DAY) +AND cs.is_enabled = TRUE; +``` + +### Get Pending Reviews +```sql +SELECT dc.distill_id, d.title, d.url, dc.token_count_distilled +FROM distilled_content dc +JOIN documents d ON dc.doc_id = d.doc_id +WHERE dc.review_status = 'pending' +ORDER BY dc.distill_date ASC; +``` + +### Export-Ready Content +```sql +SELECT d.title, d.url, dc.structured_content, t.topic_slug +FROM documents d +JOIN distilled_content dc ON d.doc_id = dc.doc_id +JOIN document_topics dt ON d.doc_id = dt.doc_id +JOIN topics t ON dt.topic_id = t.topic_id +JOIN review_logs rl ON dc.distill_id = rl.distill_id +WHERE rl.decision = 'approve' +AND rl.quality_score >= 0.85 +ORDER BY t.topic_slug, dt.relevance_score DESC; +``` + +## Workflow Integration + +1. **From crawler-orchestrator:** Receive URL + raw content path → `store_document()` +2. **To content-distiller:** Query pending documents → send for processing +3. **From quality-reviewer:** Update `review_status` based on decision +4. **To markdown-exporter:** Query approved content by topic + +## Error Handling + +- **Duplicate URL:** Silent update (version increment) via `ON DUPLICATE KEY UPDATE` +- **Missing source_id:** Validate against `sources` table before insert +- **Connection failure:** Implement retry with exponential backoff + +## Full Schema Reference + +See `references/schema.sql` for complete table definitions including indexes and constraints. + +## Config File Template + +See `references/db_config_template.yaml` for connection configuration template. diff --git a/custom-skills/90-reference-curator/claude-project/04-content-distiller.md b/custom-skills/90-reference-curator/claude-project/04-content-distiller.md new file mode 100644 index 0000000..1b06eb7 --- /dev/null +++ b/custom-skills/90-reference-curator/claude-project/04-content-distiller.md @@ -0,0 +1,234 @@ + +# Content Distiller + +Transforms raw crawled content into structured, high-quality reference materials. + +## Distillation Goals + +1. **Compress** - Reduce token count while preserving essential information +2. **Structure** - Organize content for easy retrieval and reference +3. **Extract** - Pull out code snippets, key concepts, and actionable patterns +4. **Annotate** - Add metadata for searchability and categorization + +## Distillation Workflow + +### Step 1: Load Raw Content + +```python +def load_for_distillation(cursor): + """Get documents ready for distillation.""" + sql = """ + SELECT d.doc_id, d.title, d.url, d.raw_content_path, + d.doc_type, s.source_type, s.credibility_tier + FROM documents d + JOIN sources s ON d.source_id = s.source_id + LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id + WHERE d.crawl_status = 'completed' + AND dc.distill_id IS NULL + ORDER BY s.credibility_tier ASC + """ + cursor.execute(sql) + return cursor.fetchall() +``` + +### Step 2: Analyze Content Structure + +Identify content type and select appropriate distillation strategy: + +```python +def analyze_structure(content, doc_type): + """Analyze document structure for distillation.""" + analysis = { + "has_code_blocks": bool(re.findall(r'```[\s\S]*?```', content)), + "has_headers": bool(re.findall(r'^#+\s', content, re.MULTILINE)), + "has_lists": bool(re.findall(r'^\s*[-*]\s', content, re.MULTILINE)), + "has_tables": bool(re.findall(r'\|.*\|', content)), + "estimated_tokens": len(content.split()) * 1.3, # Rough estimate + "section_count": len(re.findall(r'^#+\s', content, re.MULTILINE)) + } + return analysis +``` + +### Step 3: Extract Key Components + +**Extract Code Snippets:** +```python +def extract_code_snippets(content): + """Extract all code blocks with language tags.""" + pattern = r'```(\w*)\n([\s\S]*?)```' + snippets = [] + for match in re.finditer(pattern, content): + snippets.append({ + "language": match.group(1) or "text", + "code": match.group(2).strip(), + "context": get_surrounding_text(content, match.start(), 200) + }) + return snippets +``` + +**Extract Key Concepts:** +```python +def extract_key_concepts(content, title): + """Use Claude to extract key concepts and definitions.""" + prompt = f""" + Analyze this document and extract key concepts: + + Title: {title} + Content: {content[:8000]} # Limit for context + + Return JSON with: + - concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}] + - techniques: [{{"name": "...", "description": "...", "use_case": "..."}}] + - best_practices: ["..."] + """ + # Use Claude API to process + return claude_extract(prompt) +``` + +### Step 4: Create Structured Summary + +**Summary Template:** +```markdown +# {title} + +**Source:** {url} +**Type:** {source_type} | **Tier:** {credibility_tier} +**Distilled:** {date} + +## Executive Summary +{2-3 sentence overview} + +## Key Concepts +{bulleted list of core concepts with brief definitions} + +## Techniques & Patterns +{extracted techniques with use cases} + +## Code Examples +{relevant code snippets with context} + +## Best Practices +{actionable recommendations} + +## Related Topics +{links to related content in library} +``` + +### Step 5: Optimize for Tokens + +```python +def optimize_content(structured_content, target_ratio=0.30): + """ + Compress content to target ratio while preserving quality. + Target: 30% of original token count. + """ + original_tokens = count_tokens(structured_content) + target_tokens = int(original_tokens * target_ratio) + + # Prioritized compression strategies + strategies = [ + remove_redundant_explanations, + condense_examples, + merge_similar_sections, + trim_verbose_descriptions + ] + + optimized = structured_content + for strategy in strategies: + if count_tokens(optimized) > target_tokens: + optimized = strategy(optimized) + + return optimized +``` + +### Step 6: Store Distilled Content + +```python +def store_distilled(cursor, doc_id, summary, key_concepts, + code_snippets, structured_content, + original_tokens, distilled_tokens): + sql = """ + INSERT INTO distilled_content + (doc_id, summary, key_concepts, code_snippets, structured_content, + token_count_original, token_count_distilled, distill_model, review_status) + VALUES (%s, %s, %s, %s, %s, %s, %s, 'claude-opus-4-5', 'pending') + """ + cursor.execute(sql, ( + doc_id, summary, + json.dumps(key_concepts), + json.dumps(code_snippets), + structured_content, + original_tokens, + distilled_tokens + )) + return cursor.lastrowid +``` + +## Distillation Prompts + +**For Prompt Engineering Content:** +``` +Focus on: +1. Specific techniques with before/after examples +2. Why techniques work (not just what) +3. Common pitfalls and how to avoid them +4. Actionable patterns that can be directly applied +``` + +**For API Documentation:** +``` +Focus on: +1. Endpoint specifications and parameters +2. Request/response examples +3. Error codes and handling +4. Rate limits and best practices +``` + +**For Research Papers:** +``` +Focus on: +1. Key findings and conclusions +2. Novel techniques introduced +3. Practical applications +4. Limitations and caveats +``` + +## Quality Metrics + +Track compression efficiency: + +| Metric | Target | +|--------|--------| +| Compression Ratio | 25-35% of original | +| Key Concept Coverage | ≥90% of important terms | +| Code Snippet Retention | 100% of relevant examples | +| Readability | Clear, scannable structure | + +## Handling Refactor Requests + +When `quality-reviewer` returns `refactor` decision: + +```python +def handle_refactor(distill_id, instructions): + """Re-distill based on reviewer feedback.""" + # Load original content and existing distillation + original = load_raw_content(distill_id) + existing = load_distilled_content(distill_id) + + # Apply specific improvements based on instructions + improved = apply_improvements(existing, instructions) + + # Update distilled_content + update_distilled(distill_id, improved) + + # Reset review status + set_review_status(distill_id, 'pending') +``` + +## Integration + +| From | Input | To | +|------|-------|-----| +| content-repository | Raw document records | content-distiller | +| content-distiller | Distilled content | quality-reviewer | +| quality-reviewer | Refactor instructions | content-distiller (loop) | diff --git a/custom-skills/90-reference-curator/claude-project/05-quality-reviewer.md b/custom-skills/90-reference-curator/claude-project/05-quality-reviewer.md new file mode 100644 index 0000000..1479c91 --- /dev/null +++ b/custom-skills/90-reference-curator/claude-project/05-quality-reviewer.md @@ -0,0 +1,223 @@ + +# Quality Reviewer + +Evaluates distilled content for quality, routes decisions, and triggers refactoring or additional research when needed. + +## Review Workflow + +``` +[Distilled Content] + │ + ▼ +┌─────────────────┐ +│ Score Criteria │ → accuracy, completeness, clarity, PE quality, usability +└─────────────────┘ + │ + ▼ +┌─────────────────┐ +│ Calculate Total │ → weighted average +└─────────────────┘ + │ + ├── ≥ 0.85 → APPROVE → markdown-exporter + ├── 0.60-0.84 → REFACTOR → content-distiller (with instructions) + ├── 0.40-0.59 → DEEP_RESEARCH → web-crawler-orchestrator (with queries) + └── < 0.40 → REJECT → archive with reason +``` + +## Scoring Criteria + +| Criterion | Weight | Checks | +|-----------|--------|--------| +| **Accuracy** | 0.25 | Factual correctness, up-to-date info, proper attribution | +| **Completeness** | 0.20 | Covers key concepts, includes examples, addresses edge cases | +| **Clarity** | 0.20 | Clear structure, concise language, logical flow | +| **PE Quality** | 0.25 | Demonstrates techniques, before/after examples, explains why | +| **Usability** | 0.10 | Easy to reference, searchable keywords, appropriate length | + +## Decision Thresholds + +| Score Range | Decision | Action | +|-------------|----------|--------| +| ≥ 0.85 | `approve` | Proceed to export | +| 0.60 - 0.84 | `refactor` | Return to distiller with feedback | +| 0.40 - 0.59 | `deep_research` | Gather more sources, then re-distill | +| < 0.40 | `reject` | Archive, log reason | + +## Review Process + +### Step 1: Load Content for Review + +```python +def get_pending_reviews(cursor): + sql = """ + SELECT dc.distill_id, dc.doc_id, d.title, d.url, + dc.summary, dc.key_concepts, dc.structured_content, + dc.token_count_original, dc.token_count_distilled, + s.credibility_tier + FROM distilled_content dc + JOIN documents d ON dc.doc_id = d.doc_id + JOIN sources s ON d.source_id = s.source_id + WHERE dc.review_status = 'pending' + ORDER BY s.credibility_tier ASC, dc.distill_date ASC + """ + cursor.execute(sql) + return cursor.fetchall() +``` + +### Step 2: Score Each Criterion + +Evaluate content against each criterion using this assessment template: + +```python +assessment_template = { + "accuracy": { + "score": 0.0, # 0.00 - 1.00 + "notes": "", + "issues": [] # Specific factual errors if any + }, + "completeness": { + "score": 0.0, + "notes": "", + "missing_topics": [] # Concepts that should be covered + }, + "clarity": { + "score": 0.0, + "notes": "", + "confusing_sections": [] # Sections needing rewrite + }, + "prompt_engineering_quality": { + "score": 0.0, + "notes": "", + "improvements": [] # Specific PE technique gaps + }, + "usability": { + "score": 0.0, + "notes": "", + "suggestions": [] + } +} +``` + +### Step 3: Calculate Final Score + +```python +WEIGHTS = { + "accuracy": 0.25, + "completeness": 0.20, + "clarity": 0.20, + "prompt_engineering_quality": 0.25, + "usability": 0.10 +} + +def calculate_quality_score(assessment): + return sum( + assessment[criterion]["score"] * weight + for criterion, weight in WEIGHTS.items() + ) +``` + +### Step 4: Route Decision + +```python +def determine_decision(score, assessment): + if score >= 0.85: + return "approve", None, None + elif score >= 0.60: + instructions = generate_refactor_instructions(assessment) + return "refactor", instructions, None + elif score >= 0.40: + queries = generate_research_queries(assessment) + return "deep_research", None, queries + else: + return "reject", f"Quality score {score:.2f} below minimum threshold", None + +def generate_refactor_instructions(assessment): + """Extract actionable feedback from low-scoring criteria.""" + instructions = [] + for criterion, data in assessment.items(): + if data["score"] < 0.80: + if data.get("issues"): + instructions.extend(data["issues"]) + if data.get("missing_topics"): + instructions.append(f"Add coverage for: {', '.join(data['missing_topics'])}") + if data.get("improvements"): + instructions.extend(data["improvements"]) + return "\n".join(instructions) + +def generate_research_queries(assessment): + """Generate search queries for content gaps.""" + queries = [] + if assessment["completeness"]["missing_topics"]: + for topic in assessment["completeness"]["missing_topics"]: + queries.append(f"{topic} documentation guide") + if assessment["accuracy"]["issues"]: + queries.append("latest official documentation verification") + return queries +``` + +### Step 5: Log Review Decision + +```python +def log_review(cursor, distill_id, assessment, score, decision, instructions=None, queries=None): + # Get current round number + cursor.execute( + "SELECT COALESCE(MAX(review_round), 0) + 1 FROM review_logs WHERE distill_id = %s", + (distill_id,) + ) + review_round = cursor.fetchone()[0] + + sql = """ + INSERT INTO review_logs + (distill_id, review_round, reviewer_type, quality_score, assessment, + decision, refactor_instructions, research_queries) + VALUES (%s, %s, 'claude_review', %s, %s, %s, %s, %s) + """ + cursor.execute(sql, ( + distill_id, review_round, score, + json.dumps(assessment), decision, instructions, + json.dumps(queries) if queries else None + )) + + # Update distilled_content status + status_map = { + "approve": "approved", + "refactor": "needs_refactor", + "deep_research": "needs_refactor", + "reject": "rejected" + } + cursor.execute( + "UPDATE distilled_content SET review_status = %s WHERE distill_id = %s", + (status_map[decision], distill_id) + ) +``` + +## Prompt Engineering Quality Checklist + +When scoring `prompt_engineering_quality`, verify: + +- [ ] Demonstrates specific techniques (CoT, few-shot, etc.) +- [ ] Shows before/after examples +- [ ] Explains *why* techniques work, not just *what* +- [ ] Provides actionable patterns +- [ ] Includes edge cases and failure modes +- [ ] References authoritative sources + +## Auto-Approve Rules + +Tier 1 (official) sources with score ≥ 0.80 may auto-approve without human review if configured: + +```yaml +# In export_config.yaml +quality: + auto_approve_tier1_sources: true + auto_approve_min_score: 0.80 +``` + +## Integration Points + +| From | Action | To | +|------|--------|-----| +| content-distiller | Sends distilled content | quality-reviewer | +| quality-reviewer | APPROVE | markdown-exporter | +| quality-reviewer | REFACTOR + instructions | content-distiller | +| quality-reviewer | DEEP_RESEARCH + queries | web-crawler-orchestrator | diff --git a/custom-skills/90-reference-curator/claude-project/06-markdown-exporter.md b/custom-skills/90-reference-curator/claude-project/06-markdown-exporter.md new file mode 100644 index 0000000..9a09a4c --- /dev/null +++ b/custom-skills/90-reference-curator/claude-project/06-markdown-exporter.md @@ -0,0 +1,290 @@ + +# Markdown Exporter + +Exports approved content as structured markdown files for Claude Projects or fine-tuning. + +## Export Configuration + +```yaml +# ~/.config/reference-curator/export_config.yaml +output: + base_path: ~/reference-library/exports/ + + project_files: + structure: nested_by_topic # flat | nested_by_topic | nested_by_source + index_file: INDEX.md + include_metadata: true + + fine_tuning: + format: jsonl + max_tokens_per_sample: 4096 + include_system_prompt: true + +quality: + min_score_for_export: 0.80 +``` + +## Export Workflow + +### Step 1: Query Approved Content + +```python +def get_exportable_content(cursor, min_score=0.80, topic_filter=None): + """Get all approved content meeting quality threshold.""" + sql = """ + SELECT d.doc_id, d.title, d.url, + dc.summary, dc.key_concepts, dc.code_snippets, dc.structured_content, + t.topic_slug, t.topic_name, + rl.quality_score, s.credibility_tier, s.vendor + FROM documents d + JOIN distilled_content dc ON d.doc_id = dc.doc_id + JOIN document_topics dt ON d.doc_id = dt.doc_id + JOIN topics t ON dt.topic_id = t.topic_id + JOIN review_logs rl ON dc.distill_id = rl.distill_id + JOIN sources s ON d.source_id = s.source_id + WHERE rl.decision = 'approve' + AND rl.quality_score >= %s + AND rl.review_id = ( + SELECT MAX(review_id) FROM review_logs + WHERE distill_id = dc.distill_id + ) + """ + params = [min_score] + + if topic_filter: + sql += " AND t.topic_slug IN (%s)" % ','.join(['%s'] * len(topic_filter)) + params.extend(topic_filter) + + sql += " ORDER BY t.topic_slug, rl.quality_score DESC" + cursor.execute(sql, params) + return cursor.fetchall() +``` + +### Step 2: Organize by Structure + +**Nested by Topic (recommended):** +``` +exports/ +├── INDEX.md +├── prompt-engineering/ +│ ├── _index.md +│ ├── 01-chain-of-thought.md +│ ├── 02-few-shot-prompting.md +│ └── 03-system-prompts.md +├── claude-models/ +│ ├── _index.md +│ ├── 01-model-comparison.md +│ └── 02-context-windows.md +└── agent-building/ + ├── _index.md + └── 01-tool-use.md +``` + +**Flat Structure:** +``` +exports/ +├── INDEX.md +├── prompt-engineering-chain-of-thought.md +├── prompt-engineering-few-shot.md +└── claude-models-comparison.md +``` + +### Step 3: Generate Files + +**Document File Template:** +```python +def generate_document_file(doc, include_metadata=True): + content = [] + + if include_metadata: + content.append("---") + content.append(f"title: {doc['title']}") + content.append(f"source: {doc['url']}") + content.append(f"vendor: {doc['vendor']}") + content.append(f"tier: {doc['credibility_tier']}") + content.append(f"quality_score: {doc['quality_score']:.2f}") + content.append(f"exported: {datetime.now().isoformat()}") + content.append("---") + content.append("") + + content.append(doc['structured_content']) + + return "\n".join(content) +``` + +**Topic Index Template:** +```python +def generate_topic_index(topic_slug, topic_name, documents): + content = [ + f"# {topic_name}", + "", + f"This section contains {len(documents)} reference documents.", + "", + "## Contents", + "" + ] + + for i, doc in enumerate(documents, 1): + filename = generate_filename(doc['title']) + content.append(f"{i}. [{doc['title']}]({filename})") + + return "\n".join(content) +``` + +**Root INDEX Template:** +```python +def generate_root_index(topics_with_counts, export_date): + content = [ + "# Reference Library", + "", + f"Exported: {export_date}", + "", + "## Topics", + "" + ] + + for topic in topics_with_counts: + content.append(f"- [{topic['name']}]({topic['slug']}/) ({topic['count']} documents)") + + content.extend([ + "", + "## Quality Standards", + "", + "All documents in this library have:", + "- Passed quality review (score ≥ 0.80)", + "- Been distilled for conciseness", + "- Verified source attribution" + ]) + + return "\n".join(content) +``` + +### Step 4: Write Files + +```python +def export_project_files(content_list, config): + base_path = Path(config['output']['base_path']) + structure = config['output']['project_files']['structure'] + + # Group by topic + by_topic = defaultdict(list) + for doc in content_list: + by_topic[doc['topic_slug']].append(doc) + + # Create directories and files + for topic_slug, docs in by_topic.items(): + if structure == 'nested_by_topic': + topic_dir = base_path / topic_slug + topic_dir.mkdir(parents=True, exist_ok=True) + + # Write topic index + topic_index = generate_topic_index(topic_slug, docs[0]['topic_name'], docs) + (topic_dir / '_index.md').write_text(topic_index) + + # Write document files + for i, doc in enumerate(docs, 1): + filename = f"{i:02d}-{slugify(doc['title'])}.md" + file_content = generate_document_file(doc) + (topic_dir / filename).write_text(file_content) + + # Write root INDEX + topics_summary = [ + {"slug": slug, "name": docs[0]['topic_name'], "count": len(docs)} + for slug, docs in by_topic.items() + ] + root_index = generate_root_index(topics_summary, datetime.now().isoformat()) + (base_path / 'INDEX.md').write_text(root_index) +``` + +### Step 5: Fine-tuning Export (Optional) + +```python +def export_fine_tuning_dataset(content_list, config): + """Export as JSONL for fine-tuning.""" + output_path = Path(config['output']['base_path']) / 'fine_tuning.jsonl' + max_tokens = config['output']['fine_tuning']['max_tokens_per_sample'] + + with open(output_path, 'w') as f: + for doc in content_list: + sample = { + "messages": [ + { + "role": "system", + "content": "You are an expert on AI and prompt engineering." + }, + { + "role": "user", + "content": f"Explain {doc['title']}" + }, + { + "role": "assistant", + "content": truncate_to_tokens(doc['structured_content'], max_tokens) + } + ], + "metadata": { + "source": doc['url'], + "topic": doc['topic_slug'], + "quality_score": doc['quality_score'] + } + } + f.write(json.dumps(sample) + '\n') +``` + +### Step 6: Log Export Job + +```python +def log_export_job(cursor, export_name, export_type, output_path, + topic_filter, total_docs, total_tokens): + sql = """ + INSERT INTO export_jobs + (export_name, export_type, output_format, topic_filter, output_path, + total_documents, total_tokens, status, started_at, completed_at) + VALUES (%s, %s, 'markdown', %s, %s, %s, %s, 'completed', NOW(), NOW()) + """ + cursor.execute(sql, ( + export_name, export_type, + json.dumps(topic_filter) if topic_filter else None, + str(output_path), total_docs, total_tokens + )) +``` + +## Cross-Reference Generation + +Link related documents: + +```python +def add_cross_references(doc, all_docs): + """Find and link related documents.""" + related = [] + doc_concepts = set(c['term'].lower() for c in doc['key_concepts']) + + for other in all_docs: + if other['doc_id'] == doc['doc_id']: + continue + other_concepts = set(c['term'].lower() for c in other['key_concepts']) + overlap = len(doc_concepts & other_concepts) + if overlap >= 2: + related.append({ + "title": other['title'], + "path": generate_relative_path(doc, other), + "overlap": overlap + }) + + return sorted(related, key=lambda x: x['overlap'], reverse=True)[:5] +``` + +## Output Verification + +After export, verify: +- [ ] All files readable and valid markdown +- [ ] INDEX.md links resolve correctly +- [ ] No broken cross-references +- [ ] Total token count matches expectation +- [ ] No duplicate content + +## Integration + +| From | Input | To | +|------|-------|-----| +| quality-reviewer | Approved content IDs | markdown-exporter | +| markdown-exporter | Structured files | Project knowledge / Fine-tuning | diff --git a/custom-skills/90-reference-curator/claude-project/INDEX.md b/custom-skills/90-reference-curator/claude-project/INDEX.md new file mode 100644 index 0000000..1113c34 --- /dev/null +++ b/custom-skills/90-reference-curator/claude-project/INDEX.md @@ -0,0 +1,89 @@ +# Reference Curator - Claude.ai Project Knowledge + +This project knowledge enables Claude to curate, process, and export reference documentation through 6 modular skills. + +## Skills Overview + +| Skill | Purpose | Trigger Phrases | +|-------|---------|-----------------| +| **reference-discovery** | Search & validate authoritative sources | "find references", "search documentation", "discover sources" | +| **web-crawler** | Multi-backend crawling orchestration | "crawl URL", "fetch documents", "scrape pages" | +| **content-repository** | MySQL storage management | "store content", "save to database", "check duplicates" | +| **content-distiller** | Summarize & extract key concepts | "distill content", "summarize document", "extract key concepts" | +| **quality-reviewer** | QA scoring & routing decisions | "review content", "quality check", "assess distilled content" | +| **markdown-exporter** | Export to markdown/JSONL | "export references", "generate project files", "create markdown output" | + +## Workflow + +``` +[Topic Input] + │ + ▼ +┌─────────────────────┐ +│ reference-discovery │ → Search & validate sources +└─────────────────────┘ + │ + ▼ +┌─────────────────────┐ +│ web-crawler │ → Crawl (Firecrawl/Node.js/aiohttp/Scrapy) +└─────────────────────┘ + │ + ▼ +┌─────────────────────┐ +│ content-repository │ → Store in MySQL +└─────────────────────┘ + │ + ▼ +┌─────────────────────┐ +│ content-distiller │ → Summarize & extract +└─────────────────────┘ + │ + ▼ +┌─────────────────────┐ +│ quality-reviewer │ → QA loop +└─────────────────────┘ + │ + ├── REFACTOR → content-distiller + ├── DEEP_RESEARCH → web-crawler + │ + ▼ APPROVE +┌─────────────────────┐ +│ markdown-exporter │ → Project files / Fine-tuning +└─────────────────────┘ +``` + +## Quality Scoring Thresholds + +| Score | Decision | Action | +|-------|----------|--------| +| ≥ 0.85 | **Approve** | Ready for export | +| 0.60-0.84 | **Refactor** | Re-distill with feedback | +| 0.40-0.59 | **Deep Research** | Gather more sources | +| < 0.40 | **Reject** | Archive (low quality) | + +## Source Credibility Tiers + +| Tier | Source Type | Examples | +|------|-------------|----------| +| **Tier 1** | Official documentation | docs.anthropic.com, platform.openai.com/docs | +| **Tier 1** | Official engineering blogs | anthropic.com/news, openai.com/blog | +| **Tier 2** | Research papers | arxiv.org papers with citations | +| **Tier 2** | Verified community guides | Official cookbooks, tutorials | +| **Tier 3** | Community content | Blog posts, Stack Overflow | + +## Files in This Project + +- `INDEX.md` - This overview file +- `reference-curator-complete.md` - All 6 skills in one file +- `01-reference-discovery.md` - Source discovery skill +- `02-web-crawler.md` - Crawling orchestration skill +- `03-content-repository.md` - Database storage skill +- `04-content-distiller.md` - Content summarization skill +- `05-quality-reviewer.md` - QA review skill +- `06-markdown-exporter.md` - Export skill + +## Usage + +Upload all files to a Claude.ai Project, or upload only the skills you need. + +For the complete experience, upload `reference-curator-complete.md` which contains all skills in one file. diff --git a/custom-skills/90-reference-curator/claude-project/reference-curator-complete.md b/custom-skills/90-reference-curator/claude-project/reference-curator-complete.md new file mode 100644 index 0000000..5500b37 --- /dev/null +++ b/custom-skills/90-reference-curator/claude-project/reference-curator-complete.md @@ -0,0 +1,473 @@ +# Reference Curator - Complete Skill Set + +This document contains all 6 skills for curating, processing, and exporting reference documentation. + +--- + +# 1. Reference Discovery + +Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling. + +## Source Priority Hierarchy + +| Tier | Source Type | Examples | +|------|-------------|----------| +| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs | +| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog | +| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* | +| **Tier 2** | Research papers | arxiv.org, papers with citations | +| **Tier 2** | Verified community guides | Cookbook examples, official tutorials | +| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow | + +## Discovery Workflow + +### Step 1: Define Search Scope + +```python +search_config = { + "topic": "prompt engineering", + "vendors": ["anthropic", "openai", "google"], + "source_types": ["official_docs", "engineering_blog", "github_repo"], + "freshness": "past_year", + "max_results_per_query": 20 +} +``` + +### Step 2: Generate Search Queries + +```python +def generate_queries(topic, vendors): + queries = [] + for vendor in vendors: + queries.append(f"site:docs.{vendor}.com {topic}") + queries.append(f"site:{vendor}.com/docs {topic}") + queries.append(f"site:{vendor}.com/blog {topic}") + queries.append(f"site:github.com/{vendor} {topic}") + queries.append(f"site:arxiv.org {topic}") + return queries +``` + +### Step 3: Validate and Score Sources + +```python +def score_source(url, title): + score = 0.0 + if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']): + score += 0.40 # Tier 1 official docs + elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']): + score += 0.30 # Tier 1 official blog/news + elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']): + score += 0.30 # Tier 1 official repos + elif 'arxiv.org' in url: + score += 0.20 # Tier 2 research + else: + score += 0.10 # Tier 3 community + return min(score, 1.0) + +def assign_credibility_tier(score): + if score >= 0.60: + return 'tier1_official' + elif score >= 0.40: + return 'tier2_verified' + else: + return 'tier3_community' +``` + +## Output Format + +```json +{ + "discovery_date": "2025-01-28T10:30:00", + "topic": "prompt engineering", + "total_urls": 15, + "urls": [ + { + "url": "https://docs.anthropic.com/en/docs/prompt-engineering", + "title": "Prompt Engineering Guide", + "credibility_tier": "tier1_official", + "credibility_score": 0.85, + "source_type": "official_docs", + "vendor": "anthropic" + } + ] +} +``` + +--- + +# 2. Web Crawler Orchestrator + +Manages crawling operations using Firecrawl MCP with rate limiting and format handling. + +## Crawl Configuration + +```yaml +firecrawl: + rate_limit: + requests_per_minute: 20 + concurrent_requests: 3 + default_options: + timeout: 30000 + only_main_content: true +``` + +## Crawl Workflow + +### Determine Crawl Strategy + +```python +def select_strategy(url): + if url.endswith('.pdf'): + return 'pdf_extract' + elif 'github.com' in url and '/blob/' in url: + return 'raw_content' + elif any(d in url for d in ['docs.', 'documentation']): + return 'scrape' + else: + return 'scrape' +``` + +### Execute Firecrawl + +```python +# Single page scrape +firecrawl_scrape( + url="https://docs.anthropic.com/en/docs/prompt-engineering", + formats=["markdown"], + only_main_content=True, + timeout=30000 +) + +# Multi-page crawl +firecrawl_crawl( + url="https://docs.anthropic.com/en/docs/", + max_depth=2, + limit=50, + formats=["markdown"] +) +``` + +### Rate Limiting + +```python +class RateLimiter: + def __init__(self, requests_per_minute=20): + self.rpm = requests_per_minute + self.request_times = deque() + + def wait_if_needed(self): + now = time.time() + while self.request_times and now - self.request_times[0] > 60: + self.request_times.popleft() + if len(self.request_times) >= self.rpm: + wait_time = 60 - (now - self.request_times[0]) + if wait_time > 0: + time.sleep(wait_time) + self.request_times.append(time.time()) +``` + +## Error Handling + +| Error | Action | +|-------|--------| +| Timeout | Retry once with 2x timeout | +| Rate limit (429) | Exponential backoff, max 3 retries | +| Not found (404) | Log and skip | +| Access denied (403) | Log, mark as `failed` | + +--- + +# 3. Content Repository + +Manages MySQL storage for the reference library. Handles document storage, version control, deduplication, and retrieval. + +## Core Operations + +**Store New Document:** +```python +def store_document(cursor, source_id, title, url, doc_type, raw_content_path): + sql = """ + INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path) + VALUES (%s, %s, %s, %s, NOW(), 'completed', %s) + ON DUPLICATE KEY UPDATE + version = version + 1, + crawl_date = NOW(), + raw_content_path = VALUES(raw_content_path) + """ + cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path)) + return cursor.lastrowid +``` + +**Check Duplicate:** +```python +def is_duplicate(cursor, url): + cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,)) + return cursor.fetchone() is not None +``` + +## Table Quick Reference + +| Table | Purpose | Key Fields | +|-------|---------|------------| +| `sources` | Authorized content sources | source_type, credibility_tier, vendor | +| `documents` | Crawled document metadata | url_hash (dedup), version, crawl_status | +| `distilled_content` | Processed summaries | review_status, compression_ratio | +| `review_logs` | QA decisions | quality_score, decision | +| `topics` | Taxonomy | topic_slug, parent_topic_id | + +## Status Values + +- **crawl_status:** `pending` → `completed` | `failed` | `stale` +- **review_status:** `pending` → `in_review` → `approved` | `needs_refactor` | `rejected` +- **decision:** `approve` | `refactor` | `deep_research` | `reject` + +--- + +# 4. Content Distiller + +Transforms raw crawled content into structured, high-quality reference materials. + +## Distillation Goals + +1. **Compress** - Reduce token count while preserving essential information +2. **Structure** - Organize content for easy retrieval and reference +3. **Extract** - Pull out code snippets, key concepts, and actionable patterns +4. **Annotate** - Add metadata for searchability and categorization + +## Extract Key Components + +**Extract Code Snippets:** +```python +def extract_code_snippets(content): + pattern = r'```(\w*)\n([\s\S]*?)```' + snippets = [] + for match in re.finditer(pattern, content): + snippets.append({ + "language": match.group(1) or "text", + "code": match.group(2).strip(), + "context": get_surrounding_text(content, match.start(), 200) + }) + return snippets +``` + +**Extract Key Concepts:** +```python +def extract_key_concepts(content, title): + prompt = f""" + Analyze this document and extract key concepts: + + Title: {title} + Content: {content[:8000]} + + Return JSON with: + - concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}] + - techniques: [{{"name": "...", "description": "...", "use_case": "..."}}] + - best_practices: ["..."] + """ + return claude_extract(prompt) +``` + +## Summary Template + +```markdown +# {title} + +**Source:** {url} +**Type:** {source_type} | **Tier:** {credibility_tier} + +## Executive Summary +{2-3 sentence overview} + +## Key Concepts +{bulleted list of core concepts} + +## Techniques & Patterns +{extracted techniques with use cases} + +## Code Examples +{relevant code snippets} + +## Best Practices +{actionable recommendations} +``` + +## Quality Metrics + +| Metric | Target | +|--------|--------| +| Compression Ratio | 25-35% of original | +| Key Concept Coverage | ≥90% of important terms | +| Code Snippet Retention | 100% of relevant examples | + +--- + +# 5. Quality Reviewer + +Evaluates distilled content, routes decisions, and triggers refactoring or additional research. + +## Review Workflow + +``` +[Distilled Content] + │ + ▼ +┌─────────────────┐ +│ Score Criteria │ → accuracy, completeness, clarity, PE quality, usability +└─────────────────┘ + │ + ├── ≥ 0.85 → APPROVE → markdown-exporter + ├── 0.60-0.84 → REFACTOR → content-distiller (with instructions) + ├── 0.40-0.59 → DEEP_RESEARCH → web-crawler (with queries) + └── < 0.40 → REJECT → archive with reason +``` + +## Scoring Criteria + +| Criterion | Weight | Checks | +|-----------|--------|--------| +| **Accuracy** | 0.25 | Factual correctness, up-to-date info, proper attribution | +| **Completeness** | 0.20 | Covers key concepts, includes examples, addresses edge cases | +| **Clarity** | 0.20 | Clear structure, concise language, logical flow | +| **PE Quality** | 0.25 | Demonstrates techniques, before/after examples, explains why | +| **Usability** | 0.10 | Easy to reference, searchable keywords, appropriate length | + +## Calculate Final Score + +```python +WEIGHTS = { + "accuracy": 0.25, + "completeness": 0.20, + "clarity": 0.20, + "prompt_engineering_quality": 0.25, + "usability": 0.10 +} + +def calculate_quality_score(assessment): + return sum( + assessment[criterion]["score"] * weight + for criterion, weight in WEIGHTS.items() + ) +``` + +## Route Decision + +```python +def determine_decision(score, assessment): + if score >= 0.85: + return "approve", None, None + elif score >= 0.60: + instructions = generate_refactor_instructions(assessment) + return "refactor", instructions, None + elif score >= 0.40: + queries = generate_research_queries(assessment) + return "deep_research", None, queries + else: + return "reject", f"Quality score {score:.2f} below minimum", None +``` + +## Prompt Engineering Quality Checklist + +- [ ] Demonstrates specific techniques (CoT, few-shot, etc.) +- [ ] Shows before/after examples +- [ ] Explains *why* techniques work, not just *what* +- [ ] Provides actionable patterns +- [ ] Includes edge cases and failure modes +- [ ] References authoritative sources + +--- + +# 6. Markdown Exporter + +Exports approved content as structured markdown files for Claude Projects or fine-tuning. + +## Export Structure + +**Nested by Topic (recommended):** +``` +exports/ +├── INDEX.md +├── prompt-engineering/ +│ ├── _index.md +│ ├── 01-chain-of-thought.md +│ └── 02-few-shot-prompting.md +├── claude-models/ +│ ├── _index.md +│ └── 01-model-comparison.md +└── agent-building/ + └── 01-tool-use.md +``` + +## Document File Template + +```python +def generate_document_file(doc, include_metadata=True): + content = [] + if include_metadata: + content.append("---") + content.append(f"title: {doc['title']}") + content.append(f"source: {doc['url']}") + content.append(f"vendor: {doc['vendor']}") + content.append(f"tier: {doc['credibility_tier']}") + content.append(f"quality_score: {doc['quality_score']:.2f}") + content.append("---") + content.append("") + content.append(doc['structured_content']) + return "\n".join(content) +``` + +## Fine-tuning Export (JSONL) + +```python +def export_fine_tuning_dataset(content_list, config): + with open('fine_tuning.jsonl', 'w') as f: + for doc in content_list: + sample = { + "messages": [ + {"role": "system", "content": "You are an expert on AI and prompt engineering."}, + {"role": "user", "content": f"Explain {doc['title']}"}, + {"role": "assistant", "content": doc['structured_content']} + ], + "metadata": { + "source": doc['url'], + "topic": doc['topic_slug'], + "quality_score": doc['quality_score'] + } + } + f.write(json.dumps(sample) + '\n') +``` + +## Cross-Reference Generation + +```python +def add_cross_references(doc, all_docs): + related = [] + doc_concepts = set(c['term'].lower() for c in doc['key_concepts']) + + for other in all_docs: + if other['doc_id'] == doc['doc_id']: + continue + other_concepts = set(c['term'].lower() for c in other['key_concepts']) + overlap = len(doc_concepts & other_concepts) + if overlap >= 2: + related.append({ + "title": other['title'], + "path": generate_relative_path(doc, other), + "overlap": overlap + }) + + return sorted(related, key=lambda x: x['overlap'], reverse=True)[:5] +``` + +--- + +# Integration Flow + +| From | Output | To | +|------|--------|-----| +| **reference-discovery** | URL manifest | web-crawler | +| **web-crawler** | Raw content + manifest | content-repository | +| **content-repository** | Document records | content-distiller | +| **content-distiller** | Distilled content | quality-reviewer | +| **quality-reviewer** (approve) | Approved IDs | markdown-exporter | +| **quality-reviewer** (refactor) | Instructions | content-distiller | +| **quality-reviewer** (deep_research) | Queries | web-crawler | diff --git a/custom-skills/90-reference-curator/install.sh b/custom-skills/90-reference-curator/install.sh index a19be10..632231e 100755 --- a/custom-skills/90-reference-curator/install.sh +++ b/custom-skills/90-reference-curator/install.sh @@ -717,6 +717,65 @@ EOF post_install } +# ============================================================================ +# Export for Claude.ai Projects +# ============================================================================ + +export_claude_ai() { + print_header + echo -e "${BOLD}Export for Claude.ai Projects${NC}" + echo "" + + local project_dir="$SCRIPT_DIR/claude-project" + + if [[ ! -d "$project_dir" ]]; then + print_error "claude-project directory not found" + echo "Expected: $project_dir" + exit 1 + fi + + echo "Available files for Claude.ai Projects:" + echo "" + echo -e " ${CYAN}Consolidated (single file):${NC}" + echo " reference-curator-complete.md - All 6 skills in one file" + echo "" + echo -e " ${CYAN}Individual skills:${NC}" + ls -1 "$project_dir"/*.md 2>/dev/null | while read file; do + local filename=$(basename "$file") + local size=$(du -h "$file" | cut -f1) + if [[ "$filename" != "INDEX.md" && "$filename" != "reference-curator-complete.md" ]]; then + echo " $filename ($size)" + fi + done + echo "" + + echo -e "${BOLD}Upload to Claude.ai:${NC}" + echo "" + echo " 1. Go to https://claude.ai" + echo " 2. Create a new Project or open existing one" + echo " 3. Click 'Add to project knowledge'" + echo " 4. Upload files from:" + echo -e " ${CYAN}$project_dir${NC}" + echo "" + echo " Recommended: Upload 'reference-curator-complete.md' for full skill set" + echo "" + + if prompt_yes_no "Copy files to a different location?" "n"; then + prompt_with_default "Destination directory" "$HOME/Desktop/reference-curator-claude-ai" "DEST_DIR" + + mkdir -p "$DEST_DIR" + cp "$project_dir"/*.md "$DEST_DIR/" + + print_success "Files copied to $DEST_DIR" + echo "" + echo "Files ready for upload:" + ls -la "$DEST_DIR"/*.md + fi + + echo "" + echo -e "${GREEN}Done!${NC} Upload the files to your Claude.ai Project." +} + # ============================================================================ # Entry Point # ============================================================================ @@ -731,6 +790,9 @@ case "${1:-}" in --minimal) install_minimal ;; + --claude-ai) + export_claude_ai + ;; --help|-h) echo "Reference Curator - Portable Installation Script" echo "" @@ -738,6 +800,7 @@ case "${1:-}" in echo " ./install.sh Interactive installation" echo " ./install.sh --check Check installation status" echo " ./install.sh --minimal Firecrawl-only mode (no MySQL)" + echo " ./install.sh --claude-ai Export skills for Claude.ai Projects" echo " ./install.sh --uninstall Remove installation" echo " ./install.sh --help Show this help" ;;