feat(reference-curator): Add Claude.ai Projects export format

Add claude-project/ folder with skill files formatted for upload to
Claude.ai Projects (web interface):

- reference-curator-complete.md: All 6 skills consolidated
- INDEX.md: Overview and workflow documentation
- Individual skill files (01-06) without YAML frontmatter

Add --claude-ai option to install.sh:
- Lists available files for upload
- Optionally copies to custom destination directory
- Provides upload instructions for Claude.ai

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-29 00:33:06 +07:00
parent 8762f68e6e
commit 243b9d851c
10 changed files with 1987 additions and 0 deletions

View File

@@ -60,6 +60,7 @@ cd our-claude-skills/custom-skills/90-reference-curator
| **Full** | `./install.sh` | Interactive setup with MySQL and crawlers |
| **Minimal** | `./install.sh --minimal` | Firecrawl MCP only, no database |
| **Check** | `./install.sh --check` | Verify installation status |
| **Claude.ai** | `./install.sh --claude-ai` | Export skills for Claude.ai Projects |
| **Uninstall** | `./install.sh --uninstall` | Remove installation (preserves data) |
### What Gets Installed
@@ -94,6 +95,38 @@ export CRAWLER_PROJECT_PATH="" # Path to local crawlers (optional)
---
## Claude.ai Projects Installation
To use these skills in Claude.ai (web interface), export the skill files for upload:
```bash
./install.sh --claude-ai
```
This displays available files in `claude-project/` and optionally copies them to a convenient location.
### Files for Upload
| File | Description |
|------|-------------|
| `reference-curator-complete.md` | All 6 skills combined (recommended) |
| `INDEX.md` | Overview and workflow documentation |
| `01-reference-discovery.md` | Source discovery skill |
| `02-web-crawler.md` | Crawling orchestration skill |
| `03-content-repository.md` | Database storage skill |
| `04-content-distiller.md` | Content summarization skill |
| `05-quality-reviewer.md` | QA review skill |
| `06-markdown-exporter.md` | Export skill |
### Upload Instructions
1. Go to [claude.ai](https://claude.ai)
2. Create a new Project or open existing one
3. Click "Add to project knowledge"
4. Upload `reference-curator-complete.md` (or individual skills as needed)
---
## Architecture
```
@@ -386,6 +419,16 @@ mysql -h $MYSQL_HOST -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library < shar
├── CHANGELOG.md # Version history
├── install.sh # Portable installation script
├── claude-project/ # Files for Claude.ai Projects
│ ├── INDEX.md # Overview
│ ├── reference-curator-complete.md # All skills combined
│ ├── 01-reference-discovery.md
│ ├── 02-web-crawler.md
│ ├── 03-content-repository.md
│ ├── 04-content-distiller.md
│ ├── 05-quality-reviewer.md
│ └── 06-markdown-exporter.md
├── commands/ # Claude Code commands (tracked in git)
│ ├── reference-discovery.md
│ ├── web-crawler.md

View File

@@ -0,0 +1,184 @@
# Reference Discovery
Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.
## Source Priority Hierarchy
| Tier | Source Type | Examples |
|------|-------------|----------|
| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* |
| **Tier 2** | Research papers | arxiv.org, papers with citations |
| **Tier 2** | Verified community guides | Cookbook examples, official tutorials |
| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow |
## Discovery Workflow
### Step 1: Define Search Scope
```python
search_config = {
"topic": "prompt engineering",
"vendors": ["anthropic", "openai", "google"],
"source_types": ["official_docs", "engineering_blog", "github_repo"],
"freshness": "past_year", # past_week, past_month, past_year, any
"max_results_per_query": 20
}
```
### Step 2: Generate Search Queries
For a given topic, generate targeted queries:
```python
def generate_queries(topic, vendors):
queries = []
# Official documentation queries
for vendor in vendors:
queries.append(f"site:docs.{vendor}.com {topic}")
queries.append(f"site:{vendor}.com/docs {topic}")
# Engineering blog queries
for vendor in vendors:
queries.append(f"site:{vendor}.com/blog {topic}")
queries.append(f"site:{vendor}.com/news {topic}")
# GitHub queries
for vendor in vendors:
queries.append(f"site:github.com/{vendor} {topic}")
# Research queries
queries.append(f"site:arxiv.org {topic}")
return queries
```
### Step 3: Execute Search
Use web search tool for each query:
```python
def execute_discovery(queries):
results = []
for query in queries:
search_results = web_search(query)
for result in search_results:
results.append({
"url": result.url,
"title": result.title,
"snippet": result.snippet,
"query_used": query
})
return deduplicate_by_url(results)
```
### Step 4: Validate and Score Sources
```python
def score_source(url, title):
score = 0.0
# Domain credibility
if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
score += 0.40 # Tier 1 official docs
elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
score += 0.30 # Tier 1 official blog/news
elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
score += 0.30 # Tier 1 official repos
elif 'arxiv.org' in url:
score += 0.20 # Tier 2 research
else:
score += 0.10 # Tier 3 community
# Freshness signals (from title/snippet)
if any(year in title for year in ['2025', '2024']):
score += 0.20
elif any(year in title for year in ['2023']):
score += 0.10
# Relevance signals
if any(kw in title.lower() for kw in ['guide', 'documentation', 'tutorial', 'best practices']):
score += 0.15
return min(score, 1.0)
def assign_credibility_tier(score):
if score >= 0.60:
return 'tier1_official'
elif score >= 0.40:
return 'tier2_verified'
else:
return 'tier3_community'
```
### Step 5: Output URL Manifest
```python
def create_manifest(scored_results, topic):
manifest = {
"discovery_date": datetime.now().isoformat(),
"topic": topic,
"total_urls": len(scored_results),
"urls": []
}
for result in sorted(scored_results, key=lambda x: x['score'], reverse=True):
manifest["urls"].append({
"url": result["url"],
"title": result["title"],
"credibility_tier": result["tier"],
"credibility_score": result["score"],
"source_type": infer_source_type(result["url"]),
"vendor": infer_vendor(result["url"])
})
return manifest
```
## Output Format
Discovery produces a JSON manifest for the crawler:
```json
{
"discovery_date": "2025-01-28T10:30:00",
"topic": "prompt engineering",
"total_urls": 15,
"urls": [
{
"url": "https://docs.anthropic.com/en/docs/prompt-engineering",
"title": "Prompt Engineering Guide",
"credibility_tier": "tier1_official",
"credibility_score": 0.85,
"source_type": "official_docs",
"vendor": "anthropic"
}
]
}
```
## Known Authoritative Sources
Pre-validated sources for common topics:
| Vendor | Documentation | Blog/News | GitHub |
|--------|--------------|-----------|--------|
| Anthropic | docs.anthropic.com, docs.claude.com | anthropic.com/news | github.com/anthropics |
| OpenAI | platform.openai.com/docs | openai.com/blog | github.com/openai |
| Google | ai.google.dev/docs | blog.google/technology/ai | github.com/google |
## Integration
**Output:** URL manifest JSON → `web-crawler-orchestrator`
**Database:** Register new sources in `sources` table via `content-repository`
## Deduplication
Before outputting, deduplicate URLs:
- Normalize URLs (remove trailing slashes, query params)
- Check against existing `documents` table via `content-repository`
- Merge duplicate entries, keeping highest credibility score

View File

@@ -0,0 +1,230 @@
# Web Crawler Orchestrator
Manages crawling operations using Firecrawl MCP with rate limiting and format handling.
## Prerequisites
- Firecrawl MCP server connected
- Config file at `~/.config/reference-curator/crawl_config.yaml`
- Storage directory exists: `~/reference-library/raw/`
## Crawl Configuration
```yaml
# ~/.config/reference-curator/crawl_config.yaml
firecrawl:
rate_limit:
requests_per_minute: 20
concurrent_requests: 3
default_options:
timeout: 30000
only_main_content: true
include_html: false
processing:
max_content_size_mb: 50
raw_content_dir: ~/reference-library/raw/
```
## Crawl Workflow
### Step 1: Load URL Manifest
Receive manifest from `reference-discovery`:
```python
def load_manifest(manifest_path):
with open(manifest_path) as f:
manifest = json.load(f)
return manifest["urls"]
```
### Step 2: Determine Crawl Strategy
```python
def select_strategy(url):
"""Select optimal crawl strategy based on URL characteristics."""
if url.endswith('.pdf'):
return 'pdf_extract'
elif 'github.com' in url and '/blob/' in url:
return 'raw_content' # Get raw file content
elif 'github.com' in url:
return 'scrape' # Repository pages
elif any(d in url for d in ['docs.', 'documentation']):
return 'scrape' # Documentation sites
else:
return 'scrape' # Default
```
### Step 3: Execute Firecrawl
Use Firecrawl MCP for crawling:
```python
# Single page scrape
firecrawl_scrape(
url="https://docs.anthropic.com/en/docs/prompt-engineering",
formats=["markdown"], # markdown | html | screenshot
only_main_content=True,
timeout=30000
)
# Multi-page crawl (documentation sites)
firecrawl_crawl(
url="https://docs.anthropic.com/en/docs/",
max_depth=2,
limit=50,
formats=["markdown"],
only_main_content=True
)
```
### Step 4: Rate Limiting
```python
import time
from collections import deque
class RateLimiter:
def __init__(self, requests_per_minute=20):
self.rpm = requests_per_minute
self.request_times = deque()
def wait_if_needed(self):
now = time.time()
# Remove requests older than 1 minute
while self.request_times and now - self.request_times[0] > 60:
self.request_times.popleft()
if len(self.request_times) >= self.rpm:
wait_time = 60 - (now - self.request_times[0])
if wait_time > 0:
time.sleep(wait_time)
self.request_times.append(time.time())
```
### Step 5: Save Raw Content
```python
import hashlib
from pathlib import Path
def save_content(url, content, content_type='markdown'):
"""Save crawled content to raw storage."""
# Generate filename from URL hash
url_hash = hashlib.sha256(url.encode()).hexdigest()[:16]
# Determine extension
ext_map = {'markdown': '.md', 'html': '.html', 'pdf': '.pdf'}
ext = ext_map.get(content_type, '.txt')
# Create dated subdirectory
date_dir = datetime.now().strftime('%Y/%m')
output_dir = Path.home() / 'reference-library/raw' / date_dir
output_dir.mkdir(parents=True, exist_ok=True)
# Save file
filepath = output_dir / f"{url_hash}{ext}"
if content_type == 'pdf':
filepath.write_bytes(content)
else:
filepath.write_text(content, encoding='utf-8')
return str(filepath)
```
### Step 6: Generate Crawl Manifest
```python
def create_crawl_manifest(results):
manifest = {
"crawl_date": datetime.now().isoformat(),
"total_crawled": len([r for r in results if r["status"] == "success"]),
"total_failed": len([r for r in results if r["status"] == "failed"]),
"documents": []
}
for result in results:
manifest["documents"].append({
"url": result["url"],
"status": result["status"],
"raw_content_path": result.get("filepath"),
"content_size": result.get("size"),
"crawl_method": "firecrawl",
"error": result.get("error")
})
return manifest
```
## Error Handling
| Error | Action |
|-------|--------|
| Timeout | Retry once with 2x timeout |
| Rate limit (429) | Exponential backoff, max 3 retries |
| Not found (404) | Log and skip |
| Access denied (403) | Log, mark as `failed` |
| Connection error | Retry with backoff |
```python
def crawl_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
result = firecrawl_scrape(url)
return {"status": "success", "content": result}
except RateLimitError:
wait = 2 ** attempt * 10 # 10, 20, 40 seconds
time.sleep(wait)
except TimeoutError:
if attempt == 0:
# Retry with doubled timeout
result = firecrawl_scrape(url, timeout=60000)
return {"status": "success", "content": result}
except NotFoundError:
return {"status": "failed", "error": "404 Not Found"}
except Exception as e:
if attempt == max_retries - 1:
return {"status": "failed", "error": str(e)}
return {"status": "failed", "error": "Max retries exceeded"}
```
## Firecrawl MCP Reference
**scrape** - Single page:
```
firecrawl_scrape(url, formats, only_main_content, timeout)
```
**crawl** - Multi-page:
```
firecrawl_crawl(url, max_depth, limit, formats, only_main_content)
```
**map** - Discover URLs:
```
firecrawl_map(url, limit) # Returns list of URLs on site
```
## Integration
| From | Input | To |
|------|-------|-----|
| reference-discovery | URL manifest | web-crawler-orchestrator |
| web-crawler-orchestrator | Crawl manifest + raw files | content-repository |
| quality-reviewer (deep_research) | Additional queries | reference-discovery → here |
## Output Structure
```
~/reference-library/raw/
└── 2025/01/
├── a1b2c3d4e5f6g7h8.md # Markdown content
├── b2c3d4e5f6g7h8i9.md
└── c3d4e5f6g7h8i9j0.pdf # PDF documents
```

View File

@@ -0,0 +1,158 @@
# Content Repository
Manages MySQL storage for the reference library system. Handles document storage, version control, deduplication, and retrieval.
## Prerequisites
- MySQL 8.0+ with utf8mb4 charset
- Config file at `~/.config/reference-curator/db_config.yaml`
- Database `reference_library` initialized with schema
## Quick Reference
### Connection Setup
```python
import yaml
import os
from pathlib import Path
def get_db_config():
config_path = Path.home() / ".config/reference-curator/db_config.yaml"
with open(config_path) as f:
config = yaml.safe_load(f)
# Resolve environment variables
mysql = config['mysql']
return {
'host': mysql['host'],
'port': mysql['port'],
'database': mysql['database'],
'user': os.environ.get('MYSQL_USER', mysql.get('user', '')),
'password': os.environ.get('MYSQL_PASSWORD', mysql.get('password', '')),
'charset': mysql['charset']
}
```
### Core Operations
**Store New Document:**
```python
def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
sql = """
INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
ON DUPLICATE KEY UPDATE
version = version + 1,
previous_version_id = doc_id,
crawl_date = NOW(),
raw_content_path = VALUES(raw_content_path)
"""
cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
return cursor.lastrowid
```
**Check Duplicate:**
```python
def is_duplicate(cursor, url):
cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
return cursor.fetchone() is not None
```
**Get Document by Topic:**
```python
def get_docs_by_topic(cursor, topic_slug, min_quality=0.80):
sql = """
SELECT d.doc_id, d.title, d.url, dc.structured_content, dc.quality_score
FROM documents d
JOIN document_topics dt ON d.doc_id = dt.doc_id
JOIN topics t ON dt.topic_id = t.topic_id
LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
WHERE t.topic_slug = %s
AND (dc.review_status = 'approved' OR dc.review_status IS NULL)
ORDER BY dt.relevance_score DESC
"""
cursor.execute(sql, (topic_slug,))
return cursor.fetchall()
```
## Table Quick Reference
| Table | Purpose | Key Fields |
|-------|---------|------------|
| `sources` | Authorized content sources | source_type, credibility_tier, vendor |
| `documents` | Crawled document metadata | url_hash (dedup), version, crawl_status |
| `distilled_content` | Processed summaries | review_status, compression_ratio |
| `review_logs` | QA decisions | quality_score, decision, refactor_instructions |
| `topics` | Taxonomy | topic_slug, parent_topic_id |
| `document_topics` | Many-to-many linking | relevance_score |
| `export_jobs` | Export tracking | export_type, output_format, status |
## Status Values
**crawl_status:** `pending``completed` | `failed` | `stale`
**review_status:** `pending``in_review``approved` | `needs_refactor` | `rejected`
**decision (review):** `approve` | `refactor` | `deep_research` | `reject`
## Common Queries
### Find Stale Documents (needs re-crawl)
```sql
SELECT d.doc_id, d.title, d.url, d.crawl_date
FROM documents d
JOIN crawl_schedule cs ON d.source_id = cs.source_id
WHERE d.crawl_date < DATE_SUB(NOW(), INTERVAL
CASE cs.frequency
WHEN 'daily' THEN 1
WHEN 'weekly' THEN 7
WHEN 'biweekly' THEN 14
WHEN 'monthly' THEN 30
END DAY)
AND cs.is_enabled = TRUE;
```
### Get Pending Reviews
```sql
SELECT dc.distill_id, d.title, d.url, dc.token_count_distilled
FROM distilled_content dc
JOIN documents d ON dc.doc_id = d.doc_id
WHERE dc.review_status = 'pending'
ORDER BY dc.distill_date ASC;
```
### Export-Ready Content
```sql
SELECT d.title, d.url, dc.structured_content, t.topic_slug
FROM documents d
JOIN distilled_content dc ON d.doc_id = dc.doc_id
JOIN document_topics dt ON d.doc_id = dt.doc_id
JOIN topics t ON dt.topic_id = t.topic_id
JOIN review_logs rl ON dc.distill_id = rl.distill_id
WHERE rl.decision = 'approve'
AND rl.quality_score >= 0.85
ORDER BY t.topic_slug, dt.relevance_score DESC;
```
## Workflow Integration
1. **From crawler-orchestrator:** Receive URL + raw content path → `store_document()`
2. **To content-distiller:** Query pending documents → send for processing
3. **From quality-reviewer:** Update `review_status` based on decision
4. **To markdown-exporter:** Query approved content by topic
## Error Handling
- **Duplicate URL:** Silent update (version increment) via `ON DUPLICATE KEY UPDATE`
- **Missing source_id:** Validate against `sources` table before insert
- **Connection failure:** Implement retry with exponential backoff
## Full Schema Reference
See `references/schema.sql` for complete table definitions including indexes and constraints.
## Config File Template
See `references/db_config_template.yaml` for connection configuration template.

View File

@@ -0,0 +1,234 @@
# Content Distiller
Transforms raw crawled content into structured, high-quality reference materials.
## Distillation Goals
1. **Compress** - Reduce token count while preserving essential information
2. **Structure** - Organize content for easy retrieval and reference
3. **Extract** - Pull out code snippets, key concepts, and actionable patterns
4. **Annotate** - Add metadata for searchability and categorization
## Distillation Workflow
### Step 1: Load Raw Content
```python
def load_for_distillation(cursor):
"""Get documents ready for distillation."""
sql = """
SELECT d.doc_id, d.title, d.url, d.raw_content_path,
d.doc_type, s.source_type, s.credibility_tier
FROM documents d
JOIN sources s ON d.source_id = s.source_id
LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
WHERE d.crawl_status = 'completed'
AND dc.distill_id IS NULL
ORDER BY s.credibility_tier ASC
"""
cursor.execute(sql)
return cursor.fetchall()
```
### Step 2: Analyze Content Structure
Identify content type and select appropriate distillation strategy:
```python
def analyze_structure(content, doc_type):
"""Analyze document structure for distillation."""
analysis = {
"has_code_blocks": bool(re.findall(r'```[\s\S]*?```', content)),
"has_headers": bool(re.findall(r'^#+\s', content, re.MULTILINE)),
"has_lists": bool(re.findall(r'^\s*[-*]\s', content, re.MULTILINE)),
"has_tables": bool(re.findall(r'\|.*\|', content)),
"estimated_tokens": len(content.split()) * 1.3, # Rough estimate
"section_count": len(re.findall(r'^#+\s', content, re.MULTILINE))
}
return analysis
```
### Step 3: Extract Key Components
**Extract Code Snippets:**
```python
def extract_code_snippets(content):
"""Extract all code blocks with language tags."""
pattern = r'```(\w*)\n([\s\S]*?)```'
snippets = []
for match in re.finditer(pattern, content):
snippets.append({
"language": match.group(1) or "text",
"code": match.group(2).strip(),
"context": get_surrounding_text(content, match.start(), 200)
})
return snippets
```
**Extract Key Concepts:**
```python
def extract_key_concepts(content, title):
"""Use Claude to extract key concepts and definitions."""
prompt = f"""
Analyze this document and extract key concepts:
Title: {title}
Content: {content[:8000]} # Limit for context
Return JSON with:
- concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
- techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
- best_practices: ["..."]
"""
# Use Claude API to process
return claude_extract(prompt)
```
### Step 4: Create Structured Summary
**Summary Template:**
```markdown
# {title}
**Source:** {url}
**Type:** {source_type} | **Tier:** {credibility_tier}
**Distilled:** {date}
## Executive Summary
{2-3 sentence overview}
## Key Concepts
{bulleted list of core concepts with brief definitions}
## Techniques & Patterns
{extracted techniques with use cases}
## Code Examples
{relevant code snippets with context}
## Best Practices
{actionable recommendations}
## Related Topics
{links to related content in library}
```
### Step 5: Optimize for Tokens
```python
def optimize_content(structured_content, target_ratio=0.30):
"""
Compress content to target ratio while preserving quality.
Target: 30% of original token count.
"""
original_tokens = count_tokens(structured_content)
target_tokens = int(original_tokens * target_ratio)
# Prioritized compression strategies
strategies = [
remove_redundant_explanations,
condense_examples,
merge_similar_sections,
trim_verbose_descriptions
]
optimized = structured_content
for strategy in strategies:
if count_tokens(optimized) > target_tokens:
optimized = strategy(optimized)
return optimized
```
### Step 6: Store Distilled Content
```python
def store_distilled(cursor, doc_id, summary, key_concepts,
code_snippets, structured_content,
original_tokens, distilled_tokens):
sql = """
INSERT INTO distilled_content
(doc_id, summary, key_concepts, code_snippets, structured_content,
token_count_original, token_count_distilled, distill_model, review_status)
VALUES (%s, %s, %s, %s, %s, %s, %s, 'claude-opus-4-5', 'pending')
"""
cursor.execute(sql, (
doc_id, summary,
json.dumps(key_concepts),
json.dumps(code_snippets),
structured_content,
original_tokens,
distilled_tokens
))
return cursor.lastrowid
```
## Distillation Prompts
**For Prompt Engineering Content:**
```
Focus on:
1. Specific techniques with before/after examples
2. Why techniques work (not just what)
3. Common pitfalls and how to avoid them
4. Actionable patterns that can be directly applied
```
**For API Documentation:**
```
Focus on:
1. Endpoint specifications and parameters
2. Request/response examples
3. Error codes and handling
4. Rate limits and best practices
```
**For Research Papers:**
```
Focus on:
1. Key findings and conclusions
2. Novel techniques introduced
3. Practical applications
4. Limitations and caveats
```
## Quality Metrics
Track compression efficiency:
| Metric | Target |
|--------|--------|
| Compression Ratio | 25-35% of original |
| Key Concept Coverage | ≥90% of important terms |
| Code Snippet Retention | 100% of relevant examples |
| Readability | Clear, scannable structure |
## Handling Refactor Requests
When `quality-reviewer` returns `refactor` decision:
```python
def handle_refactor(distill_id, instructions):
"""Re-distill based on reviewer feedback."""
# Load original content and existing distillation
original = load_raw_content(distill_id)
existing = load_distilled_content(distill_id)
# Apply specific improvements based on instructions
improved = apply_improvements(existing, instructions)
# Update distilled_content
update_distilled(distill_id, improved)
# Reset review status
set_review_status(distill_id, 'pending')
```
## Integration
| From | Input | To |
|------|-------|-----|
| content-repository | Raw document records | content-distiller |
| content-distiller | Distilled content | quality-reviewer |
| quality-reviewer | Refactor instructions | content-distiller (loop) |

View File

@@ -0,0 +1,223 @@
# Quality Reviewer
Evaluates distilled content for quality, routes decisions, and triggers refactoring or additional research when needed.
## Review Workflow
```
[Distilled Content]
┌─────────────────┐
│ Score Criteria │ → accuracy, completeness, clarity, PE quality, usability
└─────────────────┘
┌─────────────────┐
│ Calculate Total │ → weighted average
└─────────────────┘
├── ≥ 0.85 → APPROVE → markdown-exporter
├── 0.60-0.84 → REFACTOR → content-distiller (with instructions)
├── 0.40-0.59 → DEEP_RESEARCH → web-crawler-orchestrator (with queries)
└── < 0.40 → REJECT → archive with reason
```
## Scoring Criteria
| Criterion | Weight | Checks |
|-----------|--------|--------|
| **Accuracy** | 0.25 | Factual correctness, up-to-date info, proper attribution |
| **Completeness** | 0.20 | Covers key concepts, includes examples, addresses edge cases |
| **Clarity** | 0.20 | Clear structure, concise language, logical flow |
| **PE Quality** | 0.25 | Demonstrates techniques, before/after examples, explains why |
| **Usability** | 0.10 | Easy to reference, searchable keywords, appropriate length |
## Decision Thresholds
| Score Range | Decision | Action |
|-------------|----------|--------|
| ≥ 0.85 | `approve` | Proceed to export |
| 0.60 - 0.84 | `refactor` | Return to distiller with feedback |
| 0.40 - 0.59 | `deep_research` | Gather more sources, then re-distill |
| < 0.40 | `reject` | Archive, log reason |
## Review Process
### Step 1: Load Content for Review
```python
def get_pending_reviews(cursor):
sql = """
SELECT dc.distill_id, dc.doc_id, d.title, d.url,
dc.summary, dc.key_concepts, dc.structured_content,
dc.token_count_original, dc.token_count_distilled,
s.credibility_tier
FROM distilled_content dc
JOIN documents d ON dc.doc_id = d.doc_id
JOIN sources s ON d.source_id = s.source_id
WHERE dc.review_status = 'pending'
ORDER BY s.credibility_tier ASC, dc.distill_date ASC
"""
cursor.execute(sql)
return cursor.fetchall()
```
### Step 2: Score Each Criterion
Evaluate content against each criterion using this assessment template:
```python
assessment_template = {
"accuracy": {
"score": 0.0, # 0.00 - 1.00
"notes": "",
"issues": [] # Specific factual errors if any
},
"completeness": {
"score": 0.0,
"notes": "",
"missing_topics": [] # Concepts that should be covered
},
"clarity": {
"score": 0.0,
"notes": "",
"confusing_sections": [] # Sections needing rewrite
},
"prompt_engineering_quality": {
"score": 0.0,
"notes": "",
"improvements": [] # Specific PE technique gaps
},
"usability": {
"score": 0.0,
"notes": "",
"suggestions": []
}
}
```
### Step 3: Calculate Final Score
```python
WEIGHTS = {
"accuracy": 0.25,
"completeness": 0.20,
"clarity": 0.20,
"prompt_engineering_quality": 0.25,
"usability": 0.10
}
def calculate_quality_score(assessment):
return sum(
assessment[criterion]["score"] * weight
for criterion, weight in WEIGHTS.items()
)
```
### Step 4: Route Decision
```python
def determine_decision(score, assessment):
if score >= 0.85:
return "approve", None, None
elif score >= 0.60:
instructions = generate_refactor_instructions(assessment)
return "refactor", instructions, None
elif score >= 0.40:
queries = generate_research_queries(assessment)
return "deep_research", None, queries
else:
return "reject", f"Quality score {score:.2f} below minimum threshold", None
def generate_refactor_instructions(assessment):
"""Extract actionable feedback from low-scoring criteria."""
instructions = []
for criterion, data in assessment.items():
if data["score"] < 0.80:
if data.get("issues"):
instructions.extend(data["issues"])
if data.get("missing_topics"):
instructions.append(f"Add coverage for: {', '.join(data['missing_topics'])}")
if data.get("improvements"):
instructions.extend(data["improvements"])
return "\n".join(instructions)
def generate_research_queries(assessment):
"""Generate search queries for content gaps."""
queries = []
if assessment["completeness"]["missing_topics"]:
for topic in assessment["completeness"]["missing_topics"]:
queries.append(f"{topic} documentation guide")
if assessment["accuracy"]["issues"]:
queries.append("latest official documentation verification")
return queries
```
### Step 5: Log Review Decision
```python
def log_review(cursor, distill_id, assessment, score, decision, instructions=None, queries=None):
# Get current round number
cursor.execute(
"SELECT COALESCE(MAX(review_round), 0) + 1 FROM review_logs WHERE distill_id = %s",
(distill_id,)
)
review_round = cursor.fetchone()[0]
sql = """
INSERT INTO review_logs
(distill_id, review_round, reviewer_type, quality_score, assessment,
decision, refactor_instructions, research_queries)
VALUES (%s, %s, 'claude_review', %s, %s, %s, %s, %s)
"""
cursor.execute(sql, (
distill_id, review_round, score,
json.dumps(assessment), decision, instructions,
json.dumps(queries) if queries else None
))
# Update distilled_content status
status_map = {
"approve": "approved",
"refactor": "needs_refactor",
"deep_research": "needs_refactor",
"reject": "rejected"
}
cursor.execute(
"UPDATE distilled_content SET review_status = %s WHERE distill_id = %s",
(status_map[decision], distill_id)
)
```
## Prompt Engineering Quality Checklist
When scoring `prompt_engineering_quality`, verify:
- [ ] Demonstrates specific techniques (CoT, few-shot, etc.)
- [ ] Shows before/after examples
- [ ] Explains *why* techniques work, not just *what*
- [ ] Provides actionable patterns
- [ ] Includes edge cases and failure modes
- [ ] References authoritative sources
## Auto-Approve Rules
Tier 1 (official) sources with score ≥ 0.80 may auto-approve without human review if configured:
```yaml
# In export_config.yaml
quality:
auto_approve_tier1_sources: true
auto_approve_min_score: 0.80
```
## Integration Points
| From | Action | To |
|------|--------|-----|
| content-distiller | Sends distilled content | quality-reviewer |
| quality-reviewer | APPROVE | markdown-exporter |
| quality-reviewer | REFACTOR + instructions | content-distiller |
| quality-reviewer | DEEP_RESEARCH + queries | web-crawler-orchestrator |

View File

@@ -0,0 +1,290 @@
# Markdown Exporter
Exports approved content as structured markdown files for Claude Projects or fine-tuning.
## Export Configuration
```yaml
# ~/.config/reference-curator/export_config.yaml
output:
base_path: ~/reference-library/exports/
project_files:
structure: nested_by_topic # flat | nested_by_topic | nested_by_source
index_file: INDEX.md
include_metadata: true
fine_tuning:
format: jsonl
max_tokens_per_sample: 4096
include_system_prompt: true
quality:
min_score_for_export: 0.80
```
## Export Workflow
### Step 1: Query Approved Content
```python
def get_exportable_content(cursor, min_score=0.80, topic_filter=None):
"""Get all approved content meeting quality threshold."""
sql = """
SELECT d.doc_id, d.title, d.url,
dc.summary, dc.key_concepts, dc.code_snippets, dc.structured_content,
t.topic_slug, t.topic_name,
rl.quality_score, s.credibility_tier, s.vendor
FROM documents d
JOIN distilled_content dc ON d.doc_id = dc.doc_id
JOIN document_topics dt ON d.doc_id = dt.doc_id
JOIN topics t ON dt.topic_id = t.topic_id
JOIN review_logs rl ON dc.distill_id = rl.distill_id
JOIN sources s ON d.source_id = s.source_id
WHERE rl.decision = 'approve'
AND rl.quality_score >= %s
AND rl.review_id = (
SELECT MAX(review_id) FROM review_logs
WHERE distill_id = dc.distill_id
)
"""
params = [min_score]
if topic_filter:
sql += " AND t.topic_slug IN (%s)" % ','.join(['%s'] * len(topic_filter))
params.extend(topic_filter)
sql += " ORDER BY t.topic_slug, rl.quality_score DESC"
cursor.execute(sql, params)
return cursor.fetchall()
```
### Step 2: Organize by Structure
**Nested by Topic (recommended):**
```
exports/
├── INDEX.md
├── prompt-engineering/
│ ├── _index.md
│ ├── 01-chain-of-thought.md
│ ├── 02-few-shot-prompting.md
│ └── 03-system-prompts.md
├── claude-models/
│ ├── _index.md
│ ├── 01-model-comparison.md
│ └── 02-context-windows.md
└── agent-building/
├── _index.md
└── 01-tool-use.md
```
**Flat Structure:**
```
exports/
├── INDEX.md
├── prompt-engineering-chain-of-thought.md
├── prompt-engineering-few-shot.md
└── claude-models-comparison.md
```
### Step 3: Generate Files
**Document File Template:**
```python
def generate_document_file(doc, include_metadata=True):
content = []
if include_metadata:
content.append("---")
content.append(f"title: {doc['title']}")
content.append(f"source: {doc['url']}")
content.append(f"vendor: {doc['vendor']}")
content.append(f"tier: {doc['credibility_tier']}")
content.append(f"quality_score: {doc['quality_score']:.2f}")
content.append(f"exported: {datetime.now().isoformat()}")
content.append("---")
content.append("")
content.append(doc['structured_content'])
return "\n".join(content)
```
**Topic Index Template:**
```python
def generate_topic_index(topic_slug, topic_name, documents):
content = [
f"# {topic_name}",
"",
f"This section contains {len(documents)} reference documents.",
"",
"## Contents",
""
]
for i, doc in enumerate(documents, 1):
filename = generate_filename(doc['title'])
content.append(f"{i}. [{doc['title']}]({filename})")
return "\n".join(content)
```
**Root INDEX Template:**
```python
def generate_root_index(topics_with_counts, export_date):
content = [
"# Reference Library",
"",
f"Exported: {export_date}",
"",
"## Topics",
""
]
for topic in topics_with_counts:
content.append(f"- [{topic['name']}]({topic['slug']}/) ({topic['count']} documents)")
content.extend([
"",
"## Quality Standards",
"",
"All documents in this library have:",
"- Passed quality review (score ≥ 0.80)",
"- Been distilled for conciseness",
"- Verified source attribution"
])
return "\n".join(content)
```
### Step 4: Write Files
```python
def export_project_files(content_list, config):
base_path = Path(config['output']['base_path'])
structure = config['output']['project_files']['structure']
# Group by topic
by_topic = defaultdict(list)
for doc in content_list:
by_topic[doc['topic_slug']].append(doc)
# Create directories and files
for topic_slug, docs in by_topic.items():
if structure == 'nested_by_topic':
topic_dir = base_path / topic_slug
topic_dir.mkdir(parents=True, exist_ok=True)
# Write topic index
topic_index = generate_topic_index(topic_slug, docs[0]['topic_name'], docs)
(topic_dir / '_index.md').write_text(topic_index)
# Write document files
for i, doc in enumerate(docs, 1):
filename = f"{i:02d}-{slugify(doc['title'])}.md"
file_content = generate_document_file(doc)
(topic_dir / filename).write_text(file_content)
# Write root INDEX
topics_summary = [
{"slug": slug, "name": docs[0]['topic_name'], "count": len(docs)}
for slug, docs in by_topic.items()
]
root_index = generate_root_index(topics_summary, datetime.now().isoformat())
(base_path / 'INDEX.md').write_text(root_index)
```
### Step 5: Fine-tuning Export (Optional)
```python
def export_fine_tuning_dataset(content_list, config):
"""Export as JSONL for fine-tuning."""
output_path = Path(config['output']['base_path']) / 'fine_tuning.jsonl'
max_tokens = config['output']['fine_tuning']['max_tokens_per_sample']
with open(output_path, 'w') as f:
for doc in content_list:
sample = {
"messages": [
{
"role": "system",
"content": "You are an expert on AI and prompt engineering."
},
{
"role": "user",
"content": f"Explain {doc['title']}"
},
{
"role": "assistant",
"content": truncate_to_tokens(doc['structured_content'], max_tokens)
}
],
"metadata": {
"source": doc['url'],
"topic": doc['topic_slug'],
"quality_score": doc['quality_score']
}
}
f.write(json.dumps(sample) + '\n')
```
### Step 6: Log Export Job
```python
def log_export_job(cursor, export_name, export_type, output_path,
topic_filter, total_docs, total_tokens):
sql = """
INSERT INTO export_jobs
(export_name, export_type, output_format, topic_filter, output_path,
total_documents, total_tokens, status, started_at, completed_at)
VALUES (%s, %s, 'markdown', %s, %s, %s, %s, 'completed', NOW(), NOW())
"""
cursor.execute(sql, (
export_name, export_type,
json.dumps(topic_filter) if topic_filter else None,
str(output_path), total_docs, total_tokens
))
```
## Cross-Reference Generation
Link related documents:
```python
def add_cross_references(doc, all_docs):
"""Find and link related documents."""
related = []
doc_concepts = set(c['term'].lower() for c in doc['key_concepts'])
for other in all_docs:
if other['doc_id'] == doc['doc_id']:
continue
other_concepts = set(c['term'].lower() for c in other['key_concepts'])
overlap = len(doc_concepts & other_concepts)
if overlap >= 2:
related.append({
"title": other['title'],
"path": generate_relative_path(doc, other),
"overlap": overlap
})
return sorted(related, key=lambda x: x['overlap'], reverse=True)[:5]
```
## Output Verification
After export, verify:
- [ ] All files readable and valid markdown
- [ ] INDEX.md links resolve correctly
- [ ] No broken cross-references
- [ ] Total token count matches expectation
- [ ] No duplicate content
## Integration
| From | Input | To |
|------|-------|-----|
| quality-reviewer | Approved content IDs | markdown-exporter |
| markdown-exporter | Structured files | Project knowledge / Fine-tuning |

View File

@@ -0,0 +1,89 @@
# Reference Curator - Claude.ai Project Knowledge
This project knowledge enables Claude to curate, process, and export reference documentation through 6 modular skills.
## Skills Overview
| Skill | Purpose | Trigger Phrases |
|-------|---------|-----------------|
| **reference-discovery** | Search & validate authoritative sources | "find references", "search documentation", "discover sources" |
| **web-crawler** | Multi-backend crawling orchestration | "crawl URL", "fetch documents", "scrape pages" |
| **content-repository** | MySQL storage management | "store content", "save to database", "check duplicates" |
| **content-distiller** | Summarize & extract key concepts | "distill content", "summarize document", "extract key concepts" |
| **quality-reviewer** | QA scoring & routing decisions | "review content", "quality check", "assess distilled content" |
| **markdown-exporter** | Export to markdown/JSONL | "export references", "generate project files", "create markdown output" |
## Workflow
```
[Topic Input]
┌─────────────────────┐
│ reference-discovery │ → Search & validate sources
└─────────────────────┘
┌─────────────────────┐
│ web-crawler │ → Crawl (Firecrawl/Node.js/aiohttp/Scrapy)
└─────────────────────┘
┌─────────────────────┐
│ content-repository │ → Store in MySQL
└─────────────────────┘
┌─────────────────────┐
│ content-distiller │ → Summarize & extract
└─────────────────────┘
┌─────────────────────┐
│ quality-reviewer │ → QA loop
└─────────────────────┘
├── REFACTOR → content-distiller
├── DEEP_RESEARCH → web-crawler
▼ APPROVE
┌─────────────────────┐
│ markdown-exporter │ → Project files / Fine-tuning
└─────────────────────┘
```
## Quality Scoring Thresholds
| Score | Decision | Action |
|-------|----------|--------|
| ≥ 0.85 | **Approve** | Ready for export |
| 0.60-0.84 | **Refactor** | Re-distill with feedback |
| 0.40-0.59 | **Deep Research** | Gather more sources |
| < 0.40 | **Reject** | Archive (low quality) |
## Source Credibility Tiers
| Tier | Source Type | Examples |
|------|-------------|----------|
| **Tier 1** | Official documentation | docs.anthropic.com, platform.openai.com/docs |
| **Tier 1** | Official engineering blogs | anthropic.com/news, openai.com/blog |
| **Tier 2** | Research papers | arxiv.org papers with citations |
| **Tier 2** | Verified community guides | Official cookbooks, tutorials |
| **Tier 3** | Community content | Blog posts, Stack Overflow |
## Files in This Project
- `INDEX.md` - This overview file
- `reference-curator-complete.md` - All 6 skills in one file
- `01-reference-discovery.md` - Source discovery skill
- `02-web-crawler.md` - Crawling orchestration skill
- `03-content-repository.md` - Database storage skill
- `04-content-distiller.md` - Content summarization skill
- `05-quality-reviewer.md` - QA review skill
- `06-markdown-exporter.md` - Export skill
## Usage
Upload all files to a Claude.ai Project, or upload only the skills you need.
For the complete experience, upload `reference-curator-complete.md` which contains all skills in one file.

View File

@@ -0,0 +1,473 @@
# Reference Curator - Complete Skill Set
This document contains all 6 skills for curating, processing, and exporting reference documentation.
---
# 1. Reference Discovery
Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.
## Source Priority Hierarchy
| Tier | Source Type | Examples |
|------|-------------|----------|
| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* |
| **Tier 2** | Research papers | arxiv.org, papers with citations |
| **Tier 2** | Verified community guides | Cookbook examples, official tutorials |
| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow |
## Discovery Workflow
### Step 1: Define Search Scope
```python
search_config = {
"topic": "prompt engineering",
"vendors": ["anthropic", "openai", "google"],
"source_types": ["official_docs", "engineering_blog", "github_repo"],
"freshness": "past_year",
"max_results_per_query": 20
}
```
### Step 2: Generate Search Queries
```python
def generate_queries(topic, vendors):
queries = []
for vendor in vendors:
queries.append(f"site:docs.{vendor}.com {topic}")
queries.append(f"site:{vendor}.com/docs {topic}")
queries.append(f"site:{vendor}.com/blog {topic}")
queries.append(f"site:github.com/{vendor} {topic}")
queries.append(f"site:arxiv.org {topic}")
return queries
```
### Step 3: Validate and Score Sources
```python
def score_source(url, title):
score = 0.0
if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
score += 0.40 # Tier 1 official docs
elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
score += 0.30 # Tier 1 official blog/news
elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
score += 0.30 # Tier 1 official repos
elif 'arxiv.org' in url:
score += 0.20 # Tier 2 research
else:
score += 0.10 # Tier 3 community
return min(score, 1.0)
def assign_credibility_tier(score):
if score >= 0.60:
return 'tier1_official'
elif score >= 0.40:
return 'tier2_verified'
else:
return 'tier3_community'
```
## Output Format
```json
{
"discovery_date": "2025-01-28T10:30:00",
"topic": "prompt engineering",
"total_urls": 15,
"urls": [
{
"url": "https://docs.anthropic.com/en/docs/prompt-engineering",
"title": "Prompt Engineering Guide",
"credibility_tier": "tier1_official",
"credibility_score": 0.85,
"source_type": "official_docs",
"vendor": "anthropic"
}
]
}
```
---
# 2. Web Crawler Orchestrator
Manages crawling operations using Firecrawl MCP with rate limiting and format handling.
## Crawl Configuration
```yaml
firecrawl:
rate_limit:
requests_per_minute: 20
concurrent_requests: 3
default_options:
timeout: 30000
only_main_content: true
```
## Crawl Workflow
### Determine Crawl Strategy
```python
def select_strategy(url):
if url.endswith('.pdf'):
return 'pdf_extract'
elif 'github.com' in url and '/blob/' in url:
return 'raw_content'
elif any(d in url for d in ['docs.', 'documentation']):
return 'scrape'
else:
return 'scrape'
```
### Execute Firecrawl
```python
# Single page scrape
firecrawl_scrape(
url="https://docs.anthropic.com/en/docs/prompt-engineering",
formats=["markdown"],
only_main_content=True,
timeout=30000
)
# Multi-page crawl
firecrawl_crawl(
url="https://docs.anthropic.com/en/docs/",
max_depth=2,
limit=50,
formats=["markdown"]
)
```
### Rate Limiting
```python
class RateLimiter:
def __init__(self, requests_per_minute=20):
self.rpm = requests_per_minute
self.request_times = deque()
def wait_if_needed(self):
now = time.time()
while self.request_times and now - self.request_times[0] > 60:
self.request_times.popleft()
if len(self.request_times) >= self.rpm:
wait_time = 60 - (now - self.request_times[0])
if wait_time > 0:
time.sleep(wait_time)
self.request_times.append(time.time())
```
## Error Handling
| Error | Action |
|-------|--------|
| Timeout | Retry once with 2x timeout |
| Rate limit (429) | Exponential backoff, max 3 retries |
| Not found (404) | Log and skip |
| Access denied (403) | Log, mark as `failed` |
---
# 3. Content Repository
Manages MySQL storage for the reference library. Handles document storage, version control, deduplication, and retrieval.
## Core Operations
**Store New Document:**
```python
def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
sql = """
INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
ON DUPLICATE KEY UPDATE
version = version + 1,
crawl_date = NOW(),
raw_content_path = VALUES(raw_content_path)
"""
cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
return cursor.lastrowid
```
**Check Duplicate:**
```python
def is_duplicate(cursor, url):
cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
return cursor.fetchone() is not None
```
## Table Quick Reference
| Table | Purpose | Key Fields |
|-------|---------|------------|
| `sources` | Authorized content sources | source_type, credibility_tier, vendor |
| `documents` | Crawled document metadata | url_hash (dedup), version, crawl_status |
| `distilled_content` | Processed summaries | review_status, compression_ratio |
| `review_logs` | QA decisions | quality_score, decision |
| `topics` | Taxonomy | topic_slug, parent_topic_id |
## Status Values
- **crawl_status:** `pending``completed` | `failed` | `stale`
- **review_status:** `pending``in_review``approved` | `needs_refactor` | `rejected`
- **decision:** `approve` | `refactor` | `deep_research` | `reject`
---
# 4. Content Distiller
Transforms raw crawled content into structured, high-quality reference materials.
## Distillation Goals
1. **Compress** - Reduce token count while preserving essential information
2. **Structure** - Organize content for easy retrieval and reference
3. **Extract** - Pull out code snippets, key concepts, and actionable patterns
4. **Annotate** - Add metadata for searchability and categorization
## Extract Key Components
**Extract Code Snippets:**
```python
def extract_code_snippets(content):
pattern = r'```(\w*)\n([\s\S]*?)```'
snippets = []
for match in re.finditer(pattern, content):
snippets.append({
"language": match.group(1) or "text",
"code": match.group(2).strip(),
"context": get_surrounding_text(content, match.start(), 200)
})
return snippets
```
**Extract Key Concepts:**
```python
def extract_key_concepts(content, title):
prompt = f"""
Analyze this document and extract key concepts:
Title: {title}
Content: {content[:8000]}
Return JSON with:
- concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
- techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
- best_practices: ["..."]
"""
return claude_extract(prompt)
```
## Summary Template
```markdown
# {title}
**Source:** {url}
**Type:** {source_type} | **Tier:** {credibility_tier}
## Executive Summary
{2-3 sentence overview}
## Key Concepts
{bulleted list of core concepts}
## Techniques & Patterns
{extracted techniques with use cases}
## Code Examples
{relevant code snippets}
## Best Practices
{actionable recommendations}
```
## Quality Metrics
| Metric | Target |
|--------|--------|
| Compression Ratio | 25-35% of original |
| Key Concept Coverage | ≥90% of important terms |
| Code Snippet Retention | 100% of relevant examples |
---
# 5. Quality Reviewer
Evaluates distilled content, routes decisions, and triggers refactoring or additional research.
## Review Workflow
```
[Distilled Content]
┌─────────────────┐
│ Score Criteria │ → accuracy, completeness, clarity, PE quality, usability
└─────────────────┘
├── ≥ 0.85 → APPROVE → markdown-exporter
├── 0.60-0.84 → REFACTOR → content-distiller (with instructions)
├── 0.40-0.59 → DEEP_RESEARCH → web-crawler (with queries)
└── < 0.40 → REJECT → archive with reason
```
## Scoring Criteria
| Criterion | Weight | Checks |
|-----------|--------|--------|
| **Accuracy** | 0.25 | Factual correctness, up-to-date info, proper attribution |
| **Completeness** | 0.20 | Covers key concepts, includes examples, addresses edge cases |
| **Clarity** | 0.20 | Clear structure, concise language, logical flow |
| **PE Quality** | 0.25 | Demonstrates techniques, before/after examples, explains why |
| **Usability** | 0.10 | Easy to reference, searchable keywords, appropriate length |
## Calculate Final Score
```python
WEIGHTS = {
"accuracy": 0.25,
"completeness": 0.20,
"clarity": 0.20,
"prompt_engineering_quality": 0.25,
"usability": 0.10
}
def calculate_quality_score(assessment):
return sum(
assessment[criterion]["score"] * weight
for criterion, weight in WEIGHTS.items()
)
```
## Route Decision
```python
def determine_decision(score, assessment):
if score >= 0.85:
return "approve", None, None
elif score >= 0.60:
instructions = generate_refactor_instructions(assessment)
return "refactor", instructions, None
elif score >= 0.40:
queries = generate_research_queries(assessment)
return "deep_research", None, queries
else:
return "reject", f"Quality score {score:.2f} below minimum", None
```
## Prompt Engineering Quality Checklist
- [ ] Demonstrates specific techniques (CoT, few-shot, etc.)
- [ ] Shows before/after examples
- [ ] Explains *why* techniques work, not just *what*
- [ ] Provides actionable patterns
- [ ] Includes edge cases and failure modes
- [ ] References authoritative sources
---
# 6. Markdown Exporter
Exports approved content as structured markdown files for Claude Projects or fine-tuning.
## Export Structure
**Nested by Topic (recommended):**
```
exports/
├── INDEX.md
├── prompt-engineering/
│ ├── _index.md
│ ├── 01-chain-of-thought.md
│ └── 02-few-shot-prompting.md
├── claude-models/
│ ├── _index.md
│ └── 01-model-comparison.md
└── agent-building/
└── 01-tool-use.md
```
## Document File Template
```python
def generate_document_file(doc, include_metadata=True):
content = []
if include_metadata:
content.append("---")
content.append(f"title: {doc['title']}")
content.append(f"source: {doc['url']}")
content.append(f"vendor: {doc['vendor']}")
content.append(f"tier: {doc['credibility_tier']}")
content.append(f"quality_score: {doc['quality_score']:.2f}")
content.append("---")
content.append("")
content.append(doc['structured_content'])
return "\n".join(content)
```
## Fine-tuning Export (JSONL)
```python
def export_fine_tuning_dataset(content_list, config):
with open('fine_tuning.jsonl', 'w') as f:
for doc in content_list:
sample = {
"messages": [
{"role": "system", "content": "You are an expert on AI and prompt engineering."},
{"role": "user", "content": f"Explain {doc['title']}"},
{"role": "assistant", "content": doc['structured_content']}
],
"metadata": {
"source": doc['url'],
"topic": doc['topic_slug'],
"quality_score": doc['quality_score']
}
}
f.write(json.dumps(sample) + '\n')
```
## Cross-Reference Generation
```python
def add_cross_references(doc, all_docs):
related = []
doc_concepts = set(c['term'].lower() for c in doc['key_concepts'])
for other in all_docs:
if other['doc_id'] == doc['doc_id']:
continue
other_concepts = set(c['term'].lower() for c in other['key_concepts'])
overlap = len(doc_concepts & other_concepts)
if overlap >= 2:
related.append({
"title": other['title'],
"path": generate_relative_path(doc, other),
"overlap": overlap
})
return sorted(related, key=lambda x: x['overlap'], reverse=True)[:5]
```
---
# Integration Flow
| From | Output | To |
|------|--------|-----|
| **reference-discovery** | URL manifest | web-crawler |
| **web-crawler** | Raw content + manifest | content-repository |
| **content-repository** | Document records | content-distiller |
| **content-distiller** | Distilled content | quality-reviewer |
| **quality-reviewer** (approve) | Approved IDs | markdown-exporter |
| **quality-reviewer** (refactor) | Instructions | content-distiller |
| **quality-reviewer** (deep_research) | Queries | web-crawler |

View File

@@ -717,6 +717,65 @@ EOF
post_install
}
# ============================================================================
# Export for Claude.ai Projects
# ============================================================================
export_claude_ai() {
print_header
echo -e "${BOLD}Export for Claude.ai Projects${NC}"
echo ""
local project_dir="$SCRIPT_DIR/claude-project"
if [[ ! -d "$project_dir" ]]; then
print_error "claude-project directory not found"
echo "Expected: $project_dir"
exit 1
fi
echo "Available files for Claude.ai Projects:"
echo ""
echo -e " ${CYAN}Consolidated (single file):${NC}"
echo " reference-curator-complete.md - All 6 skills in one file"
echo ""
echo -e " ${CYAN}Individual skills:${NC}"
ls -1 "$project_dir"/*.md 2>/dev/null | while read file; do
local filename=$(basename "$file")
local size=$(du -h "$file" | cut -f1)
if [[ "$filename" != "INDEX.md" && "$filename" != "reference-curator-complete.md" ]]; then
echo " $filename ($size)"
fi
done
echo ""
echo -e "${BOLD}Upload to Claude.ai:${NC}"
echo ""
echo " 1. Go to https://claude.ai"
echo " 2. Create a new Project or open existing one"
echo " 3. Click 'Add to project knowledge'"
echo " 4. Upload files from:"
echo -e " ${CYAN}$project_dir${NC}"
echo ""
echo " Recommended: Upload 'reference-curator-complete.md' for full skill set"
echo ""
if prompt_yes_no "Copy files to a different location?" "n"; then
prompt_with_default "Destination directory" "$HOME/Desktop/reference-curator-claude-ai" "DEST_DIR"
mkdir -p "$DEST_DIR"
cp "$project_dir"/*.md "$DEST_DIR/"
print_success "Files copied to $DEST_DIR"
echo ""
echo "Files ready for upload:"
ls -la "$DEST_DIR"/*.md
fi
echo ""
echo -e "${GREEN}Done!${NC} Upload the files to your Claude.ai Project."
}
# ============================================================================
# Entry Point
# ============================================================================
@@ -731,6 +790,9 @@ case "${1:-}" in
--minimal)
install_minimal
;;
--claude-ai)
export_claude_ai
;;
--help|-h)
echo "Reference Curator - Portable Installation Script"
echo ""
@@ -738,6 +800,7 @@ case "${1:-}" in
echo " ./install.sh Interactive installation"
echo " ./install.sh --check Check installation status"
echo " ./install.sh --minimal Firecrawl-only mode (no MySQL)"
echo " ./install.sh --claude-ai Export skills for Claude.ai Projects"
echo " ./install.sh --uninstall Remove installation"
echo " ./install.sh --help Show this help"
;;