feat(reference-curator): Add Claude.ai Projects export format
Add claude-project/ folder with skill files formatted for upload to Claude.ai Projects (web interface): - reference-curator-complete.md: All 6 skills consolidated - INDEX.md: Overview and workflow documentation - Individual skill files (01-06) without YAML frontmatter Add --claude-ai option to install.sh: - Lists available files for upload - Optionally copies to custom destination directory - Provides upload instructions for Claude.ai Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -60,6 +60,7 @@ cd our-claude-skills/custom-skills/90-reference-curator
|
|||||||
| **Full** | `./install.sh` | Interactive setup with MySQL and crawlers |
|
| **Full** | `./install.sh` | Interactive setup with MySQL and crawlers |
|
||||||
| **Minimal** | `./install.sh --minimal` | Firecrawl MCP only, no database |
|
| **Minimal** | `./install.sh --minimal` | Firecrawl MCP only, no database |
|
||||||
| **Check** | `./install.sh --check` | Verify installation status |
|
| **Check** | `./install.sh --check` | Verify installation status |
|
||||||
|
| **Claude.ai** | `./install.sh --claude-ai` | Export skills for Claude.ai Projects |
|
||||||
| **Uninstall** | `./install.sh --uninstall` | Remove installation (preserves data) |
|
| **Uninstall** | `./install.sh --uninstall` | Remove installation (preserves data) |
|
||||||
|
|
||||||
### What Gets Installed
|
### What Gets Installed
|
||||||
@@ -94,6 +95,38 @@ export CRAWLER_PROJECT_PATH="" # Path to local crawlers (optional)
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Claude.ai Projects Installation
|
||||||
|
|
||||||
|
To use these skills in Claude.ai (web interface), export the skill files for upload:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./install.sh --claude-ai
|
||||||
|
```
|
||||||
|
|
||||||
|
This displays available files in `claude-project/` and optionally copies them to a convenient location.
|
||||||
|
|
||||||
|
### Files for Upload
|
||||||
|
|
||||||
|
| File | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| `reference-curator-complete.md` | All 6 skills combined (recommended) |
|
||||||
|
| `INDEX.md` | Overview and workflow documentation |
|
||||||
|
| `01-reference-discovery.md` | Source discovery skill |
|
||||||
|
| `02-web-crawler.md` | Crawling orchestration skill |
|
||||||
|
| `03-content-repository.md` | Database storage skill |
|
||||||
|
| `04-content-distiller.md` | Content summarization skill |
|
||||||
|
| `05-quality-reviewer.md` | QA review skill |
|
||||||
|
| `06-markdown-exporter.md` | Export skill |
|
||||||
|
|
||||||
|
### Upload Instructions
|
||||||
|
|
||||||
|
1. Go to [claude.ai](https://claude.ai)
|
||||||
|
2. Create a new Project or open existing one
|
||||||
|
3. Click "Add to project knowledge"
|
||||||
|
4. Upload `reference-curator-complete.md` (or individual skills as needed)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
```
|
```
|
||||||
@@ -386,6 +419,16 @@ mysql -h $MYSQL_HOST -u $MYSQL_USER -p"$MYSQL_PASSWORD" reference_library < shar
|
|||||||
├── CHANGELOG.md # Version history
|
├── CHANGELOG.md # Version history
|
||||||
├── install.sh # Portable installation script
|
├── install.sh # Portable installation script
|
||||||
│
|
│
|
||||||
|
├── claude-project/ # Files for Claude.ai Projects
|
||||||
|
│ ├── INDEX.md # Overview
|
||||||
|
│ ├── reference-curator-complete.md # All skills combined
|
||||||
|
│ ├── 01-reference-discovery.md
|
||||||
|
│ ├── 02-web-crawler.md
|
||||||
|
│ ├── 03-content-repository.md
|
||||||
|
│ ├── 04-content-distiller.md
|
||||||
|
│ ├── 05-quality-reviewer.md
|
||||||
|
│ └── 06-markdown-exporter.md
|
||||||
|
│
|
||||||
├── commands/ # Claude Code commands (tracked in git)
|
├── commands/ # Claude Code commands (tracked in git)
|
||||||
│ ├── reference-discovery.md
|
│ ├── reference-discovery.md
|
||||||
│ ├── web-crawler.md
|
│ ├── web-crawler.md
|
||||||
|
|||||||
@@ -0,0 +1,184 @@
|
|||||||
|
|
||||||
|
# Reference Discovery
|
||||||
|
|
||||||
|
Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.
|
||||||
|
|
||||||
|
## Source Priority Hierarchy
|
||||||
|
|
||||||
|
| Tier | Source Type | Examples |
|
||||||
|
|------|-------------|----------|
|
||||||
|
| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
|
||||||
|
| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
|
||||||
|
| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* |
|
||||||
|
| **Tier 2** | Research papers | arxiv.org, papers with citations |
|
||||||
|
| **Tier 2** | Verified community guides | Cookbook examples, official tutorials |
|
||||||
|
| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow |
|
||||||
|
|
||||||
|
## Discovery Workflow
|
||||||
|
|
||||||
|
### Step 1: Define Search Scope
|
||||||
|
|
||||||
|
```python
|
||||||
|
search_config = {
|
||||||
|
"topic": "prompt engineering",
|
||||||
|
"vendors": ["anthropic", "openai", "google"],
|
||||||
|
"source_types": ["official_docs", "engineering_blog", "github_repo"],
|
||||||
|
"freshness": "past_year", # past_week, past_month, past_year, any
|
||||||
|
"max_results_per_query": 20
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Generate Search Queries
|
||||||
|
|
||||||
|
For a given topic, generate targeted queries:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def generate_queries(topic, vendors):
|
||||||
|
queries = []
|
||||||
|
|
||||||
|
# Official documentation queries
|
||||||
|
for vendor in vendors:
|
||||||
|
queries.append(f"site:docs.{vendor}.com {topic}")
|
||||||
|
queries.append(f"site:{vendor}.com/docs {topic}")
|
||||||
|
|
||||||
|
# Engineering blog queries
|
||||||
|
for vendor in vendors:
|
||||||
|
queries.append(f"site:{vendor}.com/blog {topic}")
|
||||||
|
queries.append(f"site:{vendor}.com/news {topic}")
|
||||||
|
|
||||||
|
# GitHub queries
|
||||||
|
for vendor in vendors:
|
||||||
|
queries.append(f"site:github.com/{vendor} {topic}")
|
||||||
|
|
||||||
|
# Research queries
|
||||||
|
queries.append(f"site:arxiv.org {topic}")
|
||||||
|
|
||||||
|
return queries
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Execute Search
|
||||||
|
|
||||||
|
Use web search tool for each query:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def execute_discovery(queries):
|
||||||
|
results = []
|
||||||
|
for query in queries:
|
||||||
|
search_results = web_search(query)
|
||||||
|
for result in search_results:
|
||||||
|
results.append({
|
||||||
|
"url": result.url,
|
||||||
|
"title": result.title,
|
||||||
|
"snippet": result.snippet,
|
||||||
|
"query_used": query
|
||||||
|
})
|
||||||
|
return deduplicate_by_url(results)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Validate and Score Sources
|
||||||
|
|
||||||
|
```python
|
||||||
|
def score_source(url, title):
|
||||||
|
score = 0.0
|
||||||
|
|
||||||
|
# Domain credibility
|
||||||
|
if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
|
||||||
|
score += 0.40 # Tier 1 official docs
|
||||||
|
elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
|
||||||
|
score += 0.30 # Tier 1 official blog/news
|
||||||
|
elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
|
||||||
|
score += 0.30 # Tier 1 official repos
|
||||||
|
elif 'arxiv.org' in url:
|
||||||
|
score += 0.20 # Tier 2 research
|
||||||
|
else:
|
||||||
|
score += 0.10 # Tier 3 community
|
||||||
|
|
||||||
|
# Freshness signals (from title/snippet)
|
||||||
|
if any(year in title for year in ['2025', '2024']):
|
||||||
|
score += 0.20
|
||||||
|
elif any(year in title for year in ['2023']):
|
||||||
|
score += 0.10
|
||||||
|
|
||||||
|
# Relevance signals
|
||||||
|
if any(kw in title.lower() for kw in ['guide', 'documentation', 'tutorial', 'best practices']):
|
||||||
|
score += 0.15
|
||||||
|
|
||||||
|
return min(score, 1.0)
|
||||||
|
|
||||||
|
def assign_credibility_tier(score):
|
||||||
|
if score >= 0.60:
|
||||||
|
return 'tier1_official'
|
||||||
|
elif score >= 0.40:
|
||||||
|
return 'tier2_verified'
|
||||||
|
else:
|
||||||
|
return 'tier3_community'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: Output URL Manifest
|
||||||
|
|
||||||
|
```python
|
||||||
|
def create_manifest(scored_results, topic):
|
||||||
|
manifest = {
|
||||||
|
"discovery_date": datetime.now().isoformat(),
|
||||||
|
"topic": topic,
|
||||||
|
"total_urls": len(scored_results),
|
||||||
|
"urls": []
|
||||||
|
}
|
||||||
|
|
||||||
|
for result in sorted(scored_results, key=lambda x: x['score'], reverse=True):
|
||||||
|
manifest["urls"].append({
|
||||||
|
"url": result["url"],
|
||||||
|
"title": result["title"],
|
||||||
|
"credibility_tier": result["tier"],
|
||||||
|
"credibility_score": result["score"],
|
||||||
|
"source_type": infer_source_type(result["url"]),
|
||||||
|
"vendor": infer_vendor(result["url"])
|
||||||
|
})
|
||||||
|
|
||||||
|
return manifest
|
||||||
|
```
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
|
||||||
|
Discovery produces a JSON manifest for the crawler:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"discovery_date": "2025-01-28T10:30:00",
|
||||||
|
"topic": "prompt engineering",
|
||||||
|
"total_urls": 15,
|
||||||
|
"urls": [
|
||||||
|
{
|
||||||
|
"url": "https://docs.anthropic.com/en/docs/prompt-engineering",
|
||||||
|
"title": "Prompt Engineering Guide",
|
||||||
|
"credibility_tier": "tier1_official",
|
||||||
|
"credibility_score": 0.85,
|
||||||
|
"source_type": "official_docs",
|
||||||
|
"vendor": "anthropic"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Known Authoritative Sources
|
||||||
|
|
||||||
|
Pre-validated sources for common topics:
|
||||||
|
|
||||||
|
| Vendor | Documentation | Blog/News | GitHub |
|
||||||
|
|--------|--------------|-----------|--------|
|
||||||
|
| Anthropic | docs.anthropic.com, docs.claude.com | anthropic.com/news | github.com/anthropics |
|
||||||
|
| OpenAI | platform.openai.com/docs | openai.com/blog | github.com/openai |
|
||||||
|
| Google | ai.google.dev/docs | blog.google/technology/ai | github.com/google |
|
||||||
|
|
||||||
|
## Integration
|
||||||
|
|
||||||
|
**Output:** URL manifest JSON → `web-crawler-orchestrator`
|
||||||
|
|
||||||
|
**Database:** Register new sources in `sources` table via `content-repository`
|
||||||
|
|
||||||
|
## Deduplication
|
||||||
|
|
||||||
|
Before outputting, deduplicate URLs:
|
||||||
|
- Normalize URLs (remove trailing slashes, query params)
|
||||||
|
- Check against existing `documents` table via `content-repository`
|
||||||
|
- Merge duplicate entries, keeping highest credibility score
|
||||||
@@ -0,0 +1,230 @@
|
|||||||
|
|
||||||
|
# Web Crawler Orchestrator
|
||||||
|
|
||||||
|
Manages crawling operations using Firecrawl MCP with rate limiting and format handling.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
- Firecrawl MCP server connected
|
||||||
|
- Config file at `~/.config/reference-curator/crawl_config.yaml`
|
||||||
|
- Storage directory exists: `~/reference-library/raw/`
|
||||||
|
|
||||||
|
## Crawl Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# ~/.config/reference-curator/crawl_config.yaml
|
||||||
|
firecrawl:
|
||||||
|
rate_limit:
|
||||||
|
requests_per_minute: 20
|
||||||
|
concurrent_requests: 3
|
||||||
|
default_options:
|
||||||
|
timeout: 30000
|
||||||
|
only_main_content: true
|
||||||
|
include_html: false
|
||||||
|
|
||||||
|
processing:
|
||||||
|
max_content_size_mb: 50
|
||||||
|
raw_content_dir: ~/reference-library/raw/
|
||||||
|
```
|
||||||
|
|
||||||
|
## Crawl Workflow
|
||||||
|
|
||||||
|
### Step 1: Load URL Manifest
|
||||||
|
|
||||||
|
Receive manifest from `reference-discovery`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def load_manifest(manifest_path):
|
||||||
|
with open(manifest_path) as f:
|
||||||
|
manifest = json.load(f)
|
||||||
|
return manifest["urls"]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Determine Crawl Strategy
|
||||||
|
|
||||||
|
```python
|
||||||
|
def select_strategy(url):
|
||||||
|
"""Select optimal crawl strategy based on URL characteristics."""
|
||||||
|
|
||||||
|
if url.endswith('.pdf'):
|
||||||
|
return 'pdf_extract'
|
||||||
|
elif 'github.com' in url and '/blob/' in url:
|
||||||
|
return 'raw_content' # Get raw file content
|
||||||
|
elif 'github.com' in url:
|
||||||
|
return 'scrape' # Repository pages
|
||||||
|
elif any(d in url for d in ['docs.', 'documentation']):
|
||||||
|
return 'scrape' # Documentation sites
|
||||||
|
else:
|
||||||
|
return 'scrape' # Default
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Execute Firecrawl
|
||||||
|
|
||||||
|
Use Firecrawl MCP for crawling:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Single page scrape
|
||||||
|
firecrawl_scrape(
|
||||||
|
url="https://docs.anthropic.com/en/docs/prompt-engineering",
|
||||||
|
formats=["markdown"], # markdown | html | screenshot
|
||||||
|
only_main_content=True,
|
||||||
|
timeout=30000
|
||||||
|
)
|
||||||
|
|
||||||
|
# Multi-page crawl (documentation sites)
|
||||||
|
firecrawl_crawl(
|
||||||
|
url="https://docs.anthropic.com/en/docs/",
|
||||||
|
max_depth=2,
|
||||||
|
limit=50,
|
||||||
|
formats=["markdown"],
|
||||||
|
only_main_content=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Rate Limiting
|
||||||
|
|
||||||
|
```python
|
||||||
|
import time
|
||||||
|
from collections import deque
|
||||||
|
|
||||||
|
class RateLimiter:
|
||||||
|
def __init__(self, requests_per_minute=20):
|
||||||
|
self.rpm = requests_per_minute
|
||||||
|
self.request_times = deque()
|
||||||
|
|
||||||
|
def wait_if_needed(self):
|
||||||
|
now = time.time()
|
||||||
|
# Remove requests older than 1 minute
|
||||||
|
while self.request_times and now - self.request_times[0] > 60:
|
||||||
|
self.request_times.popleft()
|
||||||
|
|
||||||
|
if len(self.request_times) >= self.rpm:
|
||||||
|
wait_time = 60 - (now - self.request_times[0])
|
||||||
|
if wait_time > 0:
|
||||||
|
time.sleep(wait_time)
|
||||||
|
|
||||||
|
self.request_times.append(time.time())
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: Save Raw Content
|
||||||
|
|
||||||
|
```python
|
||||||
|
import hashlib
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
def save_content(url, content, content_type='markdown'):
|
||||||
|
"""Save crawled content to raw storage."""
|
||||||
|
|
||||||
|
# Generate filename from URL hash
|
||||||
|
url_hash = hashlib.sha256(url.encode()).hexdigest()[:16]
|
||||||
|
|
||||||
|
# Determine extension
|
||||||
|
ext_map = {'markdown': '.md', 'html': '.html', 'pdf': '.pdf'}
|
||||||
|
ext = ext_map.get(content_type, '.txt')
|
||||||
|
|
||||||
|
# Create dated subdirectory
|
||||||
|
date_dir = datetime.now().strftime('%Y/%m')
|
||||||
|
output_dir = Path.home() / 'reference-library/raw' / date_dir
|
||||||
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
# Save file
|
||||||
|
filepath = output_dir / f"{url_hash}{ext}"
|
||||||
|
if content_type == 'pdf':
|
||||||
|
filepath.write_bytes(content)
|
||||||
|
else:
|
||||||
|
filepath.write_text(content, encoding='utf-8')
|
||||||
|
|
||||||
|
return str(filepath)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 6: Generate Crawl Manifest
|
||||||
|
|
||||||
|
```python
|
||||||
|
def create_crawl_manifest(results):
|
||||||
|
manifest = {
|
||||||
|
"crawl_date": datetime.now().isoformat(),
|
||||||
|
"total_crawled": len([r for r in results if r["status"] == "success"]),
|
||||||
|
"total_failed": len([r for r in results if r["status"] == "failed"]),
|
||||||
|
"documents": []
|
||||||
|
}
|
||||||
|
|
||||||
|
for result in results:
|
||||||
|
manifest["documents"].append({
|
||||||
|
"url": result["url"],
|
||||||
|
"status": result["status"],
|
||||||
|
"raw_content_path": result.get("filepath"),
|
||||||
|
"content_size": result.get("size"),
|
||||||
|
"crawl_method": "firecrawl",
|
||||||
|
"error": result.get("error")
|
||||||
|
})
|
||||||
|
|
||||||
|
return manifest
|
||||||
|
```
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
| Error | Action |
|
||||||
|
|-------|--------|
|
||||||
|
| Timeout | Retry once with 2x timeout |
|
||||||
|
| Rate limit (429) | Exponential backoff, max 3 retries |
|
||||||
|
| Not found (404) | Log and skip |
|
||||||
|
| Access denied (403) | Log, mark as `failed` |
|
||||||
|
| Connection error | Retry with backoff |
|
||||||
|
|
||||||
|
```python
|
||||||
|
def crawl_with_retry(url, max_retries=3):
|
||||||
|
for attempt in range(max_retries):
|
||||||
|
try:
|
||||||
|
result = firecrawl_scrape(url)
|
||||||
|
return {"status": "success", "content": result}
|
||||||
|
except RateLimitError:
|
||||||
|
wait = 2 ** attempt * 10 # 10, 20, 40 seconds
|
||||||
|
time.sleep(wait)
|
||||||
|
except TimeoutError:
|
||||||
|
if attempt == 0:
|
||||||
|
# Retry with doubled timeout
|
||||||
|
result = firecrawl_scrape(url, timeout=60000)
|
||||||
|
return {"status": "success", "content": result}
|
||||||
|
except NotFoundError:
|
||||||
|
return {"status": "failed", "error": "404 Not Found"}
|
||||||
|
except Exception as e:
|
||||||
|
if attempt == max_retries - 1:
|
||||||
|
return {"status": "failed", "error": str(e)}
|
||||||
|
|
||||||
|
return {"status": "failed", "error": "Max retries exceeded"}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Firecrawl MCP Reference
|
||||||
|
|
||||||
|
**scrape** - Single page:
|
||||||
|
```
|
||||||
|
firecrawl_scrape(url, formats, only_main_content, timeout)
|
||||||
|
```
|
||||||
|
|
||||||
|
**crawl** - Multi-page:
|
||||||
|
```
|
||||||
|
firecrawl_crawl(url, max_depth, limit, formats, only_main_content)
|
||||||
|
```
|
||||||
|
|
||||||
|
**map** - Discover URLs:
|
||||||
|
```
|
||||||
|
firecrawl_map(url, limit) # Returns list of URLs on site
|
||||||
|
```
|
||||||
|
|
||||||
|
## Integration
|
||||||
|
|
||||||
|
| From | Input | To |
|
||||||
|
|------|-------|-----|
|
||||||
|
| reference-discovery | URL manifest | web-crawler-orchestrator |
|
||||||
|
| web-crawler-orchestrator | Crawl manifest + raw files | content-repository |
|
||||||
|
| quality-reviewer (deep_research) | Additional queries | reference-discovery → here |
|
||||||
|
|
||||||
|
## Output Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
~/reference-library/raw/
|
||||||
|
└── 2025/01/
|
||||||
|
├── a1b2c3d4e5f6g7h8.md # Markdown content
|
||||||
|
├── b2c3d4e5f6g7h8i9.md
|
||||||
|
└── c3d4e5f6g7h8i9j0.pdf # PDF documents
|
||||||
|
```
|
||||||
@@ -0,0 +1,158 @@
|
|||||||
|
|
||||||
|
# Content Repository
|
||||||
|
|
||||||
|
Manages MySQL storage for the reference library system. Handles document storage, version control, deduplication, and retrieval.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
- MySQL 8.0+ with utf8mb4 charset
|
||||||
|
- Config file at `~/.config/reference-curator/db_config.yaml`
|
||||||
|
- Database `reference_library` initialized with schema
|
||||||
|
|
||||||
|
## Quick Reference
|
||||||
|
|
||||||
|
### Connection Setup
|
||||||
|
|
||||||
|
```python
|
||||||
|
import yaml
|
||||||
|
import os
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
def get_db_config():
|
||||||
|
config_path = Path.home() / ".config/reference-curator/db_config.yaml"
|
||||||
|
with open(config_path) as f:
|
||||||
|
config = yaml.safe_load(f)
|
||||||
|
|
||||||
|
# Resolve environment variables
|
||||||
|
mysql = config['mysql']
|
||||||
|
return {
|
||||||
|
'host': mysql['host'],
|
||||||
|
'port': mysql['port'],
|
||||||
|
'database': mysql['database'],
|
||||||
|
'user': os.environ.get('MYSQL_USER', mysql.get('user', '')),
|
||||||
|
'password': os.environ.get('MYSQL_PASSWORD', mysql.get('password', '')),
|
||||||
|
'charset': mysql['charset']
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Core Operations
|
||||||
|
|
||||||
|
**Store New Document:**
|
||||||
|
```python
|
||||||
|
def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
|
||||||
|
sql = """
|
||||||
|
INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
|
||||||
|
VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
|
||||||
|
ON DUPLICATE KEY UPDATE
|
||||||
|
version = version + 1,
|
||||||
|
previous_version_id = doc_id,
|
||||||
|
crawl_date = NOW(),
|
||||||
|
raw_content_path = VALUES(raw_content_path)
|
||||||
|
"""
|
||||||
|
cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
|
||||||
|
return cursor.lastrowid
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check Duplicate:**
|
||||||
|
```python
|
||||||
|
def is_duplicate(cursor, url):
|
||||||
|
cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
|
||||||
|
return cursor.fetchone() is not None
|
||||||
|
```
|
||||||
|
|
||||||
|
**Get Document by Topic:**
|
||||||
|
```python
|
||||||
|
def get_docs_by_topic(cursor, topic_slug, min_quality=0.80):
|
||||||
|
sql = """
|
||||||
|
SELECT d.doc_id, d.title, d.url, dc.structured_content, dc.quality_score
|
||||||
|
FROM documents d
|
||||||
|
JOIN document_topics dt ON d.doc_id = dt.doc_id
|
||||||
|
JOIN topics t ON dt.topic_id = t.topic_id
|
||||||
|
LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
|
||||||
|
WHERE t.topic_slug = %s
|
||||||
|
AND (dc.review_status = 'approved' OR dc.review_status IS NULL)
|
||||||
|
ORDER BY dt.relevance_score DESC
|
||||||
|
"""
|
||||||
|
cursor.execute(sql, (topic_slug,))
|
||||||
|
return cursor.fetchall()
|
||||||
|
```
|
||||||
|
|
||||||
|
## Table Quick Reference
|
||||||
|
|
||||||
|
| Table | Purpose | Key Fields |
|
||||||
|
|-------|---------|------------|
|
||||||
|
| `sources` | Authorized content sources | source_type, credibility_tier, vendor |
|
||||||
|
| `documents` | Crawled document metadata | url_hash (dedup), version, crawl_status |
|
||||||
|
| `distilled_content` | Processed summaries | review_status, compression_ratio |
|
||||||
|
| `review_logs` | QA decisions | quality_score, decision, refactor_instructions |
|
||||||
|
| `topics` | Taxonomy | topic_slug, parent_topic_id |
|
||||||
|
| `document_topics` | Many-to-many linking | relevance_score |
|
||||||
|
| `export_jobs` | Export tracking | export_type, output_format, status |
|
||||||
|
|
||||||
|
## Status Values
|
||||||
|
|
||||||
|
**crawl_status:** `pending` → `completed` | `failed` | `stale`
|
||||||
|
|
||||||
|
**review_status:** `pending` → `in_review` → `approved` | `needs_refactor` | `rejected`
|
||||||
|
|
||||||
|
**decision (review):** `approve` | `refactor` | `deep_research` | `reject`
|
||||||
|
|
||||||
|
## Common Queries
|
||||||
|
|
||||||
|
### Find Stale Documents (needs re-crawl)
|
||||||
|
```sql
|
||||||
|
SELECT d.doc_id, d.title, d.url, d.crawl_date
|
||||||
|
FROM documents d
|
||||||
|
JOIN crawl_schedule cs ON d.source_id = cs.source_id
|
||||||
|
WHERE d.crawl_date < DATE_SUB(NOW(), INTERVAL
|
||||||
|
CASE cs.frequency
|
||||||
|
WHEN 'daily' THEN 1
|
||||||
|
WHEN 'weekly' THEN 7
|
||||||
|
WHEN 'biweekly' THEN 14
|
||||||
|
WHEN 'monthly' THEN 30
|
||||||
|
END DAY)
|
||||||
|
AND cs.is_enabled = TRUE;
|
||||||
|
```
|
||||||
|
|
||||||
|
### Get Pending Reviews
|
||||||
|
```sql
|
||||||
|
SELECT dc.distill_id, d.title, d.url, dc.token_count_distilled
|
||||||
|
FROM distilled_content dc
|
||||||
|
JOIN documents d ON dc.doc_id = d.doc_id
|
||||||
|
WHERE dc.review_status = 'pending'
|
||||||
|
ORDER BY dc.distill_date ASC;
|
||||||
|
```
|
||||||
|
|
||||||
|
### Export-Ready Content
|
||||||
|
```sql
|
||||||
|
SELECT d.title, d.url, dc.structured_content, t.topic_slug
|
||||||
|
FROM documents d
|
||||||
|
JOIN distilled_content dc ON d.doc_id = dc.doc_id
|
||||||
|
JOIN document_topics dt ON d.doc_id = dt.doc_id
|
||||||
|
JOIN topics t ON dt.topic_id = t.topic_id
|
||||||
|
JOIN review_logs rl ON dc.distill_id = rl.distill_id
|
||||||
|
WHERE rl.decision = 'approve'
|
||||||
|
AND rl.quality_score >= 0.85
|
||||||
|
ORDER BY t.topic_slug, dt.relevance_score DESC;
|
||||||
|
```
|
||||||
|
|
||||||
|
## Workflow Integration
|
||||||
|
|
||||||
|
1. **From crawler-orchestrator:** Receive URL + raw content path → `store_document()`
|
||||||
|
2. **To content-distiller:** Query pending documents → send for processing
|
||||||
|
3. **From quality-reviewer:** Update `review_status` based on decision
|
||||||
|
4. **To markdown-exporter:** Query approved content by topic
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
- **Duplicate URL:** Silent update (version increment) via `ON DUPLICATE KEY UPDATE`
|
||||||
|
- **Missing source_id:** Validate against `sources` table before insert
|
||||||
|
- **Connection failure:** Implement retry with exponential backoff
|
||||||
|
|
||||||
|
## Full Schema Reference
|
||||||
|
|
||||||
|
See `references/schema.sql` for complete table definitions including indexes and constraints.
|
||||||
|
|
||||||
|
## Config File Template
|
||||||
|
|
||||||
|
See `references/db_config_template.yaml` for connection configuration template.
|
||||||
@@ -0,0 +1,234 @@
|
|||||||
|
|
||||||
|
# Content Distiller
|
||||||
|
|
||||||
|
Transforms raw crawled content into structured, high-quality reference materials.
|
||||||
|
|
||||||
|
## Distillation Goals
|
||||||
|
|
||||||
|
1. **Compress** - Reduce token count while preserving essential information
|
||||||
|
2. **Structure** - Organize content for easy retrieval and reference
|
||||||
|
3. **Extract** - Pull out code snippets, key concepts, and actionable patterns
|
||||||
|
4. **Annotate** - Add metadata for searchability and categorization
|
||||||
|
|
||||||
|
## Distillation Workflow
|
||||||
|
|
||||||
|
### Step 1: Load Raw Content
|
||||||
|
|
||||||
|
```python
|
||||||
|
def load_for_distillation(cursor):
|
||||||
|
"""Get documents ready for distillation."""
|
||||||
|
sql = """
|
||||||
|
SELECT d.doc_id, d.title, d.url, d.raw_content_path,
|
||||||
|
d.doc_type, s.source_type, s.credibility_tier
|
||||||
|
FROM documents d
|
||||||
|
JOIN sources s ON d.source_id = s.source_id
|
||||||
|
LEFT JOIN distilled_content dc ON d.doc_id = dc.doc_id
|
||||||
|
WHERE d.crawl_status = 'completed'
|
||||||
|
AND dc.distill_id IS NULL
|
||||||
|
ORDER BY s.credibility_tier ASC
|
||||||
|
"""
|
||||||
|
cursor.execute(sql)
|
||||||
|
return cursor.fetchall()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Analyze Content Structure
|
||||||
|
|
||||||
|
Identify content type and select appropriate distillation strategy:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def analyze_structure(content, doc_type):
|
||||||
|
"""Analyze document structure for distillation."""
|
||||||
|
analysis = {
|
||||||
|
"has_code_blocks": bool(re.findall(r'```[\s\S]*?```', content)),
|
||||||
|
"has_headers": bool(re.findall(r'^#+\s', content, re.MULTILINE)),
|
||||||
|
"has_lists": bool(re.findall(r'^\s*[-*]\s', content, re.MULTILINE)),
|
||||||
|
"has_tables": bool(re.findall(r'\|.*\|', content)),
|
||||||
|
"estimated_tokens": len(content.split()) * 1.3, # Rough estimate
|
||||||
|
"section_count": len(re.findall(r'^#+\s', content, re.MULTILINE))
|
||||||
|
}
|
||||||
|
return analysis
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Extract Key Components
|
||||||
|
|
||||||
|
**Extract Code Snippets:**
|
||||||
|
```python
|
||||||
|
def extract_code_snippets(content):
|
||||||
|
"""Extract all code blocks with language tags."""
|
||||||
|
pattern = r'```(\w*)\n([\s\S]*?)```'
|
||||||
|
snippets = []
|
||||||
|
for match in re.finditer(pattern, content):
|
||||||
|
snippets.append({
|
||||||
|
"language": match.group(1) or "text",
|
||||||
|
"code": match.group(2).strip(),
|
||||||
|
"context": get_surrounding_text(content, match.start(), 200)
|
||||||
|
})
|
||||||
|
return snippets
|
||||||
|
```
|
||||||
|
|
||||||
|
**Extract Key Concepts:**
|
||||||
|
```python
|
||||||
|
def extract_key_concepts(content, title):
|
||||||
|
"""Use Claude to extract key concepts and definitions."""
|
||||||
|
prompt = f"""
|
||||||
|
Analyze this document and extract key concepts:
|
||||||
|
|
||||||
|
Title: {title}
|
||||||
|
Content: {content[:8000]} # Limit for context
|
||||||
|
|
||||||
|
Return JSON with:
|
||||||
|
- concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
|
||||||
|
- techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
|
||||||
|
- best_practices: ["..."]
|
||||||
|
"""
|
||||||
|
# Use Claude API to process
|
||||||
|
return claude_extract(prompt)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Create Structured Summary
|
||||||
|
|
||||||
|
**Summary Template:**
|
||||||
|
```markdown
|
||||||
|
# {title}
|
||||||
|
|
||||||
|
**Source:** {url}
|
||||||
|
**Type:** {source_type} | **Tier:** {credibility_tier}
|
||||||
|
**Distilled:** {date}
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
{2-3 sentence overview}
|
||||||
|
|
||||||
|
## Key Concepts
|
||||||
|
{bulleted list of core concepts with brief definitions}
|
||||||
|
|
||||||
|
## Techniques & Patterns
|
||||||
|
{extracted techniques with use cases}
|
||||||
|
|
||||||
|
## Code Examples
|
||||||
|
{relevant code snippets with context}
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
{actionable recommendations}
|
||||||
|
|
||||||
|
## Related Topics
|
||||||
|
{links to related content in library}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: Optimize for Tokens
|
||||||
|
|
||||||
|
```python
|
||||||
|
def optimize_content(structured_content, target_ratio=0.30):
|
||||||
|
"""
|
||||||
|
Compress content to target ratio while preserving quality.
|
||||||
|
Target: 30% of original token count.
|
||||||
|
"""
|
||||||
|
original_tokens = count_tokens(structured_content)
|
||||||
|
target_tokens = int(original_tokens * target_ratio)
|
||||||
|
|
||||||
|
# Prioritized compression strategies
|
||||||
|
strategies = [
|
||||||
|
remove_redundant_explanations,
|
||||||
|
condense_examples,
|
||||||
|
merge_similar_sections,
|
||||||
|
trim_verbose_descriptions
|
||||||
|
]
|
||||||
|
|
||||||
|
optimized = structured_content
|
||||||
|
for strategy in strategies:
|
||||||
|
if count_tokens(optimized) > target_tokens:
|
||||||
|
optimized = strategy(optimized)
|
||||||
|
|
||||||
|
return optimized
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 6: Store Distilled Content
|
||||||
|
|
||||||
|
```python
|
||||||
|
def store_distilled(cursor, doc_id, summary, key_concepts,
|
||||||
|
code_snippets, structured_content,
|
||||||
|
original_tokens, distilled_tokens):
|
||||||
|
sql = """
|
||||||
|
INSERT INTO distilled_content
|
||||||
|
(doc_id, summary, key_concepts, code_snippets, structured_content,
|
||||||
|
token_count_original, token_count_distilled, distill_model, review_status)
|
||||||
|
VALUES (%s, %s, %s, %s, %s, %s, %s, 'claude-opus-4-5', 'pending')
|
||||||
|
"""
|
||||||
|
cursor.execute(sql, (
|
||||||
|
doc_id, summary,
|
||||||
|
json.dumps(key_concepts),
|
||||||
|
json.dumps(code_snippets),
|
||||||
|
structured_content,
|
||||||
|
original_tokens,
|
||||||
|
distilled_tokens
|
||||||
|
))
|
||||||
|
return cursor.lastrowid
|
||||||
|
```
|
||||||
|
|
||||||
|
## Distillation Prompts
|
||||||
|
|
||||||
|
**For Prompt Engineering Content:**
|
||||||
|
```
|
||||||
|
Focus on:
|
||||||
|
1. Specific techniques with before/after examples
|
||||||
|
2. Why techniques work (not just what)
|
||||||
|
3. Common pitfalls and how to avoid them
|
||||||
|
4. Actionable patterns that can be directly applied
|
||||||
|
```
|
||||||
|
|
||||||
|
**For API Documentation:**
|
||||||
|
```
|
||||||
|
Focus on:
|
||||||
|
1. Endpoint specifications and parameters
|
||||||
|
2. Request/response examples
|
||||||
|
3. Error codes and handling
|
||||||
|
4. Rate limits and best practices
|
||||||
|
```
|
||||||
|
|
||||||
|
**For Research Papers:**
|
||||||
|
```
|
||||||
|
Focus on:
|
||||||
|
1. Key findings and conclusions
|
||||||
|
2. Novel techniques introduced
|
||||||
|
3. Practical applications
|
||||||
|
4. Limitations and caveats
|
||||||
|
```
|
||||||
|
|
||||||
|
## Quality Metrics
|
||||||
|
|
||||||
|
Track compression efficiency:
|
||||||
|
|
||||||
|
| Metric | Target |
|
||||||
|
|--------|--------|
|
||||||
|
| Compression Ratio | 25-35% of original |
|
||||||
|
| Key Concept Coverage | ≥90% of important terms |
|
||||||
|
| Code Snippet Retention | 100% of relevant examples |
|
||||||
|
| Readability | Clear, scannable structure |
|
||||||
|
|
||||||
|
## Handling Refactor Requests
|
||||||
|
|
||||||
|
When `quality-reviewer` returns `refactor` decision:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def handle_refactor(distill_id, instructions):
|
||||||
|
"""Re-distill based on reviewer feedback."""
|
||||||
|
# Load original content and existing distillation
|
||||||
|
original = load_raw_content(distill_id)
|
||||||
|
existing = load_distilled_content(distill_id)
|
||||||
|
|
||||||
|
# Apply specific improvements based on instructions
|
||||||
|
improved = apply_improvements(existing, instructions)
|
||||||
|
|
||||||
|
# Update distilled_content
|
||||||
|
update_distilled(distill_id, improved)
|
||||||
|
|
||||||
|
# Reset review status
|
||||||
|
set_review_status(distill_id, 'pending')
|
||||||
|
```
|
||||||
|
|
||||||
|
## Integration
|
||||||
|
|
||||||
|
| From | Input | To |
|
||||||
|
|------|-------|-----|
|
||||||
|
| content-repository | Raw document records | content-distiller |
|
||||||
|
| content-distiller | Distilled content | quality-reviewer |
|
||||||
|
| quality-reviewer | Refactor instructions | content-distiller (loop) |
|
||||||
@@ -0,0 +1,223 @@
|
|||||||
|
|
||||||
|
# Quality Reviewer
|
||||||
|
|
||||||
|
Evaluates distilled content for quality, routes decisions, and triggers refactoring or additional research when needed.
|
||||||
|
|
||||||
|
## Review Workflow
|
||||||
|
|
||||||
|
```
|
||||||
|
[Distilled Content]
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────┐
|
||||||
|
│ Score Criteria │ → accuracy, completeness, clarity, PE quality, usability
|
||||||
|
└─────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────┐
|
||||||
|
│ Calculate Total │ → weighted average
|
||||||
|
└─────────────────┘
|
||||||
|
│
|
||||||
|
├── ≥ 0.85 → APPROVE → markdown-exporter
|
||||||
|
├── 0.60-0.84 → REFACTOR → content-distiller (with instructions)
|
||||||
|
├── 0.40-0.59 → DEEP_RESEARCH → web-crawler-orchestrator (with queries)
|
||||||
|
└── < 0.40 → REJECT → archive with reason
|
||||||
|
```
|
||||||
|
|
||||||
|
## Scoring Criteria
|
||||||
|
|
||||||
|
| Criterion | Weight | Checks |
|
||||||
|
|-----------|--------|--------|
|
||||||
|
| **Accuracy** | 0.25 | Factual correctness, up-to-date info, proper attribution |
|
||||||
|
| **Completeness** | 0.20 | Covers key concepts, includes examples, addresses edge cases |
|
||||||
|
| **Clarity** | 0.20 | Clear structure, concise language, logical flow |
|
||||||
|
| **PE Quality** | 0.25 | Demonstrates techniques, before/after examples, explains why |
|
||||||
|
| **Usability** | 0.10 | Easy to reference, searchable keywords, appropriate length |
|
||||||
|
|
||||||
|
## Decision Thresholds
|
||||||
|
|
||||||
|
| Score Range | Decision | Action |
|
||||||
|
|-------------|----------|--------|
|
||||||
|
| ≥ 0.85 | `approve` | Proceed to export |
|
||||||
|
| 0.60 - 0.84 | `refactor` | Return to distiller with feedback |
|
||||||
|
| 0.40 - 0.59 | `deep_research` | Gather more sources, then re-distill |
|
||||||
|
| < 0.40 | `reject` | Archive, log reason |
|
||||||
|
|
||||||
|
## Review Process
|
||||||
|
|
||||||
|
### Step 1: Load Content for Review
|
||||||
|
|
||||||
|
```python
|
||||||
|
def get_pending_reviews(cursor):
|
||||||
|
sql = """
|
||||||
|
SELECT dc.distill_id, dc.doc_id, d.title, d.url,
|
||||||
|
dc.summary, dc.key_concepts, dc.structured_content,
|
||||||
|
dc.token_count_original, dc.token_count_distilled,
|
||||||
|
s.credibility_tier
|
||||||
|
FROM distilled_content dc
|
||||||
|
JOIN documents d ON dc.doc_id = d.doc_id
|
||||||
|
JOIN sources s ON d.source_id = s.source_id
|
||||||
|
WHERE dc.review_status = 'pending'
|
||||||
|
ORDER BY s.credibility_tier ASC, dc.distill_date ASC
|
||||||
|
"""
|
||||||
|
cursor.execute(sql)
|
||||||
|
return cursor.fetchall()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Score Each Criterion
|
||||||
|
|
||||||
|
Evaluate content against each criterion using this assessment template:
|
||||||
|
|
||||||
|
```python
|
||||||
|
assessment_template = {
|
||||||
|
"accuracy": {
|
||||||
|
"score": 0.0, # 0.00 - 1.00
|
||||||
|
"notes": "",
|
||||||
|
"issues": [] # Specific factual errors if any
|
||||||
|
},
|
||||||
|
"completeness": {
|
||||||
|
"score": 0.0,
|
||||||
|
"notes": "",
|
||||||
|
"missing_topics": [] # Concepts that should be covered
|
||||||
|
},
|
||||||
|
"clarity": {
|
||||||
|
"score": 0.0,
|
||||||
|
"notes": "",
|
||||||
|
"confusing_sections": [] # Sections needing rewrite
|
||||||
|
},
|
||||||
|
"prompt_engineering_quality": {
|
||||||
|
"score": 0.0,
|
||||||
|
"notes": "",
|
||||||
|
"improvements": [] # Specific PE technique gaps
|
||||||
|
},
|
||||||
|
"usability": {
|
||||||
|
"score": 0.0,
|
||||||
|
"notes": "",
|
||||||
|
"suggestions": []
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Calculate Final Score
|
||||||
|
|
||||||
|
```python
|
||||||
|
WEIGHTS = {
|
||||||
|
"accuracy": 0.25,
|
||||||
|
"completeness": 0.20,
|
||||||
|
"clarity": 0.20,
|
||||||
|
"prompt_engineering_quality": 0.25,
|
||||||
|
"usability": 0.10
|
||||||
|
}
|
||||||
|
|
||||||
|
def calculate_quality_score(assessment):
|
||||||
|
return sum(
|
||||||
|
assessment[criterion]["score"] * weight
|
||||||
|
for criterion, weight in WEIGHTS.items()
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Route Decision
|
||||||
|
|
||||||
|
```python
|
||||||
|
def determine_decision(score, assessment):
|
||||||
|
if score >= 0.85:
|
||||||
|
return "approve", None, None
|
||||||
|
elif score >= 0.60:
|
||||||
|
instructions = generate_refactor_instructions(assessment)
|
||||||
|
return "refactor", instructions, None
|
||||||
|
elif score >= 0.40:
|
||||||
|
queries = generate_research_queries(assessment)
|
||||||
|
return "deep_research", None, queries
|
||||||
|
else:
|
||||||
|
return "reject", f"Quality score {score:.2f} below minimum threshold", None
|
||||||
|
|
||||||
|
def generate_refactor_instructions(assessment):
|
||||||
|
"""Extract actionable feedback from low-scoring criteria."""
|
||||||
|
instructions = []
|
||||||
|
for criterion, data in assessment.items():
|
||||||
|
if data["score"] < 0.80:
|
||||||
|
if data.get("issues"):
|
||||||
|
instructions.extend(data["issues"])
|
||||||
|
if data.get("missing_topics"):
|
||||||
|
instructions.append(f"Add coverage for: {', '.join(data['missing_topics'])}")
|
||||||
|
if data.get("improvements"):
|
||||||
|
instructions.extend(data["improvements"])
|
||||||
|
return "\n".join(instructions)
|
||||||
|
|
||||||
|
def generate_research_queries(assessment):
|
||||||
|
"""Generate search queries for content gaps."""
|
||||||
|
queries = []
|
||||||
|
if assessment["completeness"]["missing_topics"]:
|
||||||
|
for topic in assessment["completeness"]["missing_topics"]:
|
||||||
|
queries.append(f"{topic} documentation guide")
|
||||||
|
if assessment["accuracy"]["issues"]:
|
||||||
|
queries.append("latest official documentation verification")
|
||||||
|
return queries
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: Log Review Decision
|
||||||
|
|
||||||
|
```python
|
||||||
|
def log_review(cursor, distill_id, assessment, score, decision, instructions=None, queries=None):
|
||||||
|
# Get current round number
|
||||||
|
cursor.execute(
|
||||||
|
"SELECT COALESCE(MAX(review_round), 0) + 1 FROM review_logs WHERE distill_id = %s",
|
||||||
|
(distill_id,)
|
||||||
|
)
|
||||||
|
review_round = cursor.fetchone()[0]
|
||||||
|
|
||||||
|
sql = """
|
||||||
|
INSERT INTO review_logs
|
||||||
|
(distill_id, review_round, reviewer_type, quality_score, assessment,
|
||||||
|
decision, refactor_instructions, research_queries)
|
||||||
|
VALUES (%s, %s, 'claude_review', %s, %s, %s, %s, %s)
|
||||||
|
"""
|
||||||
|
cursor.execute(sql, (
|
||||||
|
distill_id, review_round, score,
|
||||||
|
json.dumps(assessment), decision, instructions,
|
||||||
|
json.dumps(queries) if queries else None
|
||||||
|
))
|
||||||
|
|
||||||
|
# Update distilled_content status
|
||||||
|
status_map = {
|
||||||
|
"approve": "approved",
|
||||||
|
"refactor": "needs_refactor",
|
||||||
|
"deep_research": "needs_refactor",
|
||||||
|
"reject": "rejected"
|
||||||
|
}
|
||||||
|
cursor.execute(
|
||||||
|
"UPDATE distilled_content SET review_status = %s WHERE distill_id = %s",
|
||||||
|
(status_map[decision], distill_id)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Prompt Engineering Quality Checklist
|
||||||
|
|
||||||
|
When scoring `prompt_engineering_quality`, verify:
|
||||||
|
|
||||||
|
- [ ] Demonstrates specific techniques (CoT, few-shot, etc.)
|
||||||
|
- [ ] Shows before/after examples
|
||||||
|
- [ ] Explains *why* techniques work, not just *what*
|
||||||
|
- [ ] Provides actionable patterns
|
||||||
|
- [ ] Includes edge cases and failure modes
|
||||||
|
- [ ] References authoritative sources
|
||||||
|
|
||||||
|
## Auto-Approve Rules
|
||||||
|
|
||||||
|
Tier 1 (official) sources with score ≥ 0.80 may auto-approve without human review if configured:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# In export_config.yaml
|
||||||
|
quality:
|
||||||
|
auto_approve_tier1_sources: true
|
||||||
|
auto_approve_min_score: 0.80
|
||||||
|
```
|
||||||
|
|
||||||
|
## Integration Points
|
||||||
|
|
||||||
|
| From | Action | To |
|
||||||
|
|------|--------|-----|
|
||||||
|
| content-distiller | Sends distilled content | quality-reviewer |
|
||||||
|
| quality-reviewer | APPROVE | markdown-exporter |
|
||||||
|
| quality-reviewer | REFACTOR + instructions | content-distiller |
|
||||||
|
| quality-reviewer | DEEP_RESEARCH + queries | web-crawler-orchestrator |
|
||||||
@@ -0,0 +1,290 @@
|
|||||||
|
|
||||||
|
# Markdown Exporter
|
||||||
|
|
||||||
|
Exports approved content as structured markdown files for Claude Projects or fine-tuning.
|
||||||
|
|
||||||
|
## Export Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# ~/.config/reference-curator/export_config.yaml
|
||||||
|
output:
|
||||||
|
base_path: ~/reference-library/exports/
|
||||||
|
|
||||||
|
project_files:
|
||||||
|
structure: nested_by_topic # flat | nested_by_topic | nested_by_source
|
||||||
|
index_file: INDEX.md
|
||||||
|
include_metadata: true
|
||||||
|
|
||||||
|
fine_tuning:
|
||||||
|
format: jsonl
|
||||||
|
max_tokens_per_sample: 4096
|
||||||
|
include_system_prompt: true
|
||||||
|
|
||||||
|
quality:
|
||||||
|
min_score_for_export: 0.80
|
||||||
|
```
|
||||||
|
|
||||||
|
## Export Workflow
|
||||||
|
|
||||||
|
### Step 1: Query Approved Content
|
||||||
|
|
||||||
|
```python
|
||||||
|
def get_exportable_content(cursor, min_score=0.80, topic_filter=None):
|
||||||
|
"""Get all approved content meeting quality threshold."""
|
||||||
|
sql = """
|
||||||
|
SELECT d.doc_id, d.title, d.url,
|
||||||
|
dc.summary, dc.key_concepts, dc.code_snippets, dc.structured_content,
|
||||||
|
t.topic_slug, t.topic_name,
|
||||||
|
rl.quality_score, s.credibility_tier, s.vendor
|
||||||
|
FROM documents d
|
||||||
|
JOIN distilled_content dc ON d.doc_id = dc.doc_id
|
||||||
|
JOIN document_topics dt ON d.doc_id = dt.doc_id
|
||||||
|
JOIN topics t ON dt.topic_id = t.topic_id
|
||||||
|
JOIN review_logs rl ON dc.distill_id = rl.distill_id
|
||||||
|
JOIN sources s ON d.source_id = s.source_id
|
||||||
|
WHERE rl.decision = 'approve'
|
||||||
|
AND rl.quality_score >= %s
|
||||||
|
AND rl.review_id = (
|
||||||
|
SELECT MAX(review_id) FROM review_logs
|
||||||
|
WHERE distill_id = dc.distill_id
|
||||||
|
)
|
||||||
|
"""
|
||||||
|
params = [min_score]
|
||||||
|
|
||||||
|
if topic_filter:
|
||||||
|
sql += " AND t.topic_slug IN (%s)" % ','.join(['%s'] * len(topic_filter))
|
||||||
|
params.extend(topic_filter)
|
||||||
|
|
||||||
|
sql += " ORDER BY t.topic_slug, rl.quality_score DESC"
|
||||||
|
cursor.execute(sql, params)
|
||||||
|
return cursor.fetchall()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Organize by Structure
|
||||||
|
|
||||||
|
**Nested by Topic (recommended):**
|
||||||
|
```
|
||||||
|
exports/
|
||||||
|
├── INDEX.md
|
||||||
|
├── prompt-engineering/
|
||||||
|
│ ├── _index.md
|
||||||
|
│ ├── 01-chain-of-thought.md
|
||||||
|
│ ├── 02-few-shot-prompting.md
|
||||||
|
│ └── 03-system-prompts.md
|
||||||
|
├── claude-models/
|
||||||
|
│ ├── _index.md
|
||||||
|
│ ├── 01-model-comparison.md
|
||||||
|
│ └── 02-context-windows.md
|
||||||
|
└── agent-building/
|
||||||
|
├── _index.md
|
||||||
|
└── 01-tool-use.md
|
||||||
|
```
|
||||||
|
|
||||||
|
**Flat Structure:**
|
||||||
|
```
|
||||||
|
exports/
|
||||||
|
├── INDEX.md
|
||||||
|
├── prompt-engineering-chain-of-thought.md
|
||||||
|
├── prompt-engineering-few-shot.md
|
||||||
|
└── claude-models-comparison.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Generate Files
|
||||||
|
|
||||||
|
**Document File Template:**
|
||||||
|
```python
|
||||||
|
def generate_document_file(doc, include_metadata=True):
|
||||||
|
content = []
|
||||||
|
|
||||||
|
if include_metadata:
|
||||||
|
content.append("---")
|
||||||
|
content.append(f"title: {doc['title']}")
|
||||||
|
content.append(f"source: {doc['url']}")
|
||||||
|
content.append(f"vendor: {doc['vendor']}")
|
||||||
|
content.append(f"tier: {doc['credibility_tier']}")
|
||||||
|
content.append(f"quality_score: {doc['quality_score']:.2f}")
|
||||||
|
content.append(f"exported: {datetime.now().isoformat()}")
|
||||||
|
content.append("---")
|
||||||
|
content.append("")
|
||||||
|
|
||||||
|
content.append(doc['structured_content'])
|
||||||
|
|
||||||
|
return "\n".join(content)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Topic Index Template:**
|
||||||
|
```python
|
||||||
|
def generate_topic_index(topic_slug, topic_name, documents):
|
||||||
|
content = [
|
||||||
|
f"# {topic_name}",
|
||||||
|
"",
|
||||||
|
f"This section contains {len(documents)} reference documents.",
|
||||||
|
"",
|
||||||
|
"## Contents",
|
||||||
|
""
|
||||||
|
]
|
||||||
|
|
||||||
|
for i, doc in enumerate(documents, 1):
|
||||||
|
filename = generate_filename(doc['title'])
|
||||||
|
content.append(f"{i}. [{doc['title']}]({filename})")
|
||||||
|
|
||||||
|
return "\n".join(content)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Root INDEX Template:**
|
||||||
|
```python
|
||||||
|
def generate_root_index(topics_with_counts, export_date):
|
||||||
|
content = [
|
||||||
|
"# Reference Library",
|
||||||
|
"",
|
||||||
|
f"Exported: {export_date}",
|
||||||
|
"",
|
||||||
|
"## Topics",
|
||||||
|
""
|
||||||
|
]
|
||||||
|
|
||||||
|
for topic in topics_with_counts:
|
||||||
|
content.append(f"- [{topic['name']}]({topic['slug']}/) ({topic['count']} documents)")
|
||||||
|
|
||||||
|
content.extend([
|
||||||
|
"",
|
||||||
|
"## Quality Standards",
|
||||||
|
"",
|
||||||
|
"All documents in this library have:",
|
||||||
|
"- Passed quality review (score ≥ 0.80)",
|
||||||
|
"- Been distilled for conciseness",
|
||||||
|
"- Verified source attribution"
|
||||||
|
])
|
||||||
|
|
||||||
|
return "\n".join(content)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Write Files
|
||||||
|
|
||||||
|
```python
|
||||||
|
def export_project_files(content_list, config):
|
||||||
|
base_path = Path(config['output']['base_path'])
|
||||||
|
structure = config['output']['project_files']['structure']
|
||||||
|
|
||||||
|
# Group by topic
|
||||||
|
by_topic = defaultdict(list)
|
||||||
|
for doc in content_list:
|
||||||
|
by_topic[doc['topic_slug']].append(doc)
|
||||||
|
|
||||||
|
# Create directories and files
|
||||||
|
for topic_slug, docs in by_topic.items():
|
||||||
|
if structure == 'nested_by_topic':
|
||||||
|
topic_dir = base_path / topic_slug
|
||||||
|
topic_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
# Write topic index
|
||||||
|
topic_index = generate_topic_index(topic_slug, docs[0]['topic_name'], docs)
|
||||||
|
(topic_dir / '_index.md').write_text(topic_index)
|
||||||
|
|
||||||
|
# Write document files
|
||||||
|
for i, doc in enumerate(docs, 1):
|
||||||
|
filename = f"{i:02d}-{slugify(doc['title'])}.md"
|
||||||
|
file_content = generate_document_file(doc)
|
||||||
|
(topic_dir / filename).write_text(file_content)
|
||||||
|
|
||||||
|
# Write root INDEX
|
||||||
|
topics_summary = [
|
||||||
|
{"slug": slug, "name": docs[0]['topic_name'], "count": len(docs)}
|
||||||
|
for slug, docs in by_topic.items()
|
||||||
|
]
|
||||||
|
root_index = generate_root_index(topics_summary, datetime.now().isoformat())
|
||||||
|
(base_path / 'INDEX.md').write_text(root_index)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: Fine-tuning Export (Optional)
|
||||||
|
|
||||||
|
```python
|
||||||
|
def export_fine_tuning_dataset(content_list, config):
|
||||||
|
"""Export as JSONL for fine-tuning."""
|
||||||
|
output_path = Path(config['output']['base_path']) / 'fine_tuning.jsonl'
|
||||||
|
max_tokens = config['output']['fine_tuning']['max_tokens_per_sample']
|
||||||
|
|
||||||
|
with open(output_path, 'w') as f:
|
||||||
|
for doc in content_list:
|
||||||
|
sample = {
|
||||||
|
"messages": [
|
||||||
|
{
|
||||||
|
"role": "system",
|
||||||
|
"content": "You are an expert on AI and prompt engineering."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": f"Explain {doc['title']}"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"role": "assistant",
|
||||||
|
"content": truncate_to_tokens(doc['structured_content'], max_tokens)
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"source": doc['url'],
|
||||||
|
"topic": doc['topic_slug'],
|
||||||
|
"quality_score": doc['quality_score']
|
||||||
|
}
|
||||||
|
}
|
||||||
|
f.write(json.dumps(sample) + '\n')
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 6: Log Export Job
|
||||||
|
|
||||||
|
```python
|
||||||
|
def log_export_job(cursor, export_name, export_type, output_path,
|
||||||
|
topic_filter, total_docs, total_tokens):
|
||||||
|
sql = """
|
||||||
|
INSERT INTO export_jobs
|
||||||
|
(export_name, export_type, output_format, topic_filter, output_path,
|
||||||
|
total_documents, total_tokens, status, started_at, completed_at)
|
||||||
|
VALUES (%s, %s, 'markdown', %s, %s, %s, %s, 'completed', NOW(), NOW())
|
||||||
|
"""
|
||||||
|
cursor.execute(sql, (
|
||||||
|
export_name, export_type,
|
||||||
|
json.dumps(topic_filter) if topic_filter else None,
|
||||||
|
str(output_path), total_docs, total_tokens
|
||||||
|
))
|
||||||
|
```
|
||||||
|
|
||||||
|
## Cross-Reference Generation
|
||||||
|
|
||||||
|
Link related documents:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def add_cross_references(doc, all_docs):
|
||||||
|
"""Find and link related documents."""
|
||||||
|
related = []
|
||||||
|
doc_concepts = set(c['term'].lower() for c in doc['key_concepts'])
|
||||||
|
|
||||||
|
for other in all_docs:
|
||||||
|
if other['doc_id'] == doc['doc_id']:
|
||||||
|
continue
|
||||||
|
other_concepts = set(c['term'].lower() for c in other['key_concepts'])
|
||||||
|
overlap = len(doc_concepts & other_concepts)
|
||||||
|
if overlap >= 2:
|
||||||
|
related.append({
|
||||||
|
"title": other['title'],
|
||||||
|
"path": generate_relative_path(doc, other),
|
||||||
|
"overlap": overlap
|
||||||
|
})
|
||||||
|
|
||||||
|
return sorted(related, key=lambda x: x['overlap'], reverse=True)[:5]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Output Verification
|
||||||
|
|
||||||
|
After export, verify:
|
||||||
|
- [ ] All files readable and valid markdown
|
||||||
|
- [ ] INDEX.md links resolve correctly
|
||||||
|
- [ ] No broken cross-references
|
||||||
|
- [ ] Total token count matches expectation
|
||||||
|
- [ ] No duplicate content
|
||||||
|
|
||||||
|
## Integration
|
||||||
|
|
||||||
|
| From | Input | To |
|
||||||
|
|------|-------|-----|
|
||||||
|
| quality-reviewer | Approved content IDs | markdown-exporter |
|
||||||
|
| markdown-exporter | Structured files | Project knowledge / Fine-tuning |
|
||||||
89
custom-skills/90-reference-curator/claude-project/INDEX.md
Normal file
89
custom-skills/90-reference-curator/claude-project/INDEX.md
Normal file
@@ -0,0 +1,89 @@
|
|||||||
|
# Reference Curator - Claude.ai Project Knowledge
|
||||||
|
|
||||||
|
This project knowledge enables Claude to curate, process, and export reference documentation through 6 modular skills.
|
||||||
|
|
||||||
|
## Skills Overview
|
||||||
|
|
||||||
|
| Skill | Purpose | Trigger Phrases |
|
||||||
|
|-------|---------|-----------------|
|
||||||
|
| **reference-discovery** | Search & validate authoritative sources | "find references", "search documentation", "discover sources" |
|
||||||
|
| **web-crawler** | Multi-backend crawling orchestration | "crawl URL", "fetch documents", "scrape pages" |
|
||||||
|
| **content-repository** | MySQL storage management | "store content", "save to database", "check duplicates" |
|
||||||
|
| **content-distiller** | Summarize & extract key concepts | "distill content", "summarize document", "extract key concepts" |
|
||||||
|
| **quality-reviewer** | QA scoring & routing decisions | "review content", "quality check", "assess distilled content" |
|
||||||
|
| **markdown-exporter** | Export to markdown/JSONL | "export references", "generate project files", "create markdown output" |
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
```
|
||||||
|
[Topic Input]
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ reference-discovery │ → Search & validate sources
|
||||||
|
└─────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ web-crawler │ → Crawl (Firecrawl/Node.js/aiohttp/Scrapy)
|
||||||
|
└─────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ content-repository │ → Store in MySQL
|
||||||
|
└─────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ content-distiller │ → Summarize & extract
|
||||||
|
└─────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ quality-reviewer │ → QA loop
|
||||||
|
└─────────────────────┘
|
||||||
|
│
|
||||||
|
├── REFACTOR → content-distiller
|
||||||
|
├── DEEP_RESEARCH → web-crawler
|
||||||
|
│
|
||||||
|
▼ APPROVE
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ markdown-exporter │ → Project files / Fine-tuning
|
||||||
|
└─────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Quality Scoring Thresholds
|
||||||
|
|
||||||
|
| Score | Decision | Action |
|
||||||
|
|-------|----------|--------|
|
||||||
|
| ≥ 0.85 | **Approve** | Ready for export |
|
||||||
|
| 0.60-0.84 | **Refactor** | Re-distill with feedback |
|
||||||
|
| 0.40-0.59 | **Deep Research** | Gather more sources |
|
||||||
|
| < 0.40 | **Reject** | Archive (low quality) |
|
||||||
|
|
||||||
|
## Source Credibility Tiers
|
||||||
|
|
||||||
|
| Tier | Source Type | Examples |
|
||||||
|
|------|-------------|----------|
|
||||||
|
| **Tier 1** | Official documentation | docs.anthropic.com, platform.openai.com/docs |
|
||||||
|
| **Tier 1** | Official engineering blogs | anthropic.com/news, openai.com/blog |
|
||||||
|
| **Tier 2** | Research papers | arxiv.org papers with citations |
|
||||||
|
| **Tier 2** | Verified community guides | Official cookbooks, tutorials |
|
||||||
|
| **Tier 3** | Community content | Blog posts, Stack Overflow |
|
||||||
|
|
||||||
|
## Files in This Project
|
||||||
|
|
||||||
|
- `INDEX.md` - This overview file
|
||||||
|
- `reference-curator-complete.md` - All 6 skills in one file
|
||||||
|
- `01-reference-discovery.md` - Source discovery skill
|
||||||
|
- `02-web-crawler.md` - Crawling orchestration skill
|
||||||
|
- `03-content-repository.md` - Database storage skill
|
||||||
|
- `04-content-distiller.md` - Content summarization skill
|
||||||
|
- `05-quality-reviewer.md` - QA review skill
|
||||||
|
- `06-markdown-exporter.md` - Export skill
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
Upload all files to a Claude.ai Project, or upload only the skills you need.
|
||||||
|
|
||||||
|
For the complete experience, upload `reference-curator-complete.md` which contains all skills in one file.
|
||||||
@@ -0,0 +1,473 @@
|
|||||||
|
# Reference Curator - Complete Skill Set
|
||||||
|
|
||||||
|
This document contains all 6 skills for curating, processing, and exporting reference documentation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 1. Reference Discovery
|
||||||
|
|
||||||
|
Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.
|
||||||
|
|
||||||
|
## Source Priority Hierarchy
|
||||||
|
|
||||||
|
| Tier | Source Type | Examples |
|
||||||
|
|------|-------------|----------|
|
||||||
|
| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
|
||||||
|
| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
|
||||||
|
| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* |
|
||||||
|
| **Tier 2** | Research papers | arxiv.org, papers with citations |
|
||||||
|
| **Tier 2** | Verified community guides | Cookbook examples, official tutorials |
|
||||||
|
| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow |
|
||||||
|
|
||||||
|
## Discovery Workflow
|
||||||
|
|
||||||
|
### Step 1: Define Search Scope
|
||||||
|
|
||||||
|
```python
|
||||||
|
search_config = {
|
||||||
|
"topic": "prompt engineering",
|
||||||
|
"vendors": ["anthropic", "openai", "google"],
|
||||||
|
"source_types": ["official_docs", "engineering_blog", "github_repo"],
|
||||||
|
"freshness": "past_year",
|
||||||
|
"max_results_per_query": 20
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Generate Search Queries
|
||||||
|
|
||||||
|
```python
|
||||||
|
def generate_queries(topic, vendors):
|
||||||
|
queries = []
|
||||||
|
for vendor in vendors:
|
||||||
|
queries.append(f"site:docs.{vendor}.com {topic}")
|
||||||
|
queries.append(f"site:{vendor}.com/docs {topic}")
|
||||||
|
queries.append(f"site:{vendor}.com/blog {topic}")
|
||||||
|
queries.append(f"site:github.com/{vendor} {topic}")
|
||||||
|
queries.append(f"site:arxiv.org {topic}")
|
||||||
|
return queries
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Validate and Score Sources
|
||||||
|
|
||||||
|
```python
|
||||||
|
def score_source(url, title):
|
||||||
|
score = 0.0
|
||||||
|
if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
|
||||||
|
score += 0.40 # Tier 1 official docs
|
||||||
|
elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
|
||||||
|
score += 0.30 # Tier 1 official blog/news
|
||||||
|
elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
|
||||||
|
score += 0.30 # Tier 1 official repos
|
||||||
|
elif 'arxiv.org' in url:
|
||||||
|
score += 0.20 # Tier 2 research
|
||||||
|
else:
|
||||||
|
score += 0.10 # Tier 3 community
|
||||||
|
return min(score, 1.0)
|
||||||
|
|
||||||
|
def assign_credibility_tier(score):
|
||||||
|
if score >= 0.60:
|
||||||
|
return 'tier1_official'
|
||||||
|
elif score >= 0.40:
|
||||||
|
return 'tier2_verified'
|
||||||
|
else:
|
||||||
|
return 'tier3_community'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"discovery_date": "2025-01-28T10:30:00",
|
||||||
|
"topic": "prompt engineering",
|
||||||
|
"total_urls": 15,
|
||||||
|
"urls": [
|
||||||
|
{
|
||||||
|
"url": "https://docs.anthropic.com/en/docs/prompt-engineering",
|
||||||
|
"title": "Prompt Engineering Guide",
|
||||||
|
"credibility_tier": "tier1_official",
|
||||||
|
"credibility_score": 0.85,
|
||||||
|
"source_type": "official_docs",
|
||||||
|
"vendor": "anthropic"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 2. Web Crawler Orchestrator
|
||||||
|
|
||||||
|
Manages crawling operations using Firecrawl MCP with rate limiting and format handling.
|
||||||
|
|
||||||
|
## Crawl Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
firecrawl:
|
||||||
|
rate_limit:
|
||||||
|
requests_per_minute: 20
|
||||||
|
concurrent_requests: 3
|
||||||
|
default_options:
|
||||||
|
timeout: 30000
|
||||||
|
only_main_content: true
|
||||||
|
```
|
||||||
|
|
||||||
|
## Crawl Workflow
|
||||||
|
|
||||||
|
### Determine Crawl Strategy
|
||||||
|
|
||||||
|
```python
|
||||||
|
def select_strategy(url):
|
||||||
|
if url.endswith('.pdf'):
|
||||||
|
return 'pdf_extract'
|
||||||
|
elif 'github.com' in url and '/blob/' in url:
|
||||||
|
return 'raw_content'
|
||||||
|
elif any(d in url for d in ['docs.', 'documentation']):
|
||||||
|
return 'scrape'
|
||||||
|
else:
|
||||||
|
return 'scrape'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Execute Firecrawl
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Single page scrape
|
||||||
|
firecrawl_scrape(
|
||||||
|
url="https://docs.anthropic.com/en/docs/prompt-engineering",
|
||||||
|
formats=["markdown"],
|
||||||
|
only_main_content=True,
|
||||||
|
timeout=30000
|
||||||
|
)
|
||||||
|
|
||||||
|
# Multi-page crawl
|
||||||
|
firecrawl_crawl(
|
||||||
|
url="https://docs.anthropic.com/en/docs/",
|
||||||
|
max_depth=2,
|
||||||
|
limit=50,
|
||||||
|
formats=["markdown"]
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Rate Limiting
|
||||||
|
|
||||||
|
```python
|
||||||
|
class RateLimiter:
|
||||||
|
def __init__(self, requests_per_minute=20):
|
||||||
|
self.rpm = requests_per_minute
|
||||||
|
self.request_times = deque()
|
||||||
|
|
||||||
|
def wait_if_needed(self):
|
||||||
|
now = time.time()
|
||||||
|
while self.request_times and now - self.request_times[0] > 60:
|
||||||
|
self.request_times.popleft()
|
||||||
|
if len(self.request_times) >= self.rpm:
|
||||||
|
wait_time = 60 - (now - self.request_times[0])
|
||||||
|
if wait_time > 0:
|
||||||
|
time.sleep(wait_time)
|
||||||
|
self.request_times.append(time.time())
|
||||||
|
```
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
| Error | Action |
|
||||||
|
|-------|--------|
|
||||||
|
| Timeout | Retry once with 2x timeout |
|
||||||
|
| Rate limit (429) | Exponential backoff, max 3 retries |
|
||||||
|
| Not found (404) | Log and skip |
|
||||||
|
| Access denied (403) | Log, mark as `failed` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 3. Content Repository
|
||||||
|
|
||||||
|
Manages MySQL storage for the reference library. Handles document storage, version control, deduplication, and retrieval.
|
||||||
|
|
||||||
|
## Core Operations
|
||||||
|
|
||||||
|
**Store New Document:**
|
||||||
|
```python
|
||||||
|
def store_document(cursor, source_id, title, url, doc_type, raw_content_path):
|
||||||
|
sql = """
|
||||||
|
INSERT INTO documents (source_id, title, url, doc_type, crawl_date, crawl_status, raw_content_path)
|
||||||
|
VALUES (%s, %s, %s, %s, NOW(), 'completed', %s)
|
||||||
|
ON DUPLICATE KEY UPDATE
|
||||||
|
version = version + 1,
|
||||||
|
crawl_date = NOW(),
|
||||||
|
raw_content_path = VALUES(raw_content_path)
|
||||||
|
"""
|
||||||
|
cursor.execute(sql, (source_id, title, url, doc_type, raw_content_path))
|
||||||
|
return cursor.lastrowid
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check Duplicate:**
|
||||||
|
```python
|
||||||
|
def is_duplicate(cursor, url):
|
||||||
|
cursor.execute("SELECT doc_id FROM documents WHERE url_hash = SHA2(%s, 256)", (url,))
|
||||||
|
return cursor.fetchone() is not None
|
||||||
|
```
|
||||||
|
|
||||||
|
## Table Quick Reference
|
||||||
|
|
||||||
|
| Table | Purpose | Key Fields |
|
||||||
|
|-------|---------|------------|
|
||||||
|
| `sources` | Authorized content sources | source_type, credibility_tier, vendor |
|
||||||
|
| `documents` | Crawled document metadata | url_hash (dedup), version, crawl_status |
|
||||||
|
| `distilled_content` | Processed summaries | review_status, compression_ratio |
|
||||||
|
| `review_logs` | QA decisions | quality_score, decision |
|
||||||
|
| `topics` | Taxonomy | topic_slug, parent_topic_id |
|
||||||
|
|
||||||
|
## Status Values
|
||||||
|
|
||||||
|
- **crawl_status:** `pending` → `completed` | `failed` | `stale`
|
||||||
|
- **review_status:** `pending` → `in_review` → `approved` | `needs_refactor` | `rejected`
|
||||||
|
- **decision:** `approve` | `refactor` | `deep_research` | `reject`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 4. Content Distiller
|
||||||
|
|
||||||
|
Transforms raw crawled content into structured, high-quality reference materials.
|
||||||
|
|
||||||
|
## Distillation Goals
|
||||||
|
|
||||||
|
1. **Compress** - Reduce token count while preserving essential information
|
||||||
|
2. **Structure** - Organize content for easy retrieval and reference
|
||||||
|
3. **Extract** - Pull out code snippets, key concepts, and actionable patterns
|
||||||
|
4. **Annotate** - Add metadata for searchability and categorization
|
||||||
|
|
||||||
|
## Extract Key Components
|
||||||
|
|
||||||
|
**Extract Code Snippets:**
|
||||||
|
```python
|
||||||
|
def extract_code_snippets(content):
|
||||||
|
pattern = r'```(\w*)\n([\s\S]*?)```'
|
||||||
|
snippets = []
|
||||||
|
for match in re.finditer(pattern, content):
|
||||||
|
snippets.append({
|
||||||
|
"language": match.group(1) or "text",
|
||||||
|
"code": match.group(2).strip(),
|
||||||
|
"context": get_surrounding_text(content, match.start(), 200)
|
||||||
|
})
|
||||||
|
return snippets
|
||||||
|
```
|
||||||
|
|
||||||
|
**Extract Key Concepts:**
|
||||||
|
```python
|
||||||
|
def extract_key_concepts(content, title):
|
||||||
|
prompt = f"""
|
||||||
|
Analyze this document and extract key concepts:
|
||||||
|
|
||||||
|
Title: {title}
|
||||||
|
Content: {content[:8000]}
|
||||||
|
|
||||||
|
Return JSON with:
|
||||||
|
- concepts: [{{"term": "...", "definition": "...", "importance": "high|medium|low"}}]
|
||||||
|
- techniques: [{{"name": "...", "description": "...", "use_case": "..."}}]
|
||||||
|
- best_practices: ["..."]
|
||||||
|
"""
|
||||||
|
return claude_extract(prompt)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Summary Template
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# {title}
|
||||||
|
|
||||||
|
**Source:** {url}
|
||||||
|
**Type:** {source_type} | **Tier:** {credibility_tier}
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
{2-3 sentence overview}
|
||||||
|
|
||||||
|
## Key Concepts
|
||||||
|
{bulleted list of core concepts}
|
||||||
|
|
||||||
|
## Techniques & Patterns
|
||||||
|
{extracted techniques with use cases}
|
||||||
|
|
||||||
|
## Code Examples
|
||||||
|
{relevant code snippets}
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
{actionable recommendations}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Quality Metrics
|
||||||
|
|
||||||
|
| Metric | Target |
|
||||||
|
|--------|--------|
|
||||||
|
| Compression Ratio | 25-35% of original |
|
||||||
|
| Key Concept Coverage | ≥90% of important terms |
|
||||||
|
| Code Snippet Retention | 100% of relevant examples |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 5. Quality Reviewer
|
||||||
|
|
||||||
|
Evaluates distilled content, routes decisions, and triggers refactoring or additional research.
|
||||||
|
|
||||||
|
## Review Workflow
|
||||||
|
|
||||||
|
```
|
||||||
|
[Distilled Content]
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────┐
|
||||||
|
│ Score Criteria │ → accuracy, completeness, clarity, PE quality, usability
|
||||||
|
└─────────────────┘
|
||||||
|
│
|
||||||
|
├── ≥ 0.85 → APPROVE → markdown-exporter
|
||||||
|
├── 0.60-0.84 → REFACTOR → content-distiller (with instructions)
|
||||||
|
├── 0.40-0.59 → DEEP_RESEARCH → web-crawler (with queries)
|
||||||
|
└── < 0.40 → REJECT → archive with reason
|
||||||
|
```
|
||||||
|
|
||||||
|
## Scoring Criteria
|
||||||
|
|
||||||
|
| Criterion | Weight | Checks |
|
||||||
|
|-----------|--------|--------|
|
||||||
|
| **Accuracy** | 0.25 | Factual correctness, up-to-date info, proper attribution |
|
||||||
|
| **Completeness** | 0.20 | Covers key concepts, includes examples, addresses edge cases |
|
||||||
|
| **Clarity** | 0.20 | Clear structure, concise language, logical flow |
|
||||||
|
| **PE Quality** | 0.25 | Demonstrates techniques, before/after examples, explains why |
|
||||||
|
| **Usability** | 0.10 | Easy to reference, searchable keywords, appropriate length |
|
||||||
|
|
||||||
|
## Calculate Final Score
|
||||||
|
|
||||||
|
```python
|
||||||
|
WEIGHTS = {
|
||||||
|
"accuracy": 0.25,
|
||||||
|
"completeness": 0.20,
|
||||||
|
"clarity": 0.20,
|
||||||
|
"prompt_engineering_quality": 0.25,
|
||||||
|
"usability": 0.10
|
||||||
|
}
|
||||||
|
|
||||||
|
def calculate_quality_score(assessment):
|
||||||
|
return sum(
|
||||||
|
assessment[criterion]["score"] * weight
|
||||||
|
for criterion, weight in WEIGHTS.items()
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Route Decision
|
||||||
|
|
||||||
|
```python
|
||||||
|
def determine_decision(score, assessment):
|
||||||
|
if score >= 0.85:
|
||||||
|
return "approve", None, None
|
||||||
|
elif score >= 0.60:
|
||||||
|
instructions = generate_refactor_instructions(assessment)
|
||||||
|
return "refactor", instructions, None
|
||||||
|
elif score >= 0.40:
|
||||||
|
queries = generate_research_queries(assessment)
|
||||||
|
return "deep_research", None, queries
|
||||||
|
else:
|
||||||
|
return "reject", f"Quality score {score:.2f} below minimum", None
|
||||||
|
```
|
||||||
|
|
||||||
|
## Prompt Engineering Quality Checklist
|
||||||
|
|
||||||
|
- [ ] Demonstrates specific techniques (CoT, few-shot, etc.)
|
||||||
|
- [ ] Shows before/after examples
|
||||||
|
- [ ] Explains *why* techniques work, not just *what*
|
||||||
|
- [ ] Provides actionable patterns
|
||||||
|
- [ ] Includes edge cases and failure modes
|
||||||
|
- [ ] References authoritative sources
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 6. Markdown Exporter
|
||||||
|
|
||||||
|
Exports approved content as structured markdown files for Claude Projects or fine-tuning.
|
||||||
|
|
||||||
|
## Export Structure
|
||||||
|
|
||||||
|
**Nested by Topic (recommended):**
|
||||||
|
```
|
||||||
|
exports/
|
||||||
|
├── INDEX.md
|
||||||
|
├── prompt-engineering/
|
||||||
|
│ ├── _index.md
|
||||||
|
│ ├── 01-chain-of-thought.md
|
||||||
|
│ └── 02-few-shot-prompting.md
|
||||||
|
├── claude-models/
|
||||||
|
│ ├── _index.md
|
||||||
|
│ └── 01-model-comparison.md
|
||||||
|
└── agent-building/
|
||||||
|
└── 01-tool-use.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## Document File Template
|
||||||
|
|
||||||
|
```python
|
||||||
|
def generate_document_file(doc, include_metadata=True):
|
||||||
|
content = []
|
||||||
|
if include_metadata:
|
||||||
|
content.append("---")
|
||||||
|
content.append(f"title: {doc['title']}")
|
||||||
|
content.append(f"source: {doc['url']}")
|
||||||
|
content.append(f"vendor: {doc['vendor']}")
|
||||||
|
content.append(f"tier: {doc['credibility_tier']}")
|
||||||
|
content.append(f"quality_score: {doc['quality_score']:.2f}")
|
||||||
|
content.append("---")
|
||||||
|
content.append("")
|
||||||
|
content.append(doc['structured_content'])
|
||||||
|
return "\n".join(content)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Fine-tuning Export (JSONL)
|
||||||
|
|
||||||
|
```python
|
||||||
|
def export_fine_tuning_dataset(content_list, config):
|
||||||
|
with open('fine_tuning.jsonl', 'w') as f:
|
||||||
|
for doc in content_list:
|
||||||
|
sample = {
|
||||||
|
"messages": [
|
||||||
|
{"role": "system", "content": "You are an expert on AI and prompt engineering."},
|
||||||
|
{"role": "user", "content": f"Explain {doc['title']}"},
|
||||||
|
{"role": "assistant", "content": doc['structured_content']}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"source": doc['url'],
|
||||||
|
"topic": doc['topic_slug'],
|
||||||
|
"quality_score": doc['quality_score']
|
||||||
|
}
|
||||||
|
}
|
||||||
|
f.write(json.dumps(sample) + '\n')
|
||||||
|
```
|
||||||
|
|
||||||
|
## Cross-Reference Generation
|
||||||
|
|
||||||
|
```python
|
||||||
|
def add_cross_references(doc, all_docs):
|
||||||
|
related = []
|
||||||
|
doc_concepts = set(c['term'].lower() for c in doc['key_concepts'])
|
||||||
|
|
||||||
|
for other in all_docs:
|
||||||
|
if other['doc_id'] == doc['doc_id']:
|
||||||
|
continue
|
||||||
|
other_concepts = set(c['term'].lower() for c in other['key_concepts'])
|
||||||
|
overlap = len(doc_concepts & other_concepts)
|
||||||
|
if overlap >= 2:
|
||||||
|
related.append({
|
||||||
|
"title": other['title'],
|
||||||
|
"path": generate_relative_path(doc, other),
|
||||||
|
"overlap": overlap
|
||||||
|
})
|
||||||
|
|
||||||
|
return sorted(related, key=lambda x: x['overlap'], reverse=True)[:5]
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Integration Flow
|
||||||
|
|
||||||
|
| From | Output | To |
|
||||||
|
|------|--------|-----|
|
||||||
|
| **reference-discovery** | URL manifest | web-crawler |
|
||||||
|
| **web-crawler** | Raw content + manifest | content-repository |
|
||||||
|
| **content-repository** | Document records | content-distiller |
|
||||||
|
| **content-distiller** | Distilled content | quality-reviewer |
|
||||||
|
| **quality-reviewer** (approve) | Approved IDs | markdown-exporter |
|
||||||
|
| **quality-reviewer** (refactor) | Instructions | content-distiller |
|
||||||
|
| **quality-reviewer** (deep_research) | Queries | web-crawler |
|
||||||
@@ -717,6 +717,65 @@ EOF
|
|||||||
post_install
|
post_install
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# Export for Claude.ai Projects
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
export_claude_ai() {
|
||||||
|
print_header
|
||||||
|
echo -e "${BOLD}Export for Claude.ai Projects${NC}"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
local project_dir="$SCRIPT_DIR/claude-project"
|
||||||
|
|
||||||
|
if [[ ! -d "$project_dir" ]]; then
|
||||||
|
print_error "claude-project directory not found"
|
||||||
|
echo "Expected: $project_dir"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Available files for Claude.ai Projects:"
|
||||||
|
echo ""
|
||||||
|
echo -e " ${CYAN}Consolidated (single file):${NC}"
|
||||||
|
echo " reference-curator-complete.md - All 6 skills in one file"
|
||||||
|
echo ""
|
||||||
|
echo -e " ${CYAN}Individual skills:${NC}"
|
||||||
|
ls -1 "$project_dir"/*.md 2>/dev/null | while read file; do
|
||||||
|
local filename=$(basename "$file")
|
||||||
|
local size=$(du -h "$file" | cut -f1)
|
||||||
|
if [[ "$filename" != "INDEX.md" && "$filename" != "reference-curator-complete.md" ]]; then
|
||||||
|
echo " $filename ($size)"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo -e "${BOLD}Upload to Claude.ai:${NC}"
|
||||||
|
echo ""
|
||||||
|
echo " 1. Go to https://claude.ai"
|
||||||
|
echo " 2. Create a new Project or open existing one"
|
||||||
|
echo " 3. Click 'Add to project knowledge'"
|
||||||
|
echo " 4. Upload files from:"
|
||||||
|
echo -e " ${CYAN}$project_dir${NC}"
|
||||||
|
echo ""
|
||||||
|
echo " Recommended: Upload 'reference-curator-complete.md' for full skill set"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
if prompt_yes_no "Copy files to a different location?" "n"; then
|
||||||
|
prompt_with_default "Destination directory" "$HOME/Desktop/reference-curator-claude-ai" "DEST_DIR"
|
||||||
|
|
||||||
|
mkdir -p "$DEST_DIR"
|
||||||
|
cp "$project_dir"/*.md "$DEST_DIR/"
|
||||||
|
|
||||||
|
print_success "Files copied to $DEST_DIR"
|
||||||
|
echo ""
|
||||||
|
echo "Files ready for upload:"
|
||||||
|
ls -la "$DEST_DIR"/*.md
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo -e "${GREEN}Done!${NC} Upload the files to your Claude.ai Project."
|
||||||
|
}
|
||||||
|
|
||||||
# ============================================================================
|
# ============================================================================
|
||||||
# Entry Point
|
# Entry Point
|
||||||
# ============================================================================
|
# ============================================================================
|
||||||
@@ -731,6 +790,9 @@ case "${1:-}" in
|
|||||||
--minimal)
|
--minimal)
|
||||||
install_minimal
|
install_minimal
|
||||||
;;
|
;;
|
||||||
|
--claude-ai)
|
||||||
|
export_claude_ai
|
||||||
|
;;
|
||||||
--help|-h)
|
--help|-h)
|
||||||
echo "Reference Curator - Portable Installation Script"
|
echo "Reference Curator - Portable Installation Script"
|
||||||
echo ""
|
echo ""
|
||||||
@@ -738,6 +800,7 @@ case "${1:-}" in
|
|||||||
echo " ./install.sh Interactive installation"
|
echo " ./install.sh Interactive installation"
|
||||||
echo " ./install.sh --check Check installation status"
|
echo " ./install.sh --check Check installation status"
|
||||||
echo " ./install.sh --minimal Firecrawl-only mode (no MySQL)"
|
echo " ./install.sh --minimal Firecrawl-only mode (no MySQL)"
|
||||||
|
echo " ./install.sh --claude-ai Export skills for Claude.ai Projects"
|
||||||
echo " ./install.sh --uninstall Remove installation"
|
echo " ./install.sh --uninstall Remove installation"
|
||||||
echo " ./install.sh --help Show this help"
|
echo " ./install.sh --help Show this help"
|
||||||
;;
|
;;
|
||||||
|
|||||||
Reference in New Issue
Block a user