feat(reference-curator): Add Claude.ai Projects export format
Add claude-project/ folder with skill files formatted for upload to Claude.ai Projects (web interface): - reference-curator-complete.md: All 6 skills consolidated - INDEX.md: Overview and workflow documentation - Individual skill files (01-06) without YAML frontmatter Add --claude-ai option to install.sh: - Lists available files for upload - Optionally copies to custom destination directory - Provides upload instructions for Claude.ai Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,184 @@
|
||||
|
||||
# Reference Discovery
|
||||
|
||||
Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.
|
||||
|
||||
## Source Priority Hierarchy
|
||||
|
||||
| Tier | Source Type | Examples |
|
||||
|------|-------------|----------|
|
||||
| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
|
||||
| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
|
||||
| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* |
|
||||
| **Tier 2** | Research papers | arxiv.org, papers with citations |
|
||||
| **Tier 2** | Verified community guides | Cookbook examples, official tutorials |
|
||||
| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow |
|
||||
|
||||
## Discovery Workflow
|
||||
|
||||
### Step 1: Define Search Scope
|
||||
|
||||
```python
|
||||
search_config = {
|
||||
"topic": "prompt engineering",
|
||||
"vendors": ["anthropic", "openai", "google"],
|
||||
"source_types": ["official_docs", "engineering_blog", "github_repo"],
|
||||
"freshness": "past_year", # past_week, past_month, past_year, any
|
||||
"max_results_per_query": 20
|
||||
}
|
||||
```
|
||||
|
||||
### Step 2: Generate Search Queries
|
||||
|
||||
For a given topic, generate targeted queries:
|
||||
|
||||
```python
|
||||
def generate_queries(topic, vendors):
|
||||
queries = []
|
||||
|
||||
# Official documentation queries
|
||||
for vendor in vendors:
|
||||
queries.append(f"site:docs.{vendor}.com {topic}")
|
||||
queries.append(f"site:{vendor}.com/docs {topic}")
|
||||
|
||||
# Engineering blog queries
|
||||
for vendor in vendors:
|
||||
queries.append(f"site:{vendor}.com/blog {topic}")
|
||||
queries.append(f"site:{vendor}.com/news {topic}")
|
||||
|
||||
# GitHub queries
|
||||
for vendor in vendors:
|
||||
queries.append(f"site:github.com/{vendor} {topic}")
|
||||
|
||||
# Research queries
|
||||
queries.append(f"site:arxiv.org {topic}")
|
||||
|
||||
return queries
|
||||
```
|
||||
|
||||
### Step 3: Execute Search
|
||||
|
||||
Use web search tool for each query:
|
||||
|
||||
```python
|
||||
def execute_discovery(queries):
|
||||
results = []
|
||||
for query in queries:
|
||||
search_results = web_search(query)
|
||||
for result in search_results:
|
||||
results.append({
|
||||
"url": result.url,
|
||||
"title": result.title,
|
||||
"snippet": result.snippet,
|
||||
"query_used": query
|
||||
})
|
||||
return deduplicate_by_url(results)
|
||||
```
|
||||
|
||||
### Step 4: Validate and Score Sources
|
||||
|
||||
```python
|
||||
def score_source(url, title):
|
||||
score = 0.0
|
||||
|
||||
# Domain credibility
|
||||
if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
|
||||
score += 0.40 # Tier 1 official docs
|
||||
elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
|
||||
score += 0.30 # Tier 1 official blog/news
|
||||
elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
|
||||
score += 0.30 # Tier 1 official repos
|
||||
elif 'arxiv.org' in url:
|
||||
score += 0.20 # Tier 2 research
|
||||
else:
|
||||
score += 0.10 # Tier 3 community
|
||||
|
||||
# Freshness signals (from title/snippet)
|
||||
if any(year in title for year in ['2025', '2024']):
|
||||
score += 0.20
|
||||
elif any(year in title for year in ['2023']):
|
||||
score += 0.10
|
||||
|
||||
# Relevance signals
|
||||
if any(kw in title.lower() for kw in ['guide', 'documentation', 'tutorial', 'best practices']):
|
||||
score += 0.15
|
||||
|
||||
return min(score, 1.0)
|
||||
|
||||
def assign_credibility_tier(score):
|
||||
if score >= 0.60:
|
||||
return 'tier1_official'
|
||||
elif score >= 0.40:
|
||||
return 'tier2_verified'
|
||||
else:
|
||||
return 'tier3_community'
|
||||
```
|
||||
|
||||
### Step 5: Output URL Manifest
|
||||
|
||||
```python
|
||||
def create_manifest(scored_results, topic):
|
||||
manifest = {
|
||||
"discovery_date": datetime.now().isoformat(),
|
||||
"topic": topic,
|
||||
"total_urls": len(scored_results),
|
||||
"urls": []
|
||||
}
|
||||
|
||||
for result in sorted(scored_results, key=lambda x: x['score'], reverse=True):
|
||||
manifest["urls"].append({
|
||||
"url": result["url"],
|
||||
"title": result["title"],
|
||||
"credibility_tier": result["tier"],
|
||||
"credibility_score": result["score"],
|
||||
"source_type": infer_source_type(result["url"]),
|
||||
"vendor": infer_vendor(result["url"])
|
||||
})
|
||||
|
||||
return manifest
|
||||
```
|
||||
|
||||
## Output Format
|
||||
|
||||
Discovery produces a JSON manifest for the crawler:
|
||||
|
||||
```json
|
||||
{
|
||||
"discovery_date": "2025-01-28T10:30:00",
|
||||
"topic": "prompt engineering",
|
||||
"total_urls": 15,
|
||||
"urls": [
|
||||
{
|
||||
"url": "https://docs.anthropic.com/en/docs/prompt-engineering",
|
||||
"title": "Prompt Engineering Guide",
|
||||
"credibility_tier": "tier1_official",
|
||||
"credibility_score": 0.85,
|
||||
"source_type": "official_docs",
|
||||
"vendor": "anthropic"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Known Authoritative Sources
|
||||
|
||||
Pre-validated sources for common topics:
|
||||
|
||||
| Vendor | Documentation | Blog/News | GitHub |
|
||||
|--------|--------------|-----------|--------|
|
||||
| Anthropic | docs.anthropic.com, docs.claude.com | anthropic.com/news | github.com/anthropics |
|
||||
| OpenAI | platform.openai.com/docs | openai.com/blog | github.com/openai |
|
||||
| Google | ai.google.dev/docs | blog.google/technology/ai | github.com/google |
|
||||
|
||||
## Integration
|
||||
|
||||
**Output:** URL manifest JSON → `web-crawler-orchestrator`
|
||||
|
||||
**Database:** Register new sources in `sources` table via `content-repository`
|
||||
|
||||
## Deduplication
|
||||
|
||||
Before outputting, deduplicate URLs:
|
||||
- Normalize URLs (remove trailing slashes, query params)
|
||||
- Check against existing `documents` table via `content-repository`
|
||||
- Merge duplicate entries, keeping highest credibility score
|
||||
Reference in New Issue
Block a user