Pipeline Orchestrator: - Add 07-pipeline-orchestrator skill with code/CLAUDE.md and desktop/SKILL.md - Add /reference-curator-pipeline slash command for full workflow automation - Add pipeline_runs and pipeline_iteration_tracker tables to schema.sql - Add v_pipeline_status and v_pipeline_iterations views - Add pipeline_config.yaml configuration template - Update AGENTS.md with Reference Curator Skills section - Update claude-project files with pipeline documentation Skill Format Refactoring: - Extract YAML frontmatter from SKILL.md files to separate skill.yaml - Add tools/ directories with MCP tool documentation - Update SKILL-FORMAT-REQUIREMENTS.md with new structure - Add migrate-skill-structure.py script for format conversion Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
184 lines
5.4 KiB
Markdown
184 lines
5.4 KiB
Markdown
# Reference Discovery
|
|
|
|
Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.
|
|
|
|
## Source Priority Hierarchy
|
|
|
|
| Tier | Source Type | Examples |
|
|
|------|-------------|----------|
|
|
| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
|
|
| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
|
|
| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* |
|
|
| **Tier 2** | Research papers | arxiv.org, papers with citations |
|
|
| **Tier 2** | Verified community guides | Cookbook examples, official tutorials |
|
|
| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow |
|
|
|
|
## Discovery Workflow
|
|
|
|
### Step 1: Define Search Scope
|
|
|
|
```python
|
|
search_config = {
|
|
"topic": "prompt engineering",
|
|
"vendors": ["anthropic", "openai", "google"],
|
|
"source_types": ["official_docs", "engineering_blog", "github_repo"],
|
|
"freshness": "past_year", # past_week, past_month, past_year, any
|
|
"max_results_per_query": 20
|
|
}
|
|
```
|
|
|
|
### Step 2: Generate Search Queries
|
|
|
|
For a given topic, generate targeted queries:
|
|
|
|
```python
|
|
def generate_queries(topic, vendors):
|
|
queries = []
|
|
|
|
# Official documentation queries
|
|
for vendor in vendors:
|
|
queries.append(f"site:docs.{vendor}.com {topic}")
|
|
queries.append(f"site:{vendor}.com/docs {topic}")
|
|
|
|
# Engineering blog queries
|
|
for vendor in vendors:
|
|
queries.append(f"site:{vendor}.com/blog {topic}")
|
|
queries.append(f"site:{vendor}.com/news {topic}")
|
|
|
|
# GitHub queries
|
|
for vendor in vendors:
|
|
queries.append(f"site:github.com/{vendor} {topic}")
|
|
|
|
# Research queries
|
|
queries.append(f"site:arxiv.org {topic}")
|
|
|
|
return queries
|
|
```
|
|
|
|
### Step 3: Execute Search
|
|
|
|
Use web search tool for each query:
|
|
|
|
```python
|
|
def execute_discovery(queries):
|
|
results = []
|
|
for query in queries:
|
|
search_results = web_search(query)
|
|
for result in search_results:
|
|
results.append({
|
|
"url": result.url,
|
|
"title": result.title,
|
|
"snippet": result.snippet,
|
|
"query_used": query
|
|
})
|
|
return deduplicate_by_url(results)
|
|
```
|
|
|
|
### Step 4: Validate and Score Sources
|
|
|
|
```python
|
|
def score_source(url, title):
|
|
score = 0.0
|
|
|
|
# Domain credibility
|
|
if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
|
|
score += 0.40 # Tier 1 official docs
|
|
elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
|
|
score += 0.30 # Tier 1 official blog/news
|
|
elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
|
|
score += 0.30 # Tier 1 official repos
|
|
elif 'arxiv.org' in url:
|
|
score += 0.20 # Tier 2 research
|
|
else:
|
|
score += 0.10 # Tier 3 community
|
|
|
|
# Freshness signals (from title/snippet)
|
|
if any(year in title for year in ['2025', '2024']):
|
|
score += 0.20
|
|
elif any(year in title for year in ['2023']):
|
|
score += 0.10
|
|
|
|
# Relevance signals
|
|
if any(kw in title.lower() for kw in ['guide', 'documentation', 'tutorial', 'best practices']):
|
|
score += 0.15
|
|
|
|
return min(score, 1.0)
|
|
|
|
def assign_credibility_tier(score):
|
|
if score >= 0.60:
|
|
return 'tier1_official'
|
|
elif score >= 0.40:
|
|
return 'tier2_verified'
|
|
else:
|
|
return 'tier3_community'
|
|
```
|
|
|
|
### Step 5: Output URL Manifest
|
|
|
|
```python
|
|
def create_manifest(scored_results, topic):
|
|
manifest = {
|
|
"discovery_date": datetime.now().isoformat(),
|
|
"topic": topic,
|
|
"total_urls": len(scored_results),
|
|
"urls": []
|
|
}
|
|
|
|
for result in sorted(scored_results, key=lambda x: x['score'], reverse=True):
|
|
manifest["urls"].append({
|
|
"url": result["url"],
|
|
"title": result["title"],
|
|
"credibility_tier": result["tier"],
|
|
"credibility_score": result["score"],
|
|
"source_type": infer_source_type(result["url"]),
|
|
"vendor": infer_vendor(result["url"])
|
|
})
|
|
|
|
return manifest
|
|
```
|
|
|
|
## Output Format
|
|
|
|
Discovery produces a JSON manifest for the crawler:
|
|
|
|
```json
|
|
{
|
|
"discovery_date": "2025-01-28T10:30:00",
|
|
"topic": "prompt engineering",
|
|
"total_urls": 15,
|
|
"urls": [
|
|
{
|
|
"url": "https://docs.anthropic.com/en/docs/prompt-engineering",
|
|
"title": "Prompt Engineering Guide",
|
|
"credibility_tier": "tier1_official",
|
|
"credibility_score": 0.85,
|
|
"source_type": "official_docs",
|
|
"vendor": "anthropic"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Known Authoritative Sources
|
|
|
|
Pre-validated sources for common topics:
|
|
|
|
| Vendor | Documentation | Blog/News | GitHub |
|
|
|--------|--------------|-----------|--------|
|
|
| Anthropic | docs.anthropic.com, docs.claude.com | anthropic.com/news | github.com/anthropics |
|
|
| OpenAI | platform.openai.com/docs | openai.com/blog | github.com/openai |
|
|
| Google | ai.google.dev/docs | blog.google/technology/ai | github.com/google |
|
|
|
|
## Integration
|
|
|
|
**Output:** URL manifest JSON → `web-crawler-orchestrator`
|
|
|
|
**Database:** Register new sources in `sources` table via `content-repository`
|
|
|
|
## Deduplication
|
|
|
|
Before outputting, deduplicate URLs:
|
|
- Normalize URLs (remove trailing slashes, query params)
|
|
- Check against existing `documents` table via `content-repository`
|
|
- Merge duplicate entries, keeping highest credibility score
|