# Reference Discovery

Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling.

## Source Priority Hierarchy

| Tier | Source Type | Examples |
|------|-------------|----------|
| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* |
| **Tier 2** | Research papers | arxiv.org, papers with citations |
| **Tier 2** | Verified community guides | Cookbook examples, official tutorials |
| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow |

## Discovery Workflow

### Step 1: Define Search Scope

```python
search_config = {
    "topic": "prompt engineering",
    "vendors": ["anthropic", "openai", "google"],
    "source_types": ["official_docs", "engineering_blog", "github_repo"],
    "freshness": "past_year",  # past_week, past_month, past_year, any
    "max_results_per_query": 20
}
```

### Step 2: Generate Search Queries

For a given topic, generate targeted queries:

```python
def generate_queries(topic, vendors):
    queries = []
    
    # Official documentation queries
    for vendor in vendors:
        queries.append(f"site:docs.{vendor}.com {topic}")
        queries.append(f"site:{vendor}.com/docs {topic}")
    
    # Engineering blog queries
    for vendor in vendors:
        queries.append(f"site:{vendor}.com/blog {topic}")
        queries.append(f"site:{vendor}.com/news {topic}")
    
    # GitHub queries
    for vendor in vendors:
        queries.append(f"site:github.com/{vendor} {topic}")
    
    # Research queries
    queries.append(f"site:arxiv.org {topic}")
    
    return queries
```

### Step 3: Execute Search

Use web search tool for each query:

```python
def execute_discovery(queries):
    results = []
    for query in queries:
        search_results = web_search(query)
        for result in search_results:
            results.append({
                "url": result.url,
                "title": result.title,
                "snippet": result.snippet,
                "query_used": query
            })
    return deduplicate_by_url(results)
```

### Step 4: Validate and Score Sources

```python
def score_source(url, title):
    score = 0.0
    
    # Domain credibility
    if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']):
        score += 0.40  # Tier 1 official docs
    elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']):
        score += 0.30  # Tier 1 official blog/news
    elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']):
        score += 0.30  # Tier 1 official repos
    elif 'arxiv.org' in url:
        score += 0.20  # Tier 2 research
    else:
        score += 0.10  # Tier 3 community
    
    # Freshness signals (from title/snippet)
    if any(year in title for year in ['2025', '2024']):
        score += 0.20
    elif any(year in title for year in ['2023']):
        score += 0.10
    
    # Relevance signals
    if any(kw in title.lower() for kw in ['guide', 'documentation', 'tutorial', 'best practices']):
        score += 0.15
    
    return min(score, 1.0)

def assign_credibility_tier(score):
    if score >= 0.60:
        return 'tier1_official'
    elif score >= 0.40:
        return 'tier2_verified'
    else:
        return 'tier3_community'
```

### Step 5: Output URL Manifest

```python
def create_manifest(scored_results, topic):
    manifest = {
        "discovery_date": datetime.now().isoformat(),
        "topic": topic,
        "total_urls": len(scored_results),
        "urls": []
    }
    
    for result in sorted(scored_results, key=lambda x: x['score'], reverse=True):
        manifest["urls"].append({
            "url": result["url"],
            "title": result["title"],
            "credibility_tier": result["tier"],
            "credibility_score": result["score"],
            "source_type": infer_source_type(result["url"]),
            "vendor": infer_vendor(result["url"])
        })
    
    return manifest
```

## Output Format

Discovery produces a JSON manifest for the crawler:

```json
{
  "discovery_date": "2025-01-28T10:30:00",
  "topic": "prompt engineering",
  "total_urls": 15,
  "urls": [
    {
      "url": "https://docs.anthropic.com/en/docs/prompt-engineering",
      "title": "Prompt Engineering Guide",
      "credibility_tier": "tier1_official",
      "credibility_score": 0.85,
      "source_type": "official_docs",
      "vendor": "anthropic"
    }
  ]
}
```

## Known Authoritative Sources

Pre-validated sources for common topics:

| Vendor | Documentation | Blog/News | GitHub |
|--------|--------------|-----------|--------|
| Anthropic | docs.anthropic.com, docs.claude.com | anthropic.com/news | github.com/anthropics |
| OpenAI | platform.openai.com/docs | openai.com/blog | github.com/openai |
| Google | ai.google.dev/docs | blog.google/technology/ai | github.com/google |

## Integration

**Output:** URL manifest JSON → `web-crawler-orchestrator`

**Database:** Register new sources in `sources` table via `content-repository`

## Deduplication

Before outputting, deduplicate URLs:
- Normalize URLs (remove trailing slashes, query params)
- Check against existing `documents` table via `content-repository`
- Merge duplicate entries, keeping highest credibility score