# Reference Discovery

Search and identify authoritative sources for reference materials. Validates source credibility, prioritizes by relevance, and outputs curated URL lists with metadata.

## Trigger Keywords
"find references", "search documentation", "discover sources", "find authoritative materials", "research topic sources"

## Source Priority Hierarchy

| Tier | Source Type | Examples |
|------|-------------|----------|
| **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs |
| **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog |
| **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* |
| **Tier 2** | Research papers | arxiv.org, papers with citations |
| **Tier 2** | Verified community guides | Cookbook examples, official tutorials |
| **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow |

## Workflow

### Step 1: Define Search Scope
Gather topic, target vendors, and freshness requirements from user input.

### Step 2: Execute Web Search
Use WebSearch tool with targeted queries:
```
site:docs.anthropic.com {topic}
site:github.com/anthropics {topic}
site:arxiv.org {topic}
```

### Step 3: Score and Validate Sources
Apply credibility scoring:
- Domain credibility (0.10 - 0.40)
- Freshness signals (0.10 - 0.20)
- Relevance signals (0.15)

### Step 4: Output URL Manifest
Generate JSON manifest for the crawler skill:

```json
{
  "discovery_date": "2025-01-28T10:30:00",
  "topic": "prompt engineering",
  "total_urls": 15,
  "urls": [
    {
      "url": "https://docs.anthropic.com/en/docs/prompt-engineering",
      "title": "Prompt Engineering Guide",
      "credibility_tier": "tier1_official",
      "credibility_score": 0.85,
      "source_type": "official_docs",
      "vendor": "anthropic"
    }
  ]
}
```

## Scripts

### `discover_sources.py`
Main discovery script. Usage:
```bash
python scripts/discover_sources.py --topic "prompt engineering" --vendors anthropic,openai --output manifest.json
```

## Output
- `manifest.json` → Handoff to `02-web-crawler-orchestrator`
- Register new sources in `sources` table via `03-content-repository`

## Deduplication
Before outputting:
- Normalize URLs (remove trailing slashes, query params)
- Check against existing `documents` table
- Merge duplicates, keeping highest credibility score