# Reference Discovery Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling. ## Source Priority Hierarchy | Tier | Source Type | Examples | |------|-------------|----------| | **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs | | **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog | | **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* | | **Tier 2** | Research papers | arxiv.org, papers with citations | | **Tier 2** | Verified community guides | Cookbook examples, official tutorials | | **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow | ## Discovery Workflow ### Step 1: Define Search Scope ```python search_config = { "topic": "prompt engineering", "vendors": ["anthropic", "openai", "google"], "source_types": ["official_docs", "engineering_blog", "github_repo"], "freshness": "past_year", # past_week, past_month, past_year, any "max_results_per_query": 20 } ``` ### Step 2: Generate Search Queries For a given topic, generate targeted queries: ```python def generate_queries(topic, vendors): queries = [] # Official documentation queries for vendor in vendors: queries.append(f"site:docs.{vendor}.com {topic}") queries.append(f"site:{vendor}.com/docs {topic}") # Engineering blog queries for vendor in vendors: queries.append(f"site:{vendor}.com/blog {topic}") queries.append(f"site:{vendor}.com/news {topic}") # GitHub queries for vendor in vendors: queries.append(f"site:github.com/{vendor} {topic}") # Research queries queries.append(f"site:arxiv.org {topic}") return queries ``` ### Step 3: Execute Search Use web search tool for each query: ```python def execute_discovery(queries): results = [] for query in queries: search_results = web_search(query) for result in search_results: results.append({ "url": result.url, "title": result.title, "snippet": result.snippet, "query_used": query }) return deduplicate_by_url(results) ``` ### Step 4: Validate and Score Sources ```python def score_source(url, title): score = 0.0 # Domain credibility if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']): score += 0.40 # Tier 1 official docs elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']): score += 0.30 # Tier 1 official blog/news elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']): score += 0.30 # Tier 1 official repos elif 'arxiv.org' in url: score += 0.20 # Tier 2 research else: score += 0.10 # Tier 3 community # Freshness signals (from title/snippet) if any(year in title for year in ['2025', '2024']): score += 0.20 elif any(year in title for year in ['2023']): score += 0.10 # Relevance signals if any(kw in title.lower() for kw in ['guide', 'documentation', 'tutorial', 'best practices']): score += 0.15 return min(score, 1.0) def assign_credibility_tier(score): if score >= 0.60: return 'tier1_official' elif score >= 0.40: return 'tier2_verified' else: return 'tier3_community' ``` ### Step 5: Output URL Manifest ```python def create_manifest(scored_results, topic): manifest = { "discovery_date": datetime.now().isoformat(), "topic": topic, "total_urls": len(scored_results), "urls": [] } for result in sorted(scored_results, key=lambda x: x['score'], reverse=True): manifest["urls"].append({ "url": result["url"], "title": result["title"], "credibility_tier": result["tier"], "credibility_score": result["score"], "source_type": infer_source_type(result["url"]), "vendor": infer_vendor(result["url"]) }) return manifest ``` ## Output Format Discovery produces a JSON manifest for the crawler: ```json { "discovery_date": "2025-01-28T10:30:00", "topic": "prompt engineering", "total_urls": 15, "urls": [ { "url": "https://docs.anthropic.com/en/docs/prompt-engineering", "title": "Prompt Engineering Guide", "credibility_tier": "tier1_official", "credibility_score": 0.85, "source_type": "official_docs", "vendor": "anthropic" } ] } ``` ## Known Authoritative Sources Pre-validated sources for common topics: | Vendor | Documentation | Blog/News | GitHub | |--------|--------------|-----------|--------| | Anthropic | docs.anthropic.com, docs.claude.com | anthropic.com/news | github.com/anthropics | | OpenAI | platform.openai.com/docs | openai.com/blog | github.com/openai | | Google | ai.google.dev/docs | blog.google/technology/ai | github.com/google | ## Integration **Output:** URL manifest JSON → `web-crawler-orchestrator` **Database:** Register new sources in `sources` table via `content-repository` ## Deduplication Before outputting, deduplicate URLs: - Normalize URLs (remove trailing slashes, query params) - Check against existing `documents` table via `content-repository` - Merge duplicate entries, keeping highest credibility score