--- name: reference-discovery description: Search and identify authoritative sources for reference materials. Validates source credibility, prioritizes by relevance, and outputs curated URL lists with metadata. Triggers on "find references", "search documentation", "discover sources", "find authoritative materials", "research topic sources". --- # Reference Discovery Searches for authoritative sources, validates credibility, and produces curated URL lists for crawling. ## Source Priority Hierarchy | Tier | Source Type | Examples | |------|-------------|----------| | **Tier 1** | Official documentation | docs.anthropic.com, docs.claude.com, platform.openai.com/docs | | **Tier 1** | Engineering blogs (official) | anthropic.com/news, openai.com/blog | | **Tier 1** | Official GitHub repos | github.com/anthropics/*, github.com/openai/* | | **Tier 2** | Research papers | arxiv.org, papers with citations | | **Tier 2** | Verified community guides | Cookbook examples, official tutorials | | **Tier 3** | Community content | Blog posts, tutorials, Stack Overflow | ## Discovery Workflow ### Step 1: Define Search Scope ```python search_config = { "topic": "prompt engineering", "vendors": ["anthropic", "openai", "google"], "source_types": ["official_docs", "engineering_blog", "github_repo"], "freshness": "past_year", # past_week, past_month, past_year, any "max_results_per_query": 20 } ``` ### Step 2: Generate Search Queries For a given topic, generate targeted queries: ```python def generate_queries(topic, vendors): queries = [] # Official documentation queries for vendor in vendors: queries.append(f"site:docs.{vendor}.com {topic}") queries.append(f"site:{vendor}.com/docs {topic}") # Engineering blog queries for vendor in vendors: queries.append(f"site:{vendor}.com/blog {topic}") queries.append(f"site:{vendor}.com/news {topic}") # GitHub queries for vendor in vendors: queries.append(f"site:github.com/{vendor} {topic}") # Research queries queries.append(f"site:arxiv.org {topic}") return queries ``` ### Step 3: Execute Search Use web search tool for each query: ```python def execute_discovery(queries): results = [] for query in queries: search_results = web_search(query) for result in search_results: results.append({ "url": result.url, "title": result.title, "snippet": result.snippet, "query_used": query }) return deduplicate_by_url(results) ``` ### Step 4: Validate and Score Sources ```python def score_source(url, title): score = 0.0 # Domain credibility if any(d in url for d in ['docs.anthropic.com', 'docs.claude.com', 'docs.openai.com']): score += 0.40 # Tier 1 official docs elif any(d in url for d in ['anthropic.com', 'openai.com', 'google.dev']): score += 0.30 # Tier 1 official blog/news elif 'github.com' in url and any(v in url for v in ['anthropics', 'openai', 'google']): score += 0.30 # Tier 1 official repos elif 'arxiv.org' in url: score += 0.20 # Tier 2 research else: score += 0.10 # Tier 3 community # Freshness signals (from title/snippet) if any(year in title for year in ['2025', '2024']): score += 0.20 elif any(year in title for year in ['2023']): score += 0.10 # Relevance signals if any(kw in title.lower() for kw in ['guide', 'documentation', 'tutorial', 'best practices']): score += 0.15 return min(score, 1.0) def assign_credibility_tier(score): if score >= 0.60: return 'tier1_official' elif score >= 0.40: return 'tier2_verified' else: return 'tier3_community' ``` ### Step 5: Output URL Manifest ```python def create_manifest(scored_results, topic): manifest = { "discovery_date": datetime.now().isoformat(), "topic": topic, "total_urls": len(scored_results), "urls": [] } for result in sorted(scored_results, key=lambda x: x['score'], reverse=True): manifest["urls"].append({ "url": result["url"], "title": result["title"], "credibility_tier": result["tier"], "credibility_score": result["score"], "source_type": infer_source_type(result["url"]), "vendor": infer_vendor(result["url"]) }) return manifest ``` ## Output Format Discovery produces a JSON manifest for the crawler: ```json { "discovery_date": "2025-01-28T10:30:00", "topic": "prompt engineering", "total_urls": 15, "urls": [ { "url": "https://docs.anthropic.com/en/docs/prompt-engineering", "title": "Prompt Engineering Guide", "credibility_tier": "tier1_official", "credibility_score": 0.85, "source_type": "official_docs", "vendor": "anthropic" } ] } ``` ## Known Authoritative Sources Pre-validated sources for common topics: | Vendor | Documentation | Blog/News | GitHub | |--------|--------------|-----------|--------| | Anthropic | docs.anthropic.com, docs.claude.com | anthropic.com/news | github.com/anthropics | | OpenAI | platform.openai.com/docs | openai.com/blog | github.com/openai | | Google | ai.google.dev/docs | blog.google/technology/ai | github.com/google | ## Integration **Output:** URL manifest JSON → `web-crawler-orchestrator` **Database:** Register new sources in `sources` table via `content-repository` ## Deduplication Before outputting, deduplicate URLs: - Normalize URLs (remove trailing slashes, query params) - Check against existing `documents` table via `content-repository` - Merge duplicate entries, keeping highest credibility score