--- name: seo-crawl-budget description: | Crawl budget optimization and server log analysis for search engine bots. Triggers: crawl budget, log analysis, bot crawling, Googlebot, crawl waste, orphan pages, crawl efficiency, 크롤 예산, 로그 분석, 크롤 최적화. --- # Crawl Budget Optimizer Analyze server access logs to identify crawl budget waste and generate optimization recommendations for search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao). ## Capabilities ### Log Analysis - Parse Nginx combined, Apache combined, and CloudFront log formats - Support for gzip/bzip2 compressed logs - Streaming parser for files >1GB - Date range filtering - Custom format via regex ### Bot Profiling - Identify bots by User-Agent: Googlebot (and variants), Yeti (Naver), Bingbot, Daumoa (Kakao), Applebot, DuckDuckBot, Baiduspider - Per-bot metrics: requests/day, requests/hour, unique URLs crawled - Status code distribution per bot (200, 301, 404, 500) - Crawl depth distribution - Crawl pattern analysis (time of day, days of week) - Most crawled URLs per bot ### Waste Detection - **Parameter URLs**: ?sort=, ?filter=, ?page=, ?utm_* consuming crawl budget - **Redirect chains**: Multiple redirects consuming crawl slots - **Soft 404s**: 200 status pages with error/empty content - **Duplicate URLs**: www/non-www, http/https, trailing slash variants - **Low-value pages**: Thin content pages, noindex pages being crawled ### Orphan Page Detection - Pages in sitemap but never crawled by bots - Pages crawled but not in sitemap - Crawled pages with no internal links pointing to them ## Workflow ### Step 1: Obtain Server Access Logs Request or locate server access logs from the target site. Supported formats: - Nginx: `/var/log/nginx/access.log` - Apache: `/var/log/apache2/access.log` - CloudFront: Downloaded from S3 or CloudWatch ### Step 2: Parse Access Logs ```bash python scripts/log_parser.py --log-file access.log --json python scripts/log_parser.py --log-file access.log.gz --streaming --json python scripts/log_parser.py --log-file access.log --bot googlebot --json ``` ### Step 3: Crawl Budget Analysis ```bash python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json python scripts/crawl_budget_analyzer.py --log-file access.log --scope orphans --json python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json ``` ### Step 4: Cross-Reference with External Data (Optional) Use `our-seo-agent` CLI or provide pre-fetched JSON via `--input` to compare indexed pages vs crawled pages. WebSearch can supplement with current indexing data. ### Step 5: Generate Recommendations Prioritized action items: 1. robots.txt optimization (block parameter URLs, low-value paths) 2. URL parameter handling (Google Search Console settings) 3. Noindex/nofollow for low-value pages 4. Redirect chain resolution (reduce 301 → 301 → 200 to 301 → 200) 5. Internal linking improvements for orphan pages ### Step 6: Report to Notion Save Korean-language report to SEO Audit Log database. | Property | Type | Description | |----------|------|-------------| | Issue | Title | Report title (Korean + date) | | Site | URL | Audited website URL | | Category | Select | Crawl Budget | | Priority | Select | Based on efficiency score | | Found Date | Date | Analysis date (YYYY-MM-DD) | | Audit ID | Rich Text | Format: CRAWL-YYYYMMDD-NNN | ## Data Sources | Source | Purpose | |--------|---------| | `our-seo-agent` CLI | Future primary data source; use `--input` for pre-fetched JSON | | Notion MCP | Save audit report to database | | WebSearch | Current bot documentation and best practices | ## Output Format ```json { "log_file": "access.log", "analysis_period": {"from": "2025-01-01", "to": "2025-01-31"}, "total_bot_requests": 150000, "bots": { "googlebot": { "requests": 80000, "unique_urls": 12000, "avg_requests_per_day": 2580, "status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000} }, "yeti": {"requests": 35000}, "bingbot": {"requests": 20000}, "daumoa": {"requests": 15000} }, "waste": { "parameter_urls": {"count": 5000, "pct_of_crawls": 3.3}, "redirect_chains": {"count": 2000, "pct_of_crawls": 1.3}, "soft_404s": {"count": 1500, "pct_of_crawls": 1.0}, "total_waste_pct": 8.5 }, "orphan_pages": { "in_sitemap_not_crawled": [], "crawled_not_in_sitemap": [] }, "recommendations": [], "efficiency_score": 72, "timestamp": "2025-01-01T00:00:00" } ``` ## Korean Output Example ``` # 크롤 예산 분석 보고서 - example.com ## 분석 기간: 2025-01-01 ~ 2025-01-31 ### 봇별 크롤 현황 | 봇 | 요청 수 | 고유 URL | 일 평균 | |----|---------|---------|---------| | Googlebot | 80,000 | 12,000 | 2,580 | | Yeti (Naver) | 35,000 | 8,000 | 1,129 | ### 크롤 낭비 요인 - 파라미터 URL: 5,000건 (3.3%) - 리다이렉트 체인: 2,000건 (1.3%) - 소프트 404: 1,500건 (1.0%) ### 효율성 점수: 72/100 ``` ## Limitations - Requires actual server access logs (not available via standard web crawling) - Log format auto-detection may need manual format specification for custom formats - CloudFront logs have a different field structure than Nginx/Apache - Large log files (>10GB) may need pre-filtering before analysis - Bot identification relies on User-Agent strings which can be spoofed ## Notion Output (Required) All audit reports MUST be saved to the OurDigital SEO Audit Log: - **Database ID**: `2c8581e5-8a1e-8035-880b-e38cefc2f3ef` - **Category**: Crawl Budget - **Audit ID Format**: CRAWL-YYYYMMDD-NNN - **Language**: Korean with technical English terms (Crawl Budget, Googlebot, robots.txt) ## Reference Scripts Located in `code/scripts/`: - `log_parser.py` — Server access log parser with bot identification - `crawl_budget_analyzer.py` — Crawl budget efficiency analysis - `base_client.py` — Shared async client utilities