# CLAUDE.md ## Overview Crawl budget optimization tool for analyzing server access logs and identifying crawl budget waste. Parses Apache/Nginx/CloudFront access logs, identifies search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao), profiles per-bot crawl behavior, detects crawl waste (parameter URLs, low-value pages, redirect chains), identifies orphan pages, and generates crawl efficiency recommendations. Uses streaming parser for large log files. ## Quick Start ```bash pip install -r scripts/requirements.txt # Parse access logs python scripts/log_parser.py --log-file /var/log/nginx/access.log --json # Crawl budget analysis python scripts/crawl_budget_analyzer.py --log-file /var/log/nginx/access.log --sitemap https://example.com/sitemap.xml --json ``` ## Scripts | Script | Purpose | Key Output | |--------|---------|------------| | `log_parser.py` | Parse server access logs, identify bots, extract crawl data | Bot identification, request patterns, status codes | | `crawl_budget_analyzer.py` | Analyze crawl budget efficiency and generate recommendations | Waste identification, orphan pages, optimization plan | | `base_client.py` | Shared utilities | RateLimiter, ConfigManager, BaseAsyncClient | ## Log Parser ```bash # Parse Nginx combined log format python scripts/log_parser.py --log-file /var/log/nginx/access.log --json # Parse Apache combined log format python scripts/log_parser.py --log-file /var/log/apache2/access.log --format apache --json # Parse CloudFront log python scripts/log_parser.py --log-file cloudfront-log.gz --format cloudfront --json # Filter by specific bot python scripts/log_parser.py --log-file access.log --bot googlebot --json # Parse gzipped logs python scripts/log_parser.py --log-file access.log.gz --json # Process large files in streaming mode python scripts/log_parser.py --log-file access.log --streaming --json ``` **Capabilities**: - Support for common log formats: - Nginx combined format - Apache combined format - CloudFront format - Custom format via regex - Bot identification by User-Agent: - Googlebot (and variants: Googlebot-Image, Googlebot-News, Googlebot-Video, AdsBot-Google) - Yeti (Naver's crawler) - Bingbot - Daumoa (Kakao/Daum crawler) - Other bots (Applebot, DuckDuckBot, Baiduspider, etc.) - Request data extraction (timestamp, IP, URL, status code, response size, user-agent, referer) - Streaming parser for files >1GB - Gzip/bzip2 compressed log support - Date range filtering ## Crawl Budget Analyzer ```bash # Full crawl budget analysis python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json # Waste identification only python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json # Orphan page detection python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --scope orphans --json # Per-bot profiling python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json # With external page history comparison python scripts/crawl_budget_analyzer.py --log-file access.log --url https://example.com --input pages.json --json ``` **Capabilities**: - Crawl budget waste identification: - Parameter URLs consuming crawl budget (?sort=, ?filter=, ?page=, ?utm_*) - Low-value pages (thin content, noindex pages being crawled) - Redirect chains consuming multiple crawls - Soft 404 pages (200 status but error content) - Duplicate URLs (www/non-www, http/https, trailing slash variants) - Per-bot behavior profiling: - Crawl frequency (requests/day, requests/hour) - Crawl depth distribution - Status code distribution per bot - Most crawled URLs per bot - Crawl pattern analysis (time of day, days of week) - Orphan page detection: - Pages in sitemap but never crawled by bots - Pages crawled but not in sitemap - Crawled pages with no internal links - Crawl efficiency recommendations: - robots.txt optimization suggestions - URL parameter handling recommendations - Noindex/nofollow suggestions for low-value pages - Redirect chain resolution priorities - Internal linking improvements for orphan pages ## Data Sources | Source | Purpose | |--------|---------| | Server access logs | Primary crawl data | | XML sitemap | Reference for expected crawlable pages | | `our-seo-agent` CLI | Compare indexed pages with crawled pages (future) | ## Output Format ```json { "log_file": "access.log", "analysis_period": {"from": "2025-01-01", "to": "2025-01-31"}, "total_bot_requests": 150000, "bots": { "googlebot": { "requests": 80000, "unique_urls": 12000, "avg_requests_per_day": 2580, "status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000}, "top_crawled_urls": [...] }, "yeti": {"requests": 35000, ...}, "bingbot": {"requests": 20000, ...}, "daumoa": {"requests": 15000, ...} }, "waste": { "parameter_urls": {"count": 5000, "pct_of_crawls": 3.3}, "redirect_chains": {"count": 2000, "pct_of_crawls": 1.3}, "soft_404s": {"count": 1500, "pct_of_crawls": 1.0}, "total_waste_pct": 8.5 }, "orphan_pages": { "in_sitemap_not_crawled": [...], "crawled_not_in_sitemap": [...] }, "recommendations": [...], "efficiency_score": 72, "timestamp": "2025-01-01T00:00:00" } ``` ## Notion Output (Required) **IMPORTANT**: All audit reports MUST be saved to the OurDigital SEO Audit Log database. ### Database Configuration | Field | Value | |-------|-------| | Database ID | `2c8581e5-8a1e-8035-880b-e38cefc2f3ef` | | URL | https://www.notion.so/dintelligence/2c8581e58a1e8035880be38cefc2f3ef | ### Required Properties | Property | Type | Description | |----------|------|-------------| | Issue | Title | Report title (Korean + date) | | Site | URL | Audited website URL | | Category | Select | Crawl Budget | | Priority | Select | Based on waste percentage | | Found Date | Date | Audit date (YYYY-MM-DD) | | Audit ID | Rich Text | Format: CRAWL-YYYYMMDD-NNN | ### Language Guidelines - Report content in Korean (한국어) - Keep technical English terms as-is (e.g., Crawl Budget, Googlebot, robots.txt) - URLs and code remain unchanged