Files
our-claude-skills/custom-skills/32-seo-crawl-budget/code/CLAUDE.md

6.1 KiB

CLAUDE.md

Overview

Crawl budget optimization tool for analyzing server access logs and identifying crawl budget waste. Parses Apache/Nginx/CloudFront access logs, identifies search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao), profiles per-bot crawl behavior, detects crawl waste (parameter URLs, low-value pages, redirect chains), identifies orphan pages, and generates crawl efficiency recommendations. Uses streaming parser for large log files.

Quick Start

pip install -r scripts/requirements.txt

# Parse access logs
python scripts/log_parser.py --log-file /var/log/nginx/access.log --json

# Crawl budget analysis
python scripts/crawl_budget_analyzer.py --log-file /var/log/nginx/access.log --sitemap https://example.com/sitemap.xml --json

Scripts

Script Purpose Key Output
log_parser.py Parse server access logs, identify bots, extract crawl data Bot identification, request patterns, status codes
crawl_budget_analyzer.py Analyze crawl budget efficiency and generate recommendations Waste identification, orphan pages, optimization plan
base_client.py Shared utilities RateLimiter, ConfigManager, BaseAsyncClient

Log Parser

# Parse Nginx combined log format
python scripts/log_parser.py --log-file /var/log/nginx/access.log --json

# Parse Apache combined log format
python scripts/log_parser.py --log-file /var/log/apache2/access.log --format apache --json

# Parse CloudFront log
python scripts/log_parser.py --log-file cloudfront-log.gz --format cloudfront --json

# Filter by specific bot
python scripts/log_parser.py --log-file access.log --bot googlebot --json

# Parse gzipped logs
python scripts/log_parser.py --log-file access.log.gz --json

# Process large files in streaming mode
python scripts/log_parser.py --log-file access.log --streaming --json

Capabilities:

  • Support for common log formats:
    • Nginx combined format
    • Apache combined format
    • CloudFront format
    • Custom format via regex
  • Bot identification by User-Agent:
    • Googlebot (and variants: Googlebot-Image, Googlebot-News, Googlebot-Video, AdsBot-Google)
    • Yeti (Naver's crawler)
    • Bingbot
    • Daumoa (Kakao/Daum crawler)
    • Other bots (Applebot, DuckDuckBot, Baiduspider, etc.)
  • Request data extraction (timestamp, IP, URL, status code, response size, user-agent, referer)
  • Streaming parser for files >1GB
  • Gzip/bzip2 compressed log support
  • Date range filtering

Crawl Budget Analyzer

# Full crawl budget analysis
python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json

# Waste identification only
python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json

# Orphan page detection
python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --scope orphans --json

# Per-bot profiling
python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json

# With external page history comparison
python scripts/crawl_budget_analyzer.py --log-file access.log --url https://example.com --input pages.json --json

Capabilities:

  • Crawl budget waste identification:
    • Parameter URLs consuming crawl budget (?sort=, ?filter=, ?page=, ?utm_*)
    • Low-value pages (thin content, noindex pages being crawled)
    • Redirect chains consuming multiple crawls
    • Soft 404 pages (200 status but error content)
    • Duplicate URLs (www/non-www, http/https, trailing slash variants)
  • Per-bot behavior profiling:
    • Crawl frequency (requests/day, requests/hour)
    • Crawl depth distribution
    • Status code distribution per bot
    • Most crawled URLs per bot
    • Crawl pattern analysis (time of day, days of week)
  • Orphan page detection:
    • Pages in sitemap but never crawled by bots
    • Pages crawled but not in sitemap
    • Crawled pages with no internal links
  • Crawl efficiency recommendations:
    • robots.txt optimization suggestions
    • URL parameter handling recommendations
    • Noindex/nofollow suggestions for low-value pages
    • Redirect chain resolution priorities
    • Internal linking improvements for orphan pages

Data Sources

Source Purpose
Server access logs Primary crawl data
XML sitemap Reference for expected crawlable pages
our-seo-agent CLI Compare indexed pages with crawled pages (future)

Output Format

{
  "log_file": "access.log",
  "analysis_period": {"from": "2025-01-01", "to": "2025-01-31"},
  "total_bot_requests": 150000,
  "bots": {
    "googlebot": {
      "requests": 80000,
      "unique_urls": 12000,
      "avg_requests_per_day": 2580,
      "status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000},
      "top_crawled_urls": [...]
    },
    "yeti": {"requests": 35000, ...},
    "bingbot": {"requests": 20000, ...},
    "daumoa": {"requests": 15000, ...}
  },
  "waste": {
    "parameter_urls": {"count": 5000, "pct_of_crawls": 3.3},
    "redirect_chains": {"count": 2000, "pct_of_crawls": 1.3},
    "soft_404s": {"count": 1500, "pct_of_crawls": 1.0},
    "total_waste_pct": 8.5
  },
  "orphan_pages": {
    "in_sitemap_not_crawled": [...],
    "crawled_not_in_sitemap": [...]
  },
  "recommendations": [...],
  "efficiency_score": 72,
  "timestamp": "2025-01-01T00:00:00"
}

Notion Output (Required)

IMPORTANT: All audit reports MUST be saved to the OurDigital SEO Audit Log database.

Database Configuration

Field Value
Database ID 2c8581e5-8a1e-8035-880b-e38cefc2f3ef
URL https://www.notion.so/dintelligence/2c8581e58a1e8035880be38cefc2f3ef

Required Properties

Property Type Description
Issue Title Report title (Korean + date)
Site URL Audited website URL
Category Select Crawl Budget
Priority Select Based on waste percentage
Found Date Date Audit date (YYYY-MM-DD)
Audit ID Rich Text Format: CRAWL-YYYYMMDD-NNN

Language Guidelines

  • Report content in Korean (한국어)
  • Keep technical English terms as-is (e.g., Crawl Budget, Googlebot, robots.txt)
  • URLs and code remain unchanged