Add SEO skills 19-28, 31-32 with full Python implementations

12 new skills: Keyword Strategy, SERP Analysis, Position Tracking, Link Building, Content Strategy, E-Commerce SEO, KPI Framework, International SEO, AI Visibility, Knowledge Graph, Competitor Intel, and Crawl Budget. ~20K lines of Python across 25 domain scripts. Updated skill 11 pipeline table and repo CLAUDE.md. Enhanced skill 18 local SEO workflow from jamie.clinic audit. Note: Skill 26 hreflang_validator.py pending (content filter block). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 12:05:59 +09:00
parent 159f7ec3f7
commit a3ff965b87
125 changed files with 25948 additions and 173 deletions
--- a/custom-skills/32-seo-crawl-budget/code/CLAUDE.md
+++ b/custom-skills/32-seo-crawl-budget/code/CLAUDE.md
@@ -0,0 +1,178 @@
+# CLAUDE.md
+
+## Overview
+
+Crawl budget optimization tool for analyzing server access logs and identifying crawl budget waste. Parses Apache/Nginx/CloudFront access logs, identifies search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao), profiles per-bot crawl behavior, detects crawl waste (parameter URLs, low-value pages, redirect chains), identifies orphan pages, and generates crawl efficiency recommendations. Uses streaming parser for large log files.
+
+## Quick Start
+
+```bash
+pip install -r scripts/requirements.txt
+
+# Parse access logs
+python scripts/log_parser.py --log-file /var/log/nginx/access.log --json
+
+# Crawl budget analysis
+python scripts/crawl_budget_analyzer.py --log-file /var/log/nginx/access.log --sitemap https://example.com/sitemap.xml --json
+```
+
+## Scripts
+
+| Script | Purpose | Key Output |
+|--------|---------|------------|
+| `log_parser.py` | Parse server access logs, identify bots, extract crawl data | Bot identification, request patterns, status codes |
+| `crawl_budget_analyzer.py` | Analyze crawl budget efficiency and generate recommendations | Waste identification, orphan pages, optimization plan |
+| `base_client.py` | Shared utilities | RateLimiter, ConfigManager, BaseAsyncClient |
+
+## Log Parser
+
+```bash
+# Parse Nginx combined log format
+python scripts/log_parser.py --log-file /var/log/nginx/access.log --json
+
+# Parse Apache combined log format
+python scripts/log_parser.py --log-file /var/log/apache2/access.log --format apache --json
+
+# Parse CloudFront log
+python scripts/log_parser.py --log-file cloudfront-log.gz --format cloudfront --json
+
+# Filter by specific bot
+python scripts/log_parser.py --log-file access.log --bot googlebot --json
+
+# Parse gzipped logs
+python scripts/log_parser.py --log-file access.log.gz --json
+
+# Process large files in streaming mode
+python scripts/log_parser.py --log-file access.log --streaming --json
+```
+
+**Capabilities**:
+- Support for common log formats:
+  - Nginx combined format
+  - Apache combined format
+  - CloudFront format
+  - Custom format via regex
+- Bot identification by User-Agent:
+  - Googlebot (and variants: Googlebot-Image, Googlebot-News, Googlebot-Video, AdsBot-Google)
+  - Yeti (Naver's crawler)
+  - Bingbot
+  - Daumoa (Kakao/Daum crawler)
+  - Other bots (Applebot, DuckDuckBot, Baiduspider, etc.)
+- Request data extraction (timestamp, IP, URL, status code, response size, user-agent, referer)
+- Streaming parser for files >1GB
+- Gzip/bzip2 compressed log support
+- Date range filtering
+
+## Crawl Budget Analyzer
+
+```bash
+# Full crawl budget analysis
+python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json
+
+# Waste identification only
+python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json
+
+# Orphan page detection
+python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --scope orphans --json
+
+# Per-bot profiling
+python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json
+
+# With Ahrefs page history comparison
+python scripts/crawl_budget_analyzer.py --log-file access.log --url https://example.com --ahrefs --json
+```
+
+**Capabilities**:
+- Crawl budget waste identification:
+  - Parameter URLs consuming crawl budget (?sort=, ?filter=, ?page=, ?utm_*)
+  - Low-value pages (thin content, noindex pages being crawled)
+  - Redirect chains consuming multiple crawls
+  - Soft 404 pages (200 status but error content)
+  - Duplicate URLs (www/non-www, http/https, trailing slash variants)
+- Per-bot behavior profiling:
+  - Crawl frequency (requests/day, requests/hour)
+  - Crawl depth distribution
+  - Status code distribution per bot
+  - Most crawled URLs per bot
+  - Crawl pattern analysis (time of day, days of week)
+- Orphan page detection:
+  - Pages in sitemap but never crawled by bots
+  - Pages crawled but not in sitemap
+  - Crawled pages with no internal links
+- Crawl efficiency recommendations:
+  - robots.txt optimization suggestions
+  - URL parameter handling recommendations
+  - Noindex/nofollow suggestions for low-value pages
+  - Redirect chain resolution priorities
+  - Internal linking improvements for orphan pages
+
+## Data Sources
+
+| Source | Purpose |
+|--------|---------|
+| Server access logs | Primary crawl data |
+| XML sitemap | Reference for expected crawlable pages |
+| Ahrefs `site-explorer-pages-history` | Compare indexed pages with crawled pages |
+
+## Output Format
+
+```json
+{
+  "log_file": "access.log",
+  "analysis_period": {"from": "2025-01-01", "to": "2025-01-31"},
+  "total_bot_requests": 150000,
+  "bots": {
+    "googlebot": {
+      "requests": 80000,
+      "unique_urls": 12000,
+      "avg_requests_per_day": 2580,
+      "status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000},
+      "top_crawled_urls": [...]
+    },
+    "yeti": {"requests": 35000, ...},
+    "bingbot": {"requests": 20000, ...},
+    "daumoa": {"requests": 15000, ...}
+  },
+  "waste": {
+    "parameter_urls": {"count": 5000, "pct_of_crawls": 3.3},
+    "redirect_chains": {"count": 2000, "pct_of_crawls": 1.3},
+    "soft_404s": {"count": 1500, "pct_of_crawls": 1.0},
+    "total_waste_pct": 8.5
+  },
+  "orphan_pages": {
+    "in_sitemap_not_crawled": [...],
+    "crawled_not_in_sitemap": [...]
+  },
+  "recommendations": [...],
+  "efficiency_score": 72,
+  "timestamp": "2025-01-01T00:00:00"
+}
+```
+
+## Notion Output (Required)
+
+**IMPORTANT**: All audit reports MUST be saved to the OurDigital SEO Audit Log database.
+
+### Database Configuration
+
+| Field | Value |
+|-------|-------|
+| Database ID | `2c8581e5-8a1e-8035-880b-e38cefc2f3ef` |
+| URL | https://www.notion.so/dintelligence/2c8581e58a1e8035880be38cefc2f3ef |
+
+### Required Properties
+
+| Property | Type | Description |
+|----------|------|-------------|
+| Issue | Title | Report title (Korean + date) |
+| Site | URL | Audited website URL |
+| Category | Select | Crawl Budget |
+| Priority | Select | Based on waste percentage |
+| Found Date | Date | Audit date (YYYY-MM-DD) |
+| Audit ID | Rich Text | Format: CRAWL-YYYYMMDD-NNN |
+
+### Language Guidelines
+
+- Report content in Korean (한국어)
+- Keep technical English terms as-is (e.g., Crawl Budget, Googlebot, robots.txt)
+- URLs and code remain unchanged
--- a/custom-skills/32-seo-crawl-budget/code/scripts/base_client.py
+++ b/custom-skills/32-seo-crawl-budget/code/scripts/base_client.py
@@ -0,0 +1,207 @@
+"""
+Base Client - Shared async client utilities
+===========================================
+Purpose: Rate-limited async operations for API clients
+Python: 3.10+
+"""
+
+import asyncio
+import logging
+import os
+from asyncio import Semaphore
+from datetime import datetime
+from typing import Any, Callable, TypeVar
+
+from dotenv import load_dotenv
+from tenacity import (
+    retry,
+    stop_after_attempt,
+    wait_exponential,
+    retry_if_exception_type,
+)
+
+# Load environment variables
+load_dotenv()
+
+# Logging setup
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s - %(levelname)s - %(message)s",
+)
+
+T = TypeVar("T")
+
+
+class RateLimiter:
+    """Rate limiter using token bucket algorithm."""
+
+    def __init__(self, rate: float, per: float = 1.0):
+        """
+        Initialize rate limiter.
+
+        Args:
+            rate: Number of requests allowed
+            per: Time period in seconds (default: 1 second)
+        """
+        self.rate = rate
+        self.per = per
+        self.tokens = rate
+        self.last_update = datetime.now()
+        self._lock = asyncio.Lock()
+
+    async def acquire(self) -> None:
+        """Acquire a token, waiting if necessary."""
+        async with self._lock:
+            now = datetime.now()
+            elapsed = (now - self.last_update).total_seconds()
+            self.tokens = min(self.rate, self.tokens + elapsed * (self.rate / self.per))
+            self.last_update = now
+
+            if self.tokens < 1:
+                wait_time = (1 - self.tokens) * (self.per / self.rate)
+                await asyncio.sleep(wait_time)
+                self.tokens = 0
+            else:
+                self.tokens -= 1
+
+
+class BaseAsyncClient:
+    """Base class for async API clients with rate limiting."""
+
+    def __init__(
+        self,
+        max_concurrent: int = 5,
+        requests_per_second: float = 3.0,
+        logger: logging.Logger | None = None,
+    ):
+        """
+        Initialize base client.
+
+        Args:
+            max_concurrent: Maximum concurrent requests
+            requests_per_second: Rate limit
+            logger: Logger instance
+        """
+        self.semaphore = Semaphore(max_concurrent)
+        self.rate_limiter = RateLimiter(requests_per_second)
+        self.logger = logger or logging.getLogger(self.__class__.__name__)
+        self.stats = {
+            "requests": 0,
+            "success": 0,
+            "errors": 0,
+            "retries": 0,
+        }
+
+    @retry(
+        stop=stop_after_attempt(3),
+        wait=wait_exponential(multiplier=1, min=2, max=10),
+        retry=retry_if_exception_type(Exception),
+    )
+    async def _rate_limited_request(
+        self,
+        coro: Callable[[], Any],
+    ) -> Any:
+        """Execute a request with rate limiting and retry."""
+        async with self.semaphore:
+            await self.rate_limiter.acquire()
+            self.stats["requests"] += 1
+            try:
+                result = await coro()
+                self.stats["success"] += 1
+                return result
+            except Exception as e:
+                self.stats["errors"] += 1
+                self.logger.error(f"Request failed: {e}")
+                raise
+
+    async def batch_requests(
+        self,
+        requests: list[Callable[[], Any]],
+        desc: str = "Processing",
+    ) -> list[Any]:
+        """Execute multiple requests concurrently."""
+        try:
+            from tqdm.asyncio import tqdm
+            has_tqdm = True
+        except ImportError:
+            has_tqdm = False
+
+        async def execute(req: Callable) -> Any:
+            try:
+                return await self._rate_limited_request(req)
+            except Exception as e:
+                return {"error": str(e)}
+
+        tasks = [execute(req) for req in requests]
+
+        if has_tqdm:
+            results = []
+            for coro in tqdm.as_completed(tasks, total=len(tasks), desc=desc):
+                result = await coro
+                results.append(result)
+            return results
+        else:
+            return await asyncio.gather(*tasks, return_exceptions=True)
+
+    def print_stats(self) -> None:
+        """Print request statistics."""
+        self.logger.info("=" * 40)
+        self.logger.info("Request Statistics:")
+        self.logger.info(f"  Total Requests: {self.stats['requests']}")
+        self.logger.info(f"  Successful: {self.stats['success']}")
+        self.logger.info(f"  Errors: {self.stats['errors']}")
+        self.logger.info("=" * 40)
+
+
+class ConfigManager:
+    """Manage API configuration and credentials."""
+
+    def __init__(self):
+        load_dotenv()
+
+    @property
+    def google_credentials_path(self) -> str | None:
+        """Get Google service account credentials path."""
+        # Prefer SEO-specific credentials, fallback to general credentials
+        seo_creds = os.path.expanduser("~/.credential/ourdigital-seo-agent.json")
+        if os.path.exists(seo_creds):
+            return seo_creds
+        return os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
+
+    @property
+    def pagespeed_api_key(self) -> str | None:
+        """Get PageSpeed Insights API key."""
+        return os.getenv("PAGESPEED_API_KEY")
+
+    @property
+    def custom_search_api_key(self) -> str | None:
+        """Get Custom Search API key."""
+        return os.getenv("CUSTOM_SEARCH_API_KEY")
+
+    @property
+    def custom_search_engine_id(self) -> str | None:
+        """Get Custom Search Engine ID."""
+        return os.getenv("CUSTOM_SEARCH_ENGINE_ID")
+
+    @property
+    def notion_token(self) -> str | None:
+        """Get Notion API token."""
+        return os.getenv("NOTION_TOKEN") or os.getenv("NOTION_API_KEY")
+
+    def validate_google_credentials(self) -> bool:
+        """Validate Google credentials are configured."""
+        creds_path = self.google_credentials_path
+        if not creds_path:
+            return False
+        return os.path.exists(creds_path)
+
+    def get_required(self, key: str) -> str:
+        """Get required environment variable or raise error."""
+        value = os.getenv(key)
+        if not value:
+            raise ValueError(f"Missing required environment variable: {key}")
+        return value
+
+
+# Singleton config instance
+config = ConfigManager()
--- a/custom-skills/32-seo-crawl-budget/code/scripts/crawl_budget_analyzer.py
+++ b/custom-skills/32-seo-crawl-budget/code/scripts/crawl_budget_analyzer.py
@@ -0,0 +1,805 @@
+"""
+Crawl Budget Analyzer - Identify crawl waste and generate recommendations
+=========================================================================
+Purpose: Analyze server access logs for crawl budget efficiency, detect waste
+         (parameter URLs, redirect chains, soft 404s, duplicates), find orphan
+         pages, profile per-bot behavior, and produce prioritized recommendations.
+Python: 3.10+
+"""
+
+import argparse
+import json
+import logging
+import re
+import sys
+from collections import Counter, defaultdict
+from dataclasses import asdict, dataclass, field
+from datetime import datetime
+from pathlib import Path
+from typing import Any
+from urllib.parse import parse_qs, urlparse
+
+import requests
+from bs4 import BeautifulSoup
+
+from log_parser import BotIdentification, LogEntry, LogParser
+
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s - %(levelname)s - %(message)s",
+)
+logger = logging.getLogger(__name__)
+
+
+# ---------------------------------------------------------------------------
+# Constants
+# ---------------------------------------------------------------------------
+
+WASTE_PARAMS = {"sort", "filter", "order", "orderby", "dir", "direction"}
+TRACKING_PARAMS_RE = re.compile(r"^utm_", re.IGNORECASE)
+PAGINATION_PARAM = "page"
+HIGH_PAGE_THRESHOLD = 5
+SOFT_404_MAX_SIZE = 1024  # bytes - pages smaller than this may be soft 404s
+REDIRECT_STATUSES = {301, 302, 303, 307, 308}
+TOP_N_URLS = 50
+
+
+# ---------------------------------------------------------------------------
+# Data classes
+# ---------------------------------------------------------------------------
+
+@dataclass
+class CrawlWaste:
+    """A category of crawl budget waste."""
+    waste_type: str
+    urls: list[str]
+    count: int
+    pct_of_total: float
+    recommendation: str
+
+    def to_dict(self) -> dict:
+        return {
+            "waste_type": self.waste_type,
+            "count": self.count,
+            "pct_of_total": round(self.pct_of_total, 2),
+            "recommendation": self.recommendation,
+            "sample_urls": self.urls[:20],
+        }
+
+
+@dataclass
+class OrphanPage:
+    """A page that is either in the sitemap but uncrawled, or crawled but not in sitemap."""
+    url: str
+    in_sitemap: bool
+    crawled: bool
+    last_crawl_date: str | None = None
+
+    def to_dict(self) -> dict:
+        return asdict(self)
+
+
+@dataclass
+class BotProfile:
+    """Per-bot crawl behavior profile."""
+    name: str
+    total_requests: int = 0
+    requests_per_day: float = 0.0
+    crawl_depth_distribution: dict[int, int] = field(default_factory=dict)
+    peak_hours: list[int] = field(default_factory=list)
+    status_breakdown: dict[str, int] = field(default_factory=dict)
+    top_crawled_urls: list[tuple[str, int]] = field(default_factory=list)
+    unique_urls: int = 0
+    days_active: int = 0
+
+    def to_dict(self) -> dict:
+        return {
+            "name": self.name,
+            "total_requests": self.total_requests,
+            "requests_per_day": round(self.requests_per_day, 1),
+            "crawl_depth_distribution": self.crawl_depth_distribution,
+            "peak_hours": self.peak_hours,
+            "status_breakdown": self.status_breakdown,
+            "top_crawled_urls": [{"url": u, "count": c} for u, c in self.top_crawled_urls],
+            "unique_urls": self.unique_urls,
+            "days_active": self.days_active,
+        }
+
+
+@dataclass
+class CrawlRecommendation:
+    """A single optimization recommendation."""
+    category: str
+    priority: str  # critical, high, medium, low
+    action: str
+    impact: str
+    details: str
+
+    def to_dict(self) -> dict:
+        return asdict(self)
+
+
+@dataclass
+class CrawlBudgetResult:
+    """Complete crawl budget analysis result."""
+    log_file: str
+    analysis_period: dict[str, str]
+    total_bot_requests: int
+    bots: dict[str, BotProfile]
+    waste: list[CrawlWaste]
+    total_waste_pct: float
+    orphan_pages: dict[str, list[OrphanPage]]
+    recommendations: list[CrawlRecommendation]
+    efficiency_score: int
+    timestamp: str
+
+    def to_dict(self) -> dict:
+        return {
+            "log_file": self.log_file,
+            "analysis_period": self.analysis_period,
+            "total_bot_requests": self.total_bot_requests,
+            "bots": {n: p.to_dict() for n, p in self.bots.items()},
+            "waste": {w.waste_type: w.to_dict() for w in self.waste},
+            "total_waste_pct": round(self.total_waste_pct, 2),
+            "orphan_pages": {
+                k: [o.to_dict() for o in v]
+                for k, v in self.orphan_pages.items()
+            },
+            "recommendations": [r.to_dict() for r in self.recommendations],
+            "efficiency_score": self.efficiency_score,
+            "timestamp": self.timestamp,
+        }
+
+
+# ---------------------------------------------------------------------------
+# CrawlBudgetAnalyzer
+# ---------------------------------------------------------------------------
+
+class CrawlBudgetAnalyzer:
+    """Analyze crawl budget efficiency from server access logs."""
+
+    def __init__(
+        self,
+        log_file: str,
+        sitemap_url: str | None = None,
+        target_url: str | None = None,
+    ):
+        self.log_file = log_file
+        self.sitemap_url = sitemap_url
+        self.target_url = target_url
+        self._bot_entries: list[tuple[LogEntry, BotIdentification]] = []
+        self._sitemap_urls: set[str] = set()
+
+    # -- data loading ---------------------------------------------------------
+
+    def load_log_data(self, log_file: str) -> list[tuple[LogEntry, BotIdentification]]:
+        """Use LogParser to load all bot requests from the log file."""
+        parser = LogParser(log_file=log_file, fmt="auto")
+        entries = parser.parse()
+        logger.info(f"Loaded {len(entries):,} bot entries from {log_file}")
+        self._bot_entries = entries
+        return entries
+
+    def load_sitemap_urls(self, sitemap_url: str) -> set[str]:
+        """Fetch and parse an XML sitemap, returning the set of URLs."""
+        urls: set[str] = set()
+        try:
+            resp = requests.get(sitemap_url, timeout=30, headers={
+                "User-Agent": "CrawlBudgetAnalyzer/1.0",
+            })
+            resp.raise_for_status()
+            soup = BeautifulSoup(resp.content, "lxml-xml")
+
+            # Handle sitemap index
+            sitemap_tags = soup.find_all("sitemap")
+            if sitemap_tags:
+                for st in sitemap_tags:
+                    loc = st.find("loc")
+                    if loc and loc.text:
+                        child_urls = self._fetch_sitemap_child(loc.text.strip())
+                        urls.update(child_urls)
+            else:
+                for url_tag in soup.find_all("url"):
+                    loc = url_tag.find("loc")
+                    if loc and loc.text:
+                        urls.add(self._normalize_url(loc.text.strip()))
+
+            logger.info(f"Loaded {len(urls):,} URLs from sitemap: {sitemap_url}")
+        except Exception as e:
+            logger.error(f"Failed to load sitemap {sitemap_url}: {e}")
+
+        self._sitemap_urls = urls
+        return urls
+
+    def _fetch_sitemap_child(self, url: str) -> set[str]:
+        """Fetch a child sitemap from a sitemap index."""
+        urls: set[str] = set()
+        try:
+            resp = requests.get(url, timeout=30, headers={
+                "User-Agent": "CrawlBudgetAnalyzer/1.0",
+            })
+            resp.raise_for_status()
+            soup = BeautifulSoup(resp.content, "lxml-xml")
+            for url_tag in soup.find_all("url"):
+                loc = url_tag.find("loc")
+                if loc and loc.text:
+                    urls.add(self._normalize_url(loc.text.strip()))
+        except Exception as e:
+            logger.warning(f"Failed to fetch child sitemap {url}: {e}")
+        return urls
+
+    @staticmethod
+    def _normalize_url(url: str) -> str:
+        """Normalize a URL by removing trailing slash and lowercasing the scheme/host."""
+        parsed = urlparse(url)
+        path = parsed.path.rstrip("/") or "/"
+        return f"{parsed.scheme}://{parsed.netloc}{path}"
+
+    # -- waste identification -------------------------------------------------
+
+    def identify_parameter_waste(
+        self,
+        bot_requests: list[tuple[LogEntry, BotIdentification]],
+    ) -> CrawlWaste:
+        """Find URLs with unnecessary query parameters wasting crawl budget."""
+        waste_urls: list[str] = []
+        for entry, _ in bot_requests:
+            parsed = urlparse(entry.url)
+            if not parsed.query:
+                continue
+            params = parse_qs(parsed.query)
+            param_keys = {k.lower() for k in params}
+            # Check for waste parameters
+            has_waste = bool(param_keys & WASTE_PARAMS)
+            # Check for tracking parameters
+            has_tracking = any(TRACKING_PARAMS_RE.match(k) for k in param_keys)
+            # Check for deep pagination
+            page_val = params.get(PAGINATION_PARAM, params.get("p", [None]))
+            has_deep_page = False
+            if page_val and page_val[0]:
+                try:
+                    if int(page_val[0]) > HIGH_PAGE_THRESHOLD:
+                        has_deep_page = True
+                except (ValueError, TypeError):
+                    pass
+            if has_waste or has_tracking or has_deep_page:
+                waste_urls.append(entry.url)
+
+        total = len(bot_requests)
+        count = len(waste_urls)
+        pct = (count / total * 100) if total else 0.0
+        return CrawlWaste(
+            waste_type="parameter_urls",
+            urls=list(set(waste_urls)),
+            count=count,
+            pct_of_total=pct,
+            recommendation=(
+                "robots.txt에 불필요한 parameter URL 패턴을 Disallow로 추가하거나, "
+                "Google Search Console의 URL Parameters 설정을 활용하세요. "
+                "UTM 파라미터가 포함된 URL은 canonical 태그로 처리하세요."
+            ),
+        )
+
+    def identify_redirect_chains(
+        self,
+        bot_requests: list[tuple[LogEntry, BotIdentification]],
+    ) -> CrawlWaste:
+        """Find URLs that repeatedly return redirect status codes."""
+        redirect_urls: list[str] = []
+        redirect_counter: Counter = Counter()
+        for entry, _ in bot_requests:
+            if entry.status_code in REDIRECT_STATUSES:
+                redirect_counter[entry.url] += 1
+                redirect_urls.append(entry.url)
+
+        # URLs redirected more than once are chain candidates
+        chain_urls = [url for url, cnt in redirect_counter.items() if cnt >= 2]
+        total = len(bot_requests)
+        count = len(redirect_urls)
+        pct = (count / total * 100) if total else 0.0
+        return CrawlWaste(
+            waste_type="redirect_chains",
+            urls=chain_urls,
+            count=count,
+            pct_of_total=pct,
+            recommendation=(
+                "301/302 리다이렉트가 반복적으로 크롤링되고 있습니다. "
+                "내부 링크를 최종 목적지 URL로 직접 업데이트하고, "
+                "리다이렉트 체인을 단일 리다이렉트로 단축하세요."
+            ),
+        )
+
+    def identify_soft_404s(
+        self,
+        bot_requests: list[tuple[LogEntry, BotIdentification]],
+    ) -> CrawlWaste:
+        """Find 200-status pages with suspiciously small response sizes."""
+        soft_404_urls: list[str] = []
+        for entry, _ in bot_requests:
+            if entry.status_code == 200 and entry.response_size < SOFT_404_MAX_SIZE:
+                if entry.response_size > 0:
+                    soft_404_urls.append(entry.url)
+
+        total = len(bot_requests)
+        count = len(soft_404_urls)
+        pct = (count / total * 100) if total else 0.0
+        return CrawlWaste(
+            waste_type="soft_404s",
+            urls=list(set(soft_404_urls)),
+            count=count,
+            pct_of_total=pct,
+            recommendation=(
+                "200 상태 코드를 반환하지만 콘텐츠가 거의 없는 Soft 404 페이지입니다. "
+                "실제 404 상태 코드를 반환하거나, 해당 페이지에 noindex 태그를 추가하세요."
+            ),
+        )
+
+    def identify_duplicate_crawls(
+        self,
+        bot_requests: list[tuple[LogEntry, BotIdentification]],
+    ) -> CrawlWaste:
+        """Find duplicate URL variants: www/non-www, trailing slash, etc."""
+        url_variants: dict[str, set[str]] = defaultdict(set)
+        for entry, _ in bot_requests:
+            parsed = urlparse(entry.url)
+            # Normalize: strip www, strip trailing slash, lowercase
+            host = parsed.netloc.lower().lstrip("www.")
+            path = parsed.path.rstrip("/") or "/"
+            canonical = f"{host}{path}"
+            full_url = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
+            url_variants[canonical].add(full_url)
+
+        # Identify canonicals with multiple variants
+        duplicate_urls: list[str] = []
+        for canonical, variants in url_variants.items():
+            if len(variants) > 1:
+                duplicate_urls.extend(variants)
+
+        total = len(bot_requests)
+        # Count how many requests hit duplicate variant URLs
+        dup_set = set(duplicate_urls)
+        dup_request_count = sum(1 for e, _ in bot_requests if f"{urlparse(e.url).scheme}://{urlparse(e.url).netloc}{urlparse(e.url).path}" in dup_set)
+        pct = (dup_request_count / total * 100) if total else 0.0
+        return CrawlWaste(
+            waste_type="duplicate_urls",
+            urls=duplicate_urls[:TOP_N_URLS],
+            count=dup_request_count,
+            pct_of_total=pct,
+            recommendation=(
+                "www/non-www, trailing slash 유무 등 중복 URL 변형이 크롤링되고 있습니다. "
+                "301 리다이렉트로 canonical URL로 통합하고, "
+                "rel=canonical 태그를 정확히 설정하세요."
+            ),
+        )
+
+    # -- bot profiling --------------------------------------------------------
+
+    def profile_bots(
+        self,
+        bot_requests: list[tuple[LogEntry, BotIdentification]],
+    ) -> dict[str, BotProfile]:
+        """Generate per-bot behavior profiles."""
+        bot_data: dict[str, dict] = defaultdict(lambda: {
+            "urls": Counter(),
+            "statuses": Counter(),
+            "hours": Counter(),
+            "days": set(),
+            "depths": Counter(),
+            "count": 0,
+        })
+
+        for entry, bot in bot_requests:
+            bd = bot_data[bot.name]
+            bd["count"] += 1
+            bd["urls"][entry.url] += 1
+            bd["statuses"][str(entry.status_code)] += 1
+            # URL depth = number of path segments
+            depth = len([s for s in urlparse(entry.url).path.split("/") if s])
+            bd["depths"][depth] += 1
+            if entry.timestamp:
+                bd["hours"][entry.timestamp.hour] += 1
+                bd["days"].add(entry.timestamp.strftime("%Y-%m-%d"))
+
+        profiles: dict[str, BotProfile] = {}
+        for name, bd in bot_data.items():
+            days_active = len(bd["days"]) or 1
+            rpd = bd["count"] / days_active
+            # Top 3 peak hours
+            top_hours = sorted(bd["hours"].items(), key=lambda x: -x[1])[:3]
+            peak = [h for h, _ in top_hours]
+            profiles[name] = BotProfile(
+                name=name,
+                total_requests=bd["count"],
+                requests_per_day=rpd,
+                crawl_depth_distribution=dict(sorted(bd["depths"].items())),
+                peak_hours=peak,
+                status_breakdown=dict(bd["statuses"]),
+                top_crawled_urls=bd["urls"].most_common(TOP_N_URLS),
+                unique_urls=len(bd["urls"]),
+                days_active=days_active,
+            )
+        return profiles
+
+    # -- orphan detection -----------------------------------------------------
+
+    def detect_orphan_pages(
+        self,
+        crawled_urls: set[str],
+        sitemap_urls: set[str],
+    ) -> dict[str, list[OrphanPage]]:
+        """Compare crawled URLs with sitemap URLs to find orphans."""
+        in_sitemap_not_crawled = sitemap_urls - crawled_urls
+        crawled_not_in_sitemap = crawled_urls - sitemap_urls
+
+        return {
+            "in_sitemap_not_crawled": [
+                OrphanPage(url=u, in_sitemap=True, crawled=False)
+                for u in sorted(in_sitemap_not_crawled)
+            ],
+            "crawled_not_in_sitemap": [
+                OrphanPage(url=u, in_sitemap=False, crawled=True)
+                for u in sorted(crawled_not_in_sitemap)
+            ],
+        }
+
+    # -- efficiency score -----------------------------------------------------
+
+    @staticmethod
+    def calculate_efficiency_score(total_waste_pct: float) -> int:
+        """Calculate crawl efficiency score: 100 - waste%, capped at [0, 100]."""
+        score = int(100 - total_waste_pct)
+        return max(0, min(100, score))
+
+    # -- recommendations ------------------------------------------------------
+
+    def generate_recommendations(
+        self,
+        waste: list[CrawlWaste],
+        orphans: dict[str, list[OrphanPage]],
+        bot_profiles: dict[str, BotProfile],
+    ) -> list[CrawlRecommendation]:
+        """Generate prioritized crawl budget optimization recommendations."""
+        recs: list[CrawlRecommendation] = []
+
+        # Waste-based recommendations
+        for w in waste:
+            if w.pct_of_total > 5.0:
+                priority = "critical"
+            elif w.pct_of_total > 2.0:
+                priority = "high"
+            elif w.pct_of_total > 0.5:
+                priority = "medium"
+            else:
+                priority = "low"
+
+            if w.waste_type == "parameter_urls" and w.count > 0:
+                recs.append(CrawlRecommendation(
+                    category="URL Parameters",
+                    priority=priority,
+                    action="robots.txt에 parameter URL 패턴 Disallow 규칙 추가",
+                    impact=f"크롤 요청 {w.pct_of_total:.1f}% 절감 가능",
+                    details=(
+                        f"총 {w.count:,}건의 parameter URL이 크롤링되었습니다. "
+                        f"sort, filter, utm_* 등 불필요한 파라미터를 차단하세요."
+                    ),
+                ))
+            elif w.waste_type == "redirect_chains" and w.count > 0:
+                recs.append(CrawlRecommendation(
+                    category="Redirect Chains",
+                    priority=priority,
+                    action="리다이렉트 체인을 단축하고 내부 링크 업데이트",
+                    impact=f"크롤 요청 {w.pct_of_total:.1f}% 절감 가능",
+                    details=(
+                        f"총 {w.count:,}건의 리다이렉트 요청이 발생했습니다. "
+                        f"내부 링크를 최종 URL로 직접 연결하세요."
+                    ),
+                ))
+            elif w.waste_type == "soft_404s" and w.count > 0:
+                recs.append(CrawlRecommendation(
+                    category="Soft 404s",
+                    priority=priority,
+                    action="Soft 404 페이지에 적절한 HTTP 상태 코드 또는 noindex 적용",
+                    impact=f"크롤 요청 {w.pct_of_total:.1f}% 절감 가능",
+                    details=(
+                        f"총 {w.count:,}건의 Soft 404가 감지되었습니다. "
+                        f"적절한 404 응답 또는 noindex meta 태그를 설정하세요."
+                    ),
+                ))
+            elif w.waste_type == "duplicate_urls" and w.count > 0:
+                recs.append(CrawlRecommendation(
+                    category="Duplicate URLs",
+                    priority=priority,
+                    action="URL 정규화 및 canonical 태그 설정",
+                    impact=f"크롤 요청 {w.pct_of_total:.1f}% 절감 가능",
+                    details=(
+                        f"총 {w.count:,}건의 중복 URL 변형이 크롤링되었습니다. "
+                        f"www/non-www, trailing slash 통합을 진행하세요."
+                    ),
+                ))
+
+        # Orphan page recommendations
+        not_crawled = orphans.get("in_sitemap_not_crawled", [])
+        not_in_sitemap = orphans.get("crawled_not_in_sitemap", [])
+
+        if len(not_crawled) > 0:
+            pct = len(not_crawled) / max(len(self._sitemap_urls), 1) * 100
+            priority = "critical" if pct > 30 else "high" if pct > 10 else "medium"
+            recs.append(CrawlRecommendation(
+                category="Orphan Pages (Uncrawled)",
+                priority=priority,
+                action="사이트맵에 있으나 크롤링되지 않은 페이지의 내부 링크 강화",
+                impact=f"사이트맵 URL의 {pct:.1f}%가 미크롤 상태",
+                details=(
+                    f"총 {len(not_crawled):,}개 URL이 사이트맵에 있지만 "
+                    f"봇이 크롤링하지 않았습니다. 내부 링크를 추가하세요."
+                ),
+            ))
+
+        if len(not_in_sitemap) > 0:
+            recs.append(CrawlRecommendation(
+                category="Orphan Pages (Unlisted)",
+                priority="medium",
+                action="크롤링되었으나 사이트맵에 없는 페이지를 사이트맵에 추가 또는 차단",
+                impact=f"{len(not_in_sitemap):,}개 URL이 사이트맵에 미등록",
+                details=(
+                    f"봇이 크롤링한 {len(not_in_sitemap):,}개 URL이 "
+                    f"사이트맵에 포함되어 있지 않습니다. 유효한 페이지는 "
+                    f"사이트맵에 추가하고, 불필요한 페이지는 robots.txt로 차단하세요."
+                ),
+            ))
+
+        # Bot-specific recommendations
+        for name, profile in bot_profiles.items():
+            error_count = sum(
+                v for k, v in profile.status_breakdown.items()
+                if k.startswith("4") or k.startswith("5")
+            )
+            error_pct = (error_count / profile.total_requests * 100) if profile.total_requests else 0
+            if error_pct > 10:
+                recs.append(CrawlRecommendation(
+                    category=f"Bot Errors ({name})",
+                    priority="high" if error_pct > 20 else "medium",
+                    action=f"{name}의 4xx/5xx 오류율 {error_pct:.1f}% 개선 필요",
+                    impact=f"{name} 크롤 예산의 {error_pct:.1f}%가 오류에 소비",
+                    details=(
+                        f"{name}이(가) {error_count:,}건의 오류 응답을 받았습니다. "
+                        f"깨진 링크를 수정하고 서버 안정성을 개선하세요."
+                    ),
+                ))
+
+        # Sort by priority
+        priority_order = {"critical": 0, "high": 1, "medium": 2, "low": 3}
+        recs.sort(key=lambda r: priority_order.get(r.priority, 4))
+        return recs
+
+    # -- orchestrator ---------------------------------------------------------
+
+    def analyze(self, scope: str = "all") -> CrawlBudgetResult:
+        """Orchestrate the full crawl budget analysis."""
+        # Load log data
+        entries = self.load_log_data(self.log_file)
+        if not entries:
+            logger.warning("No bot entries found in log file.")
+
+        # Load sitemap if provided
+        if self.sitemap_url:
+            self.load_sitemap_urls(self.sitemap_url)
+
+        # Profile bots
+        bot_profiles: dict[str, BotProfile] = {}
+        if scope in ("all", "bots"):
+            bot_profiles = self.profile_bots(entries)
+
+        # Identify waste
+        waste: list[CrawlWaste] = []
+        if scope in ("all", "waste"):
+            waste.append(self.identify_parameter_waste(entries))
+            waste.append(self.identify_redirect_chains(entries))
+            waste.append(self.identify_soft_404s(entries))
+            waste.append(self.identify_duplicate_crawls(entries))
+
+        total_waste_pct = sum(w.pct_of_total for w in waste)
+
+        # Detect orphan pages
+        orphans: dict[str, list[OrphanPage]] = {
+            "in_sitemap_not_crawled": [],
+            "crawled_not_in_sitemap": [],
+        }
+        if scope in ("all", "orphans") and self._sitemap_urls:
+            crawled_urls: set[str] = set()
+            for entry, _ in entries:
+                # Build full URL from path for comparison
+                if self.target_url:
+                    parsed_target = urlparse(self.target_url)
+                    full = f"{parsed_target.scheme}://{parsed_target.netloc}{entry.url}"
+                    crawled_urls.add(self._normalize_url(full))
+                else:
+                    crawled_urls.add(entry.url)
+            orphans = self.detect_orphan_pages(crawled_urls, self._sitemap_urls)
+
+        # Efficiency score
+        efficiency_score = self.calculate_efficiency_score(total_waste_pct)
+
+        # Recommendations
+        recommendations = self.generate_recommendations(waste, orphans, bot_profiles)
+
+        # Date range from entries
+        timestamps = [e.timestamp for e, _ in entries if e.timestamp]
+        analysis_period = {}
+        if timestamps:
+            analysis_period = {
+                "from": min(timestamps).strftime("%Y-%m-%d"),
+                "to": max(timestamps).strftime("%Y-%m-%d"),
+            }
+
+        return CrawlBudgetResult(
+            log_file=self.log_file,
+            analysis_period=analysis_period,
+            total_bot_requests=len(entries),
+            bots=bot_profiles,
+            waste=waste,
+            total_waste_pct=total_waste_pct,
+            orphan_pages=orphans,
+            recommendations=recommendations,
+            efficiency_score=efficiency_score,
+            timestamp=datetime.now().isoformat(),
+        )
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+def main() -> None:
+    parser = argparse.ArgumentParser(
+        description="Analyze crawl budget efficiency and generate optimization recommendations.",
+    )
+    parser.add_argument(
+        "--log-file",
+        required=True,
+        help="Path to server access log file",
+    )
+    parser.add_argument(
+        "--sitemap",
+        default=None,
+        help="URL of XML sitemap for orphan page detection",
+    )
+    parser.add_argument(
+        "--url",
+        default=None,
+        help="Target website URL (used for URL normalization and Ahrefs)",
+    )
+    parser.add_argument(
+        "--scope",
+        choices=["all", "waste", "orphans", "bots"],
+        default="all",
+        help="Analysis scope (default: all)",
+    )
+    parser.add_argument(
+        "--ahrefs",
+        action="store_true",
+        help="Include Ahrefs page history comparison (requires MCP tool)",
+    )
+    parser.add_argument(
+        "--json",
+        action="store_true",
+        help="Output in JSON format",
+    )
+    parser.add_argument(
+        "--output",
+        default=None,
+        help="Write output to file instead of stdout",
+    )
+    args = parser.parse_args()
+
+    # Validate log file exists
+    if not Path(args.log_file).exists():
+        logger.error(f"Log file not found: {args.log_file}")
+        sys.exit(1)
+
+    analyzer = CrawlBudgetAnalyzer(
+        log_file=args.log_file,
+        sitemap_url=args.sitemap,
+        target_url=args.url,
+    )
+
+    result = analyzer.analyze(scope=args.scope)
+
+    if args.json:
+        output_data = result.to_dict()
+        output_str = json.dumps(output_data, indent=2, ensure_ascii=False)
+    else:
+        lines = _format_text_report(result)
+        output_str = "\n".join(lines)
+
+    if args.output:
+        Path(args.output).write_text(output_str, encoding="utf-8")
+        logger.info(f"Output written to {args.output}")
+    else:
+        print(output_str)
+
+
+def _format_text_report(result: CrawlBudgetResult) -> list[str]:
+    """Format the analysis result as a human-readable text report."""
+    lines = [
+        "=" * 70,
+        "Crawl Budget Analysis Report",
+        "=" * 70,
+        f"Log File: {result.log_file}",
+        f"Total Bot Requests: {result.total_bot_requests:,}",
+        f"Efficiency Score: {result.efficiency_score}/100",
+        f"Total Waste: {result.total_waste_pct:.1f}%",
+    ]
+    if result.analysis_period:
+        lines.append(
+            f"Period: {result.analysis_period.get('from', 'N/A')} ~ "
+            f"{result.analysis_period.get('to', 'N/A')}"
+        )
+    lines.append("")
+
+    # Bot profiles
+    if result.bots:
+        lines.append("-" * 60)
+        lines.append("Bot Profiles")
+        lines.append("-" * 60)
+        for name, profile in sorted(result.bots.items(), key=lambda x: -x[1].total_requests):
+            lines.append(f"\n  [{name.upper()}]")
+            lines.append(f"    Requests: {profile.total_requests:,}")
+            lines.append(f"    Unique URLs: {profile.unique_urls:,}")
+            lines.append(f"    Requests/Day: {profile.requests_per_day:,.1f}")
+            lines.append(f"    Days Active: {profile.days_active}")
+            lines.append(f"    Peak Hours: {profile.peak_hours}")
+            lines.append(f"    Status: {profile.status_breakdown}")
+        lines.append("")
+
+    # Waste breakdown
+    if result.waste:
+        lines.append("-" * 60)
+        lines.append("Crawl Waste Breakdown")
+        lines.append("-" * 60)
+        for w in result.waste:
+            if w.count > 0:
+                lines.append(f"\n  [{w.waste_type}]")
+                lines.append(f"    Count: {w.count:,} ({w.pct_of_total:.1f}%)")
+                lines.append(f"    Recommendation: {w.recommendation}")
+                if w.urls:
+                    lines.append(f"    Sample URLs:")
+                    for u in w.urls[:5]:
+                        lines.append(f"      - {u}")
+        lines.append("")
+
+    # Orphan pages
+    not_crawled = result.orphan_pages.get("in_sitemap_not_crawled", [])
+    not_in_sitemap = result.orphan_pages.get("crawled_not_in_sitemap", [])
+    if not_crawled or not_in_sitemap:
+        lines.append("-" * 60)
+        lines.append("Orphan Pages")
+        lines.append("-" * 60)
+        if not_crawled:
+            lines.append(f"\n  In Sitemap but Not Crawled: {len(not_crawled):,}")
+            for op in not_crawled[:10]:
+                lines.append(f"    - {op.url}")
+        if not_in_sitemap:
+            lines.append(f"\n  Crawled but Not in Sitemap: {len(not_in_sitemap):,}")
+            for op in not_in_sitemap[:10]:
+                lines.append(f"    - {op.url}")
+        lines.append("")
+
+    # Recommendations
+    if result.recommendations:
+        lines.append("-" * 60)
+        lines.append("Recommendations")
+        lines.append("-" * 60)
+        for i, rec in enumerate(result.recommendations, 1):
+            lines.append(f"\n  {i}. [{rec.priority.upper()}] {rec.category}")
+            lines.append(f"     Action: {rec.action}")
+            lines.append(f"     Impact: {rec.impact}")
+            lines.append(f"     Details: {rec.details}")
+
+    lines.append("")
+    lines.append(f"Generated: {result.timestamp}")
+    return lines
+
+
+if __name__ == "__main__":
+    main()
--- a/custom-skills/32-seo-crawl-budget/code/scripts/log_parser.py
+++ b/custom-skills/32-seo-crawl-budget/code/scripts/log_parser.py
@@ -0,0 +1,613 @@
+"""
+Log Parser - Server access log parser with bot identification
+=============================================================
+Purpose: Parse Apache/Nginx/CloudFront access logs, identify search engine
+         bots, extract crawl data, and generate per-bot statistics.
+Python: 3.10+
+"""
+
+import argparse
+import bz2
+import gzip
+import json
+import logging
+import re
+import sys
+from collections import Counter, defaultdict
+from dataclasses import asdict, dataclass, field
+from datetime import datetime
+from pathlib import Path
+from typing import Generator, TextIO
+
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s - %(levelname)s - %(message)s",
+)
+logger = logging.getLogger(__name__)
+
+
+# ---------------------------------------------------------------------------
+# Constants: bot user-agent patterns
+# ---------------------------------------------------------------------------
+
+BOT_PATTERNS: list[tuple[str, str, str]] = [
+    # (canonical name, regex pattern, category)
+    ("googlebot", r"Googlebot(?:-Image|-News|-Video)?/", "search_engine"),
+    ("googlebot-adsbot", r"AdsBot-Google", "search_engine"),
+    ("googlebot-mediapartners", r"Mediapartners-Google", "search_engine"),
+    ("yeti", r"Yeti/", "search_engine"),
+    ("bingbot", r"bingbot/", "search_engine"),
+    ("daumoa", r"Daumoa", "search_engine"),
+    ("applebot", r"Applebot/", "search_engine"),
+    ("duckduckbot", r"DuckDuckBot/", "search_engine"),
+    ("baiduspider", r"Baiduspider", "search_engine"),
+    ("yandexbot", r"YandexBot/", "search_engine"),
+    ("sogou", r"Sogou", "search_engine"),
+    ("seznambot", r"SeznamBot/", "search_engine"),
+    ("ahrefsbot", r"AhrefsBot/", "seo_tool"),
+    ("semrushbot", r"SemrushBot/", "seo_tool"),
+    ("mj12bot", r"MJ12bot/", "seo_tool"),
+    ("dotbot", r"DotBot/", "seo_tool"),
+    ("rogerbot", r"rogerbot/", "seo_tool"),
+    ("screaming frog", r"Screaming Frog SEO Spider", "seo_tool"),
+]
+
+COMPILED_BOT_PATTERNS: list[tuple[str, re.Pattern, str]] = [
+    (name, re.compile(pattern, re.IGNORECASE), category)
+    for name, pattern, category in BOT_PATTERNS
+]
+
+# ---------------------------------------------------------------------------
+# Regex patterns for each log format
+# ---------------------------------------------------------------------------
+
+NGINX_COMBINED_RE = re.compile(
+    r'(?P<ip>[\d.:a-fA-F]+)\s+-\s+(?P<user>\S+)\s+'
+    r'\[(?P<timestamp>[^\]]+)\]\s+'
+    r'"(?P<method>\S+)\s+(?P<url>\S+)\s+(?P<protocol>[^"]+)"\s+'
+    r'(?P<status>\d{3})\s+(?P<size>\d+|-)\s+'
+    r'"(?P<referer>[^"]*)"\s+'
+    r'"(?P<user_agent>[^"]*)"'
+)
+
+APACHE_COMBINED_RE = re.compile(
+    r'(?P<ip>[\d.:a-fA-F]+)\s+\S+\s+(?P<user>\S+)\s+'
+    r'\[(?P<timestamp>[^\]]+)\]\s+'
+    r'"(?P<method>\S+)\s+(?P<url>\S+)\s+(?P<protocol>[^"]+)"\s+'
+    r'(?P<status>\d{3})\s+(?P<size>\d+|-)\s+'
+    r'"(?P<referer>[^"]*)"\s+'
+    r'"(?P<user_agent>[^"]*)"'
+)
+
+CLOUDFRONT_FIELDS = [
+    "date", "time", "x_edge_location", "sc_bytes", "c_ip",
+    "cs_method", "cs_host", "cs_uri_stem", "sc_status",
+    "cs_referer", "cs_user_agent", "cs_uri_query",
+    "cs_cookie", "x_edge_result_type", "x_edge_request_id",
+    "x_host_header", "cs_protocol", "cs_bytes",
+    "time_taken", "x_forwarded_for", "ssl_protocol",
+    "ssl_cipher", "x_edge_response_result_type", "cs_protocol_version",
+]
+
+# Timestamp formats
+NGINX_TS_FORMAT = "%d/%b/%Y:%H:%M:%S %z"
+APACHE_TS_FORMAT = "%d/%b/%Y:%H:%M:%S %z"
+
+
+# ---------------------------------------------------------------------------
+# Data classes
+# ---------------------------------------------------------------------------
+
+@dataclass
+class LogEntry:
+    """A single parsed log entry."""
+    timestamp: datetime | None
+    ip: str
+    method: str
+    url: str
+    status_code: int
+    response_size: int
+    user_agent: str
+    referer: str
+
+    def to_dict(self) -> dict:
+        d = asdict(self)
+        if self.timestamp:
+            d["timestamp"] = self.timestamp.isoformat()
+        return d
+
+
+@dataclass
+class BotIdentification:
+    """Bot identification result."""
+    name: str
+    user_agent_pattern: str
+    category: str
+
+
+@dataclass
+class BotStats:
+    """Aggregated statistics for a single bot."""
+    name: str
+    total_requests: int = 0
+    unique_urls: int = 0
+    status_distribution: dict[str, int] = field(default_factory=dict)
+    top_urls: list[tuple[str, int]] = field(default_factory=list)
+    hourly_distribution: dict[int, int] = field(default_factory=dict)
+    daily_distribution: dict[str, int] = field(default_factory=dict)
+    avg_response_size: float = 0.0
+
+    def to_dict(self) -> dict:
+        return {
+            "name": self.name,
+            "total_requests": self.total_requests,
+            "unique_urls": self.unique_urls,
+            "status_distribution": self.status_distribution,
+            "top_urls": [{"url": u, "count": c} for u, c in self.top_urls],
+            "hourly_distribution": self.hourly_distribution,
+            "daily_distribution": self.daily_distribution,
+            "avg_response_size": round(self.avg_response_size, 1),
+        }
+
+
+@dataclass
+class LogParseResult:
+    """Complete log parsing result."""
+    log_file: str
+    format_detected: str
+    total_lines: int
+    parsed_lines: int
+    bot_entries: int
+    date_range: dict[str, str]
+    bots: dict[str, BotStats]
+    errors: int
+
+    def to_dict(self) -> dict:
+        return {
+            "log_file": self.log_file,
+            "format_detected": self.format_detected,
+            "total_lines": self.total_lines,
+            "parsed_lines": self.parsed_lines,
+            "bot_entries": self.bot_entries,
+            "date_range": self.date_range,
+            "bots": {name: stats.to_dict() for name, stats in self.bots.items()},
+            "errors": self.errors,
+        }
+
+
+# ---------------------------------------------------------------------------
+# LogParser class
+# ---------------------------------------------------------------------------
+
+class LogParser:
+    """Parse server access logs and identify search engine bot traffic."""
+
+    def __init__(
+        self,
+        log_file: str,
+        fmt: str = "auto",
+        streaming: bool = False,
+    ):
+        self.log_file = log_file
+        self.fmt = fmt
+        self.streaming = streaming
+        self._detected_format: str | None = None
+        self._parse_errors = 0
+
+    # -- format detection -----------------------------------------------------
+
+    def _detect_format(self, line: str) -> str:
+        """Auto-detect log format from a sample line."""
+        if line.startswith("#"):
+            return "cloudfront"
+        if NGINX_COMBINED_RE.match(line):
+            return "nginx"
+        if APACHE_COMBINED_RE.match(line):
+            return "apache"
+        # Fallback: try tab-separated (CloudFront without header)
+        if "\t" in line and line.count("\t") >= 10:
+            return "cloudfront"
+        return "nginx"
+
+    # -- line parsers ---------------------------------------------------------
+
+    def _parse_nginx_combined(self, line: str) -> LogEntry | None:
+        """Parse a single Nginx combined format log line."""
+        m = NGINX_COMBINED_RE.match(line)
+        if not m:
+            return None
+        ts = None
+        try:
+            ts = datetime.strptime(m.group("timestamp"), NGINX_TS_FORMAT)
+        except (ValueError, TypeError):
+            pass
+        size_raw = m.group("size")
+        size = int(size_raw) if size_raw != "-" else 0
+        return LogEntry(
+            timestamp=ts,
+            ip=m.group("ip"),
+            method=m.group("method"),
+            url=m.group("url"),
+            status_code=int(m.group("status")),
+            response_size=size,
+            user_agent=m.group("user_agent"),
+            referer=m.group("referer"),
+        )
+
+    def _parse_apache_combined(self, line: str) -> LogEntry | None:
+        """Parse a single Apache combined format log line."""
+        m = APACHE_COMBINED_RE.match(line)
+        if not m:
+            return None
+        ts = None
+        try:
+            ts = datetime.strptime(m.group("timestamp"), APACHE_TS_FORMAT)
+        except (ValueError, TypeError):
+            pass
+        size_raw = m.group("size")
+        size = int(size_raw) if size_raw != "-" else 0
+        return LogEntry(
+            timestamp=ts,
+            ip=m.group("ip"),
+            method=m.group("method"),
+            url=m.group("url"),
+            status_code=int(m.group("status")),
+            response_size=size,
+            user_agent=m.group("user_agent"),
+            referer=m.group("referer"),
+        )
+
+    def _parse_cloudfront(self, line: str) -> LogEntry | None:
+        """Parse a CloudFront tab-separated log line."""
+        if line.startswith("#"):
+            return None
+        parts = line.strip().split("\t")
+        if len(parts) < 13:
+            return None
+        ts = None
+        try:
+            ts = datetime.strptime(f"{parts[0]} {parts[1]}", "%Y-%m-%d %H:%M:%S")
+        except (ValueError, IndexError):
+            pass
+        try:
+            status = int(parts[8])
+        except (ValueError, IndexError):
+            status = 0
+        try:
+            size = int(parts[3])
+        except (ValueError, IndexError):
+            size = 0
+        url = parts[7] if len(parts) > 7 else ""
+        query = parts[11] if len(parts) > 11 else ""
+        if query and query != "-":
+            url = f"{url}?{query}"
+        ua = parts[10] if len(parts) > 10 else ""
+        ua = ua.replace("%20", " ").replace("%2520", " ")
+        referer = parts[9] if len(parts) > 9 else ""
+        return LogEntry(
+            timestamp=ts,
+            ip=parts[4] if len(parts) > 4 else "",
+            method=parts[5] if len(parts) > 5 else "",
+            url=url,
+            status_code=status,
+            response_size=size,
+            user_agent=ua,
+            referer=referer,
+        )
+
+    def _parse_line(self, line: str, fmt: str) -> LogEntry | None:
+        """Route to the correct parser based on format."""
+        parsers = {
+            "nginx": self._parse_nginx_combined,
+            "apache": self._parse_apache_combined,
+            "cloudfront": self._parse_cloudfront,
+        }
+        parser = parsers.get(fmt, self._parse_nginx_combined)
+        return parser(line)
+
+    # -- bot identification ---------------------------------------------------
+
+    @staticmethod
+    def identify_bot(user_agent: str) -> BotIdentification | None:
+        """Match user-agent against known bot patterns."""
+        if not user_agent or user_agent == "-":
+            return None
+        for name, pattern, category in COMPILED_BOT_PATTERNS:
+            if pattern.search(user_agent):
+                return BotIdentification(
+                    name=name,
+                    user_agent_pattern=pattern.pattern,
+                    category=category,
+                )
+        # Heuristic: generic bot detection via common keywords
+        ua_lower = user_agent.lower()
+        bot_keywords = ["bot", "spider", "crawler", "scraper", "fetch"]
+        for kw in bot_keywords:
+            if kw in ua_lower:
+                return BotIdentification(
+                    name="other",
+                    user_agent_pattern=kw,
+                    category="other",
+                )
+        return None
+
+    # -- file handling --------------------------------------------------------
+
+    @staticmethod
+    def _open_file(path: str) -> TextIO:
+        """Open plain text, .gz, or .bz2 log files."""
+        p = Path(path)
+        if p.suffix == ".gz":
+            return gzip.open(path, "rt", encoding="utf-8", errors="replace")
+        if p.suffix == ".bz2":
+            return bz2.open(path, "rt", encoding="utf-8", errors="replace")
+        return open(path, "r", encoding="utf-8", errors="replace")
+
+    # -- streaming parser -----------------------------------------------------
+
+    def parse_streaming(
+        self,
+        filter_bot: str | None = None,
+    ) -> Generator[tuple[LogEntry, BotIdentification], None, None]:
+        """Generator-based streaming parser for large files."""
+        fmt = self.fmt
+        first_line_checked = False
+
+        fh = self._open_file(self.log_file)
+        try:
+            for line in fh:
+                line = line.strip()
+                if not line:
+                    continue
+                if not first_line_checked and fmt == "auto":
+                    fmt = self._detect_format(line)
+                    self._detected_format = fmt
+                    first_line_checked = True
+                entry = self._parse_line(line, fmt)
+                if entry is None:
+                    self._parse_errors += 1
+                    continue
+                bot = self.identify_bot(entry.user_agent)
+                if bot is None:
+                    continue
+                if filter_bot and bot.name != filter_bot.lower():
+                    continue
+                yield entry, bot
+        finally:
+            fh.close()
+
+    # -- full parse -----------------------------------------------------------
+
+    def parse(
+        self,
+        filter_bot: str | None = None,
+        date_from: datetime | None = None,
+        date_to: datetime | None = None,
+    ) -> list[tuple[LogEntry, BotIdentification]]:
+        """Full parse with optional date and bot filters."""
+        results: list[tuple[LogEntry, BotIdentification]] = []
+        for entry, bot in self.parse_streaming(filter_bot):
+            if date_from and entry.timestamp and entry.timestamp < date_from:
+                continue
+            if date_to and entry.timestamp and entry.timestamp > date_to:
+                continue
+            results.append((entry, bot))
+        return results
+
+    # -- statistics -----------------------------------------------------------
+
+    @staticmethod
+    def get_bot_stats(
+        entries: list[tuple[LogEntry, BotIdentification]],
+    ) -> dict[str, BotStats]:
+        """Aggregate per-bot statistics from parsed entries."""
+        bot_data: dict[str, dict] = defaultdict(lambda: {
+            "urls": Counter(),
+            "statuses": Counter(),
+            "hours": Counter(),
+            "days": Counter(),
+            "sizes": [],
+            "count": 0,
+        })
+
+        for entry, bot in entries:
+            bd = bot_data[bot.name]
+            bd["count"] += 1
+            bd["urls"][entry.url] += 1
+            bd["statuses"][str(entry.status_code)] += 1
+            bd["sizes"].append(entry.response_size)
+            if entry.timestamp:
+                bd["hours"][entry.timestamp.hour] += 1
+                day_key = entry.timestamp.strftime("%Y-%m-%d")
+                bd["days"][day_key] += 1
+
+        stats: dict[str, BotStats] = {}
+        for name, bd in bot_data.items():
+            avg_size = sum(bd["sizes"]) / len(bd["sizes"]) if bd["sizes"] else 0.0
+            top_20 = bd["urls"].most_common(20)
+            stats[name] = BotStats(
+                name=name,
+                total_requests=bd["count"],
+                unique_urls=len(bd["urls"]),
+                status_distribution=dict(bd["statuses"]),
+                top_urls=top_20,
+                hourly_distribution=dict(sorted(bd["hours"].items())),
+                daily_distribution=dict(sorted(bd["days"].items())),
+                avg_response_size=avg_size,
+            )
+        return stats
+
+    # -- orchestrator ---------------------------------------------------------
+
+    def parse_and_analyze(
+        self,
+        filter_bot: str | None = None,
+        date_from: datetime | None = None,
+        date_to: datetime | None = None,
+    ) -> LogParseResult:
+        """Orchestrate parsing and statistics generation."""
+        entries = self.parse(filter_bot, date_from, date_to)
+        bot_stats = self.get_bot_stats(entries)
+
+        # Determine date range
+        timestamps = [e.timestamp for e, _ in entries if e.timestamp]
+        date_range = {}
+        if timestamps:
+            date_range = {
+                "from": min(timestamps).isoformat(),
+                "to": max(timestamps).isoformat(),
+            }
+
+        # Count total lines for context
+        total_lines = 0
+        fh = self._open_file(self.log_file)
+        try:
+            for _ in fh:
+                total_lines += 1
+        finally:
+            fh.close()
+
+        return LogParseResult(
+            log_file=self.log_file,
+            format_detected=self._detected_format or self.fmt,
+            total_lines=total_lines,
+            parsed_lines=total_lines - self._parse_errors,
+            bot_entries=len(entries),
+            date_range=date_range,
+            bots=bot_stats,
+            errors=self._parse_errors,
+        )
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+def _parse_date(val: str) -> datetime:
+    """Parse a date string in YYYY-MM-DD format."""
+    return datetime.strptime(val, "%Y-%m-%d")
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(
+        description="Parse server access logs and identify search engine bot traffic.",
+    )
+    parser.add_argument(
+        "--log-file",
+        required=True,
+        help="Path to access log file (plain, .gz, .bz2)",
+    )
+    parser.add_argument(
+        "--format",
+        dest="fmt",
+        choices=["auto", "nginx", "apache", "cloudfront"],
+        default="auto",
+        help="Log format (default: auto-detect)",
+    )
+    parser.add_argument(
+        "--bot",
+        default=None,
+        help="Filter results to a specific bot (e.g., googlebot, yeti, bingbot, daumoa)",
+    )
+    parser.add_argument(
+        "--streaming",
+        action="store_true",
+        help="Use streaming parser for large files (prints entries incrementally)",
+    )
+    parser.add_argument(
+        "--date-from",
+        default=None,
+        help="Filter entries from date (YYYY-MM-DD)",
+    )
+    parser.add_argument(
+        "--date-to",
+        default=None,
+        help="Filter entries to date (YYYY-MM-DD)",
+    )
+    parser.add_argument(
+        "--json",
+        action="store_true",
+        help="Output in JSON format",
+    )
+    parser.add_argument(
+        "--output",
+        default=None,
+        help="Write output to file instead of stdout",
+    )
+    args = parser.parse_args()
+
+    # Validate file exists
+    if not Path(args.log_file).exists():
+        logger.error(f"Log file not found: {args.log_file}")
+        sys.exit(1)
+
+    date_from = _parse_date(args.date_from) if args.date_from else None
+    date_to = _parse_date(args.date_to) if args.date_to else None
+
+    lp = LogParser(log_file=args.log_file, fmt=args.fmt, streaming=args.streaming)
+
+    if args.streaming and not args.json:
+        # Streaming mode: print entries as they are parsed
+        count = 0
+        for entry, bot in lp.parse_streaming(args.bot):
+            if date_from and entry.timestamp and entry.timestamp < date_from:
+                continue
+            if date_to and entry.timestamp and entry.timestamp > date_to:
+                continue
+            ts_str = entry.timestamp.isoformat() if entry.timestamp else "N/A"
+            print(
+                f"[{bot.name}] {ts_str} {entry.status_code} "
+                f"{entry.method} {entry.url} ({entry.response_size}B)"
+            )
+            count += 1
+        print(f"\n--- Total bot requests: {count} ---")
+        return
+
+    # Full analysis mode
+    result = lp.parse_and_analyze(
+        filter_bot=args.bot,
+        date_from=date_from,
+        date_to=date_to,
+    )
+
+    if args.json:
+        output_data = result.to_dict()
+        output_str = json.dumps(output_data, indent=2, ensure_ascii=False)
+    else:
+        lines = [
+            f"Log File: {result.log_file}",
+            f"Format: {result.format_detected}",
+            f"Total Lines: {result.total_lines:,}",
+            f"Parsed Lines: {result.parsed_lines:,}",
+            f"Bot Entries: {result.bot_entries:,}",
+            f"Parse Errors: {result.errors:,}",
+        ]
+        if result.date_range:
+            lines.append(f"Date Range: {result.date_range.get('from', 'N/A')} to {result.date_range.get('to', 'N/A')}")
+        lines.append("")
+        lines.append("=" * 60)
+        lines.append("Bot Statistics")
+        lines.append("=" * 60)
+        for name, stats in sorted(result.bots.items(), key=lambda x: -x[1].total_requests):
+            lines.append(f"\n--- {name.upper()} ---")
+            lines.append(f"  Requests: {stats.total_requests:,}")
+            lines.append(f"  Unique URLs: {stats.unique_urls:,}")
+            lines.append(f"  Avg Response Size: {stats.avg_response_size:,.0f} bytes")
+            lines.append(f"  Status Distribution: {stats.status_distribution}")
+            lines.append(f"  Top 10 URLs:")
+            for url, cnt in stats.top_urls[:10]:
+                lines.append(f"    {cnt:>6,} | {url}")
+            if stats.hourly_distribution:
+                peak_hour = max(stats.hourly_distribution, key=stats.hourly_distribution.get)
+                lines.append(f"  Peak Hour: {peak_hour}:00 ({stats.hourly_distribution[peak_hour]:,} reqs)")
+        output_str = "\n".join(lines)
+
+    if args.output:
+        Path(args.output).write_text(output_str, encoding="utf-8")
+        logger.info(f"Output written to {args.output}")
+    else:
+        print(output_str)
+
+
+if __name__ == "__main__":
+    main()
--- a/custom-skills/32-seo-crawl-budget/code/scripts/requirements.txt
+++ b/custom-skills/32-seo-crawl-budget/code/scripts/requirements.txt
@@ -0,0 +1,10 @@
+# 32-seo-crawl-budget dependencies
+requests>=2.31.0
+aiohttp>=3.9.0
+pandas>=2.1.0
+beautifulsoup4>=4.12.0
+lxml>=5.1.0
+tenacity>=8.2.0
+tqdm>=4.66.0
+python-dotenv>=1.0.0
+rich>=13.7.0