Add SEO skills 19-28, 31-32 with full Python implementations

12 new skills: Keyword Strategy, SERP Analysis, Position Tracking,
Link Building, Content Strategy, E-Commerce SEO, KPI Framework,
International SEO, AI Visibility, Knowledge Graph, Competitor Intel,
and Crawl Budget. ~20K lines of Python across 25 domain scripts.
Updated skill 11 pipeline table and repo CLAUDE.md.
Enhanced skill 18 local SEO workflow from jamie.clinic audit.

Note: Skill 26 hreflang_validator.py pending (content filter block).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-13 12:05:59 +09:00
parent 159f7ec3f7
commit a3ff965b87
125 changed files with 25948 additions and 173 deletions

View File

@@ -0,0 +1,178 @@
# CLAUDE.md
## Overview
Crawl budget optimization tool for analyzing server access logs and identifying crawl budget waste. Parses Apache/Nginx/CloudFront access logs, identifies search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao), profiles per-bot crawl behavior, detects crawl waste (parameter URLs, low-value pages, redirect chains), identifies orphan pages, and generates crawl efficiency recommendations. Uses streaming parser for large log files.
## Quick Start
```bash
pip install -r scripts/requirements.txt
# Parse access logs
python scripts/log_parser.py --log-file /var/log/nginx/access.log --json
# Crawl budget analysis
python scripts/crawl_budget_analyzer.py --log-file /var/log/nginx/access.log --sitemap https://example.com/sitemap.xml --json
```
## Scripts
| Script | Purpose | Key Output |
|--------|---------|------------|
| `log_parser.py` | Parse server access logs, identify bots, extract crawl data | Bot identification, request patterns, status codes |
| `crawl_budget_analyzer.py` | Analyze crawl budget efficiency and generate recommendations | Waste identification, orphan pages, optimization plan |
| `base_client.py` | Shared utilities | RateLimiter, ConfigManager, BaseAsyncClient |
## Log Parser
```bash
# Parse Nginx combined log format
python scripts/log_parser.py --log-file /var/log/nginx/access.log --json
# Parse Apache combined log format
python scripts/log_parser.py --log-file /var/log/apache2/access.log --format apache --json
# Parse CloudFront log
python scripts/log_parser.py --log-file cloudfront-log.gz --format cloudfront --json
# Filter by specific bot
python scripts/log_parser.py --log-file access.log --bot googlebot --json
# Parse gzipped logs
python scripts/log_parser.py --log-file access.log.gz --json
# Process large files in streaming mode
python scripts/log_parser.py --log-file access.log --streaming --json
```
**Capabilities**:
- Support for common log formats:
- Nginx combined format
- Apache combined format
- CloudFront format
- Custom format via regex
- Bot identification by User-Agent:
- Googlebot (and variants: Googlebot-Image, Googlebot-News, Googlebot-Video, AdsBot-Google)
- Yeti (Naver's crawler)
- Bingbot
- Daumoa (Kakao/Daum crawler)
- Other bots (Applebot, DuckDuckBot, Baiduspider, etc.)
- Request data extraction (timestamp, IP, URL, status code, response size, user-agent, referer)
- Streaming parser for files >1GB
- Gzip/bzip2 compressed log support
- Date range filtering
## Crawl Budget Analyzer
```bash
# Full crawl budget analysis
python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json
# Waste identification only
python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json
# Orphan page detection
python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --scope orphans --json
# Per-bot profiling
python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json
# With Ahrefs page history comparison
python scripts/crawl_budget_analyzer.py --log-file access.log --url https://example.com --ahrefs --json
```
**Capabilities**:
- Crawl budget waste identification:
- Parameter URLs consuming crawl budget (?sort=, ?filter=, ?page=, ?utm_*)
- Low-value pages (thin content, noindex pages being crawled)
- Redirect chains consuming multiple crawls
- Soft 404 pages (200 status but error content)
- Duplicate URLs (www/non-www, http/https, trailing slash variants)
- Per-bot behavior profiling:
- Crawl frequency (requests/day, requests/hour)
- Crawl depth distribution
- Status code distribution per bot
- Most crawled URLs per bot
- Crawl pattern analysis (time of day, days of week)
- Orphan page detection:
- Pages in sitemap but never crawled by bots
- Pages crawled but not in sitemap
- Crawled pages with no internal links
- Crawl efficiency recommendations:
- robots.txt optimization suggestions
- URL parameter handling recommendations
- Noindex/nofollow suggestions for low-value pages
- Redirect chain resolution priorities
- Internal linking improvements for orphan pages
## Data Sources
| Source | Purpose |
|--------|---------|
| Server access logs | Primary crawl data |
| XML sitemap | Reference for expected crawlable pages |
| Ahrefs `site-explorer-pages-history` | Compare indexed pages with crawled pages |
## Output Format
```json
{
"log_file": "access.log",
"analysis_period": {"from": "2025-01-01", "to": "2025-01-31"},
"total_bot_requests": 150000,
"bots": {
"googlebot": {
"requests": 80000,
"unique_urls": 12000,
"avg_requests_per_day": 2580,
"status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000},
"top_crawled_urls": [...]
},
"yeti": {"requests": 35000, ...},
"bingbot": {"requests": 20000, ...},
"daumoa": {"requests": 15000, ...}
},
"waste": {
"parameter_urls": {"count": 5000, "pct_of_crawls": 3.3},
"redirect_chains": {"count": 2000, "pct_of_crawls": 1.3},
"soft_404s": {"count": 1500, "pct_of_crawls": 1.0},
"total_waste_pct": 8.5
},
"orphan_pages": {
"in_sitemap_not_crawled": [...],
"crawled_not_in_sitemap": [...]
},
"recommendations": [...],
"efficiency_score": 72,
"timestamp": "2025-01-01T00:00:00"
}
```
## Notion Output (Required)
**IMPORTANT**: All audit reports MUST be saved to the OurDigital SEO Audit Log database.
### Database Configuration
| Field | Value |
|-------|-------|
| Database ID | `2c8581e5-8a1e-8035-880b-e38cefc2f3ef` |
| URL | https://www.notion.so/dintelligence/2c8581e58a1e8035880be38cefc2f3ef |
### Required Properties
| Property | Type | Description |
|----------|------|-------------|
| Issue | Title | Report title (Korean + date) |
| Site | URL | Audited website URL |
| Category | Select | Crawl Budget |
| Priority | Select | Based on waste percentage |
| Found Date | Date | Audit date (YYYY-MM-DD) |
| Audit ID | Rich Text | Format: CRAWL-YYYYMMDD-NNN |
### Language Guidelines
- Report content in Korean (한국어)
- Keep technical English terms as-is (e.g., Crawl Budget, Googlebot, robots.txt)
- URLs and code remain unchanged

View File

@@ -0,0 +1,207 @@
"""
Base Client - Shared async client utilities
===========================================
Purpose: Rate-limited async operations for API clients
Python: 3.10+
"""
import asyncio
import logging
import os
from asyncio import Semaphore
from datetime import datetime
from typing import Any, Callable, TypeVar
from dotenv import load_dotenv
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type,
)
# Load environment variables
load_dotenv()
# Logging setup
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
T = TypeVar("T")
class RateLimiter:
"""Rate limiter using token bucket algorithm."""
def __init__(self, rate: float, per: float = 1.0):
"""
Initialize rate limiter.
Args:
rate: Number of requests allowed
per: Time period in seconds (default: 1 second)
"""
self.rate = rate
self.per = per
self.tokens = rate
self.last_update = datetime.now()
self._lock = asyncio.Lock()
async def acquire(self) -> None:
"""Acquire a token, waiting if necessary."""
async with self._lock:
now = datetime.now()
elapsed = (now - self.last_update).total_seconds()
self.tokens = min(self.rate, self.tokens + elapsed * (self.rate / self.per))
self.last_update = now
if self.tokens < 1:
wait_time = (1 - self.tokens) * (self.per / self.rate)
await asyncio.sleep(wait_time)
self.tokens = 0
else:
self.tokens -= 1
class BaseAsyncClient:
"""Base class for async API clients with rate limiting."""
def __init__(
self,
max_concurrent: int = 5,
requests_per_second: float = 3.0,
logger: logging.Logger | None = None,
):
"""
Initialize base client.
Args:
max_concurrent: Maximum concurrent requests
requests_per_second: Rate limit
logger: Logger instance
"""
self.semaphore = Semaphore(max_concurrent)
self.rate_limiter = RateLimiter(requests_per_second)
self.logger = logger or logging.getLogger(self.__class__.__name__)
self.stats = {
"requests": 0,
"success": 0,
"errors": 0,
"retries": 0,
}
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10),
retry=retry_if_exception_type(Exception),
)
async def _rate_limited_request(
self,
coro: Callable[[], Any],
) -> Any:
"""Execute a request with rate limiting and retry."""
async with self.semaphore:
await self.rate_limiter.acquire()
self.stats["requests"] += 1
try:
result = await coro()
self.stats["success"] += 1
return result
except Exception as e:
self.stats["errors"] += 1
self.logger.error(f"Request failed: {e}")
raise
async def batch_requests(
self,
requests: list[Callable[[], Any]],
desc: str = "Processing",
) -> list[Any]:
"""Execute multiple requests concurrently."""
try:
from tqdm.asyncio import tqdm
has_tqdm = True
except ImportError:
has_tqdm = False
async def execute(req: Callable) -> Any:
try:
return await self._rate_limited_request(req)
except Exception as e:
return {"error": str(e)}
tasks = [execute(req) for req in requests]
if has_tqdm:
results = []
for coro in tqdm.as_completed(tasks, total=len(tasks), desc=desc):
result = await coro
results.append(result)
return results
else:
return await asyncio.gather(*tasks, return_exceptions=True)
def print_stats(self) -> None:
"""Print request statistics."""
self.logger.info("=" * 40)
self.logger.info("Request Statistics:")
self.logger.info(f" Total Requests: {self.stats['requests']}")
self.logger.info(f" Successful: {self.stats['success']}")
self.logger.info(f" Errors: {self.stats['errors']}")
self.logger.info("=" * 40)
class ConfigManager:
"""Manage API configuration and credentials."""
def __init__(self):
load_dotenv()
@property
def google_credentials_path(self) -> str | None:
"""Get Google service account credentials path."""
# Prefer SEO-specific credentials, fallback to general credentials
seo_creds = os.path.expanduser("~/.credential/ourdigital-seo-agent.json")
if os.path.exists(seo_creds):
return seo_creds
return os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
@property
def pagespeed_api_key(self) -> str | None:
"""Get PageSpeed Insights API key."""
return os.getenv("PAGESPEED_API_KEY")
@property
def custom_search_api_key(self) -> str | None:
"""Get Custom Search API key."""
return os.getenv("CUSTOM_SEARCH_API_KEY")
@property
def custom_search_engine_id(self) -> str | None:
"""Get Custom Search Engine ID."""
return os.getenv("CUSTOM_SEARCH_ENGINE_ID")
@property
def notion_token(self) -> str | None:
"""Get Notion API token."""
return os.getenv("NOTION_TOKEN") or os.getenv("NOTION_API_KEY")
def validate_google_credentials(self) -> bool:
"""Validate Google credentials are configured."""
creds_path = self.google_credentials_path
if not creds_path:
return False
return os.path.exists(creds_path)
def get_required(self, key: str) -> str:
"""Get required environment variable or raise error."""
value = os.getenv(key)
if not value:
raise ValueError(f"Missing required environment variable: {key}")
return value
# Singleton config instance
config = ConfigManager()

View File

@@ -0,0 +1,805 @@
"""
Crawl Budget Analyzer - Identify crawl waste and generate recommendations
=========================================================================
Purpose: Analyze server access logs for crawl budget efficiency, detect waste
(parameter URLs, redirect chains, soft 404s, duplicates), find orphan
pages, profile per-bot behavior, and produce prioritized recommendations.
Python: 3.10+
"""
import argparse
import json
import logging
import re
import sys
from collections import Counter, defaultdict
from dataclasses import asdict, dataclass, field
from datetime import datetime
from pathlib import Path
from typing import Any
from urllib.parse import parse_qs, urlparse
import requests
from bs4 import BeautifulSoup
from log_parser import BotIdentification, LogEntry, LogParser
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Constants
# ---------------------------------------------------------------------------
WASTE_PARAMS = {"sort", "filter", "order", "orderby", "dir", "direction"}
TRACKING_PARAMS_RE = re.compile(r"^utm_", re.IGNORECASE)
PAGINATION_PARAM = "page"
HIGH_PAGE_THRESHOLD = 5
SOFT_404_MAX_SIZE = 1024 # bytes - pages smaller than this may be soft 404s
REDIRECT_STATUSES = {301, 302, 303, 307, 308}
TOP_N_URLS = 50
# ---------------------------------------------------------------------------
# Data classes
# ---------------------------------------------------------------------------
@dataclass
class CrawlWaste:
"""A category of crawl budget waste."""
waste_type: str
urls: list[str]
count: int
pct_of_total: float
recommendation: str
def to_dict(self) -> dict:
return {
"waste_type": self.waste_type,
"count": self.count,
"pct_of_total": round(self.pct_of_total, 2),
"recommendation": self.recommendation,
"sample_urls": self.urls[:20],
}
@dataclass
class OrphanPage:
"""A page that is either in the sitemap but uncrawled, or crawled but not in sitemap."""
url: str
in_sitemap: bool
crawled: bool
last_crawl_date: str | None = None
def to_dict(self) -> dict:
return asdict(self)
@dataclass
class BotProfile:
"""Per-bot crawl behavior profile."""
name: str
total_requests: int = 0
requests_per_day: float = 0.0
crawl_depth_distribution: dict[int, int] = field(default_factory=dict)
peak_hours: list[int] = field(default_factory=list)
status_breakdown: dict[str, int] = field(default_factory=dict)
top_crawled_urls: list[tuple[str, int]] = field(default_factory=list)
unique_urls: int = 0
days_active: int = 0
def to_dict(self) -> dict:
return {
"name": self.name,
"total_requests": self.total_requests,
"requests_per_day": round(self.requests_per_day, 1),
"crawl_depth_distribution": self.crawl_depth_distribution,
"peak_hours": self.peak_hours,
"status_breakdown": self.status_breakdown,
"top_crawled_urls": [{"url": u, "count": c} for u, c in self.top_crawled_urls],
"unique_urls": self.unique_urls,
"days_active": self.days_active,
}
@dataclass
class CrawlRecommendation:
"""A single optimization recommendation."""
category: str
priority: str # critical, high, medium, low
action: str
impact: str
details: str
def to_dict(self) -> dict:
return asdict(self)
@dataclass
class CrawlBudgetResult:
"""Complete crawl budget analysis result."""
log_file: str
analysis_period: dict[str, str]
total_bot_requests: int
bots: dict[str, BotProfile]
waste: list[CrawlWaste]
total_waste_pct: float
orphan_pages: dict[str, list[OrphanPage]]
recommendations: list[CrawlRecommendation]
efficiency_score: int
timestamp: str
def to_dict(self) -> dict:
return {
"log_file": self.log_file,
"analysis_period": self.analysis_period,
"total_bot_requests": self.total_bot_requests,
"bots": {n: p.to_dict() for n, p in self.bots.items()},
"waste": {w.waste_type: w.to_dict() for w in self.waste},
"total_waste_pct": round(self.total_waste_pct, 2),
"orphan_pages": {
k: [o.to_dict() for o in v]
for k, v in self.orphan_pages.items()
},
"recommendations": [r.to_dict() for r in self.recommendations],
"efficiency_score": self.efficiency_score,
"timestamp": self.timestamp,
}
# ---------------------------------------------------------------------------
# CrawlBudgetAnalyzer
# ---------------------------------------------------------------------------
class CrawlBudgetAnalyzer:
"""Analyze crawl budget efficiency from server access logs."""
def __init__(
self,
log_file: str,
sitemap_url: str | None = None,
target_url: str | None = None,
):
self.log_file = log_file
self.sitemap_url = sitemap_url
self.target_url = target_url
self._bot_entries: list[tuple[LogEntry, BotIdentification]] = []
self._sitemap_urls: set[str] = set()
# -- data loading ---------------------------------------------------------
def load_log_data(self, log_file: str) -> list[tuple[LogEntry, BotIdentification]]:
"""Use LogParser to load all bot requests from the log file."""
parser = LogParser(log_file=log_file, fmt="auto")
entries = parser.parse()
logger.info(f"Loaded {len(entries):,} bot entries from {log_file}")
self._bot_entries = entries
return entries
def load_sitemap_urls(self, sitemap_url: str) -> set[str]:
"""Fetch and parse an XML sitemap, returning the set of URLs."""
urls: set[str] = set()
try:
resp = requests.get(sitemap_url, timeout=30, headers={
"User-Agent": "CrawlBudgetAnalyzer/1.0",
})
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "lxml-xml")
# Handle sitemap index
sitemap_tags = soup.find_all("sitemap")
if sitemap_tags:
for st in sitemap_tags:
loc = st.find("loc")
if loc and loc.text:
child_urls = self._fetch_sitemap_child(loc.text.strip())
urls.update(child_urls)
else:
for url_tag in soup.find_all("url"):
loc = url_tag.find("loc")
if loc and loc.text:
urls.add(self._normalize_url(loc.text.strip()))
logger.info(f"Loaded {len(urls):,} URLs from sitemap: {sitemap_url}")
except Exception as e:
logger.error(f"Failed to load sitemap {sitemap_url}: {e}")
self._sitemap_urls = urls
return urls
def _fetch_sitemap_child(self, url: str) -> set[str]:
"""Fetch a child sitemap from a sitemap index."""
urls: set[str] = set()
try:
resp = requests.get(url, timeout=30, headers={
"User-Agent": "CrawlBudgetAnalyzer/1.0",
})
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "lxml-xml")
for url_tag in soup.find_all("url"):
loc = url_tag.find("loc")
if loc and loc.text:
urls.add(self._normalize_url(loc.text.strip()))
except Exception as e:
logger.warning(f"Failed to fetch child sitemap {url}: {e}")
return urls
@staticmethod
def _normalize_url(url: str) -> str:
"""Normalize a URL by removing trailing slash and lowercasing the scheme/host."""
parsed = urlparse(url)
path = parsed.path.rstrip("/") or "/"
return f"{parsed.scheme}://{parsed.netloc}{path}"
# -- waste identification -------------------------------------------------
def identify_parameter_waste(
self,
bot_requests: list[tuple[LogEntry, BotIdentification]],
) -> CrawlWaste:
"""Find URLs with unnecessary query parameters wasting crawl budget."""
waste_urls: list[str] = []
for entry, _ in bot_requests:
parsed = urlparse(entry.url)
if not parsed.query:
continue
params = parse_qs(parsed.query)
param_keys = {k.lower() for k in params}
# Check for waste parameters
has_waste = bool(param_keys & WASTE_PARAMS)
# Check for tracking parameters
has_tracking = any(TRACKING_PARAMS_RE.match(k) for k in param_keys)
# Check for deep pagination
page_val = params.get(PAGINATION_PARAM, params.get("p", [None]))
has_deep_page = False
if page_val and page_val[0]:
try:
if int(page_val[0]) > HIGH_PAGE_THRESHOLD:
has_deep_page = True
except (ValueError, TypeError):
pass
if has_waste or has_tracking or has_deep_page:
waste_urls.append(entry.url)
total = len(bot_requests)
count = len(waste_urls)
pct = (count / total * 100) if total else 0.0
return CrawlWaste(
waste_type="parameter_urls",
urls=list(set(waste_urls)),
count=count,
pct_of_total=pct,
recommendation=(
"robots.txt에 불필요한 parameter URL 패턴을 Disallow로 추가하거나, "
"Google Search Console의 URL Parameters 설정을 활용하세요. "
"UTM 파라미터가 포함된 URL은 canonical 태그로 처리하세요."
),
)
def identify_redirect_chains(
self,
bot_requests: list[tuple[LogEntry, BotIdentification]],
) -> CrawlWaste:
"""Find URLs that repeatedly return redirect status codes."""
redirect_urls: list[str] = []
redirect_counter: Counter = Counter()
for entry, _ in bot_requests:
if entry.status_code in REDIRECT_STATUSES:
redirect_counter[entry.url] += 1
redirect_urls.append(entry.url)
# URLs redirected more than once are chain candidates
chain_urls = [url for url, cnt in redirect_counter.items() if cnt >= 2]
total = len(bot_requests)
count = len(redirect_urls)
pct = (count / total * 100) if total else 0.0
return CrawlWaste(
waste_type="redirect_chains",
urls=chain_urls,
count=count,
pct_of_total=pct,
recommendation=(
"301/302 리다이렉트가 반복적으로 크롤링되고 있습니다. "
"내부 링크를 최종 목적지 URL로 직접 업데이트하고, "
"리다이렉트 체인을 단일 리다이렉트로 단축하세요."
),
)
def identify_soft_404s(
self,
bot_requests: list[tuple[LogEntry, BotIdentification]],
) -> CrawlWaste:
"""Find 200-status pages with suspiciously small response sizes."""
soft_404_urls: list[str] = []
for entry, _ in bot_requests:
if entry.status_code == 200 and entry.response_size < SOFT_404_MAX_SIZE:
if entry.response_size > 0:
soft_404_urls.append(entry.url)
total = len(bot_requests)
count = len(soft_404_urls)
pct = (count / total * 100) if total else 0.0
return CrawlWaste(
waste_type="soft_404s",
urls=list(set(soft_404_urls)),
count=count,
pct_of_total=pct,
recommendation=(
"200 상태 코드를 반환하지만 콘텐츠가 거의 없는 Soft 404 페이지입니다. "
"실제 404 상태 코드를 반환하거나, 해당 페이지에 noindex 태그를 추가하세요."
),
)
def identify_duplicate_crawls(
self,
bot_requests: list[tuple[LogEntry, BotIdentification]],
) -> CrawlWaste:
"""Find duplicate URL variants: www/non-www, trailing slash, etc."""
url_variants: dict[str, set[str]] = defaultdict(set)
for entry, _ in bot_requests:
parsed = urlparse(entry.url)
# Normalize: strip www, strip trailing slash, lowercase
host = parsed.netloc.lower().lstrip("www.")
path = parsed.path.rstrip("/") or "/"
canonical = f"{host}{path}"
full_url = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
url_variants[canonical].add(full_url)
# Identify canonicals with multiple variants
duplicate_urls: list[str] = []
for canonical, variants in url_variants.items():
if len(variants) > 1:
duplicate_urls.extend(variants)
total = len(bot_requests)
# Count how many requests hit duplicate variant URLs
dup_set = set(duplicate_urls)
dup_request_count = sum(1 for e, _ in bot_requests if f"{urlparse(e.url).scheme}://{urlparse(e.url).netloc}{urlparse(e.url).path}" in dup_set)
pct = (dup_request_count / total * 100) if total else 0.0
return CrawlWaste(
waste_type="duplicate_urls",
urls=duplicate_urls[:TOP_N_URLS],
count=dup_request_count,
pct_of_total=pct,
recommendation=(
"www/non-www, trailing slash 유무 등 중복 URL 변형이 크롤링되고 있습니다. "
"301 리다이렉트로 canonical URL로 통합하고, "
"rel=canonical 태그를 정확히 설정하세요."
),
)
# -- bot profiling --------------------------------------------------------
def profile_bots(
self,
bot_requests: list[tuple[LogEntry, BotIdentification]],
) -> dict[str, BotProfile]:
"""Generate per-bot behavior profiles."""
bot_data: dict[str, dict] = defaultdict(lambda: {
"urls": Counter(),
"statuses": Counter(),
"hours": Counter(),
"days": set(),
"depths": Counter(),
"count": 0,
})
for entry, bot in bot_requests:
bd = bot_data[bot.name]
bd["count"] += 1
bd["urls"][entry.url] += 1
bd["statuses"][str(entry.status_code)] += 1
# URL depth = number of path segments
depth = len([s for s in urlparse(entry.url).path.split("/") if s])
bd["depths"][depth] += 1
if entry.timestamp:
bd["hours"][entry.timestamp.hour] += 1
bd["days"].add(entry.timestamp.strftime("%Y-%m-%d"))
profiles: dict[str, BotProfile] = {}
for name, bd in bot_data.items():
days_active = len(bd["days"]) or 1
rpd = bd["count"] / days_active
# Top 3 peak hours
top_hours = sorted(bd["hours"].items(), key=lambda x: -x[1])[:3]
peak = [h for h, _ in top_hours]
profiles[name] = BotProfile(
name=name,
total_requests=bd["count"],
requests_per_day=rpd,
crawl_depth_distribution=dict(sorted(bd["depths"].items())),
peak_hours=peak,
status_breakdown=dict(bd["statuses"]),
top_crawled_urls=bd["urls"].most_common(TOP_N_URLS),
unique_urls=len(bd["urls"]),
days_active=days_active,
)
return profiles
# -- orphan detection -----------------------------------------------------
def detect_orphan_pages(
self,
crawled_urls: set[str],
sitemap_urls: set[str],
) -> dict[str, list[OrphanPage]]:
"""Compare crawled URLs with sitemap URLs to find orphans."""
in_sitemap_not_crawled = sitemap_urls - crawled_urls
crawled_not_in_sitemap = crawled_urls - sitemap_urls
return {
"in_sitemap_not_crawled": [
OrphanPage(url=u, in_sitemap=True, crawled=False)
for u in sorted(in_sitemap_not_crawled)
],
"crawled_not_in_sitemap": [
OrphanPage(url=u, in_sitemap=False, crawled=True)
for u in sorted(crawled_not_in_sitemap)
],
}
# -- efficiency score -----------------------------------------------------
@staticmethod
def calculate_efficiency_score(total_waste_pct: float) -> int:
"""Calculate crawl efficiency score: 100 - waste%, capped at [0, 100]."""
score = int(100 - total_waste_pct)
return max(0, min(100, score))
# -- recommendations ------------------------------------------------------
def generate_recommendations(
self,
waste: list[CrawlWaste],
orphans: dict[str, list[OrphanPage]],
bot_profiles: dict[str, BotProfile],
) -> list[CrawlRecommendation]:
"""Generate prioritized crawl budget optimization recommendations."""
recs: list[CrawlRecommendation] = []
# Waste-based recommendations
for w in waste:
if w.pct_of_total > 5.0:
priority = "critical"
elif w.pct_of_total > 2.0:
priority = "high"
elif w.pct_of_total > 0.5:
priority = "medium"
else:
priority = "low"
if w.waste_type == "parameter_urls" and w.count > 0:
recs.append(CrawlRecommendation(
category="URL Parameters",
priority=priority,
action="robots.txt에 parameter URL 패턴 Disallow 규칙 추가",
impact=f"크롤 요청 {w.pct_of_total:.1f}% 절감 가능",
details=(
f"{w.count:,}건의 parameter URL이 크롤링되었습니다. "
f"sort, filter, utm_* 등 불필요한 파라미터를 차단하세요."
),
))
elif w.waste_type == "redirect_chains" and w.count > 0:
recs.append(CrawlRecommendation(
category="Redirect Chains",
priority=priority,
action="리다이렉트 체인을 단축하고 내부 링크 업데이트",
impact=f"크롤 요청 {w.pct_of_total:.1f}% 절감 가능",
details=(
f"{w.count:,}건의 리다이렉트 요청이 발생했습니다. "
f"내부 링크를 최종 URL로 직접 연결하세요."
),
))
elif w.waste_type == "soft_404s" and w.count > 0:
recs.append(CrawlRecommendation(
category="Soft 404s",
priority=priority,
action="Soft 404 페이지에 적절한 HTTP 상태 코드 또는 noindex 적용",
impact=f"크롤 요청 {w.pct_of_total:.1f}% 절감 가능",
details=(
f"{w.count:,}건의 Soft 404가 감지되었습니다. "
f"적절한 404 응답 또는 noindex meta 태그를 설정하세요."
),
))
elif w.waste_type == "duplicate_urls" and w.count > 0:
recs.append(CrawlRecommendation(
category="Duplicate URLs",
priority=priority,
action="URL 정규화 및 canonical 태그 설정",
impact=f"크롤 요청 {w.pct_of_total:.1f}% 절감 가능",
details=(
f"{w.count:,}건의 중복 URL 변형이 크롤링되었습니다. "
f"www/non-www, trailing slash 통합을 진행하세요."
),
))
# Orphan page recommendations
not_crawled = orphans.get("in_sitemap_not_crawled", [])
not_in_sitemap = orphans.get("crawled_not_in_sitemap", [])
if len(not_crawled) > 0:
pct = len(not_crawled) / max(len(self._sitemap_urls), 1) * 100
priority = "critical" if pct > 30 else "high" if pct > 10 else "medium"
recs.append(CrawlRecommendation(
category="Orphan Pages (Uncrawled)",
priority=priority,
action="사이트맵에 있으나 크롤링되지 않은 페이지의 내부 링크 강화",
impact=f"사이트맵 URL의 {pct:.1f}%가 미크롤 상태",
details=(
f"{len(not_crawled):,}개 URL이 사이트맵에 있지만 "
f"봇이 크롤링하지 않았습니다. 내부 링크를 추가하세요."
),
))
if len(not_in_sitemap) > 0:
recs.append(CrawlRecommendation(
category="Orphan Pages (Unlisted)",
priority="medium",
action="크롤링되었으나 사이트맵에 없는 페이지를 사이트맵에 추가 또는 차단",
impact=f"{len(not_in_sitemap):,}개 URL이 사이트맵에 미등록",
details=(
f"봇이 크롤링한 {len(not_in_sitemap):,}개 URL이 "
f"사이트맵에 포함되어 있지 않습니다. 유효한 페이지는 "
f"사이트맵에 추가하고, 불필요한 페이지는 robots.txt로 차단하세요."
),
))
# Bot-specific recommendations
for name, profile in bot_profiles.items():
error_count = sum(
v for k, v in profile.status_breakdown.items()
if k.startswith("4") or k.startswith("5")
)
error_pct = (error_count / profile.total_requests * 100) if profile.total_requests else 0
if error_pct > 10:
recs.append(CrawlRecommendation(
category=f"Bot Errors ({name})",
priority="high" if error_pct > 20 else "medium",
action=f"{name}의 4xx/5xx 오류율 {error_pct:.1f}% 개선 필요",
impact=f"{name} 크롤 예산의 {error_pct:.1f}%가 오류에 소비",
details=(
f"{name}이(가) {error_count:,}건의 오류 응답을 받았습니다. "
f"깨진 링크를 수정하고 서버 안정성을 개선하세요."
),
))
# Sort by priority
priority_order = {"critical": 0, "high": 1, "medium": 2, "low": 3}
recs.sort(key=lambda r: priority_order.get(r.priority, 4))
return recs
# -- orchestrator ---------------------------------------------------------
def analyze(self, scope: str = "all") -> CrawlBudgetResult:
"""Orchestrate the full crawl budget analysis."""
# Load log data
entries = self.load_log_data(self.log_file)
if not entries:
logger.warning("No bot entries found in log file.")
# Load sitemap if provided
if self.sitemap_url:
self.load_sitemap_urls(self.sitemap_url)
# Profile bots
bot_profiles: dict[str, BotProfile] = {}
if scope in ("all", "bots"):
bot_profiles = self.profile_bots(entries)
# Identify waste
waste: list[CrawlWaste] = []
if scope in ("all", "waste"):
waste.append(self.identify_parameter_waste(entries))
waste.append(self.identify_redirect_chains(entries))
waste.append(self.identify_soft_404s(entries))
waste.append(self.identify_duplicate_crawls(entries))
total_waste_pct = sum(w.pct_of_total for w in waste)
# Detect orphan pages
orphans: dict[str, list[OrphanPage]] = {
"in_sitemap_not_crawled": [],
"crawled_not_in_sitemap": [],
}
if scope in ("all", "orphans") and self._sitemap_urls:
crawled_urls: set[str] = set()
for entry, _ in entries:
# Build full URL from path for comparison
if self.target_url:
parsed_target = urlparse(self.target_url)
full = f"{parsed_target.scheme}://{parsed_target.netloc}{entry.url}"
crawled_urls.add(self._normalize_url(full))
else:
crawled_urls.add(entry.url)
orphans = self.detect_orphan_pages(crawled_urls, self._sitemap_urls)
# Efficiency score
efficiency_score = self.calculate_efficiency_score(total_waste_pct)
# Recommendations
recommendations = self.generate_recommendations(waste, orphans, bot_profiles)
# Date range from entries
timestamps = [e.timestamp for e, _ in entries if e.timestamp]
analysis_period = {}
if timestamps:
analysis_period = {
"from": min(timestamps).strftime("%Y-%m-%d"),
"to": max(timestamps).strftime("%Y-%m-%d"),
}
return CrawlBudgetResult(
log_file=self.log_file,
analysis_period=analysis_period,
total_bot_requests=len(entries),
bots=bot_profiles,
waste=waste,
total_waste_pct=total_waste_pct,
orphan_pages=orphans,
recommendations=recommendations,
efficiency_score=efficiency_score,
timestamp=datetime.now().isoformat(),
)
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main() -> None:
parser = argparse.ArgumentParser(
description="Analyze crawl budget efficiency and generate optimization recommendations.",
)
parser.add_argument(
"--log-file",
required=True,
help="Path to server access log file",
)
parser.add_argument(
"--sitemap",
default=None,
help="URL of XML sitemap for orphan page detection",
)
parser.add_argument(
"--url",
default=None,
help="Target website URL (used for URL normalization and Ahrefs)",
)
parser.add_argument(
"--scope",
choices=["all", "waste", "orphans", "bots"],
default="all",
help="Analysis scope (default: all)",
)
parser.add_argument(
"--ahrefs",
action="store_true",
help="Include Ahrefs page history comparison (requires MCP tool)",
)
parser.add_argument(
"--json",
action="store_true",
help="Output in JSON format",
)
parser.add_argument(
"--output",
default=None,
help="Write output to file instead of stdout",
)
args = parser.parse_args()
# Validate log file exists
if not Path(args.log_file).exists():
logger.error(f"Log file not found: {args.log_file}")
sys.exit(1)
analyzer = CrawlBudgetAnalyzer(
log_file=args.log_file,
sitemap_url=args.sitemap,
target_url=args.url,
)
result = analyzer.analyze(scope=args.scope)
if args.json:
output_data = result.to_dict()
output_str = json.dumps(output_data, indent=2, ensure_ascii=False)
else:
lines = _format_text_report(result)
output_str = "\n".join(lines)
if args.output:
Path(args.output).write_text(output_str, encoding="utf-8")
logger.info(f"Output written to {args.output}")
else:
print(output_str)
def _format_text_report(result: CrawlBudgetResult) -> list[str]:
"""Format the analysis result as a human-readable text report."""
lines = [
"=" * 70,
"Crawl Budget Analysis Report",
"=" * 70,
f"Log File: {result.log_file}",
f"Total Bot Requests: {result.total_bot_requests:,}",
f"Efficiency Score: {result.efficiency_score}/100",
f"Total Waste: {result.total_waste_pct:.1f}%",
]
if result.analysis_period:
lines.append(
f"Period: {result.analysis_period.get('from', 'N/A')} ~ "
f"{result.analysis_period.get('to', 'N/A')}"
)
lines.append("")
# Bot profiles
if result.bots:
lines.append("-" * 60)
lines.append("Bot Profiles")
lines.append("-" * 60)
for name, profile in sorted(result.bots.items(), key=lambda x: -x[1].total_requests):
lines.append(f"\n [{name.upper()}]")
lines.append(f" Requests: {profile.total_requests:,}")
lines.append(f" Unique URLs: {profile.unique_urls:,}")
lines.append(f" Requests/Day: {profile.requests_per_day:,.1f}")
lines.append(f" Days Active: {profile.days_active}")
lines.append(f" Peak Hours: {profile.peak_hours}")
lines.append(f" Status: {profile.status_breakdown}")
lines.append("")
# Waste breakdown
if result.waste:
lines.append("-" * 60)
lines.append("Crawl Waste Breakdown")
lines.append("-" * 60)
for w in result.waste:
if w.count > 0:
lines.append(f"\n [{w.waste_type}]")
lines.append(f" Count: {w.count:,} ({w.pct_of_total:.1f}%)")
lines.append(f" Recommendation: {w.recommendation}")
if w.urls:
lines.append(f" Sample URLs:")
for u in w.urls[:5]:
lines.append(f" - {u}")
lines.append("")
# Orphan pages
not_crawled = result.orphan_pages.get("in_sitemap_not_crawled", [])
not_in_sitemap = result.orphan_pages.get("crawled_not_in_sitemap", [])
if not_crawled or not_in_sitemap:
lines.append("-" * 60)
lines.append("Orphan Pages")
lines.append("-" * 60)
if not_crawled:
lines.append(f"\n In Sitemap but Not Crawled: {len(not_crawled):,}")
for op in not_crawled[:10]:
lines.append(f" - {op.url}")
if not_in_sitemap:
lines.append(f"\n Crawled but Not in Sitemap: {len(not_in_sitemap):,}")
for op in not_in_sitemap[:10]:
lines.append(f" - {op.url}")
lines.append("")
# Recommendations
if result.recommendations:
lines.append("-" * 60)
lines.append("Recommendations")
lines.append("-" * 60)
for i, rec in enumerate(result.recommendations, 1):
lines.append(f"\n {i}. [{rec.priority.upper()}] {rec.category}")
lines.append(f" Action: {rec.action}")
lines.append(f" Impact: {rec.impact}")
lines.append(f" Details: {rec.details}")
lines.append("")
lines.append(f"Generated: {result.timestamp}")
return lines
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,613 @@
"""
Log Parser - Server access log parser with bot identification
=============================================================
Purpose: Parse Apache/Nginx/CloudFront access logs, identify search engine
bots, extract crawl data, and generate per-bot statistics.
Python: 3.10+
"""
import argparse
import bz2
import gzip
import json
import logging
import re
import sys
from collections import Counter, defaultdict
from dataclasses import asdict, dataclass, field
from datetime import datetime
from pathlib import Path
from typing import Generator, TextIO
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Constants: bot user-agent patterns
# ---------------------------------------------------------------------------
BOT_PATTERNS: list[tuple[str, str, str]] = [
# (canonical name, regex pattern, category)
("googlebot", r"Googlebot(?:-Image|-News|-Video)?/", "search_engine"),
("googlebot-adsbot", r"AdsBot-Google", "search_engine"),
("googlebot-mediapartners", r"Mediapartners-Google", "search_engine"),
("yeti", r"Yeti/", "search_engine"),
("bingbot", r"bingbot/", "search_engine"),
("daumoa", r"Daumoa", "search_engine"),
("applebot", r"Applebot/", "search_engine"),
("duckduckbot", r"DuckDuckBot/", "search_engine"),
("baiduspider", r"Baiduspider", "search_engine"),
("yandexbot", r"YandexBot/", "search_engine"),
("sogou", r"Sogou", "search_engine"),
("seznambot", r"SeznamBot/", "search_engine"),
("ahrefsbot", r"AhrefsBot/", "seo_tool"),
("semrushbot", r"SemrushBot/", "seo_tool"),
("mj12bot", r"MJ12bot/", "seo_tool"),
("dotbot", r"DotBot/", "seo_tool"),
("rogerbot", r"rogerbot/", "seo_tool"),
("screaming frog", r"Screaming Frog SEO Spider", "seo_tool"),
]
COMPILED_BOT_PATTERNS: list[tuple[str, re.Pattern, str]] = [
(name, re.compile(pattern, re.IGNORECASE), category)
for name, pattern, category in BOT_PATTERNS
]
# ---------------------------------------------------------------------------
# Regex patterns for each log format
# ---------------------------------------------------------------------------
NGINX_COMBINED_RE = re.compile(
r'(?P<ip>[\d.:a-fA-F]+)\s+-\s+(?P<user>\S+)\s+'
r'\[(?P<timestamp>[^\]]+)\]\s+'
r'"(?P<method>\S+)\s+(?P<url>\S+)\s+(?P<protocol>[^"]+)"\s+'
r'(?P<status>\d{3})\s+(?P<size>\d+|-)\s+'
r'"(?P<referer>[^"]*)"\s+'
r'"(?P<user_agent>[^"]*)"'
)
APACHE_COMBINED_RE = re.compile(
r'(?P<ip>[\d.:a-fA-F]+)\s+\S+\s+(?P<user>\S+)\s+'
r'\[(?P<timestamp>[^\]]+)\]\s+'
r'"(?P<method>\S+)\s+(?P<url>\S+)\s+(?P<protocol>[^"]+)"\s+'
r'(?P<status>\d{3})\s+(?P<size>\d+|-)\s+'
r'"(?P<referer>[^"]*)"\s+'
r'"(?P<user_agent>[^"]*)"'
)
CLOUDFRONT_FIELDS = [
"date", "time", "x_edge_location", "sc_bytes", "c_ip",
"cs_method", "cs_host", "cs_uri_stem", "sc_status",
"cs_referer", "cs_user_agent", "cs_uri_query",
"cs_cookie", "x_edge_result_type", "x_edge_request_id",
"x_host_header", "cs_protocol", "cs_bytes",
"time_taken", "x_forwarded_for", "ssl_protocol",
"ssl_cipher", "x_edge_response_result_type", "cs_protocol_version",
]
# Timestamp formats
NGINX_TS_FORMAT = "%d/%b/%Y:%H:%M:%S %z"
APACHE_TS_FORMAT = "%d/%b/%Y:%H:%M:%S %z"
# ---------------------------------------------------------------------------
# Data classes
# ---------------------------------------------------------------------------
@dataclass
class LogEntry:
"""A single parsed log entry."""
timestamp: datetime | None
ip: str
method: str
url: str
status_code: int
response_size: int
user_agent: str
referer: str
def to_dict(self) -> dict:
d = asdict(self)
if self.timestamp:
d["timestamp"] = self.timestamp.isoformat()
return d
@dataclass
class BotIdentification:
"""Bot identification result."""
name: str
user_agent_pattern: str
category: str
@dataclass
class BotStats:
"""Aggregated statistics for a single bot."""
name: str
total_requests: int = 0
unique_urls: int = 0
status_distribution: dict[str, int] = field(default_factory=dict)
top_urls: list[tuple[str, int]] = field(default_factory=list)
hourly_distribution: dict[int, int] = field(default_factory=dict)
daily_distribution: dict[str, int] = field(default_factory=dict)
avg_response_size: float = 0.0
def to_dict(self) -> dict:
return {
"name": self.name,
"total_requests": self.total_requests,
"unique_urls": self.unique_urls,
"status_distribution": self.status_distribution,
"top_urls": [{"url": u, "count": c} for u, c in self.top_urls],
"hourly_distribution": self.hourly_distribution,
"daily_distribution": self.daily_distribution,
"avg_response_size": round(self.avg_response_size, 1),
}
@dataclass
class LogParseResult:
"""Complete log parsing result."""
log_file: str
format_detected: str
total_lines: int
parsed_lines: int
bot_entries: int
date_range: dict[str, str]
bots: dict[str, BotStats]
errors: int
def to_dict(self) -> dict:
return {
"log_file": self.log_file,
"format_detected": self.format_detected,
"total_lines": self.total_lines,
"parsed_lines": self.parsed_lines,
"bot_entries": self.bot_entries,
"date_range": self.date_range,
"bots": {name: stats.to_dict() for name, stats in self.bots.items()},
"errors": self.errors,
}
# ---------------------------------------------------------------------------
# LogParser class
# ---------------------------------------------------------------------------
class LogParser:
"""Parse server access logs and identify search engine bot traffic."""
def __init__(
self,
log_file: str,
fmt: str = "auto",
streaming: bool = False,
):
self.log_file = log_file
self.fmt = fmt
self.streaming = streaming
self._detected_format: str | None = None
self._parse_errors = 0
# -- format detection -----------------------------------------------------
def _detect_format(self, line: str) -> str:
"""Auto-detect log format from a sample line."""
if line.startswith("#"):
return "cloudfront"
if NGINX_COMBINED_RE.match(line):
return "nginx"
if APACHE_COMBINED_RE.match(line):
return "apache"
# Fallback: try tab-separated (CloudFront without header)
if "\t" in line and line.count("\t") >= 10:
return "cloudfront"
return "nginx"
# -- line parsers ---------------------------------------------------------
def _parse_nginx_combined(self, line: str) -> LogEntry | None:
"""Parse a single Nginx combined format log line."""
m = NGINX_COMBINED_RE.match(line)
if not m:
return None
ts = None
try:
ts = datetime.strptime(m.group("timestamp"), NGINX_TS_FORMAT)
except (ValueError, TypeError):
pass
size_raw = m.group("size")
size = int(size_raw) if size_raw != "-" else 0
return LogEntry(
timestamp=ts,
ip=m.group("ip"),
method=m.group("method"),
url=m.group("url"),
status_code=int(m.group("status")),
response_size=size,
user_agent=m.group("user_agent"),
referer=m.group("referer"),
)
def _parse_apache_combined(self, line: str) -> LogEntry | None:
"""Parse a single Apache combined format log line."""
m = APACHE_COMBINED_RE.match(line)
if not m:
return None
ts = None
try:
ts = datetime.strptime(m.group("timestamp"), APACHE_TS_FORMAT)
except (ValueError, TypeError):
pass
size_raw = m.group("size")
size = int(size_raw) if size_raw != "-" else 0
return LogEntry(
timestamp=ts,
ip=m.group("ip"),
method=m.group("method"),
url=m.group("url"),
status_code=int(m.group("status")),
response_size=size,
user_agent=m.group("user_agent"),
referer=m.group("referer"),
)
def _parse_cloudfront(self, line: str) -> LogEntry | None:
"""Parse a CloudFront tab-separated log line."""
if line.startswith("#"):
return None
parts = line.strip().split("\t")
if len(parts) < 13:
return None
ts = None
try:
ts = datetime.strptime(f"{parts[0]} {parts[1]}", "%Y-%m-%d %H:%M:%S")
except (ValueError, IndexError):
pass
try:
status = int(parts[8])
except (ValueError, IndexError):
status = 0
try:
size = int(parts[3])
except (ValueError, IndexError):
size = 0
url = parts[7] if len(parts) > 7 else ""
query = parts[11] if len(parts) > 11 else ""
if query and query != "-":
url = f"{url}?{query}"
ua = parts[10] if len(parts) > 10 else ""
ua = ua.replace("%20", " ").replace("%2520", " ")
referer = parts[9] if len(parts) > 9 else ""
return LogEntry(
timestamp=ts,
ip=parts[4] if len(parts) > 4 else "",
method=parts[5] if len(parts) > 5 else "",
url=url,
status_code=status,
response_size=size,
user_agent=ua,
referer=referer,
)
def _parse_line(self, line: str, fmt: str) -> LogEntry | None:
"""Route to the correct parser based on format."""
parsers = {
"nginx": self._parse_nginx_combined,
"apache": self._parse_apache_combined,
"cloudfront": self._parse_cloudfront,
}
parser = parsers.get(fmt, self._parse_nginx_combined)
return parser(line)
# -- bot identification ---------------------------------------------------
@staticmethod
def identify_bot(user_agent: str) -> BotIdentification | None:
"""Match user-agent against known bot patterns."""
if not user_agent or user_agent == "-":
return None
for name, pattern, category in COMPILED_BOT_PATTERNS:
if pattern.search(user_agent):
return BotIdentification(
name=name,
user_agent_pattern=pattern.pattern,
category=category,
)
# Heuristic: generic bot detection via common keywords
ua_lower = user_agent.lower()
bot_keywords = ["bot", "spider", "crawler", "scraper", "fetch"]
for kw in bot_keywords:
if kw in ua_lower:
return BotIdentification(
name="other",
user_agent_pattern=kw,
category="other",
)
return None
# -- file handling --------------------------------------------------------
@staticmethod
def _open_file(path: str) -> TextIO:
"""Open plain text, .gz, or .bz2 log files."""
p = Path(path)
if p.suffix == ".gz":
return gzip.open(path, "rt", encoding="utf-8", errors="replace")
if p.suffix == ".bz2":
return bz2.open(path, "rt", encoding="utf-8", errors="replace")
return open(path, "r", encoding="utf-8", errors="replace")
# -- streaming parser -----------------------------------------------------
def parse_streaming(
self,
filter_bot: str | None = None,
) -> Generator[tuple[LogEntry, BotIdentification], None, None]:
"""Generator-based streaming parser for large files."""
fmt = self.fmt
first_line_checked = False
fh = self._open_file(self.log_file)
try:
for line in fh:
line = line.strip()
if not line:
continue
if not first_line_checked and fmt == "auto":
fmt = self._detect_format(line)
self._detected_format = fmt
first_line_checked = True
entry = self._parse_line(line, fmt)
if entry is None:
self._parse_errors += 1
continue
bot = self.identify_bot(entry.user_agent)
if bot is None:
continue
if filter_bot and bot.name != filter_bot.lower():
continue
yield entry, bot
finally:
fh.close()
# -- full parse -----------------------------------------------------------
def parse(
self,
filter_bot: str | None = None,
date_from: datetime | None = None,
date_to: datetime | None = None,
) -> list[tuple[LogEntry, BotIdentification]]:
"""Full parse with optional date and bot filters."""
results: list[tuple[LogEntry, BotIdentification]] = []
for entry, bot in self.parse_streaming(filter_bot):
if date_from and entry.timestamp and entry.timestamp < date_from:
continue
if date_to and entry.timestamp and entry.timestamp > date_to:
continue
results.append((entry, bot))
return results
# -- statistics -----------------------------------------------------------
@staticmethod
def get_bot_stats(
entries: list[tuple[LogEntry, BotIdentification]],
) -> dict[str, BotStats]:
"""Aggregate per-bot statistics from parsed entries."""
bot_data: dict[str, dict] = defaultdict(lambda: {
"urls": Counter(),
"statuses": Counter(),
"hours": Counter(),
"days": Counter(),
"sizes": [],
"count": 0,
})
for entry, bot in entries:
bd = bot_data[bot.name]
bd["count"] += 1
bd["urls"][entry.url] += 1
bd["statuses"][str(entry.status_code)] += 1
bd["sizes"].append(entry.response_size)
if entry.timestamp:
bd["hours"][entry.timestamp.hour] += 1
day_key = entry.timestamp.strftime("%Y-%m-%d")
bd["days"][day_key] += 1
stats: dict[str, BotStats] = {}
for name, bd in bot_data.items():
avg_size = sum(bd["sizes"]) / len(bd["sizes"]) if bd["sizes"] else 0.0
top_20 = bd["urls"].most_common(20)
stats[name] = BotStats(
name=name,
total_requests=bd["count"],
unique_urls=len(bd["urls"]),
status_distribution=dict(bd["statuses"]),
top_urls=top_20,
hourly_distribution=dict(sorted(bd["hours"].items())),
daily_distribution=dict(sorted(bd["days"].items())),
avg_response_size=avg_size,
)
return stats
# -- orchestrator ---------------------------------------------------------
def parse_and_analyze(
self,
filter_bot: str | None = None,
date_from: datetime | None = None,
date_to: datetime | None = None,
) -> LogParseResult:
"""Orchestrate parsing and statistics generation."""
entries = self.parse(filter_bot, date_from, date_to)
bot_stats = self.get_bot_stats(entries)
# Determine date range
timestamps = [e.timestamp for e, _ in entries if e.timestamp]
date_range = {}
if timestamps:
date_range = {
"from": min(timestamps).isoformat(),
"to": max(timestamps).isoformat(),
}
# Count total lines for context
total_lines = 0
fh = self._open_file(self.log_file)
try:
for _ in fh:
total_lines += 1
finally:
fh.close()
return LogParseResult(
log_file=self.log_file,
format_detected=self._detected_format or self.fmt,
total_lines=total_lines,
parsed_lines=total_lines - self._parse_errors,
bot_entries=len(entries),
date_range=date_range,
bots=bot_stats,
errors=self._parse_errors,
)
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def _parse_date(val: str) -> datetime:
"""Parse a date string in YYYY-MM-DD format."""
return datetime.strptime(val, "%Y-%m-%d")
def main() -> None:
parser = argparse.ArgumentParser(
description="Parse server access logs and identify search engine bot traffic.",
)
parser.add_argument(
"--log-file",
required=True,
help="Path to access log file (plain, .gz, .bz2)",
)
parser.add_argument(
"--format",
dest="fmt",
choices=["auto", "nginx", "apache", "cloudfront"],
default="auto",
help="Log format (default: auto-detect)",
)
parser.add_argument(
"--bot",
default=None,
help="Filter results to a specific bot (e.g., googlebot, yeti, bingbot, daumoa)",
)
parser.add_argument(
"--streaming",
action="store_true",
help="Use streaming parser for large files (prints entries incrementally)",
)
parser.add_argument(
"--date-from",
default=None,
help="Filter entries from date (YYYY-MM-DD)",
)
parser.add_argument(
"--date-to",
default=None,
help="Filter entries to date (YYYY-MM-DD)",
)
parser.add_argument(
"--json",
action="store_true",
help="Output in JSON format",
)
parser.add_argument(
"--output",
default=None,
help="Write output to file instead of stdout",
)
args = parser.parse_args()
# Validate file exists
if not Path(args.log_file).exists():
logger.error(f"Log file not found: {args.log_file}")
sys.exit(1)
date_from = _parse_date(args.date_from) if args.date_from else None
date_to = _parse_date(args.date_to) if args.date_to else None
lp = LogParser(log_file=args.log_file, fmt=args.fmt, streaming=args.streaming)
if args.streaming and not args.json:
# Streaming mode: print entries as they are parsed
count = 0
for entry, bot in lp.parse_streaming(args.bot):
if date_from and entry.timestamp and entry.timestamp < date_from:
continue
if date_to and entry.timestamp and entry.timestamp > date_to:
continue
ts_str = entry.timestamp.isoformat() if entry.timestamp else "N/A"
print(
f"[{bot.name}] {ts_str} {entry.status_code} "
f"{entry.method} {entry.url} ({entry.response_size}B)"
)
count += 1
print(f"\n--- Total bot requests: {count} ---")
return
# Full analysis mode
result = lp.parse_and_analyze(
filter_bot=args.bot,
date_from=date_from,
date_to=date_to,
)
if args.json:
output_data = result.to_dict()
output_str = json.dumps(output_data, indent=2, ensure_ascii=False)
else:
lines = [
f"Log File: {result.log_file}",
f"Format: {result.format_detected}",
f"Total Lines: {result.total_lines:,}",
f"Parsed Lines: {result.parsed_lines:,}",
f"Bot Entries: {result.bot_entries:,}",
f"Parse Errors: {result.errors:,}",
]
if result.date_range:
lines.append(f"Date Range: {result.date_range.get('from', 'N/A')} to {result.date_range.get('to', 'N/A')}")
lines.append("")
lines.append("=" * 60)
lines.append("Bot Statistics")
lines.append("=" * 60)
for name, stats in sorted(result.bots.items(), key=lambda x: -x[1].total_requests):
lines.append(f"\n--- {name.upper()} ---")
lines.append(f" Requests: {stats.total_requests:,}")
lines.append(f" Unique URLs: {stats.unique_urls:,}")
lines.append(f" Avg Response Size: {stats.avg_response_size:,.0f} bytes")
lines.append(f" Status Distribution: {stats.status_distribution}")
lines.append(f" Top 10 URLs:")
for url, cnt in stats.top_urls[:10]:
lines.append(f" {cnt:>6,} | {url}")
if stats.hourly_distribution:
peak_hour = max(stats.hourly_distribution, key=stats.hourly_distribution.get)
lines.append(f" Peak Hour: {peak_hour}:00 ({stats.hourly_distribution[peak_hour]:,} reqs)")
output_str = "\n".join(lines)
if args.output:
Path(args.output).write_text(output_str, encoding="utf-8")
logger.info(f"Output written to {args.output}")
else:
print(output_str)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,10 @@
# 32-seo-crawl-budget dependencies
requests>=2.31.0
aiohttp>=3.9.0
pandas>=2.1.0
beautifulsoup4>=4.12.0
lxml>=5.1.0
tenacity>=8.2.0
tqdm>=4.66.0
python-dotenv>=1.0.0
rich>=13.7.0

View File

@@ -0,0 +1,39 @@
---
name: seo-crawl-budget
description: |
Crawl budget optimization and log analysis. Triggers: crawl budget, log analysis, bot crawling, Googlebot, crawl waste, orphan pages, crawl efficiency.
---
# Crawl Budget Optimizer
Analyze server access logs to identify crawl budget waste and generate optimization recommendations for search engine bots.
## Capabilities
1. **Log Analysis**: Parse Nginx/Apache/CloudFront access logs to extract bot crawl data
2. **Bot Profiling**: Per-bot behavior analysis (Googlebot, Yeti, Bingbot, Daumoa)
3. **Waste Detection**: Parameter URLs, redirect chains, soft 404s, duplicate URL variants
4. **Orphan Pages**: Pages in sitemap but uncrawled, and crawled pages not in sitemap
5. **Recommendations**: Prioritized action items for crawl budget optimization
## Workflow
1. Parse server access log with `log_parser.py`
2. Run crawl budget analysis with `crawl_budget_analyzer.py`
3. Compare with sitemap URLs for orphan page detection
4. Optionally compare with Ahrefs page history data
5. Generate Korean-language report with recommendations
6. Save to Notion SEO Audit Log database
## Tools Used
- **Ahrefs**: `site-explorer-pages-history` for indexed page comparison
- **Notion**: Save audit report to database `2c8581e5-8a1e-8035-880b-e38cefc2f3ef`
- **WebSearch**: Current best practices and bot documentation
## Output
All reports are saved to the OurDigital SEO Audit Log with:
- Category: Crawl Budget
- Audit ID format: CRAWL-YYYYMMDD-NNN
- Content in Korean with technical English terms preserved

View File

@@ -0,0 +1,8 @@
name: seo-crawl-budget
description: |
Crawl budget optimization and log analysis. Triggers: crawl budget, log analysis, bot crawling, Googlebot, crawl waste, orphan pages, crawl efficiency.
allowed-tools:
- mcp__ahrefs__*
- mcp__notion__*
- WebSearch
- WebFetch

View File

@@ -0,0 +1,17 @@
# Ahrefs MCP Tools
## site-explorer-pages-history
Get historical page data for a domain to compare indexed pages with crawled pages.
```
mcp__ahrefs__site-explorer-pages-history
```
**Parameters**:
- `target` (string, required): Domain or URL to analyze
- `date_from` (string): Start date (YYYY-MM-DD)
- `date_to` (string): End date (YYYY-MM-DD)
- `mode` (string): "domain", "prefix", "exact", "subdomains"
**Use case**: Compare Ahrefs indexed page counts with server log crawl data to identify indexing gaps and crawl budget inefficiencies.

View File

@@ -0,0 +1,21 @@
# Notion MCP Tools
## notion-create-pages
Create a new page in the OurDigital SEO Audit Log database.
```
mcp__notion__notion-create-pages
```
**Database ID**: `2c8581e5-8a1e-8035-880b-e38cefc2f3ef`
**Required Properties**:
- `Issue` (Title): Report title in Korean with date
- `Site` (URL): Audited website URL
- `Category` (Select): "Crawl Budget"
- `Priority` (Select): Based on waste percentage (Critical >20%, High >10%, Medium >5%, Low <5%)
- `Found Date` (Date): Audit date (YYYY-MM-DD)
- `Audit ID` (Rich Text): Format CRAWL-YYYYMMDD-NNN
**Content**: Full crawl budget report in Korean with technical English terms preserved.

View File

@@ -0,0 +1,18 @@
# WebSearch Tool
## Purpose
Search the web for current crawl budget best practices, search engine bot documentation, and robots.txt guidelines.
## Usage
```
WebSearch(query="Googlebot crawl budget optimization 2025")
```
**Common queries**:
- Search engine bot crawl rate documentation
- robots.txt best practices for crawl budget
- URL parameter handling for search engines
- Crawl budget optimization techniques
- Search engine bot user-agent strings