Add SEO skills 19-28, 31-32 with full Python implementations
12 new skills: Keyword Strategy, SERP Analysis, Position Tracking, Link Building, Content Strategy, E-Commerce SEO, KPI Framework, International SEO, AI Visibility, Knowledge Graph, Competitor Intel, and Crawl Budget. ~20K lines of Python across 25 domain scripts. Updated skill 11 pipeline table and repo CLAUDE.md. Enhanced skill 18 local SEO workflow from jamie.clinic audit. Note: Skill 26 hreflang_validator.py pending (content filter block). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
178
custom-skills/32-seo-crawl-budget/code/CLAUDE.md
Normal file
178
custom-skills/32-seo-crawl-budget/code/CLAUDE.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# CLAUDE.md
|
||||
|
||||
## Overview
|
||||
|
||||
Crawl budget optimization tool for analyzing server access logs and identifying crawl budget waste. Parses Apache/Nginx/CloudFront access logs, identifies search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao), profiles per-bot crawl behavior, detects crawl waste (parameter URLs, low-value pages, redirect chains), identifies orphan pages, and generates crawl efficiency recommendations. Uses streaming parser for large log files.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
pip install -r scripts/requirements.txt
|
||||
|
||||
# Parse access logs
|
||||
python scripts/log_parser.py --log-file /var/log/nginx/access.log --json
|
||||
|
||||
# Crawl budget analysis
|
||||
python scripts/crawl_budget_analyzer.py --log-file /var/log/nginx/access.log --sitemap https://example.com/sitemap.xml --json
|
||||
```
|
||||
|
||||
## Scripts
|
||||
|
||||
| Script | Purpose | Key Output |
|
||||
|--------|---------|------------|
|
||||
| `log_parser.py` | Parse server access logs, identify bots, extract crawl data | Bot identification, request patterns, status codes |
|
||||
| `crawl_budget_analyzer.py` | Analyze crawl budget efficiency and generate recommendations | Waste identification, orphan pages, optimization plan |
|
||||
| `base_client.py` | Shared utilities | RateLimiter, ConfigManager, BaseAsyncClient |
|
||||
|
||||
## Log Parser
|
||||
|
||||
```bash
|
||||
# Parse Nginx combined log format
|
||||
python scripts/log_parser.py --log-file /var/log/nginx/access.log --json
|
||||
|
||||
# Parse Apache combined log format
|
||||
python scripts/log_parser.py --log-file /var/log/apache2/access.log --format apache --json
|
||||
|
||||
# Parse CloudFront log
|
||||
python scripts/log_parser.py --log-file cloudfront-log.gz --format cloudfront --json
|
||||
|
||||
# Filter by specific bot
|
||||
python scripts/log_parser.py --log-file access.log --bot googlebot --json
|
||||
|
||||
# Parse gzipped logs
|
||||
python scripts/log_parser.py --log-file access.log.gz --json
|
||||
|
||||
# Process large files in streaming mode
|
||||
python scripts/log_parser.py --log-file access.log --streaming --json
|
||||
```
|
||||
|
||||
**Capabilities**:
|
||||
- Support for common log formats:
|
||||
- Nginx combined format
|
||||
- Apache combined format
|
||||
- CloudFront format
|
||||
- Custom format via regex
|
||||
- Bot identification by User-Agent:
|
||||
- Googlebot (and variants: Googlebot-Image, Googlebot-News, Googlebot-Video, AdsBot-Google)
|
||||
- Yeti (Naver's crawler)
|
||||
- Bingbot
|
||||
- Daumoa (Kakao/Daum crawler)
|
||||
- Other bots (Applebot, DuckDuckBot, Baiduspider, etc.)
|
||||
- Request data extraction (timestamp, IP, URL, status code, response size, user-agent, referer)
|
||||
- Streaming parser for files >1GB
|
||||
- Gzip/bzip2 compressed log support
|
||||
- Date range filtering
|
||||
|
||||
## Crawl Budget Analyzer
|
||||
|
||||
```bash
|
||||
# Full crawl budget analysis
|
||||
python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json
|
||||
|
||||
# Waste identification only
|
||||
python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json
|
||||
|
||||
# Orphan page detection
|
||||
python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --scope orphans --json
|
||||
|
||||
# Per-bot profiling
|
||||
python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json
|
||||
|
||||
# With Ahrefs page history comparison
|
||||
python scripts/crawl_budget_analyzer.py --log-file access.log --url https://example.com --ahrefs --json
|
||||
```
|
||||
|
||||
**Capabilities**:
|
||||
- Crawl budget waste identification:
|
||||
- Parameter URLs consuming crawl budget (?sort=, ?filter=, ?page=, ?utm_*)
|
||||
- Low-value pages (thin content, noindex pages being crawled)
|
||||
- Redirect chains consuming multiple crawls
|
||||
- Soft 404 pages (200 status but error content)
|
||||
- Duplicate URLs (www/non-www, http/https, trailing slash variants)
|
||||
- Per-bot behavior profiling:
|
||||
- Crawl frequency (requests/day, requests/hour)
|
||||
- Crawl depth distribution
|
||||
- Status code distribution per bot
|
||||
- Most crawled URLs per bot
|
||||
- Crawl pattern analysis (time of day, days of week)
|
||||
- Orphan page detection:
|
||||
- Pages in sitemap but never crawled by bots
|
||||
- Pages crawled but not in sitemap
|
||||
- Crawled pages with no internal links
|
||||
- Crawl efficiency recommendations:
|
||||
- robots.txt optimization suggestions
|
||||
- URL parameter handling recommendations
|
||||
- Noindex/nofollow suggestions for low-value pages
|
||||
- Redirect chain resolution priorities
|
||||
- Internal linking improvements for orphan pages
|
||||
|
||||
## Data Sources
|
||||
|
||||
| Source | Purpose |
|
||||
|--------|---------|
|
||||
| Server access logs | Primary crawl data |
|
||||
| XML sitemap | Reference for expected crawlable pages |
|
||||
| Ahrefs `site-explorer-pages-history` | Compare indexed pages with crawled pages |
|
||||
|
||||
## Output Format
|
||||
|
||||
```json
|
||||
{
|
||||
"log_file": "access.log",
|
||||
"analysis_period": {"from": "2025-01-01", "to": "2025-01-31"},
|
||||
"total_bot_requests": 150000,
|
||||
"bots": {
|
||||
"googlebot": {
|
||||
"requests": 80000,
|
||||
"unique_urls": 12000,
|
||||
"avg_requests_per_day": 2580,
|
||||
"status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000},
|
||||
"top_crawled_urls": [...]
|
||||
},
|
||||
"yeti": {"requests": 35000, ...},
|
||||
"bingbot": {"requests": 20000, ...},
|
||||
"daumoa": {"requests": 15000, ...}
|
||||
},
|
||||
"waste": {
|
||||
"parameter_urls": {"count": 5000, "pct_of_crawls": 3.3},
|
||||
"redirect_chains": {"count": 2000, "pct_of_crawls": 1.3},
|
||||
"soft_404s": {"count": 1500, "pct_of_crawls": 1.0},
|
||||
"total_waste_pct": 8.5
|
||||
},
|
||||
"orphan_pages": {
|
||||
"in_sitemap_not_crawled": [...],
|
||||
"crawled_not_in_sitemap": [...]
|
||||
},
|
||||
"recommendations": [...],
|
||||
"efficiency_score": 72,
|
||||
"timestamp": "2025-01-01T00:00:00"
|
||||
}
|
||||
```
|
||||
|
||||
## Notion Output (Required)
|
||||
|
||||
**IMPORTANT**: All audit reports MUST be saved to the OurDigital SEO Audit Log database.
|
||||
|
||||
### Database Configuration
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| Database ID | `2c8581e5-8a1e-8035-880b-e38cefc2f3ef` |
|
||||
| URL | https://www.notion.so/dintelligence/2c8581e58a1e8035880be38cefc2f3ef |
|
||||
|
||||
### Required Properties
|
||||
|
||||
| Property | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| Issue | Title | Report title (Korean + date) |
|
||||
| Site | URL | Audited website URL |
|
||||
| Category | Select | Crawl Budget |
|
||||
| Priority | Select | Based on waste percentage |
|
||||
| Found Date | Date | Audit date (YYYY-MM-DD) |
|
||||
| Audit ID | Rich Text | Format: CRAWL-YYYYMMDD-NNN |
|
||||
|
||||
### Language Guidelines
|
||||
|
||||
- Report content in Korean (한국어)
|
||||
- Keep technical English terms as-is (e.g., Crawl Budget, Googlebot, robots.txt)
|
||||
- URLs and code remain unchanged
|
||||
207
custom-skills/32-seo-crawl-budget/code/scripts/base_client.py
Normal file
207
custom-skills/32-seo-crawl-budget/code/scripts/base_client.py
Normal file
@@ -0,0 +1,207 @@
|
||||
"""
|
||||
Base Client - Shared async client utilities
|
||||
===========================================
|
||||
Purpose: Rate-limited async operations for API clients
|
||||
Python: 3.10+
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import os
|
||||
from asyncio import Semaphore
|
||||
from datetime import datetime
|
||||
from typing import Any, Callable, TypeVar
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from tenacity import (
|
||||
retry,
|
||||
stop_after_attempt,
|
||||
wait_exponential,
|
||||
retry_if_exception_type,
|
||||
)
|
||||
|
||||
# Load environment variables
|
||||
load_dotenv()
|
||||
|
||||
# Logging setup
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s - %(levelname)s - %(message)s",
|
||||
)
|
||||
|
||||
T = TypeVar("T")
|
||||
|
||||
|
||||
class RateLimiter:
|
||||
"""Rate limiter using token bucket algorithm."""
|
||||
|
||||
def __init__(self, rate: float, per: float = 1.0):
|
||||
"""
|
||||
Initialize rate limiter.
|
||||
|
||||
Args:
|
||||
rate: Number of requests allowed
|
||||
per: Time period in seconds (default: 1 second)
|
||||
"""
|
||||
self.rate = rate
|
||||
self.per = per
|
||||
self.tokens = rate
|
||||
self.last_update = datetime.now()
|
||||
self._lock = asyncio.Lock()
|
||||
|
||||
async def acquire(self) -> None:
|
||||
"""Acquire a token, waiting if necessary."""
|
||||
async with self._lock:
|
||||
now = datetime.now()
|
||||
elapsed = (now - self.last_update).total_seconds()
|
||||
self.tokens = min(self.rate, self.tokens + elapsed * (self.rate / self.per))
|
||||
self.last_update = now
|
||||
|
||||
if self.tokens < 1:
|
||||
wait_time = (1 - self.tokens) * (self.per / self.rate)
|
||||
await asyncio.sleep(wait_time)
|
||||
self.tokens = 0
|
||||
else:
|
||||
self.tokens -= 1
|
||||
|
||||
|
||||
class BaseAsyncClient:
|
||||
"""Base class for async API clients with rate limiting."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
max_concurrent: int = 5,
|
||||
requests_per_second: float = 3.0,
|
||||
logger: logging.Logger | None = None,
|
||||
):
|
||||
"""
|
||||
Initialize base client.
|
||||
|
||||
Args:
|
||||
max_concurrent: Maximum concurrent requests
|
||||
requests_per_second: Rate limit
|
||||
logger: Logger instance
|
||||
"""
|
||||
self.semaphore = Semaphore(max_concurrent)
|
||||
self.rate_limiter = RateLimiter(requests_per_second)
|
||||
self.logger = logger or logging.getLogger(self.__class__.__name__)
|
||||
self.stats = {
|
||||
"requests": 0,
|
||||
"success": 0,
|
||||
"errors": 0,
|
||||
"retries": 0,
|
||||
}
|
||||
|
||||
@retry(
|
||||
stop=stop_after_attempt(3),
|
||||
wait=wait_exponential(multiplier=1, min=2, max=10),
|
||||
retry=retry_if_exception_type(Exception),
|
||||
)
|
||||
async def _rate_limited_request(
|
||||
self,
|
||||
coro: Callable[[], Any],
|
||||
) -> Any:
|
||||
"""Execute a request with rate limiting and retry."""
|
||||
async with self.semaphore:
|
||||
await self.rate_limiter.acquire()
|
||||
self.stats["requests"] += 1
|
||||
try:
|
||||
result = await coro()
|
||||
self.stats["success"] += 1
|
||||
return result
|
||||
except Exception as e:
|
||||
self.stats["errors"] += 1
|
||||
self.logger.error(f"Request failed: {e}")
|
||||
raise
|
||||
|
||||
async def batch_requests(
|
||||
self,
|
||||
requests: list[Callable[[], Any]],
|
||||
desc: str = "Processing",
|
||||
) -> list[Any]:
|
||||
"""Execute multiple requests concurrently."""
|
||||
try:
|
||||
from tqdm.asyncio import tqdm
|
||||
has_tqdm = True
|
||||
except ImportError:
|
||||
has_tqdm = False
|
||||
|
||||
async def execute(req: Callable) -> Any:
|
||||
try:
|
||||
return await self._rate_limited_request(req)
|
||||
except Exception as e:
|
||||
return {"error": str(e)}
|
||||
|
||||
tasks = [execute(req) for req in requests]
|
||||
|
||||
if has_tqdm:
|
||||
results = []
|
||||
for coro in tqdm.as_completed(tasks, total=len(tasks), desc=desc):
|
||||
result = await coro
|
||||
results.append(result)
|
||||
return results
|
||||
else:
|
||||
return await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
def print_stats(self) -> None:
|
||||
"""Print request statistics."""
|
||||
self.logger.info("=" * 40)
|
||||
self.logger.info("Request Statistics:")
|
||||
self.logger.info(f" Total Requests: {self.stats['requests']}")
|
||||
self.logger.info(f" Successful: {self.stats['success']}")
|
||||
self.logger.info(f" Errors: {self.stats['errors']}")
|
||||
self.logger.info("=" * 40)
|
||||
|
||||
|
||||
class ConfigManager:
|
||||
"""Manage API configuration and credentials."""
|
||||
|
||||
def __init__(self):
|
||||
load_dotenv()
|
||||
|
||||
@property
|
||||
def google_credentials_path(self) -> str | None:
|
||||
"""Get Google service account credentials path."""
|
||||
# Prefer SEO-specific credentials, fallback to general credentials
|
||||
seo_creds = os.path.expanduser("~/.credential/ourdigital-seo-agent.json")
|
||||
if os.path.exists(seo_creds):
|
||||
return seo_creds
|
||||
return os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
|
||||
|
||||
@property
|
||||
def pagespeed_api_key(self) -> str | None:
|
||||
"""Get PageSpeed Insights API key."""
|
||||
return os.getenv("PAGESPEED_API_KEY")
|
||||
|
||||
@property
|
||||
def custom_search_api_key(self) -> str | None:
|
||||
"""Get Custom Search API key."""
|
||||
return os.getenv("CUSTOM_SEARCH_API_KEY")
|
||||
|
||||
@property
|
||||
def custom_search_engine_id(self) -> str | None:
|
||||
"""Get Custom Search Engine ID."""
|
||||
return os.getenv("CUSTOM_SEARCH_ENGINE_ID")
|
||||
|
||||
@property
|
||||
def notion_token(self) -> str | None:
|
||||
"""Get Notion API token."""
|
||||
return os.getenv("NOTION_TOKEN") or os.getenv("NOTION_API_KEY")
|
||||
|
||||
def validate_google_credentials(self) -> bool:
|
||||
"""Validate Google credentials are configured."""
|
||||
creds_path = self.google_credentials_path
|
||||
if not creds_path:
|
||||
return False
|
||||
return os.path.exists(creds_path)
|
||||
|
||||
def get_required(self, key: str) -> str:
|
||||
"""Get required environment variable or raise error."""
|
||||
value = os.getenv(key)
|
||||
if not value:
|
||||
raise ValueError(f"Missing required environment variable: {key}")
|
||||
return value
|
||||
|
||||
|
||||
# Singleton config instance
|
||||
config = ConfigManager()
|
||||
@@ -0,0 +1,805 @@
|
||||
"""
|
||||
Crawl Budget Analyzer - Identify crawl waste and generate recommendations
|
||||
=========================================================================
|
||||
Purpose: Analyze server access logs for crawl budget efficiency, detect waste
|
||||
(parameter URLs, redirect chains, soft 404s, duplicates), find orphan
|
||||
pages, profile per-bot behavior, and produce prioritized recommendations.
|
||||
Python: 3.10+
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
import sys
|
||||
from collections import Counter, defaultdict
|
||||
from dataclasses import asdict, dataclass, field
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
from urllib.parse import parse_qs, urlparse
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from log_parser import BotIdentification, LogEntry, LogParser
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s - %(levelname)s - %(message)s",
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Constants
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
WASTE_PARAMS = {"sort", "filter", "order", "orderby", "dir", "direction"}
|
||||
TRACKING_PARAMS_RE = re.compile(r"^utm_", re.IGNORECASE)
|
||||
PAGINATION_PARAM = "page"
|
||||
HIGH_PAGE_THRESHOLD = 5
|
||||
SOFT_404_MAX_SIZE = 1024 # bytes - pages smaller than this may be soft 404s
|
||||
REDIRECT_STATUSES = {301, 302, 303, 307, 308}
|
||||
TOP_N_URLS = 50
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Data classes
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@dataclass
|
||||
class CrawlWaste:
|
||||
"""A category of crawl budget waste."""
|
||||
waste_type: str
|
||||
urls: list[str]
|
||||
count: int
|
||||
pct_of_total: float
|
||||
recommendation: str
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"waste_type": self.waste_type,
|
||||
"count": self.count,
|
||||
"pct_of_total": round(self.pct_of_total, 2),
|
||||
"recommendation": self.recommendation,
|
||||
"sample_urls": self.urls[:20],
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class OrphanPage:
|
||||
"""A page that is either in the sitemap but uncrawled, or crawled but not in sitemap."""
|
||||
url: str
|
||||
in_sitemap: bool
|
||||
crawled: bool
|
||||
last_crawl_date: str | None = None
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return asdict(self)
|
||||
|
||||
|
||||
@dataclass
|
||||
class BotProfile:
|
||||
"""Per-bot crawl behavior profile."""
|
||||
name: str
|
||||
total_requests: int = 0
|
||||
requests_per_day: float = 0.0
|
||||
crawl_depth_distribution: dict[int, int] = field(default_factory=dict)
|
||||
peak_hours: list[int] = field(default_factory=list)
|
||||
status_breakdown: dict[str, int] = field(default_factory=dict)
|
||||
top_crawled_urls: list[tuple[str, int]] = field(default_factory=list)
|
||||
unique_urls: int = 0
|
||||
days_active: int = 0
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"name": self.name,
|
||||
"total_requests": self.total_requests,
|
||||
"requests_per_day": round(self.requests_per_day, 1),
|
||||
"crawl_depth_distribution": self.crawl_depth_distribution,
|
||||
"peak_hours": self.peak_hours,
|
||||
"status_breakdown": self.status_breakdown,
|
||||
"top_crawled_urls": [{"url": u, "count": c} for u, c in self.top_crawled_urls],
|
||||
"unique_urls": self.unique_urls,
|
||||
"days_active": self.days_active,
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class CrawlRecommendation:
|
||||
"""A single optimization recommendation."""
|
||||
category: str
|
||||
priority: str # critical, high, medium, low
|
||||
action: str
|
||||
impact: str
|
||||
details: str
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return asdict(self)
|
||||
|
||||
|
||||
@dataclass
|
||||
class CrawlBudgetResult:
|
||||
"""Complete crawl budget analysis result."""
|
||||
log_file: str
|
||||
analysis_period: dict[str, str]
|
||||
total_bot_requests: int
|
||||
bots: dict[str, BotProfile]
|
||||
waste: list[CrawlWaste]
|
||||
total_waste_pct: float
|
||||
orphan_pages: dict[str, list[OrphanPage]]
|
||||
recommendations: list[CrawlRecommendation]
|
||||
efficiency_score: int
|
||||
timestamp: str
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"log_file": self.log_file,
|
||||
"analysis_period": self.analysis_period,
|
||||
"total_bot_requests": self.total_bot_requests,
|
||||
"bots": {n: p.to_dict() for n, p in self.bots.items()},
|
||||
"waste": {w.waste_type: w.to_dict() for w in self.waste},
|
||||
"total_waste_pct": round(self.total_waste_pct, 2),
|
||||
"orphan_pages": {
|
||||
k: [o.to_dict() for o in v]
|
||||
for k, v in self.orphan_pages.items()
|
||||
},
|
||||
"recommendations": [r.to_dict() for r in self.recommendations],
|
||||
"efficiency_score": self.efficiency_score,
|
||||
"timestamp": self.timestamp,
|
||||
}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CrawlBudgetAnalyzer
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class CrawlBudgetAnalyzer:
|
||||
"""Analyze crawl budget efficiency from server access logs."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
log_file: str,
|
||||
sitemap_url: str | None = None,
|
||||
target_url: str | None = None,
|
||||
):
|
||||
self.log_file = log_file
|
||||
self.sitemap_url = sitemap_url
|
||||
self.target_url = target_url
|
||||
self._bot_entries: list[tuple[LogEntry, BotIdentification]] = []
|
||||
self._sitemap_urls: set[str] = set()
|
||||
|
||||
# -- data loading ---------------------------------------------------------
|
||||
|
||||
def load_log_data(self, log_file: str) -> list[tuple[LogEntry, BotIdentification]]:
|
||||
"""Use LogParser to load all bot requests from the log file."""
|
||||
parser = LogParser(log_file=log_file, fmt="auto")
|
||||
entries = parser.parse()
|
||||
logger.info(f"Loaded {len(entries):,} bot entries from {log_file}")
|
||||
self._bot_entries = entries
|
||||
return entries
|
||||
|
||||
def load_sitemap_urls(self, sitemap_url: str) -> set[str]:
|
||||
"""Fetch and parse an XML sitemap, returning the set of URLs."""
|
||||
urls: set[str] = set()
|
||||
try:
|
||||
resp = requests.get(sitemap_url, timeout=30, headers={
|
||||
"User-Agent": "CrawlBudgetAnalyzer/1.0",
|
||||
})
|
||||
resp.raise_for_status()
|
||||
soup = BeautifulSoup(resp.content, "lxml-xml")
|
||||
|
||||
# Handle sitemap index
|
||||
sitemap_tags = soup.find_all("sitemap")
|
||||
if sitemap_tags:
|
||||
for st in sitemap_tags:
|
||||
loc = st.find("loc")
|
||||
if loc and loc.text:
|
||||
child_urls = self._fetch_sitemap_child(loc.text.strip())
|
||||
urls.update(child_urls)
|
||||
else:
|
||||
for url_tag in soup.find_all("url"):
|
||||
loc = url_tag.find("loc")
|
||||
if loc and loc.text:
|
||||
urls.add(self._normalize_url(loc.text.strip()))
|
||||
|
||||
logger.info(f"Loaded {len(urls):,} URLs from sitemap: {sitemap_url}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load sitemap {sitemap_url}: {e}")
|
||||
|
||||
self._sitemap_urls = urls
|
||||
return urls
|
||||
|
||||
def _fetch_sitemap_child(self, url: str) -> set[str]:
|
||||
"""Fetch a child sitemap from a sitemap index."""
|
||||
urls: set[str] = set()
|
||||
try:
|
||||
resp = requests.get(url, timeout=30, headers={
|
||||
"User-Agent": "CrawlBudgetAnalyzer/1.0",
|
||||
})
|
||||
resp.raise_for_status()
|
||||
soup = BeautifulSoup(resp.content, "lxml-xml")
|
||||
for url_tag in soup.find_all("url"):
|
||||
loc = url_tag.find("loc")
|
||||
if loc and loc.text:
|
||||
urls.add(self._normalize_url(loc.text.strip()))
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to fetch child sitemap {url}: {e}")
|
||||
return urls
|
||||
|
||||
@staticmethod
|
||||
def _normalize_url(url: str) -> str:
|
||||
"""Normalize a URL by removing trailing slash and lowercasing the scheme/host."""
|
||||
parsed = urlparse(url)
|
||||
path = parsed.path.rstrip("/") or "/"
|
||||
return f"{parsed.scheme}://{parsed.netloc}{path}"
|
||||
|
||||
# -- waste identification -------------------------------------------------
|
||||
|
||||
def identify_parameter_waste(
|
||||
self,
|
||||
bot_requests: list[tuple[LogEntry, BotIdentification]],
|
||||
) -> CrawlWaste:
|
||||
"""Find URLs with unnecessary query parameters wasting crawl budget."""
|
||||
waste_urls: list[str] = []
|
||||
for entry, _ in bot_requests:
|
||||
parsed = urlparse(entry.url)
|
||||
if not parsed.query:
|
||||
continue
|
||||
params = parse_qs(parsed.query)
|
||||
param_keys = {k.lower() for k in params}
|
||||
# Check for waste parameters
|
||||
has_waste = bool(param_keys & WASTE_PARAMS)
|
||||
# Check for tracking parameters
|
||||
has_tracking = any(TRACKING_PARAMS_RE.match(k) for k in param_keys)
|
||||
# Check for deep pagination
|
||||
page_val = params.get(PAGINATION_PARAM, params.get("p", [None]))
|
||||
has_deep_page = False
|
||||
if page_val and page_val[0]:
|
||||
try:
|
||||
if int(page_val[0]) > HIGH_PAGE_THRESHOLD:
|
||||
has_deep_page = True
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
if has_waste or has_tracking or has_deep_page:
|
||||
waste_urls.append(entry.url)
|
||||
|
||||
total = len(bot_requests)
|
||||
count = len(waste_urls)
|
||||
pct = (count / total * 100) if total else 0.0
|
||||
return CrawlWaste(
|
||||
waste_type="parameter_urls",
|
||||
urls=list(set(waste_urls)),
|
||||
count=count,
|
||||
pct_of_total=pct,
|
||||
recommendation=(
|
||||
"robots.txt에 불필요한 parameter URL 패턴을 Disallow로 추가하거나, "
|
||||
"Google Search Console의 URL Parameters 설정을 활용하세요. "
|
||||
"UTM 파라미터가 포함된 URL은 canonical 태그로 처리하세요."
|
||||
),
|
||||
)
|
||||
|
||||
def identify_redirect_chains(
|
||||
self,
|
||||
bot_requests: list[tuple[LogEntry, BotIdentification]],
|
||||
) -> CrawlWaste:
|
||||
"""Find URLs that repeatedly return redirect status codes."""
|
||||
redirect_urls: list[str] = []
|
||||
redirect_counter: Counter = Counter()
|
||||
for entry, _ in bot_requests:
|
||||
if entry.status_code in REDIRECT_STATUSES:
|
||||
redirect_counter[entry.url] += 1
|
||||
redirect_urls.append(entry.url)
|
||||
|
||||
# URLs redirected more than once are chain candidates
|
||||
chain_urls = [url for url, cnt in redirect_counter.items() if cnt >= 2]
|
||||
total = len(bot_requests)
|
||||
count = len(redirect_urls)
|
||||
pct = (count / total * 100) if total else 0.0
|
||||
return CrawlWaste(
|
||||
waste_type="redirect_chains",
|
||||
urls=chain_urls,
|
||||
count=count,
|
||||
pct_of_total=pct,
|
||||
recommendation=(
|
||||
"301/302 리다이렉트가 반복적으로 크롤링되고 있습니다. "
|
||||
"내부 링크를 최종 목적지 URL로 직접 업데이트하고, "
|
||||
"리다이렉트 체인을 단일 리다이렉트로 단축하세요."
|
||||
),
|
||||
)
|
||||
|
||||
def identify_soft_404s(
|
||||
self,
|
||||
bot_requests: list[tuple[LogEntry, BotIdentification]],
|
||||
) -> CrawlWaste:
|
||||
"""Find 200-status pages with suspiciously small response sizes."""
|
||||
soft_404_urls: list[str] = []
|
||||
for entry, _ in bot_requests:
|
||||
if entry.status_code == 200 and entry.response_size < SOFT_404_MAX_SIZE:
|
||||
if entry.response_size > 0:
|
||||
soft_404_urls.append(entry.url)
|
||||
|
||||
total = len(bot_requests)
|
||||
count = len(soft_404_urls)
|
||||
pct = (count / total * 100) if total else 0.0
|
||||
return CrawlWaste(
|
||||
waste_type="soft_404s",
|
||||
urls=list(set(soft_404_urls)),
|
||||
count=count,
|
||||
pct_of_total=pct,
|
||||
recommendation=(
|
||||
"200 상태 코드를 반환하지만 콘텐츠가 거의 없는 Soft 404 페이지입니다. "
|
||||
"실제 404 상태 코드를 반환하거나, 해당 페이지에 noindex 태그를 추가하세요."
|
||||
),
|
||||
)
|
||||
|
||||
def identify_duplicate_crawls(
|
||||
self,
|
||||
bot_requests: list[tuple[LogEntry, BotIdentification]],
|
||||
) -> CrawlWaste:
|
||||
"""Find duplicate URL variants: www/non-www, trailing slash, etc."""
|
||||
url_variants: dict[str, set[str]] = defaultdict(set)
|
||||
for entry, _ in bot_requests:
|
||||
parsed = urlparse(entry.url)
|
||||
# Normalize: strip www, strip trailing slash, lowercase
|
||||
host = parsed.netloc.lower().lstrip("www.")
|
||||
path = parsed.path.rstrip("/") or "/"
|
||||
canonical = f"{host}{path}"
|
||||
full_url = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
|
||||
url_variants[canonical].add(full_url)
|
||||
|
||||
# Identify canonicals with multiple variants
|
||||
duplicate_urls: list[str] = []
|
||||
for canonical, variants in url_variants.items():
|
||||
if len(variants) > 1:
|
||||
duplicate_urls.extend(variants)
|
||||
|
||||
total = len(bot_requests)
|
||||
# Count how many requests hit duplicate variant URLs
|
||||
dup_set = set(duplicate_urls)
|
||||
dup_request_count = sum(1 for e, _ in bot_requests if f"{urlparse(e.url).scheme}://{urlparse(e.url).netloc}{urlparse(e.url).path}" in dup_set)
|
||||
pct = (dup_request_count / total * 100) if total else 0.0
|
||||
return CrawlWaste(
|
||||
waste_type="duplicate_urls",
|
||||
urls=duplicate_urls[:TOP_N_URLS],
|
||||
count=dup_request_count,
|
||||
pct_of_total=pct,
|
||||
recommendation=(
|
||||
"www/non-www, trailing slash 유무 등 중복 URL 변형이 크롤링되고 있습니다. "
|
||||
"301 리다이렉트로 canonical URL로 통합하고, "
|
||||
"rel=canonical 태그를 정확히 설정하세요."
|
||||
),
|
||||
)
|
||||
|
||||
# -- bot profiling --------------------------------------------------------
|
||||
|
||||
def profile_bots(
|
||||
self,
|
||||
bot_requests: list[tuple[LogEntry, BotIdentification]],
|
||||
) -> dict[str, BotProfile]:
|
||||
"""Generate per-bot behavior profiles."""
|
||||
bot_data: dict[str, dict] = defaultdict(lambda: {
|
||||
"urls": Counter(),
|
||||
"statuses": Counter(),
|
||||
"hours": Counter(),
|
||||
"days": set(),
|
||||
"depths": Counter(),
|
||||
"count": 0,
|
||||
})
|
||||
|
||||
for entry, bot in bot_requests:
|
||||
bd = bot_data[bot.name]
|
||||
bd["count"] += 1
|
||||
bd["urls"][entry.url] += 1
|
||||
bd["statuses"][str(entry.status_code)] += 1
|
||||
# URL depth = number of path segments
|
||||
depth = len([s for s in urlparse(entry.url).path.split("/") if s])
|
||||
bd["depths"][depth] += 1
|
||||
if entry.timestamp:
|
||||
bd["hours"][entry.timestamp.hour] += 1
|
||||
bd["days"].add(entry.timestamp.strftime("%Y-%m-%d"))
|
||||
|
||||
profiles: dict[str, BotProfile] = {}
|
||||
for name, bd in bot_data.items():
|
||||
days_active = len(bd["days"]) or 1
|
||||
rpd = bd["count"] / days_active
|
||||
# Top 3 peak hours
|
||||
top_hours = sorted(bd["hours"].items(), key=lambda x: -x[1])[:3]
|
||||
peak = [h for h, _ in top_hours]
|
||||
profiles[name] = BotProfile(
|
||||
name=name,
|
||||
total_requests=bd["count"],
|
||||
requests_per_day=rpd,
|
||||
crawl_depth_distribution=dict(sorted(bd["depths"].items())),
|
||||
peak_hours=peak,
|
||||
status_breakdown=dict(bd["statuses"]),
|
||||
top_crawled_urls=bd["urls"].most_common(TOP_N_URLS),
|
||||
unique_urls=len(bd["urls"]),
|
||||
days_active=days_active,
|
||||
)
|
||||
return profiles
|
||||
|
||||
# -- orphan detection -----------------------------------------------------
|
||||
|
||||
def detect_orphan_pages(
|
||||
self,
|
||||
crawled_urls: set[str],
|
||||
sitemap_urls: set[str],
|
||||
) -> dict[str, list[OrphanPage]]:
|
||||
"""Compare crawled URLs with sitemap URLs to find orphans."""
|
||||
in_sitemap_not_crawled = sitemap_urls - crawled_urls
|
||||
crawled_not_in_sitemap = crawled_urls - sitemap_urls
|
||||
|
||||
return {
|
||||
"in_sitemap_not_crawled": [
|
||||
OrphanPage(url=u, in_sitemap=True, crawled=False)
|
||||
for u in sorted(in_sitemap_not_crawled)
|
||||
],
|
||||
"crawled_not_in_sitemap": [
|
||||
OrphanPage(url=u, in_sitemap=False, crawled=True)
|
||||
for u in sorted(crawled_not_in_sitemap)
|
||||
],
|
||||
}
|
||||
|
||||
# -- efficiency score -----------------------------------------------------
|
||||
|
||||
@staticmethod
|
||||
def calculate_efficiency_score(total_waste_pct: float) -> int:
|
||||
"""Calculate crawl efficiency score: 100 - waste%, capped at [0, 100]."""
|
||||
score = int(100 - total_waste_pct)
|
||||
return max(0, min(100, score))
|
||||
|
||||
# -- recommendations ------------------------------------------------------
|
||||
|
||||
def generate_recommendations(
|
||||
self,
|
||||
waste: list[CrawlWaste],
|
||||
orphans: dict[str, list[OrphanPage]],
|
||||
bot_profiles: dict[str, BotProfile],
|
||||
) -> list[CrawlRecommendation]:
|
||||
"""Generate prioritized crawl budget optimization recommendations."""
|
||||
recs: list[CrawlRecommendation] = []
|
||||
|
||||
# Waste-based recommendations
|
||||
for w in waste:
|
||||
if w.pct_of_total > 5.0:
|
||||
priority = "critical"
|
||||
elif w.pct_of_total > 2.0:
|
||||
priority = "high"
|
||||
elif w.pct_of_total > 0.5:
|
||||
priority = "medium"
|
||||
else:
|
||||
priority = "low"
|
||||
|
||||
if w.waste_type == "parameter_urls" and w.count > 0:
|
||||
recs.append(CrawlRecommendation(
|
||||
category="URL Parameters",
|
||||
priority=priority,
|
||||
action="robots.txt에 parameter URL 패턴 Disallow 규칙 추가",
|
||||
impact=f"크롤 요청 {w.pct_of_total:.1f}% 절감 가능",
|
||||
details=(
|
||||
f"총 {w.count:,}건의 parameter URL이 크롤링되었습니다. "
|
||||
f"sort, filter, utm_* 등 불필요한 파라미터를 차단하세요."
|
||||
),
|
||||
))
|
||||
elif w.waste_type == "redirect_chains" and w.count > 0:
|
||||
recs.append(CrawlRecommendation(
|
||||
category="Redirect Chains",
|
||||
priority=priority,
|
||||
action="리다이렉트 체인을 단축하고 내부 링크 업데이트",
|
||||
impact=f"크롤 요청 {w.pct_of_total:.1f}% 절감 가능",
|
||||
details=(
|
||||
f"총 {w.count:,}건의 리다이렉트 요청이 발생했습니다. "
|
||||
f"내부 링크를 최종 URL로 직접 연결하세요."
|
||||
),
|
||||
))
|
||||
elif w.waste_type == "soft_404s" and w.count > 0:
|
||||
recs.append(CrawlRecommendation(
|
||||
category="Soft 404s",
|
||||
priority=priority,
|
||||
action="Soft 404 페이지에 적절한 HTTP 상태 코드 또는 noindex 적용",
|
||||
impact=f"크롤 요청 {w.pct_of_total:.1f}% 절감 가능",
|
||||
details=(
|
||||
f"총 {w.count:,}건의 Soft 404가 감지되었습니다. "
|
||||
f"적절한 404 응답 또는 noindex meta 태그를 설정하세요."
|
||||
),
|
||||
))
|
||||
elif w.waste_type == "duplicate_urls" and w.count > 0:
|
||||
recs.append(CrawlRecommendation(
|
||||
category="Duplicate URLs",
|
||||
priority=priority,
|
||||
action="URL 정규화 및 canonical 태그 설정",
|
||||
impact=f"크롤 요청 {w.pct_of_total:.1f}% 절감 가능",
|
||||
details=(
|
||||
f"총 {w.count:,}건의 중복 URL 변형이 크롤링되었습니다. "
|
||||
f"www/non-www, trailing slash 통합을 진행하세요."
|
||||
),
|
||||
))
|
||||
|
||||
# Orphan page recommendations
|
||||
not_crawled = orphans.get("in_sitemap_not_crawled", [])
|
||||
not_in_sitemap = orphans.get("crawled_not_in_sitemap", [])
|
||||
|
||||
if len(not_crawled) > 0:
|
||||
pct = len(not_crawled) / max(len(self._sitemap_urls), 1) * 100
|
||||
priority = "critical" if pct > 30 else "high" if pct > 10 else "medium"
|
||||
recs.append(CrawlRecommendation(
|
||||
category="Orphan Pages (Uncrawled)",
|
||||
priority=priority,
|
||||
action="사이트맵에 있으나 크롤링되지 않은 페이지의 내부 링크 강화",
|
||||
impact=f"사이트맵 URL의 {pct:.1f}%가 미크롤 상태",
|
||||
details=(
|
||||
f"총 {len(not_crawled):,}개 URL이 사이트맵에 있지만 "
|
||||
f"봇이 크롤링하지 않았습니다. 내부 링크를 추가하세요."
|
||||
),
|
||||
))
|
||||
|
||||
if len(not_in_sitemap) > 0:
|
||||
recs.append(CrawlRecommendation(
|
||||
category="Orphan Pages (Unlisted)",
|
||||
priority="medium",
|
||||
action="크롤링되었으나 사이트맵에 없는 페이지를 사이트맵에 추가 또는 차단",
|
||||
impact=f"{len(not_in_sitemap):,}개 URL이 사이트맵에 미등록",
|
||||
details=(
|
||||
f"봇이 크롤링한 {len(not_in_sitemap):,}개 URL이 "
|
||||
f"사이트맵에 포함되어 있지 않습니다. 유효한 페이지는 "
|
||||
f"사이트맵에 추가하고, 불필요한 페이지는 robots.txt로 차단하세요."
|
||||
),
|
||||
))
|
||||
|
||||
# Bot-specific recommendations
|
||||
for name, profile in bot_profiles.items():
|
||||
error_count = sum(
|
||||
v for k, v in profile.status_breakdown.items()
|
||||
if k.startswith("4") or k.startswith("5")
|
||||
)
|
||||
error_pct = (error_count / profile.total_requests * 100) if profile.total_requests else 0
|
||||
if error_pct > 10:
|
||||
recs.append(CrawlRecommendation(
|
||||
category=f"Bot Errors ({name})",
|
||||
priority="high" if error_pct > 20 else "medium",
|
||||
action=f"{name}의 4xx/5xx 오류율 {error_pct:.1f}% 개선 필요",
|
||||
impact=f"{name} 크롤 예산의 {error_pct:.1f}%가 오류에 소비",
|
||||
details=(
|
||||
f"{name}이(가) {error_count:,}건의 오류 응답을 받았습니다. "
|
||||
f"깨진 링크를 수정하고 서버 안정성을 개선하세요."
|
||||
),
|
||||
))
|
||||
|
||||
# Sort by priority
|
||||
priority_order = {"critical": 0, "high": 1, "medium": 2, "low": 3}
|
||||
recs.sort(key=lambda r: priority_order.get(r.priority, 4))
|
||||
return recs
|
||||
|
||||
# -- orchestrator ---------------------------------------------------------
|
||||
|
||||
def analyze(self, scope: str = "all") -> CrawlBudgetResult:
|
||||
"""Orchestrate the full crawl budget analysis."""
|
||||
# Load log data
|
||||
entries = self.load_log_data(self.log_file)
|
||||
if not entries:
|
||||
logger.warning("No bot entries found in log file.")
|
||||
|
||||
# Load sitemap if provided
|
||||
if self.sitemap_url:
|
||||
self.load_sitemap_urls(self.sitemap_url)
|
||||
|
||||
# Profile bots
|
||||
bot_profiles: dict[str, BotProfile] = {}
|
||||
if scope in ("all", "bots"):
|
||||
bot_profiles = self.profile_bots(entries)
|
||||
|
||||
# Identify waste
|
||||
waste: list[CrawlWaste] = []
|
||||
if scope in ("all", "waste"):
|
||||
waste.append(self.identify_parameter_waste(entries))
|
||||
waste.append(self.identify_redirect_chains(entries))
|
||||
waste.append(self.identify_soft_404s(entries))
|
||||
waste.append(self.identify_duplicate_crawls(entries))
|
||||
|
||||
total_waste_pct = sum(w.pct_of_total for w in waste)
|
||||
|
||||
# Detect orphan pages
|
||||
orphans: dict[str, list[OrphanPage]] = {
|
||||
"in_sitemap_not_crawled": [],
|
||||
"crawled_not_in_sitemap": [],
|
||||
}
|
||||
if scope in ("all", "orphans") and self._sitemap_urls:
|
||||
crawled_urls: set[str] = set()
|
||||
for entry, _ in entries:
|
||||
# Build full URL from path for comparison
|
||||
if self.target_url:
|
||||
parsed_target = urlparse(self.target_url)
|
||||
full = f"{parsed_target.scheme}://{parsed_target.netloc}{entry.url}"
|
||||
crawled_urls.add(self._normalize_url(full))
|
||||
else:
|
||||
crawled_urls.add(entry.url)
|
||||
orphans = self.detect_orphan_pages(crawled_urls, self._sitemap_urls)
|
||||
|
||||
# Efficiency score
|
||||
efficiency_score = self.calculate_efficiency_score(total_waste_pct)
|
||||
|
||||
# Recommendations
|
||||
recommendations = self.generate_recommendations(waste, orphans, bot_profiles)
|
||||
|
||||
# Date range from entries
|
||||
timestamps = [e.timestamp for e, _ in entries if e.timestamp]
|
||||
analysis_period = {}
|
||||
if timestamps:
|
||||
analysis_period = {
|
||||
"from": min(timestamps).strftime("%Y-%m-%d"),
|
||||
"to": max(timestamps).strftime("%Y-%m-%d"),
|
||||
}
|
||||
|
||||
return CrawlBudgetResult(
|
||||
log_file=self.log_file,
|
||||
analysis_period=analysis_period,
|
||||
total_bot_requests=len(entries),
|
||||
bots=bot_profiles,
|
||||
waste=waste,
|
||||
total_waste_pct=total_waste_pct,
|
||||
orphan_pages=orphans,
|
||||
recommendations=recommendations,
|
||||
efficiency_score=efficiency_score,
|
||||
timestamp=datetime.now().isoformat(),
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Analyze crawl budget efficiency and generate optimization recommendations.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--log-file",
|
||||
required=True,
|
||||
help="Path to server access log file",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--sitemap",
|
||||
default=None,
|
||||
help="URL of XML sitemap for orphan page detection",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--url",
|
||||
default=None,
|
||||
help="Target website URL (used for URL normalization and Ahrefs)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--scope",
|
||||
choices=["all", "waste", "orphans", "bots"],
|
||||
default="all",
|
||||
help="Analysis scope (default: all)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--ahrefs",
|
||||
action="store_true",
|
||||
help="Include Ahrefs page history comparison (requires MCP tool)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--json",
|
||||
action="store_true",
|
||||
help="Output in JSON format",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
default=None,
|
||||
help="Write output to file instead of stdout",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate log file exists
|
||||
if not Path(args.log_file).exists():
|
||||
logger.error(f"Log file not found: {args.log_file}")
|
||||
sys.exit(1)
|
||||
|
||||
analyzer = CrawlBudgetAnalyzer(
|
||||
log_file=args.log_file,
|
||||
sitemap_url=args.sitemap,
|
||||
target_url=args.url,
|
||||
)
|
||||
|
||||
result = analyzer.analyze(scope=args.scope)
|
||||
|
||||
if args.json:
|
||||
output_data = result.to_dict()
|
||||
output_str = json.dumps(output_data, indent=2, ensure_ascii=False)
|
||||
else:
|
||||
lines = _format_text_report(result)
|
||||
output_str = "\n".join(lines)
|
||||
|
||||
if args.output:
|
||||
Path(args.output).write_text(output_str, encoding="utf-8")
|
||||
logger.info(f"Output written to {args.output}")
|
||||
else:
|
||||
print(output_str)
|
||||
|
||||
|
||||
def _format_text_report(result: CrawlBudgetResult) -> list[str]:
|
||||
"""Format the analysis result as a human-readable text report."""
|
||||
lines = [
|
||||
"=" * 70,
|
||||
"Crawl Budget Analysis Report",
|
||||
"=" * 70,
|
||||
f"Log File: {result.log_file}",
|
||||
f"Total Bot Requests: {result.total_bot_requests:,}",
|
||||
f"Efficiency Score: {result.efficiency_score}/100",
|
||||
f"Total Waste: {result.total_waste_pct:.1f}%",
|
||||
]
|
||||
if result.analysis_period:
|
||||
lines.append(
|
||||
f"Period: {result.analysis_period.get('from', 'N/A')} ~ "
|
||||
f"{result.analysis_period.get('to', 'N/A')}"
|
||||
)
|
||||
lines.append("")
|
||||
|
||||
# Bot profiles
|
||||
if result.bots:
|
||||
lines.append("-" * 60)
|
||||
lines.append("Bot Profiles")
|
||||
lines.append("-" * 60)
|
||||
for name, profile in sorted(result.bots.items(), key=lambda x: -x[1].total_requests):
|
||||
lines.append(f"\n [{name.upper()}]")
|
||||
lines.append(f" Requests: {profile.total_requests:,}")
|
||||
lines.append(f" Unique URLs: {profile.unique_urls:,}")
|
||||
lines.append(f" Requests/Day: {profile.requests_per_day:,.1f}")
|
||||
lines.append(f" Days Active: {profile.days_active}")
|
||||
lines.append(f" Peak Hours: {profile.peak_hours}")
|
||||
lines.append(f" Status: {profile.status_breakdown}")
|
||||
lines.append("")
|
||||
|
||||
# Waste breakdown
|
||||
if result.waste:
|
||||
lines.append("-" * 60)
|
||||
lines.append("Crawl Waste Breakdown")
|
||||
lines.append("-" * 60)
|
||||
for w in result.waste:
|
||||
if w.count > 0:
|
||||
lines.append(f"\n [{w.waste_type}]")
|
||||
lines.append(f" Count: {w.count:,} ({w.pct_of_total:.1f}%)")
|
||||
lines.append(f" Recommendation: {w.recommendation}")
|
||||
if w.urls:
|
||||
lines.append(f" Sample URLs:")
|
||||
for u in w.urls[:5]:
|
||||
lines.append(f" - {u}")
|
||||
lines.append("")
|
||||
|
||||
# Orphan pages
|
||||
not_crawled = result.orphan_pages.get("in_sitemap_not_crawled", [])
|
||||
not_in_sitemap = result.orphan_pages.get("crawled_not_in_sitemap", [])
|
||||
if not_crawled or not_in_sitemap:
|
||||
lines.append("-" * 60)
|
||||
lines.append("Orphan Pages")
|
||||
lines.append("-" * 60)
|
||||
if not_crawled:
|
||||
lines.append(f"\n In Sitemap but Not Crawled: {len(not_crawled):,}")
|
||||
for op in not_crawled[:10]:
|
||||
lines.append(f" - {op.url}")
|
||||
if not_in_sitemap:
|
||||
lines.append(f"\n Crawled but Not in Sitemap: {len(not_in_sitemap):,}")
|
||||
for op in not_in_sitemap[:10]:
|
||||
lines.append(f" - {op.url}")
|
||||
lines.append("")
|
||||
|
||||
# Recommendations
|
||||
if result.recommendations:
|
||||
lines.append("-" * 60)
|
||||
lines.append("Recommendations")
|
||||
lines.append("-" * 60)
|
||||
for i, rec in enumerate(result.recommendations, 1):
|
||||
lines.append(f"\n {i}. [{rec.priority.upper()}] {rec.category}")
|
||||
lines.append(f" Action: {rec.action}")
|
||||
lines.append(f" Impact: {rec.impact}")
|
||||
lines.append(f" Details: {rec.details}")
|
||||
|
||||
lines.append("")
|
||||
lines.append(f"Generated: {result.timestamp}")
|
||||
return lines
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
613
custom-skills/32-seo-crawl-budget/code/scripts/log_parser.py
Normal file
613
custom-skills/32-seo-crawl-budget/code/scripts/log_parser.py
Normal file
@@ -0,0 +1,613 @@
|
||||
"""
|
||||
Log Parser - Server access log parser with bot identification
|
||||
=============================================================
|
||||
Purpose: Parse Apache/Nginx/CloudFront access logs, identify search engine
|
||||
bots, extract crawl data, and generate per-bot statistics.
|
||||
Python: 3.10+
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import bz2
|
||||
import gzip
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
import sys
|
||||
from collections import Counter, defaultdict
|
||||
from dataclasses import asdict, dataclass, field
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Generator, TextIO
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s - %(levelname)s - %(message)s",
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Constants: bot user-agent patterns
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
BOT_PATTERNS: list[tuple[str, str, str]] = [
|
||||
# (canonical name, regex pattern, category)
|
||||
("googlebot", r"Googlebot(?:-Image|-News|-Video)?/", "search_engine"),
|
||||
("googlebot-adsbot", r"AdsBot-Google", "search_engine"),
|
||||
("googlebot-mediapartners", r"Mediapartners-Google", "search_engine"),
|
||||
("yeti", r"Yeti/", "search_engine"),
|
||||
("bingbot", r"bingbot/", "search_engine"),
|
||||
("daumoa", r"Daumoa", "search_engine"),
|
||||
("applebot", r"Applebot/", "search_engine"),
|
||||
("duckduckbot", r"DuckDuckBot/", "search_engine"),
|
||||
("baiduspider", r"Baiduspider", "search_engine"),
|
||||
("yandexbot", r"YandexBot/", "search_engine"),
|
||||
("sogou", r"Sogou", "search_engine"),
|
||||
("seznambot", r"SeznamBot/", "search_engine"),
|
||||
("ahrefsbot", r"AhrefsBot/", "seo_tool"),
|
||||
("semrushbot", r"SemrushBot/", "seo_tool"),
|
||||
("mj12bot", r"MJ12bot/", "seo_tool"),
|
||||
("dotbot", r"DotBot/", "seo_tool"),
|
||||
("rogerbot", r"rogerbot/", "seo_tool"),
|
||||
("screaming frog", r"Screaming Frog SEO Spider", "seo_tool"),
|
||||
]
|
||||
|
||||
COMPILED_BOT_PATTERNS: list[tuple[str, re.Pattern, str]] = [
|
||||
(name, re.compile(pattern, re.IGNORECASE), category)
|
||||
for name, pattern, category in BOT_PATTERNS
|
||||
]
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Regex patterns for each log format
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
NGINX_COMBINED_RE = re.compile(
|
||||
r'(?P<ip>[\d.:a-fA-F]+)\s+-\s+(?P<user>\S+)\s+'
|
||||
r'\[(?P<timestamp>[^\]]+)\]\s+'
|
||||
r'"(?P<method>\S+)\s+(?P<url>\S+)\s+(?P<protocol>[^"]+)"\s+'
|
||||
r'(?P<status>\d{3})\s+(?P<size>\d+|-)\s+'
|
||||
r'"(?P<referer>[^"]*)"\s+'
|
||||
r'"(?P<user_agent>[^"]*)"'
|
||||
)
|
||||
|
||||
APACHE_COMBINED_RE = re.compile(
|
||||
r'(?P<ip>[\d.:a-fA-F]+)\s+\S+\s+(?P<user>\S+)\s+'
|
||||
r'\[(?P<timestamp>[^\]]+)\]\s+'
|
||||
r'"(?P<method>\S+)\s+(?P<url>\S+)\s+(?P<protocol>[^"]+)"\s+'
|
||||
r'(?P<status>\d{3})\s+(?P<size>\d+|-)\s+'
|
||||
r'"(?P<referer>[^"]*)"\s+'
|
||||
r'"(?P<user_agent>[^"]*)"'
|
||||
)
|
||||
|
||||
CLOUDFRONT_FIELDS = [
|
||||
"date", "time", "x_edge_location", "sc_bytes", "c_ip",
|
||||
"cs_method", "cs_host", "cs_uri_stem", "sc_status",
|
||||
"cs_referer", "cs_user_agent", "cs_uri_query",
|
||||
"cs_cookie", "x_edge_result_type", "x_edge_request_id",
|
||||
"x_host_header", "cs_protocol", "cs_bytes",
|
||||
"time_taken", "x_forwarded_for", "ssl_protocol",
|
||||
"ssl_cipher", "x_edge_response_result_type", "cs_protocol_version",
|
||||
]
|
||||
|
||||
# Timestamp formats
|
||||
NGINX_TS_FORMAT = "%d/%b/%Y:%H:%M:%S %z"
|
||||
APACHE_TS_FORMAT = "%d/%b/%Y:%H:%M:%S %z"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Data classes
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@dataclass
|
||||
class LogEntry:
|
||||
"""A single parsed log entry."""
|
||||
timestamp: datetime | None
|
||||
ip: str
|
||||
method: str
|
||||
url: str
|
||||
status_code: int
|
||||
response_size: int
|
||||
user_agent: str
|
||||
referer: str
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
d = asdict(self)
|
||||
if self.timestamp:
|
||||
d["timestamp"] = self.timestamp.isoformat()
|
||||
return d
|
||||
|
||||
|
||||
@dataclass
|
||||
class BotIdentification:
|
||||
"""Bot identification result."""
|
||||
name: str
|
||||
user_agent_pattern: str
|
||||
category: str
|
||||
|
||||
|
||||
@dataclass
|
||||
class BotStats:
|
||||
"""Aggregated statistics for a single bot."""
|
||||
name: str
|
||||
total_requests: int = 0
|
||||
unique_urls: int = 0
|
||||
status_distribution: dict[str, int] = field(default_factory=dict)
|
||||
top_urls: list[tuple[str, int]] = field(default_factory=list)
|
||||
hourly_distribution: dict[int, int] = field(default_factory=dict)
|
||||
daily_distribution: dict[str, int] = field(default_factory=dict)
|
||||
avg_response_size: float = 0.0
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"name": self.name,
|
||||
"total_requests": self.total_requests,
|
||||
"unique_urls": self.unique_urls,
|
||||
"status_distribution": self.status_distribution,
|
||||
"top_urls": [{"url": u, "count": c} for u, c in self.top_urls],
|
||||
"hourly_distribution": self.hourly_distribution,
|
||||
"daily_distribution": self.daily_distribution,
|
||||
"avg_response_size": round(self.avg_response_size, 1),
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class LogParseResult:
|
||||
"""Complete log parsing result."""
|
||||
log_file: str
|
||||
format_detected: str
|
||||
total_lines: int
|
||||
parsed_lines: int
|
||||
bot_entries: int
|
||||
date_range: dict[str, str]
|
||||
bots: dict[str, BotStats]
|
||||
errors: int
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"log_file": self.log_file,
|
||||
"format_detected": self.format_detected,
|
||||
"total_lines": self.total_lines,
|
||||
"parsed_lines": self.parsed_lines,
|
||||
"bot_entries": self.bot_entries,
|
||||
"date_range": self.date_range,
|
||||
"bots": {name: stats.to_dict() for name, stats in self.bots.items()},
|
||||
"errors": self.errors,
|
||||
}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# LogParser class
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class LogParser:
|
||||
"""Parse server access logs and identify search engine bot traffic."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
log_file: str,
|
||||
fmt: str = "auto",
|
||||
streaming: bool = False,
|
||||
):
|
||||
self.log_file = log_file
|
||||
self.fmt = fmt
|
||||
self.streaming = streaming
|
||||
self._detected_format: str | None = None
|
||||
self._parse_errors = 0
|
||||
|
||||
# -- format detection -----------------------------------------------------
|
||||
|
||||
def _detect_format(self, line: str) -> str:
|
||||
"""Auto-detect log format from a sample line."""
|
||||
if line.startswith("#"):
|
||||
return "cloudfront"
|
||||
if NGINX_COMBINED_RE.match(line):
|
||||
return "nginx"
|
||||
if APACHE_COMBINED_RE.match(line):
|
||||
return "apache"
|
||||
# Fallback: try tab-separated (CloudFront without header)
|
||||
if "\t" in line and line.count("\t") >= 10:
|
||||
return "cloudfront"
|
||||
return "nginx"
|
||||
|
||||
# -- line parsers ---------------------------------------------------------
|
||||
|
||||
def _parse_nginx_combined(self, line: str) -> LogEntry | None:
|
||||
"""Parse a single Nginx combined format log line."""
|
||||
m = NGINX_COMBINED_RE.match(line)
|
||||
if not m:
|
||||
return None
|
||||
ts = None
|
||||
try:
|
||||
ts = datetime.strptime(m.group("timestamp"), NGINX_TS_FORMAT)
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
size_raw = m.group("size")
|
||||
size = int(size_raw) if size_raw != "-" else 0
|
||||
return LogEntry(
|
||||
timestamp=ts,
|
||||
ip=m.group("ip"),
|
||||
method=m.group("method"),
|
||||
url=m.group("url"),
|
||||
status_code=int(m.group("status")),
|
||||
response_size=size,
|
||||
user_agent=m.group("user_agent"),
|
||||
referer=m.group("referer"),
|
||||
)
|
||||
|
||||
def _parse_apache_combined(self, line: str) -> LogEntry | None:
|
||||
"""Parse a single Apache combined format log line."""
|
||||
m = APACHE_COMBINED_RE.match(line)
|
||||
if not m:
|
||||
return None
|
||||
ts = None
|
||||
try:
|
||||
ts = datetime.strptime(m.group("timestamp"), APACHE_TS_FORMAT)
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
size_raw = m.group("size")
|
||||
size = int(size_raw) if size_raw != "-" else 0
|
||||
return LogEntry(
|
||||
timestamp=ts,
|
||||
ip=m.group("ip"),
|
||||
method=m.group("method"),
|
||||
url=m.group("url"),
|
||||
status_code=int(m.group("status")),
|
||||
response_size=size,
|
||||
user_agent=m.group("user_agent"),
|
||||
referer=m.group("referer"),
|
||||
)
|
||||
|
||||
def _parse_cloudfront(self, line: str) -> LogEntry | None:
|
||||
"""Parse a CloudFront tab-separated log line."""
|
||||
if line.startswith("#"):
|
||||
return None
|
||||
parts = line.strip().split("\t")
|
||||
if len(parts) < 13:
|
||||
return None
|
||||
ts = None
|
||||
try:
|
||||
ts = datetime.strptime(f"{parts[0]} {parts[1]}", "%Y-%m-%d %H:%M:%S")
|
||||
except (ValueError, IndexError):
|
||||
pass
|
||||
try:
|
||||
status = int(parts[8])
|
||||
except (ValueError, IndexError):
|
||||
status = 0
|
||||
try:
|
||||
size = int(parts[3])
|
||||
except (ValueError, IndexError):
|
||||
size = 0
|
||||
url = parts[7] if len(parts) > 7 else ""
|
||||
query = parts[11] if len(parts) > 11 else ""
|
||||
if query and query != "-":
|
||||
url = f"{url}?{query}"
|
||||
ua = parts[10] if len(parts) > 10 else ""
|
||||
ua = ua.replace("%20", " ").replace("%2520", " ")
|
||||
referer = parts[9] if len(parts) > 9 else ""
|
||||
return LogEntry(
|
||||
timestamp=ts,
|
||||
ip=parts[4] if len(parts) > 4 else "",
|
||||
method=parts[5] if len(parts) > 5 else "",
|
||||
url=url,
|
||||
status_code=status,
|
||||
response_size=size,
|
||||
user_agent=ua,
|
||||
referer=referer,
|
||||
)
|
||||
|
||||
def _parse_line(self, line: str, fmt: str) -> LogEntry | None:
|
||||
"""Route to the correct parser based on format."""
|
||||
parsers = {
|
||||
"nginx": self._parse_nginx_combined,
|
||||
"apache": self._parse_apache_combined,
|
||||
"cloudfront": self._parse_cloudfront,
|
||||
}
|
||||
parser = parsers.get(fmt, self._parse_nginx_combined)
|
||||
return parser(line)
|
||||
|
||||
# -- bot identification ---------------------------------------------------
|
||||
|
||||
@staticmethod
|
||||
def identify_bot(user_agent: str) -> BotIdentification | None:
|
||||
"""Match user-agent against known bot patterns."""
|
||||
if not user_agent or user_agent == "-":
|
||||
return None
|
||||
for name, pattern, category in COMPILED_BOT_PATTERNS:
|
||||
if pattern.search(user_agent):
|
||||
return BotIdentification(
|
||||
name=name,
|
||||
user_agent_pattern=pattern.pattern,
|
||||
category=category,
|
||||
)
|
||||
# Heuristic: generic bot detection via common keywords
|
||||
ua_lower = user_agent.lower()
|
||||
bot_keywords = ["bot", "spider", "crawler", "scraper", "fetch"]
|
||||
for kw in bot_keywords:
|
||||
if kw in ua_lower:
|
||||
return BotIdentification(
|
||||
name="other",
|
||||
user_agent_pattern=kw,
|
||||
category="other",
|
||||
)
|
||||
return None
|
||||
|
||||
# -- file handling --------------------------------------------------------
|
||||
|
||||
@staticmethod
|
||||
def _open_file(path: str) -> TextIO:
|
||||
"""Open plain text, .gz, or .bz2 log files."""
|
||||
p = Path(path)
|
||||
if p.suffix == ".gz":
|
||||
return gzip.open(path, "rt", encoding="utf-8", errors="replace")
|
||||
if p.suffix == ".bz2":
|
||||
return bz2.open(path, "rt", encoding="utf-8", errors="replace")
|
||||
return open(path, "r", encoding="utf-8", errors="replace")
|
||||
|
||||
# -- streaming parser -----------------------------------------------------
|
||||
|
||||
def parse_streaming(
|
||||
self,
|
||||
filter_bot: str | None = None,
|
||||
) -> Generator[tuple[LogEntry, BotIdentification], None, None]:
|
||||
"""Generator-based streaming parser for large files."""
|
||||
fmt = self.fmt
|
||||
first_line_checked = False
|
||||
|
||||
fh = self._open_file(self.log_file)
|
||||
try:
|
||||
for line in fh:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
if not first_line_checked and fmt == "auto":
|
||||
fmt = self._detect_format(line)
|
||||
self._detected_format = fmt
|
||||
first_line_checked = True
|
||||
entry = self._parse_line(line, fmt)
|
||||
if entry is None:
|
||||
self._parse_errors += 1
|
||||
continue
|
||||
bot = self.identify_bot(entry.user_agent)
|
||||
if bot is None:
|
||||
continue
|
||||
if filter_bot and bot.name != filter_bot.lower():
|
||||
continue
|
||||
yield entry, bot
|
||||
finally:
|
||||
fh.close()
|
||||
|
||||
# -- full parse -----------------------------------------------------------
|
||||
|
||||
def parse(
|
||||
self,
|
||||
filter_bot: str | None = None,
|
||||
date_from: datetime | None = None,
|
||||
date_to: datetime | None = None,
|
||||
) -> list[tuple[LogEntry, BotIdentification]]:
|
||||
"""Full parse with optional date and bot filters."""
|
||||
results: list[tuple[LogEntry, BotIdentification]] = []
|
||||
for entry, bot in self.parse_streaming(filter_bot):
|
||||
if date_from and entry.timestamp and entry.timestamp < date_from:
|
||||
continue
|
||||
if date_to and entry.timestamp and entry.timestamp > date_to:
|
||||
continue
|
||||
results.append((entry, bot))
|
||||
return results
|
||||
|
||||
# -- statistics -----------------------------------------------------------
|
||||
|
||||
@staticmethod
|
||||
def get_bot_stats(
|
||||
entries: list[tuple[LogEntry, BotIdentification]],
|
||||
) -> dict[str, BotStats]:
|
||||
"""Aggregate per-bot statistics from parsed entries."""
|
||||
bot_data: dict[str, dict] = defaultdict(lambda: {
|
||||
"urls": Counter(),
|
||||
"statuses": Counter(),
|
||||
"hours": Counter(),
|
||||
"days": Counter(),
|
||||
"sizes": [],
|
||||
"count": 0,
|
||||
})
|
||||
|
||||
for entry, bot in entries:
|
||||
bd = bot_data[bot.name]
|
||||
bd["count"] += 1
|
||||
bd["urls"][entry.url] += 1
|
||||
bd["statuses"][str(entry.status_code)] += 1
|
||||
bd["sizes"].append(entry.response_size)
|
||||
if entry.timestamp:
|
||||
bd["hours"][entry.timestamp.hour] += 1
|
||||
day_key = entry.timestamp.strftime("%Y-%m-%d")
|
||||
bd["days"][day_key] += 1
|
||||
|
||||
stats: dict[str, BotStats] = {}
|
||||
for name, bd in bot_data.items():
|
||||
avg_size = sum(bd["sizes"]) / len(bd["sizes"]) if bd["sizes"] else 0.0
|
||||
top_20 = bd["urls"].most_common(20)
|
||||
stats[name] = BotStats(
|
||||
name=name,
|
||||
total_requests=bd["count"],
|
||||
unique_urls=len(bd["urls"]),
|
||||
status_distribution=dict(bd["statuses"]),
|
||||
top_urls=top_20,
|
||||
hourly_distribution=dict(sorted(bd["hours"].items())),
|
||||
daily_distribution=dict(sorted(bd["days"].items())),
|
||||
avg_response_size=avg_size,
|
||||
)
|
||||
return stats
|
||||
|
||||
# -- orchestrator ---------------------------------------------------------
|
||||
|
||||
def parse_and_analyze(
|
||||
self,
|
||||
filter_bot: str | None = None,
|
||||
date_from: datetime | None = None,
|
||||
date_to: datetime | None = None,
|
||||
) -> LogParseResult:
|
||||
"""Orchestrate parsing and statistics generation."""
|
||||
entries = self.parse(filter_bot, date_from, date_to)
|
||||
bot_stats = self.get_bot_stats(entries)
|
||||
|
||||
# Determine date range
|
||||
timestamps = [e.timestamp for e, _ in entries if e.timestamp]
|
||||
date_range = {}
|
||||
if timestamps:
|
||||
date_range = {
|
||||
"from": min(timestamps).isoformat(),
|
||||
"to": max(timestamps).isoformat(),
|
||||
}
|
||||
|
||||
# Count total lines for context
|
||||
total_lines = 0
|
||||
fh = self._open_file(self.log_file)
|
||||
try:
|
||||
for _ in fh:
|
||||
total_lines += 1
|
||||
finally:
|
||||
fh.close()
|
||||
|
||||
return LogParseResult(
|
||||
log_file=self.log_file,
|
||||
format_detected=self._detected_format or self.fmt,
|
||||
total_lines=total_lines,
|
||||
parsed_lines=total_lines - self._parse_errors,
|
||||
bot_entries=len(entries),
|
||||
date_range=date_range,
|
||||
bots=bot_stats,
|
||||
errors=self._parse_errors,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _parse_date(val: str) -> datetime:
|
||||
"""Parse a date string in YYYY-MM-DD format."""
|
||||
return datetime.strptime(val, "%Y-%m-%d")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Parse server access logs and identify search engine bot traffic.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--log-file",
|
||||
required=True,
|
||||
help="Path to access log file (plain, .gz, .bz2)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--format",
|
||||
dest="fmt",
|
||||
choices=["auto", "nginx", "apache", "cloudfront"],
|
||||
default="auto",
|
||||
help="Log format (default: auto-detect)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--bot",
|
||||
default=None,
|
||||
help="Filter results to a specific bot (e.g., googlebot, yeti, bingbot, daumoa)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--streaming",
|
||||
action="store_true",
|
||||
help="Use streaming parser for large files (prints entries incrementally)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--date-from",
|
||||
default=None,
|
||||
help="Filter entries from date (YYYY-MM-DD)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--date-to",
|
||||
default=None,
|
||||
help="Filter entries to date (YYYY-MM-DD)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--json",
|
||||
action="store_true",
|
||||
help="Output in JSON format",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
default=None,
|
||||
help="Write output to file instead of stdout",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate file exists
|
||||
if not Path(args.log_file).exists():
|
||||
logger.error(f"Log file not found: {args.log_file}")
|
||||
sys.exit(1)
|
||||
|
||||
date_from = _parse_date(args.date_from) if args.date_from else None
|
||||
date_to = _parse_date(args.date_to) if args.date_to else None
|
||||
|
||||
lp = LogParser(log_file=args.log_file, fmt=args.fmt, streaming=args.streaming)
|
||||
|
||||
if args.streaming and not args.json:
|
||||
# Streaming mode: print entries as they are parsed
|
||||
count = 0
|
||||
for entry, bot in lp.parse_streaming(args.bot):
|
||||
if date_from and entry.timestamp and entry.timestamp < date_from:
|
||||
continue
|
||||
if date_to and entry.timestamp and entry.timestamp > date_to:
|
||||
continue
|
||||
ts_str = entry.timestamp.isoformat() if entry.timestamp else "N/A"
|
||||
print(
|
||||
f"[{bot.name}] {ts_str} {entry.status_code} "
|
||||
f"{entry.method} {entry.url} ({entry.response_size}B)"
|
||||
)
|
||||
count += 1
|
||||
print(f"\n--- Total bot requests: {count} ---")
|
||||
return
|
||||
|
||||
# Full analysis mode
|
||||
result = lp.parse_and_analyze(
|
||||
filter_bot=args.bot,
|
||||
date_from=date_from,
|
||||
date_to=date_to,
|
||||
)
|
||||
|
||||
if args.json:
|
||||
output_data = result.to_dict()
|
||||
output_str = json.dumps(output_data, indent=2, ensure_ascii=False)
|
||||
else:
|
||||
lines = [
|
||||
f"Log File: {result.log_file}",
|
||||
f"Format: {result.format_detected}",
|
||||
f"Total Lines: {result.total_lines:,}",
|
||||
f"Parsed Lines: {result.parsed_lines:,}",
|
||||
f"Bot Entries: {result.bot_entries:,}",
|
||||
f"Parse Errors: {result.errors:,}",
|
||||
]
|
||||
if result.date_range:
|
||||
lines.append(f"Date Range: {result.date_range.get('from', 'N/A')} to {result.date_range.get('to', 'N/A')}")
|
||||
lines.append("")
|
||||
lines.append("=" * 60)
|
||||
lines.append("Bot Statistics")
|
||||
lines.append("=" * 60)
|
||||
for name, stats in sorted(result.bots.items(), key=lambda x: -x[1].total_requests):
|
||||
lines.append(f"\n--- {name.upper()} ---")
|
||||
lines.append(f" Requests: {stats.total_requests:,}")
|
||||
lines.append(f" Unique URLs: {stats.unique_urls:,}")
|
||||
lines.append(f" Avg Response Size: {stats.avg_response_size:,.0f} bytes")
|
||||
lines.append(f" Status Distribution: {stats.status_distribution}")
|
||||
lines.append(f" Top 10 URLs:")
|
||||
for url, cnt in stats.top_urls[:10]:
|
||||
lines.append(f" {cnt:>6,} | {url}")
|
||||
if stats.hourly_distribution:
|
||||
peak_hour = max(stats.hourly_distribution, key=stats.hourly_distribution.get)
|
||||
lines.append(f" Peak Hour: {peak_hour}:00 ({stats.hourly_distribution[peak_hour]:,} reqs)")
|
||||
output_str = "\n".join(lines)
|
||||
|
||||
if args.output:
|
||||
Path(args.output).write_text(output_str, encoding="utf-8")
|
||||
logger.info(f"Output written to {args.output}")
|
||||
else:
|
||||
print(output_str)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,10 @@
|
||||
# 32-seo-crawl-budget dependencies
|
||||
requests>=2.31.0
|
||||
aiohttp>=3.9.0
|
||||
pandas>=2.1.0
|
||||
beautifulsoup4>=4.12.0
|
||||
lxml>=5.1.0
|
||||
tenacity>=8.2.0
|
||||
tqdm>=4.66.0
|
||||
python-dotenv>=1.0.0
|
||||
rich>=13.7.0
|
||||
Reference in New Issue
Block a user