Add SEO skills 19-28, 31-32 with full Python implementations

12 new skills: Keyword Strategy, SERP Analysis, Position Tracking, Link Building, Content Strategy, E-Commerce SEO, KPI Framework, International SEO, AI Visibility, Knowledge Graph, Competitor Intel, and Crawl Budget. ~20K lines of Python across 25 domain scripts. Updated skill 11 pipeline table and repo CLAUDE.md. Enhanced skill 18 local SEO workflow from jamie.clinic audit. Note: Skill 26 hreflang_validator.py pending (content filter block). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 12:05:59 +09:00
parent 159f7ec3f7
commit a3ff965b87
125 changed files with 25948 additions and 173 deletions
--- a/custom-skills/32-seo-crawl-budget/code/CLAUDE.md
+++ b/custom-skills/32-seo-crawl-budget/code/CLAUDE.md
@@ -0,0 +1,178 @@
+# CLAUDE.md
+
+## Overview
+
+Crawl budget optimization tool for analyzing server access logs and identifying crawl budget waste. Parses Apache/Nginx/CloudFront access logs, identifies search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao), profiles per-bot crawl behavior, detects crawl waste (parameter URLs, low-value pages, redirect chains), identifies orphan pages, and generates crawl efficiency recommendations. Uses streaming parser for large log files.
+
+## Quick Start
+
+```bash
+pip install -r scripts/requirements.txt
+
+# Parse access logs
+python scripts/log_parser.py --log-file /var/log/nginx/access.log --json
+
+# Crawl budget analysis
+python scripts/crawl_budget_analyzer.py --log-file /var/log/nginx/access.log --sitemap https://example.com/sitemap.xml --json
+```
+
+## Scripts
+
+| Script | Purpose | Key Output |
+|--------|---------|------------|
+| `log_parser.py` | Parse server access logs, identify bots, extract crawl data | Bot identification, request patterns, status codes |
+| `crawl_budget_analyzer.py` | Analyze crawl budget efficiency and generate recommendations | Waste identification, orphan pages, optimization plan |
+| `base_client.py` | Shared utilities | RateLimiter, ConfigManager, BaseAsyncClient |
+
+## Log Parser
+
+```bash
+# Parse Nginx combined log format
+python scripts/log_parser.py --log-file /var/log/nginx/access.log --json
+
+# Parse Apache combined log format
+python scripts/log_parser.py --log-file /var/log/apache2/access.log --format apache --json
+
+# Parse CloudFront log
+python scripts/log_parser.py --log-file cloudfront-log.gz --format cloudfront --json
+
+# Filter by specific bot
+python scripts/log_parser.py --log-file access.log --bot googlebot --json
+
+# Parse gzipped logs
+python scripts/log_parser.py --log-file access.log.gz --json
+
+# Process large files in streaming mode
+python scripts/log_parser.py --log-file access.log --streaming --json
+```
+
+**Capabilities**:
+- Support for common log formats:
+  - Nginx combined format
+  - Apache combined format
+  - CloudFront format
+  - Custom format via regex
+- Bot identification by User-Agent:
+  - Googlebot (and variants: Googlebot-Image, Googlebot-News, Googlebot-Video, AdsBot-Google)
+  - Yeti (Naver's crawler)
+  - Bingbot
+  - Daumoa (Kakao/Daum crawler)
+  - Other bots (Applebot, DuckDuckBot, Baiduspider, etc.)
+- Request data extraction (timestamp, IP, URL, status code, response size, user-agent, referer)
+- Streaming parser for files >1GB
+- Gzip/bzip2 compressed log support
+- Date range filtering
+
+## Crawl Budget Analyzer
+
+```bash
+# Full crawl budget analysis
+python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json
+
+# Waste identification only
+python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json
+
+# Orphan page detection
+python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --scope orphans --json
+
+# Per-bot profiling
+python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json
+
+# With Ahrefs page history comparison
+python scripts/crawl_budget_analyzer.py --log-file access.log --url https://example.com --ahrefs --json
+```
+
+**Capabilities**:
+- Crawl budget waste identification:
+  - Parameter URLs consuming crawl budget (?sort=, ?filter=, ?page=, ?utm_*)
+  - Low-value pages (thin content, noindex pages being crawled)
+  - Redirect chains consuming multiple crawls
+  - Soft 404 pages (200 status but error content)
+  - Duplicate URLs (www/non-www, http/https, trailing slash variants)
+- Per-bot behavior profiling:
+  - Crawl frequency (requests/day, requests/hour)
+  - Crawl depth distribution
+  - Status code distribution per bot
+  - Most crawled URLs per bot
+  - Crawl pattern analysis (time of day, days of week)
+- Orphan page detection:
+  - Pages in sitemap but never crawled by bots
+  - Pages crawled but not in sitemap
+  - Crawled pages with no internal links
+- Crawl efficiency recommendations:
+  - robots.txt optimization suggestions
+  - URL parameter handling recommendations
+  - Noindex/nofollow suggestions for low-value pages
+  - Redirect chain resolution priorities
+  - Internal linking improvements for orphan pages
+
+## Data Sources
+
+| Source | Purpose |
+|--------|---------|
+| Server access logs | Primary crawl data |
+| XML sitemap | Reference for expected crawlable pages |
+| Ahrefs `site-explorer-pages-history` | Compare indexed pages with crawled pages |
+
+## Output Format
+
+```json
+{
+  "log_file": "access.log",
+  "analysis_period": {"from": "2025-01-01", "to": "2025-01-31"},
+  "total_bot_requests": 150000,
+  "bots": {
+    "googlebot": {
+      "requests": 80000,
+      "unique_urls": 12000,
+      "avg_requests_per_day": 2580,
+      "status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000},
+      "top_crawled_urls": [...]
+    },
+    "yeti": {"requests": 35000, ...},
+    "bingbot": {"requests": 20000, ...},
+    "daumoa": {"requests": 15000, ...}
+  },
+  "waste": {
+    "parameter_urls": {"count": 5000, "pct_of_crawls": 3.3},
+    "redirect_chains": {"count": 2000, "pct_of_crawls": 1.3},
+    "soft_404s": {"count": 1500, "pct_of_crawls": 1.0},
+    "total_waste_pct": 8.5
+  },
+  "orphan_pages": {
+    "in_sitemap_not_crawled": [...],
+    "crawled_not_in_sitemap": [...]
+  },
+  "recommendations": [...],
+  "efficiency_score": 72,
+  "timestamp": "2025-01-01T00:00:00"
+}
+```
+
+## Notion Output (Required)
+
+**IMPORTANT**: All audit reports MUST be saved to the OurDigital SEO Audit Log database.
+
+### Database Configuration
+
+| Field | Value |
+|-------|-------|
+| Database ID | `2c8581e5-8a1e-8035-880b-e38cefc2f3ef` |
+| URL | https://www.notion.so/dintelligence/2c8581e58a1e8035880be38cefc2f3ef |
+
+### Required Properties
+
+| Property | Type | Description |
+|----------|------|-------------|
+| Issue | Title | Report title (Korean + date) |
+| Site | URL | Audited website URL |
+| Category | Select | Crawl Budget |
+| Priority | Select | Based on waste percentage |
+| Found Date | Date | Audit date (YYYY-MM-DD) |
+| Audit ID | Rich Text | Format: CRAWL-YYYYMMDD-NNN |
+
+### Language Guidelines
+
+- Report content in Korean (한국어)
+- Keep technical English terms as-is (e.g., Crawl Budget, Googlebot, robots.txt)
+- URLs and code remain unchanged