Add SEO skills 19-28, 31-32 with full Python implementations
12 new skills: Keyword Strategy, SERP Analysis, Position Tracking, Link Building, Content Strategy, E-Commerce SEO, KPI Framework, International SEO, AI Visibility, Knowledge Graph, Competitor Intel, and Crawl Budget. ~20K lines of Python across 25 domain scripts. Updated skill 11 pipeline table and repo CLAUDE.md. Enhanced skill 18 local SEO workflow from jamie.clinic audit. Note: Skill 26 hreflang_validator.py pending (content filter block). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
178
custom-skills/32-seo-crawl-budget/code/CLAUDE.md
Normal file
178
custom-skills/32-seo-crawl-budget/code/CLAUDE.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# CLAUDE.md
|
||||
|
||||
## Overview
|
||||
|
||||
Crawl budget optimization tool for analyzing server access logs and identifying crawl budget waste. Parses Apache/Nginx/CloudFront access logs, identifies search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao), profiles per-bot crawl behavior, detects crawl waste (parameter URLs, low-value pages, redirect chains), identifies orphan pages, and generates crawl efficiency recommendations. Uses streaming parser for large log files.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
pip install -r scripts/requirements.txt
|
||||
|
||||
# Parse access logs
|
||||
python scripts/log_parser.py --log-file /var/log/nginx/access.log --json
|
||||
|
||||
# Crawl budget analysis
|
||||
python scripts/crawl_budget_analyzer.py --log-file /var/log/nginx/access.log --sitemap https://example.com/sitemap.xml --json
|
||||
```
|
||||
|
||||
## Scripts
|
||||
|
||||
| Script | Purpose | Key Output |
|
||||
|--------|---------|------------|
|
||||
| `log_parser.py` | Parse server access logs, identify bots, extract crawl data | Bot identification, request patterns, status codes |
|
||||
| `crawl_budget_analyzer.py` | Analyze crawl budget efficiency and generate recommendations | Waste identification, orphan pages, optimization plan |
|
||||
| `base_client.py` | Shared utilities | RateLimiter, ConfigManager, BaseAsyncClient |
|
||||
|
||||
## Log Parser
|
||||
|
||||
```bash
|
||||
# Parse Nginx combined log format
|
||||
python scripts/log_parser.py --log-file /var/log/nginx/access.log --json
|
||||
|
||||
# Parse Apache combined log format
|
||||
python scripts/log_parser.py --log-file /var/log/apache2/access.log --format apache --json
|
||||
|
||||
# Parse CloudFront log
|
||||
python scripts/log_parser.py --log-file cloudfront-log.gz --format cloudfront --json
|
||||
|
||||
# Filter by specific bot
|
||||
python scripts/log_parser.py --log-file access.log --bot googlebot --json
|
||||
|
||||
# Parse gzipped logs
|
||||
python scripts/log_parser.py --log-file access.log.gz --json
|
||||
|
||||
# Process large files in streaming mode
|
||||
python scripts/log_parser.py --log-file access.log --streaming --json
|
||||
```
|
||||
|
||||
**Capabilities**:
|
||||
- Support for common log formats:
|
||||
- Nginx combined format
|
||||
- Apache combined format
|
||||
- CloudFront format
|
||||
- Custom format via regex
|
||||
- Bot identification by User-Agent:
|
||||
- Googlebot (and variants: Googlebot-Image, Googlebot-News, Googlebot-Video, AdsBot-Google)
|
||||
- Yeti (Naver's crawler)
|
||||
- Bingbot
|
||||
- Daumoa (Kakao/Daum crawler)
|
||||
- Other bots (Applebot, DuckDuckBot, Baiduspider, etc.)
|
||||
- Request data extraction (timestamp, IP, URL, status code, response size, user-agent, referer)
|
||||
- Streaming parser for files >1GB
|
||||
- Gzip/bzip2 compressed log support
|
||||
- Date range filtering
|
||||
|
||||
## Crawl Budget Analyzer
|
||||
|
||||
```bash
|
||||
# Full crawl budget analysis
|
||||
python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json
|
||||
|
||||
# Waste identification only
|
||||
python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json
|
||||
|
||||
# Orphan page detection
|
||||
python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --scope orphans --json
|
||||
|
||||
# Per-bot profiling
|
||||
python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json
|
||||
|
||||
# With Ahrefs page history comparison
|
||||
python scripts/crawl_budget_analyzer.py --log-file access.log --url https://example.com --ahrefs --json
|
||||
```
|
||||
|
||||
**Capabilities**:
|
||||
- Crawl budget waste identification:
|
||||
- Parameter URLs consuming crawl budget (?sort=, ?filter=, ?page=, ?utm_*)
|
||||
- Low-value pages (thin content, noindex pages being crawled)
|
||||
- Redirect chains consuming multiple crawls
|
||||
- Soft 404 pages (200 status but error content)
|
||||
- Duplicate URLs (www/non-www, http/https, trailing slash variants)
|
||||
- Per-bot behavior profiling:
|
||||
- Crawl frequency (requests/day, requests/hour)
|
||||
- Crawl depth distribution
|
||||
- Status code distribution per bot
|
||||
- Most crawled URLs per bot
|
||||
- Crawl pattern analysis (time of day, days of week)
|
||||
- Orphan page detection:
|
||||
- Pages in sitemap but never crawled by bots
|
||||
- Pages crawled but not in sitemap
|
||||
- Crawled pages with no internal links
|
||||
- Crawl efficiency recommendations:
|
||||
- robots.txt optimization suggestions
|
||||
- URL parameter handling recommendations
|
||||
- Noindex/nofollow suggestions for low-value pages
|
||||
- Redirect chain resolution priorities
|
||||
- Internal linking improvements for orphan pages
|
||||
|
||||
## Data Sources
|
||||
|
||||
| Source | Purpose |
|
||||
|--------|---------|
|
||||
| Server access logs | Primary crawl data |
|
||||
| XML sitemap | Reference for expected crawlable pages |
|
||||
| Ahrefs `site-explorer-pages-history` | Compare indexed pages with crawled pages |
|
||||
|
||||
## Output Format
|
||||
|
||||
```json
|
||||
{
|
||||
"log_file": "access.log",
|
||||
"analysis_period": {"from": "2025-01-01", "to": "2025-01-31"},
|
||||
"total_bot_requests": 150000,
|
||||
"bots": {
|
||||
"googlebot": {
|
||||
"requests": 80000,
|
||||
"unique_urls": 12000,
|
||||
"avg_requests_per_day": 2580,
|
||||
"status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000},
|
||||
"top_crawled_urls": [...]
|
||||
},
|
||||
"yeti": {"requests": 35000, ...},
|
||||
"bingbot": {"requests": 20000, ...},
|
||||
"daumoa": {"requests": 15000, ...}
|
||||
},
|
||||
"waste": {
|
||||
"parameter_urls": {"count": 5000, "pct_of_crawls": 3.3},
|
||||
"redirect_chains": {"count": 2000, "pct_of_crawls": 1.3},
|
||||
"soft_404s": {"count": 1500, "pct_of_crawls": 1.0},
|
||||
"total_waste_pct": 8.5
|
||||
},
|
||||
"orphan_pages": {
|
||||
"in_sitemap_not_crawled": [...],
|
||||
"crawled_not_in_sitemap": [...]
|
||||
},
|
||||
"recommendations": [...],
|
||||
"efficiency_score": 72,
|
||||
"timestamp": "2025-01-01T00:00:00"
|
||||
}
|
||||
```
|
||||
|
||||
## Notion Output (Required)
|
||||
|
||||
**IMPORTANT**: All audit reports MUST be saved to the OurDigital SEO Audit Log database.
|
||||
|
||||
### Database Configuration
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| Database ID | `2c8581e5-8a1e-8035-880b-e38cefc2f3ef` |
|
||||
| URL | https://www.notion.so/dintelligence/2c8581e58a1e8035880be38cefc2f3ef |
|
||||
|
||||
### Required Properties
|
||||
|
||||
| Property | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| Issue | Title | Report title (Korean + date) |
|
||||
| Site | URL | Audited website URL |
|
||||
| Category | Select | Crawl Budget |
|
||||
| Priority | Select | Based on waste percentage |
|
||||
| Found Date | Date | Audit date (YYYY-MM-DD) |
|
||||
| Audit ID | Rich Text | Format: CRAWL-YYYYMMDD-NNN |
|
||||
|
||||
### Language Guidelines
|
||||
|
||||
- Report content in Korean (한국어)
|
||||
- Keep technical English terms as-is (e.g., Crawl Budget, Googlebot, robots.txt)
|
||||
- URLs and code remain unchanged
|
||||
Reference in New Issue
Block a user