Files
our-claude-skills/custom-skills/32-seo-crawl-budget/desktop/SKILL.md
Andrew Yim d2d0a2d460 Add SEO skills 33-34 and fix bugs in skills 19-34
New skills:
- Skill 33: Site migration planner with redirect mapping and monitoring
- Skill 34: Reporting dashboard with HTML charts and Korean executive reports

Bug fixes (Skill 34 - report_aggregator.py):
- Add audit_type fallback for skill identification (was only using audit_id prefix)
- Extract health scores from nested data dict (technical_score, onpage_score, etc.)
- Support subdomain matching in domain filter (blog.ourdigital.org matches ourdigital.org)
- Skip self-referencing DASH- aggregated reports

Bug fixes (Skill 20 - naver_serp_analyzer.py):
- Remove VIEW tab selectors (removed by Naver in 2026)
- Add new section detectors: books (도서), shortform (숏폼), influencer (인플루언서)

Improvements (Skill 34 - dashboard/executive report):
- Add Korean category labels for Chart.js charts (기술 SEO, 온페이지, etc.)
- Add Korean trend labels (개선 중 ↑, 안정 →, 하락 중 ↓)
- Add English→Korean issue description translation layer (20 common patterns)

Documentation improvements:
- Add Korean triggers to 4 skill descriptions (19, 25, 28, 31)
- Expand Skill 32 SKILL.md from 40→143 lines (was 6/10, added workflow, output format, limitations)
- Add output format examples to Skills 27 and 28 SKILL.md
- Add limitations sections to Skills 27 and 28
- Update README.md, CLAUDE.md, AGENTS.md for skills 33-34

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 00:01:00 +09:00

143 lines
5.0 KiB
Markdown

---
name: seo-crawl-budget
description: |
Crawl budget optimization and server log analysis for search engine bots.
Triggers: crawl budget, log analysis, bot crawling, Googlebot, crawl waste,
orphan pages, crawl efficiency, 크롤 예산, 로그 분석, 크롤 최적화.
---
# Crawl Budget Optimizer
Analyze server access logs to identify crawl budget waste and generate optimization recommendations for search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao).
## Capabilities
### Log Analysis
- Parse Nginx combined, Apache combined, and CloudFront log formats
- Support for gzip/bzip2 compressed logs
- Streaming parser for files >1GB
- Date range filtering
- Custom format via regex
### Bot Profiling
- Identify bots by User-Agent: Googlebot (and variants), Yeti (Naver), Bingbot, Daumoa (Kakao), Applebot, DuckDuckBot, Baiduspider
- Per-bot metrics: requests/day, requests/hour, unique URLs crawled
- Status code distribution per bot (200, 301, 404, 500)
- Crawl depth distribution
- Crawl pattern analysis (time of day, days of week)
- Most crawled URLs per bot
### Waste Detection
- **Parameter URLs**: ?sort=, ?filter=, ?page=, ?utm_* consuming crawl budget
- **Redirect chains**: Multiple redirects consuming crawl slots
- **Soft 404s**: 200 status pages with error/empty content
- **Duplicate URLs**: www/non-www, http/https, trailing slash variants
- **Low-value pages**: Thin content pages, noindex pages being crawled
### Orphan Page Detection
- Pages in sitemap but never crawled by bots
- Pages crawled but not in sitemap
- Crawled pages with no internal links pointing to them
## Workflow
### Step 1: Obtain Server Access Logs
Request or locate server access logs from the target site. Supported formats:
- Nginx: `/var/log/nginx/access.log`
- Apache: `/var/log/apache2/access.log`
- CloudFront: Downloaded from S3 or CloudWatch
### Step 2: Parse Access Logs
```bash
python scripts/log_parser.py --log-file access.log --json
python scripts/log_parser.py --log-file access.log.gz --streaming --json
python scripts/log_parser.py --log-file access.log --bot googlebot --json
```
### Step 3: Crawl Budget Analysis
```bash
python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope orphans --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json
```
### Step 4: Cross-Reference with Ahrefs (Optional)
Use `site-explorer-pages-history` to compare indexed pages vs crawled pages.
### Step 5: Generate Recommendations
Prioritized action items:
1. robots.txt optimization (block parameter URLs, low-value paths)
2. URL parameter handling (Google Search Console settings)
3. Noindex/nofollow for low-value pages
4. Redirect chain resolution (reduce 301 → 301 → 200 to 301 → 200)
5. Internal linking improvements for orphan pages
### Step 6: Report to Notion
Save Korean-language report to SEO Audit Log database.
## MCP Tools Used
| Tool | Purpose |
|------|---------|
| Ahrefs `site-explorer-pages-history` | Compare indexed pages with crawled pages |
| Notion | Save audit report to database |
| WebSearch | Current bot documentation and best practices |
## Output Format
```json
{
"log_file": "access.log",
"analysis_period": {"from": "2025-01-01", "to": "2025-01-31"},
"total_bot_requests": 150000,
"bots": {
"googlebot": {
"requests": 80000,
"unique_urls": 12000,
"avg_requests_per_day": 2580,
"status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000}
},
"yeti": {"requests": 35000},
"bingbot": {"requests": 20000},
"daumoa": {"requests": 15000}
},
"waste": {
"parameter_urls": {"count": 5000, "pct_of_crawls": 3.3},
"redirect_chains": {"count": 2000, "pct_of_crawls": 1.3},
"soft_404s": {"count": 1500, "pct_of_crawls": 1.0},
"total_waste_pct": 8.5
},
"orphan_pages": {
"in_sitemap_not_crawled": [],
"crawled_not_in_sitemap": []
},
"recommendations": [],
"efficiency_score": 72,
"timestamp": "2025-01-01T00:00:00"
}
```
## Limitations
- Requires actual server access logs (not available via standard web crawling)
- Log format auto-detection may need manual format specification for custom formats
- CloudFront logs have a different field structure than Nginx/Apache
- Large log files (>10GB) may need pre-filtering before analysis
- Bot identification relies on User-Agent strings which can be spoofed
## Notion Output (Required)
All audit reports MUST be saved to the OurDigital SEO Audit Log:
- **Database ID**: `2c8581e5-8a1e-8035-880b-e38cefc2f3ef`
- **Category**: Crawl Budget
- **Audit ID Format**: CRAWL-YYYYMMDD-NNN
- **Language**: Korean with technical English terms (Crawl Budget, Googlebot, robots.txt)
## Reference Scripts
Located in `code/scripts/`:
- `log_parser.py` — Server access log parser with bot identification
- `crawl_budget_analyzer.py` — Crawl budget efficiency analysis
- `base_client.py` — Shared async client utilities