Add SEO skills 33-34 and fix bugs in skills 19-34

New skills:
- Skill 33: Site migration planner with redirect mapping and monitoring
- Skill 34: Reporting dashboard with HTML charts and Korean executive reports

Bug fixes (Skill 34 - report_aggregator.py):
- Add audit_type fallback for skill identification (was only using audit_id prefix)
- Extract health scores from nested data dict (technical_score, onpage_score, etc.)
- Support subdomain matching in domain filter (blog.ourdigital.org matches ourdigital.org)
- Skip self-referencing DASH- aggregated reports

Bug fixes (Skill 20 - naver_serp_analyzer.py):
- Remove VIEW tab selectors (removed by Naver in 2026)
- Add new section detectors: books (도서), shortform (숏폼), influencer (인플루언서)

Improvements (Skill 34 - dashboard/executive report):
- Add Korean category labels for Chart.js charts (기술 SEO, 온페이지, etc.)
- Add Korean trend labels (개선 중 ↑, 안정 →, 하락 중 ↓)
- Add English→Korean issue description translation layer (20 common patterns)

Documentation improvements:
- Add Korean triggers to 4 skill descriptions (19, 25, 28, 31)
- Expand Skill 32 SKILL.md from 40→143 lines (was 6/10, added workflow, output format, limitations)
- Add output format examples to Skills 27 and 28 SKILL.md
- Add limitations sections to Skills 27 and 28
- Update README.md, CLAUDE.md, AGENTS.md for skills 33-34

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-14 00:01:00 +09:00
parent dbfaa883cd
commit d2d0a2d460
37 changed files with 5462 additions and 56 deletions

View File

@@ -1,39 +1,142 @@
---
name: seo-crawl-budget
description: |
Crawl budget optimization and log analysis. Triggers: crawl budget, log analysis, bot crawling, Googlebot, crawl waste, orphan pages, crawl efficiency.
Crawl budget optimization and server log analysis for search engine bots.
Triggers: crawl budget, log analysis, bot crawling, Googlebot, crawl waste,
orphan pages, crawl efficiency, 크롤 예산, 로그 분석, 크롤 최적화.
---
# Crawl Budget Optimizer
Analyze server access logs to identify crawl budget waste and generate optimization recommendations for search engine bots.
Analyze server access logs to identify crawl budget waste and generate optimization recommendations for search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao).
## Capabilities
1. **Log Analysis**: Parse Nginx/Apache/CloudFront access logs to extract bot crawl data
2. **Bot Profiling**: Per-bot behavior analysis (Googlebot, Yeti, Bingbot, Daumoa)
3. **Waste Detection**: Parameter URLs, redirect chains, soft 404s, duplicate URL variants
4. **Orphan Pages**: Pages in sitemap but uncrawled, and crawled pages not in sitemap
5. **Recommendations**: Prioritized action items for crawl budget optimization
### Log Analysis
- Parse Nginx combined, Apache combined, and CloudFront log formats
- Support for gzip/bzip2 compressed logs
- Streaming parser for files >1GB
- Date range filtering
- Custom format via regex
### Bot Profiling
- Identify bots by User-Agent: Googlebot (and variants), Yeti (Naver), Bingbot, Daumoa (Kakao), Applebot, DuckDuckBot, Baiduspider
- Per-bot metrics: requests/day, requests/hour, unique URLs crawled
- Status code distribution per bot (200, 301, 404, 500)
- Crawl depth distribution
- Crawl pattern analysis (time of day, days of week)
- Most crawled URLs per bot
### Waste Detection
- **Parameter URLs**: ?sort=, ?filter=, ?page=, ?utm_* consuming crawl budget
- **Redirect chains**: Multiple redirects consuming crawl slots
- **Soft 404s**: 200 status pages with error/empty content
- **Duplicate URLs**: www/non-www, http/https, trailing slash variants
- **Low-value pages**: Thin content pages, noindex pages being crawled
### Orphan Page Detection
- Pages in sitemap but never crawled by bots
- Pages crawled but not in sitemap
- Crawled pages with no internal links pointing to them
## Workflow
1. Parse server access log with `log_parser.py`
2. Run crawl budget analysis with `crawl_budget_analyzer.py`
3. Compare with sitemap URLs for orphan page detection
4. Optionally compare with Ahrefs page history data
5. Generate Korean-language report with recommendations
6. Save to Notion SEO Audit Log database
### Step 1: Obtain Server Access Logs
Request or locate server access logs from the target site. Supported formats:
- Nginx: `/var/log/nginx/access.log`
- Apache: `/var/log/apache2/access.log`
- CloudFront: Downloaded from S3 or CloudWatch
## Tools Used
### Step 2: Parse Access Logs
```bash
python scripts/log_parser.py --log-file access.log --json
python scripts/log_parser.py --log-file access.log.gz --streaming --json
python scripts/log_parser.py --log-file access.log --bot googlebot --json
```
- **Ahrefs**: `site-explorer-pages-history` for indexed page comparison
- **Notion**: Save audit report to database `2c8581e5-8a1e-8035-880b-e38cefc2f3ef`
- **WebSearch**: Current best practices and bot documentation
### Step 3: Crawl Budget Analysis
```bash
python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope orphans --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json
```
## Output
### Step 4: Cross-Reference with Ahrefs (Optional)
Use `site-explorer-pages-history` to compare indexed pages vs crawled pages.
All reports are saved to the OurDigital SEO Audit Log with:
- Category: Crawl Budget
- Audit ID format: CRAWL-YYYYMMDD-NNN
- Content in Korean with technical English terms preserved
### Step 5: Generate Recommendations
Prioritized action items:
1. robots.txt optimization (block parameter URLs, low-value paths)
2. URL parameter handling (Google Search Console settings)
3. Noindex/nofollow for low-value pages
4. Redirect chain resolution (reduce 301 → 301 → 200 to 301 → 200)
5. Internal linking improvements for orphan pages
### Step 6: Report to Notion
Save Korean-language report to SEO Audit Log database.
## MCP Tools Used
| Tool | Purpose |
|------|---------|
| Ahrefs `site-explorer-pages-history` | Compare indexed pages with crawled pages |
| Notion | Save audit report to database |
| WebSearch | Current bot documentation and best practices |
## Output Format
```json
{
"log_file": "access.log",
"analysis_period": {"from": "2025-01-01", "to": "2025-01-31"},
"total_bot_requests": 150000,
"bots": {
"googlebot": {
"requests": 80000,
"unique_urls": 12000,
"avg_requests_per_day": 2580,
"status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000}
},
"yeti": {"requests": 35000},
"bingbot": {"requests": 20000},
"daumoa": {"requests": 15000}
},
"waste": {
"parameter_urls": {"count": 5000, "pct_of_crawls": 3.3},
"redirect_chains": {"count": 2000, "pct_of_crawls": 1.3},
"soft_404s": {"count": 1500, "pct_of_crawls": 1.0},
"total_waste_pct": 8.5
},
"orphan_pages": {
"in_sitemap_not_crawled": [],
"crawled_not_in_sitemap": []
},
"recommendations": [],
"efficiency_score": 72,
"timestamp": "2025-01-01T00:00:00"
}
```
## Limitations
- Requires actual server access logs (not available via standard web crawling)
- Log format auto-detection may need manual format specification for custom formats
- CloudFront logs have a different field structure than Nginx/Apache
- Large log files (>10GB) may need pre-filtering before analysis
- Bot identification relies on User-Agent strings which can be spoofed
## Notion Output (Required)
All audit reports MUST be saved to the OurDigital SEO Audit Log:
- **Database ID**: `2c8581e5-8a1e-8035-880b-e38cefc2f3ef`
- **Category**: Crawl Budget
- **Audit ID Format**: CRAWL-YYYYMMDD-NNN
- **Language**: Korean with technical English terms (Crawl Budget, Googlebot, robots.txt)
## Reference Scripts
Located in `code/scripts/`:
- `log_parser.py` — Server access log parser with bot identification
- `crawl_budget_analyzer.py` — Crawl budget efficiency analysis
- `base_client.py` — Shared async client utilities