Add SEO skills 33-34 and fix bugs in skills 19-34

New skills: - Skill 33: Site migration planner with redirect mapping and monitoring - Skill 34: Reporting dashboard with HTML charts and Korean executive reports Bug fixes (Skill 34 - report_aggregator.py): - Add audit_type fallback for skill identification (was only using audit_id prefix) - Extract health scores from nested data dict (technical_score, onpage_score, etc.) - Support subdomain matching in domain filter (blog.ourdigital.org matches ourdigital.org) - Skip self-referencing DASH- aggregated reports Bug fixes (Skill 20 - naver_serp_analyzer.py): - Remove VIEW tab selectors (removed by Naver in 2026) - Add new section detectors: books (도서), shortform (숏폼), influencer (인플루언서) Improvements (Skill 34 - dashboard/executive report): - Add Korean category labels for Chart.js charts (기술 SEO, 온페이지, etc.) - Add Korean trend labels (개선 중 ↑, 안정 →, 하락 중 ↓) - Add English→Korean issue description translation layer (20 common patterns) Documentation improvements: - Add Korean triggers to 4 skill descriptions (19, 25, 28, 31) - Expand Skill 32 SKILL.md from 40→143 lines (was 6/10, added workflow, output format, limitations) - Add output format examples to Skills 27 and 28 SKILL.md - Add limitations sections to Skills 27 and 28 - Update README.md, CLAUDE.md, AGENTS.md for skills 33-34 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 00:01:00 +09:00
parent dbfaa883cd
commit d2d0a2d460
37 changed files with 5462 additions and 56 deletions
--- a/custom-skills/32-seo-crawl-budget/desktop/SKILL.md
+++ b/custom-skills/32-seo-crawl-budget/desktop/SKILL.md
@@ -1,39 +1,142 @@
 ---
 name: seo-crawl-budget
 description: |
-  Crawl budget optimization and log analysis. Triggers: crawl budget, log analysis, bot crawling, Googlebot, crawl waste, orphan pages, crawl efficiency.
+  Crawl budget optimization and server log analysis for search engine bots.
+  Triggers: crawl budget, log analysis, bot crawling, Googlebot, crawl waste,
+  orphan pages, crawl efficiency, 크롤 예산, 로그 분석, 크롤 최적화.
 ---

 # Crawl Budget Optimizer

-Analyze server access logs to identify crawl budget waste and generate optimization recommendations for search engine bots.
+Analyze server access logs to identify crawl budget waste and generate optimization recommendations for search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao).

 ## Capabilities

-1. **Log Analysis**: Parse Nginx/Apache/CloudFront access logs to extract bot crawl data
-2. **Bot Profiling**: Per-bot behavior analysis (Googlebot, Yeti, Bingbot, Daumoa)
-3. **Waste Detection**: Parameter URLs, redirect chains, soft 404s, duplicate URL variants
-4. **Orphan Pages**: Pages in sitemap but uncrawled, and crawled pages not in sitemap
-5. **Recommendations**: Prioritized action items for crawl budget optimization
+### Log Analysis
+- Parse Nginx combined, Apache combined, and CloudFront log formats
+- Support for gzip/bzip2 compressed logs
+- Streaming parser for files >1GB
+- Date range filtering
+- Custom format via regex
+
+### Bot Profiling
+- Identify bots by User-Agent: Googlebot (and variants), Yeti (Naver), Bingbot, Daumoa (Kakao), Applebot, DuckDuckBot, Baiduspider
+- Per-bot metrics: requests/day, requests/hour, unique URLs crawled
+- Status code distribution per bot (200, 301, 404, 500)
+- Crawl depth distribution
+- Crawl pattern analysis (time of day, days of week)
+- Most crawled URLs per bot
+
+### Waste Detection
+- **Parameter URLs**: ?sort=, ?filter=, ?page=, ?utm_* consuming crawl budget
+- **Redirect chains**: Multiple redirects consuming crawl slots
+- **Soft 404s**: 200 status pages with error/empty content
+- **Duplicate URLs**: www/non-www, http/https, trailing slash variants
+- **Low-value pages**: Thin content pages, noindex pages being crawled
+
+### Orphan Page Detection
+- Pages in sitemap but never crawled by bots
+- Pages crawled but not in sitemap
+- Crawled pages with no internal links pointing to them

 ## Workflow

-1. Parse server access log with `log_parser.py`
-2. Run crawl budget analysis with `crawl_budget_analyzer.py`
-3. Compare with sitemap URLs for orphan page detection
-4. Optionally compare with Ahrefs page history data
-5. Generate Korean-language report with recommendations
-6. Save to Notion SEO Audit Log database
+### Step 1: Obtain Server Access Logs
+Request or locate server access logs from the target site. Supported formats:
+- Nginx: `/var/log/nginx/access.log`
+- Apache: `/var/log/apache2/access.log`
+- CloudFront: Downloaded from S3 or CloudWatch

-## Tools Used
+### Step 2: Parse Access Logs
+```bash
+python scripts/log_parser.py --log-file access.log --json
+python scripts/log_parser.py --log-file access.log.gz --streaming --json
+python scripts/log_parser.py --log-file access.log --bot googlebot --json
+```

- **Ahrefs**: `site-explorer-pages-history` for indexed page comparison
- **Notion**: Save audit report to database `2c8581e5-8a1e-8035-880b-e38cefc2f3ef`
- **WebSearch**: Current best practices and bot documentation
+### Step 3: Crawl Budget Analysis
+```bash
+python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json
+python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json
+python scripts/crawl_budget_analyzer.py --log-file access.log --scope orphans --json
+python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json
+```

-## Output
+### Step 4: Cross-Reference with Ahrefs (Optional)
+Use `site-explorer-pages-history` to compare indexed pages vs crawled pages.

-All reports are saved to the OurDigital SEO Audit Log with:
- Category: Crawl Budget
- Audit ID format: CRAWL-YYYYMMDD-NNN
- Content in Korean with technical English terms preserved
+### Step 5: Generate Recommendations
+Prioritized action items:
+1. robots.txt optimization (block parameter URLs, low-value paths)
+2. URL parameter handling (Google Search Console settings)
+3. Noindex/nofollow for low-value pages
+4. Redirect chain resolution (reduce 301 → 301 → 200 to 301 → 200)
+5. Internal linking improvements for orphan pages
+
+### Step 6: Report to Notion
+Save Korean-language report to SEO Audit Log database.
+
+## MCP Tools Used
+
+| Tool | Purpose |
+|------|---------|
+| Ahrefs `site-explorer-pages-history` | Compare indexed pages with crawled pages |
+| Notion | Save audit report to database |
+| WebSearch | Current bot documentation and best practices |
+
+## Output Format
+
+```json
+{
+  "log_file": "access.log",
+  "analysis_period": {"from": "2025-01-01", "to": "2025-01-31"},
+  "total_bot_requests": 150000,
+  "bots": {
+    "googlebot": {
+      "requests": 80000,
+      "unique_urls": 12000,
+      "avg_requests_per_day": 2580,
+      "status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000}
+    },
+    "yeti": {"requests": 35000},
+    "bingbot": {"requests": 20000},
+    "daumoa": {"requests": 15000}
+  },
+  "waste": {
+    "parameter_urls": {"count": 5000, "pct_of_crawls": 3.3},
+    "redirect_chains": {"count": 2000, "pct_of_crawls": 1.3},
+    "soft_404s": {"count": 1500, "pct_of_crawls": 1.0},
+    "total_waste_pct": 8.5
+  },
+  "orphan_pages": {
+    "in_sitemap_not_crawled": [],
+    "crawled_not_in_sitemap": []
+  },
+  "recommendations": [],
+  "efficiency_score": 72,
+  "timestamp": "2025-01-01T00:00:00"
+}
+```
+
+## Limitations
+
+- Requires actual server access logs (not available via standard web crawling)
+- Log format auto-detection may need manual format specification for custom formats
+- CloudFront logs have a different field structure than Nginx/Apache
+- Large log files (>10GB) may need pre-filtering before analysis
+- Bot identification relies on User-Agent strings which can be spoofed
+
+## Notion Output (Required)
+
+All audit reports MUST be saved to the OurDigital SEO Audit Log:
+- **Database ID**: `2c8581e5-8a1e-8035-880b-e38cefc2f3ef`
+- **Category**: Crawl Budget
+- **Audit ID Format**: CRAWL-YYYYMMDD-NNN
+- **Language**: Korean with technical English terms (Crawl Budget, Googlebot, robots.txt)
+
+## Reference Scripts
+
+Located in `code/scripts/`:
+- `log_parser.py` — Server access log parser with bot identification
+- `crawl_budget_analyzer.py` — Crawl budget efficiency analysis
+- `base_client.py` — Shared async client utilities