Files

Andrew Yim d2d0a2d460 Add SEO skills 33-34 and fix bugs in skills 19-34

New skills:
- Skill 33: Site migration planner with redirect mapping and monitoring
- Skill 34: Reporting dashboard with HTML charts and Korean executive reports

Bug fixes (Skill 34 - report_aggregator.py):
- Add audit_type fallback for skill identification (was only using audit_id prefix)
- Extract health scores from nested data dict (technical_score, onpage_score, etc.)
- Support subdomain matching in domain filter (blog.ourdigital.org matches ourdigital.org)
- Skip self-referencing DASH- aggregated reports

Bug fixes (Skill 20 - naver_serp_analyzer.py):
- Remove VIEW tab selectors (removed by Naver in 2026)
- Add new section detectors: books (도서), shortform (숏폼), influencer (인플루언서)

Improvements (Skill 34 - dashboard/executive report):
- Add Korean category labels for Chart.js charts (기술 SEO, 온페이지, etc.)
- Add Korean trend labels (개선 중 ↑, 안정 →, 하락 중 ↓)
- Add English→Korean issue description translation layer (20 common patterns)

Documentation improvements:
- Add Korean triggers to 4 skill descriptions (19, 25, 28, 31)
- Expand Skill 32 SKILL.md from 40→143 lines (was 6/10, added workflow, output format, limitations)
- Add output format examples to Skills 27 and 28 SKILL.md
- Add limitations sections to Skills 27 and 28
- Update README.md, CLAUDE.md, AGENTS.md for skills 33-34

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-14 00:01:00 +09:00

5.0 KiB

Raw Blame History

name, description

name	description
seo-crawl-budget	Crawl budget optimization and server log analysis for search engine bots. Triggers: crawl budget, log analysis, bot crawling, Googlebot, crawl waste, orphan pages, crawl efficiency, 크롤 예산, 로그 분석, 크롤 최적화.

Crawl Budget Optimizer

Analyze server access logs to identify crawl budget waste and generate optimization recommendations for search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao).

Capabilities

Log Analysis

Parse Nginx combined, Apache combined, and CloudFront log formats
Support for gzip/bzip2 compressed logs
Streaming parser for files >1GB
Date range filtering
Custom format via regex

Bot Profiling

Identify bots by User-Agent: Googlebot (and variants), Yeti (Naver), Bingbot, Daumoa (Kakao), Applebot, DuckDuckBot, Baiduspider
Per-bot metrics: requests/day, requests/hour, unique URLs crawled
Status code distribution per bot (200, 301, 404, 500)
Crawl depth distribution
Crawl pattern analysis (time of day, days of week)
Most crawled URLs per bot

Waste Detection

Parameter URLs: ?sort=, ?filter=, ?page=, ?utm_* consuming crawl budget
Redirect chains: Multiple redirects consuming crawl slots
Soft 404s: 200 status pages with error/empty content
Duplicate URLs: www/non-www, http/https, trailing slash variants
Low-value pages: Thin content pages, noindex pages being crawled

Orphan Page Detection

Pages in sitemap but never crawled by bots
Pages crawled but not in sitemap
Crawled pages with no internal links pointing to them

Workflow

Step 1: Obtain Server Access Logs

Request or locate server access logs from the target site. Supported formats:

Nginx: /var/log/nginx/access.log
Apache: /var/log/apache2/access.log
CloudFront: Downloaded from S3 or CloudWatch

Step 2: Parse Access Logs

python scripts/log_parser.py --log-file access.log --json
python scripts/log_parser.py --log-file access.log.gz --streaming --json
python scripts/log_parser.py --log-file access.log --bot googlebot --json

Step 3: Crawl Budget Analysis

python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope orphans --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json

Step 4: Cross-Reference with Ahrefs (Optional)

Use site-explorer-pages-history to compare indexed pages vs crawled pages.

Step 5: Generate Recommendations

Prioritized action items:

robots.txt optimization (block parameter URLs, low-value paths)
URL parameter handling (Google Search Console settings)
Noindex/nofollow for low-value pages
Redirect chain resolution (reduce 301 → 301 → 200 to 301 → 200)
Internal linking improvements for orphan pages

Step 6: Report to Notion

Save Korean-language report to SEO Audit Log database.

MCP Tools Used

Tool	Purpose
Ahrefs `site-explorer-pages-history`	Compare indexed pages with crawled pages
Notion	Save audit report to database
WebSearch	Current bot documentation and best practices

Output Format

{
  "log_file": "access.log",
  "analysis_period": {"from": "2025-01-01", "to": "2025-01-31"},
  "total_bot_requests": 150000,
  "bots": {
    "googlebot": {
      "requests": 80000,
      "unique_urls": 12000,
      "avg_requests_per_day": 2580,
      "status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000}
    },
    "yeti": {"requests": 35000},
    "bingbot": {"requests": 20000},
    "daumoa": {"requests": 15000}
  },
  "waste": {
    "parameter_urls": {"count": 5000, "pct_of_crawls": 3.3},
    "redirect_chains": {"count": 2000, "pct_of_crawls": 1.3},
    "soft_404s": {"count": 1500, "pct_of_crawls": 1.0},
    "total_waste_pct": 8.5
  },
  "orphan_pages": {
    "in_sitemap_not_crawled": [],
    "crawled_not_in_sitemap": []
  },
  "recommendations": [],
  "efficiency_score": 72,
  "timestamp": "2025-01-01T00:00:00"
}

Limitations

Requires actual server access logs (not available via standard web crawling)
Log format auto-detection may need manual format specification for custom formats
CloudFront logs have a different field structure than Nginx/Apache
Large log files (>10GB) may need pre-filtering before analysis
Bot identification relies on User-Agent strings which can be spoofed

Notion Output (Required)

All audit reports MUST be saved to the OurDigital SEO Audit Log:

Database ID: 2c8581e5-8a1e-8035-880b-e38cefc2f3ef
Category: Crawl Budget
Audit ID Format: CRAWL-YYYYMMDD-NNN
Language: Korean with technical English terms (Crawl Budget, Googlebot, robots.txt)

Reference Scripts

Located in code/scripts/:

log_parser.py — Server access log parser with bot identification
crawl_budget_analyzer.py — Crawl budget efficiency analysis
base_client.py — Shared async client utilities

5.0 KiB Raw Blame History