Files
our-claude-skills/custom-skills/32-seo-crawl-budget/desktop/SKILL.md
Andrew Yim d2d0a2d460 Add SEO skills 33-34 and fix bugs in skills 19-34
New skills:
- Skill 33: Site migration planner with redirect mapping and monitoring
- Skill 34: Reporting dashboard with HTML charts and Korean executive reports

Bug fixes (Skill 34 - report_aggregator.py):
- Add audit_type fallback for skill identification (was only using audit_id prefix)
- Extract health scores from nested data dict (technical_score, onpage_score, etc.)
- Support subdomain matching in domain filter (blog.ourdigital.org matches ourdigital.org)
- Skip self-referencing DASH- aggregated reports

Bug fixes (Skill 20 - naver_serp_analyzer.py):
- Remove VIEW tab selectors (removed by Naver in 2026)
- Add new section detectors: books (도서), shortform (숏폼), influencer (인플루언서)

Improvements (Skill 34 - dashboard/executive report):
- Add Korean category labels for Chart.js charts (기술 SEO, 온페이지, etc.)
- Add Korean trend labels (개선 중 ↑, 안정 →, 하락 중 ↓)
- Add English→Korean issue description translation layer (20 common patterns)

Documentation improvements:
- Add Korean triggers to 4 skill descriptions (19, 25, 28, 31)
- Expand Skill 32 SKILL.md from 40→143 lines (was 6/10, added workflow, output format, limitations)
- Add output format examples to Skills 27 and 28 SKILL.md
- Add limitations sections to Skills 27 and 28
- Update README.md, CLAUDE.md, AGENTS.md for skills 33-34

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 00:01:00 +09:00

5.0 KiB

name, description
name description
seo-crawl-budget Crawl budget optimization and server log analysis for search engine bots. Triggers: crawl budget, log analysis, bot crawling, Googlebot, crawl waste, orphan pages, crawl efficiency, 크롤 예산, 로그 분석, 크롤 최적화.

Crawl Budget Optimizer

Analyze server access logs to identify crawl budget waste and generate optimization recommendations for search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao).

Capabilities

Log Analysis

  • Parse Nginx combined, Apache combined, and CloudFront log formats
  • Support for gzip/bzip2 compressed logs
  • Streaming parser for files >1GB
  • Date range filtering
  • Custom format via regex

Bot Profiling

  • Identify bots by User-Agent: Googlebot (and variants), Yeti (Naver), Bingbot, Daumoa (Kakao), Applebot, DuckDuckBot, Baiduspider
  • Per-bot metrics: requests/day, requests/hour, unique URLs crawled
  • Status code distribution per bot (200, 301, 404, 500)
  • Crawl depth distribution
  • Crawl pattern analysis (time of day, days of week)
  • Most crawled URLs per bot

Waste Detection

  • Parameter URLs: ?sort=, ?filter=, ?page=, ?utm_* consuming crawl budget
  • Redirect chains: Multiple redirects consuming crawl slots
  • Soft 404s: 200 status pages with error/empty content
  • Duplicate URLs: www/non-www, http/https, trailing slash variants
  • Low-value pages: Thin content pages, noindex pages being crawled

Orphan Page Detection

  • Pages in sitemap but never crawled by bots
  • Pages crawled but not in sitemap
  • Crawled pages with no internal links pointing to them

Workflow

Step 1: Obtain Server Access Logs

Request or locate server access logs from the target site. Supported formats:

  • Nginx: /var/log/nginx/access.log
  • Apache: /var/log/apache2/access.log
  • CloudFront: Downloaded from S3 or CloudWatch

Step 2: Parse Access Logs

python scripts/log_parser.py --log-file access.log --json
python scripts/log_parser.py --log-file access.log.gz --streaming --json
python scripts/log_parser.py --log-file access.log --bot googlebot --json

Step 3: Crawl Budget Analysis

python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope orphans --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json

Step 4: Cross-Reference with Ahrefs (Optional)

Use site-explorer-pages-history to compare indexed pages vs crawled pages.

Step 5: Generate Recommendations

Prioritized action items:

  1. robots.txt optimization (block parameter URLs, low-value paths)
  2. URL parameter handling (Google Search Console settings)
  3. Noindex/nofollow for low-value pages
  4. Redirect chain resolution (reduce 301 → 301 → 200 to 301 → 200)
  5. Internal linking improvements for orphan pages

Step 6: Report to Notion

Save Korean-language report to SEO Audit Log database.

MCP Tools Used

Tool Purpose
Ahrefs site-explorer-pages-history Compare indexed pages with crawled pages
Notion Save audit report to database
WebSearch Current bot documentation and best practices

Output Format

{
  "log_file": "access.log",
  "analysis_period": {"from": "2025-01-01", "to": "2025-01-31"},
  "total_bot_requests": 150000,
  "bots": {
    "googlebot": {
      "requests": 80000,
      "unique_urls": 12000,
      "avg_requests_per_day": 2580,
      "status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000}
    },
    "yeti": {"requests": 35000},
    "bingbot": {"requests": 20000},
    "daumoa": {"requests": 15000}
  },
  "waste": {
    "parameter_urls": {"count": 5000, "pct_of_crawls": 3.3},
    "redirect_chains": {"count": 2000, "pct_of_crawls": 1.3},
    "soft_404s": {"count": 1500, "pct_of_crawls": 1.0},
    "total_waste_pct": 8.5
  },
  "orphan_pages": {
    "in_sitemap_not_crawled": [],
    "crawled_not_in_sitemap": []
  },
  "recommendations": [],
  "efficiency_score": 72,
  "timestamp": "2025-01-01T00:00:00"
}

Limitations

  • Requires actual server access logs (not available via standard web crawling)
  • Log format auto-detection may need manual format specification for custom formats
  • CloudFront logs have a different field structure than Nginx/Apache
  • Large log files (>10GB) may need pre-filtering before analysis
  • Bot identification relies on User-Agent strings which can be spoofed

Notion Output (Required)

All audit reports MUST be saved to the OurDigital SEO Audit Log:

  • Database ID: 2c8581e5-8a1e-8035-880b-e38cefc2f3ef
  • Category: Crawl Budget
  • Audit ID Format: CRAWL-YYYYMMDD-NNN
  • Language: Korean with technical English terms (Crawl Budget, Googlebot, robots.txt)

Reference Scripts

Located in code/scripts/:

  • log_parser.py — Server access log parser with bot identification
  • crawl_budget_analyzer.py — Crawl budget efficiency analysis
  • base_client.py — Shared async client utilities