Files
our-claude-skills/custom-skills/32-seo-crawl-budget/desktop/SKILL.md

6.0 KiB

name, description
name description
seo-crawl-budget Crawl budget optimization and server log analysis for search engine bots. Triggers: crawl budget, log analysis, bot crawling, Googlebot, crawl waste, orphan pages, crawl efficiency, 크롤 예산, 로그 분석, 크롤 최적화.

Crawl Budget Optimizer

Analyze server access logs to identify crawl budget waste and generate optimization recommendations for search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao).

Capabilities

Log Analysis

  • Parse Nginx combined, Apache combined, and CloudFront log formats
  • Support for gzip/bzip2 compressed logs
  • Streaming parser for files >1GB
  • Date range filtering
  • Custom format via regex

Bot Profiling

  • Identify bots by User-Agent: Googlebot (and variants), Yeti (Naver), Bingbot, Daumoa (Kakao), Applebot, DuckDuckBot, Baiduspider
  • Per-bot metrics: requests/day, requests/hour, unique URLs crawled
  • Status code distribution per bot (200, 301, 404, 500)
  • Crawl depth distribution
  • Crawl pattern analysis (time of day, days of week)
  • Most crawled URLs per bot

Waste Detection

  • Parameter URLs: ?sort=, ?filter=, ?page=, ?utm_* consuming crawl budget
  • Redirect chains: Multiple redirects consuming crawl slots
  • Soft 404s: 200 status pages with error/empty content
  • Duplicate URLs: www/non-www, http/https, trailing slash variants
  • Low-value pages: Thin content pages, noindex pages being crawled

Orphan Page Detection

  • Pages in sitemap but never crawled by bots
  • Pages crawled but not in sitemap
  • Crawled pages with no internal links pointing to them

Workflow

Step 1: Obtain Server Access Logs

Request or locate server access logs from the target site. Supported formats:

  • Nginx: /var/log/nginx/access.log
  • Apache: /var/log/apache2/access.log
  • CloudFront: Downloaded from S3 or CloudWatch

Step 2: Parse Access Logs

python scripts/log_parser.py --log-file access.log --json
python scripts/log_parser.py --log-file access.log.gz --streaming --json
python scripts/log_parser.py --log-file access.log --bot googlebot --json

Step 3: Crawl Budget Analysis

python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope orphans --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json

Step 4: Cross-Reference with External Data (Optional)

Use our-seo-agent CLI or provide pre-fetched JSON via --input to compare indexed pages vs crawled pages. WebSearch can supplement with current indexing data.

Step 5: Generate Recommendations

Prioritized action items:

  1. robots.txt optimization (block parameter URLs, low-value paths)
  2. URL parameter handling (Google Search Console settings)
  3. Noindex/nofollow for low-value pages
  4. Redirect chain resolution (reduce 301 → 301 → 200 to 301 → 200)
  5. Internal linking improvements for orphan pages

Step 6: Report to Notion

Save Korean-language report to SEO Audit Log database.

Property Type Description
Issue Title Report title (Korean + date)
Site URL Audited website URL
Category Select Crawl Budget
Priority Select Based on efficiency score
Found Date Date Analysis date (YYYY-MM-DD)
Audit ID Rich Text Format: CRAWL-YYYYMMDD-NNN

Data Sources

Source Purpose
our-seo-agent CLI Future primary data source; use --input for pre-fetched JSON
Notion MCP Save audit report to database
WebSearch Current bot documentation and best practices

Output Format

{
  "log_file": "access.log",
  "analysis_period": {"from": "2025-01-01", "to": "2025-01-31"},
  "total_bot_requests": 150000,
  "bots": {
    "googlebot": {
      "requests": 80000,
      "unique_urls": 12000,
      "avg_requests_per_day": 2580,
      "status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000}
    },
    "yeti": {"requests": 35000},
    "bingbot": {"requests": 20000},
    "daumoa": {"requests": 15000}
  },
  "waste": {
    "parameter_urls": {"count": 5000, "pct_of_crawls": 3.3},
    "redirect_chains": {"count": 2000, "pct_of_crawls": 1.3},
    "soft_404s": {"count": 1500, "pct_of_crawls": 1.0},
    "total_waste_pct": 8.5
  },
  "orphan_pages": {
    "in_sitemap_not_crawled": [],
    "crawled_not_in_sitemap": []
  },
  "recommendations": [],
  "efficiency_score": 72,
  "timestamp": "2025-01-01T00:00:00"
}

Korean Output Example

# 크롤 예산 분석 보고서 - example.com

## 분석 기간: 2025-01-01 ~ 2025-01-31

### 봇별 크롤 현황
| 봇 | 요청 수 | 고유 URL | 일 평균 |
|----|---------|---------|---------|
| Googlebot | 80,000 | 12,000 | 2,580 |
| Yeti (Naver) | 35,000 | 8,000 | 1,129 |

### 크롤 낭비 요인
- 파라미터 URL: 5,000건 (3.3%)
- 리다이렉트 체인: 2,000건 (1.3%)
- 소프트 404: 1,500건 (1.0%)

### 효율성 점수: 72/100

Limitations

  • Requires actual server access logs (not available via standard web crawling)
  • Log format auto-detection may need manual format specification for custom formats
  • CloudFront logs have a different field structure than Nginx/Apache
  • Large log files (>10GB) may need pre-filtering before analysis
  • Bot identification relies on User-Agent strings which can be spoofed

Notion Output (Required)

All audit reports MUST be saved to the OurDigital SEO Audit Log:

  • Database ID: 2c8581e5-8a1e-8035-880b-e38cefc2f3ef
  • Category: Crawl Budget
  • Audit ID Format: CRAWL-YYYYMMDD-NNN
  • Language: Korean with technical English terms (Crawl Budget, Googlebot, robots.txt)

Reference Scripts

Located in code/scripts/:

  • log_parser.py — Server access log parser with bot identification
  • crawl_budget_analyzer.py — Crawl budget efficiency analysis
  • base_client.py — Shared async client utilities