Files

Andrew Yim a28bfbf847 Fix SEO skill 34 bugs, Korean labels, and transition Ahrefs refs to our-seo-agent (#2 )

2026-02-14 01:09:35 +09:00

6.0 KiB

Raw Blame History

name, description

name	description
seo-crawl-budget	Crawl budget optimization and server log analysis for search engine bots. Triggers: crawl budget, log analysis, bot crawling, Googlebot, crawl waste, orphan pages, crawl efficiency, 크롤 예산, 로그 분석, 크롤 최적화.

Crawl Budget Optimizer

Analyze server access logs to identify crawl budget waste and generate optimization recommendations for search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao).

Capabilities

Log Analysis

Parse Nginx combined, Apache combined, and CloudFront log formats
Support for gzip/bzip2 compressed logs
Streaming parser for files >1GB
Date range filtering
Custom format via regex

Bot Profiling

Identify bots by User-Agent: Googlebot (and variants), Yeti (Naver), Bingbot, Daumoa (Kakao), Applebot, DuckDuckBot, Baiduspider
Per-bot metrics: requests/day, requests/hour, unique URLs crawled
Status code distribution per bot (200, 301, 404, 500)
Crawl depth distribution
Crawl pattern analysis (time of day, days of week)
Most crawled URLs per bot

Waste Detection

Parameter URLs: ?sort=, ?filter=, ?page=, ?utm_* consuming crawl budget
Redirect chains: Multiple redirects consuming crawl slots
Soft 404s: 200 status pages with error/empty content
Duplicate URLs: www/non-www, http/https, trailing slash variants
Low-value pages: Thin content pages, noindex pages being crawled

Orphan Page Detection

Pages in sitemap but never crawled by bots
Pages crawled but not in sitemap
Crawled pages with no internal links pointing to them

Workflow

Step 1: Obtain Server Access Logs

Request or locate server access logs from the target site. Supported formats:

Nginx: /var/log/nginx/access.log
Apache: /var/log/apache2/access.log
CloudFront: Downloaded from S3 or CloudWatch

Step 2: Parse Access Logs

python scripts/log_parser.py --log-file access.log --json
python scripts/log_parser.py --log-file access.log.gz --streaming --json
python scripts/log_parser.py --log-file access.log --bot googlebot --json

Step 3: Crawl Budget Analysis

python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope orphans --json
python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json

Step 4: Cross-Reference with External Data (Optional)

Use our-seo-agent CLI or provide pre-fetched JSON via --input to compare indexed pages vs crawled pages. WebSearch can supplement with current indexing data.

Step 5: Generate Recommendations

Prioritized action items:

robots.txt optimization (block parameter URLs, low-value paths)
URL parameter handling (Google Search Console settings)
Noindex/nofollow for low-value pages
Redirect chain resolution (reduce 301 → 301 → 200 to 301 → 200)
Internal linking improvements for orphan pages

Step 6: Report to Notion

Save Korean-language report to SEO Audit Log database.

Property	Type	Description
Issue	Title	Report title (Korean + date)
Site	URL	Audited website URL
Category	Select	Crawl Budget
Priority	Select	Based on efficiency score
Found Date	Date	Analysis date (YYYY-MM-DD)
Audit ID	Rich Text	Format: CRAWL-YYYYMMDD-NNN

Data Sources

Source	Purpose
`our-seo-agent` CLI	Future primary data source; use `--input` for pre-fetched JSON
Notion MCP	Save audit report to database
WebSearch	Current bot documentation and best practices

Output Format

{
  "log_file": "access.log",
  "analysis_period": {"from": "2025-01-01", "to": "2025-01-31"},
  "total_bot_requests": 150000,
  "bots": {
    "googlebot": {
      "requests": 80000,
      "unique_urls": 12000,
      "avg_requests_per_day": 2580,
      "status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000}
    },
    "yeti": {"requests": 35000},
    "bingbot": {"requests": 20000},
    "daumoa": {"requests": 15000}
  },
  "waste": {
    "parameter_urls": {"count": 5000, "pct_of_crawls": 3.3},
    "redirect_chains": {"count": 2000, "pct_of_crawls": 1.3},
    "soft_404s": {"count": 1500, "pct_of_crawls": 1.0},
    "total_waste_pct": 8.5
  },
  "orphan_pages": {
    "in_sitemap_not_crawled": [],
    "crawled_not_in_sitemap": []
  },
  "recommendations": [],
  "efficiency_score": 72,
  "timestamp": "2025-01-01T00:00:00"
}

Korean Output Example

# 크롤 예산 분석 보고서 - example.com

## 분석 기간: 2025-01-01 ~ 2025-01-31

### 봇별 크롤 현황
| 봇 | 요청 수 | 고유 URL | 일 평균 |
|----|---------|---------|---------|
| Googlebot | 80,000 | 12,000 | 2,580 |
| Yeti (Naver) | 35,000 | 8,000 | 1,129 |

### 크롤 낭비 요인
- 파라미터 URL: 5,000건 (3.3%)
- 리다이렉트 체인: 2,000건 (1.3%)
- 소프트 404: 1,500건 (1.0%)

### 효율성 점수: 72/100

Limitations

Requires actual server access logs (not available via standard web crawling)
Log format auto-detection may need manual format specification for custom formats
CloudFront logs have a different field structure than Nginx/Apache
Large log files (>10GB) may need pre-filtering before analysis
Bot identification relies on User-Agent strings which can be spoofed

Notion Output (Required)

All audit reports MUST be saved to the OurDigital SEO Audit Log:

Database ID: 2c8581e5-8a1e-8035-880b-e38cefc2f3ef
Category: Crawl Budget
Audit ID Format: CRAWL-YYYYMMDD-NNN
Language: Korean with technical English terms (Crawl Budget, Googlebot, robots.txt)

Reference Scripts

Located in code/scripts/:

log_parser.py — Server access log parser with bot identification
crawl_budget_analyzer.py — Crawl budget efficiency analysis
base_client.py — Shared async client utilities

6.0 KiB Raw Blame History