173 lines
6.0 KiB
Markdown
173 lines
6.0 KiB
Markdown
---
|
|
name: seo-crawl-budget
|
|
description: |
|
|
Crawl budget optimization and server log analysis for search engine bots.
|
|
Triggers: crawl budget, log analysis, bot crawling, Googlebot, crawl waste,
|
|
orphan pages, crawl efficiency, 크롤 예산, 로그 분석, 크롤 최적화.
|
|
---
|
|
|
|
# Crawl Budget Optimizer
|
|
|
|
Analyze server access logs to identify crawl budget waste and generate optimization recommendations for search engine bots (Googlebot, Yeti/Naver, Bingbot, Daumoa/Kakao).
|
|
|
|
## Capabilities
|
|
|
|
### Log Analysis
|
|
- Parse Nginx combined, Apache combined, and CloudFront log formats
|
|
- Support for gzip/bzip2 compressed logs
|
|
- Streaming parser for files >1GB
|
|
- Date range filtering
|
|
- Custom format via regex
|
|
|
|
### Bot Profiling
|
|
- Identify bots by User-Agent: Googlebot (and variants), Yeti (Naver), Bingbot, Daumoa (Kakao), Applebot, DuckDuckBot, Baiduspider
|
|
- Per-bot metrics: requests/day, requests/hour, unique URLs crawled
|
|
- Status code distribution per bot (200, 301, 404, 500)
|
|
- Crawl depth distribution
|
|
- Crawl pattern analysis (time of day, days of week)
|
|
- Most crawled URLs per bot
|
|
|
|
### Waste Detection
|
|
- **Parameter URLs**: ?sort=, ?filter=, ?page=, ?utm_* consuming crawl budget
|
|
- **Redirect chains**: Multiple redirects consuming crawl slots
|
|
- **Soft 404s**: 200 status pages with error/empty content
|
|
- **Duplicate URLs**: www/non-www, http/https, trailing slash variants
|
|
- **Low-value pages**: Thin content pages, noindex pages being crawled
|
|
|
|
### Orphan Page Detection
|
|
- Pages in sitemap but never crawled by bots
|
|
- Pages crawled but not in sitemap
|
|
- Crawled pages with no internal links pointing to them
|
|
|
|
## Workflow
|
|
|
|
### Step 1: Obtain Server Access Logs
|
|
Request or locate server access logs from the target site. Supported formats:
|
|
- Nginx: `/var/log/nginx/access.log`
|
|
- Apache: `/var/log/apache2/access.log`
|
|
- CloudFront: Downloaded from S3 or CloudWatch
|
|
|
|
### Step 2: Parse Access Logs
|
|
```bash
|
|
python scripts/log_parser.py --log-file access.log --json
|
|
python scripts/log_parser.py --log-file access.log.gz --streaming --json
|
|
python scripts/log_parser.py --log-file access.log --bot googlebot --json
|
|
```
|
|
|
|
### Step 3: Crawl Budget Analysis
|
|
```bash
|
|
python scripts/crawl_budget_analyzer.py --log-file access.log --sitemap https://example.com/sitemap.xml --json
|
|
python scripts/crawl_budget_analyzer.py --log-file access.log --scope waste --json
|
|
python scripts/crawl_budget_analyzer.py --log-file access.log --scope orphans --json
|
|
python scripts/crawl_budget_analyzer.py --log-file access.log --scope bots --json
|
|
```
|
|
|
|
### Step 4: Cross-Reference with External Data (Optional)
|
|
Use `our-seo-agent` CLI or provide pre-fetched JSON via `--input` to compare indexed pages vs crawled pages. WebSearch can supplement with current indexing data.
|
|
|
|
### Step 5: Generate Recommendations
|
|
Prioritized action items:
|
|
1. robots.txt optimization (block parameter URLs, low-value paths)
|
|
2. URL parameter handling (Google Search Console settings)
|
|
3. Noindex/nofollow for low-value pages
|
|
4. Redirect chain resolution (reduce 301 → 301 → 200 to 301 → 200)
|
|
5. Internal linking improvements for orphan pages
|
|
|
|
### Step 6: Report to Notion
|
|
Save Korean-language report to SEO Audit Log database.
|
|
|
|
| Property | Type | Description |
|
|
|----------|------|-------------|
|
|
| Issue | Title | Report title (Korean + date) |
|
|
| Site | URL | Audited website URL |
|
|
| Category | Select | Crawl Budget |
|
|
| Priority | Select | Based on efficiency score |
|
|
| Found Date | Date | Analysis date (YYYY-MM-DD) |
|
|
| Audit ID | Rich Text | Format: CRAWL-YYYYMMDD-NNN |
|
|
|
|
## Data Sources
|
|
|
|
| Source | Purpose |
|
|
|--------|---------|
|
|
| `our-seo-agent` CLI | Future primary data source; use `--input` for pre-fetched JSON |
|
|
| Notion MCP | Save audit report to database |
|
|
| WebSearch | Current bot documentation and best practices |
|
|
|
|
## Output Format
|
|
|
|
```json
|
|
{
|
|
"log_file": "access.log",
|
|
"analysis_period": {"from": "2025-01-01", "to": "2025-01-31"},
|
|
"total_bot_requests": 150000,
|
|
"bots": {
|
|
"googlebot": {
|
|
"requests": 80000,
|
|
"unique_urls": 12000,
|
|
"avg_requests_per_day": 2580,
|
|
"status_distribution": {"200": 70000, "301": 5000, "404": 3000, "500": 2000}
|
|
},
|
|
"yeti": {"requests": 35000},
|
|
"bingbot": {"requests": 20000},
|
|
"daumoa": {"requests": 15000}
|
|
},
|
|
"waste": {
|
|
"parameter_urls": {"count": 5000, "pct_of_crawls": 3.3},
|
|
"redirect_chains": {"count": 2000, "pct_of_crawls": 1.3},
|
|
"soft_404s": {"count": 1500, "pct_of_crawls": 1.0},
|
|
"total_waste_pct": 8.5
|
|
},
|
|
"orphan_pages": {
|
|
"in_sitemap_not_crawled": [],
|
|
"crawled_not_in_sitemap": []
|
|
},
|
|
"recommendations": [],
|
|
"efficiency_score": 72,
|
|
"timestamp": "2025-01-01T00:00:00"
|
|
}
|
|
```
|
|
|
|
## Korean Output Example
|
|
|
|
```
|
|
# 크롤 예산 분석 보고서 - example.com
|
|
|
|
## 분석 기간: 2025-01-01 ~ 2025-01-31
|
|
|
|
### 봇별 크롤 현황
|
|
| 봇 | 요청 수 | 고유 URL | 일 평균 |
|
|
|----|---------|---------|---------|
|
|
| Googlebot | 80,000 | 12,000 | 2,580 |
|
|
| Yeti (Naver) | 35,000 | 8,000 | 1,129 |
|
|
|
|
### 크롤 낭비 요인
|
|
- 파라미터 URL: 5,000건 (3.3%)
|
|
- 리다이렉트 체인: 2,000건 (1.3%)
|
|
- 소프트 404: 1,500건 (1.0%)
|
|
|
|
### 효율성 점수: 72/100
|
|
```
|
|
|
|
## Limitations
|
|
|
|
- Requires actual server access logs (not available via standard web crawling)
|
|
- Log format auto-detection may need manual format specification for custom formats
|
|
- CloudFront logs have a different field structure than Nginx/Apache
|
|
- Large log files (>10GB) may need pre-filtering before analysis
|
|
- Bot identification relies on User-Agent strings which can be spoofed
|
|
|
|
## Notion Output (Required)
|
|
|
|
All audit reports MUST be saved to the OurDigital SEO Audit Log:
|
|
- **Database ID**: `2c8581e5-8a1e-8035-880b-e38cefc2f3ef`
|
|
- **Category**: Crawl Budget
|
|
- **Audit ID Format**: CRAWL-YYYYMMDD-NNN
|
|
- **Language**: Korean with technical English terms (Crawl Budget, Googlebot, robots.txt)
|
|
|
|
## Reference Scripts
|
|
|
|
Located in `code/scripts/`:
|
|
- `log_parser.py` — Server access log parser with bot identification
|
|
- `crawl_budget_analyzer.py` — Crawl budget efficiency analysis
|
|
- `base_client.py` — Shared async client utilities
|