# CLAUDE.md ## Overview Technical SEO auditor for crawlability fundamentals: robots.txt validation, XML sitemap analysis, and URL accessibility checking. ## Quick Start ```bash # Install dependencies pip install -r scripts/requirements.txt # Robots.txt analysis python scripts/robots_checker.py --url https://example.com # Sitemap validation python scripts/sitemap_validator.py --url https://example.com/sitemap.xml # Async URL crawl (check sitemap URLs accessibility) python scripts/sitemap_crawler.py --sitemap https://example.com/sitemap.xml ``` ## Scripts | Script | Purpose | Key Output | |--------|---------|------------| | `robots_checker.py` | Parse and validate robots.txt | User-agent rules, disallow patterns, sitemap declarations | | `sitemap_validator.py` | Validate XML sitemap structure | URL count, lastmod dates, size limits, syntax errors | | `sitemap_crawler.py` | Async check URL accessibility | HTTP status codes, response times, broken links | | `base_client.py` | Shared utilities | RateLimiter, ConfigManager, BaseAsyncClient | ## Robots.txt Checker ```bash # Basic analysis python scripts/robots_checker.py --url https://example.com # Test specific URL against rules python scripts/robots_checker.py --url https://example.com --test-url /admin/ # Output JSON python scripts/robots_checker.py --url https://example.com --json ``` **Checks performed**: - Syntax validation - User-agent rule parsing - Disallow/Allow pattern analysis - Sitemap declarations - Critical resource access (CSS/JS/images) ## Sitemap Validator ```bash # Validate sitemap python scripts/sitemap_validator.py --url https://example.com/sitemap.xml # Include sitemap index parsing python scripts/sitemap_validator.py --url https://example.com/sitemap_index.xml --follow-index ``` **Validation rules**: - XML syntax correctness - URL count limit (50,000 max per sitemap) - File size limit (50MB max uncompressed) - Lastmod date format validation - Sitemap index structure ## Sitemap Crawler ```bash # Crawl all URLs in sitemap python scripts/sitemap_crawler.py --sitemap https://example.com/sitemap.xml # Limit concurrent requests python scripts/sitemap_crawler.py --sitemap https://example.com/sitemap.xml --concurrency 10 # Sample mode (check subset) python scripts/sitemap_crawler.py --sitemap https://example.com/sitemap.xml --sample 100 ``` **Output includes**: - HTTP status codes per URL - Response times - Redirect chains - Broken links (4xx, 5xx) ## Output Format All scripts support `--json` flag for structured output: ```json { "url": "https://example.com", "status": "valid|invalid|warning", "issues": [ { "type": "error|warning|info", "message": "Description", "location": "Line or URL" } ], "summary": {} } ``` ## Common Issues Detected | Category | Issue | Severity | |----------|-------|----------| | Robots.txt | Missing sitemap declaration | Medium | | Robots.txt | Blocking CSS/JS resources | High | | Robots.txt | Overly broad disallow rules | Medium | | Sitemap | URLs returning 404 | High | | Sitemap | Missing lastmod dates | Low | | Sitemap | Exceeds 50,000 URL limit | High | | Sitemap | Non-canonical URLs included | Medium | ## Configuration Environment variables (optional): ```bash # Rate limiting CRAWL_DELAY=1.0 # Seconds between requests MAX_CONCURRENT=20 # Async concurrency limit REQUEST_TIMEOUT=30 # Request timeout seconds ```