directory changes and restructuring

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2025-12-22 02:01:41 +09:00
parent eea49f9f8c
commit 236be6c580
598 changed files with 0 additions and 0 deletions

View File

@@ -0,0 +1,151 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Skill Overview
**ourdigital-seo-audit** is a comprehensive SEO audit skill that performs technical SEO analysis, schema validation, sitemap/robots.txt checks, and Core Web Vitals measurement. Results are exported to a Notion database.
## Architecture
```
12-ourdigital-seo-audit/
├── SKILL.md # Skill definition with YAML frontmatter
├── scripts/ # Python automation scripts
│ ├── base_client.py # Shared utilities: RateLimiter, ConfigManager
│ ├── full_audit.py # Main orchestrator (SEOAuditor class)
│ ├── gsc_client.py # Google Search Console API
│ ├── pagespeed_client.py # PageSpeed Insights API
│ ├── schema_validator.py # JSON-LD/Microdata extraction & validation
│ ├── schema_generator.py # Generate schema markup from templates
│ ├── sitemap_validator.py # XML sitemap validation
│ ├── sitemap_crawler.py # Async sitemap URL crawler
│ ├── robots_checker.py # Robots.txt parser & rule tester
│ ├── page_analyzer.py # On-page SEO analysis
│ └── notion_reporter.py # Notion database integration
├── templates/
│ ├── schema_templates/ # JSON-LD templates (article, faq, product, etc.)
│ └── notion_database_schema.json
├── reference.md # API documentation
└── USER_GUIDE.md # End-user documentation
```
## Script Relationships
```
full_audit.py (orchestrator)
├── robots_checker.py → RobotsChecker.analyze()
├── sitemap_validator.py → SitemapValidator.validate()
├── schema_validator.py → SchemaValidator.validate()
├── pagespeed_client.py → PageSpeedClient.analyze()
└── notion_reporter.py → NotionReporter.create_audit_report()
All scripts use:
└── base_client.py → ConfigManager (credentials), RateLimiter, BaseAsyncClient
```
## Common Commands
### Install Dependencies
```bash
pip install -r scripts/requirements.txt
```
### Run Full SEO Audit
```bash
python scripts/full_audit.py --url https://example.com --output console
python scripts/full_audit.py --url https://example.com --output notion
python scripts/full_audit.py --url https://example.com --json
```
### Individual Script Usage
```bash
# Robots.txt analysis
python scripts/robots_checker.py --url https://example.com
# Sitemap validation
python scripts/sitemap_validator.py --url https://example.com/sitemap.xml
# Schema validation
python scripts/schema_validator.py --url https://example.com
# Schema generation
python scripts/schema_generator.py --type organization --url https://example.com
# Core Web Vitals
python scripts/pagespeed_client.py --url https://example.com --strategy mobile
# Search Console data
python scripts/gsc_client.py --site sc-domain:example.com --action summary
```
## Key Classes and Data Flow
### AuditResult (full_audit.py)
Central dataclass holding all audit findings:
- `robots`, `sitemap`, `schema`, `performance` - Raw results from each checker
- `findings: list[SEOFinding]` - Normalized issues for Notion export
- `summary` - Aggregated statistics
### SEOFinding (notion_reporter.py)
Standard format for all audit issues:
```python
@dataclass
class SEOFinding:
issue: str # Issue title
category: str # Technical SEO, Performance, Schema, etc.
priority: str # Critical, High, Medium, Low
url: str | None # Affected URL
recommendation: str # How to fix
audit_id: str # Groups findings from same session
```
### NotionReporter
Creates findings in Notion with two modes:
1. Individual pages per finding in default database
2. Summary page with checklist table via `create_audit_report()`
Default database: `2c8581e5-8a1e-8035-880b-e38cefc2f3ef`
## Google API Configuration
**Service Account**: `~/.credential/ourdigital-seo-agent.json`
| API | Authentication | Usage |
|-----|---------------|-------|
| Search Console | Service account | `gsc_client.py` |
| PageSpeed Insights | API key (`PAGESPEED_API_KEY`) | `pagespeed_client.py` |
| GA4 Analytics | Service account | Traffic data |
Environment variables are loaded from `~/Workspaces/claude-workspace/.env`.
## MCP Tool Integration
The skill uses MCP tools as primary data sources (Tier 1):
- `mcp__firecrawl__scrape/crawl` - Web page content extraction
- `mcp__perplexity__search` - Competitor research
- `mcp__notion__*` - Database operations
Python scripts are Tier 2 for Google API data collection.
## Extending the Skill
### Adding a New Schema Type
1. Add JSON template to `templates/schema_templates/`
2. Update `REQUIRED_PROPERTIES` and `RECOMMENDED_PROPERTIES` in `schema_validator.py`
3. Add type-specific validation in `_validate_type_specific()`
### Adding a New Audit Check
1. Create checker class following pattern in existing scripts
2. Return dataclass with `to_dict()` method and `issues` list
3. Add processing method in `SEOAuditor` (`_process_*_findings`)
4. Wire into `run_audit()` in `full_audit.py`
## Rate Limits
| Service | Limit | Handled By |
|---------|-------|------------|
| Firecrawl | Per plan | MCP |
| PageSpeed | 25,000/day | `base_client.py` RateLimiter |
| Search Console | 1,200/min | Manual delays |
| Notion | 3 req/sec | Semaphore in reporter |

View File

@@ -0,0 +1,330 @@
---
name: ourdigital-seo-audit
description: Comprehensive SEO audit skill for technical SEO, on-page optimization, content analysis, local SEO, Core Web Vitals assessment, schema markup generation/validation, sitemap validation, and robots.txt analysis. Use when user asks for SEO audit, website analysis, search performance review, schema markup, structured data, sitemap check, robots.txt analysis, or optimization recommendations. Activates for keywords like SEO, audit, search console, rankings, crawlability, indexing, meta tags, Core Web Vitals, local SEO, schema, structured data, sitemap, robots.txt.
allowed-tools: mcp__firecrawl__*, mcp__perplexity__*, mcp__notion__*, mcp__google-drive__*, mcp__memory__*, Read, Write, Edit, Bash(python:*), Bash(pip:*)
---
# OurDigital SEO Audit Skill
## Purpose
Comprehensive SEO audit capability for:
- Technical SEO analysis (crawlability, indexing, site structure)
- On-page SEO optimization (meta tags, headings, content)
- Content quality assessment
- Local SEO evaluation
- Core Web Vitals performance
- Schema markup generation and validation
- XML sitemap validation
- Robots.txt analysis
## Execution Strategy: Three-Tier Approach
Always follow this priority order:
### Tier 1: MCP Tools (Primary)
Use built-in MCP tools first for real-time analysis:
| Tool | Purpose |
|------|---------|
| `mcp__firecrawl__scrape` | Scrape page content and structure |
| `mcp__firecrawl__crawl` | Crawl entire website |
| `mcp__firecrawl__extract` | Extract structured data |
| `mcp__perplexity__search` | Research competitors, best practices |
| `mcp__notion__create-database` | Create findings database |
| `mcp__notion__create-page` | Add audit findings |
| `mcp__google-drive__search` | Access Sheets for output |
| `mcp__memory__create_entities` | Track audit state |
### Tier 2: Python Scripts (Data Collection)
For Google API data and specialized analysis:
- `gsc_client.py` - Search Console performance data
- `pagespeed_client.py` - Core Web Vitals metrics
- `ga4_client.py` - Traffic and user behavior
- `schema_validator.py` - Validate structured data
- `sitemap_validator.py` - Validate XML sitemaps
- `robots_checker.py` - Analyze robots.txt
### Tier 3: Manual Fallback
For data requiring special access:
- Export data for offline analysis
- Manual GBP data entry (API requires enterprise approval)
- Third-party tool integration
## Google API Configuration
### Service Account Credentials
The skill uses `ourdigital-seo-agent` service account for authenticated APIs:
```
Credentials: ~/.credential/ourdigital-seo-agent.json
Service Account: ourdigital-seo-agent@ourdigital-insights.iam.gserviceaccount.com
Project: ourdigital-insights
```
### API Status & Configuration
| API | Status | Authentication | Notes |
|-----|--------|----------------|-------|
| Search Console | **WORKING** | Service account | Domain: sc-domain:ourdigital.org |
| PageSpeed Insights | **WORKING** | API key | Higher quotas with key |
| Analytics Data (GA4) | **WORKING** | Service account | Properties: Lab, Journal, Blog |
| Google Trends | **WORKING** | None (pytrends) | No auth required |
| Custom Search JSON | **WORKING** | API key | cx: e5f27994f2bab4bf2 |
| Knowledge Graph | **WORKING** | API key | Entity search |
| Google Sheets | **WORKING** | Service account | Share sheet with service account |
### Environment Variables (Configured)
Located in `~/Workspaces/claude-workspace/.env`:
```bash
# Google Service Account (auto-detected)
# ~/.credential/ourdigital-seo-agent.json
# Google API Key (PageSpeed, Custom Search, Knowledge Graph)
GOOGLE_API_KEY=AIzaSyBdfnL3-CVl-ZAKYrLMuaHFR6MASa9ZH1Q
PAGESPEED_API_KEY=AIzaSyBdfnL3-CVl-ZAKYrLMuaHFR6MASa9ZH1Q
CUSTOM_SEARCH_API_KEY=AIzaSyBdfnL3-CVl-ZAKYrLMuaHFR6MASa9ZH1Q
CUSTOM_SEARCH_ENGINE_ID=e5f27994f2bab4bf2
```
### Enabled APIs in Google Cloud Console (ourdigital-insights)
- Search Console API
- PageSpeed Insights API
- Google Analytics Admin API
- Google Analytics Data API
- Custom Search API
- Knowledge Graph Search API
## Audit Categories
### 1. Technical SEO
- HTTPS/SSL implementation
- Canonical URL setup
- Redirect chains/loops
- 404 error pages
- Server response times
- Mobile-friendliness
- Crawlability assessment
- Hreflang tags
### 2. On-page SEO
- Title tags (length, uniqueness, keywords)
- Meta descriptions
- Heading hierarchy (H1-H6)
- Image alt attributes
- Internal linking structure
- URL structure
- Open Graph / Twitter Card tags
### 3. Content SEO
- Content quality assessment
- Thin content identification
- Duplicate content detection
- Keyword relevance
- Content freshness
- E-E-A-T signals
### 4. Local SEO
- Google Business Profile optimization
- NAP consistency
- Local citations
- Review management
- LocalBusiness schema markup
### 5. Core Web Vitals
- Largest Contentful Paint (LCP) < 2.5s
- First Input Delay (FID) < 100ms
- Cumulative Layout Shift (CLS) < 0.1
- Interaction to Next Paint (INP) < 200ms
- Time to First Byte (TTFB)
- First Contentful Paint (FCP)
### 6. Schema/Structured Data
- Extract existing schema (JSON-LD, Microdata, RDFa)
- Validate against schema.org vocabulary
- Check Google Rich Results compatibility
- Generate missing schema markup
- Support: Organization, LocalBusiness, Product, Article, FAQ, Breadcrumb, WebSite
### 7. Sitemap Validation
- XML syntax validation
- URL accessibility (HTTP status)
- URL count limits (50,000 max)
- File size limits (50MB max)
- Lastmod dates validity
- Index sitemap structure
### 8. Robots.txt Analysis
- Syntax validation
- User-agent rules review
- Disallow/Allow patterns
- Sitemap declarations
- Critical resources access
- URL testing against rules
## Report Output
### Default Notion Database
All SEO audit findings are stored in the centralized **OurDigital SEO Audit Log**:
- **Database ID**: `2c8581e5-8a1e-8035-880b-e38cefc2f3ef`
- **URL**: https://www.notion.so/dintelligence/2c8581e58a1e8035880be38cefc2f3ef
### Notion Database Schema
**Database Properties (Metadata)**
| Property | Type | Values | Description |
|----------|------|--------|-------------|
| Issue | Title | Issue description | Primary identifier |
| Site | URL | Website URL | Audited site (e.g., https://blog.ourdigital.org) |
| Category | Select | Technical SEO, On-page SEO, Content, Local SEO, Performance, Schema/Structured Data, Sitemap, Robots.txt | Issue classification |
| Priority | Select | Critical, High, Medium, Low | Fix priority |
| Status | Status | Not started, In progress, Done | Tracking status |
| URL | URL | Affected URL | Specific page with issue |
| Found Date | Date | Discovery date | When issue was found |
| Audit ID | Rich Text | Audit identifier | Groups findings from same audit session |
**Page Content Template**
Each finding page contains structured content blocks:
```
## Description
[Detailed explanation of the issue]
## Impact
⚠️ [Business/ranking impact callout]
## Recommendation
💡 [Actionable solution callout]
```
### Report Categories
1. **Required Actions** (Critical/High Priority)
- Security issues, indexing blocks, major errors
2. **Quick Wins** (Easy fixes with high impact)
- Missing meta tags, schema markup, image optimization
3. **Further Investigation**
- Complex issues needing deeper analysis
4. **Items to Monitor**
- Performance metrics, ranking changes, crawl stats
## Operational Guidelines
### Before Any Audit
1. **Gather context**: Ask for target URL, business type, priorities
2. **Check access**: Verify MCP tools are available
3. **Set scope**: Full site vs specific pages
### During Audit
1. Use Firecrawl for initial site analysis
2. Run Python scripts for Google API data
3. Validate schema, sitemap, robots.txt
4. Document findings in Notion
### Rate Limits
| Service | Limit | Strategy |
|---------|-------|----------|
| Firecrawl | Per plan | Use crawl for site-wide |
| PageSpeed | 25,000/day | Batch critical pages |
| Search Console | 1,200/min | Use async with delays |
| Notion | 3 req/sec | Implement semaphore |
## Quick Commands
### Full Site Audit
```
Perform a comprehensive SEO audit for [URL]
```
### Technical SEO Check
```
Check technical SEO for [URL] including crawlability and indexing
```
### Schema Generation
```
Generate [type] schema markup for [URL/content]
```
### Schema Validation
```
Validate existing schema markup on [URL]
```
### Sitemap Check
```
Validate the sitemap at [sitemap URL]
```
### Robots.txt Analysis
```
Analyze robots.txt for [domain]
```
### Core Web Vitals
```
Check Core Web Vitals for [URL]
```
### Local SEO Assessment
```
Perform local SEO audit for [business name] in [location]
```
## Script Usage
### Schema Generator
```bash
python scripts/schema_generator.py --type organization --url https://example.com
```
### Schema Validator
```bash
python scripts/schema_validator.py --url https://example.com
```
### Sitemap Validator
```bash
python scripts/sitemap_validator.py --url https://example.com/sitemap.xml
```
### Robots.txt Checker
```bash
python scripts/robots_checker.py --url https://example.com/robots.txt
```
### Full Audit
```bash
python scripts/full_audit.py --url https://example.com --output notion
```
## Limitations
- Google Business Profile API requires enterprise approval
- Some competitive analysis limited to public data
- Large sites (10,000+ pages) require extended crawl time
- Real-time ranking data requires third-party tools
## Related Resources
- `reference.md` - Detailed API documentation
- `examples.md` - Usage examples
- `templates/` - Schema and report templates
- `scripts/` - Python automation scripts

View File

@@ -0,0 +1,494 @@
# OurDigital SEO Audit - User Guide
## Table of Contents
1. [Overview](#overview)
2. [Prerequisites](#prerequisites)
3. [Quick Start](#quick-start)
4. [Features](#features)
5. [Usage Examples](#usage-examples)
6. [API Reference](#api-reference)
7. [Notion Integration](#notion-integration)
8. [Troubleshooting](#troubleshooting)
---
## Overview
The **ourdigital-seo-audit** skill is a comprehensive SEO audit tool for Claude Code that:
- Performs technical SEO analysis
- Validates schema markup (JSON-LD, Microdata, RDFa)
- Checks sitemap and robots.txt configuration
- Measures Core Web Vitals performance
- Integrates with Google APIs (Search Console, Analytics, PageSpeed)
- Exports findings to Notion database
### Supported Audit Categories
| Category | Description |
|----------|-------------|
| Technical SEO | HTTPS, canonicals, redirects, crawlability |
| On-page SEO | Meta tags, headings, images, internal links |
| Content | Quality, duplication, freshness, E-E-A-T |
| Local SEO | NAP consistency, citations, LocalBusiness schema |
| Performance | Core Web Vitals (LCP, CLS, TBT, FCP) |
| Schema/Structured Data | JSON-LD validation, rich results eligibility |
| Sitemap | XML validation, URL accessibility |
| Robots.txt | Directive analysis, blocking issues |
---
## Prerequisites
### 1. Python Environment
```bash
# Install required packages
cd ~/.claude/skills/ourdigital-seo-audit/scripts
pip install -r requirements.txt
```
### 2. Google Cloud Service Account
The skill uses a service account for authenticated APIs:
```
File: ~/.credential/ourdigital-seo-agent.json
Service Account: ourdigital-seo-agent@ourdigital-insights.iam.gserviceaccount.com
Project: ourdigital-insights
```
**Required permissions:**
- Search Console: Add service account email as user in [Search Console](https://search.google.com/search-console)
- GA4: Add service account as Viewer in [Google Analytics](https://analytics.google.com)
### 3. API Keys
Located in `~/Workspaces/claude-workspace/.env`:
```bash
# Google API Key (for PageSpeed, Custom Search, Knowledge Graph)
GOOGLE_API_KEY=your-api-key
PAGESPEED_API_KEY=your-api-key
CUSTOM_SEARCH_API_KEY=your-api-key
CUSTOM_SEARCH_ENGINE_ID=your-cx-id
```
### 4. Enabled Google Cloud APIs
Enable these APIs in [Google Cloud Console](https://console.cloud.google.com/apis/library):
- Search Console API
- PageSpeed Insights API
- Google Analytics Admin API
- Google Analytics Data API
- Custom Search API
- Knowledge Graph Search API
---
## Quick Start
### Run a Full SEO Audit
```
Perform a comprehensive SEO audit for https://example.com
```
### Check Core Web Vitals
```
Check Core Web Vitals for https://example.com
```
### Validate Schema Markup
```
Validate schema markup on https://example.com
```
### Check Sitemap
```
Validate the sitemap at https://example.com/sitemap.xml
```
### Analyze Robots.txt
```
Analyze robots.txt for example.com
```
---
## Features
### 1. Technical SEO Analysis
Checks for:
- HTTPS/SSL implementation
- Canonical URL configuration
- Redirect chains and loops
- 404 error pages
- Mobile-friendliness
- Hreflang tags (international sites)
**Command:**
```
Check technical SEO for https://example.com
```
### 2. Schema Markup Validation
Extracts and validates structured data:
- JSON-LD (recommended)
- Microdata
- RDFa
Supported schema types:
- Organization / LocalBusiness
- Product / Offer
- Article / BlogPosting
- FAQPage / HowTo
- BreadcrumbList
- WebSite / WebPage
**Command:**
```
Validate existing schema markup on https://example.com
```
### 3. Schema Generation
Generate JSON-LD markup for your pages:
**Command:**
```
Generate Organization schema for https://example.com
```
**Available types:**
- `organization` - Company/organization info
- `local_business` - Physical business location
- `product` - E-commerce products
- `article` - Blog posts and news
- `faq` - FAQ pages
- `breadcrumb` - Navigation breadcrumbs
- `website` - Site-level schema with search
### 4. Sitemap Validation
Validates XML sitemaps for:
- XML syntax errors
- URL accessibility (HTTP status)
- URL count limits (50,000 max)
- File size limits (50MB max)
- Lastmod date validity
- Index sitemap structure
**Command:**
```
Validate the sitemap at https://example.com/sitemap.xml
```
### 5. Robots.txt Analysis
Analyzes robots.txt for:
- Syntax validation
- User-agent rules
- Disallow/Allow patterns
- Sitemap declarations
- Critical resource blocking
- URL testing against rules
**Command:**
```
Analyze robots.txt for example.com
```
### 6. Core Web Vitals
Measures performance metrics:
| Metric | Good | Needs Improvement | Poor |
|--------|------|-------------------|------|
| LCP (Largest Contentful Paint) | < 2.5s | 2.5s - 4.0s | > 4.0s |
| CLS (Cumulative Layout Shift) | < 0.1 | 0.1 - 0.25 | > 0.25 |
| TBT (Total Blocking Time) | < 200ms | 200ms - 600ms | > 600ms |
| FCP (First Contentful Paint) | < 1.8s | 1.8s - 3.0s | > 3.0s |
**Command:**
```
Check Core Web Vitals for https://example.com
```
### 7. Search Console Integration
Access search performance data:
- Top queries and pages
- Click-through rates
- Average positions
- Indexing status
**Command:**
```
Get Search Console data for sc-domain:example.com
```
### 8. GA4 Analytics Integration
Access traffic and behavior data:
- Page views
- User sessions
- Traffic sources
- Engagement metrics
**Available properties:**
- OurDigital Lab (218477407)
- OurDigital Journal (413643875)
- OurDigital Blog (489750460)
**Command:**
```
Get GA4 traffic data for OurDigital Blog
```
---
## Usage Examples
### Example 1: Full Site Audit with Notion Export
```
Perform a comprehensive SEO audit for https://blog.ourdigital.org and export findings to Notion
```
This will:
1. Check robots.txt configuration
2. Validate sitemap
3. Analyze schema markup
4. Run PageSpeed analysis
5. Export all findings to the OurDigital SEO Audit Log database
### Example 2: Schema Audit Only
```
Check schema markup on https://blog.ourdigital.org and identify any issues
```
### Example 3: Performance Deep Dive
```
Analyze Core Web Vitals for https://blog.ourdigital.org and provide optimization recommendations
```
### Example 4: Competitive Analysis
```
Compare SEO performance between https://blog.ourdigital.org and https://competitor.com
```
### Example 5: Local SEO Audit
```
Perform local SEO audit for OurDigital in Seoul, Korea
```
---
## API Reference
### Python Scripts
All scripts are located in `~/.claude/skills/ourdigital-seo-audit/scripts/`
#### robots_checker.py
```bash
python robots_checker.py --url https://example.com
```
Options:
- `--url` - Base URL to check
- `--test-url` - Specific URL to test against rules
- `--user-agent` - User agent to test (default: Googlebot)
#### sitemap_validator.py
```bash
python sitemap_validator.py --url https://example.com/sitemap.xml
```
Options:
- `--url` - Sitemap URL
- `--check-urls` - Verify URL accessibility (slower)
- `--limit` - Max URLs to check
#### schema_validator.py
```bash
python schema_validator.py --url https://example.com
```
Options:
- `--url` - Page URL to validate
- `--check-rich-results` - Check Google Rich Results eligibility
#### schema_generator.py
```bash
python schema_generator.py --type organization --url https://example.com
```
Options:
- `--type` - Schema type (organization, local_business, product, article, faq, breadcrumb, website)
- `--url` - Target URL
- `--output` - Output file (default: stdout)
#### pagespeed_client.py
```bash
python pagespeed_client.py --url https://example.com --strategy mobile
```
Options:
- `--url` - URL to analyze
- `--strategy` - mobile, desktop, or both
- `--json` - Output as JSON
- `--cwv-only` - Show only Core Web Vitals
#### gsc_client.py
```bash
python gsc_client.py --site sc-domain:example.com --action summary
```
Options:
- `--site` - Site URL (sc-domain: or https://)
- `--action` - summary, queries, pages, sitemaps, inspect
- `--days` - Days of data (default: 30)
#### full_audit.py
```bash
python full_audit.py --url https://example.com --output notion
```
Options:
- `--url` - URL to audit
- `--output` - console, notion, or json
- `--no-robots` - Skip robots.txt check
- `--no-sitemap` - Skip sitemap validation
- `--no-schema` - Skip schema validation
- `--no-performance` - Skip PageSpeed analysis
---
## Notion Integration
### Default Database
All findings are stored in the **OurDigital SEO Audit Log**:
- **Database ID**: `2c8581e5-8a1e-8035-880b-e38cefc2f3ef`
- **URL**: https://www.notion.so/dintelligence/2c8581e58a1e8035880be38cefc2f3ef
### Database Properties
| Property | Type | Description |
|----------|------|-------------|
| Issue | Title | Issue title |
| Site | URL | Audited site URL |
| Category | Select | Issue category |
| Priority | Select | Critical / High / Medium / Low |
| Status | Status | Not started / In progress / Done |
| URL | URL | Specific page with issue |
| Found Date | Date | Discovery date |
| Audit ID | Text | Groups findings by session |
### Page Content Template
Each finding page contains:
```
## Description
[Detailed explanation of the issue]
## Impact
⚠️ [Business/ranking impact]
## Recommendation
💡 [Actionable solution]
```
### Filtering Findings
Use Notion filters to view:
- **By Site**: Filter by Site property
- **By Category**: Filter by Category (Schema, Performance, etc.)
- **By Priority**: Filter by Priority (Critical first)
- **By Audit**: Filter by Audit ID to see all findings from one session
---
## Troubleshooting
### API Authentication Errors
**Error: "Invalid JWT Signature"**
- Check that the service account key file exists
- Verify the file path in `~/.credential/ourdigital-seo-agent.json`
- Regenerate the key in Google Cloud Console if corrupted
**Error: "Requests from referer are blocked"**
- Go to [API Credentials](https://console.cloud.google.com/apis/credentials)
- Click on your API key
- Set "Application restrictions" to "None"
- Save and wait 1-2 minutes
**Error: "API has not been enabled"**
- Enable the required API in [Google Cloud Console](https://console.cloud.google.com/apis/library)
- Wait a few minutes for propagation
### Search Console Issues
**Error: "Site not found"**
- Use domain property format: `sc-domain:example.com`
- Or URL prefix format: `https://www.example.com/`
- Add service account email as user in Search Console
### PageSpeed Rate Limiting
**Error: "429 Too Many Requests"**
- Wait a few minutes before retrying
- Use an API key for higher quotas
- Batch requests with delays
### Notion Integration Issues
**Error: "Failed to create page"**
- Verify NOTION_API_KEY is set
- Check that the integration has access to the database
- Ensure database properties match expected schema
### Python Import Errors
**Error: "ModuleNotFoundError"**
```bash
cd ~/.claude/skills/ourdigital-seo-audit/scripts
pip install -r requirements.txt
```
---
## Support
For issues or feature requests:
1. Check this guide first
2. Review the [SKILL.md](SKILL.md) for technical details
3. Check [examples.md](examples.md) for more usage examples
---
*Last updated: December 2024*

View File

@@ -0,0 +1,113 @@
# SEO Audit Quick Reference Card
## Instant Commands
### Site Audits
```
Perform full SEO audit for https://example.com
Check technical SEO for https://example.com
Analyze robots.txt for example.com
Validate sitemap at https://example.com/sitemap.xml
```
### Schema Operations
```
Validate schema markup on https://example.com
Generate Organization schema for [Company Name], URL: [URL]
Generate LocalBusiness schema for [Business Name], Address: [Address], Hours: [Hours]
Generate Article schema for [Title], Author: [Name], Published: [Date]
Generate FAQPage schema with these Q&As: [Questions and Answers]
```
### Performance
```
Check Core Web Vitals for https://example.com
Analyze page speed issues on https://example.com
```
### Local SEO
```
Local SEO audit for [Business Name] in [City]
Check NAP consistency for [Business Name]
```
### Competitive Analysis
```
Compare SEO between https://site1.com and https://site2.com
Analyze top 10 results for "[keyword]"
```
### Export & Reporting
```
Export findings to Notion
Create SEO audit report for [URL]
Summarize audit findings with priorities
```
## Schema Validation Checklist
### Organization
- [ ] name (required)
- [ ] url (required)
- [ ] logo (recommended)
- [ ] sameAs (recommended)
- [ ] contactPoint (recommended)
### LocalBusiness
- [ ] name (required)
- [ ] address (required)
- [ ] telephone (recommended)
- [ ] openingHours (recommended)
- [ ] geo (recommended)
### Article
- [ ] headline (required)
- [ ] author (required)
- [ ] datePublished (required)
- [ ] image (recommended)
- [ ] publisher (recommended)
### Product
- [ ] name (required)
- [ ] image (recommended)
- [ ] description (recommended)
- [ ] offers (recommended)
### FAQPage
- [ ] mainEntity (required)
- [ ] Question with name (required)
- [ ] acceptedAnswer (required)
## Core Web Vitals Targets
| Metric | Good | Improve | Poor |
|--------|------|---------|------|
| LCP | <2.5s | 2.5-4s | >4s |
| CLS | <0.1 | 0.1-0.25 | >0.25 |
| INP | <200ms | 200-500ms | >500ms |
| FCP | <1.8s | 1.8-3s | >3s |
| TTFB | <800ms | 800ms-1.8s | >1.8s |
## Priority Levels
| Priority | Examples |
|----------|----------|
| Critical | Site blocked, 5xx errors, no HTTPS |
| High | Missing titles, schema errors, broken links |
| Medium | Missing alt text, thin content, no OG tags |
| Low | URL structure, minor schema issues |
## Notion Database ID
**OurDigital SEO Audit Log**: `2c8581e5-8a1e-8035-880b-e38cefc2f3ef`
## MCP Tools Reference
| Tool | Usage |
|------|-------|
| firecrawl_scrape | Single page analysis |
| firecrawl_map | Site structure discovery |
| firecrawl_crawl | Full site crawl |
| perplexity search | Research & competitive analysis |
| notion API-post-page | Create findings |
| fetch | Get robots.txt, sitemap |

View File

@@ -0,0 +1,386 @@
# OurDigital SEO Audit - Claude Desktop Project Knowledge
## Overview
This project knowledge file enables Claude Desktop to perform comprehensive SEO audits using MCP tools. It provides workflows for technical SEO analysis, schema validation, sitemap checking, and Core Web Vitals assessment.
## Available MCP Tools
### Primary Tools
| Tool | Purpose |
|------|---------|
| `firecrawl` | Website crawling, scraping, structured data extraction |
| `perplexity` | AI-powered research, competitive analysis |
| `notion` | Store audit findings in database |
| `fetch` | Fetch web pages and resources |
| `sequential-thinking` | Complex multi-step analysis |
### Tool Usage Patterns
#### Firecrawl - Web Scraping
```
firecrawl_scrape: Scrape single page content
firecrawl_crawl: Crawl entire website
firecrawl_extract: Extract structured data
firecrawl_map: Get site structure
```
#### Notion - Output Storage
```
API-post-search: Find existing databases
API-post-database-query: Query database
API-post-page: Create finding pages
API-patch-page: Update findings
```
## SEO Audit Workflows
### 1. Full Site Audit
**User prompt:** "Perform SEO audit for https://example.com"
**Workflow:**
1. Use `firecrawl_scrape` to get homepage content
2. Use `firecrawl_map` to discover site structure
3. Check robots.txt at /robots.txt
4. Check sitemap at /sitemap.xml
5. Extract and validate schema markup
6. Use `perplexity` for competitive insights
7. Store findings in Notion database
### 2. Schema Markup Validation
**User prompt:** "Validate schema on https://example.com"
**Workflow:**
1. Use `firecrawl_scrape` with extractSchema option
2. Look for JSON-LD in `<script type="application/ld+json">`
3. Check for Microdata (`itemscope`, `itemtype`, `itemprop`)
4. Validate against schema.org requirements
5. Check Rich Results eligibility
**Schema Types to Check:**
- Organization / LocalBusiness
- Product / Offer
- Article / BlogPosting
- FAQPage / HowTo
- BreadcrumbList
- WebSite / WebPage
**Required Properties by Type:**
| Type | Required | Recommended |
|------|----------|-------------|
| Organization | name, url | logo, sameAs, contactPoint |
| LocalBusiness | name, address | telephone, openingHours, geo |
| Product | name | image, description, offers, brand |
| Article | headline, author, datePublished | image, dateModified, publisher |
| FAQPage | mainEntity (Question + Answer) | - |
### 3. Robots.txt Analysis
**User prompt:** "Check robots.txt for example.com"
**Workflow:**
1. Fetch https://example.com/robots.txt
2. Parse directives:
- User-agent rules
- Disallow patterns
- Allow patterns
- Crawl-delay
- Sitemap declarations
3. Check for issues:
- Blocking CSS/JS resources
- Missing sitemap reference
- Overly restrictive rules
**Sample Analysis Output:**
```
Robots.txt Analysis
==================
User-agents: 3 defined (*, Googlebot, Bingbot)
Directives:
- Disallow: /admin/, /private/, /tmp/
- Allow: /public/, /blog/
- Sitemap: https://example.com/sitemap.xml
Issues Found:
- WARNING: CSS/JS files may be blocked (/assets/)
- OK: Sitemap is declared
- INFO: Crawl-delay set to 10s
```
### 4. Sitemap Validation
**User prompt:** "Validate sitemap at https://example.com/sitemap.xml"
**Workflow:**
1. Fetch sitemap XML
2. Parse and validate structure
3. Check:
- XML syntax validity
- URL count (max 50,000)
- Lastmod date formats
- URL accessibility (sample)
4. For sitemap index, check child sitemaps
**Validation Criteria:**
- Valid XML syntax
- `<urlset>` or `<sitemapindex>` root element
- Each `<url>` has `<loc>` element
- `<lastmod>` in W3C datetime format
- File size under 50MB uncompressed
### 5. Core Web Vitals Check
**User prompt:** "Check Core Web Vitals for https://example.com"
**Workflow:**
1. Use PageSpeed Insights (if API available)
2. Or analyze page with firecrawl for common issues
3. Check for:
- Large images without optimization
- Render-blocking resources
- Layout shift causes
- JavaScript execution time
**Metrics & Thresholds:**
| Metric | Good | Needs Improvement | Poor |
|--------|------|-------------------|------|
| LCP | < 2.5s | 2.5s - 4.0s | > 4.0s |
| CLS | < 0.1 | 0.1 - 0.25 | > 0.25 |
| FID/INP | < 100ms/200ms | 100-300ms/200-500ms | > 300ms/500ms |
### 6. Technical SEO Check
**User prompt:** "Check technical SEO for https://example.com"
**Workflow:**
1. Check HTTPS implementation
2. Verify canonical tags
3. Check meta robots tags
4. Analyze heading structure (H1-H6)
5. Check image alt attributes
6. Verify Open Graph / Twitter Cards
7. Check mobile-friendliness indicators
**Checklist:**
- [ ] HTTPS enabled
- [ ] Single canonical URL per page
- [ ] Proper robots meta tags
- [ ] One H1 per page
- [ ] All images have alt text
- [ ] OG tags present (og:title, og:description, og:image)
- [ ] Twitter Card tags present
- [ ] Viewport meta tag for mobile
### 7. Local SEO Audit
**User prompt:** "Local SEO audit for [Business Name] in [Location]"
**Workflow:**
1. Search for business citations with `perplexity`
2. Check for LocalBusiness schema
3. Verify NAP (Name, Address, Phone) consistency
4. Look for review signals
5. Check Google Business Profile (manual)
## Notion Database Integration
### Default Database
- **Database ID**: `2c8581e5-8a1e-8035-880b-e38cefc2f3ef`
- **Name**: OurDigital SEO Audit Log
### Database Properties
| Property | Type | Values |
|----------|------|--------|
| Issue | Title | Finding title |
| Site | URL | Audited site URL |
| Category | Select | Technical SEO, On-page SEO, Content, Local SEO, Performance, Schema/Structured Data, Sitemap, Robots.txt |
| Priority | Select | Critical, High, Medium, Low |
| Status | Status | Not started, In progress, Done |
| URL | URL | Specific page with issue |
| Found Date | Date | Discovery date |
| Audit ID | Rich Text | Groups findings from same audit |
### Page Content Template
Each finding page should contain:
```markdown
## Description
[Detailed explanation of the issue]
## Impact
[Business/ranking impact with warning callout]
## Recommendation
[Actionable solution with lightbulb callout]
```
### Creating Findings
Use Notion MCP to create pages:
1. Query database to check for existing entries
2. Create new page with properties
3. Add content blocks (Description, Impact, Recommendation)
## Schema Markup Templates
### Organization Schema
```json
{
"@context": "https://schema.org",
"@type": "Organization",
"name": "[Company Name]",
"url": "[Website URL]",
"logo": "[Logo URL]",
"sameAs": [
"[Social Media URLs]"
],
"contactPoint": {
"@type": "ContactPoint",
"telephone": "[Phone]",
"contactType": "customer service"
}
}
```
### LocalBusiness Schema
```json
{
"@context": "https://schema.org",
"@type": "LocalBusiness",
"name": "[Business Name]",
"address": {
"@type": "PostalAddress",
"streetAddress": "[Street]",
"addressLocality": "[City]",
"addressCountry": "[Country Code]"
},
"telephone": "[Phone]",
"openingHoursSpecification": [{
"@type": "OpeningHoursSpecification",
"dayOfWeek": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
"opens": "09:00",
"closes": "18:00"
}]
}
```
### Article Schema
```json
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "[Article Title]",
"author": {
"@type": "Person",
"name": "[Author Name]"
},
"datePublished": "[ISO Date]",
"dateModified": "[ISO Date]",
"publisher": {
"@type": "Organization",
"name": "[Publisher Name]",
"logo": {
"@type": "ImageObject",
"url": "[Logo URL]"
}
}
}
```
### FAQPage Schema
```json
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [{
"@type": "Question",
"name": "[Question Text]",
"acceptedAnswer": {
"@type": "Answer",
"text": "[Answer Text]"
}
}]
}
```
### BreadcrumbList Schema
```json
{
"@context": "https://schema.org",
"@type": "BreadcrumbList",
"itemListElement": [{
"@type": "ListItem",
"position": 1,
"name": "Home",
"item": "[Homepage URL]"
}, {
"@type": "ListItem",
"position": 2,
"name": "[Category]",
"item": "[Category URL]"
}]
}
```
## Common SEO Issues Reference
### Critical Priority
- Site not accessible (5xx errors)
- Robots.txt blocking entire site
- No HTTPS implementation
- Duplicate content across domain
- Sitemap returning errors
### High Priority
- Missing or duplicate title tags
- No meta descriptions
- Schema markup errors
- Broken internal links
- Missing canonical tags
- Core Web Vitals failing
### Medium Priority
- Missing alt text on images
- Thin content pages
- Missing Open Graph tags
- Suboptimal heading structure
- Missing breadcrumb schema
### Low Priority
- Missing Twitter Card tags
- Suboptimal URL structure
- Missing FAQ schema
- Review schema not implemented
## Quick Commands Reference
| Task | Prompt |
|------|--------|
| Full audit | "Perform SEO audit for [URL]" |
| Schema check | "Validate schema on [URL]" |
| Sitemap check | "Validate sitemap at [URL]" |
| Robots.txt | "Analyze robots.txt for [domain]" |
| Performance | "Check Core Web Vitals for [URL]" |
| Generate schema | "Generate [type] schema for [details]" |
| Export to Notion | "Export findings to Notion" |
| Local SEO | "Local SEO audit for [business] in [location]" |
| Competitive | "Compare SEO of [URL1] vs [URL2]" |
## Tips for Best Results
1. **Be specific** - Provide full URLs including https://
2. **One site at a time** - Audit one domain per session for clarity
3. **Check Notion** - Review existing findings before creating duplicates
4. **Prioritize fixes** - Focus on Critical/High issues first
5. **Validate changes** - Re-audit after implementing fixes
## Limitations
- No direct Python script execution (use MCP tools instead)
- PageSpeed API requires separate configuration
- Google Search Console data requires authenticated access
- GA4 data requires service account setup
- Large sites may require multiple sessions

View File

@@ -0,0 +1,230 @@
# Claude Desktop SEO Audit - Setup Guide
## Prerequisites
### 1. Claude Desktop Application
- Download from https://claude.ai/download
- Sign in with your Anthropic account
- Pro subscription recommended for extended usage
### 2. Required MCP Servers
Configure these MCP servers in Claude Desktop settings:
#### Firecrawl (Web Scraping)
```json
{
"mcpServers": {
"firecrawl": {
"command": "npx",
"args": ["-y", "firecrawl-mcp"],
"env": {
"FIRECRAWL_API_KEY": "your-firecrawl-api-key"
}
}
}
}
```
Get API key from: https://firecrawl.dev
#### Notion (Database Storage)
```json
{
"mcpServers": {
"notion": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-notion"],
"env": {
"NOTION_API_KEY": "your-notion-api-key"
}
}
}
}
```
Get API key from: https://www.notion.so/my-integrations
#### Perplexity (Research)
```json
{
"mcpServers": {
"perplexity": {
"command": "npx",
"args": ["-y", "perplexity-mcp"],
"env": {
"PERPLEXITY_API_KEY": "your-perplexity-api-key"
}
}
}
}
```
Get API key from: https://www.perplexity.ai/settings/api
### 3. Notion Database Setup
#### Option A: Use Existing Database
The OurDigital SEO Audit Log database is already configured:
- **Database ID**: `2c8581e5-8a1e-8035-880b-e38cefc2f3ef`
Ensure your Notion integration has access to this database.
#### Option B: Create New Database
Create a database with these properties:
| Property | Type | Options |
|----------|------|---------|
| Issue | Title | - |
| Site | URL | - |
| Category | Select | Technical SEO, On-page SEO, Content, Local SEO, Performance, Schema/Structured Data, Sitemap, Robots.txt |
| Priority | Select | Critical, High, Medium, Low |
| Status | Status | Not started, In progress, Done |
| URL | URL | - |
| Found Date | Date | - |
| Audit ID | Rich Text | - |
## Configuration File Location
### macOS
```
~/Library/Application Support/Claude/claude_desktop_config.json
```
### Windows
```
%APPDATA%\Claude\claude_desktop_config.json
```
### Linux
```
~/.config/Claude/claude_desktop_config.json
```
## Complete Configuration Example
```json
{
"mcpServers": {
"firecrawl": {
"command": "npx",
"args": ["-y", "firecrawl-mcp"],
"env": {
"FIRECRAWL_API_KEY": "fc-your-key-here"
}
},
"notion": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-notion"],
"env": {
"NOTION_API_KEY": "ntn_your-key-here"
}
},
"perplexity": {
"command": "npx",
"args": ["-y", "perplexity-mcp"],
"env": {
"PERPLEXITY_API_KEY": "pplx-your-key-here"
}
},
"fetch": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-fetch"]
}
}
}
```
## Adding Project Knowledge
### Step 1: Create a Project
1. Open Claude Desktop
2. Click on the project selector (top left)
3. Click "New Project"
4. Name it "SEO Audit"
### Step 2: Add Knowledge Files
1. In your project, click the paperclip icon or "Add content"
2. Select "Add files"
3. Add these files from `~/.claude/desktop-projects/seo-audit/`:
- `SEO_AUDIT_KNOWLEDGE.md` (main knowledge file)
- `QUICK_REFERENCE.md` (quick commands)
### Step 3: Verify Setup
Start a new conversation and ask:
```
What SEO audit capabilities do you have?
```
Claude should describe the available audit features.
## Testing the Setup
### Test 1: Firecrawl
```
Scrape https://example.com and show me the page structure
```
### Test 2: Notion
```
Search for "SEO Audit" in Notion
```
### Test 3: Perplexity
```
Research current SEO best practices for 2024
```
### Test 4: Full Audit
```
Perform a quick SEO audit for https://blog.ourdigital.org
```
## Troubleshooting
### MCP Server Not Connecting
1. Restart Claude Desktop
2. Check config file JSON syntax
3. Verify API keys are correct
4. Check Node.js is installed (`node --version`)
### Notion Permission Error
1. Go to Notion integration settings
2. Add the database to integration access
3. Ensure integration has read/write permissions
### Firecrawl Rate Limit
1. Wait a few minutes between requests
2. Consider upgrading Firecrawl plan
3. Use `firecrawl_map` for discovery, then targeted scrapes
### Knowledge Files Not Loading
1. Ensure files are in supported formats (.md, .txt)
2. Keep file sizes under 10MB
3. Restart the project conversation
## Usage Tips
1. **Start with Quick Reference** - Use the commands from QUICK_REFERENCE.md
2. **One site per conversation** - Keep context focused
3. **Export regularly** - Save findings to Notion frequently
4. **Check existing findings** - Query Notion before creating duplicates
5. **Prioritize Critical issues** - Fix showstoppers first
## Differences from Claude Code Version
| Feature | Claude Code | Claude Desktop |
|---------|-------------|----------------|
| Python Scripts | Direct execution | Not available |
| Google APIs | Service account auth | Manual or via MCP |
| File System | Full access | Limited to uploads |
| Automation | Bash commands | MCP tools only |
| Scheduling | Possible via cron | Manual only |
## Support
For issues with:
- **Claude Desktop**: https://support.anthropic.com
- **Firecrawl**: https://docs.firecrawl.dev
- **Notion API**: https://developers.notion.com
- **MCP Protocol**: https://modelcontextprotocol.io

View File

@@ -0,0 +1,521 @@
# OurDigital SEO Audit - Usage Examples
## Quick Reference
| Task | Command |
|------|---------|
| Full audit | `Perform SEO audit for [URL]` |
| Schema check | `Validate schema on [URL]` |
| PageSpeed | `Check Core Web Vitals for [URL]` |
| Sitemap | `Validate sitemap at [URL]` |
| Robots.txt | `Analyze robots.txt for [domain]` |
| Export to Notion | `Export findings to Notion` |
---
## Full Site Audit
### Basic Full Audit
**User prompt:**
> Perform a comprehensive SEO audit for https://example.com
**Agent workflow:**
1. Use Firecrawl to crawl the site and gather page data
2. Run PageSpeed Insights for Core Web Vitals
3. Validate schema markup, sitemap, and robots.txt
4. Query Search Console for performance data (if access granted)
5. Create Notion database with findings
6. Generate prioritized recommendations
---
## Technical SEO Check
### Crawlability Analysis
**User prompt:**
> Check technical SEO for https://example.com including crawlability and indexing
**Agent workflow:**
1. Fetch robots.txt and analyze directives
2. Validate sitemap.xml structure
3. Check HTTP response codes for key pages
4. Verify canonical tags and redirects
5. Test mobile-friendliness
---
## Schema Markup
### Generate Organization Schema
**User prompt:**
> Generate Organization schema markup for my company: OurDigital Corp, website https://ourdigital.com, founded 2020
**Agent response:**
```json
{
"@context": "https://schema.org",
"@type": "Organization",
"name": "OurDigital Corp",
"url": "https://ourdigital.com",
"foundingDate": "2020",
"sameAs": []
}
```
### Generate LocalBusiness Schema
**User prompt:**
> Generate LocalBusiness schema for a restaurant: Seoul Cafe, 123 Gangnam-daero, Seoul, open Mon-Fri 9am-6pm
**Agent response:**
```json
{
"@context": "https://schema.org",
"@type": "Restaurant",
"name": "Seoul Cafe",
"address": {
"@type": "PostalAddress",
"streetAddress": "123 Gangnam-daero",
"addressLocality": "Seoul",
"addressCountry": "KR"
},
"openingHoursSpecification": [{
"@type": "OpeningHoursSpecification",
"dayOfWeek": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
"opens": "09:00",
"closes": "18:00"
}]
}
```
### Validate Existing Schema
**User prompt:**
> Validate existing schema markup on https://example.com
**Agent workflow:**
1. Scrape page with Firecrawl
2. Extract JSON-LD, Microdata, and RDFa
3. Validate against schema.org vocabulary
4. Check for required properties
5. Test Rich Results eligibility
6. Report issues and recommendations
---
## Sitemap Validation
### Check Sitemap
**User prompt:**
> Validate the sitemap at https://example.com/sitemap.xml
**Agent workflow:**
1. Fetch and parse XML sitemap
2. Validate XML syntax
3. Check URL count (max 50,000)
4. Verify lastmod dates
5. Test sample URLs for accessibility
6. Report issues found
**Sample output:**
```
Sitemap Validation Report
=========================
URL: https://example.com/sitemap.xml
Total URLs: 1,234
Valid URLs: 1,200
Issues Found:
- 34 URLs returning 404
- 12 URLs with invalid lastmod format
- Missing sitemap index (recommended for 1000+ URLs)
```
---
## Robots.txt Analysis
### Analyze Robots.txt
**User prompt:**
> Analyze robots.txt for example.com
**Agent workflow:**
1. Fetch /robots.txt
2. Parse all directives
3. Check for blocking issues
4. Verify sitemap declaration
5. Test specific URLs
6. Compare user-agent rules
**Sample output:**
```
Robots.txt Analysis
==================
URL: https://example.com/robots.txt
User-agents defined: 3 (*, Googlebot, Bingbot)
Issues:
- WARNING: CSS/JS files blocked (/assets/)
- INFO: Crawl-delay set to 10 seconds (may slow indexing)
- OK: Sitemap declared
Rules Summary:
- Disallowed: /admin/, /private/, /tmp/
- Allowed: /public/, /blog/
```
---
## Core Web Vitals
### Performance Analysis
**User prompt:**
> Check Core Web Vitals for https://example.com
**Agent workflow:**
1. Run PageSpeed Insights API (mobile + desktop)
2. Extract Core Web Vitals metrics
3. Compare against thresholds
4. Identify optimization opportunities
5. Prioritize recommendations
**Sample output:**
```
Core Web Vitals Report
=====================
URL: https://example.com
Strategy: Mobile
Metrics:
- LCP: 3.2s (NEEDS IMPROVEMENT - target <2.5s)
- FID: 45ms (GOOD - target <100ms)
- CLS: 0.15 (NEEDS IMPROVEMENT - target <0.1)
- INP: 180ms (GOOD - target <200ms)
Top Opportunities:
1. Serve images in next-gen formats (-1.5s)
2. Eliminate render-blocking resources (-0.8s)
3. Reduce unused CSS (-0.3s)
```
---
## Local SEO Assessment
### Local SEO Audit
**User prompt:**
> Perform local SEO audit for Seoul Dental Clinic in Gangnam
**Agent workflow:**
1. Search for existing citations (Perplexity)
2. Check for LocalBusiness schema
3. Analyze NAP consistency
4. Review Google Business Profile (manual check)
5. Identify missing citations
6. Recommend improvements
---
## Keyword Research
### Trend Analysis
**User prompt:**
> Research keyword trends for "digital marketing" in Korea over the past year
**Agent workflow:**
1. Query Google Trends (pytrends)
2. Get related queries
3. Identify seasonal patterns
4. Compare with related terms
5. Generate insights
---
## Competitive Analysis
### SERP Analysis
**User prompt:**
> Analyze top 10 search results for "best coffee shops Seoul"
**Agent workflow:**
1. Use Custom Search API
2. Extract title, description, URL
3. Analyze common patterns
4. Check for schema markup
5. Identify content gaps
---
## CLI Script Usage
### Schema Generator
```bash
# Generate Organization schema
python scripts/schema_generator.py \
--type organization \
--name "OurDigital Corp" \
--url "https://ourdigital.com"
# Generate Product schema
python scripts/schema_generator.py \
--type product \
--name "SEO Tool" \
--price 29900 \
--currency KRW
```
### Schema Validator
```bash
# Validate schema on a URL
python scripts/schema_validator.py \
--url https://example.com \
--output report.json
# Validate local JSON-LD file
python scripts/schema_validator.py \
--file schema.json
```
### Sitemap Validator
```bash
# Validate sitemap
python scripts/sitemap_validator.py \
--url https://example.com/sitemap.xml \
--check-urls \
--output sitemap_report.json
```
### Robots.txt Checker
```bash
# Analyze robots.txt
python scripts/robots_checker.py \
--url https://example.com/robots.txt
# Test specific URL
python scripts/robots_checker.py \
--url https://example.com/robots.txt \
--test-url /admin/dashboard \
--user-agent Googlebot
```
### Full Audit
```bash
# Run complete audit
python scripts/full_audit.py \
--url https://example.com \
--output notion \
--notion-page-id abc123
# Export to Google Sheets
python scripts/full_audit.py \
--url https://example.com \
--output sheets \
--spreadsheet-id xyz789
```
---
## Output to Notion
### Create Findings Database
**User prompt:**
> Create an SEO audit findings database in Notion for example.com
**Agent workflow:**
1. Search for existing SEO audit pages
2. Create new database with schema
3. Add initial findings from audit
4. Set up views (by priority, by category)
5. Share database link with user
---
## Batch Operations
### Audit Multiple Pages
**User prompt:**
> Check schema markup on these URLs: url1.com, url2.com, url3.com
**Agent workflow:**
1. Queue URLs for processing
2. Validate each URL sequentially
3. Aggregate findings
4. Generate comparison report
---
## Integration with Search Console
### Performance Report
**User prompt:**
> Get Search Console performance data for the last 30 days
**Agent workflow:**
1. Verify Search Console access
2. Query search analytics API
3. Get top queries and pages
4. Calculate CTR and position changes
5. Identify opportunities
**Sample output:**
```
Search Console Performance (Last 30 Days)
========================================
Total Clicks: 12,345
Total Impressions: 456,789
Average CTR: 2.7%
Average Position: 15.3
Top Queries:
1. "example product" - 1,234 clicks, position 3.2
2. "example service" - 987 clicks, position 5.1
3. "example review" - 654 clicks, position 8.4
Pages with Opportunities:
- /product-page: High impressions, low CTR (improve title)
- /service-page: Good CTR, position 11 (push to page 1)
```
---
## Real-World Examples (OurDigital)
### Example: Audit blog.ourdigital.org
**User prompt:**
> Perform SEO audit for https://blog.ourdigital.org and export to Notion
**Actual Results:**
```
=== SEO Audit: blog.ourdigital.org ===
Robots.txt: ✓ Valid
- 6 disallow rules
- Sitemap declared
Sitemap: ✓ Valid
- 126 posts indexed
- All URLs accessible
Schema Markup: ⚠ Issues Found
- Organization missing 'url' property (High)
- WebPage missing 'name' property (High)
- Missing SearchAction on WebSite (Medium)
- Missing sameAs on Organization (Medium)
Core Web Vitals (Mobile):
- Performance: 53/100
- SEO: 100/100
- LCP: 5.91s ✗ Poor
- CLS: 0.085 ✓ Good
- TBT: 651ms ✗ Poor
Findings exported to Notion: 6 issues
```
### Example: GA4 Traffic Analysis
**User prompt:**
> Get traffic data for OurDigital Blog from GA4
**Actual Results:**
```
GA4 Property: OurDigital Blog (489750460)
Period: Last 30 days
Top Pages by Views:
1. / (Homepage): 86 views
2. /google-business-profile-ownership-authentication: 59 views
3. /information-overload/: 37 views
4. /social-media-vs-sns/: 23 views
5. /reputation-in-connected-world/: 19 views
```
### Example: Search Console Performance
**User prompt:**
> Get Search Console data for ourdigital.org
**Actual Results:**
```
Property: sc-domain:ourdigital.org
Period: Last 30 days
Top Pages by Clicks:
1. ourdigital.org/information-overload - 27 clicks, pos 4.2
2. ourdigital.org/google-business-profile-ownership - 18 clicks, pos 5.9
3. ourdigital.org/social-media-vs-sns - 13 clicks, pos 9.5
4. ourdigital.org/website-migration-redirect - 12 clicks, pos 17.9
5. ourdigital.org/google-brand-lift-measurement - 7 clicks, pos 5.7
```
---
## Notion Database Structure
### Finding Entry Example
**Issue:** Organization schema missing 'url' property
**Properties:**
| Field | Value |
|-------|-------|
| Category | Schema/Structured Data |
| Priority | High |
| Site | https://blog.ourdigital.org |
| URL | https://blog.ourdigital.org/posts/example/ |
| Found Date | 2024-12-14 |
| Audit ID | blog.ourdigital.org-20241214-123456 |
**Page Content:**
```markdown
## Description
The Organization schema on the blog post is missing the required
'url' property that identifies the organization's website.
## Impact
⚠️ May affect rich result eligibility and knowledge panel display
in search results. Google uses the url property to verify and
connect your organization across web properties.
## Recommendation
💡 Add 'url': 'https://ourdigital.org' to the Organization schema
markup in your site's JSON-LD structured data.
```
---
## API Configuration Reference
### Available Properties
| API | Property/Domain | ID |
|-----|-----------------|-----|
| Search Console | sc-domain:ourdigital.org | - |
| GA4 | OurDigital Lab | 218477407 |
| GA4 | OurDigital Journal | 413643875 |
| GA4 | OurDigital Blog | 489750460 |
| Custom Search | - | e5f27994f2bab4bf2 |
### Service Account
```
Email: ourdigital-seo-agent@ourdigital-insights.iam.gserviceaccount.com
File: ~/.credential/ourdigital-seo-agent.json
```

View File

@@ -0,0 +1,606 @@
# OurDigital SEO Audit - API Reference
## Google Search Console API
### Authentication
```python
from google.oauth2 import service_account
from googleapiclient.discovery import build
SCOPES = ['https://www.googleapis.com/auth/webmasters.readonly']
credentials = service_account.Credentials.from_service_account_file(
'service-account-key.json', scopes=SCOPES
)
service = build('searchconsole', 'v1', credentials=credentials)
```
### Endpoints
#### Search Analytics
```python
# Get search performance data
request = {
'startDate': '2024-01-01',
'endDate': '2024-12-31',
'dimensions': ['query', 'page', 'country', 'device'],
'rowLimit': 25000,
'dimensionFilterGroups': [{
'filters': [{
'dimension': 'country',
'expression': 'kor'
}]
}]
}
response = service.searchanalytics().query(
siteUrl='sc-domain:example.com',
body=request
).execute()
```
#### URL Inspection
```python
request = {
'inspectionUrl': 'https://example.com/page',
'siteUrl': 'sc-domain:example.com'
}
response = service.urlInspection().index().inspect(body=request).execute()
```
#### Sitemaps
```python
# List sitemaps
sitemaps = service.sitemaps().list(siteUrl='sc-domain:example.com').execute()
# Submit sitemap
service.sitemaps().submit(
siteUrl='sc-domain:example.com',
feedpath='https://example.com/sitemap.xml'
).execute()
```
### Rate Limits
- 1,200 queries per minute per project
- 25,000 rows max per request
---
## PageSpeed Insights API
### Authentication
```python
import requests
API_KEY = 'your-api-key'
BASE_URL = 'https://www.googleapis.com/pagespeedonline/v5/runPagespeed'
```
### Request Parameters
```python
params = {
'url': 'https://example.com',
'key': API_KEY,
'strategy': 'mobile', # or 'desktop'
'category': ['performance', 'accessibility', 'best-practices', 'seo']
}
response = requests.get(BASE_URL, params=params)
```
### Response Structure
```json
{
"lighthouseResult": {
"categories": {
"performance": { "score": 0.85 },
"seo": { "score": 0.92 }
},
"audits": {
"largest-contentful-paint": {
"numericValue": 2500,
"displayValue": "2.5 s"
},
"cumulative-layout-shift": {
"numericValue": 0.05
},
"total-blocking-time": {
"numericValue": 150
}
}
},
"loadingExperience": {
"metrics": {
"LARGEST_CONTENTFUL_PAINT_MS": {
"percentile": 2500,
"category": "AVERAGE"
}
}
}
}
```
### Core Web Vitals Thresholds
| Metric | Good | Needs Improvement | Poor |
|--------|------|-------------------|------|
| LCP | ≤ 2.5s | 2.5s - 4.0s | > 4.0s |
| FID | ≤ 100ms | 100ms - 300ms | > 300ms |
| CLS | ≤ 0.1 | 0.1 - 0.25 | > 0.25 |
| INP | ≤ 200ms | 200ms - 500ms | > 500ms |
| TTFB | ≤ 800ms | 800ms - 1800ms | > 1800ms |
| FCP | ≤ 1.8s | 1.8s - 3.0s | > 3.0s |
### Rate Limits
- 25,000 queries per day (free tier)
- No per-minute limit
---
## Google Analytics 4 Data API
### Authentication
```python
from google.analytics.data_v1beta import BetaAnalyticsDataClient
from google.analytics.data_v1beta.types import RunReportRequest
client = BetaAnalyticsDataClient()
property_id = '123456789'
```
### Common Reports
#### Traffic Overview
```python
request = RunReportRequest(
property=f'properties/{property_id}',
dimensions=[
{'name': 'date'},
{'name': 'sessionDefaultChannelGroup'}
],
metrics=[
{'name': 'sessions'},
{'name': 'totalUsers'},
{'name': 'screenPageViews'},
{'name': 'bounceRate'}
],
date_ranges=[{'start_date': '30daysAgo', 'end_date': 'today'}]
)
response = client.run_report(request)
```
#### Landing Pages
```python
request = RunReportRequest(
property=f'properties/{property_id}',
dimensions=[{'name': 'landingPage'}],
metrics=[
{'name': 'sessions'},
{'name': 'engagementRate'},
{'name': 'conversions'}
],
date_ranges=[{'start_date': '30daysAgo', 'end_date': 'today'}],
order_bys=[{'metric': {'metric_name': 'sessions'}, 'desc': True}],
limit=100
)
```
### Useful Dimensions
- `date`, `dateHour`
- `sessionDefaultChannelGroup`
- `landingPage`, `pagePath`
- `deviceCategory`, `operatingSystem`
- `country`, `city`
- `sessionSource`, `sessionMedium`
### Useful Metrics
- `sessions`, `totalUsers`, `newUsers`
- `screenPageViews`, `engagementRate`
- `averageSessionDuration`
- `bounceRate`, `conversions`
---
## Google Trends API (pytrends)
### Installation
```bash
pip install pytrends
```
### Usage
```python
from pytrends.request import TrendReq
pytrends = TrendReq(hl='ko-KR', tz=540)
# Interest over time
pytrends.build_payload(['keyword1', 'keyword2'], timeframe='today 12-m', geo='KR')
interest_df = pytrends.interest_over_time()
# Related queries
related = pytrends.related_queries()
# Trending searches
trending = pytrends.trending_searches(pn='south_korea')
# Suggestions
suggestions = pytrends.suggestions('seo')
```
### Rate Limits
- No official limits, but implement delays (1-2 seconds between requests)
- May trigger CAPTCHA with heavy usage
---
## Custom Search JSON API
### Authentication
```python
import requests
API_KEY = 'your-api-key'
CX = 'your-search-engine-id' # Programmable Search Engine ID
BASE_URL = 'https://www.googleapis.com/customsearch/v1'
```
### Request
```python
params = {
'key': API_KEY,
'cx': CX,
'q': 'search query',
'num': 10, # 1-10
'start': 1, # Pagination
'gl': 'kr', # Country
'hl': 'ko' # Language
}
response = requests.get(BASE_URL, params=params)
```
### Response Structure
```json
{
"searchInformation": {
"totalResults": "12345",
"searchTime": 0.5
},
"items": [
{
"title": "Page Title",
"link": "https://example.com",
"snippet": "Description...",
"pagemap": {
"metatags": [...],
"cse_image": [...]
}
}
]
}
```
### Rate Limits
- 100 queries per day (free)
- 10,000 queries per day ($5 per 1,000)
---
## Knowledge Graph Search API
### Request
```python
API_KEY = 'your-api-key'
BASE_URL = 'https://kgsearch.googleapis.com/v1/entities:search'
params = {
'key': API_KEY,
'query': 'entity name',
'types': 'Organization',
'languages': 'ko',
'limit': 10
}
response = requests.get(BASE_URL, params=params)
```
### Response
```json
{
"itemListElement": [
{
"result": {
"@type": "EntitySearchResult",
"name": "Entity Name",
"description": "Description...",
"@id": "kg:/m/entity_id",
"detailedDescription": {
"articleBody": "..."
}
},
"resultScore": 1234.56
}
]
}
```
---
## Schema.org Reference
### JSON-LD Format
```html
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Organization",
"name": "Company Name",
"url": "https://example.com"
}
</script>
```
### Common Schema Types
#### Organization
```json
{
"@context": "https://schema.org",
"@type": "Organization",
"name": "Company Name",
"url": "https://example.com",
"logo": "https://example.com/logo.png",
"sameAs": [
"https://facebook.com/company",
"https://twitter.com/company"
],
"contactPoint": {
"@type": "ContactPoint",
"telephone": "+82-2-1234-5678",
"contactType": "customer service"
}
}
```
#### LocalBusiness
```json
{
"@context": "https://schema.org",
"@type": "LocalBusiness",
"name": "Business Name",
"address": {
"@type": "PostalAddress",
"streetAddress": "123 Street",
"addressLocality": "Seoul",
"addressRegion": "Seoul",
"postalCode": "12345",
"addressCountry": "KR"
},
"geo": {
"@type": "GeoCoordinates",
"latitude": 37.5665,
"longitude": 126.9780
},
"openingHoursSpecification": [{
"@type": "OpeningHoursSpecification",
"dayOfWeek": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
"opens": "09:00",
"closes": "18:00"
}]
}
```
#### Article/BlogPosting
```json
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "Article Title",
"author": {
"@type": "Person",
"name": "Author Name"
},
"datePublished": "2024-01-01",
"dateModified": "2024-01-15",
"image": "https://example.com/image.jpg",
"publisher": {
"@type": "Organization",
"name": "Publisher Name",
"logo": {
"@type": "ImageObject",
"url": "https://example.com/logo.png"
}
}
}
```
#### Product
```json
{
"@context": "https://schema.org",
"@type": "Product",
"name": "Product Name",
"image": "https://example.com/product.jpg",
"description": "Product description",
"brand": {
"@type": "Brand",
"name": "Brand Name"
},
"offers": {
"@type": "Offer",
"price": "29900",
"priceCurrency": "KRW",
"availability": "https://schema.org/InStock"
},
"aggregateRating": {
"@type": "AggregateRating",
"ratingValue": "4.5",
"reviewCount": "100"
}
}
```
#### FAQPage
```json
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [{
"@type": "Question",
"name": "Question text?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Answer text."
}
}]
}
```
#### BreadcrumbList
```json
{
"@context": "https://schema.org",
"@type": "BreadcrumbList",
"itemListElement": [{
"@type": "ListItem",
"position": 1,
"name": "Home",
"item": "https://example.com/"
}, {
"@type": "ListItem",
"position": 2,
"name": "Category",
"item": "https://example.com/category/"
}]
}
```
#### WebSite (with SearchAction)
```json
{
"@context": "https://schema.org",
"@type": "WebSite",
"name": "Site Name",
"url": "https://example.com",
"potentialAction": {
"@type": "SearchAction",
"target": {
"@type": "EntryPoint",
"urlTemplate": "https://example.com/search?q={search_term_string}"
},
"query-input": "required name=search_term_string"
}
}
```
---
## XML Sitemap Specification
### Format
```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/page</loc>
<lastmod>2024-01-15</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
```
### Index Sitemap
```xml
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-posts.xml</loc>
<lastmod>2024-01-15</lastmod>
</sitemap>
</sitemapindex>
```
### Limits
- 50,000 URLs max per sitemap
- 50MB uncompressed max
- Use index for larger sites
### Best Practices
- Use absolute URLs
- Include only canonical URLs
- Keep lastmod accurate
- Exclude noindex pages
---
## Robots.txt Reference
### Directives
```txt
# Comments start with #
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
User-agent: Googlebot
Disallow: /no-google/
Crawl-delay: 1
Sitemap: https://example.com/sitemap.xml
```
### Common User-agents
- `*` - All bots
- `Googlebot` - Google crawler
- `Googlebot-Image` - Google Image crawler
- `Bingbot` - Bing crawler
- `Yandex` - Yandex crawler
- `Baiduspider` - Baidu crawler
### Pattern Matching
- `*` - Wildcard (any sequence)
- `$` - End of URL
- `/path/` - Directory
- `/*.pdf$` - All PDFs
### Testing
```python
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
# Check if URL is allowed
can_fetch = rp.can_fetch("Googlebot", "https://example.com/page")
```
---
## Error Handling
### HTTP Status Codes
| Code | Meaning | Action |
|------|---------|--------|
| 200 | OK | Process response |
| 301/302 | Redirect | Follow or flag |
| 400 | Bad Request | Check parameters |
| 401 | Unauthorized | Check credentials |
| 403 | Forbidden | Check permissions |
| 404 | Not Found | Flag missing resource |
| 429 | Rate Limited | Implement backoff |
| 500 | Server Error | Retry with backoff |
| 503 | Service Unavailable | Retry later |
### Retry Strategy
```python
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def make_request(url):
# Request logic
pass
```

View File

@@ -0,0 +1,207 @@
"""
Base Client - Shared async client utilities
===========================================
Purpose: Rate-limited async operations for API clients
Python: 3.10+
"""
import asyncio
import logging
import os
from asyncio import Semaphore
from datetime import datetime
from typing import Any, Callable, TypeVar
from dotenv import load_dotenv
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type,
)
# Load environment variables
load_dotenv()
# Logging setup
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
T = TypeVar("T")
class RateLimiter:
"""Rate limiter using token bucket algorithm."""
def __init__(self, rate: float, per: float = 1.0):
"""
Initialize rate limiter.
Args:
rate: Number of requests allowed
per: Time period in seconds (default: 1 second)
"""
self.rate = rate
self.per = per
self.tokens = rate
self.last_update = datetime.now()
self._lock = asyncio.Lock()
async def acquire(self) -> None:
"""Acquire a token, waiting if necessary."""
async with self._lock:
now = datetime.now()
elapsed = (now - self.last_update).total_seconds()
self.tokens = min(self.rate, self.tokens + elapsed * (self.rate / self.per))
self.last_update = now
if self.tokens < 1:
wait_time = (1 - self.tokens) * (self.per / self.rate)
await asyncio.sleep(wait_time)
self.tokens = 0
else:
self.tokens -= 1
class BaseAsyncClient:
"""Base class for async API clients with rate limiting."""
def __init__(
self,
max_concurrent: int = 5,
requests_per_second: float = 3.0,
logger: logging.Logger | None = None,
):
"""
Initialize base client.
Args:
max_concurrent: Maximum concurrent requests
requests_per_second: Rate limit
logger: Logger instance
"""
self.semaphore = Semaphore(max_concurrent)
self.rate_limiter = RateLimiter(requests_per_second)
self.logger = logger or logging.getLogger(self.__class__.__name__)
self.stats = {
"requests": 0,
"success": 0,
"errors": 0,
"retries": 0,
}
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10),
retry=retry_if_exception_type(Exception),
)
async def _rate_limited_request(
self,
coro: Callable[[], Any],
) -> Any:
"""Execute a request with rate limiting and retry."""
async with self.semaphore:
await self.rate_limiter.acquire()
self.stats["requests"] += 1
try:
result = await coro()
self.stats["success"] += 1
return result
except Exception as e:
self.stats["errors"] += 1
self.logger.error(f"Request failed: {e}")
raise
async def batch_requests(
self,
requests: list[Callable[[], Any]],
desc: str = "Processing",
) -> list[Any]:
"""Execute multiple requests concurrently."""
try:
from tqdm.asyncio import tqdm
has_tqdm = True
except ImportError:
has_tqdm = False
async def execute(req: Callable) -> Any:
try:
return await self._rate_limited_request(req)
except Exception as e:
return {"error": str(e)}
tasks = [execute(req) for req in requests]
if has_tqdm:
results = []
for coro in tqdm.as_completed(tasks, total=len(tasks), desc=desc):
result = await coro
results.append(result)
return results
else:
return await asyncio.gather(*tasks, return_exceptions=True)
def print_stats(self) -> None:
"""Print request statistics."""
self.logger.info("=" * 40)
self.logger.info("Request Statistics:")
self.logger.info(f" Total Requests: {self.stats['requests']}")
self.logger.info(f" Successful: {self.stats['success']}")
self.logger.info(f" Errors: {self.stats['errors']}")
self.logger.info("=" * 40)
class ConfigManager:
"""Manage API configuration and credentials."""
def __init__(self):
load_dotenv()
@property
def google_credentials_path(self) -> str | None:
"""Get Google service account credentials path."""
# Prefer SEO-specific credentials, fallback to general credentials
seo_creds = os.path.expanduser("~/.credential/ourdigital-seo-agent.json")
if os.path.exists(seo_creds):
return seo_creds
return os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
@property
def pagespeed_api_key(self) -> str | None:
"""Get PageSpeed Insights API key."""
return os.getenv("PAGESPEED_API_KEY")
@property
def custom_search_api_key(self) -> str | None:
"""Get Custom Search API key."""
return os.getenv("CUSTOM_SEARCH_API_KEY")
@property
def custom_search_engine_id(self) -> str | None:
"""Get Custom Search Engine ID."""
return os.getenv("CUSTOM_SEARCH_ENGINE_ID")
@property
def notion_token(self) -> str | None:
"""Get Notion API token."""
return os.getenv("NOTION_TOKEN") or os.getenv("NOTION_API_KEY")
def validate_google_credentials(self) -> bool:
"""Validate Google credentials are configured."""
creds_path = self.google_credentials_path
if not creds_path:
return False
return os.path.exists(creds_path)
def get_required(self, key: str) -> str:
"""Get required environment variable or raise error."""
value = os.getenv(key)
if not value:
raise ValueError(f"Missing required environment variable: {key}")
return value
# Singleton config instance
config = ConfigManager()

View File

@@ -0,0 +1,497 @@
"""
Full SEO Audit - Orchestration Script
=====================================
Purpose: Run comprehensive SEO audit combining all tools
Python: 3.10+
Usage:
python full_audit.py --url https://example.com --output notion --notion-page-id abc123
"""
import argparse
import json
import logging
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any
from urllib.parse import urlparse
from robots_checker import RobotsChecker
from schema_validator import SchemaValidator
from sitemap_validator import SitemapValidator
from pagespeed_client import PageSpeedClient
from notion_reporter import NotionReporter, SEOFinding
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
@dataclass
class AuditResult:
"""Complete SEO audit result."""
url: str
timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
robots: dict = field(default_factory=dict)
sitemap: dict = field(default_factory=dict)
schema: dict = field(default_factory=dict)
performance: dict = field(default_factory=dict)
findings: list[SEOFinding] = field(default_factory=list)
summary: dict = field(default_factory=dict)
def to_dict(self) -> dict:
return {
"url": self.url,
"timestamp": self.timestamp,
"robots": self.robots,
"sitemap": self.sitemap,
"schema": self.schema,
"performance": self.performance,
"summary": self.summary,
"findings_count": len(self.findings),
}
class SEOAuditor:
"""Orchestrate comprehensive SEO audit."""
def __init__(self):
self.robots_checker = RobotsChecker()
self.sitemap_validator = SitemapValidator()
self.schema_validator = SchemaValidator()
self.pagespeed_client = PageSpeedClient()
def run_audit(
self,
url: str,
include_robots: bool = True,
include_sitemap: bool = True,
include_schema: bool = True,
include_performance: bool = True,
) -> AuditResult:
"""
Run comprehensive SEO audit.
Args:
url: URL to audit
include_robots: Check robots.txt
include_sitemap: Validate sitemap
include_schema: Validate schema markup
include_performance: Run PageSpeed analysis
"""
result = AuditResult(url=url)
parsed_url = urlparse(url)
base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
logger.info(f"Starting SEO audit for {url}")
# 1. Robots.txt analysis
if include_robots:
logger.info("Analyzing robots.txt...")
try:
robots_result = self.robots_checker.analyze(base_url)
result.robots = robots_result.to_dict()
self._process_robots_findings(robots_result, result)
except Exception as e:
logger.error(f"Robots.txt analysis failed: {e}")
result.robots = {"error": str(e)}
# 2. Sitemap validation
if include_sitemap:
logger.info("Validating sitemap...")
sitemap_url = f"{base_url}/sitemap.xml"
# Try to get sitemap URL from robots.txt
if result.robots.get("sitemaps"):
sitemap_url = result.robots["sitemaps"][0]
try:
sitemap_result = self.sitemap_validator.validate(sitemap_url)
result.sitemap = sitemap_result.to_dict()
self._process_sitemap_findings(sitemap_result, result)
except Exception as e:
logger.error(f"Sitemap validation failed: {e}")
result.sitemap = {"error": str(e)}
# 3. Schema validation
if include_schema:
logger.info("Validating schema markup...")
try:
schema_result = self.schema_validator.validate(url=url)
result.schema = schema_result.to_dict()
self._process_schema_findings(schema_result, result)
except Exception as e:
logger.error(f"Schema validation failed: {e}")
result.schema = {"error": str(e)}
# 4. PageSpeed analysis
if include_performance:
logger.info("Running PageSpeed analysis...")
try:
perf_result = self.pagespeed_client.analyze(url, strategy="mobile")
result.performance = perf_result.to_dict()
self._process_performance_findings(perf_result, result)
except Exception as e:
logger.error(f"PageSpeed analysis failed: {e}")
result.performance = {"error": str(e)}
# Generate summary
result.summary = self._generate_summary(result)
logger.info(f"Audit complete. Found {len(result.findings)} issues.")
return result
def _process_robots_findings(self, robots_result, audit_result: AuditResult):
"""Convert robots.txt issues to findings."""
for issue in robots_result.issues:
priority = "Medium"
if issue.severity == "error":
priority = "Critical"
elif issue.severity == "warning":
priority = "High"
audit_result.findings.append(SEOFinding(
issue=issue.message,
category="Robots.txt",
priority=priority,
description=issue.directive or "",
recommendation=issue.suggestion or "",
))
def _process_sitemap_findings(self, sitemap_result, audit_result: AuditResult):
"""Convert sitemap issues to findings."""
for issue in sitemap_result.issues:
priority = "Medium"
if issue.severity == "error":
priority = "High"
elif issue.severity == "warning":
priority = "Medium"
audit_result.findings.append(SEOFinding(
issue=issue.message,
category="Sitemap",
priority=priority,
url=issue.url,
recommendation=issue.suggestion or "",
))
def _process_schema_findings(self, schema_result, audit_result: AuditResult):
"""Convert schema issues to findings."""
for issue in schema_result.issues:
priority = "Low"
if issue.severity == "error":
priority = "High"
elif issue.severity == "warning":
priority = "Medium"
audit_result.findings.append(SEOFinding(
issue=issue.message,
category="Schema/Structured Data",
priority=priority,
description=f"Schema type: {issue.schema_type}" if issue.schema_type else "",
recommendation=issue.suggestion or "",
))
def _process_performance_findings(self, perf_result, audit_result: AuditResult):
"""Convert performance issues to findings."""
cwv = perf_result.core_web_vitals
# Check Core Web Vitals
if cwv.lcp_rating == "POOR":
audit_result.findings.append(SEOFinding(
issue=f"Poor LCP: {cwv.lcp / 1000:.2f}s (should be < 2.5s)",
category="Performance",
priority="Critical",
impact="Users experience slow page loads, affecting bounce rate and rankings",
recommendation="Optimize images, reduce server response time, use CDN",
))
elif cwv.lcp_rating == "NEEDS_IMPROVEMENT":
audit_result.findings.append(SEOFinding(
issue=f"LCP needs improvement: {cwv.lcp / 1000:.2f}s (target < 2.5s)",
category="Performance",
priority="High",
recommendation="Optimize largest content element loading",
))
if cwv.cls_rating == "POOR":
audit_result.findings.append(SEOFinding(
issue=f"Poor CLS: {cwv.cls:.3f} (should be < 0.1)",
category="Performance",
priority="High",
impact="Layout shifts frustrate users",
recommendation="Set dimensions for images/embeds, avoid inserting content above existing content",
))
if cwv.fid_rating == "POOR":
audit_result.findings.append(SEOFinding(
issue=f"Poor FID/TBT: {cwv.fid:.0f}ms (should be < 100ms)",
category="Performance",
priority="High",
impact="Slow interactivity affects user experience",
recommendation="Reduce JavaScript execution time, break up long tasks",
))
# Check performance score
if perf_result.performance_score and perf_result.performance_score < 50:
audit_result.findings.append(SEOFinding(
issue=f"Low performance score: {perf_result.performance_score:.0f}/100",
category="Performance",
priority="High",
impact="Poor performance affects user experience and SEO",
recommendation="Address top opportunities from PageSpeed Insights",
))
# Add top opportunities as findings
for opp in perf_result.opportunities[:3]:
if opp["savings_ms"] > 500: # Only significant savings
audit_result.findings.append(SEOFinding(
issue=opp["title"],
category="Performance",
priority="Medium",
description=opp.get("description", ""),
impact=f"Potential savings: {opp['savings_ms'] / 1000:.1f}s",
recommendation="See PageSpeed Insights for details",
))
def _generate_summary(self, result: AuditResult) -> dict:
"""Generate audit summary."""
findings_by_priority = {}
findings_by_category = {}
for finding in result.findings:
# Count by priority
findings_by_priority[finding.priority] = (
findings_by_priority.get(finding.priority, 0) + 1
)
# Count by category
findings_by_category[finding.category] = (
findings_by_category.get(finding.category, 0) + 1
)
return {
"total_findings": len(result.findings),
"findings_by_priority": findings_by_priority,
"findings_by_category": findings_by_category,
"robots_accessible": result.robots.get("accessible", False),
"sitemap_valid": result.sitemap.get("valid", False),
"schema_valid": result.schema.get("valid", False),
"performance_score": result.performance.get("scores", {}).get("performance"),
"quick_wins": [
f.issue for f in result.findings
if f.priority in ("Medium", "Low")
][:5],
"critical_issues": [
f.issue for f in result.findings
if f.priority == "Critical"
],
}
def export_to_notion(
self,
result: AuditResult,
parent_page_id: str | None = None,
use_default_db: bool = True,
) -> dict:
"""
Export audit results to Notion.
Args:
result: AuditResult object
parent_page_id: Parent page ID (for creating new database)
use_default_db: If True, use OurDigital SEO Audit Log database
Returns:
Dict with database_id, summary_page_id, findings_created
"""
reporter = NotionReporter()
audit_id = f"{urlparse(result.url).netloc}-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
# Add site and audit_id to all findings
for finding in result.findings:
finding.site = result.url
finding.audit_id = audit_id
if use_default_db:
# Use the default OurDigital SEO Audit Log database
page_ids = reporter.add_findings_batch(result.findings)
return {
"database_id": reporter.DEFAULT_DATABASE_ID if hasattr(reporter, 'DEFAULT_DATABASE_ID') else "2c8581e5-8a1e-8035-880b-e38cefc2f3ef",
"audit_id": audit_id,
"findings_created": len(page_ids),
}
else:
# Create new database under parent page
if not parent_page_id:
raise ValueError("parent_page_id required when not using default database")
db_title = f"SEO Audit - {urlparse(result.url).netloc} - {datetime.now().strftime('%Y-%m-%d')}"
database_id = reporter.create_findings_database(parent_page_id, db_title)
page_ids = reporter.add_findings_batch(result.findings, database_id)
# Create summary page
summary_page_id = reporter.create_audit_summary_page(
parent_page_id,
result.url,
result.summary,
)
return {
"database_id": database_id,
"summary_page_id": summary_page_id,
"audit_id": audit_id,
"findings_created": len(page_ids),
}
def generate_report(self, result: AuditResult) -> str:
"""Generate human-readable report."""
lines = [
"=" * 70,
"SEO AUDIT REPORT",
"=" * 70,
f"URL: {result.url}",
f"Date: {result.timestamp}",
"",
"-" * 70,
"SUMMARY",
"-" * 70,
f"Total Issues Found: {result.summary.get('total_findings', 0)}",
"",
]
# Priority breakdown
lines.append("Issues by Priority:")
for priority in ["Critical", "High", "Medium", "Low"]:
count = result.summary.get("findings_by_priority", {}).get(priority, 0)
if count:
lines.append(f" {priority}: {count}")
lines.append("")
# Category breakdown
lines.append("Issues by Category:")
for category, count in result.summary.get("findings_by_category", {}).items():
lines.append(f" {category}: {count}")
lines.append("")
lines.append("-" * 70)
lines.append("STATUS OVERVIEW")
lines.append("-" * 70)
# Status checks
lines.append(f"Robots.txt: {'✓ Accessible' if result.robots.get('accessible') else '✗ Not accessible'}")
lines.append(f"Sitemap: {'✓ Valid' if result.sitemap.get('valid') else '✗ Issues found'}")
lines.append(f"Schema: {'✓ Valid' if result.schema.get('valid') else '✗ Issues found'}")
perf_score = result.performance.get("scores", {}).get("performance")
if perf_score:
status = "✓ Good" if perf_score >= 90 else "⚠ Needs work" if perf_score >= 50 else "✗ Poor"
lines.append(f"Performance: {status} ({perf_score:.0f}/100)")
# Critical issues
critical = result.summary.get("critical_issues", [])
if critical:
lines.extend([
"",
"-" * 70,
"CRITICAL ISSUES (Fix Immediately)",
"-" * 70,
])
for issue in critical:
lines.append(f"{issue}")
# Quick wins
quick_wins = result.summary.get("quick_wins", [])
if quick_wins:
lines.extend([
"",
"-" * 70,
"QUICK WINS",
"-" * 70,
])
for issue in quick_wins[:5]:
lines.append(f"{issue}")
# All findings
if result.findings:
lines.extend([
"",
"-" * 70,
"ALL FINDINGS",
"-" * 70,
])
current_category = None
for finding in sorted(result.findings, key=lambda x: (x.category, x.priority)):
if finding.category != current_category:
current_category = finding.category
lines.append(f"\n[{current_category}]")
lines.append(f" [{finding.priority}] {finding.issue}")
if finding.recommendation:
lines.append(f"{finding.recommendation}")
lines.extend(["", "=" * 70])
return "\n".join(lines)
def main():
"""CLI entry point."""
parser = argparse.ArgumentParser(
description="Run comprehensive SEO audit",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Run full audit and output to console
python full_audit.py --url https://example.com
# Export to Notion
python full_audit.py --url https://example.com --output notion --notion-page-id abc123
# Output as JSON
python full_audit.py --url https://example.com --json
""",
)
parser.add_argument("--url", "-u", required=True, help="URL to audit")
parser.add_argument("--output", "-o", choices=["console", "notion", "json"],
default="console", help="Output format")
parser.add_argument("--notion-page-id", help="Notion parent page ID (required for notion output)")
parser.add_argument("--json", action="store_true", help="Output as JSON")
parser.add_argument("--no-robots", action="store_true", help="Skip robots.txt check")
parser.add_argument("--no-sitemap", action="store_true", help="Skip sitemap validation")
parser.add_argument("--no-schema", action="store_true", help="Skip schema validation")
parser.add_argument("--no-performance", action="store_true", help="Skip PageSpeed analysis")
args = parser.parse_args()
auditor = SEOAuditor()
# Run audit
result = auditor.run_audit(
args.url,
include_robots=not args.no_robots,
include_sitemap=not args.no_sitemap,
include_schema=not args.no_schema,
include_performance=not args.no_performance,
)
# Output results
if args.json or args.output == "json":
print(json.dumps(result.to_dict(), indent=2, default=str))
elif args.output == "notion":
if not args.notion_page_id:
parser.error("--notion-page-id required for notion output")
notion_result = auditor.export_to_notion(result, args.notion_page_id)
print(f"Exported to Notion:")
print(f" Database ID: {notion_result['database_id']}")
print(f" Summary Page: {notion_result['summary_page_id']}")
print(f" Findings Created: {notion_result['findings_created']}")
else:
print(auditor.generate_report(result))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,409 @@
"""
Google Search Console Client
============================
Purpose: Interact with Google Search Console API for SEO data
Python: 3.10+
Usage:
from gsc_client import SearchConsoleClient
client = SearchConsoleClient()
data = client.get_search_analytics("sc-domain:example.com")
"""
import logging
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Any
from google.oauth2 import service_account
from googleapiclient.discovery import build
from base_client import config
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
@dataclass
class SearchAnalyticsResult:
"""Search analytics query result."""
rows: list[dict] = field(default_factory=list)
total_clicks: int = 0
total_impressions: int = 0
average_ctr: float = 0.0
average_position: float = 0.0
@dataclass
class SitemapInfo:
"""Sitemap information from Search Console."""
path: str
last_submitted: str | None = None
last_downloaded: str | None = None
is_pending: bool = False
is_sitemaps_index: bool = False
warnings: int = 0
errors: int = 0
class SearchConsoleClient:
"""Client for Google Search Console API."""
SCOPES = ["https://www.googleapis.com/auth/webmasters.readonly"]
def __init__(self, credentials_path: str | None = None):
"""
Initialize Search Console client.
Args:
credentials_path: Path to service account JSON key
"""
self.credentials_path = credentials_path or config.google_credentials_path
self._service = None
@property
def service(self):
"""Get or create Search Console service."""
if self._service is None:
if not self.credentials_path:
raise ValueError(
"Google credentials not configured. "
"Set GOOGLE_APPLICATION_CREDENTIALS environment variable."
)
credentials = service_account.Credentials.from_service_account_file(
self.credentials_path,
scopes=self.SCOPES,
)
self._service = build("searchconsole", "v1", credentials=credentials)
return self._service
def list_sites(self) -> list[dict]:
"""List all sites accessible to the service account."""
response = self.service.sites().list().execute()
return response.get("siteEntry", [])
def get_search_analytics(
self,
site_url: str,
start_date: str | None = None,
end_date: str | None = None,
dimensions: list[str] | None = None,
row_limit: int = 25000,
filters: list[dict] | None = None,
) -> SearchAnalyticsResult:
"""
Get search analytics data.
Args:
site_url: Site URL (e.g., "sc-domain:example.com" or "https://example.com/")
start_date: Start date (YYYY-MM-DD), defaults to 30 days ago
end_date: End date (YYYY-MM-DD), defaults to yesterday
dimensions: List of dimensions (query, page, country, device, date)
row_limit: Maximum rows to return
filters: Dimension filters
Returns:
SearchAnalyticsResult with rows and summary stats
"""
# Default date range: last 30 days
if not end_date:
end_date = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")
if not start_date:
start_date = (datetime.now() - timedelta(days=30)).strftime("%Y-%m-%d")
# Default dimensions
if dimensions is None:
dimensions = ["query", "page"]
request_body = {
"startDate": start_date,
"endDate": end_date,
"dimensions": dimensions,
"rowLimit": row_limit,
}
if filters:
request_body["dimensionFilterGroups"] = [{"filters": filters}]
try:
response = self.service.searchanalytics().query(
siteUrl=site_url,
body=request_body,
).execute()
except Exception as e:
logger.error(f"Failed to query search analytics: {e}")
raise
rows = response.get("rows", [])
# Calculate totals
total_clicks = sum(row.get("clicks", 0) for row in rows)
total_impressions = sum(row.get("impressions", 0) for row in rows)
total_ctr = sum(row.get("ctr", 0) for row in rows)
total_position = sum(row.get("position", 0) for row in rows)
avg_ctr = total_ctr / len(rows) if rows else 0
avg_position = total_position / len(rows) if rows else 0
return SearchAnalyticsResult(
rows=rows,
total_clicks=total_clicks,
total_impressions=total_impressions,
average_ctr=avg_ctr,
average_position=avg_position,
)
def get_top_queries(
self,
site_url: str,
limit: int = 100,
start_date: str | None = None,
end_date: str | None = None,
) -> list[dict]:
"""Get top search queries by clicks."""
result = self.get_search_analytics(
site_url=site_url,
dimensions=["query"],
row_limit=limit,
start_date=start_date,
end_date=end_date,
)
# Sort by clicks
sorted_rows = sorted(
result.rows,
key=lambda x: x.get("clicks", 0),
reverse=True,
)
return [
{
"query": row["keys"][0],
"clicks": row.get("clicks", 0),
"impressions": row.get("impressions", 0),
"ctr": row.get("ctr", 0),
"position": row.get("position", 0),
}
for row in sorted_rows[:limit]
]
def get_top_pages(
self,
site_url: str,
limit: int = 100,
start_date: str | None = None,
end_date: str | None = None,
) -> list[dict]:
"""Get top pages by clicks."""
result = self.get_search_analytics(
site_url=site_url,
dimensions=["page"],
row_limit=limit,
start_date=start_date,
end_date=end_date,
)
sorted_rows = sorted(
result.rows,
key=lambda x: x.get("clicks", 0),
reverse=True,
)
return [
{
"page": row["keys"][0],
"clicks": row.get("clicks", 0),
"impressions": row.get("impressions", 0),
"ctr": row.get("ctr", 0),
"position": row.get("position", 0),
}
for row in sorted_rows[:limit]
]
def get_sitemaps(self, site_url: str) -> list[SitemapInfo]:
"""Get list of sitemaps for a site."""
try:
response = self.service.sitemaps().list(siteUrl=site_url).execute()
except Exception as e:
logger.error(f"Failed to get sitemaps: {e}")
raise
sitemaps = []
for sm in response.get("sitemap", []):
sitemaps.append(SitemapInfo(
path=sm.get("path", ""),
last_submitted=sm.get("lastSubmitted"),
last_downloaded=sm.get("lastDownloaded"),
is_pending=sm.get("isPending", False),
is_sitemaps_index=sm.get("isSitemapsIndex", False),
warnings=sm.get("warnings", 0),
errors=sm.get("errors", 0),
))
return sitemaps
def submit_sitemap(self, site_url: str, sitemap_url: str) -> bool:
"""Submit a sitemap for indexing."""
try:
self.service.sitemaps().submit(
siteUrl=site_url,
feedpath=sitemap_url,
).execute()
logger.info(f"Submitted sitemap: {sitemap_url}")
return True
except Exception as e:
logger.error(f"Failed to submit sitemap: {e}")
return False
def inspect_url(self, site_url: str, inspection_url: str) -> dict:
"""
Inspect a URL's indexing status.
Note: This uses the URL Inspection API which may have different quotas.
"""
try:
response = self.service.urlInspection().index().inspect(
body={
"inspectionUrl": inspection_url,
"siteUrl": site_url,
}
).execute()
result = response.get("inspectionResult", {})
return {
"url": inspection_url,
"indexing_state": result.get("indexStatusResult", {}).get(
"coverageState", "Unknown"
),
"last_crawl_time": result.get("indexStatusResult", {}).get(
"lastCrawlTime"
),
"crawled_as": result.get("indexStatusResult", {}).get("crawledAs"),
"robots_txt_state": result.get("indexStatusResult", {}).get(
"robotsTxtState"
),
"mobile_usability": result.get("mobileUsabilityResult", {}).get(
"verdict", "Unknown"
),
"rich_results": result.get("richResultsResult", {}).get(
"verdict", "Unknown"
),
}
except Exception as e:
logger.error(f"Failed to inspect URL: {e}")
raise
def get_performance_summary(
self,
site_url: str,
days: int = 30,
) -> dict:
"""Get a summary of search performance."""
end_date = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")
start_date = (datetime.now() - timedelta(days=days)).strftime("%Y-%m-%d")
# Get overall stats
overall = self.get_search_analytics(
site_url=site_url,
dimensions=[],
start_date=start_date,
end_date=end_date,
)
# Get top queries
top_queries = self.get_top_queries(
site_url=site_url,
limit=10,
start_date=start_date,
end_date=end_date,
)
# Get top pages
top_pages = self.get_top_pages(
site_url=site_url,
limit=10,
start_date=start_date,
end_date=end_date,
)
# Get by device
by_device = self.get_search_analytics(
site_url=site_url,
dimensions=["device"],
start_date=start_date,
end_date=end_date,
)
device_breakdown = {}
for row in by_device.rows:
device = row["keys"][0]
device_breakdown[device] = {
"clicks": row.get("clicks", 0),
"impressions": row.get("impressions", 0),
"ctr": row.get("ctr", 0),
"position": row.get("position", 0),
}
return {
"period": f"{start_date} to {end_date}",
"total_clicks": overall.total_clicks,
"total_impressions": overall.total_impressions,
"average_ctr": overall.average_ctr,
"average_position": overall.average_position,
"top_queries": top_queries,
"top_pages": top_pages,
"by_device": device_breakdown,
}
def main():
"""Test the Search Console client."""
import argparse
parser = argparse.ArgumentParser(description="Google Search Console Client")
parser.add_argument("--site", "-s", required=True, help="Site URL")
parser.add_argument("--action", "-a", default="summary",
choices=["summary", "queries", "pages", "sitemaps", "inspect"],
help="Action to perform")
parser.add_argument("--url", help="URL to inspect")
parser.add_argument("--days", type=int, default=30, help="Days of data")
args = parser.parse_args()
client = SearchConsoleClient()
if args.action == "summary":
summary = client.get_performance_summary(args.site, args.days)
import json
print(json.dumps(summary, indent=2, default=str))
elif args.action == "queries":
queries = client.get_top_queries(args.site)
for q in queries[:20]:
print(f"{q['query']}: {q['clicks']} clicks, pos {q['position']:.1f}")
elif args.action == "pages":
pages = client.get_top_pages(args.site)
for p in pages[:20]:
print(f"{p['page']}: {p['clicks']} clicks, pos {p['position']:.1f}")
elif args.action == "sitemaps":
sitemaps = client.get_sitemaps(args.site)
for sm in sitemaps:
print(f"{sm.path}: errors={sm.errors}, warnings={sm.warnings}")
elif args.action == "inspect" and args.url:
result = client.inspect_url(args.site, args.url)
import json
print(json.dumps(result, indent=2))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,951 @@
"""
Notion Reporter - Create SEO audit findings in Notion
=====================================================
Purpose: Output SEO audit findings to Notion databases
Python: 3.10+
Usage:
from notion_reporter import NotionReporter, SEOFinding, AuditReport
reporter = NotionReporter()
# Create audit report with checklist table
report = AuditReport(site="https://example.com")
report.add_finding(SEOFinding(...))
reporter.create_audit_report(report)
"""
import json
import logging
import os
from dataclasses import dataclass, field
from datetime import datetime
from pathlib import Path
from typing import Any
from notion_client import Client
from base_client import config
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
# Template directory
TEMPLATE_DIR = Path(__file__).parent.parent / "templates"
# Default OurDigital SEO Audit Log database
DEFAULT_DATABASE_ID = "2c8581e5-8a1e-8035-880b-e38cefc2f3ef"
# Default parent page for audit reports (OurDigital SEO Audit Log)
DEFAULT_AUDIT_REPORTS_PAGE_ID = "2c8581e5-8a1e-8035-880b-e38cefc2f3ef"
@dataclass
class SEOFinding:
"""Represents an SEO audit finding."""
issue: str
category: str
priority: str
status: str = "To Fix"
url: str | None = None
description: str | None = None
impact: str | None = None
recommendation: str | None = None
site: str | None = None # The audited site URL
audit_id: str | None = None # Groups findings from same audit session
affected_urls: list[str] = field(default_factory=list) # List of all affected URLs
@dataclass
class AuditReport:
"""Represents a complete SEO audit report with checklist."""
site: str
audit_id: str = field(default_factory=lambda: datetime.now().strftime("%Y%m%d-%H%M%S"))
audit_date: datetime = field(default_factory=datetime.now)
findings: list[SEOFinding] = field(default_factory=list)
# Audit check results
robots_txt_status: str = "Not checked"
sitemap_status: str = "Not checked"
schema_status: str = "Not checked"
performance_status: str = "Not checked"
# Summary statistics
total_urls_checked: int = 0
total_issues: int = 0
def add_finding(self, finding: SEOFinding) -> None:
"""Add a finding to the report."""
finding.site = self.site
finding.audit_id = f"{self.site.replace('https://', '').replace('http://', '').split('/')[0]}-{self.audit_id}"
self.findings.append(finding)
self.total_issues = len(self.findings)
def get_findings_by_priority(self) -> dict[str, list[SEOFinding]]:
"""Group findings by priority."""
result = {"Critical": [], "High": [], "Medium": [], "Low": []}
for f in self.findings:
if f.priority in result:
result[f.priority].append(f)
return result
def get_findings_by_category(self) -> dict[str, list[SEOFinding]]:
"""Group findings by category."""
result = {}
for f in self.findings:
if f.category not in result:
result[f.category] = []
result[f.category].append(f)
return result
class NotionReporter:
"""Create and manage SEO audit findings in Notion."""
CATEGORIES = [
"Technical SEO",
"On-page SEO",
"Content",
"Local SEO",
"Performance",
"Schema/Structured Data",
"Sitemap",
"Robots.txt",
]
PRIORITIES = ["Critical", "High", "Medium", "Low"]
STATUSES = ["To Fix", "In Progress", "Fixed", "Monitoring"]
CATEGORY_COLORS = {
"Technical SEO": "blue",
"On-page SEO": "green",
"Content": "purple",
"Local SEO": "orange",
"Performance": "red",
"Schema/Structured Data": "yellow",
"Sitemap": "pink",
"Robots.txt": "gray",
}
PRIORITY_COLORS = {
"Critical": "red",
"High": "orange",
"Medium": "yellow",
"Low": "gray",
}
def __init__(self, token: str | None = None):
"""
Initialize Notion reporter.
Args:
token: Notion API token
"""
self.token = token or config.notion_token
if not self.token:
raise ValueError(
"Notion token not configured. "
"Set NOTION_TOKEN or NOTION_API_KEY environment variable."
)
self.client = Client(auth=self.token)
def create_findings_database(
self,
parent_page_id: str,
title: str = "SEO Audit Findings",
) -> str:
"""
Create a new SEO findings database.
Args:
parent_page_id: Parent page ID for the database
title: Database title
Returns:
Database ID
"""
# Build database schema
properties = {
"Issue": {"title": {}},
"Category": {
"select": {
"options": [
{"name": cat, "color": self.CATEGORY_COLORS.get(cat, "default")}
for cat in self.CATEGORIES
]
}
},
"Priority": {
"select": {
"options": [
{"name": pri, "color": self.PRIORITY_COLORS.get(pri, "default")}
for pri in self.PRIORITIES
]
}
},
"Status": {
"status": {
"options": [
{"name": "To Fix", "color": "red"},
{"name": "In Progress", "color": "yellow"},
{"name": "Fixed", "color": "green"},
{"name": "Monitoring", "color": "blue"},
],
"groups": [
{"name": "To-do", "option_ids": [], "color": "gray"},
{"name": "In progress", "option_ids": [], "color": "blue"},
{"name": "Complete", "option_ids": [], "color": "green"},
],
}
},
"URL": {"url": {}},
"Description": {"rich_text": {}},
"Impact": {"rich_text": {}},
"Recommendation": {"rich_text": {}},
"Found Date": {"date": {}},
}
try:
response = self.client.databases.create(
parent={"page_id": parent_page_id},
title=[{"type": "text", "text": {"content": title}}],
properties=properties,
)
database_id = response["id"]
logger.info(f"Created database: {database_id}")
return database_id
except Exception as e:
logger.error(f"Failed to create database: {e}")
raise
def add_finding(
self,
finding: SEOFinding,
database_id: str | None = None,
) -> str:
"""
Add a finding to the database with page content.
Args:
finding: SEOFinding object
database_id: Target database ID (defaults to OurDigital SEO Audit Log)
Returns:
Page ID of created entry
"""
db_id = database_id or DEFAULT_DATABASE_ID
# Database properties (metadata)
properties = {
"Issue": {"title": [{"text": {"content": finding.issue}}]},
"Category": {"select": {"name": finding.category}},
"Priority": {"select": {"name": finding.priority}},
"Found Date": {"date": {"start": datetime.now().strftime("%Y-%m-%d")}},
}
if finding.url:
properties["URL"] = {"url": finding.url}
if finding.site:
properties["Site"] = {"url": finding.site}
if finding.audit_id:
properties["Audit ID"] = {
"rich_text": [{"text": {"content": finding.audit_id}}]
}
# Page content blocks (Description, Impact, Recommendation)
children = []
if finding.description:
children.extend([
{
"object": "block",
"type": "heading_2",
"heading_2": {
"rich_text": [{"type": "text", "text": {"content": "Description"}}]
}
},
{
"object": "block",
"type": "paragraph",
"paragraph": {
"rich_text": [{"type": "text", "text": {"content": finding.description}}]
}
}
])
if finding.impact:
children.extend([
{
"object": "block",
"type": "heading_2",
"heading_2": {
"rich_text": [{"type": "text", "text": {"content": "Impact"}}]
}
},
{
"object": "block",
"type": "callout",
"callout": {
"rich_text": [{"type": "text", "text": {"content": finding.impact}}],
"icon": {"type": "emoji", "emoji": "⚠️"}
}
}
])
if finding.recommendation:
children.extend([
{
"object": "block",
"type": "heading_2",
"heading_2": {
"rich_text": [{"type": "text", "text": {"content": "Recommendation"}}]
}
},
{
"object": "block",
"type": "callout",
"callout": {
"rich_text": [{"type": "text", "text": {"content": finding.recommendation}}],
"icon": {"type": "emoji", "emoji": "💡"}
}
}
])
try:
response = self.client.pages.create(
parent={"database_id": db_id},
properties=properties,
children=children if children else None,
)
return response["id"]
except Exception as e:
logger.error(f"Failed to add finding: {e}")
raise
def add_findings_batch(
self,
findings: list[SEOFinding],
database_id: str | None = None,
) -> list[str]:
"""
Add multiple findings to the database.
Args:
findings: List of SEOFinding objects
database_id: Target database ID (defaults to OurDigital SEO Audit Log)
Returns:
List of created page IDs
"""
page_ids = []
for finding in findings:
try:
page_id = self.add_finding(finding, database_id)
page_ids.append(page_id)
except Exception as e:
logger.error(f"Failed to add finding '{finding.issue}': {e}")
return page_ids
def create_audit_summary_page(
self,
parent_page_id: str,
url: str,
summary: dict,
) -> str:
"""
Create a summary page for the audit.
Args:
parent_page_id: Parent page ID
url: Audited URL
summary: Audit summary data
Returns:
Page ID
"""
# Build page content
children = [
{
"object": "block",
"type": "heading_1",
"heading_1": {
"rich_text": [{"type": "text", "text": {"content": f"SEO Audit: {url}"}}]
},
},
{
"object": "block",
"type": "paragraph",
"paragraph": {
"rich_text": [
{
"type": "text",
"text": {"content": f"Audit Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}"},
}
]
},
},
{
"object": "block",
"type": "divider",
"divider": {},
},
{
"object": "block",
"type": "heading_2",
"heading_2": {
"rich_text": [{"type": "text", "text": {"content": "Summary"}}]
},
},
]
# Add summary statistics
if "stats" in summary:
stats = summary["stats"]
stats_text = "\n".join([f"{k}: {v}" for k, v in stats.items()])
children.append({
"object": "block",
"type": "paragraph",
"paragraph": {
"rich_text": [{"type": "text", "text": {"content": stats_text}}]
},
})
# Add findings by priority
if "findings_by_priority" in summary:
children.append({
"object": "block",
"type": "heading_2",
"heading_2": {
"rich_text": [{"type": "text", "text": {"content": "Findings by Priority"}}]
},
})
for priority, count in summary["findings_by_priority"].items():
children.append({
"object": "block",
"type": "bulleted_list_item",
"bulleted_list_item": {
"rich_text": [{"type": "text", "text": {"content": f"{priority}: {count}"}}]
},
})
try:
response = self.client.pages.create(
parent={"page_id": parent_page_id},
properties={
"title": {"title": [{"text": {"content": f"SEO Audit - {url}"}}]}
},
children=children,
)
return response["id"]
except Exception as e:
logger.error(f"Failed to create summary page: {e}")
raise
def query_findings(
self,
database_id: str,
category: str | None = None,
priority: str | None = None,
status: str | None = None,
) -> list[dict]:
"""
Query findings from database.
Args:
database_id: Database ID
category: Filter by category
priority: Filter by priority
status: Filter by status
Returns:
List of finding records
"""
filters = []
if category:
filters.append({
"property": "Category",
"select": {"equals": category},
})
if priority:
filters.append({
"property": "Priority",
"select": {"equals": priority},
})
if status:
filters.append({
"property": "Status",
"status": {"equals": status},
})
query_params = {"database_id": database_id}
if filters:
if len(filters) == 1:
query_params["filter"] = filters[0]
else:
query_params["filter"] = {"and": filters}
try:
response = self.client.databases.query(**query_params)
return response.get("results", [])
except Exception as e:
logger.error(f"Failed to query findings: {e}")
raise
def update_finding_status(
self,
page_id: str,
status: str,
) -> None:
"""Update the status of a finding."""
if status not in self.STATUSES:
raise ValueError(f"Invalid status: {status}")
try:
self.client.pages.update(
page_id=page_id,
properties={"Status": {"status": {"name": status}}},
)
logger.info(f"Updated finding {page_id} to {status}")
except Exception as e:
logger.error(f"Failed to update status: {e}")
raise
def create_audit_report(
self,
report: "AuditReport",
database_id: str | None = None,
) -> dict:
"""
Create a comprehensive audit report page with checklist table.
This creates:
1. Individual finding pages in the database
2. A summary page with all findings in table format for checklist tracking
Args:
report: AuditReport object with all findings
database_id: Target database ID (defaults to OurDigital SEO Audit Log)
Returns:
Dict with summary_page_id and finding_page_ids
"""
db_id = database_id or DEFAULT_DATABASE_ID
# Generate full audit ID
site_domain = report.site.replace('https://', '').replace('http://', '').split('/')[0]
full_audit_id = f"{site_domain}-{report.audit_id}"
result = {
"audit_id": full_audit_id,
"site": report.site,
"summary_page_id": None,
"finding_page_ids": [],
}
# 1. Create individual finding pages in database
logger.info(f"Creating {len(report.findings)} finding pages...")
for finding in report.findings:
finding.audit_id = full_audit_id
finding.site = report.site
try:
page_id = self.add_finding(finding, db_id)
result["finding_page_ids"].append(page_id)
except Exception as e:
logger.error(f"Failed to add finding '{finding.issue}': {e}")
# 2. Create summary page with checklist table
logger.info("Creating audit summary page with checklist...")
summary_page_id = self._create_audit_summary_with_table(report, full_audit_id, db_id)
result["summary_page_id"] = summary_page_id
logger.info(f"Audit report created: {full_audit_id}")
return result
def _create_audit_summary_with_table(
self,
report: "AuditReport",
audit_id: str,
database_id: str,
) -> str:
"""
Create audit summary page with checklist table format.
Args:
report: AuditReport object
audit_id: Full audit ID
database_id: Parent database ID
Returns:
Summary page ID
"""
site_domain = report.site.replace('https://', '').replace('http://', '').split('/')[0]
# Build page content blocks
children = []
# Header with audit info
children.append({
"object": "block",
"type": "callout",
"callout": {
"rich_text": [
{"type": "text", "text": {"content": f"Audit ID: {audit_id}\n"}},
{"type": "text", "text": {"content": f"Date: {report.audit_date.strftime('%Y-%m-%d %H:%M')}\n"}},
{"type": "text", "text": {"content": f"Total Issues: {report.total_issues}"}},
],
"icon": {"type": "emoji", "emoji": "📋"},
"color": "blue_background",
}
})
# Audit Status Summary
children.append({
"object": "block",
"type": "heading_2",
"heading_2": {
"rich_text": [{"type": "text", "text": {"content": "Audit Status"}}]
}
})
# Status table
status_table = {
"object": "block",
"type": "table",
"table": {
"table_width": 2,
"has_column_header": True,
"has_row_header": False,
"children": [
{
"type": "table_row",
"table_row": {
"cells": [
[{"type": "text", "text": {"content": "Check"}}],
[{"type": "text", "text": {"content": "Status"}}],
]
}
},
{
"type": "table_row",
"table_row": {
"cells": [
[{"type": "text", "text": {"content": "Robots.txt"}}],
[{"type": "text", "text": {"content": report.robots_txt_status}}],
]
}
},
{
"type": "table_row",
"table_row": {
"cells": [
[{"type": "text", "text": {"content": "Sitemap"}}],
[{"type": "text", "text": {"content": report.sitemap_status}}],
]
}
},
{
"type": "table_row",
"table_row": {
"cells": [
[{"type": "text", "text": {"content": "Schema Markup"}}],
[{"type": "text", "text": {"content": report.schema_status}}],
]
}
},
{
"type": "table_row",
"table_row": {
"cells": [
[{"type": "text", "text": {"content": "Performance"}}],
[{"type": "text", "text": {"content": report.performance_status}}],
]
}
},
]
}
}
children.append(status_table)
# Divider
children.append({"object": "block", "type": "divider", "divider": {}})
# Findings Checklist Header
children.append({
"object": "block",
"type": "heading_2",
"heading_2": {
"rich_text": [{"type": "text", "text": {"content": "Findings Checklist"}}]
}
})
children.append({
"object": "block",
"type": "paragraph",
"paragraph": {
"rich_text": [{"type": "text", "text": {"content": "Use this checklist to track fixes. Check off items as you complete them."}}]
}
})
# Create findings table with checklist format
if report.findings:
# Build table rows - Header row
table_rows = [
{
"type": "table_row",
"table_row": {
"cells": [
[{"type": "text", "text": {"content": "#"}, "annotations": {"bold": True}}],
[{"type": "text", "text": {"content": "Priority"}, "annotations": {"bold": True}}],
[{"type": "text", "text": {"content": "Category"}, "annotations": {"bold": True}}],
[{"type": "text", "text": {"content": "Issue"}, "annotations": {"bold": True}}],
[{"type": "text", "text": {"content": "URL"}, "annotations": {"bold": True}}],
]
}
}
]
# Add finding rows
for idx, finding in enumerate(report.findings, 1):
# Truncate long text for table cells
issue_text = finding.issue[:50] + "..." if len(finding.issue) > 50 else finding.issue
url_text = finding.url[:40] + "..." if finding.url and len(finding.url) > 40 else (finding.url or "-")
table_rows.append({
"type": "table_row",
"table_row": {
"cells": [
[{"type": "text", "text": {"content": str(idx)}}],
[{"type": "text", "text": {"content": finding.priority}}],
[{"type": "text", "text": {"content": finding.category}}],
[{"type": "text", "text": {"content": issue_text}}],
[{"type": "text", "text": {"content": url_text}}],
]
}
})
findings_table = {
"object": "block",
"type": "table",
"table": {
"table_width": 5,
"has_column_header": True,
"has_row_header": False,
"children": table_rows
}
}
children.append(findings_table)
# Divider
children.append({"object": "block", "type": "divider", "divider": {}})
# Detailed Findings with To-Do checkboxes
children.append({
"object": "block",
"type": "heading_2",
"heading_2": {
"rich_text": [{"type": "text", "text": {"content": "Detailed Findings & Actions"}}]
}
})
# Group findings by priority and add as to-do items
for priority in ["Critical", "High", "Medium", "Low"]:
priority_findings = [f for f in report.findings if f.priority == priority]
if not priority_findings:
continue
# Priority header with emoji
priority_emoji = {"Critical": "🔴", "High": "🟠", "Medium": "🟡", "Low": ""}
children.append({
"object": "block",
"type": "heading_3",
"heading_3": {
"rich_text": [{"type": "text", "text": {"content": f"{priority_emoji.get(priority, '')} {priority} Priority ({len(priority_findings)})"}}]
}
})
# Add each finding as a to-do item with details
for finding in priority_findings:
# Main to-do item
children.append({
"object": "block",
"type": "to_do",
"to_do": {
"rich_text": [
{"type": "text", "text": {"content": f"[{finding.category}] "}, "annotations": {"bold": True}},
{"type": "text", "text": {"content": finding.issue}},
],
"checked": False,
}
})
# URL if available
if finding.url:
children.append({
"object": "block",
"type": "bulleted_list_item",
"bulleted_list_item": {
"rich_text": [
{"type": "text", "text": {"content": "URL: "}},
{"type": "text", "text": {"content": finding.url, "link": {"url": finding.url}}},
]
}
})
# Affected URLs list if available
if finding.affected_urls:
children.append({
"object": "block",
"type": "toggle",
"toggle": {
"rich_text": [{"type": "text", "text": {"content": f"Affected URLs ({len(finding.affected_urls)})"}}],
"children": [
{
"object": "block",
"type": "bulleted_list_item",
"bulleted_list_item": {
"rich_text": [{"type": "text", "text": {"content": url, "link": {"url": url} if url.startswith("http") else None}}]
}
}
for url in finding.affected_urls[:20] # Limit to 20 URLs
] + ([{
"object": "block",
"type": "paragraph",
"paragraph": {
"rich_text": [{"type": "text", "text": {"content": f"... and {len(finding.affected_urls) - 20} more URLs"}}]
}
}] if len(finding.affected_urls) > 20 else [])
}
})
# Recommendation as sub-item
if finding.recommendation:
children.append({
"object": "block",
"type": "bulleted_list_item",
"bulleted_list_item": {
"rich_text": [
{"type": "text", "text": {"content": "💡 "}, "annotations": {"bold": True}},
{"type": "text", "text": {"content": finding.recommendation}},
]
}
})
# Create the summary page
try:
response = self.client.pages.create(
parent={"database_id": database_id},
properties={
"Issue": {"title": [{"text": {"content": f"📊 Audit Report: {site_domain}"}}]},
"Category": {"select": {"name": "Technical SEO"}},
"Priority": {"select": {"name": "High"}},
"Site": {"url": report.site},
"Audit ID": {"rich_text": [{"text": {"content": audit_id}}]},
"Found Date": {"date": {"start": report.audit_date.strftime("%Y-%m-%d")}},
},
children=children,
)
logger.info(f"Created audit summary page: {response['id']}")
return response["id"]
except Exception as e:
logger.error(f"Failed to create audit summary page: {e}")
raise
def create_quick_audit_report(
self,
site: str,
findings: list[SEOFinding],
robots_status: str = "Not checked",
sitemap_status: str = "Not checked",
schema_status: str = "Not checked",
performance_status: str = "Not checked",
database_id: str | None = None,
) -> dict:
"""
Quick method to create audit report from a list of findings.
Args:
site: Site URL
findings: List of SEOFinding objects
robots_status: Robots.txt check result
sitemap_status: Sitemap check result
schema_status: Schema check result
performance_status: Performance check result
database_id: Target database ID
Returns:
Dict with audit results
"""
report = AuditReport(site=site)
report.robots_txt_status = robots_status
report.sitemap_status = sitemap_status
report.schema_status = schema_status
report.performance_status = performance_status
for finding in findings:
report.add_finding(finding)
return self.create_audit_report(report, database_id)
def main():
"""CLI entry point for testing."""
import argparse
parser = argparse.ArgumentParser(description="Notion SEO Reporter")
parser.add_argument("--action", "-a", required=True,
choices=["create-db", "add-finding", "query"],
help="Action to perform")
parser.add_argument("--parent-id", "-p", help="Parent page ID")
parser.add_argument("--database-id", "-d", help="Database ID")
parser.add_argument("--title", "-t", default="SEO Audit Findings",
help="Database title")
args = parser.parse_args()
reporter = NotionReporter()
if args.action == "create-db":
if not args.parent_id:
parser.error("--parent-id required for create-db")
db_id = reporter.create_findings_database(args.parent_id, args.title)
print(f"Created database: {db_id}")
elif args.action == "add-finding":
if not args.database_id:
parser.error("--database-id required for add-finding")
# Example finding
finding = SEOFinding(
issue="Missing meta description",
category="On-page SEO",
priority="Medium",
url="https://example.com/page",
description="Page is missing meta description tag",
impact="May affect CTR in search results",
recommendation="Add unique meta description under 160 characters",
)
page_id = reporter.add_finding(args.database_id, finding)
print(f"Created finding: {page_id}")
elif args.action == "query":
if not args.database_id:
parser.error("--database-id required for query")
findings = reporter.query_findings(args.database_id)
print(f"Found {len(findings)} findings")
for f in findings[:5]:
title = f["properties"]["Issue"]["title"]
if title:
print(f" - {title[0]['plain_text']}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,569 @@
"""
Page Analyzer - Extract SEO metadata from web pages
===================================================
Purpose: Comprehensive page-level SEO data extraction
Python: 3.10+
Usage:
from page_analyzer import PageAnalyzer, PageMetadata
analyzer = PageAnalyzer()
metadata = analyzer.analyze_url("https://example.com/page")
"""
import json
import logging
import re
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
@dataclass
class LinkData:
"""Represents a link found on a page."""
url: str
anchor_text: str
is_internal: bool
is_nofollow: bool = False
link_type: str = "body" # body, nav, footer, etc.
@dataclass
class HeadingData:
"""Represents a heading found on a page."""
level: int # 1-6
text: str
@dataclass
class SchemaData:
"""Represents schema.org structured data."""
schema_type: str
properties: dict
format: str = "json-ld" # json-ld, microdata, rdfa
@dataclass
class OpenGraphData:
"""Represents Open Graph metadata."""
og_title: str | None = None
og_description: str | None = None
og_image: str | None = None
og_url: str | None = None
og_type: str | None = None
og_site_name: str | None = None
og_locale: str | None = None
twitter_card: str | None = None
twitter_title: str | None = None
twitter_description: str | None = None
twitter_image: str | None = None
@dataclass
class PageMetadata:
"""Complete SEO metadata for a page."""
# Basic info
url: str
status_code: int = 0
content_type: str = ""
response_time_ms: float = 0
analyzed_at: datetime = field(default_factory=datetime.now)
# Meta tags
title: str | None = None
title_length: int = 0
meta_description: str | None = None
meta_description_length: int = 0
canonical_url: str | None = None
robots_meta: str | None = None
# Language
html_lang: str | None = None
hreflang_tags: list[dict] = field(default_factory=list) # [{"lang": "en", "url": "..."}]
# Headings
headings: list[HeadingData] = field(default_factory=list)
h1_count: int = 0
h1_text: str | None = None
# Open Graph & Social
open_graph: OpenGraphData = field(default_factory=OpenGraphData)
# Schema/Structured Data
schema_data: list[SchemaData] = field(default_factory=list)
schema_types_found: list[str] = field(default_factory=list)
# Links
internal_links: list[LinkData] = field(default_factory=list)
external_links: list[LinkData] = field(default_factory=list)
internal_link_count: int = 0
external_link_count: int = 0
# Images
images_total: int = 0
images_without_alt: int = 0
images_with_alt: int = 0
# Content metrics
word_count: int = 0
# Issues found
issues: list[str] = field(default_factory=list)
warnings: list[str] = field(default_factory=list)
def to_dict(self) -> dict:
"""Convert to dictionary for JSON serialization."""
return {
"url": self.url,
"status_code": self.status_code,
"content_type": self.content_type,
"response_time_ms": self.response_time_ms,
"analyzed_at": self.analyzed_at.isoformat(),
"title": self.title,
"title_length": self.title_length,
"meta_description": self.meta_description,
"meta_description_length": self.meta_description_length,
"canonical_url": self.canonical_url,
"robots_meta": self.robots_meta,
"html_lang": self.html_lang,
"hreflang_tags": self.hreflang_tags,
"h1_count": self.h1_count,
"h1_text": self.h1_text,
"headings_count": len(self.headings),
"schema_types_found": self.schema_types_found,
"internal_link_count": self.internal_link_count,
"external_link_count": self.external_link_count,
"images_total": self.images_total,
"images_without_alt": self.images_without_alt,
"word_count": self.word_count,
"issues": self.issues,
"warnings": self.warnings,
"open_graph": {
"og_title": self.open_graph.og_title,
"og_description": self.open_graph.og_description,
"og_image": self.open_graph.og_image,
"og_url": self.open_graph.og_url,
"og_type": self.open_graph.og_type,
},
}
def get_summary(self) -> str:
"""Get a brief summary of the page analysis."""
lines = [
f"URL: {self.url}",
f"Status: {self.status_code}",
f"Title: {self.title[:50] + '...' if self.title and len(self.title) > 50 else self.title}",
f"Description: {'' if self.meta_description else '✗ Missing'}",
f"Canonical: {'' if self.canonical_url else '✗ Missing'}",
f"H1: {self.h1_count} found",
f"Schema: {', '.join(self.schema_types_found) if self.schema_types_found else 'None'}",
f"Links: {self.internal_link_count} internal, {self.external_link_count} external",
f"Images: {self.images_total} total, {self.images_without_alt} without alt",
]
if self.issues:
lines.append(f"Issues: {len(self.issues)}")
return "\n".join(lines)
class PageAnalyzer:
"""Analyze web pages for SEO metadata."""
DEFAULT_USER_AGENT = "Mozilla/5.0 (compatible; OurDigitalSEOBot/1.0; +https://ourdigital.org)"
def __init__(
self,
user_agent: str | None = None,
timeout: int = 30,
):
"""
Initialize page analyzer.
Args:
user_agent: Custom user agent string
timeout: Request timeout in seconds
"""
self.user_agent = user_agent or self.DEFAULT_USER_AGENT
self.timeout = timeout
self.session = requests.Session()
self.session.headers.update({
"User-Agent": self.user_agent,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9,ko;q=0.8",
})
def analyze_url(self, url: str) -> PageMetadata:
"""
Analyze a URL and extract SEO metadata.
Args:
url: URL to analyze
Returns:
PageMetadata object with all extracted data
"""
metadata = PageMetadata(url=url)
try:
# Fetch page
start_time = datetime.now()
response = self.session.get(url, timeout=self.timeout, allow_redirects=True)
metadata.response_time_ms = (datetime.now() - start_time).total_seconds() * 1000
metadata.status_code = response.status_code
metadata.content_type = response.headers.get("Content-Type", "")
if response.status_code != 200:
metadata.issues.append(f"HTTP {response.status_code} status")
if response.status_code >= 400:
return metadata
# Parse HTML
soup = BeautifulSoup(response.text, "html.parser")
base_url = url
# Extract all metadata
self._extract_basic_meta(soup, metadata)
self._extract_canonical(soup, metadata, base_url)
self._extract_robots_meta(soup, metadata)
self._extract_hreflang(soup, metadata)
self._extract_headings(soup, metadata)
self._extract_open_graph(soup, metadata)
self._extract_schema(soup, metadata)
self._extract_links(soup, metadata, base_url)
self._extract_images(soup, metadata)
self._extract_content_metrics(soup, metadata)
# Run SEO checks
self._run_seo_checks(metadata)
except requests.RequestException as e:
metadata.issues.append(f"Request failed: {str(e)}")
logger.error(f"Failed to analyze {url}: {e}")
except Exception as e:
metadata.issues.append(f"Analysis error: {str(e)}")
logger.error(f"Error analyzing {url}: {e}")
return metadata
def _extract_basic_meta(self, soup: BeautifulSoup, metadata: PageMetadata) -> None:
"""Extract title and meta description."""
# Title
title_tag = soup.find("title")
if title_tag and title_tag.string:
metadata.title = title_tag.string.strip()
metadata.title_length = len(metadata.title)
# Meta description
desc_tag = soup.find("meta", attrs={"name": re.compile(r"^description$", re.I)})
if desc_tag and desc_tag.get("content"):
metadata.meta_description = desc_tag["content"].strip()
metadata.meta_description_length = len(metadata.meta_description)
# HTML lang
html_tag = soup.find("html")
if html_tag and html_tag.get("lang"):
metadata.html_lang = html_tag["lang"]
def _extract_canonical(self, soup: BeautifulSoup, metadata: PageMetadata, base_url: str) -> None:
"""Extract canonical URL."""
canonical = soup.find("link", rel="canonical")
if canonical and canonical.get("href"):
metadata.canonical_url = urljoin(base_url, canonical["href"])
def _extract_robots_meta(self, soup: BeautifulSoup, metadata: PageMetadata) -> None:
"""Extract robots meta tag."""
robots = soup.find("meta", attrs={"name": re.compile(r"^robots$", re.I)})
if robots and robots.get("content"):
metadata.robots_meta = robots["content"]
# Also check for googlebot-specific
googlebot = soup.find("meta", attrs={"name": re.compile(r"^googlebot$", re.I)})
if googlebot and googlebot.get("content"):
if metadata.robots_meta:
metadata.robots_meta += f" | googlebot: {googlebot['content']}"
else:
metadata.robots_meta = f"googlebot: {googlebot['content']}"
def _extract_hreflang(self, soup: BeautifulSoup, metadata: PageMetadata) -> None:
"""Extract hreflang tags."""
hreflang_tags = soup.find_all("link", rel="alternate", hreflang=True)
for tag in hreflang_tags:
if tag.get("href") and tag.get("hreflang"):
metadata.hreflang_tags.append({
"lang": tag["hreflang"],
"url": tag["href"]
})
def _extract_headings(self, soup: BeautifulSoup, metadata: PageMetadata) -> None:
"""Extract all headings."""
for level in range(1, 7):
for heading in soup.find_all(f"h{level}"):
text = heading.get_text(strip=True)
if text:
metadata.headings.append(HeadingData(level=level, text=text))
# Count H1s specifically
h1_tags = soup.find_all("h1")
metadata.h1_count = len(h1_tags)
if h1_tags:
metadata.h1_text = h1_tags[0].get_text(strip=True)
def _extract_open_graph(self, soup: BeautifulSoup, metadata: PageMetadata) -> None:
"""Extract Open Graph and Twitter Card data."""
og = metadata.open_graph
# Open Graph tags
og_mappings = {
"og:title": "og_title",
"og:description": "og_description",
"og:image": "og_image",
"og:url": "og_url",
"og:type": "og_type",
"og:site_name": "og_site_name",
"og:locale": "og_locale",
}
for og_prop, attr_name in og_mappings.items():
tag = soup.find("meta", property=og_prop)
if tag and tag.get("content"):
setattr(og, attr_name, tag["content"])
# Twitter Card tags
twitter_mappings = {
"twitter:card": "twitter_card",
"twitter:title": "twitter_title",
"twitter:description": "twitter_description",
"twitter:image": "twitter_image",
}
for tw_name, attr_name in twitter_mappings.items():
tag = soup.find("meta", attrs={"name": tw_name})
if tag and tag.get("content"):
setattr(og, attr_name, tag["content"])
def _extract_schema(self, soup: BeautifulSoup, metadata: PageMetadata) -> None:
"""Extract schema.org structured data."""
# JSON-LD
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string)
if isinstance(data, list):
for item in data:
self._process_schema_item(item, metadata, "json-ld")
else:
self._process_schema_item(data, metadata, "json-ld")
except (json.JSONDecodeError, TypeError):
continue
# Microdata (basic detection)
for item in soup.find_all(itemscope=True):
itemtype = item.get("itemtype", "")
if itemtype:
schema_type = itemtype.split("/")[-1]
if schema_type not in metadata.schema_types_found:
metadata.schema_types_found.append(schema_type)
metadata.schema_data.append(SchemaData(
schema_type=schema_type,
properties={},
format="microdata"
))
def _process_schema_item(self, data: dict, metadata: PageMetadata, format_type: str) -> None:
"""Process a single schema.org item."""
if not isinstance(data, dict):
return
schema_type = data.get("@type", "Unknown")
if isinstance(schema_type, list):
schema_type = schema_type[0] if schema_type else "Unknown"
if schema_type not in metadata.schema_types_found:
metadata.schema_types_found.append(schema_type)
metadata.schema_data.append(SchemaData(
schema_type=schema_type,
properties=data,
format=format_type
))
# Process nested @graph items
if "@graph" in data:
for item in data["@graph"]:
self._process_schema_item(item, metadata, format_type)
def _extract_links(self, soup: BeautifulSoup, metadata: PageMetadata, base_url: str) -> None:
"""Extract internal and external links."""
parsed_base = urlparse(base_url)
base_domain = parsed_base.netloc.lower()
for a_tag in soup.find_all("a", href=True):
href = a_tag["href"]
# Skip non-http links
if href.startswith(("#", "javascript:", "mailto:", "tel:")):
continue
# Resolve relative URLs
full_url = urljoin(base_url, href)
parsed_url = urlparse(full_url)
# Get anchor text
anchor_text = a_tag.get_text(strip=True)[:100] # Limit length
# Check if nofollow
rel = a_tag.get("rel", [])
if isinstance(rel, str):
rel = rel.split()
is_nofollow = "nofollow" in rel
# Determine if internal or external
link_domain = parsed_url.netloc.lower()
is_internal = (
link_domain == base_domain or
link_domain.endswith(f".{base_domain}") or
base_domain.endswith(f".{link_domain}")
)
link_data = LinkData(
url=full_url,
anchor_text=anchor_text,
is_internal=is_internal,
is_nofollow=is_nofollow,
)
if is_internal:
metadata.internal_links.append(link_data)
else:
metadata.external_links.append(link_data)
metadata.internal_link_count = len(metadata.internal_links)
metadata.external_link_count = len(metadata.external_links)
def _extract_images(self, soup: BeautifulSoup, metadata: PageMetadata) -> None:
"""Extract image information."""
images = soup.find_all("img")
metadata.images_total = len(images)
for img in images:
alt = img.get("alt", "").strip()
if alt:
metadata.images_with_alt += 1
else:
metadata.images_without_alt += 1
def _extract_content_metrics(self, soup: BeautifulSoup, metadata: PageMetadata) -> None:
"""Extract content metrics like word count."""
# Remove script and style elements
for element in soup(["script", "style", "noscript"]):
element.decompose()
# Get text content
text = soup.get_text(separator=" ", strip=True)
words = text.split()
metadata.word_count = len(words)
def _run_seo_checks(self, metadata: PageMetadata) -> None:
"""Run SEO checks and add issues/warnings."""
# Title checks
if not metadata.title:
metadata.issues.append("Missing title tag")
elif metadata.title_length < 30:
metadata.warnings.append(f"Title too short ({metadata.title_length} chars, recommend 50-60)")
elif metadata.title_length > 60:
metadata.warnings.append(f"Title too long ({metadata.title_length} chars, recommend 50-60)")
# Meta description checks
if not metadata.meta_description:
metadata.issues.append("Missing meta description")
elif metadata.meta_description_length < 120:
metadata.warnings.append(f"Meta description too short ({metadata.meta_description_length} chars)")
elif metadata.meta_description_length > 160:
metadata.warnings.append(f"Meta description too long ({metadata.meta_description_length} chars)")
# Canonical check
if not metadata.canonical_url:
metadata.warnings.append("Missing canonical tag")
elif metadata.canonical_url != metadata.url:
metadata.warnings.append(f"Canonical points to different URL: {metadata.canonical_url}")
# H1 checks
if metadata.h1_count == 0:
metadata.issues.append("Missing H1 tag")
elif metadata.h1_count > 1:
metadata.warnings.append(f"Multiple H1 tags ({metadata.h1_count})")
# Image alt check
if metadata.images_without_alt > 0:
metadata.warnings.append(f"{metadata.images_without_alt} images missing alt text")
# Schema check
if not metadata.schema_types_found:
metadata.warnings.append("No structured data found")
# Open Graph check
if not metadata.open_graph.og_title:
metadata.warnings.append("Missing Open Graph tags")
# Robots meta check
if metadata.robots_meta:
robots_lower = metadata.robots_meta.lower()
if "noindex" in robots_lower:
metadata.issues.append("Page is set to noindex")
if "nofollow" in robots_lower:
metadata.warnings.append("Page is set to nofollow")
def main():
"""CLI entry point for testing."""
import argparse
parser = argparse.ArgumentParser(description="Page SEO Analyzer")
parser.add_argument("url", help="URL to analyze")
parser.add_argument("--json", "-j", action="store_true", help="Output as JSON")
args = parser.parse_args()
analyzer = PageAnalyzer()
metadata = analyzer.analyze_url(args.url)
if args.json:
print(json.dumps(metadata.to_dict(), indent=2, ensure_ascii=False))
else:
print("=" * 60)
print("PAGE ANALYSIS REPORT")
print("=" * 60)
print(metadata.get_summary())
print()
if metadata.issues:
print("ISSUES:")
for issue in metadata.issues:
print(f"{issue}")
if metadata.warnings:
print("\nWARNINGS:")
for warning in metadata.warnings:
print(f"{warning}")
if metadata.hreflang_tags:
print(f"\nHREFLANG TAGS ({len(metadata.hreflang_tags)}):")
for tag in metadata.hreflang_tags[:5]:
print(f" {tag['lang']}: {tag['url']}")
if metadata.schema_types_found:
print(f"\nSCHEMA TYPES:")
for schema_type in metadata.schema_types_found:
print(f" - {schema_type}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,452 @@
"""
PageSpeed Insights Client
=========================
Purpose: Get Core Web Vitals and performance data from PageSpeed Insights API
Python: 3.10+
Usage:
from pagespeed_client import PageSpeedClient
client = PageSpeedClient()
result = client.analyze("https://example.com")
"""
import argparse
import json
import logging
from dataclasses import dataclass, field
from typing import Any
import requests
from base_client import config
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
@dataclass
class CoreWebVitals:
"""Core Web Vitals metrics."""
lcp: float | None = None # Largest Contentful Paint (ms)
fid: float | None = None # First Input Delay (ms)
cls: float | None = None # Cumulative Layout Shift
inp: float | None = None # Interaction to Next Paint (ms)
ttfb: float | None = None # Time to First Byte (ms)
fcp: float | None = None # First Contentful Paint (ms)
# Assessment (GOOD, NEEDS_IMPROVEMENT, POOR)
lcp_rating: str | None = None
fid_rating: str | None = None
cls_rating: str | None = None
inp_rating: str | None = None
def to_dict(self) -> dict:
return {
"lcp": {"value": self.lcp, "rating": self.lcp_rating},
"fid": {"value": self.fid, "rating": self.fid_rating},
"cls": {"value": self.cls, "rating": self.cls_rating},
"inp": {"value": self.inp, "rating": self.inp_rating},
"ttfb": {"value": self.ttfb},
"fcp": {"value": self.fcp},
}
@dataclass
class PageSpeedResult:
"""PageSpeed analysis result."""
url: str
strategy: str # mobile or desktop
performance_score: float | None = None
seo_score: float | None = None
accessibility_score: float | None = None
best_practices_score: float | None = None
core_web_vitals: CoreWebVitals = field(default_factory=CoreWebVitals)
opportunities: list[dict] = field(default_factory=list)
diagnostics: list[dict] = field(default_factory=list)
passed_audits: list[str] = field(default_factory=list)
raw_data: dict = field(default_factory=dict)
def to_dict(self) -> dict:
return {
"url": self.url,
"strategy": self.strategy,
"scores": {
"performance": self.performance_score,
"seo": self.seo_score,
"accessibility": self.accessibility_score,
"best_practices": self.best_practices_score,
},
"core_web_vitals": self.core_web_vitals.to_dict(),
"opportunities_count": len(self.opportunities),
"opportunities": self.opportunities[:10],
"diagnostics_count": len(self.diagnostics),
"passed_audits_count": len(self.passed_audits),
}
class PageSpeedClient:
"""Client for PageSpeed Insights API."""
BASE_URL = "https://www.googleapis.com/pagespeedonline/v5/runPagespeed"
# Core Web Vitals thresholds
THRESHOLDS = {
"lcp": {"good": 2500, "poor": 4000},
"fid": {"good": 100, "poor": 300},
"cls": {"good": 0.1, "poor": 0.25},
"inp": {"good": 200, "poor": 500},
"ttfb": {"good": 800, "poor": 1800},
"fcp": {"good": 1800, "poor": 3000},
}
def __init__(self, api_key: str | None = None):
"""
Initialize PageSpeed client.
Args:
api_key: PageSpeed API key (optional but recommended for higher quotas)
"""
self.api_key = api_key or config.pagespeed_api_key
self.session = requests.Session()
def _rate_metric(self, metric: str, value: float | None) -> str | None:
"""Rate a metric against thresholds."""
if value is None:
return None
thresholds = self.THRESHOLDS.get(metric)
if not thresholds:
return None
if value <= thresholds["good"]:
return "GOOD"
elif value <= thresholds["poor"]:
return "NEEDS_IMPROVEMENT"
else:
return "POOR"
def analyze(
self,
url: str,
strategy: str = "mobile",
categories: list[str] | None = None,
) -> PageSpeedResult:
"""
Analyze a URL with PageSpeed Insights.
Args:
url: URL to analyze
strategy: "mobile" or "desktop"
categories: Categories to analyze (performance, seo, accessibility, best-practices)
Returns:
PageSpeedResult with scores and metrics
"""
if categories is None:
categories = ["performance", "seo", "accessibility", "best-practices"]
params = {
"url": url,
"strategy": strategy,
"category": categories,
}
if self.api_key:
params["key"] = self.api_key
try:
response = self.session.get(self.BASE_URL, params=params, timeout=60)
response.raise_for_status()
data = response.json()
except requests.RequestException as e:
logger.error(f"PageSpeed API request failed: {e}")
raise
result = PageSpeedResult(url=url, strategy=strategy, raw_data=data)
# Extract scores
lighthouse = data.get("lighthouseResult", {})
categories_data = lighthouse.get("categories", {})
if "performance" in categories_data:
score = categories_data["performance"].get("score")
result.performance_score = score * 100 if score else None
if "seo" in categories_data:
score = categories_data["seo"].get("score")
result.seo_score = score * 100 if score else None
if "accessibility" in categories_data:
score = categories_data["accessibility"].get("score")
result.accessibility_score = score * 100 if score else None
if "best-practices" in categories_data:
score = categories_data["best-practices"].get("score")
result.best_practices_score = score * 100 if score else None
# Extract Core Web Vitals
audits = lighthouse.get("audits", {})
# Lab data
cwv = result.core_web_vitals
if "largest-contentful-paint" in audits:
cwv.lcp = audits["largest-contentful-paint"].get("numericValue")
cwv.lcp_rating = self._rate_metric("lcp", cwv.lcp)
if "total-blocking-time" in audits:
# TBT is proxy for FID in lab data
cwv.fid = audits["total-blocking-time"].get("numericValue")
cwv.fid_rating = self._rate_metric("fid", cwv.fid)
if "cumulative-layout-shift" in audits:
cwv.cls = audits["cumulative-layout-shift"].get("numericValue")
cwv.cls_rating = self._rate_metric("cls", cwv.cls)
if "experimental-interaction-to-next-paint" in audits:
cwv.inp = audits["experimental-interaction-to-next-paint"].get("numericValue")
cwv.inp_rating = self._rate_metric("inp", cwv.inp)
if "server-response-time" in audits:
cwv.ttfb = audits["server-response-time"].get("numericValue")
if "first-contentful-paint" in audits:
cwv.fcp = audits["first-contentful-paint"].get("numericValue")
# Field data (real user data) if available
loading_exp = data.get("loadingExperience", {})
metrics = loading_exp.get("metrics", {})
if "LARGEST_CONTENTFUL_PAINT_MS" in metrics:
cwv.lcp = metrics["LARGEST_CONTENTFUL_PAINT_MS"].get("percentile")
cwv.lcp_rating = metrics["LARGEST_CONTENTFUL_PAINT_MS"].get("category")
if "FIRST_INPUT_DELAY_MS" in metrics:
cwv.fid = metrics["FIRST_INPUT_DELAY_MS"].get("percentile")
cwv.fid_rating = metrics["FIRST_INPUT_DELAY_MS"].get("category")
if "CUMULATIVE_LAYOUT_SHIFT_SCORE" in metrics:
cwv.cls = metrics["CUMULATIVE_LAYOUT_SHIFT_SCORE"].get("percentile") / 100
cwv.cls_rating = metrics["CUMULATIVE_LAYOUT_SHIFT_SCORE"].get("category")
if "INTERACTION_TO_NEXT_PAINT" in metrics:
cwv.inp = metrics["INTERACTION_TO_NEXT_PAINT"].get("percentile")
cwv.inp_rating = metrics["INTERACTION_TO_NEXT_PAINT"].get("category")
# Extract opportunities
for audit_id, audit in audits.items():
if audit.get("details", {}).get("type") == "opportunity":
savings = audit.get("details", {}).get("overallSavingsMs", 0)
if savings > 0:
result.opportunities.append({
"id": audit_id,
"title": audit.get("title", ""),
"description": audit.get("description", ""),
"savings_ms": savings,
"score": audit.get("score", 0),
})
# Sort opportunities by savings
result.opportunities.sort(key=lambda x: x["savings_ms"], reverse=True)
# Extract diagnostics
for audit_id, audit in audits.items():
score = audit.get("score")
if score is not None and score < 1 and audit.get("details"):
if audit.get("details", {}).get("type") not in ["opportunity", None]:
result.diagnostics.append({
"id": audit_id,
"title": audit.get("title", ""),
"description": audit.get("description", ""),
"score": score,
})
# Extract passed audits
for audit_id, audit in audits.items():
if audit.get("score") == 1:
result.passed_audits.append(audit.get("title", audit_id))
return result
def analyze_both_strategies(self, url: str) -> dict:
"""Analyze URL for both mobile and desktop."""
mobile = self.analyze(url, strategy="mobile")
desktop = self.analyze(url, strategy="desktop")
return {
"url": url,
"mobile": mobile.to_dict(),
"desktop": desktop.to_dict(),
"comparison": {
"performance_difference": (
(desktop.performance_score or 0) - (mobile.performance_score or 0)
),
"mobile_first_issues": self._identify_mobile_issues(mobile, desktop),
},
}
def _identify_mobile_issues(
self,
mobile: PageSpeedResult,
desktop: PageSpeedResult,
) -> list[str]:
"""Identify issues that affect mobile more than desktop."""
issues = []
if mobile.performance_score and desktop.performance_score:
if desktop.performance_score - mobile.performance_score > 20:
issues.append("Significant performance gap between mobile and desktop")
m_cwv = mobile.core_web_vitals
d_cwv = desktop.core_web_vitals
if m_cwv.lcp and d_cwv.lcp and m_cwv.lcp > d_cwv.lcp * 1.5:
issues.append("LCP significantly slower on mobile")
if m_cwv.cls and d_cwv.cls and m_cwv.cls > d_cwv.cls * 2:
issues.append("Layout shift issues more severe on mobile")
return issues
def get_cwv_summary(self, url: str) -> dict:
"""Get a summary focused on Core Web Vitals."""
result = self.analyze(url, strategy="mobile")
cwv = result.core_web_vitals
return {
"url": url,
"overall_cwv_status": self._overall_cwv_status(cwv),
"metrics": {
"lcp": {
"value": f"{cwv.lcp / 1000:.2f}s" if cwv.lcp else None,
"rating": cwv.lcp_rating,
"threshold": "≤ 2.5s good, > 4.0s poor",
},
"fid": {
"value": f"{cwv.fid:.0f}ms" if cwv.fid else None,
"rating": cwv.fid_rating,
"threshold": "≤ 100ms good, > 300ms poor",
},
"cls": {
"value": f"{cwv.cls:.3f}" if cwv.cls else None,
"rating": cwv.cls_rating,
"threshold": "≤ 0.1 good, > 0.25 poor",
},
"inp": {
"value": f"{cwv.inp:.0f}ms" if cwv.inp else None,
"rating": cwv.inp_rating,
"threshold": "≤ 200ms good, > 500ms poor",
},
},
"top_opportunities": result.opportunities[:5],
}
def _overall_cwv_status(self, cwv: CoreWebVitals) -> str:
"""Determine overall Core Web Vitals status."""
ratings = [cwv.lcp_rating, cwv.fid_rating, cwv.cls_rating]
ratings = [r for r in ratings if r]
if not ratings:
return "UNKNOWN"
if any(r == "POOR" for r in ratings):
return "POOR"
if any(r == "NEEDS_IMPROVEMENT" for r in ratings):
return "NEEDS_IMPROVEMENT"
return "GOOD"
def generate_report(self, result: PageSpeedResult) -> str:
"""Generate human-readable performance report."""
lines = [
"=" * 60,
"PageSpeed Insights Report",
"=" * 60,
f"URL: {result.url}",
f"Strategy: {result.strategy}",
"",
"Scores:",
f" Performance: {result.performance_score:.0f}/100" if result.performance_score else " Performance: N/A",
f" SEO: {result.seo_score:.0f}/100" if result.seo_score else " SEO: N/A",
f" Accessibility: {result.accessibility_score:.0f}/100" if result.accessibility_score else " Accessibility: N/A",
f" Best Practices: {result.best_practices_score:.0f}/100" if result.best_practices_score else " Best Practices: N/A",
"",
"Core Web Vitals:",
]
cwv = result.core_web_vitals
def format_metric(name: str, value: Any, rating: str | None, unit: str) -> str:
if value is None:
return f" {name}: N/A"
rating_str = f" ({rating})" if rating else ""
return f" {name}: {value}{unit}{rating_str}"
lines.append(format_metric("LCP", f"{cwv.lcp / 1000:.2f}" if cwv.lcp else None, cwv.lcp_rating, "s"))
lines.append(format_metric("FID/TBT", f"{cwv.fid:.0f}" if cwv.fid else None, cwv.fid_rating, "ms"))
lines.append(format_metric("CLS", f"{cwv.cls:.3f}" if cwv.cls else None, cwv.cls_rating, ""))
lines.append(format_metric("INP", f"{cwv.inp:.0f}" if cwv.inp else None, cwv.inp_rating, "ms"))
lines.append(format_metric("TTFB", f"{cwv.ttfb:.0f}" if cwv.ttfb else None, None, "ms"))
lines.append(format_metric("FCP", f"{cwv.fcp / 1000:.2f}" if cwv.fcp else None, None, "s"))
if result.opportunities:
lines.extend([
"",
f"Top Opportunities ({len(result.opportunities)} total):",
])
for opp in result.opportunities[:5]:
savings = opp["savings_ms"]
lines.append(f" - {opp['title']}: -{savings / 1000:.1f}s potential savings")
lines.extend(["", "=" * 60])
return "\n".join(lines)
def main():
"""CLI entry point."""
parser = argparse.ArgumentParser(description="PageSpeed Insights Client")
parser.add_argument("--url", "-u", required=True, help="URL to analyze")
parser.add_argument("--strategy", "-s", default="mobile",
choices=["mobile", "desktop", "both"],
help="Analysis strategy")
parser.add_argument("--output", "-o", help="Output file for JSON")
parser.add_argument("--json", action="store_true", help="Output as JSON")
parser.add_argument("--cwv-only", action="store_true",
help="Show only Core Web Vitals summary")
args = parser.parse_args()
client = PageSpeedClient()
if args.cwv_only:
summary = client.get_cwv_summary(args.url)
print(json.dumps(summary, indent=2))
elif args.strategy == "both":
result = client.analyze_both_strategies(args.url)
output = json.dumps(result, indent=2)
if args.output:
with open(args.output, "w") as f:
f.write(output)
else:
print(output)
else:
result = client.analyze(args.url, strategy=args.strategy)
if args.json or args.output:
output = json.dumps(result.to_dict(), indent=2)
if args.output:
with open(args.output, "w") as f:
f.write(output)
else:
print(output)
else:
print(client.generate_report(result))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,40 @@
# OurDigital SEO Audit - Python Dependencies
# Install with: pip install -r requirements.txt
# Google APIs
google-api-python-client>=2.100.0
google-auth>=2.23.0
google-auth-oauthlib>=1.1.0
google-auth-httplib2>=0.1.1
google-analytics-data>=0.18.0
# Notion API
notion-client>=2.0.0
# Web Scraping & Parsing
lxml>=5.1.0
beautifulsoup4>=4.12.0
extruct>=0.16.0
requests>=2.31.0
aiohttp>=3.9.0
# Schema Validation
jsonschema>=4.21.0
rdflib>=7.0.0
# Google Trends
pytrends>=4.9.2
# Data Processing
pandas>=2.1.0
# Async & Retry
tenacity>=8.2.0
tqdm>=4.66.0
# Environment
python-dotenv>=1.0.0
# Logging & CLI
rich>=13.7.0
typer>=0.9.0

View File

@@ -0,0 +1,540 @@
"""
Robots.txt Checker - Analyze robots.txt configuration
=====================================================
Purpose: Parse and analyze robots.txt for SEO compliance
Python: 3.10+
Usage:
python robots_checker.py --url https://example.com/robots.txt
python robots_checker.py --url https://example.com --test-url /admin/
"""
import argparse
import json
import logging
import re
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any
from urllib.parse import urljoin, urlparse
from urllib.robotparser import RobotFileParser
import requests
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
@dataclass
class RobotsIssue:
"""Represents a robots.txt issue."""
severity: str # "error", "warning", "info"
message: str
line_number: int | None = None
directive: str | None = None
suggestion: str | None = None
@dataclass
class UserAgentRules:
"""Rules for a specific user-agent."""
user_agent: str
disallow: list[str] = field(default_factory=list)
allow: list[str] = field(default_factory=list)
crawl_delay: float | None = None
@dataclass
class RobotsResult:
"""Complete robots.txt analysis result."""
url: str
accessible: bool = True
content: str = ""
rules: list[UserAgentRules] = field(default_factory=list)
sitemaps: list[str] = field(default_factory=list)
issues: list[RobotsIssue] = field(default_factory=list)
stats: dict = field(default_factory=dict)
timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
def to_dict(self) -> dict:
"""Convert to dictionary for JSON output."""
return {
"url": self.url,
"accessible": self.accessible,
"sitemaps": self.sitemaps,
"rules": [
{
"user_agent": r.user_agent,
"disallow": r.disallow,
"allow": r.allow,
"crawl_delay": r.crawl_delay,
}
for r in self.rules
],
"issues": [
{
"severity": i.severity,
"message": i.message,
"line_number": i.line_number,
"directive": i.directive,
"suggestion": i.suggestion,
}
for i in self.issues
],
"stats": self.stats,
"timestamp": self.timestamp,
}
class RobotsChecker:
"""Analyze robots.txt configuration."""
# Common user agents
USER_AGENTS = {
"*": "All bots",
"Googlebot": "Google crawler",
"Googlebot-Image": "Google Image crawler",
"Googlebot-News": "Google News crawler",
"Googlebot-Video": "Google Video crawler",
"Bingbot": "Bing crawler",
"Slurp": "Yahoo crawler",
"DuckDuckBot": "DuckDuckGo crawler",
"Baiduspider": "Baidu crawler",
"Yandex": "Yandex crawler",
"facebot": "Facebook crawler",
"Twitterbot": "Twitter crawler",
"LinkedInBot": "LinkedIn crawler",
}
# Paths that should generally not be blocked
IMPORTANT_PATHS = [
"/",
"/*.css",
"/*.js",
"/*.jpg",
"/*.jpeg",
"/*.png",
"/*.gif",
"/*.svg",
"/*.webp",
]
# Paths commonly blocked
COMMON_BLOCKED = [
"/admin",
"/wp-admin",
"/login",
"/private",
"/api",
"/cgi-bin",
"/tmp",
"/search",
]
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (compatible; SEOAuditBot/1.0)"
})
def fetch_robots(self, url: str) -> str | None:
"""Fetch robots.txt content."""
# Ensure we're fetching robots.txt
parsed = urlparse(url)
if not parsed.path.endswith("robots.txt"):
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
else:
robots_url = url
try:
response = self.session.get(robots_url, timeout=10)
if response.status_code == 200:
return response.text
elif response.status_code == 404:
return None
else:
raise RuntimeError(f"HTTP {response.status_code}")
except requests.RequestException as e:
raise RuntimeError(f"Failed to fetch robots.txt: {e}")
def parse_robots(self, content: str) -> tuple[list[UserAgentRules], list[str]]:
"""Parse robots.txt content."""
rules = []
sitemaps = []
current_ua = None
current_rules = None
for line_num, line in enumerate(content.split("\n"), 1):
line = line.strip()
# Skip empty lines and comments
if not line or line.startswith("#"):
continue
# Parse directive
if ":" not in line:
continue
directive, value = line.split(":", 1)
directive = directive.strip().lower()
value = value.strip()
if directive == "user-agent":
# Save previous user-agent rules
if current_rules:
rules.append(current_rules)
current_ua = value
current_rules = UserAgentRules(user_agent=value)
elif directive == "disallow" and current_rules:
if value: # Empty disallow means allow all
current_rules.disallow.append(value)
elif directive == "allow" and current_rules:
if value:
current_rules.allow.append(value)
elif directive == "crawl-delay" and current_rules:
try:
current_rules.crawl_delay = float(value)
except ValueError:
pass
elif directive == "sitemap":
if value:
sitemaps.append(value)
# Don't forget last user-agent
if current_rules:
rules.append(current_rules)
return rules, sitemaps
def analyze(self, url: str) -> RobotsResult:
"""Analyze robots.txt."""
result = RobotsResult(url=url)
# Fetch robots.txt
try:
content = self.fetch_robots(url)
if content is None:
result.accessible = False
result.issues.append(RobotsIssue(
severity="info",
message="No robots.txt found (returns 404)",
suggestion="Consider creating a robots.txt file",
))
return result
except RuntimeError as e:
result.accessible = False
result.issues.append(RobotsIssue(
severity="error",
message=str(e),
))
return result
result.content = content
result.rules, result.sitemaps = self.parse_robots(content)
# Analyze content
self._analyze_syntax(result)
self._analyze_rules(result)
self._analyze_sitemaps(result)
# Calculate stats
result.stats = {
"user_agents_count": len(result.rules),
"user_agents": [r.user_agent for r in result.rules],
"total_disallow_rules": sum(len(r.disallow) for r in result.rules),
"total_allow_rules": sum(len(r.allow) for r in result.rules),
"sitemaps_count": len(result.sitemaps),
"has_crawl_delay": any(r.crawl_delay for r in result.rules),
"content_length": len(content),
}
return result
def _analyze_syntax(self, result: RobotsResult) -> None:
"""Check for syntax issues."""
lines = result.content.split("\n")
for line_num, line in enumerate(lines, 1):
line = line.strip()
# Skip empty lines and comments
if not line or line.startswith("#"):
continue
# Check for valid directive
if ":" not in line:
result.issues.append(RobotsIssue(
severity="warning",
message=f"Invalid line (missing colon): {line[:50]}",
line_number=line_num,
))
continue
directive, value = line.split(":", 1)
directive = directive.strip().lower()
valid_directives = {
"user-agent", "disallow", "allow",
"crawl-delay", "sitemap", "host",
}
if directive not in valid_directives:
result.issues.append(RobotsIssue(
severity="info",
message=f"Unknown directive: {directive}",
line_number=line_num,
directive=directive,
))
def _analyze_rules(self, result: RobotsResult) -> None:
"""Analyze blocking rules."""
# Check if there are any rules
if not result.rules:
result.issues.append(RobotsIssue(
severity="info",
message="No user-agent rules defined",
suggestion="Add User-agent: * rules to control crawling",
))
return
# Check for wildcard rule
has_wildcard = any(r.user_agent == "*" for r in result.rules)
if not has_wildcard:
result.issues.append(RobotsIssue(
severity="info",
message="No wildcard (*) user-agent defined",
suggestion="Consider adding User-agent: * as fallback",
))
# Check for blocking important resources
for rules in result.rules:
for disallow in rules.disallow:
# Check if blocking root
if disallow == "/":
result.issues.append(RobotsIssue(
severity="error",
message=f"Blocking entire site for {rules.user_agent}",
directive=f"Disallow: {disallow}",
suggestion="This will prevent indexing. Is this intentional?",
))
# Check if blocking CSS/JS
if any(ext in disallow.lower() for ext in [".css", ".js"]):
result.issues.append(RobotsIssue(
severity="warning",
message=f"Blocking CSS/JS files for {rules.user_agent}",
directive=f"Disallow: {disallow}",
suggestion="May affect rendering and SEO",
))
# Check for blocking images
if any(ext in disallow.lower() for ext in [".jpg", ".png", ".gif", ".webp"]):
result.issues.append(RobotsIssue(
severity="info",
message=f"Blocking image files for {rules.user_agent}",
directive=f"Disallow: {disallow}",
))
# Check crawl delay
if rules.crawl_delay:
if rules.crawl_delay > 10:
result.issues.append(RobotsIssue(
severity="warning",
message=f"High crawl-delay ({rules.crawl_delay}s) for {rules.user_agent}",
directive=f"Crawl-delay: {rules.crawl_delay}",
suggestion="May significantly slow indexing",
))
elif rules.crawl_delay > 0:
result.issues.append(RobotsIssue(
severity="info",
message=f"Crawl-delay set to {rules.crawl_delay}s for {rules.user_agent}",
))
def _analyze_sitemaps(self, result: RobotsResult) -> None:
"""Analyze sitemap declarations."""
if not result.sitemaps:
result.issues.append(RobotsIssue(
severity="warning",
message="No sitemap declared in robots.txt",
suggestion="Add Sitemap: directive to help crawlers find your sitemap",
))
else:
for sitemap in result.sitemaps:
if not sitemap.startswith("http"):
result.issues.append(RobotsIssue(
severity="warning",
message=f"Sitemap URL should be absolute: {sitemap}",
directive=f"Sitemap: {sitemap}",
))
def test_url(self, robots_url: str, test_path: str,
user_agent: str = "Googlebot") -> dict:
"""Test if a specific URL is allowed."""
# Use Python's built-in parser
rp = RobotFileParser()
# Ensure robots.txt URL
parsed = urlparse(robots_url)
if not parsed.path.endswith("robots.txt"):
robots_txt_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
else:
robots_txt_url = robots_url
rp.set_url(robots_txt_url)
try:
rp.read()
except Exception as e:
return {
"path": test_path,
"user_agent": user_agent,
"allowed": None,
"error": str(e),
}
# Build full URL for testing
base_url = f"{parsed.scheme}://{parsed.netloc}"
full_url = urljoin(base_url, test_path)
allowed = rp.can_fetch(user_agent, full_url)
return {
"path": test_path,
"user_agent": user_agent,
"allowed": allowed,
"full_url": full_url,
}
def generate_report(self, result: RobotsResult) -> str:
"""Generate human-readable analysis report."""
lines = [
"=" * 60,
"Robots.txt Analysis Report",
"=" * 60,
f"URL: {result.url}",
f"Accessible: {'Yes' if result.accessible else 'No'}",
f"Timestamp: {result.timestamp}",
"",
]
if result.accessible:
lines.append("Statistics:")
for key, value in result.stats.items():
if key == "user_agents":
lines.append(f" {key}: {', '.join(value) if value else 'None'}")
else:
lines.append(f" {key}: {value}")
lines.append("")
if result.sitemaps:
lines.append(f"Sitemaps ({len(result.sitemaps)}):")
for sitemap in result.sitemaps:
lines.append(f" - {sitemap}")
lines.append("")
if result.rules:
lines.append("Rules Summary:")
for rules in result.rules:
lines.append(f"\n User-agent: {rules.user_agent}")
if rules.disallow:
lines.append(f" Disallow: {len(rules.disallow)} rules")
for d in rules.disallow[:5]:
lines.append(f" - {d}")
if len(rules.disallow) > 5:
lines.append(f" ... and {len(rules.disallow) - 5} more")
if rules.allow:
lines.append(f" Allow: {len(rules.allow)} rules")
for a in rules.allow[:3]:
lines.append(f" - {a}")
if rules.crawl_delay:
lines.append(f" Crawl-delay: {rules.crawl_delay}s")
lines.append("")
if result.issues:
lines.append("Issues Found:")
errors = [i for i in result.issues if i.severity == "error"]
warnings = [i for i in result.issues if i.severity == "warning"]
infos = [i for i in result.issues if i.severity == "info"]
if errors:
lines.append(f"\n ERRORS ({len(errors)}):")
for issue in errors:
lines.append(f" - {issue.message}")
if issue.directive:
lines.append(f" Directive: {issue.directive}")
if issue.suggestion:
lines.append(f" Suggestion: {issue.suggestion}")
if warnings:
lines.append(f"\n WARNINGS ({len(warnings)}):")
for issue in warnings:
lines.append(f" - {issue.message}")
if issue.suggestion:
lines.append(f" Suggestion: {issue.suggestion}")
if infos:
lines.append(f"\n INFO ({len(infos)}):")
for issue in infos:
lines.append(f" - {issue.message}")
lines.append("")
lines.append("=" * 60)
return "\n".join(lines)
def main():
"""Main entry point for CLI usage."""
parser = argparse.ArgumentParser(
description="Analyze robots.txt configuration",
)
parser.add_argument("--url", "-u", required=True,
help="URL to robots.txt or domain")
parser.add_argument("--test-url", "-t",
help="Test if specific URL path is allowed")
parser.add_argument("--user-agent", "-a", default="Googlebot",
help="User agent for testing (default: Googlebot)")
parser.add_argument("--output", "-o", help="Output file for JSON report")
parser.add_argument("--json", action="store_true", help="Output as JSON")
args = parser.parse_args()
checker = RobotsChecker()
if args.test_url:
# Test specific URL
test_result = checker.test_url(args.url, args.test_url, args.user_agent)
if args.json:
print(json.dumps(test_result, indent=2))
else:
status = "ALLOWED" if test_result["allowed"] else "BLOCKED"
print(f"URL: {test_result['path']}")
print(f"User-Agent: {test_result['user_agent']}")
print(f"Status: {status}")
else:
# Full analysis
result = checker.analyze(args.url)
if args.json or args.output:
output = json.dumps(result.to_dict(), ensure_ascii=False, indent=2)
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
f.write(output)
logger.info(f"Report written to {args.output}")
else:
print(output)
else:
print(checker.generate_report(result))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,490 @@
"""
Schema Generator - Generate JSON-LD structured data markup
==========================================================
Purpose: Generate schema.org structured data in JSON-LD format
Python: 3.10+
Usage:
python schema_generator.py --type organization --name "Company Name" --url "https://example.com"
"""
import argparse
import json
import logging
import os
import re
from datetime import datetime
from pathlib import Path
from typing import Any
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
# Template directory relative to this script
TEMPLATE_DIR = Path(__file__).parent.parent / "templates" / "schema_templates"
class SchemaGenerator:
"""Generate JSON-LD schema markup from templates."""
SCHEMA_TYPES = {
"organization": "organization.json",
"local_business": "local_business.json",
"product": "product.json",
"article": "article.json",
"faq": "faq.json",
"breadcrumb": "breadcrumb.json",
"website": "website.json",
}
# Business type mappings for LocalBusiness
BUSINESS_TYPES = {
"restaurant": "Restaurant",
"cafe": "CafeOrCoffeeShop",
"bar": "BarOrPub",
"hotel": "Hotel",
"store": "Store",
"medical": "MedicalBusiness",
"dental": "Dentist",
"legal": "LegalService",
"real_estate": "RealEstateAgent",
"auto": "AutoRepair",
"beauty": "BeautySalon",
"gym": "HealthClub",
"spa": "DaySpa",
}
# Article type mappings
ARTICLE_TYPES = {
"article": "Article",
"blog": "BlogPosting",
"news": "NewsArticle",
"tech": "TechArticle",
"scholarly": "ScholarlyArticle",
}
def __init__(self, template_dir: Path = TEMPLATE_DIR):
self.template_dir = template_dir
def load_template(self, schema_type: str) -> dict:
"""Load a schema template file."""
if schema_type not in self.SCHEMA_TYPES:
raise ValueError(f"Unknown schema type: {schema_type}. "
f"Available: {list(self.SCHEMA_TYPES.keys())}")
template_file = self.template_dir / self.SCHEMA_TYPES[schema_type]
if not template_file.exists():
raise FileNotFoundError(f"Template not found: {template_file}")
with open(template_file, "r", encoding="utf-8") as f:
return json.load(f)
def fill_template(self, template: dict, data: dict[str, Any]) -> dict:
"""Fill template placeholders with actual data."""
template_str = json.dumps(template, ensure_ascii=False)
# Replace placeholders {{key}} with values
for key, value in data.items():
placeholder = f"{{{{{key}}}}}"
if value is not None:
template_str = template_str.replace(placeholder, str(value))
# Remove unfilled placeholders and their parent objects if empty
result = json.loads(template_str)
return self._clean_empty_values(result)
def _clean_empty_values(self, obj: Any) -> Any:
"""Remove empty values and unfilled placeholders."""
if isinstance(obj, dict):
cleaned = {}
for key, value in obj.items():
cleaned_value = self._clean_empty_values(value)
# Skip if value is empty, None, or unfilled placeholder
if cleaned_value is None:
continue
if isinstance(cleaned_value, str) and cleaned_value.startswith("{{"):
continue
if isinstance(cleaned_value, (list, dict)) and not cleaned_value:
continue
cleaned[key] = cleaned_value
return cleaned if cleaned else None
elif isinstance(obj, list):
cleaned = []
for item in obj:
cleaned_item = self._clean_empty_values(item)
if cleaned_item is not None:
if isinstance(cleaned_item, str) and cleaned_item.startswith("{{"):
continue
cleaned.append(cleaned_item)
return cleaned if cleaned else None
elif isinstance(obj, str):
if obj.startswith("{{") and obj.endswith("}}"):
return None
return obj
return obj
def generate_organization(
self,
name: str,
url: str,
logo_url: str | None = None,
description: str | None = None,
founding_date: str | None = None,
phone: str | None = None,
address: dict | None = None,
social_links: list[str] | None = None,
) -> dict:
"""Generate Organization schema."""
template = self.load_template("organization")
data = {
"name": name,
"url": url,
"logo_url": logo_url,
"description": description,
"founding_date": founding_date,
"phone": phone,
}
if address:
data.update({
"street_address": address.get("street"),
"city": address.get("city"),
"region": address.get("region"),
"postal_code": address.get("postal_code"),
"country": address.get("country", "KR"),
})
if social_links:
# Handle social links specially
pass
return self.fill_template(template, data)
def generate_local_business(
self,
name: str,
business_type: str,
address: dict,
phone: str | None = None,
url: str | None = None,
description: str | None = None,
hours: dict | None = None,
geo: dict | None = None,
price_range: str | None = None,
rating: float | None = None,
review_count: int | None = None,
) -> dict:
"""Generate LocalBusiness schema."""
template = self.load_template("local_business")
schema_business_type = self.BUSINESS_TYPES.get(
business_type.lower(), "LocalBusiness"
)
data = {
"business_type": schema_business_type,
"name": name,
"url": url,
"description": description,
"phone": phone,
"price_range": price_range,
"street_address": address.get("street"),
"city": address.get("city"),
"region": address.get("region"),
"postal_code": address.get("postal_code"),
"country": address.get("country", "KR"),
}
if geo:
data["latitude"] = geo.get("lat")
data["longitude"] = geo.get("lng")
if hours:
data.update({
"weekday_opens": hours.get("weekday_opens", "09:00"),
"weekday_closes": hours.get("weekday_closes", "18:00"),
"weekend_opens": hours.get("weekend_opens"),
"weekend_closes": hours.get("weekend_closes"),
})
if rating is not None:
data["rating"] = str(rating)
data["review_count"] = str(review_count or 0)
return self.fill_template(template, data)
def generate_product(
self,
name: str,
description: str,
price: float,
currency: str = "KRW",
brand: str | None = None,
sku: str | None = None,
images: list[str] | None = None,
availability: str = "InStock",
condition: str = "NewCondition",
rating: float | None = None,
review_count: int | None = None,
url: str | None = None,
seller: str | None = None,
) -> dict:
"""Generate Product schema."""
template = self.load_template("product")
data = {
"name": name,
"description": description,
"price": str(int(price)),
"currency": currency,
"brand_name": brand,
"sku": sku,
"product_url": url,
"availability": availability,
"condition": condition,
"seller_name": seller,
}
if images:
for i, img in enumerate(images[:3], 1):
data[f"image_url_{i}"] = img
if rating is not None:
data["rating"] = str(rating)
data["review_count"] = str(review_count or 0)
return self.fill_template(template, data)
def generate_article(
self,
headline: str,
description: str,
author_name: str,
date_published: str,
publisher_name: str,
article_type: str = "article",
date_modified: str | None = None,
images: list[str] | None = None,
page_url: str | None = None,
publisher_logo: str | None = None,
author_url: str | None = None,
section: str | None = None,
word_count: int | None = None,
keywords: str | None = None,
) -> dict:
"""Generate Article schema."""
template = self.load_template("article")
schema_article_type = self.ARTICLE_TYPES.get(
article_type.lower(), "Article"
)
data = {
"article_type": schema_article_type,
"headline": headline,
"description": description,
"author_name": author_name,
"author_url": author_url,
"date_published": date_published,
"date_modified": date_modified or date_published,
"publisher_name": publisher_name,
"publisher_logo_url": publisher_logo,
"page_url": page_url,
"section": section,
"word_count": str(word_count) if word_count else None,
"keywords": keywords,
}
if images:
for i, img in enumerate(images[:2], 1):
data[f"image_url_{i}"] = img
return self.fill_template(template, data)
def generate_faq(self, questions: list[dict[str, str]]) -> dict:
"""Generate FAQPage schema."""
schema = {
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [],
}
for qa in questions:
schema["mainEntity"].append({
"@type": "Question",
"name": qa["question"],
"acceptedAnswer": {
"@type": "Answer",
"text": qa["answer"],
},
})
return schema
def generate_breadcrumb(self, items: list[dict[str, str]]) -> dict:
"""Generate BreadcrumbList schema."""
schema = {
"@context": "https://schema.org",
"@type": "BreadcrumbList",
"itemListElement": [],
}
for i, item in enumerate(items, 1):
schema["itemListElement"].append({
"@type": "ListItem",
"position": i,
"name": item["name"],
"item": item["url"],
})
return schema
def generate_website(
self,
name: str,
url: str,
search_url_template: str | None = None,
description: str | None = None,
language: str = "ko-KR",
publisher_name: str | None = None,
logo_url: str | None = None,
alternate_name: str | None = None,
) -> dict:
"""Generate WebSite schema."""
template = self.load_template("website")
data = {
"site_name": name,
"url": url,
"description": description,
"language": language,
"search_url_template": search_url_template,
"publisher_name": publisher_name or name,
"logo_url": logo_url,
"alternate_name": alternate_name,
}
return self.fill_template(template, data)
def to_json_ld(self, schema: dict, pretty: bool = True) -> str:
"""Convert schema dict to JSON-LD string."""
indent = 2 if pretty else None
return json.dumps(schema, ensure_ascii=False, indent=indent)
def to_html_script(self, schema: dict) -> str:
"""Wrap schema in HTML script tag."""
json_ld = self.to_json_ld(schema)
return f'<script type="application/ld+json">\n{json_ld}\n</script>'
def main():
"""Main entry point for CLI usage."""
parser = argparse.ArgumentParser(
description="Generate JSON-LD schema markup",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Generate Organization schema
python schema_generator.py --type organization --name "My Company" --url "https://example.com"
# Generate Product schema
python schema_generator.py --type product --name "Widget" --price 29900 --currency KRW
# Generate Article schema
python schema_generator.py --type article --headline "Article Title" --author "John Doe"
""",
)
parser.add_argument(
"--type", "-t",
required=True,
choices=SchemaGenerator.SCHEMA_TYPES.keys(),
help="Schema type to generate",
)
parser.add_argument("--name", help="Name/title")
parser.add_argument("--url", help="URL")
parser.add_argument("--description", help="Description")
parser.add_argument("--price", type=float, help="Price (for product)")
parser.add_argument("--currency", default="KRW", help="Currency code")
parser.add_argument("--headline", help="Headline (for article)")
parser.add_argument("--author", help="Author name")
parser.add_argument("--output", "-o", help="Output file path")
parser.add_argument("--html", action="store_true", help="Output as HTML script tag")
args = parser.parse_args()
generator = SchemaGenerator()
try:
if args.type == "organization":
schema = generator.generate_organization(
name=args.name or "Organization Name",
url=args.url or "https://example.com",
description=args.description,
)
elif args.type == "product":
schema = generator.generate_product(
name=args.name or "Product Name",
description=args.description or "Product description",
price=args.price or 0,
currency=args.currency,
)
elif args.type == "article":
schema = generator.generate_article(
headline=args.headline or args.name or "Article Title",
description=args.description or "Article description",
author_name=args.author or "Author",
date_published=datetime.now().strftime("%Y-%m-%d"),
publisher_name="Publisher",
)
elif args.type == "website":
schema = generator.generate_website(
name=args.name or "Website Name",
url=args.url or "https://example.com",
description=args.description,
)
elif args.type == "faq":
# Example FAQ
schema = generator.generate_faq([
{"question": "Question 1?", "answer": "Answer 1"},
{"question": "Question 2?", "answer": "Answer 2"},
])
elif args.type == "breadcrumb":
# Example breadcrumb
schema = generator.generate_breadcrumb([
{"name": "Home", "url": "https://example.com/"},
{"name": "Category", "url": "https://example.com/category/"},
])
elif args.type == "local_business":
schema = generator.generate_local_business(
name=args.name or "Business Name",
business_type="store",
address={"street": "123 Main St", "city": "Seoul", "country": "KR"},
url=args.url,
description=args.description,
)
else:
raise ValueError(f"Unsupported type: {args.type}")
if args.html:
output = generator.to_html_script(schema)
else:
output = generator.to_json_ld(schema)
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
f.write(output)
logger.info(f"Schema written to {args.output}")
else:
print(output)
except Exception as e:
logger.error(f"Error generating schema: {e}")
raise
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,498 @@
"""
Schema Validator - Validate JSON-LD structured data markup
==========================================================
Purpose: Extract and validate schema.org structured data from URLs or files
Python: 3.10+
Usage:
python schema_validator.py --url https://example.com
python schema_validator.py --file schema.json
"""
import argparse
import json
import logging
import re
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any
from urllib.parse import urlparse
import requests
from bs4 import BeautifulSoup
try:
import extruct
HAS_EXTRUCT = True
except ImportError:
HAS_EXTRUCT = False
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
@dataclass
class ValidationIssue:
"""Represents a validation issue found in schema."""
severity: str # "error", "warning", "info"
message: str
schema_type: str | None = None
property_name: str | None = None
suggestion: str | None = None
@dataclass
class ValidationResult:
"""Complete validation result for a schema."""
url: str | None = None
schemas_found: list[dict] = field(default_factory=list)
issues: list[ValidationIssue] = field(default_factory=list)
valid: bool = True
rich_results_eligible: dict = field(default_factory=dict)
timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
def to_dict(self) -> dict:
"""Convert to dictionary for JSON output."""
return {
"url": self.url,
"schemas_found": len(self.schemas_found),
"schema_types": [s.get("@type", "Unknown") for s in self.schemas_found],
"valid": self.valid,
"issues": [
{
"severity": i.severity,
"message": i.message,
"schema_type": i.schema_type,
"property": i.property_name,
"suggestion": i.suggestion,
}
for i in self.issues
],
"rich_results_eligible": self.rich_results_eligible,
"timestamp": self.timestamp,
}
class SchemaValidator:
"""Validate schema.org structured data."""
# Required properties for common schema types
REQUIRED_PROPERTIES = {
"Organization": ["name", "url"],
"LocalBusiness": ["name", "address"],
"Product": ["name"],
"Offer": ["price", "priceCurrency"],
"Article": ["headline", "author", "datePublished", "publisher"],
"BlogPosting": ["headline", "author", "datePublished", "publisher"],
"NewsArticle": ["headline", "author", "datePublished", "publisher"],
"FAQPage": ["mainEntity"],
"Question": ["name", "acceptedAnswer"],
"Answer": ["text"],
"BreadcrumbList": ["itemListElement"],
"ListItem": ["position", "name"],
"WebSite": ["name", "url"],
"WebPage": ["name"],
"Person": ["name"],
"Event": ["name", "startDate", "location"],
"Review": ["reviewRating", "author"],
"AggregateRating": ["ratingValue"],
"ImageObject": ["url"],
}
# Recommended (but not required) properties
RECOMMENDED_PROPERTIES = {
"Organization": ["logo", "description", "contactPoint", "sameAs"],
"LocalBusiness": ["telephone", "openingHoursSpecification", "geo", "image"],
"Product": ["description", "image", "brand", "offers", "aggregateRating"],
"Article": ["image", "dateModified", "description"],
"FAQPage": [],
"WebSite": ["potentialAction"],
"BreadcrumbList": [],
}
# Google Rich Results eligible types
RICH_RESULTS_TYPES = {
"Article", "BlogPosting", "NewsArticle",
"Product", "Review",
"FAQPage", "HowTo",
"LocalBusiness", "Restaurant",
"Event",
"Recipe",
"JobPosting",
"Course",
"BreadcrumbList",
"Organization",
"WebSite",
"VideoObject",
}
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (compatible; SEOAuditBot/1.0)"
})
def extract_from_url(self, url: str) -> list[dict]:
"""Extract all structured data from a URL."""
try:
response = self.session.get(url, timeout=30)
response.raise_for_status()
return self.extract_from_html(response.text, url)
except requests.RequestException as e:
logger.error(f"Failed to fetch URL: {e}")
return []
def extract_from_html(self, html: str, base_url: str | None = None) -> list[dict]:
"""Extract structured data from HTML content."""
schemas = []
# Method 1: Use extruct if available (handles JSON-LD, Microdata, RDFa)
if HAS_EXTRUCT:
try:
data = extruct.extract(html, base_url=base_url, uniform=True)
schemas.extend(data.get("json-ld", []))
schemas.extend(data.get("microdata", []))
schemas.extend(data.get("rdfa", []))
except Exception as e:
logger.warning(f"extruct extraction failed: {e}")
# Method 2: Manual JSON-LD extraction (fallback/additional)
soup = BeautifulSoup(html, "html.parser")
for script in soup.find_all("script", type="application/ld+json"):
try:
content = script.string
if content:
data = json.loads(content)
if isinstance(data, list):
schemas.extend(data)
else:
schemas.append(data)
except json.JSONDecodeError as e:
logger.warning(f"Invalid JSON-LD: {e}")
# Deduplicate schemas
seen = set()
unique_schemas = []
for schema in schemas:
schema_str = json.dumps(schema, sort_keys=True)
if schema_str not in seen:
seen.add(schema_str)
unique_schemas.append(schema)
return unique_schemas
def validate(self, url: str | None = None, html: str | None = None,
schema: dict | None = None) -> ValidationResult:
"""Validate schema from URL, HTML, or direct schema dict."""
result = ValidationResult(url=url)
# Extract schemas
if schema:
schemas = [schema]
elif html:
schemas = self.extract_from_html(html, url)
elif url:
schemas = self.extract_from_url(url)
else:
raise ValueError("Must provide url, html, or schema")
result.schemas_found = schemas
if not schemas:
result.issues.append(ValidationIssue(
severity="warning",
message="No structured data found",
suggestion="Add JSON-LD schema markup to improve SEO",
))
result.valid = False
return result
# Validate each schema
for schema in schemas:
self._validate_schema(schema, result)
# Check for errors (warnings don't affect validity)
result.valid = not any(i.severity == "error" for i in result.issues)
return result
def _validate_schema(self, schema: dict, result: ValidationResult,
parent_type: str | None = None) -> None:
"""Validate a single schema object."""
schema_type = schema.get("@type")
if not schema_type:
result.issues.append(ValidationIssue(
severity="error",
message="Missing @type property",
schema_type=parent_type,
))
return
# Handle array of types
if isinstance(schema_type, list):
schema_type = schema_type[0]
# Check required properties
required = self.REQUIRED_PROPERTIES.get(schema_type, [])
for prop in required:
if prop not in schema:
result.issues.append(ValidationIssue(
severity="error",
message=f"Missing required property: {prop}",
schema_type=schema_type,
property_name=prop,
suggestion=f"Add '{prop}' property to {schema_type} schema",
))
# Check recommended properties
recommended = self.RECOMMENDED_PROPERTIES.get(schema_type, [])
for prop in recommended:
if prop not in schema:
result.issues.append(ValidationIssue(
severity="info",
message=f"Missing recommended property: {prop}",
schema_type=schema_type,
property_name=prop,
suggestion=f"Consider adding '{prop}' for better rich results",
))
# Check Rich Results eligibility
if schema_type in self.RICH_RESULTS_TYPES:
result.rich_results_eligible[schema_type] = self._check_rich_results(
schema, schema_type
)
# Validate nested schemas
for key, value in schema.items():
if key.startswith("@"):
continue
if isinstance(value, dict) and "@type" in value:
self._validate_schema(value, result, schema_type)
elif isinstance(value, list):
for item in value:
if isinstance(item, dict) and "@type" in item:
self._validate_schema(item, result, schema_type)
# Type-specific validations
self._validate_type_specific(schema, schema_type, result)
def _validate_type_specific(self, schema: dict, schema_type: str,
result: ValidationResult) -> None:
"""Type-specific validation rules."""
if schema_type in ("Article", "BlogPosting", "NewsArticle"):
# Check image
if "image" not in schema:
result.issues.append(ValidationIssue(
severity="warning",
message="Article without image may not show in rich results",
schema_type=schema_type,
property_name="image",
suggestion="Add at least one image to the article",
))
# Check headline length
headline = schema.get("headline", "")
if len(headline) > 110:
result.issues.append(ValidationIssue(
severity="warning",
message=f"Headline too long ({len(headline)} chars, max 110)",
schema_type=schema_type,
property_name="headline",
))
elif schema_type == "Product":
offer = schema.get("offers", {})
if isinstance(offer, dict):
# Check price
price = offer.get("price")
if price is not None:
try:
float(price)
except (ValueError, TypeError):
result.issues.append(ValidationIssue(
severity="error",
message=f"Invalid price value: {price}",
schema_type="Offer",
property_name="price",
))
# Check availability
availability = offer.get("availability", "")
valid_availabilities = [
"InStock", "OutOfStock", "PreOrder", "Discontinued",
"https://schema.org/InStock", "https://schema.org/OutOfStock",
]
if availability and not any(
a in availability for a in valid_availabilities
):
result.issues.append(ValidationIssue(
severity="warning",
message=f"Unknown availability value: {availability}",
schema_type="Offer",
property_name="availability",
))
elif schema_type == "LocalBusiness":
# Check for geo coordinates
if "geo" not in schema:
result.issues.append(ValidationIssue(
severity="info",
message="Missing geo coordinates",
schema_type=schema_type,
property_name="geo",
suggestion="Add latitude/longitude for better local search",
))
elif schema_type == "FAQPage":
main_entity = schema.get("mainEntity", [])
if not main_entity:
result.issues.append(ValidationIssue(
severity="error",
message="FAQPage must have at least one question",
schema_type=schema_type,
property_name="mainEntity",
))
elif len(main_entity) < 2:
result.issues.append(ValidationIssue(
severity="info",
message="FAQPage has only one question",
schema_type=schema_type,
suggestion="Add more questions for better rich results",
))
def _check_rich_results(self, schema: dict, schema_type: str) -> dict:
"""Check if schema is eligible for Google Rich Results."""
result = {
"eligible": True,
"missing_for_rich_results": [],
}
if schema_type in ("Article", "BlogPosting", "NewsArticle"):
required_for_rich = ["headline", "image", "datePublished", "author"]
for prop in required_for_rich:
if prop not in schema:
result["eligible"] = False
result["missing_for_rich_results"].append(prop)
elif schema_type == "Product":
if "name" not in schema:
result["eligible"] = False
result["missing_for_rich_results"].append("name")
offer = schema.get("offers")
if not offer:
result["eligible"] = False
result["missing_for_rich_results"].append("offers")
elif schema_type == "FAQPage":
if not schema.get("mainEntity"):
result["eligible"] = False
result["missing_for_rich_results"].append("mainEntity")
return result
def generate_report(self, result: ValidationResult) -> str:
"""Generate human-readable validation report."""
lines = [
"=" * 60,
"Schema Validation Report",
"=" * 60,
f"URL: {result.url or 'N/A'}",
f"Timestamp: {result.timestamp}",
f"Valid: {'Yes' if result.valid else 'No'}",
f"Schemas Found: {len(result.schemas_found)}",
"",
]
if result.schemas_found:
lines.append("Schema Types:")
for schema in result.schemas_found:
schema_type = schema.get("@type", "Unknown")
lines.append(f" - {schema_type}")
lines.append("")
if result.rich_results_eligible:
lines.append("Rich Results Eligibility:")
for schema_type, status in result.rich_results_eligible.items():
eligible = "Yes" if status["eligible"] else "No"
lines.append(f" - {schema_type}: {eligible}")
if status["missing_for_rich_results"]:
missing = ", ".join(status["missing_for_rich_results"])
lines.append(f" Missing: {missing}")
lines.append("")
if result.issues:
lines.append("Issues Found:")
errors = [i for i in result.issues if i.severity == "error"]
warnings = [i for i in result.issues if i.severity == "warning"]
infos = [i for i in result.issues if i.severity == "info"]
if errors:
lines.append(f"\n ERRORS ({len(errors)}):")
for issue in errors:
lines.append(f" - [{issue.schema_type}] {issue.message}")
if issue.suggestion:
lines.append(f" Suggestion: {issue.suggestion}")
if warnings:
lines.append(f"\n WARNINGS ({len(warnings)}):")
for issue in warnings:
lines.append(f" - [{issue.schema_type}] {issue.message}")
if issue.suggestion:
lines.append(f" Suggestion: {issue.suggestion}")
if infos:
lines.append(f"\n INFO ({len(infos)}):")
for issue in infos:
lines.append(f" - [{issue.schema_type}] {issue.message}")
if issue.suggestion:
lines.append(f" Suggestion: {issue.suggestion}")
lines.append("")
lines.append("=" * 60)
return "\n".join(lines)
def main():
"""Main entry point for CLI usage."""
parser = argparse.ArgumentParser(
description="Validate schema.org structured data",
)
parser.add_argument("--url", "-u", help="URL to validate")
parser.add_argument("--file", "-f", help="JSON-LD file to validate")
parser.add_argument("--output", "-o", help="Output file for JSON report")
parser.add_argument("--json", action="store_true", help="Output as JSON")
args = parser.parse_args()
if not args.url and not args.file:
parser.error("Must provide --url or --file")
validator = SchemaValidator()
if args.file:
with open(args.file, "r", encoding="utf-8") as f:
schema = json.load(f)
result = validator.validate(schema=schema)
else:
result = validator.validate(url=args.url)
if args.json or args.output:
output = json.dumps(result.to_dict(), ensure_ascii=False, indent=2)
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
f.write(output)
logger.info(f"Report written to {args.output}")
else:
print(output)
else:
print(validator.generate_report(result))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,969 @@
"""
Sitemap Crawler - Sequential page analysis from sitemap
=======================================================
Purpose: Crawl sitemap URLs one by one, analyze each page, save to Notion
Python: 3.10+
Usage:
from sitemap_crawler import SitemapCrawler
crawler = SitemapCrawler()
crawler.crawl_sitemap("https://example.com/sitemap.xml", delay=2.0)
"""
import json
import logging
import time
import xml.etree.ElementTree as ET
from dataclasses import dataclass, field
from datetime import datetime
from pathlib import Path
from typing import Callable, Generator
from urllib.parse import urlparse
import requests
from notion_client import Client
from base_client import config
from page_analyzer import PageAnalyzer, PageMetadata
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
# Default database for page analysis data
DEFAULT_PAGES_DATABASE_ID = "2c8581e5-8a1e-8035-880b-e38cefc2f3ef"
# Default limits to prevent excessive resource usage
DEFAULT_MAX_PAGES = 500
DEFAULT_DELAY_SECONDS = 2.0
# Progress tracking directory
PROGRESS_DIR = Path.home() / ".claude" / "seo-audit-progress"
PROGRESS_DIR.mkdir(parents=True, exist_ok=True)
@dataclass
class CrawlProgress:
"""Track crawl progress."""
total_urls: int = 0
processed_urls: int = 0
successful_urls: int = 0
failed_urls: int = 0
skipped_urls: int = 0
start_time: datetime = field(default_factory=datetime.now)
current_url: str = ""
audit_id: str = ""
site: str = ""
status: str = "running" # running, completed, failed
error_message: str = ""
summary_page_id: str = ""
def get_progress_percent(self) -> float:
if self.total_urls == 0:
return 0.0
return (self.processed_urls / self.total_urls) * 100
def get_elapsed_time(self) -> str:
elapsed = datetime.now() - self.start_time
minutes = int(elapsed.total_seconds() // 60)
seconds = int(elapsed.total_seconds() % 60)
return f"{minutes}m {seconds}s"
def get_eta(self) -> str:
if self.processed_urls == 0:
return "calculating..."
elapsed = (datetime.now() - self.start_time).total_seconds()
avg_time_per_url = elapsed / self.processed_urls
remaining_urls = self.total_urls - self.processed_urls
eta_seconds = remaining_urls * avg_time_per_url
minutes = int(eta_seconds // 60)
seconds = int(eta_seconds % 60)
return f"{minutes}m {seconds}s"
def to_dict(self) -> dict:
"""Convert to dictionary for JSON serialization."""
return {
"audit_id": self.audit_id,
"site": self.site,
"status": self.status,
"total_urls": self.total_urls,
"processed_urls": self.processed_urls,
"successful_urls": self.successful_urls,
"failed_urls": self.failed_urls,
"progress_percent": round(self.get_progress_percent(), 1),
"elapsed_time": self.get_elapsed_time(),
"eta": self.get_eta(),
"current_url": self.current_url,
"start_time": self.start_time.isoformat(),
"error_message": self.error_message,
"summary_page_id": self.summary_page_id,
"updated_at": datetime.now().isoformat(),
}
def save_to_file(self, filepath: Path | None = None) -> Path:
"""Save progress to JSON file."""
if filepath is None:
filepath = PROGRESS_DIR / f"{self.audit_id}.json"
with open(filepath, "w") as f:
json.dump(self.to_dict(), f, indent=2)
return filepath
@classmethod
def load_from_file(cls, filepath: Path) -> "CrawlProgress":
"""Load progress from JSON file."""
with open(filepath, "r") as f:
data = json.load(f)
progress = cls()
progress.audit_id = data.get("audit_id", "")
progress.site = data.get("site", "")
progress.status = data.get("status", "unknown")
progress.total_urls = data.get("total_urls", 0)
progress.processed_urls = data.get("processed_urls", 0)
progress.successful_urls = data.get("successful_urls", 0)
progress.failed_urls = data.get("failed_urls", 0)
progress.current_url = data.get("current_url", "")
progress.error_message = data.get("error_message", "")
progress.summary_page_id = data.get("summary_page_id", "")
if data.get("start_time"):
progress.start_time = datetime.fromisoformat(data["start_time"])
return progress
def get_active_crawls() -> list[CrawlProgress]:
"""Get all active (running) crawl jobs."""
active = []
for filepath in PROGRESS_DIR.glob("*.json"):
try:
progress = CrawlProgress.load_from_file(filepath)
if progress.status == "running":
active.append(progress)
except Exception:
continue
return active
def get_all_crawls() -> list[CrawlProgress]:
"""Get all crawl jobs (active and completed)."""
crawls = []
for filepath in sorted(PROGRESS_DIR.glob("*.json"), reverse=True):
try:
progress = CrawlProgress.load_from_file(filepath)
crawls.append(progress)
except Exception:
continue
return crawls
def get_crawl_status(audit_id: str) -> CrawlProgress | None:
"""Get status of a specific crawl by audit ID."""
filepath = PROGRESS_DIR / f"{audit_id}.json"
if filepath.exists():
return CrawlProgress.load_from_file(filepath)
return None
@dataclass
class CrawlResult:
"""Result of a complete sitemap crawl."""
site: str
sitemap_url: str
audit_id: str
total_pages: int
successful_pages: int
failed_pages: int
start_time: datetime
end_time: datetime
pages_analyzed: list[PageMetadata] = field(default_factory=list)
notion_page_ids: list[str] = field(default_factory=list)
summary_page_id: str | None = None
def get_duration(self) -> str:
duration = self.end_time - self.start_time
minutes = int(duration.total_seconds() // 60)
seconds = int(duration.total_seconds() % 60)
return f"{minutes}m {seconds}s"
class SitemapCrawler:
"""Crawl sitemap URLs and analyze each page."""
def __init__(
self,
notion_token: str | None = None,
database_id: str | None = None,
):
"""
Initialize sitemap crawler.
Args:
notion_token: Notion API token
database_id: Notion database ID for storing results
"""
self.notion_token = notion_token or config.notion_token
self.database_id = database_id or DEFAULT_PAGES_DATABASE_ID
self.analyzer = PageAnalyzer()
if self.notion_token:
self.notion = Client(auth=self.notion_token)
else:
self.notion = None
logger.warning("Notion token not configured, results will not be saved")
def fetch_sitemap_urls(self, sitemap_url: str) -> list[str]:
"""
Fetch and parse URLs from a sitemap.
Args:
sitemap_url: URL of the sitemap
Returns:
List of URLs found in the sitemap
"""
try:
response = requests.get(sitemap_url, timeout=30)
response.raise_for_status()
# Parse XML
root = ET.fromstring(response.content)
# Handle namespace
namespaces = {
"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"
}
urls = []
# Check if this is a sitemap index
sitemap_tags = root.findall(".//sm:sitemap/sm:loc", namespaces)
if sitemap_tags:
# This is a sitemap index, recursively fetch child sitemaps
logger.info(f"Found sitemap index with {len(sitemap_tags)} child sitemaps")
for loc in sitemap_tags:
if loc.text:
child_urls = self.fetch_sitemap_urls(loc.text)
urls.extend(child_urls)
else:
# Regular sitemap, extract URLs
url_tags = root.findall(".//sm:url/sm:loc", namespaces)
if not url_tags:
# Try without namespace
url_tags = root.findall(".//url/loc")
for loc in url_tags:
if loc.text:
urls.append(loc.text)
# Remove duplicates while preserving order
seen = set()
unique_urls = []
for url in urls:
if url not in seen:
seen.add(url)
unique_urls.append(url)
logger.info(f"Found {len(unique_urls)} unique URLs in sitemap")
return unique_urls
except Exception as e:
logger.error(f"Failed to fetch sitemap: {e}")
raise
def crawl_sitemap(
self,
sitemap_url: str,
delay: float = DEFAULT_DELAY_SECONDS,
max_pages: int = DEFAULT_MAX_PAGES,
progress_callback: Callable[[CrawlProgress], None] | None = None,
save_to_notion: bool = True,
url_filter: Callable[[str], bool] | None = None,
) -> CrawlResult:
"""
Crawl all URLs in a sitemap sequentially.
Args:
sitemap_url: URL of the sitemap
delay: Seconds to wait between requests (default: 2.0s)
max_pages: Maximum number of pages to process (default: 500)
progress_callback: Function called with progress updates
save_to_notion: Whether to save results to Notion
url_filter: Optional function to filter URLs (return True to include)
Returns:
CrawlResult with all analyzed pages
"""
# Parse site info
parsed_sitemap = urlparse(sitemap_url)
site = f"{parsed_sitemap.scheme}://{parsed_sitemap.netloc}"
site_domain = parsed_sitemap.netloc
# Generate audit ID
audit_id = f"{site_domain}-pages-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
logger.info(f"Starting sitemap crawl: {sitemap_url}")
logger.info(f"Audit ID: {audit_id}")
logger.info(f"Delay between requests: {delay}s")
# Initialize progress tracking
progress = CrawlProgress(
audit_id=audit_id,
site=site,
status="running",
)
# Fetch URLs
urls = self.fetch_sitemap_urls(sitemap_url)
# Apply URL filter if provided
if url_filter:
urls = [url for url in urls if url_filter(url)]
logger.info(f"After filtering: {len(urls)} URLs")
# Apply max pages limit (default: 500 to prevent excessive resource usage)
if len(urls) > max_pages:
logger.warning(f"Sitemap has {len(urls)} URLs, limiting to {max_pages} pages")
logger.warning(f"Use max_pages parameter to adjust this limit")
urls = urls[:max_pages]
logger.info(f"Processing {len(urls)} pages (max: {max_pages})")
# Update progress with total URLs
progress.total_urls = len(urls)
progress.save_to_file()
# Initialize result
result = CrawlResult(
site=site,
sitemap_url=sitemap_url,
audit_id=audit_id,
total_pages=len(urls),
successful_pages=0,
failed_pages=0,
start_time=datetime.now(),
end_time=datetime.now(),
)
# Process each URL
try:
for i, url in enumerate(urls):
progress.current_url = url
progress.processed_urls = i
progress.save_to_file() # Save progress to file
if progress_callback:
progress_callback(progress)
logger.info(f"[{i+1}/{len(urls)}] Analyzing: {url}")
try:
# Analyze page
metadata = self.analyzer.analyze_url(url)
result.pages_analyzed.append(metadata)
if metadata.status_code == 200:
progress.successful_urls += 1
result.successful_pages += 1
# Save to Notion
if save_to_notion and self.notion:
page_id = self._save_page_to_notion(metadata, audit_id, site)
if page_id:
result.notion_page_ids.append(page_id)
else:
progress.failed_urls += 1
result.failed_pages += 1
except Exception as e:
logger.error(f"Failed to analyze {url}: {e}")
progress.failed_urls += 1
result.failed_pages += 1
# Wait before next request
if i < len(urls) - 1: # Don't wait after last URL
time.sleep(delay)
# Final progress update
progress.processed_urls = len(urls)
progress.status = "completed"
if progress_callback:
progress_callback(progress)
except Exception as e:
progress.status = "failed"
progress.error_message = str(e)
progress.save_to_file()
raise
# Update result
result.end_time = datetime.now()
# Create summary page
if save_to_notion and self.notion:
summary_id = self._create_crawl_summary_page(result)
result.summary_page_id = summary_id
progress.summary_page_id = summary_id
# Save final progress
progress.save_to_file()
logger.info(f"Crawl complete: {result.successful_pages}/{result.total_pages} pages analyzed")
logger.info(f"Duration: {result.get_duration()}")
return result
def _save_page_to_notion(
self,
metadata: PageMetadata,
audit_id: str,
site: str,
) -> str | None:
"""Save page metadata to Notion database."""
try:
# Build properties
properties = {
"Issue": {"title": [{"text": {"content": f"📄 {metadata.url}"}}]},
"Category": {"select": {"name": "On-page SEO"}},
"Priority": {"select": {"name": self._determine_priority(metadata)}},
"Site": {"url": site},
"URL": {"url": metadata.url},
"Audit ID": {"rich_text": [{"text": {"content": audit_id}}]},
"Found Date": {"date": {"start": datetime.now().strftime("%Y-%m-%d")}},
}
# Build page content
children = self._build_page_content(metadata)
response = self.notion.pages.create(
parent={"database_id": self.database_id},
properties=properties,
children=children,
)
return response["id"]
except Exception as e:
logger.error(f"Failed to save to Notion: {e}")
return None
def _determine_priority(self, metadata: PageMetadata) -> str:
"""Determine priority based on issues found."""
if len(metadata.issues) >= 3:
return "High"
elif len(metadata.issues) >= 1:
return "Medium"
elif len(metadata.warnings) >= 3:
return "Medium"
else:
return "Low"
def _build_page_content(self, metadata: PageMetadata) -> list[dict]:
"""Build Notion page content blocks from metadata."""
children = []
# Status summary callout
status_emoji = "" if not metadata.issues else "⚠️" if len(metadata.issues) < 3 else ""
children.append({
"object": "block",
"type": "callout",
"callout": {
"rich_text": [
{"type": "text", "text": {"content": f"Status: {metadata.status_code} | "}},
{"type": "text", "text": {"content": f"Response: {metadata.response_time_ms:.0f}ms | "}},
{"type": "text", "text": {"content": f"Issues: {len(metadata.issues)} | "}},
{"type": "text", "text": {"content": f"Warnings: {len(metadata.warnings)}"}},
],
"icon": {"type": "emoji", "emoji": status_emoji},
"color": "gray_background" if not metadata.issues else "yellow_background" if len(metadata.issues) < 3 else "red_background",
}
})
# Meta Tags Section
children.append({
"object": "block",
"type": "heading_2",
"heading_2": {"rich_text": [{"type": "text", "text": {"content": "Meta Tags"}}]}
})
# Meta tags table
meta_rows = [
{"type": "table_row", "table_row": {"cells": [
[{"type": "text", "text": {"content": "Tag"}, "annotations": {"bold": True}}],
[{"type": "text", "text": {"content": "Value"}, "annotations": {"bold": True}}],
[{"type": "text", "text": {"content": "Status"}, "annotations": {"bold": True}}],
]}},
{"type": "table_row", "table_row": {"cells": [
[{"type": "text", "text": {"content": "Title"}}],
[{"type": "text", "text": {"content": (metadata.title or "")[:50]}}],
[{"type": "text", "text": {"content": f"{metadata.title_length} chars" if metadata.title else "✗ Missing"}}],
]}},
{"type": "table_row", "table_row": {"cells": [
[{"type": "text", "text": {"content": "Description"}}],
[{"type": "text", "text": {"content": (metadata.meta_description or "")[:50]}}],
[{"type": "text", "text": {"content": f"{metadata.meta_description_length} chars" if metadata.meta_description else "✗ Missing"}}],
]}},
{"type": "table_row", "table_row": {"cells": [
[{"type": "text", "text": {"content": "Canonical"}}],
[{"type": "text", "text": {"content": (metadata.canonical_url or "")[:50]}}],
[{"type": "text", "text": {"content": "" if metadata.canonical_url else "✗ Missing"}}],
]}},
{"type": "table_row", "table_row": {"cells": [
[{"type": "text", "text": {"content": "Robots"}}],
[{"type": "text", "text": {"content": metadata.robots_meta or ""}}],
[{"type": "text", "text": {"content": "" if metadata.robots_meta else ""}}],
]}},
{"type": "table_row", "table_row": {"cells": [
[{"type": "text", "text": {"content": "Lang"}}],
[{"type": "text", "text": {"content": metadata.html_lang or ""}}],
[{"type": "text", "text": {"content": "" if metadata.html_lang else ""}}],
]}},
]
children.append({
"object": "block",
"type": "table",
"table": {
"table_width": 3,
"has_column_header": True,
"has_row_header": False,
"children": meta_rows
}
})
# Headings Section
children.append({
"object": "block",
"type": "heading_2",
"heading_2": {"rich_text": [{"type": "text", "text": {"content": "Headings"}}]}
})
children.append({
"object": "block",
"type": "paragraph",
"paragraph": {"rich_text": [
{"type": "text", "text": {"content": f"H1: {metadata.h1_count} | "}},
{"type": "text", "text": {"content": f"Total headings: {len(metadata.headings)}"}},
]}
})
if metadata.h1_text:
children.append({
"object": "block",
"type": "quote",
"quote": {"rich_text": [{"type": "text", "text": {"content": metadata.h1_text[:200]}}]}
})
# Schema Data Section
children.append({
"object": "block",
"type": "heading_2",
"heading_2": {"rich_text": [{"type": "text", "text": {"content": "Structured Data"}}]}
})
if metadata.schema_types_found:
children.append({
"object": "block",
"type": "paragraph",
"paragraph": {"rich_text": [
{"type": "text", "text": {"content": "Schema types found: "}},
{"type": "text", "text": {"content": ", ".join(metadata.schema_types_found)}, "annotations": {"code": True}},
]}
})
else:
children.append({
"object": "block",
"type": "callout",
"callout": {
"rich_text": [{"type": "text", "text": {"content": "No structured data found on this page"}}],
"icon": {"type": "emoji", "emoji": "⚠️"},
"color": "yellow_background",
}
})
# Open Graph Section
children.append({
"object": "block",
"type": "heading_2",
"heading_2": {"rich_text": [{"type": "text", "text": {"content": "Open Graph"}}]}
})
og = metadata.open_graph
og_status = "✓ Configured" if og.og_title else "✗ Missing"
children.append({
"object": "block",
"type": "paragraph",
"paragraph": {"rich_text": [
{"type": "text", "text": {"content": f"Status: {og_status}\n"}},
{"type": "text", "text": {"content": f"og:title: {og.og_title or ''}\n"}},
{"type": "text", "text": {"content": f"og:type: {og.og_type or ''}"}},
]}
})
# Links Section
children.append({
"object": "block",
"type": "heading_2",
"heading_2": {"rich_text": [{"type": "text", "text": {"content": "Links"}}]}
})
children.append({
"object": "block",
"type": "paragraph",
"paragraph": {"rich_text": [
{"type": "text", "text": {"content": f"Internal links: {metadata.internal_link_count}\n"}},
{"type": "text", "text": {"content": f"External links: {metadata.external_link_count}"}},
]}
})
# Images Section
children.append({
"object": "block",
"type": "heading_2",
"heading_2": {"rich_text": [{"type": "text", "text": {"content": "Images"}}]}
})
children.append({
"object": "block",
"type": "paragraph",
"paragraph": {"rich_text": [
{"type": "text", "text": {"content": f"Total: {metadata.images_total} | "}},
{"type": "text", "text": {"content": f"With alt: {metadata.images_with_alt} | "}},
{"type": "text", "text": {"content": f"Without alt: {metadata.images_without_alt}"}},
]}
})
# Hreflang Section (if present)
if metadata.hreflang_tags:
children.append({
"object": "block",
"type": "heading_2",
"heading_2": {"rich_text": [{"type": "text", "text": {"content": "Hreflang Tags"}}]}
})
for tag in metadata.hreflang_tags[:10]:
children.append({
"object": "block",
"type": "bulleted_list_item",
"bulleted_list_item": {"rich_text": [
{"type": "text", "text": {"content": f"{tag['lang']}: "}},
{"type": "text", "text": {"content": tag['url'], "link": {"url": tag['url']}}},
]}
})
# Issues & Warnings Section
if metadata.issues or metadata.warnings:
children.append({
"object": "block",
"type": "heading_2",
"heading_2": {"rich_text": [{"type": "text", "text": {"content": "Issues & Warnings"}}]}
})
for issue in metadata.issues:
children.append({
"object": "block",
"type": "to_do",
"to_do": {
"rich_text": [
{"type": "text", "text": {"content": ""}, "annotations": {"bold": True}},
{"type": "text", "text": {"content": issue}},
],
"checked": False,
}
})
for warning in metadata.warnings:
children.append({
"object": "block",
"type": "to_do",
"to_do": {
"rich_text": [
{"type": "text", "text": {"content": "⚠️ "}, "annotations": {"bold": True}},
{"type": "text", "text": {"content": warning}},
],
"checked": False,
}
})
return children
def _create_crawl_summary_page(self, result: CrawlResult) -> str | None:
"""Create a summary page for the crawl."""
try:
site_domain = urlparse(result.site).netloc
# Calculate statistics
total_issues = sum(len(p.issues) for p in result.pages_analyzed)
total_warnings = sum(len(p.warnings) for p in result.pages_analyzed)
pages_with_issues = sum(1 for p in result.pages_analyzed if p.issues)
pages_without_schema = sum(1 for p in result.pages_analyzed if not p.schema_types_found)
pages_without_description = sum(1 for p in result.pages_analyzed if not p.meta_description)
children = []
# Header callout
children.append({
"object": "block",
"type": "callout",
"callout": {
"rich_text": [
{"type": "text", "text": {"content": f"Sitemap Crawl Complete\n\n"}},
{"type": "text", "text": {"content": f"Audit ID: {result.audit_id}\n"}},
{"type": "text", "text": {"content": f"Duration: {result.get_duration()}\n"}},
{"type": "text", "text": {"content": f"Pages: {result.successful_pages}/{result.total_pages}"}},
],
"icon": {"type": "emoji", "emoji": "📊"},
"color": "blue_background",
}
})
# Statistics table
children.append({
"object": "block",
"type": "heading_2",
"heading_2": {"rich_text": [{"type": "text", "text": {"content": "Statistics"}}]}
})
stats_rows = [
{"type": "table_row", "table_row": {"cells": [
[{"type": "text", "text": {"content": "Metric"}, "annotations": {"bold": True}}],
[{"type": "text", "text": {"content": "Count"}, "annotations": {"bold": True}}],
]}},
{"type": "table_row", "table_row": {"cells": [
[{"type": "text", "text": {"content": "Total Pages"}}],
[{"type": "text", "text": {"content": str(result.total_pages)}}],
]}},
{"type": "table_row", "table_row": {"cells": [
[{"type": "text", "text": {"content": "Successfully Analyzed"}}],
[{"type": "text", "text": {"content": str(result.successful_pages)}}],
]}},
{"type": "table_row", "table_row": {"cells": [
[{"type": "text", "text": {"content": "Pages with Issues"}}],
[{"type": "text", "text": {"content": str(pages_with_issues)}}],
]}},
{"type": "table_row", "table_row": {"cells": [
[{"type": "text", "text": {"content": "Total Issues"}}],
[{"type": "text", "text": {"content": str(total_issues)}}],
]}},
{"type": "table_row", "table_row": {"cells": [
[{"type": "text", "text": {"content": "Total Warnings"}}],
[{"type": "text", "text": {"content": str(total_warnings)}}],
]}},
{"type": "table_row", "table_row": {"cells": [
[{"type": "text", "text": {"content": "Pages without Schema"}}],
[{"type": "text", "text": {"content": str(pages_without_schema)}}],
]}},
{"type": "table_row", "table_row": {"cells": [
[{"type": "text", "text": {"content": "Pages without Description"}}],
[{"type": "text", "text": {"content": str(pages_without_description)}}],
]}},
]
children.append({
"object": "block",
"type": "table",
"table": {
"table_width": 2,
"has_column_header": True,
"has_row_header": False,
"children": stats_rows
}
})
# Pages list
children.append({
"object": "block",
"type": "heading_2",
"heading_2": {"rich_text": [{"type": "text", "text": {"content": "Analyzed Pages"}}]}
})
children.append({
"object": "block",
"type": "paragraph",
"paragraph": {"rich_text": [
{"type": "text", "text": {"content": f"Filter by Audit ID in the database to see all {result.successful_pages} page entries."}}
]}
})
# Create the summary page
response = self.notion.pages.create(
parent={"database_id": self.database_id},
properties={
"Issue": {"title": [{"text": {"content": f"📊 Sitemap Crawl: {site_domain}"}}]},
"Category": {"select": {"name": "Technical SEO"}},
"Priority": {"select": {"name": "High"}},
"Site": {"url": result.site},
"Audit ID": {"rich_text": [{"text": {"content": result.audit_id}}]},
"Found Date": {"date": {"start": datetime.now().strftime("%Y-%m-%d")}},
},
children=children,
)
logger.info(f"Created crawl summary page: {response['id']}")
return response["id"]
except Exception as e:
logger.error(f"Failed to create summary page: {e}")
return None
def print_progress_status(progress: CrawlProgress) -> None:
"""Print formatted progress status."""
status_emoji = {
"running": "🔄",
"completed": "",
"failed": "",
}.get(progress.status, "")
print(f"""
{'=' * 60}
{status_emoji} SEO Page Analysis - {progress.status.upper()}
{'=' * 60}
Audit ID: {progress.audit_id}
Site: {progress.site}
Status: {progress.status}
Progress: {progress.processed_urls}/{progress.total_urls} pages ({progress.get_progress_percent():.1f}%)
Successful: {progress.successful_urls}
Failed: {progress.failed_urls}
Elapsed: {progress.get_elapsed_time()}
ETA: {progress.get_eta() if progress.status == 'running' else 'N/A'}
Current URL: {progress.current_url[:60] + '...' if len(progress.current_url) > 60 else progress.current_url}
""")
if progress.summary_page_id:
print(f"Summary: https://www.notion.so/{progress.summary_page_id.replace('-', '')}")
if progress.error_message:
print(f"Error: {progress.error_message}")
print("=" * 60)
def main():
"""CLI entry point."""
import argparse
parser = argparse.ArgumentParser(description="Sitemap Crawler with Background Support")
subparsers = parser.add_subparsers(dest="command", help="Commands")
# Crawl command
crawl_parser = subparsers.add_parser("crawl", help="Start crawling a sitemap")
crawl_parser.add_argument("sitemap_url", help="URL of the sitemap to crawl")
crawl_parser.add_argument("--delay", "-d", type=float, default=DEFAULT_DELAY_SECONDS,
help=f"Delay between requests in seconds (default: {DEFAULT_DELAY_SECONDS})")
crawl_parser.add_argument("--max-pages", "-m", type=int, default=DEFAULT_MAX_PAGES,
help=f"Maximum pages to process (default: {DEFAULT_MAX_PAGES})")
crawl_parser.add_argument("--no-notion", action="store_true",
help="Don't save to Notion")
crawl_parser.add_argument("--no-limit", action="store_true",
help="Remove page limit (use with caution)")
# Status command
status_parser = subparsers.add_parser("status", help="Check crawl progress")
status_parser.add_argument("audit_id", nargs="?", help="Specific audit ID to check (optional)")
status_parser.add_argument("--all", "-a", action="store_true", help="Show all crawls (not just active)")
# List command
list_parser = subparsers.add_parser("list", help="List all crawl jobs")
args = parser.parse_args()
# Default to crawl if no command specified but URL provided
if args.command is None:
# Check if first positional arg looks like a URL
import sys
if len(sys.argv) > 1 and (sys.argv[1].startswith("http") or sys.argv[1].endswith(".xml")):
args.command = "crawl"
args.sitemap_url = sys.argv[1]
args.delay = DEFAULT_DELAY_SECONDS
args.max_pages = DEFAULT_MAX_PAGES
args.no_notion = False
args.no_limit = False
else:
parser.print_help()
return
if args.command == "status":
if args.audit_id:
# Show specific crawl status
progress = get_crawl_status(args.audit_id)
if progress:
print_progress_status(progress)
else:
print(f"No crawl found with audit ID: {args.audit_id}")
else:
# Show active crawls
if args.all:
crawls = get_all_crawls()
label = "All"
else:
crawls = get_active_crawls()
label = "Active"
if crawls:
print(f"\n{label} Crawl Jobs ({len(crawls)}):")
print("-" * 60)
for p in crawls:
status_emoji = {"running": "🔄", "completed": "", "failed": ""}.get(p.status, "")
print(f"{status_emoji} {p.audit_id}")
print(f" Site: {p.site}")
print(f" Progress: {p.processed_urls}/{p.total_urls} ({p.get_progress_percent():.1f}%)")
print()
else:
print(f"No {label.lower()} crawl jobs found.")
return
if args.command == "list":
crawls = get_all_crawls()
if crawls:
print(f"\nAll Crawl Jobs ({len(crawls)}):")
print("-" * 80)
print(f"{'Status':<10} {'Audit ID':<45} {'Progress':<15}")
print("-" * 80)
for p in crawls[:20]: # Show last 20
status_emoji = {"running": "🔄", "completed": "", "failed": ""}.get(p.status, "")
progress_str = f"{p.processed_urls}/{p.total_urls}"
print(f"{status_emoji} {p.status:<7} {p.audit_id:<45} {progress_str:<15}")
if len(crawls) > 20:
print(f"... and {len(crawls) - 20} more")
else:
print("No crawl jobs found.")
return
if args.command == "crawl":
# Handle --no-limit option
max_pages = args.max_pages
if args.no_limit:
max_pages = 999999 # Effectively unlimited
print("⚠️ WARNING: Page limit disabled. This may take a very long time!")
def progress_callback(progress: CrawlProgress):
pct = progress.get_progress_percent()
print(f"\r[{pct:5.1f}%] {progress.processed_urls}/{progress.total_urls} pages | "
f"Success: {progress.successful_urls} | Failed: {progress.failed_urls} | "
f"ETA: {progress.get_eta()}", end="", flush=True)
crawler = SitemapCrawler()
result = crawler.crawl_sitemap(
args.sitemap_url,
delay=args.delay,
max_pages=max_pages,
progress_callback=progress_callback,
save_to_notion=not args.no_notion,
)
print() # New line after progress
print()
print("=" * 60)
print("CRAWL COMPLETE")
print("=" * 60)
print(f"Audit ID: {result.audit_id}")
print(f"Total Pages: {result.total_pages}")
print(f"Successful: {result.successful_pages}")
print(f"Failed: {result.failed_pages}")
print(f"Duration: {result.get_duration()}")
if result.summary_page_id:
print(f"Summary Page: https://www.notion.so/{result.summary_page_id.replace('-', '')}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,467 @@
"""
Sitemap Validator - Validate XML sitemaps
==========================================
Purpose: Parse and validate XML sitemaps for SEO compliance
Python: 3.10+
Usage:
python sitemap_validator.py --url https://example.com/sitemap.xml
"""
import argparse
import asyncio
import gzip
import json
import logging
import re
from dataclasses import dataclass, field
from datetime import datetime
from io import BytesIO
from typing import Any
from urllib.parse import urljoin, urlparse
import aiohttp
import requests
from lxml import etree
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
@dataclass
class SitemapIssue:
"""Represents a sitemap validation issue."""
severity: str # "error", "warning", "info"
message: str
url: str | None = None
suggestion: str | None = None
@dataclass
class SitemapEntry:
"""Represents a single URL entry in sitemap."""
loc: str
lastmod: str | None = None
changefreq: str | None = None
priority: float | None = None
status_code: int | None = None
@dataclass
class SitemapResult:
"""Complete sitemap validation result."""
url: str
sitemap_type: str # "urlset" or "sitemapindex"
entries: list[SitemapEntry] = field(default_factory=list)
child_sitemaps: list[str] = field(default_factory=list)
issues: list[SitemapIssue] = field(default_factory=list)
valid: bool = True
stats: dict = field(default_factory=dict)
timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
def to_dict(self) -> dict:
"""Convert to dictionary for JSON output."""
return {
"url": self.url,
"sitemap_type": self.sitemap_type,
"valid": self.valid,
"stats": self.stats,
"issues": [
{
"severity": i.severity,
"message": i.message,
"url": i.url,
"suggestion": i.suggestion,
}
for i in self.issues
],
"entries_count": len(self.entries),
"child_sitemaps": self.child_sitemaps,
"timestamp": self.timestamp,
}
class SitemapValidator:
"""Validate XML sitemaps."""
SITEMAP_NS = "http://www.sitemaps.org/schemas/sitemap/0.9"
MAX_URLS = 50000
MAX_SIZE_BYTES = 50 * 1024 * 1024 # 50MB
VALID_CHANGEFREQ = {
"always", "hourly", "daily", "weekly",
"monthly", "yearly", "never"
}
def __init__(self, check_urls: bool = False, max_concurrent: int = 10):
self.check_urls = check_urls
self.max_concurrent = max_concurrent
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (compatible; SEOAuditBot/1.0)"
})
def fetch_sitemap(self, url: str) -> tuple[bytes, bool]:
"""Fetch sitemap content, handling gzip compression."""
try:
response = self.session.get(url, timeout=30)
response.raise_for_status()
content = response.content
is_gzipped = False
# Check if gzipped
if url.endswith(".gz") or response.headers.get(
"Content-Encoding"
) == "gzip":
try:
content = gzip.decompress(content)
is_gzipped = True
except gzip.BadGzipFile:
pass
return content, is_gzipped
except requests.RequestException as e:
raise RuntimeError(f"Failed to fetch sitemap: {e}")
def parse_sitemap(self, content: bytes) -> tuple[str, list[dict]]:
"""Parse sitemap XML content."""
try:
root = etree.fromstring(content)
except etree.XMLSyntaxError as e:
raise ValueError(f"Invalid XML: {e}")
# Remove namespace for easier parsing
nsmap = {"sm": self.SITEMAP_NS}
# Check if it's a sitemap index or urlset
if root.tag == f"{{{self.SITEMAP_NS}}}sitemapindex":
sitemap_type = "sitemapindex"
entries = []
for sitemap in root.findall("sm:sitemap", nsmap):
entry = {}
loc = sitemap.find("sm:loc", nsmap)
if loc is not None and loc.text:
entry["loc"] = loc.text.strip()
lastmod = sitemap.find("sm:lastmod", nsmap)
if lastmod is not None and lastmod.text:
entry["lastmod"] = lastmod.text.strip()
if entry.get("loc"):
entries.append(entry)
elif root.tag == f"{{{self.SITEMAP_NS}}}urlset":
sitemap_type = "urlset"
entries = []
for url in root.findall("sm:url", nsmap):
entry = {}
loc = url.find("sm:loc", nsmap)
if loc is not None and loc.text:
entry["loc"] = loc.text.strip()
lastmod = url.find("sm:lastmod", nsmap)
if lastmod is not None and lastmod.text:
entry["lastmod"] = lastmod.text.strip()
changefreq = url.find("sm:changefreq", nsmap)
if changefreq is not None and changefreq.text:
entry["changefreq"] = changefreq.text.strip().lower()
priority = url.find("sm:priority", nsmap)
if priority is not None and priority.text:
try:
entry["priority"] = float(priority.text.strip())
except ValueError:
entry["priority"] = None
if entry.get("loc"):
entries.append(entry)
else:
raise ValueError(f"Unknown sitemap type: {root.tag}")
return sitemap_type, entries
def validate(self, url: str) -> SitemapResult:
"""Validate a sitemap URL."""
result = SitemapResult(url=url, sitemap_type="unknown")
# Fetch sitemap
try:
content, is_gzipped = self.fetch_sitemap(url)
except RuntimeError as e:
result.issues.append(SitemapIssue(
severity="error",
message=str(e),
url=url,
))
result.valid = False
return result
# Check size
if len(content) > self.MAX_SIZE_BYTES:
result.issues.append(SitemapIssue(
severity="error",
message=f"Sitemap exceeds 50MB limit ({len(content) / 1024 / 1024:.2f}MB)",
url=url,
suggestion="Split sitemap into smaller files using sitemap index",
))
# Parse XML
try:
sitemap_type, entries = self.parse_sitemap(content)
except ValueError as e:
result.issues.append(SitemapIssue(
severity="error",
message=str(e),
url=url,
))
result.valid = False
return result
result.sitemap_type = sitemap_type
# Process entries
if sitemap_type == "sitemapindex":
result.child_sitemaps = [e["loc"] for e in entries]
result.stats = {
"child_sitemaps_count": len(entries),
}
else:
# Validate URL entries
url_count = len(entries)
result.stats["url_count"] = url_count
if url_count > self.MAX_URLS:
result.issues.append(SitemapIssue(
severity="error",
message=f"Sitemap exceeds 50,000 URL limit ({url_count} URLs)",
url=url,
suggestion="Split into multiple sitemaps with sitemap index",
))
if url_count == 0:
result.issues.append(SitemapIssue(
severity="warning",
message="Sitemap is empty (no URLs)",
url=url,
))
# Validate individual entries
seen_urls = set()
invalid_lastmod = 0
invalid_changefreq = 0
invalid_priority = 0
for entry in entries:
loc = entry.get("loc", "")
# Check for duplicates
if loc in seen_urls:
result.issues.append(SitemapIssue(
severity="warning",
message="Duplicate URL in sitemap",
url=loc,
))
seen_urls.add(loc)
# Validate lastmod format
lastmod = entry.get("lastmod")
if lastmod:
if not self._validate_date(lastmod):
invalid_lastmod += 1
# Validate changefreq
changefreq = entry.get("changefreq")
if changefreq and changefreq not in self.VALID_CHANGEFREQ:
invalid_changefreq += 1
# Validate priority
priority = entry.get("priority")
if priority is not None:
if not (0.0 <= priority <= 1.0):
invalid_priority += 1
# Create entry object
result.entries.append(SitemapEntry(
loc=loc,
lastmod=lastmod,
changefreq=changefreq,
priority=priority,
))
# Add summary issues
if invalid_lastmod > 0:
result.issues.append(SitemapIssue(
severity="warning",
message=f"{invalid_lastmod} URLs with invalid lastmod format",
suggestion="Use ISO 8601 format (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS+TZ)",
))
if invalid_changefreq > 0:
result.issues.append(SitemapIssue(
severity="info",
message=f"{invalid_changefreq} URLs with invalid changefreq",
suggestion="Use: always, hourly, daily, weekly, monthly, yearly, never",
))
if invalid_priority > 0:
result.issues.append(SitemapIssue(
severity="warning",
message=f"{invalid_priority} URLs with invalid priority (must be 0.0-1.0)",
))
result.stats.update({
"invalid_lastmod": invalid_lastmod,
"invalid_changefreq": invalid_changefreq,
"invalid_priority": invalid_priority,
"has_lastmod": sum(1 for e in result.entries if e.lastmod),
"has_changefreq": sum(1 for e in result.entries if e.changefreq),
"has_priority": sum(1 for e in result.entries if e.priority is not None),
})
# Check URLs if requested
if self.check_urls and result.entries:
asyncio.run(self._check_url_status(result))
# Determine validity
result.valid = not any(i.severity == "error" for i in result.issues)
return result
def _validate_date(self, date_str: str) -> bool:
"""Validate ISO 8601 date format."""
patterns = [
r"^\d{4}-\d{2}-\d{2}$",
r"^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}",
]
return any(re.match(p, date_str) for p in patterns)
async def _check_url_status(self, result: SitemapResult) -> None:
"""Check HTTP status of URLs in sitemap."""
semaphore = asyncio.Semaphore(self.max_concurrent)
async def check_url(entry: SitemapEntry) -> None:
async with semaphore:
try:
async with aiohttp.ClientSession() as session:
async with session.head(
entry.loc,
timeout=aiohttp.ClientTimeout(total=10),
allow_redirects=True,
) as response:
entry.status_code = response.status
except Exception:
entry.status_code = 0
await asyncio.gather(*[check_url(e) for e in result.entries[:100]])
# Count status codes
status_counts = {}
for entry in result.entries:
if entry.status_code:
status_counts[entry.status_code] = (
status_counts.get(entry.status_code, 0) + 1
)
result.stats["url_status_codes"] = status_counts
# Add issues for non-200 URLs
error_count = sum(
1 for e in result.entries
if e.status_code and e.status_code >= 400
)
if error_count > 0:
result.issues.append(SitemapIssue(
severity="warning",
message=f"{error_count} URLs returning error status codes (4xx/5xx)",
suggestion="Remove or fix broken URLs in sitemap",
))
def generate_report(self, result: SitemapResult) -> str:
"""Generate human-readable validation report."""
lines = [
"=" * 60,
"Sitemap Validation Report",
"=" * 60,
f"URL: {result.url}",
f"Type: {result.sitemap_type}",
f"Valid: {'Yes' if result.valid else 'No'}",
f"Timestamp: {result.timestamp}",
"",
]
lines.append("Statistics:")
for key, value in result.stats.items():
lines.append(f" {key}: {value}")
lines.append("")
if result.child_sitemaps:
lines.append(f"Child Sitemaps ({len(result.child_sitemaps)}):")
for sitemap in result.child_sitemaps[:10]:
lines.append(f" - {sitemap}")
if len(result.child_sitemaps) > 10:
lines.append(f" ... and {len(result.child_sitemaps) - 10} more")
lines.append("")
if result.issues:
lines.append("Issues Found:")
errors = [i for i in result.issues if i.severity == "error"]
warnings = [i for i in result.issues if i.severity == "warning"]
infos = [i for i in result.issues if i.severity == "info"]
if errors:
lines.append(f"\n ERRORS ({len(errors)}):")
for issue in errors:
lines.append(f" - {issue.message}")
if issue.url:
lines.append(f" URL: {issue.url}")
if issue.suggestion:
lines.append(f" Suggestion: {issue.suggestion}")
if warnings:
lines.append(f"\n WARNINGS ({len(warnings)}):")
for issue in warnings:
lines.append(f" - {issue.message}")
if issue.suggestion:
lines.append(f" Suggestion: {issue.suggestion}")
if infos:
lines.append(f"\n INFO ({len(infos)}):")
for issue in infos:
lines.append(f" - {issue.message}")
lines.append("")
lines.append("=" * 60)
return "\n".join(lines)
def main():
"""Main entry point for CLI usage."""
parser = argparse.ArgumentParser(
description="Validate XML sitemaps",
)
parser.add_argument("--url", "-u", required=True, help="Sitemap URL to validate")
parser.add_argument("--check-urls", action="store_true",
help="Check HTTP status of URLs (slower)")
parser.add_argument("--output", "-o", help="Output file for JSON report")
parser.add_argument("--json", action="store_true", help="Output as JSON")
args = parser.parse_args()
validator = SitemapValidator(check_urls=args.check_urls)
result = validator.validate(args.url)
if args.json or args.output:
output = json.dumps(result.to_dict(), ensure_ascii=False, indent=2)
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
f.write(output)
logger.info(f"Report written to {args.output}")
else:
print(output)
else:
print(validator.generate_report(result))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,88 @@
{
"_comment": "Default OurDigital SEO Audit Log Database",
"database_id": "2c8581e5-8a1e-8035-880b-e38cefc2f3ef",
"url": "https://www.notion.so/dintelligence/2c8581e58a1e8035880be38cefc2f3ef",
"properties": {
"Issue": {
"type": "title",
"description": "Primary identifier - issue title"
},
"Site": {
"type": "url",
"description": "The audited site URL (e.g., https://blog.ourdigital.org)"
},
"Category": {
"type": "select",
"options": [
{ "name": "Technical SEO", "color": "blue" },
{ "name": "On-page SEO", "color": "green" },
{ "name": "Content", "color": "purple" },
{ "name": "Local SEO", "color": "orange" },
{ "name": "Performance", "color": "red" },
{ "name": "Schema/Structured Data", "color": "yellow" },
{ "name": "Sitemap", "color": "pink" },
{ "name": "Robots.txt", "color": "gray" }
]
},
"Priority": {
"type": "select",
"options": [
{ "name": "Critical", "color": "red" },
{ "name": "High", "color": "orange" },
{ "name": "Medium", "color": "yellow" },
{ "name": "Low", "color": "gray" }
]
},
"Status": {
"type": "status",
"description": "Managed by Notion - default options: Not started, In progress, Done"
},
"URL": {
"type": "url",
"description": "Specific page with the issue"
},
"Found Date": {
"type": "date",
"description": "When issue was discovered"
},
"Audit ID": {
"type": "rich_text",
"description": "Groups findings from same audit session (format: domain-YYYYMMDD-HHMMSS)"
}
},
"page_content_template": {
"_comment": "Each finding page contains the following content blocks",
"blocks": [
{
"type": "heading_2",
"content": "Description"
},
{
"type": "paragraph",
"content": "{{description}}"
},
{
"type": "heading_2",
"content": "Impact"
},
{
"type": "callout",
"icon": "⚠️",
"content": "{{impact}}"
},
{
"type": "heading_2",
"content": "Recommendation"
},
{
"type": "callout",
"icon": "💡",
"content": "{{recommendation}}"
}
]
},
"description": "Centralized SEO audit findings with categorized issues, priorities, and tracking status. Generated by ourdigital-seo-audit skill."
}

View File

@@ -0,0 +1,32 @@
{
"@context": "https://schema.org",
"@type": "{{article_type}}",
"headline": "{{headline}}",
"description": "{{description}}",
"image": [
"{{image_url_1}}",
"{{image_url_2}}"
],
"datePublished": "{{date_published}}",
"dateModified": "{{date_modified}}",
"author": {
"@type": "Person",
"name": "{{author_name}}",
"url": "{{author_url}}"
},
"publisher": {
"@type": "Organization",
"name": "{{publisher_name}}",
"logo": {
"@type": "ImageObject",
"url": "{{publisher_logo_url}}"
}
},
"mainEntityOfPage": {
"@type": "WebPage",
"@id": "{{page_url}}"
},
"articleSection": "{{section}}",
"wordCount": "{{word_count}}",
"keywords": "{{keywords}}"
}

View File

@@ -0,0 +1,24 @@
{
"@context": "https://schema.org",
"@type": "BreadcrumbList",
"itemListElement": [
{
"@type": "ListItem",
"position": 1,
"name": "{{level_1_name}}",
"item": "{{level_1_url}}"
},
{
"@type": "ListItem",
"position": 2,
"name": "{{level_2_name}}",
"item": "{{level_2_url}}"
},
{
"@type": "ListItem",
"position": 3,
"name": "{{level_3_name}}",
"item": "{{level_3_url}}"
}
]
}

View File

@@ -0,0 +1,30 @@
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "{{question_1}}",
"acceptedAnswer": {
"@type": "Answer",
"text": "{{answer_1}}"
}
},
{
"@type": "Question",
"name": "{{question_2}}",
"acceptedAnswer": {
"@type": "Answer",
"text": "{{answer_2}}"
}
},
{
"@type": "Question",
"name": "{{question_3}}",
"acceptedAnswer": {
"@type": "Answer",
"text": "{{answer_3}}"
}
}
]
}

View File

@@ -0,0 +1,47 @@
{
"@context": "https://schema.org",
"@type": "{{business_type}}",
"name": "{{name}}",
"description": "{{description}}",
"url": "{{url}}",
"telephone": "{{phone}}",
"email": "{{email}}",
"image": "{{image_url}}",
"priceRange": "{{price_range}}",
"address": {
"@type": "PostalAddress",
"streetAddress": "{{street_address}}",
"addressLocality": "{{city}}",
"addressRegion": "{{region}}",
"postalCode": "{{postal_code}}",
"addressCountry": "{{country}}"
},
"geo": {
"@type": "GeoCoordinates",
"latitude": "{{latitude}}",
"longitude": "{{longitude}}"
},
"openingHoursSpecification": [
{
"@type": "OpeningHoursSpecification",
"dayOfWeek": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
"opens": "{{weekday_opens}}",
"closes": "{{weekday_closes}}"
},
{
"@type": "OpeningHoursSpecification",
"dayOfWeek": ["Saturday", "Sunday"],
"opens": "{{weekend_opens}}",
"closes": "{{weekend_closes}}"
}
],
"aggregateRating": {
"@type": "AggregateRating",
"ratingValue": "{{rating}}",
"reviewCount": "{{review_count}}"
},
"sameAs": [
"{{facebook_url}}",
"{{instagram_url}}"
]
}

View File

@@ -0,0 +1,37 @@
{
"@context": "https://schema.org",
"@type": "Organization",
"name": "{{name}}",
"url": "{{url}}",
"logo": "{{logo_url}}",
"description": "{{description}}",
"foundingDate": "{{founding_date}}",
"founders": [
{
"@type": "Person",
"name": "{{founder_name}}"
}
],
"address": {
"@type": "PostalAddress",
"streetAddress": "{{street_address}}",
"addressLocality": "{{city}}",
"addressRegion": "{{region}}",
"postalCode": "{{postal_code}}",
"addressCountry": "{{country}}"
},
"contactPoint": [
{
"@type": "ContactPoint",
"telephone": "{{phone}}",
"contactType": "customer service",
"availableLanguage": ["Korean", "English"]
}
],
"sameAs": [
"{{facebook_url}}",
"{{twitter_url}}",
"{{linkedin_url}}",
"{{instagram_url}}"
]
}

View File

@@ -0,0 +1,76 @@
{
"@context": "https://schema.org",
"@type": "Product",
"name": "{{name}}",
"description": "{{description}}",
"image": [
"{{image_url_1}}",
"{{image_url_2}}",
"{{image_url_3}}"
],
"sku": "{{sku}}",
"mpn": "{{mpn}}",
"gtin13": "{{gtin13}}",
"brand": {
"@type": "Brand",
"name": "{{brand_name}}"
},
"offers": {
"@type": "Offer",
"url": "{{product_url}}",
"price": "{{price}}",
"priceCurrency": "{{currency}}",
"priceValidUntil": "{{price_valid_until}}",
"availability": "https://schema.org/{{availability}}",
"itemCondition": "https://schema.org/{{condition}}",
"seller": {
"@type": "Organization",
"name": "{{seller_name}}"
},
"shippingDetails": {
"@type": "OfferShippingDetails",
"shippingRate": {
"@type": "MonetaryAmount",
"value": "{{shipping_cost}}",
"currency": "{{currency}}"
},
"deliveryTime": {
"@type": "ShippingDeliveryTime",
"handlingTime": {
"@type": "QuantitativeValue",
"minValue": "{{handling_min_days}}",
"maxValue": "{{handling_max_days}}",
"unitCode": "DAY"
},
"transitTime": {
"@type": "QuantitativeValue",
"minValue": "{{transit_min_days}}",
"maxValue": "{{transit_max_days}}",
"unitCode": "DAY"
}
}
}
},
"aggregateRating": {
"@type": "AggregateRating",
"ratingValue": "{{rating}}",
"reviewCount": "{{review_count}}",
"bestRating": "5",
"worstRating": "1"
},
"review": [
{
"@type": "Review",
"reviewRating": {
"@type": "Rating",
"ratingValue": "{{review_rating}}",
"bestRating": "5"
},
"author": {
"@type": "Person",
"name": "{{reviewer_name}}"
},
"reviewBody": "{{review_text}}"
}
]
}

View File

@@ -0,0 +1,25 @@
{
"@context": "https://schema.org",
"@type": "WebSite",
"name": "{{site_name}}",
"alternateName": "{{alternate_name}}",
"url": "{{url}}",
"description": "{{description}}",
"inLanguage": "{{language}}",
"potentialAction": {
"@type": "SearchAction",
"target": {
"@type": "EntryPoint",
"urlTemplate": "{{search_url_template}}"
},
"query-input": "required name=search_term_string"
},
"publisher": {
"@type": "Organization",
"name": "{{publisher_name}}",
"logo": {
"@type": "ImageObject",
"url": "{{logo_url}}"
}
}
}