Add SEO skills 19-28, 31-32 with full Python implementations
12 new skills: Keyword Strategy, SERP Analysis, Position Tracking, Link Building, Content Strategy, E-Commerce SEO, KPI Framework, International SEO, AI Visibility, Knowledge Graph, Competitor Intel, and Crawl Budget. ~20K lines of Python across 25 domain scripts. Updated skill 11 pipeline table and repo CLAUDE.md. Enhanced skill 18 local SEO workflow from jamie.clinic audit. Note: Skill 26 hreflang_validator.py pending (content filter block). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
139
custom-skills/28-seo-knowledge-graph/code/CLAUDE.md
Normal file
139
custom-skills/28-seo-knowledge-graph/code/CLAUDE.md
Normal file
@@ -0,0 +1,139 @@
|
||||
# CLAUDE.md
|
||||
|
||||
## Overview
|
||||
|
||||
Knowledge Graph and Entity SEO tool for analyzing brand entity presence in Google Knowledge Graph, Knowledge Panels, People Also Ask (PAA), and FAQ rich results. Checks entity attribute completeness, Wikipedia/Wikidata presence, and Korean equivalents (Naver knowledge iN, Naver encyclopedia). Uses WebSearch and WebFetch for data collection, Ahrefs serp-overview for SERP feature detection.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
pip install -r scripts/requirements.txt
|
||||
|
||||
# Knowledge Graph analysis
|
||||
python scripts/knowledge_graph_analyzer.py --entity "Samsung Electronics" --json
|
||||
|
||||
# Entity SEO audit
|
||||
python scripts/entity_auditor.py --url https://example.com --entity "Brand Name" --json
|
||||
```
|
||||
|
||||
## Scripts
|
||||
|
||||
| Script | Purpose | Key Output |
|
||||
|--------|---------|------------|
|
||||
| `knowledge_graph_analyzer.py` | Analyze Knowledge Panel and entity presence | KP detection, entity attributes, Wikipedia/Wikidata status |
|
||||
| `entity_auditor.py` | Audit entity SEO signals and PAA/FAQ presence | PAA monitoring, FAQ schema tracking, entity completeness |
|
||||
| `base_client.py` | Shared utilities | RateLimiter, ConfigManager, BaseAsyncClient |
|
||||
|
||||
## Knowledge Graph Analyzer
|
||||
|
||||
```bash
|
||||
# Analyze entity in Knowledge Graph
|
||||
python scripts/knowledge_graph_analyzer.py --entity "Samsung Electronics" --json
|
||||
|
||||
# Check with Korean name
|
||||
python scripts/knowledge_graph_analyzer.py --entity "삼성전자" --language ko --json
|
||||
|
||||
# Include Wikipedia/Wikidata check
|
||||
python scripts/knowledge_graph_analyzer.py --entity "Samsung" --wiki --json
|
||||
```
|
||||
|
||||
**Capabilities**:
|
||||
- Knowledge Panel detection via Google search
|
||||
- Entity attribute extraction (name, description, logo, type, social profiles, website)
|
||||
- Entity attribute completeness scoring
|
||||
- Wikipedia article presence check
|
||||
- Wikidata entity presence check (QID lookup)
|
||||
- Naver encyclopedia (네이버 백과사전) presence
|
||||
- Naver knowledge iN (지식iN) presence
|
||||
- Knowledge Panel comparison with competitors
|
||||
|
||||
## Entity Auditor
|
||||
|
||||
```bash
|
||||
# Full entity SEO audit
|
||||
python scripts/entity_auditor.py --url https://example.com --entity "Brand Name" --json
|
||||
|
||||
# PAA monitoring for brand keywords
|
||||
python scripts/entity_auditor.py --url https://example.com --entity "Brand Name" --paa --json
|
||||
|
||||
# FAQ rich result tracking
|
||||
python scripts/entity_auditor.py --url https://example.com --entity "Brand Name" --faq --json
|
||||
```
|
||||
|
||||
**Capabilities**:
|
||||
- People Also Ask (PAA) monitoring for brand-related queries
|
||||
- FAQ schema presence tracking (FAQPage schema -> SERP appearance)
|
||||
- Entity markup audit (Organization, Person, LocalBusiness schema on website)
|
||||
- Social profile linking validation (sameAs in schema)
|
||||
- Brand SERP analysis (what appears when you search the brand name)
|
||||
- Entity consistency across web properties
|
||||
- Korean entity optimization (Korean Knowledge Panel, Naver profiles)
|
||||
|
||||
## Data Sources
|
||||
|
||||
| Source | Purpose |
|
||||
|--------|---------|
|
||||
| WebSearch | Search for entity/brand to detect Knowledge Panel |
|
||||
| WebFetch | Fetch Wikipedia, Wikidata, Naver pages |
|
||||
| Ahrefs `serp-overview` | SERP feature detection for entity keywords |
|
||||
|
||||
## Output Format
|
||||
|
||||
```json
|
||||
{
|
||||
"entity": "Samsung Electronics",
|
||||
"knowledge_panel": {
|
||||
"detected": true,
|
||||
"attributes": {
|
||||
"name": "Samsung Electronics",
|
||||
"type": "Corporation",
|
||||
"description": "...",
|
||||
"logo": true,
|
||||
"website": true,
|
||||
"social_profiles": ["twitter", "facebook", "linkedin"]
|
||||
},
|
||||
"completeness_score": 85
|
||||
},
|
||||
"wikipedia": {"present": true, "url": "..."},
|
||||
"wikidata": {"present": true, "qid": "Q20710"},
|
||||
"naver_encyclopedia": {"present": true, "url": "..."},
|
||||
"naver_knowledge_in": {"present": true, "entries": 15},
|
||||
"paa_questions": [...],
|
||||
"faq_rich_results": [...],
|
||||
"entity_schema_on_site": {
|
||||
"organization": true,
|
||||
"same_as_links": 5,
|
||||
"completeness": 78
|
||||
},
|
||||
"score": 75,
|
||||
"timestamp": "2025-01-01T00:00:00"
|
||||
}
|
||||
```
|
||||
|
||||
## Notion Output (Required)
|
||||
|
||||
**IMPORTANT**: All audit reports MUST be saved to the OurDigital SEO Audit Log database.
|
||||
|
||||
### Database Configuration
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| Database ID | `2c8581e5-8a1e-8035-880b-e38cefc2f3ef` |
|
||||
| URL | https://www.notion.so/dintelligence/2c8581e58a1e8035880be38cefc2f3ef |
|
||||
|
||||
### Required Properties
|
||||
|
||||
| Property | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| Issue | Title | Report title (Korean + date) |
|
||||
| Site | URL | Entity website URL |
|
||||
| Category | Select | Knowledge Graph & Entity SEO |
|
||||
| Priority | Select | Based on entity completeness |
|
||||
| Found Date | Date | Audit date (YYYY-MM-DD) |
|
||||
| Audit ID | Rich Text | Format: KG-YYYYMMDD-NNN |
|
||||
|
||||
### Language Guidelines
|
||||
|
||||
- Report content in Korean (한국어)
|
||||
- Keep technical English terms as-is (e.g., Knowledge Panel, Knowledge Graph, PAA)
|
||||
- URLs and code remain unchanged
|
||||
207
custom-skills/28-seo-knowledge-graph/code/scripts/base_client.py
Normal file
207
custom-skills/28-seo-knowledge-graph/code/scripts/base_client.py
Normal file
@@ -0,0 +1,207 @@
|
||||
"""
|
||||
Base Client - Shared async client utilities
|
||||
===========================================
|
||||
Purpose: Rate-limited async operations for API clients
|
||||
Python: 3.10+
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import os
|
||||
from asyncio import Semaphore
|
||||
from datetime import datetime
|
||||
from typing import Any, Callable, TypeVar
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from tenacity import (
|
||||
retry,
|
||||
stop_after_attempt,
|
||||
wait_exponential,
|
||||
retry_if_exception_type,
|
||||
)
|
||||
|
||||
# Load environment variables
|
||||
load_dotenv()
|
||||
|
||||
# Logging setup
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s - %(levelname)s - %(message)s",
|
||||
)
|
||||
|
||||
T = TypeVar("T")
|
||||
|
||||
|
||||
class RateLimiter:
|
||||
"""Rate limiter using token bucket algorithm."""
|
||||
|
||||
def __init__(self, rate: float, per: float = 1.0):
|
||||
"""
|
||||
Initialize rate limiter.
|
||||
|
||||
Args:
|
||||
rate: Number of requests allowed
|
||||
per: Time period in seconds (default: 1 second)
|
||||
"""
|
||||
self.rate = rate
|
||||
self.per = per
|
||||
self.tokens = rate
|
||||
self.last_update = datetime.now()
|
||||
self._lock = asyncio.Lock()
|
||||
|
||||
async def acquire(self) -> None:
|
||||
"""Acquire a token, waiting if necessary."""
|
||||
async with self._lock:
|
||||
now = datetime.now()
|
||||
elapsed = (now - self.last_update).total_seconds()
|
||||
self.tokens = min(self.rate, self.tokens + elapsed * (self.rate / self.per))
|
||||
self.last_update = now
|
||||
|
||||
if self.tokens < 1:
|
||||
wait_time = (1 - self.tokens) * (self.per / self.rate)
|
||||
await asyncio.sleep(wait_time)
|
||||
self.tokens = 0
|
||||
else:
|
||||
self.tokens -= 1
|
||||
|
||||
|
||||
class BaseAsyncClient:
|
||||
"""Base class for async API clients with rate limiting."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
max_concurrent: int = 5,
|
||||
requests_per_second: float = 3.0,
|
||||
logger: logging.Logger | None = None,
|
||||
):
|
||||
"""
|
||||
Initialize base client.
|
||||
|
||||
Args:
|
||||
max_concurrent: Maximum concurrent requests
|
||||
requests_per_second: Rate limit
|
||||
logger: Logger instance
|
||||
"""
|
||||
self.semaphore = Semaphore(max_concurrent)
|
||||
self.rate_limiter = RateLimiter(requests_per_second)
|
||||
self.logger = logger or logging.getLogger(self.__class__.__name__)
|
||||
self.stats = {
|
||||
"requests": 0,
|
||||
"success": 0,
|
||||
"errors": 0,
|
||||
"retries": 0,
|
||||
}
|
||||
|
||||
@retry(
|
||||
stop=stop_after_attempt(3),
|
||||
wait=wait_exponential(multiplier=1, min=2, max=10),
|
||||
retry=retry_if_exception_type(Exception),
|
||||
)
|
||||
async def _rate_limited_request(
|
||||
self,
|
||||
coro: Callable[[], Any],
|
||||
) -> Any:
|
||||
"""Execute a request with rate limiting and retry."""
|
||||
async with self.semaphore:
|
||||
await self.rate_limiter.acquire()
|
||||
self.stats["requests"] += 1
|
||||
try:
|
||||
result = await coro()
|
||||
self.stats["success"] += 1
|
||||
return result
|
||||
except Exception as e:
|
||||
self.stats["errors"] += 1
|
||||
self.logger.error(f"Request failed: {e}")
|
||||
raise
|
||||
|
||||
async def batch_requests(
|
||||
self,
|
||||
requests: list[Callable[[], Any]],
|
||||
desc: str = "Processing",
|
||||
) -> list[Any]:
|
||||
"""Execute multiple requests concurrently."""
|
||||
try:
|
||||
from tqdm.asyncio import tqdm
|
||||
has_tqdm = True
|
||||
except ImportError:
|
||||
has_tqdm = False
|
||||
|
||||
async def execute(req: Callable) -> Any:
|
||||
try:
|
||||
return await self._rate_limited_request(req)
|
||||
except Exception as e:
|
||||
return {"error": str(e)}
|
||||
|
||||
tasks = [execute(req) for req in requests]
|
||||
|
||||
if has_tqdm:
|
||||
results = []
|
||||
for coro in tqdm.as_completed(tasks, total=len(tasks), desc=desc):
|
||||
result = await coro
|
||||
results.append(result)
|
||||
return results
|
||||
else:
|
||||
return await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
def print_stats(self) -> None:
|
||||
"""Print request statistics."""
|
||||
self.logger.info("=" * 40)
|
||||
self.logger.info("Request Statistics:")
|
||||
self.logger.info(f" Total Requests: {self.stats['requests']}")
|
||||
self.logger.info(f" Successful: {self.stats['success']}")
|
||||
self.logger.info(f" Errors: {self.stats['errors']}")
|
||||
self.logger.info("=" * 40)
|
||||
|
||||
|
||||
class ConfigManager:
|
||||
"""Manage API configuration and credentials."""
|
||||
|
||||
def __init__(self):
|
||||
load_dotenv()
|
||||
|
||||
@property
|
||||
def google_credentials_path(self) -> str | None:
|
||||
"""Get Google service account credentials path."""
|
||||
# Prefer SEO-specific credentials, fallback to general credentials
|
||||
seo_creds = os.path.expanduser("~/.credential/ourdigital-seo-agent.json")
|
||||
if os.path.exists(seo_creds):
|
||||
return seo_creds
|
||||
return os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
|
||||
|
||||
@property
|
||||
def pagespeed_api_key(self) -> str | None:
|
||||
"""Get PageSpeed Insights API key."""
|
||||
return os.getenv("PAGESPEED_API_KEY")
|
||||
|
||||
@property
|
||||
def custom_search_api_key(self) -> str | None:
|
||||
"""Get Custom Search API key."""
|
||||
return os.getenv("CUSTOM_SEARCH_API_KEY")
|
||||
|
||||
@property
|
||||
def custom_search_engine_id(self) -> str | None:
|
||||
"""Get Custom Search Engine ID."""
|
||||
return os.getenv("CUSTOM_SEARCH_ENGINE_ID")
|
||||
|
||||
@property
|
||||
def notion_token(self) -> str | None:
|
||||
"""Get Notion API token."""
|
||||
return os.getenv("NOTION_TOKEN") or os.getenv("NOTION_API_KEY")
|
||||
|
||||
def validate_google_credentials(self) -> bool:
|
||||
"""Validate Google credentials are configured."""
|
||||
creds_path = self.google_credentials_path
|
||||
if not creds_path:
|
||||
return False
|
||||
return os.path.exists(creds_path)
|
||||
|
||||
def get_required(self, key: str) -> str:
|
||||
"""Get required environment variable or raise error."""
|
||||
value = os.getenv(key)
|
||||
if not value:
|
||||
raise ValueError(f"Missing required environment variable: {key}")
|
||||
return value
|
||||
|
||||
|
||||
# Singleton config instance
|
||||
config = ConfigManager()
|
||||
@@ -0,0 +1,902 @@
|
||||
"""
|
||||
Entity Auditor
|
||||
===============
|
||||
Purpose: Audit entity SEO signals including PAA monitoring, FAQ schema tracking,
|
||||
entity markup validation, and brand SERP analysis.
|
||||
Python: 3.10+
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import asyncio
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
import sys
|
||||
from dataclasses import asdict, dataclass, field
|
||||
from datetime import datetime
|
||||
from typing import Any
|
||||
from urllib.parse import quote, urljoin, urlparse
|
||||
|
||||
import aiohttp
|
||||
from bs4 import BeautifulSoup
|
||||
from rich.console import Console
|
||||
from rich.table import Table
|
||||
|
||||
from base_client import BaseAsyncClient, ConfigManager, config
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
console = Console()
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Data classes
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@dataclass
|
||||
class PaaQuestion:
|
||||
"""A People Also Ask question found in SERP."""
|
||||
question: str = ""
|
||||
keyword: str = ""
|
||||
position: int = 0
|
||||
source_url: str | None = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class FaqRichResult:
|
||||
"""FAQ rich result tracking entry."""
|
||||
url: str = ""
|
||||
question_count: int = 0
|
||||
appearing_in_serp: bool = False
|
||||
questions: list[str] = field(default_factory=list)
|
||||
schema_valid: bool = False
|
||||
|
||||
|
||||
@dataclass
|
||||
class EntitySchema:
|
||||
"""Entity structured data found on a website."""
|
||||
type: str = "" # Organization, Person, LocalBusiness, etc.
|
||||
properties: dict[str, Any] = field(default_factory=dict)
|
||||
same_as_links: list[str] = field(default_factory=list)
|
||||
completeness: float = 0.0
|
||||
issues: list[str] = field(default_factory=list)
|
||||
|
||||
|
||||
@dataclass
|
||||
class BrandSerpResult:
|
||||
"""What appears when searching for the brand name."""
|
||||
query: str = ""
|
||||
features: list[str] = field(default_factory=list)
|
||||
paa_count: int = 0
|
||||
faq_count: int = 0
|
||||
knowledge_panel: bool = False
|
||||
sitelinks: bool = False
|
||||
social_profiles: list[str] = field(default_factory=list)
|
||||
top_results: list[dict[str, str]] = field(default_factory=list)
|
||||
|
||||
|
||||
@dataclass
|
||||
class EntityAuditResult:
|
||||
"""Full entity SEO audit result."""
|
||||
url: str = ""
|
||||
entity_name: str = ""
|
||||
paa_questions: list[PaaQuestion] = field(default_factory=list)
|
||||
faq_rich_results: list[FaqRichResult] = field(default_factory=list)
|
||||
entity_schemas: list[EntitySchema] = field(default_factory=list)
|
||||
brand_serp: BrandSerpResult = field(default_factory=BrandSerpResult)
|
||||
social_profile_status: dict[str, bool] = field(default_factory=dict)
|
||||
overall_score: float = 0.0
|
||||
recommendations: list[str] = field(default_factory=list)
|
||||
timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
|
||||
|
||||
def to_dict(self) -> dict[str, Any]:
|
||||
return asdict(self)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Entity Auditor
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class EntityAuditor(BaseAsyncClient):
|
||||
"""Audit entity SEO signals and rich result presence."""
|
||||
|
||||
GOOGLE_SEARCH_URL = "https://www.google.com/search"
|
||||
|
||||
HEADERS = {
|
||||
"User-Agent": (
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
|
||||
"AppleWebKit/537.36 (KHTML, like Gecko) "
|
||||
"Chrome/120.0.0.0 Safari/537.36"
|
||||
),
|
||||
"Accept-Language": "en-US,en;q=0.9",
|
||||
}
|
||||
|
||||
PAA_KEYWORD_TEMPLATES = [
|
||||
"{entity}",
|
||||
"{entity} reviews",
|
||||
"{entity} vs",
|
||||
"what is {entity}",
|
||||
"{entity} pricing",
|
||||
"{entity} alternatives",
|
||||
"is {entity} good",
|
||||
"{entity} benefits",
|
||||
"how to use {entity}",
|
||||
"{entity} complaints",
|
||||
]
|
||||
|
||||
EXPECTED_SCHEMA_PROPERTIES = {
|
||||
"Organization": [
|
||||
"name", "url", "logo", "description", "sameAs",
|
||||
"contactPoint", "address", "foundingDate", "founder",
|
||||
"numberOfEmployees", "email", "telephone",
|
||||
],
|
||||
"Person": [
|
||||
"name", "url", "image", "description", "sameAs",
|
||||
"jobTitle", "worksFor", "alumniOf", "birthDate",
|
||||
],
|
||||
"LocalBusiness": [
|
||||
"name", "url", "image", "description", "sameAs",
|
||||
"address", "telephone", "openingHours", "geo",
|
||||
"priceRange", "aggregateRating",
|
||||
],
|
||||
}
|
||||
|
||||
def __init__(self, **kwargs):
|
||||
super().__init__(**kwargs)
|
||||
self.config = config
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# PAA monitoring
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
async def monitor_paa(
|
||||
self,
|
||||
entity_name: str,
|
||||
keywords: list[str] | None = None,
|
||||
session: aiohttp.ClientSession | None = None,
|
||||
) -> list[PaaQuestion]:
|
||||
"""Search brand keywords and extract People Also Ask questions."""
|
||||
if keywords is None:
|
||||
keywords = [t.format(entity=entity_name) for t in self.PAA_KEYWORD_TEMPLATES]
|
||||
|
||||
paa_questions: list[PaaQuestion] = []
|
||||
|
||||
own_session = session is None
|
||||
if own_session:
|
||||
session = aiohttp.ClientSession()
|
||||
|
||||
try:
|
||||
for keyword in keywords:
|
||||
params = {"q": keyword, "hl": "en", "gl": "us"}
|
||||
try:
|
||||
async with session.get(
|
||||
self.GOOGLE_SEARCH_URL, params=params, headers=self.HEADERS,
|
||||
timeout=aiohttp.ClientTimeout(total=20),
|
||||
) as resp:
|
||||
if resp.status != 200:
|
||||
logger.warning("Search for '%s' returned status %d", keyword, resp.status)
|
||||
continue
|
||||
|
||||
html = await resp.text()
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
|
||||
# PAA box selectors
|
||||
paa_selectors = [
|
||||
"div[data-sgrd] div[data-q]",
|
||||
"div.related-question-pair",
|
||||
"div[jsname] div[data-q]",
|
||||
"div.wQiwMc",
|
||||
]
|
||||
|
||||
position = 0
|
||||
for selector in paa_selectors:
|
||||
elements = soup.select(selector)
|
||||
for el in elements:
|
||||
question_text = el.get("data-q", "") or el.get_text(strip=True)
|
||||
if question_text and len(question_text) > 5:
|
||||
position += 1
|
||||
paa_questions.append(PaaQuestion(
|
||||
question=question_text,
|
||||
keyword=keyword,
|
||||
position=position,
|
||||
))
|
||||
|
||||
# Fallback: regex for PAA-like questions
|
||||
if not paa_questions:
|
||||
text = soup.get_text(separator="\n")
|
||||
q_patterns = re.findall(
|
||||
r"((?:What|How|Why|When|Where|Who|Is|Can|Does|Do|Which)\s+[^?\n]{10,80}\??)",
|
||||
text,
|
||||
)
|
||||
for i, q in enumerate(q_patterns[:8]):
|
||||
paa_questions.append(PaaQuestion(
|
||||
question=q.strip(),
|
||||
keyword=keyword,
|
||||
position=i + 1,
|
||||
))
|
||||
|
||||
except Exception as exc:
|
||||
logger.error("PAA search failed for '%s': %s", keyword, exc)
|
||||
continue
|
||||
|
||||
# Rate limit between searches
|
||||
await asyncio.sleep(1.5)
|
||||
finally:
|
||||
if own_session:
|
||||
await session.close()
|
||||
|
||||
# Deduplicate questions
|
||||
seen = set()
|
||||
unique = []
|
||||
for q in paa_questions:
|
||||
key = q.question.lower().strip()
|
||||
if key not in seen:
|
||||
seen.add(key)
|
||||
unique.append(q)
|
||||
|
||||
logger.info("Found %d unique PAA questions for '%s'", len(unique), entity_name)
|
||||
return unique
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# FAQ rich result tracking
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
async def track_faq_rich_results(
|
||||
self,
|
||||
url: str,
|
||||
session: aiohttp.ClientSession | None = None,
|
||||
) -> list[FaqRichResult]:
|
||||
"""Check pages for FAQPage schema and SERP appearance."""
|
||||
faq_results: list[FaqRichResult] = []
|
||||
domain = urlparse(url).netloc
|
||||
|
||||
own_session = session is None
|
||||
if own_session:
|
||||
session = aiohttp.ClientSession()
|
||||
|
||||
try:
|
||||
# Fetch the page and look for FAQ schema
|
||||
async with session.get(
|
||||
url, headers=self.HEADERS, timeout=aiohttp.ClientTimeout(total=20),
|
||||
) as resp:
|
||||
if resp.status != 200:
|
||||
logger.warning("Page %s returned status %d", url, resp.status)
|
||||
return faq_results
|
||||
|
||||
html = await resp.text()
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
|
||||
# Find JSON-LD scripts with FAQPage
|
||||
scripts = soup.find_all("script", type="application/ld+json")
|
||||
for script in scripts:
|
||||
try:
|
||||
data = json.loads(script.string or "{}")
|
||||
items = data if isinstance(data, list) else [data]
|
||||
|
||||
for item in items:
|
||||
schema_type = item.get("@type", "")
|
||||
if schema_type == "FAQPage" or (
|
||||
isinstance(schema_type, list) and "FAQPage" in schema_type
|
||||
):
|
||||
questions = item.get("mainEntity", [])
|
||||
faq = FaqRichResult(
|
||||
url=url,
|
||||
question_count=len(questions),
|
||||
questions=[
|
||||
q.get("name", "") for q in questions if isinstance(q, dict)
|
||||
],
|
||||
schema_valid=True,
|
||||
)
|
||||
faq_results.append(faq)
|
||||
|
||||
# Check for nested @graph
|
||||
graph = item.get("@graph", [])
|
||||
for g_item in graph:
|
||||
if g_item.get("@type") == "FAQPage":
|
||||
questions = g_item.get("mainEntity", [])
|
||||
faq = FaqRichResult(
|
||||
url=url,
|
||||
question_count=len(questions),
|
||||
questions=[
|
||||
q.get("name", "") for q in questions if isinstance(q, dict)
|
||||
],
|
||||
schema_valid=True,
|
||||
)
|
||||
faq_results.append(faq)
|
||||
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
|
||||
# Also check for microdata FAQ markup
|
||||
faq_items = soup.select("[itemtype*='FAQPage'] [itemprop='mainEntity']")
|
||||
if faq_items and not faq_results:
|
||||
questions = []
|
||||
for item in faq_items:
|
||||
q_el = item.select_one("[itemprop='name']")
|
||||
if q_el:
|
||||
questions.append(q_el.get_text(strip=True))
|
||||
faq_results.append(FaqRichResult(
|
||||
url=url,
|
||||
question_count=len(questions),
|
||||
questions=questions,
|
||||
schema_valid=True,
|
||||
))
|
||||
|
||||
except Exception as exc:
|
||||
logger.error("FAQ tracking failed for %s: %s", url, exc)
|
||||
finally:
|
||||
if own_session:
|
||||
await session.close()
|
||||
|
||||
logger.info("Found %d FAQ schemas on %s", len(faq_results), url)
|
||||
return faq_results
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Entity schema audit
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
async def audit_entity_schema(
|
||||
self,
|
||||
url: str,
|
||||
session: aiohttp.ClientSession | None = None,
|
||||
) -> list[EntitySchema]:
|
||||
"""Check Organization/Person/LocalBusiness schema on website."""
|
||||
schemas: list[EntitySchema] = []
|
||||
target_types = {"Organization", "Person", "LocalBusiness", "Corporation", "MedicalBusiness"}
|
||||
|
||||
own_session = session is None
|
||||
if own_session:
|
||||
session = aiohttp.ClientSession()
|
||||
|
||||
try:
|
||||
async with session.get(
|
||||
url, headers=self.HEADERS, timeout=aiohttp.ClientTimeout(total=20),
|
||||
) as resp:
|
||||
if resp.status != 200:
|
||||
logger.warning("Page %s returned status %d", url, resp.status)
|
||||
return schemas
|
||||
|
||||
html = await resp.text()
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
|
||||
scripts = soup.find_all("script", type="application/ld+json")
|
||||
for script in scripts:
|
||||
try:
|
||||
data = json.loads(script.string or "{}")
|
||||
items = data if isinstance(data, list) else [data]
|
||||
|
||||
# Include @graph nested items
|
||||
expanded = []
|
||||
for item in items:
|
||||
expanded.append(item)
|
||||
if "@graph" in item:
|
||||
expanded.extend(item["@graph"])
|
||||
|
||||
for item in expanded:
|
||||
item_type = item.get("@type", "")
|
||||
if isinstance(item_type, list):
|
||||
matching = [t for t in item_type if t in target_types]
|
||||
if not matching:
|
||||
continue
|
||||
item_type = matching[0]
|
||||
elif item_type not in target_types:
|
||||
continue
|
||||
|
||||
same_as = item.get("sameAs", [])
|
||||
if isinstance(same_as, str):
|
||||
same_as = [same_as]
|
||||
|
||||
# Calculate completeness
|
||||
base_type = item_type
|
||||
if base_type == "Corporation":
|
||||
base_type = "Organization"
|
||||
elif base_type == "MedicalBusiness":
|
||||
base_type = "LocalBusiness"
|
||||
|
||||
expected = self.EXPECTED_SCHEMA_PROPERTIES.get(base_type, [])
|
||||
present = [k for k in expected if k in item and item[k]]
|
||||
completeness = round((len(present) / len(expected)) * 100, 1) if expected else 0
|
||||
|
||||
# Check for issues
|
||||
issues = []
|
||||
if "name" not in item:
|
||||
issues.append("Missing 'name' property")
|
||||
if "url" not in item:
|
||||
issues.append("Missing 'url' property")
|
||||
if not same_as:
|
||||
issues.append("No 'sameAs' links (social profiles)")
|
||||
if "logo" not in item and base_type == "Organization":
|
||||
issues.append("Missing 'logo' property")
|
||||
if "description" not in item:
|
||||
issues.append("Missing 'description' property")
|
||||
|
||||
schema = EntitySchema(
|
||||
type=item_type,
|
||||
properties={k: (str(v)[:100] if not isinstance(v, (list, dict)) else v) for k, v in item.items() if k != "@context"},
|
||||
same_as_links=same_as,
|
||||
completeness=completeness,
|
||||
issues=issues,
|
||||
)
|
||||
schemas.append(schema)
|
||||
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
|
||||
except Exception as exc:
|
||||
logger.error("Entity schema audit failed for %s: %s", url, exc)
|
||||
finally:
|
||||
if own_session:
|
||||
await session.close()
|
||||
|
||||
logger.info("Found %d entity schemas on %s", len(schemas), url)
|
||||
return schemas
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Brand SERP analysis
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
async def analyze_brand_serp(
|
||||
self,
|
||||
entity_name: str,
|
||||
session: aiohttp.ClientSession | None = None,
|
||||
) -> BrandSerpResult:
|
||||
"""Analyze what appears in SERP for the brand name search."""
|
||||
result = BrandSerpResult(query=entity_name)
|
||||
|
||||
own_session = session is None
|
||||
if own_session:
|
||||
session = aiohttp.ClientSession()
|
||||
|
||||
try:
|
||||
params = {"q": entity_name, "hl": "en", "gl": "us"}
|
||||
async with session.get(
|
||||
self.GOOGLE_SEARCH_URL, params=params, headers=self.HEADERS,
|
||||
timeout=aiohttp.ClientTimeout(total=20),
|
||||
) as resp:
|
||||
if resp.status != 200:
|
||||
return result
|
||||
|
||||
html = await resp.text()
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
text = soup.get_text(separator=" ", strip=True).lower()
|
||||
|
||||
# Detect SERP features
|
||||
feature_indicators = {
|
||||
"knowledge_panel": ["kp-wholepage", "knowledge-panel", "kno-"],
|
||||
"sitelinks": ["sitelinks", "site-links"],
|
||||
"people_also_ask": ["related-question-pair", "data-q"],
|
||||
"faq_rich_result": ["faqpage", "frequently asked"],
|
||||
"featured_snippet": ["featured-snippet", "data-tts"],
|
||||
"image_pack": ["image-result", "img-brk"],
|
||||
"video_carousel": ["video-result", "vid-"],
|
||||
"twitter_carousel": ["twitter-timeline", "g-scrolling-carousel"],
|
||||
"reviews": ["star-rating", "aggregate-rating"],
|
||||
"local_pack": ["local-pack", "local_pack"],
|
||||
}
|
||||
|
||||
for feature, indicators in feature_indicators.items():
|
||||
for ind in indicators:
|
||||
if ind in str(soup).lower():
|
||||
result.features.append(feature)
|
||||
break
|
||||
|
||||
result.knowledge_panel = "knowledge_panel" in result.features
|
||||
result.sitelinks = "sitelinks" in result.features
|
||||
|
||||
# Count PAA questions
|
||||
paa_elements = soup.select("div[data-q], div.related-question-pair")
|
||||
result.paa_count = len(paa_elements)
|
||||
if result.paa_count > 0 and "people_also_ask" not in result.features:
|
||||
result.features.append("people_also_ask")
|
||||
|
||||
# Detect social profiles in results
|
||||
social_domains = {
|
||||
"twitter.com": "twitter", "x.com": "twitter",
|
||||
"facebook.com": "facebook", "linkedin.com": "linkedin",
|
||||
"youtube.com": "youtube", "instagram.com": "instagram",
|
||||
"github.com": "github", "pinterest.com": "pinterest",
|
||||
}
|
||||
links = soup.find_all("a", href=True)
|
||||
for link in links:
|
||||
href = link["href"]
|
||||
for domain, name in social_domains.items():
|
||||
if domain in href and name not in result.social_profiles:
|
||||
result.social_profiles.append(name)
|
||||
|
||||
# Extract top organic results
|
||||
result_divs = soup.select("div.g, div[data-sokoban-container]")[:10]
|
||||
for div in result_divs:
|
||||
title_el = div.select_one("h3")
|
||||
link_el = div.select_one("a[href]")
|
||||
if title_el and link_el:
|
||||
result.top_results.append({
|
||||
"title": title_el.get_text(strip=True),
|
||||
"url": link_el.get("href", ""),
|
||||
})
|
||||
|
||||
except Exception as exc:
|
||||
logger.error("Brand SERP analysis failed for '%s': %s", entity_name, exc)
|
||||
finally:
|
||||
if own_session:
|
||||
await session.close()
|
||||
|
||||
return result
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Social profile link validation
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
async def check_social_profile_links(
|
||||
self,
|
||||
same_as_links: list[str],
|
||||
session: aiohttp.ClientSession | None = None,
|
||||
) -> dict[str, bool]:
|
||||
"""Validate sameAs URLs are accessible."""
|
||||
status: dict[str, bool] = {}
|
||||
if not same_as_links:
|
||||
return status
|
||||
|
||||
own_session = session is None
|
||||
if own_session:
|
||||
session = aiohttp.ClientSession()
|
||||
|
||||
try:
|
||||
for link in same_as_links:
|
||||
try:
|
||||
async with session.head(
|
||||
link, headers=self.HEADERS, timeout=aiohttp.ClientTimeout(total=10),
|
||||
allow_redirects=True,
|
||||
) as resp:
|
||||
status[link] = resp.status < 400
|
||||
except Exception:
|
||||
status[link] = False
|
||||
|
||||
await asyncio.sleep(0.5)
|
||||
finally:
|
||||
if own_session:
|
||||
await session.close()
|
||||
|
||||
accessible = sum(1 for v in status.values() if v)
|
||||
logger.info("Social profile links: %d/%d accessible", accessible, len(status))
|
||||
return status
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Recommendations
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def generate_recommendations(self, result: EntityAuditResult) -> list[str]:
|
||||
"""Generate actionable entity SEO improvement recommendations."""
|
||||
recs: list[str] = []
|
||||
|
||||
# PAA recommendations
|
||||
if not result.paa_questions:
|
||||
recs.append(
|
||||
"브랜드 관련 People Also Ask(PAA) 질문이 감지되지 않았습니다. "
|
||||
"FAQ 콘텐츠를 작성하여 PAA 노출 기회를 확보하세요."
|
||||
)
|
||||
elif len(result.paa_questions) < 5:
|
||||
recs.append(
|
||||
f"PAA 질문이 {len(result.paa_questions)}개만 감지되었습니다. "
|
||||
"더 다양한 키워드에 대한 Q&A 콘텐츠를 강화하세요."
|
||||
)
|
||||
|
||||
# FAQ schema recommendations
|
||||
if not result.faq_rich_results:
|
||||
recs.append(
|
||||
"FAQPage schema가 감지되지 않았습니다. "
|
||||
"FAQ 페이지에 FAQPage JSON-LD를 추가하여 Rich Result를 확보하세요."
|
||||
)
|
||||
else:
|
||||
invalid = [f for f in result.faq_rich_results if not f.schema_valid]
|
||||
if invalid:
|
||||
recs.append(
|
||||
f"{len(invalid)}개의 FAQ schema에 유효성 문제가 있습니다. "
|
||||
"Google Rich Results Test로 검증하세요."
|
||||
)
|
||||
|
||||
# Entity schema recommendations
|
||||
if not result.entity_schemas:
|
||||
recs.append(
|
||||
"Organization/Person/LocalBusiness schema가 없습니다. "
|
||||
"홈페이지에 Organization schema JSON-LD를 추가하세요."
|
||||
)
|
||||
else:
|
||||
for schema in result.entity_schemas:
|
||||
if schema.completeness < 50:
|
||||
recs.append(
|
||||
f"{schema.type} schema 완성도가 {schema.completeness}%입니다. "
|
||||
f"누락 항목: {', '.join(schema.issues[:3])}"
|
||||
)
|
||||
if not schema.same_as_links:
|
||||
recs.append(
|
||||
f"{schema.type} schema에 sameAs 속성이 없습니다. "
|
||||
"소셜 미디어 프로필 URL을 sameAs에 추가하세요."
|
||||
)
|
||||
|
||||
# Brand SERP recommendations
|
||||
serp = result.brand_serp
|
||||
if not serp.knowledge_panel:
|
||||
recs.append(
|
||||
"브랜드 검색 시 Knowledge Panel이 표시되지 않습니다. "
|
||||
"Wikipedia, Wikidata, 구조화된 데이터를 통해 엔티티 인식을 강화하세요."
|
||||
)
|
||||
if not serp.sitelinks:
|
||||
recs.append(
|
||||
"Sitelinks가 표시되지 않습니다. "
|
||||
"사이트 구조와 내부 링크를 개선하세요."
|
||||
)
|
||||
if len(serp.social_profiles) < 3:
|
||||
recs.append(
|
||||
f"SERP에 소셜 프로필이 {len(serp.social_profiles)}개만 표시됩니다. "
|
||||
"주요 소셜 미디어 프로필을 활성화하고 schema sameAs에 연결하세요."
|
||||
)
|
||||
|
||||
# Social profile accessibility
|
||||
broken = [url for url, ok in result.social_profile_status.items() if not ok]
|
||||
if broken:
|
||||
recs.append(
|
||||
f"접근 불가한 소셜 프로필 링크 {len(broken)}개: "
|
||||
f"{', '.join(broken[:3])}. sameAs URL을 업데이트하세요."
|
||||
)
|
||||
|
||||
if not recs:
|
||||
recs.append("Entity SEO 상태가 양호합니다. 현재 수준을 유지하세요.")
|
||||
|
||||
return recs
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Scoring
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def compute_score(self, result: EntityAuditResult) -> float:
|
||||
"""Compute overall entity SEO score (0-100)."""
|
||||
score = 0.0
|
||||
|
||||
# PAA presence (15 points)
|
||||
paa_count = len(result.paa_questions)
|
||||
if paa_count >= 10:
|
||||
score += 15
|
||||
elif paa_count >= 5:
|
||||
score += 10
|
||||
elif paa_count > 0:
|
||||
score += 5
|
||||
|
||||
# FAQ schema (15 points)
|
||||
if result.faq_rich_results:
|
||||
valid_count = sum(1 for f in result.faq_rich_results if f.schema_valid)
|
||||
score += min(15, valid_count * 5)
|
||||
|
||||
# Entity schema (25 points)
|
||||
if result.entity_schemas:
|
||||
best_completeness = max(s.completeness for s in result.entity_schemas)
|
||||
score += best_completeness * 0.25
|
||||
|
||||
# Brand SERP features (25 points)
|
||||
serp = result.brand_serp
|
||||
if serp.knowledge_panel:
|
||||
score += 10
|
||||
if serp.sitelinks:
|
||||
score += 5
|
||||
score += min(10, len(serp.features) * 2)
|
||||
|
||||
# Social profiles (10 points)
|
||||
if result.social_profile_status:
|
||||
accessible = sum(1 for v in result.social_profile_status.values() if v)
|
||||
total = len(result.social_profile_status)
|
||||
score += (accessible / total) * 10 if total > 0 else 0
|
||||
|
||||
# sameAs links (10 points)
|
||||
total_same_as = sum(len(s.same_as_links) for s in result.entity_schemas)
|
||||
score += min(10, total_same_as * 2)
|
||||
|
||||
return round(min(100, score), 1)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Main orchestrator
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
async def audit(
|
||||
self,
|
||||
url: str,
|
||||
entity_name: str,
|
||||
include_paa: bool = True,
|
||||
include_faq: bool = True,
|
||||
) -> EntityAuditResult:
|
||||
"""Orchestrate full entity SEO audit."""
|
||||
result = EntityAuditResult(url=url, entity_name=entity_name)
|
||||
logger.info("Starting entity audit for '%s' at %s", entity_name, url)
|
||||
|
||||
async with aiohttp.ClientSession() as session:
|
||||
# Parallel tasks: entity schema, brand SERP, FAQ
|
||||
tasks = [
|
||||
self.audit_entity_schema(url, session),
|
||||
self.analyze_brand_serp(entity_name, session),
|
||||
]
|
||||
|
||||
if include_faq:
|
||||
tasks.append(self.track_faq_rich_results(url, session))
|
||||
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
# Unpack results
|
||||
if not isinstance(results[0], Exception):
|
||||
result.entity_schemas = results[0]
|
||||
else:
|
||||
logger.error("Entity schema audit failed: %s", results[0])
|
||||
|
||||
if not isinstance(results[1], Exception):
|
||||
result.brand_serp = results[1]
|
||||
else:
|
||||
logger.error("Brand SERP analysis failed: %s", results[1])
|
||||
|
||||
if include_faq and len(results) > 2 and not isinstance(results[2], Exception):
|
||||
result.faq_rich_results = results[2]
|
||||
|
||||
# PAA monitoring (sequential due to rate limits)
|
||||
if include_paa:
|
||||
result.paa_questions = await self.monitor_paa(entity_name, session=session)
|
||||
|
||||
# Validate social profile links from schema
|
||||
all_same_as = []
|
||||
for schema in result.entity_schemas:
|
||||
all_same_as.extend(schema.same_as_links)
|
||||
if all_same_as:
|
||||
result.social_profile_status = await self.check_social_profile_links(
|
||||
list(set(all_same_as)), session
|
||||
)
|
||||
|
||||
# Compute score and recommendations
|
||||
result.overall_score = self.compute_score(result)
|
||||
result.recommendations = self.generate_recommendations(result)
|
||||
|
||||
logger.info("Entity audit complete. Score: %.1f", result.overall_score)
|
||||
return result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI display helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def display_result(result: EntityAuditResult) -> None:
|
||||
"""Display audit result in rich tables."""
|
||||
console.print()
|
||||
console.print(f"[bold cyan]Entity SEO Audit: {result.entity_name}[/bold cyan]")
|
||||
console.print(f"URL: {result.url} | Score: {result.overall_score}/100")
|
||||
console.print()
|
||||
|
||||
# Entity Schema table
|
||||
if result.entity_schemas:
|
||||
table = Table(title="Entity Schema Markup", show_header=True)
|
||||
table.add_column("Type", style="bold")
|
||||
table.add_column("Completeness")
|
||||
table.add_column("sameAs Links")
|
||||
table.add_column("Issues")
|
||||
|
||||
for schema in result.entity_schemas:
|
||||
issues_text = "; ".join(schema.issues[:3]) if schema.issues else "None"
|
||||
table.add_row(
|
||||
schema.type,
|
||||
f"{schema.completeness}%",
|
||||
str(len(schema.same_as_links)),
|
||||
issues_text,
|
||||
)
|
||||
console.print(table)
|
||||
else:
|
||||
console.print("[red]No entity schema markup found on website![/red]")
|
||||
console.print()
|
||||
|
||||
# Brand SERP table
|
||||
serp = result.brand_serp
|
||||
serp_table = Table(title="Brand SERP Analysis", show_header=True)
|
||||
serp_table.add_column("Feature", style="bold")
|
||||
serp_table.add_column("Status")
|
||||
|
||||
serp_table.add_row("Knowledge Panel", "[green]Yes[/]" if serp.knowledge_panel else "[red]No[/]")
|
||||
serp_table.add_row("Sitelinks", "[green]Yes[/]" if serp.sitelinks else "[red]No[/]")
|
||||
serp_table.add_row("PAA Count", str(serp.paa_count))
|
||||
serp_table.add_row("SERP Features", ", ".join(serp.features) if serp.features else "None")
|
||||
serp_table.add_row("Social Profiles", ", ".join(serp.social_profiles) if serp.social_profiles else "None")
|
||||
|
||||
console.print(serp_table)
|
||||
console.print()
|
||||
|
||||
# PAA Questions
|
||||
if result.paa_questions:
|
||||
paa_table = Table(title=f"People Also Ask ({len(result.paa_questions)} questions)", show_header=True)
|
||||
paa_table.add_column("#", style="dim")
|
||||
paa_table.add_column("Question")
|
||||
paa_table.add_column("Keyword")
|
||||
|
||||
for i, q in enumerate(result.paa_questions[:15], 1):
|
||||
paa_table.add_row(str(i), q.question, q.keyword)
|
||||
console.print(paa_table)
|
||||
console.print()
|
||||
|
||||
# FAQ Rich Results
|
||||
if result.faq_rich_results:
|
||||
faq_table = Table(title="FAQ Rich Results", show_header=True)
|
||||
faq_table.add_column("URL")
|
||||
faq_table.add_column("Questions")
|
||||
faq_table.add_column("Valid")
|
||||
|
||||
for faq in result.faq_rich_results:
|
||||
faq_table.add_row(
|
||||
faq.url[:60],
|
||||
str(faq.question_count),
|
||||
"[green]Yes[/]" if faq.schema_valid else "[red]No[/]",
|
||||
)
|
||||
console.print(faq_table)
|
||||
console.print()
|
||||
|
||||
# Social Profile Status
|
||||
if result.social_profile_status:
|
||||
sp_table = Table(title="Social Profile Link Status", show_header=True)
|
||||
sp_table.add_column("URL")
|
||||
sp_table.add_column("Accessible")
|
||||
|
||||
for link, accessible in result.social_profile_status.items():
|
||||
sp_table.add_row(
|
||||
link[:70],
|
||||
"[green]Yes[/]" if accessible else "[red]No[/]",
|
||||
)
|
||||
console.print(sp_table)
|
||||
console.print()
|
||||
|
||||
# Recommendations
|
||||
console.print("[bold yellow]Recommendations:[/bold yellow]")
|
||||
for i, rec in enumerate(result.recommendations, 1):
|
||||
console.print(f" {i}. {rec}")
|
||||
console.print()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Entity SEO Auditor",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
)
|
||||
parser.add_argument("--url", required=True, help="Website URL to audit")
|
||||
parser.add_argument("--entity", required=True, help="Entity/brand name")
|
||||
parser.add_argument("--paa", action="store_true", default=True, help="Include PAA monitoring (default: True)")
|
||||
parser.add_argument("--no-paa", action="store_true", help="Skip PAA monitoring")
|
||||
parser.add_argument("--faq", action="store_true", default=True, help="Include FAQ tracking (default: True)")
|
||||
parser.add_argument("--no-faq", action="store_true", help="Skip FAQ tracking")
|
||||
parser.add_argument("--json", action="store_true", help="Output as JSON")
|
||||
parser.add_argument("--output", type=str, help="Output file path")
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
async def main() -> None:
|
||||
args = parse_args()
|
||||
|
||||
auditor = EntityAuditor()
|
||||
result = await auditor.audit(
|
||||
url=args.url,
|
||||
entity_name=args.entity,
|
||||
include_paa=not args.no_paa,
|
||||
include_faq=not args.no_faq,
|
||||
)
|
||||
|
||||
if args.json:
|
||||
output = json.dumps(result.to_dict(), ensure_ascii=False, indent=2)
|
||||
if args.output:
|
||||
with open(args.output, "w", encoding="utf-8") as f:
|
||||
f.write(output)
|
||||
console.print(f"[green]Output saved to {args.output}[/green]")
|
||||
else:
|
||||
print(output)
|
||||
else:
|
||||
display_result(result)
|
||||
if args.output:
|
||||
with open(args.output, "w", encoding="utf-8") as f:
|
||||
json.dump(result.to_dict(), f, ensure_ascii=False, indent=2)
|
||||
console.print(f"[green]Output saved to {args.output}[/green]")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -0,0 +1,782 @@
|
||||
"""
|
||||
Knowledge Graph Analyzer
|
||||
=========================
|
||||
Purpose: Analyze entity presence in Google Knowledge Graph, Knowledge Panels,
|
||||
Wikipedia, Wikidata, and Korean equivalents (Naver encyclopedia, 지식iN).
|
||||
Python: 3.10+
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import asyncio
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
import sys
|
||||
from dataclasses import asdict, dataclass, field
|
||||
from datetime import datetime
|
||||
from typing import Any
|
||||
from urllib.parse import quote, urljoin
|
||||
|
||||
import aiohttp
|
||||
from bs4 import BeautifulSoup
|
||||
from rich.console import Console
|
||||
from rich.table import Table
|
||||
|
||||
from base_client import BaseAsyncClient, ConfigManager, config
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
console = Console()
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Data classes
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
EXPECTED_ATTRIBUTES = [
|
||||
"name",
|
||||
"type",
|
||||
"description",
|
||||
"logo",
|
||||
"website",
|
||||
"founded",
|
||||
"ceo",
|
||||
"headquarters",
|
||||
"parent_organization",
|
||||
"subsidiaries",
|
||||
"social_twitter",
|
||||
"social_facebook",
|
||||
"social_linkedin",
|
||||
"social_youtube",
|
||||
"social_instagram",
|
||||
"stock_ticker",
|
||||
"industry",
|
||||
"employees",
|
||||
"revenue",
|
||||
]
|
||||
|
||||
|
||||
@dataclass
|
||||
class KnowledgePanelAttribute:
|
||||
"""Single attribute extracted from a Knowledge Panel."""
|
||||
name: str
|
||||
value: str | None = None
|
||||
present: bool = False
|
||||
|
||||
|
||||
@dataclass
|
||||
class KnowledgePanel:
|
||||
"""Represents a detected Knowledge Panel."""
|
||||
detected: bool = False
|
||||
entity_type: str | None = None
|
||||
attributes: list[KnowledgePanelAttribute] = field(default_factory=list)
|
||||
completeness_score: float = 0.0
|
||||
raw_snippet: str | None = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class WikiPresence:
|
||||
"""Wikipedia or Wikidata presence record."""
|
||||
platform: str = "" # "wikipedia" or "wikidata"
|
||||
present: bool = False
|
||||
url: str | None = None
|
||||
qid: str | None = None # Wikidata QID (e.g. Q20710)
|
||||
language: str = "en"
|
||||
|
||||
|
||||
@dataclass
|
||||
class NaverPresence:
|
||||
"""Naver encyclopedia and 지식iN presence."""
|
||||
encyclopedia_present: bool = False
|
||||
encyclopedia_url: str | None = None
|
||||
knowledge_in_present: bool = False
|
||||
knowledge_in_count: int = 0
|
||||
knowledge_in_url: str | None = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class KnowledgeGraphResult:
|
||||
"""Full Knowledge Graph analysis result."""
|
||||
entity: str = ""
|
||||
language: str = "en"
|
||||
knowledge_panel: KnowledgePanel = field(default_factory=KnowledgePanel)
|
||||
wikipedia: WikiPresence = field(default_factory=lambda: WikiPresence(platform="wikipedia"))
|
||||
wikidata: WikiPresence = field(default_factory=lambda: WikiPresence(platform="wikidata"))
|
||||
naver: NaverPresence = field(default_factory=NaverPresence)
|
||||
competitors: list[dict[str, Any]] = field(default_factory=list)
|
||||
overall_score: float = 0.0
|
||||
recommendations: list[str] = field(default_factory=list)
|
||||
timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
|
||||
|
||||
def to_dict(self) -> dict[str, Any]:
|
||||
return asdict(self)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Knowledge Graph Analyzer
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class KnowledgeGraphAnalyzer(BaseAsyncClient):
|
||||
"""Analyze entity presence in Knowledge Graph and related platforms."""
|
||||
|
||||
GOOGLE_SEARCH_URL = "https://www.google.com/search"
|
||||
WIKIPEDIA_API_URL = "https://{lang}.wikipedia.org/api/rest_v1/page/summary/{title}"
|
||||
WIKIDATA_API_URL = "https://www.wikidata.org/w/api.php"
|
||||
NAVER_SEARCH_URL = "https://search.naver.com/search.naver"
|
||||
NAVER_ENCYCLOPEDIA_URL = "https://terms.naver.com/search.naver"
|
||||
NAVER_KIN_URL = "https://kin.naver.com/search/list.naver"
|
||||
|
||||
HEADERS = {
|
||||
"User-Agent": (
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
|
||||
"AppleWebKit/537.36 (KHTML, like Gecko) "
|
||||
"Chrome/120.0.0.0 Safari/537.36"
|
||||
),
|
||||
"Accept-Language": "en-US,en;q=0.9",
|
||||
}
|
||||
|
||||
def __init__(self, **kwargs):
|
||||
super().__init__(**kwargs)
|
||||
self.config = config
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Google entity search
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
async def search_entity(
|
||||
self,
|
||||
entity_name: str,
|
||||
language: str = "en",
|
||||
session: aiohttp.ClientSession | None = None,
|
||||
) -> dict[str, Any]:
|
||||
"""Search Google for entity to detect Knowledge Panel signals."""
|
||||
params = {"q": entity_name, "hl": language, "gl": "us" if language == "en" else "kr"}
|
||||
headers = {**self.HEADERS}
|
||||
if language == "ko":
|
||||
headers["Accept-Language"] = "ko-KR,ko;q=0.9"
|
||||
params["gl"] = "kr"
|
||||
|
||||
own_session = session is None
|
||||
if own_session:
|
||||
session = aiohttp.ClientSession()
|
||||
|
||||
try:
|
||||
async with session.get(
|
||||
self.GOOGLE_SEARCH_URL, params=params, headers=headers, timeout=aiohttp.ClientTimeout(total=20)
|
||||
) as resp:
|
||||
if resp.status != 200:
|
||||
logger.warning("Google search returned status %d", resp.status)
|
||||
return {"html": "", "status": resp.status}
|
||||
html = await resp.text()
|
||||
return {"html": html, "status": resp.status}
|
||||
except Exception as exc:
|
||||
logger.error("Google search failed: %s", exc)
|
||||
return {"html": "", "status": 0, "error": str(exc)}
|
||||
finally:
|
||||
if own_session:
|
||||
await session.close()
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Knowledge Panel detection
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def detect_knowledge_panel(self, search_data: dict[str, Any]) -> KnowledgePanel:
|
||||
"""Parse search results HTML for Knowledge Panel indicators."""
|
||||
html = search_data.get("html", "")
|
||||
if not html:
|
||||
return KnowledgePanel(detected=False)
|
||||
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
kp = KnowledgePanel()
|
||||
|
||||
# Knowledge Panel is typically in a div with class 'kp-wholepage' or 'knowledge-panel'
|
||||
kp_selectors = [
|
||||
"div.kp-wholepage",
|
||||
"div.knowledge-panel",
|
||||
"div[data-attrid='title']",
|
||||
"div.kp-header",
|
||||
"div[class*='kno-']",
|
||||
"div.osrp-blk",
|
||||
]
|
||||
|
||||
kp_element = None
|
||||
for selector in kp_selectors:
|
||||
kp_element = soup.select_one(selector)
|
||||
if kp_element:
|
||||
break
|
||||
|
||||
if kp_element:
|
||||
kp.detected = True
|
||||
kp.raw_snippet = kp_element.get_text(separator=" ", strip=True)[:500]
|
||||
else:
|
||||
# Fallback: check for common KP text patterns
|
||||
text = soup.get_text(separator=" ", strip=True).lower()
|
||||
kp_indicators = [
|
||||
"wikipedia", "description", "founded", "ceo",
|
||||
"headquarters", "subsidiaries", "parent organization",
|
||||
]
|
||||
matches = sum(1 for ind in kp_indicators if ind in text)
|
||||
if matches >= 3:
|
||||
kp.detected = True
|
||||
kp.raw_snippet = text[:500]
|
||||
|
||||
return kp
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Attribute extraction
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def extract_attributes(self, kp: KnowledgePanel, html: str = "") -> list[KnowledgePanelAttribute]:
|
||||
"""Extract entity attributes from Knowledge Panel data."""
|
||||
attributes: list[KnowledgePanelAttribute] = []
|
||||
text = (kp.raw_snippet or "").lower()
|
||||
|
||||
# Parse HTML for structured attribute data
|
||||
soup = BeautifulSoup(html, "lxml") if html else None
|
||||
|
||||
attribute_patterns = {
|
||||
"name": r"^(.+?)(?:\s+is\s+|\s*[-|]\s*)",
|
||||
"type": r"(?:is\s+(?:a|an)\s+)(\w[\w\s]+?)(?:\.|,|\s+based)",
|
||||
"description": r"(?:is\s+)(.{20,200}?)(?:\.\s)",
|
||||
"founded": r"(?:founded|established|incorporated)\s*(?:in|:)?\s*(\d{4})",
|
||||
"ceo": r"(?:ceo|chief executive|chairman)\s*(?::|is)?\s*([A-Z][\w\s.]+?)(?:,|\.|;|\s{2})",
|
||||
"headquarters": r"(?:headquarters?|hq|based in)\s*(?::|is|in)?\s*([A-Z][\w\s,]+?)(?:\.|;|\s{2})",
|
||||
"stock_ticker": r"(?:stock|ticker|symbol)\s*(?::|is)?\s*([A-Z]{1,5}(?:\s*:\s*[A-Z]{1,5})?)",
|
||||
"employees": r"(?:employees?|staff|workforce)\s*(?::|is)?\s*([\d,]+)",
|
||||
"revenue": r"(?:revenue|sales)\s*(?::|is)?\s*([\$\d,.]+\s*(?:billion|million|B|M)?)",
|
||||
"industry": r"(?:industry|sector)\s*(?::|is)?\s*([\w\s&]+?)(?:\.|,|;)",
|
||||
}
|
||||
|
||||
social_patterns = {
|
||||
"social_twitter": r"(?:twitter\.com|x\.com)/(\w+)",
|
||||
"social_facebook": r"facebook\.com/([\w.]+)",
|
||||
"social_linkedin": r"linkedin\.com/(?:company|in)/([\w-]+)",
|
||||
"social_youtube": r"youtube\.com/(?:@|channel/|user/)([\w-]+)",
|
||||
"social_instagram": r"instagram\.com/([\w.]+)",
|
||||
}
|
||||
|
||||
full_text = kp.raw_snippet or ""
|
||||
html_text = ""
|
||||
if soup:
|
||||
html_text = soup.get_text(separator=" ", strip=True)
|
||||
|
||||
combined = f"{full_text} {html_text}"
|
||||
|
||||
for attr_name, pattern in attribute_patterns.items():
|
||||
match = re.search(pattern, combined, re.IGNORECASE)
|
||||
present = match is not None
|
||||
value = match.group(1).strip() if match else None
|
||||
attributes.append(KnowledgePanelAttribute(name=attr_name, value=value, present=present))
|
||||
|
||||
# Social profiles
|
||||
for attr_name, pattern in social_patterns.items():
|
||||
match = re.search(pattern, combined, re.IGNORECASE)
|
||||
present = match is not None
|
||||
value = match.group(1).strip() if match else None
|
||||
attributes.append(KnowledgePanelAttribute(name=attr_name, value=value, present=present))
|
||||
|
||||
# Logo detection from HTML
|
||||
logo_present = False
|
||||
if soup:
|
||||
logo_img = soup.select_one("img[data-atf], g-img img, img.kno-fb-img, img[alt*='logo']")
|
||||
if logo_img:
|
||||
logo_present = True
|
||||
attributes.append(KnowledgePanelAttribute(name="logo", value=None, present=logo_present))
|
||||
|
||||
# Website detection
|
||||
website_present = False
|
||||
if soup:
|
||||
site_link = soup.select_one("a[data-attrid*='website'], a.ab_button[href*='http']")
|
||||
if site_link:
|
||||
website_present = True
|
||||
value = site_link.get("href", "")
|
||||
attributes.append(KnowledgePanelAttribute(name="website", value=value if website_present else None, present=website_present))
|
||||
|
||||
return attributes
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Completeness scoring
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def score_completeness(self, attributes: list[KnowledgePanelAttribute]) -> float:
|
||||
"""Score attribute completeness (0-100) based on filled vs expected."""
|
||||
if not attributes:
|
||||
return 0.0
|
||||
|
||||
weights = {
|
||||
"name": 10, "type": 8, "description": 10, "logo": 8, "website": 10,
|
||||
"founded": 5, "ceo": 5, "headquarters": 5, "parent_organization": 3,
|
||||
"subsidiaries": 3, "social_twitter": 4, "social_facebook": 4,
|
||||
"social_linkedin": 4, "social_youtube": 3, "social_instagram": 3,
|
||||
"stock_ticker": 3, "industry": 5, "employees": 3, "revenue": 4,
|
||||
}
|
||||
|
||||
total_weight = sum(weights.values())
|
||||
earned = 0.0
|
||||
|
||||
attr_map = {a.name: a for a in attributes}
|
||||
for attr_name, weight in weights.items():
|
||||
attr = attr_map.get(attr_name)
|
||||
if attr and attr.present:
|
||||
earned += weight
|
||||
|
||||
return round((earned / total_weight) * 100, 1) if total_weight > 0 else 0.0
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Wikipedia check
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
async def check_wikipedia(
|
||||
self,
|
||||
entity_name: str,
|
||||
language: str = "en",
|
||||
session: aiohttp.ClientSession | None = None,
|
||||
) -> WikiPresence:
|
||||
"""Check Wikipedia article existence for entity."""
|
||||
wiki = WikiPresence(platform="wikipedia", language=language)
|
||||
title = entity_name.replace(" ", "_")
|
||||
url = self.WIKIPEDIA_API_URL.format(lang=language, title=quote(title))
|
||||
|
||||
own_session = session is None
|
||||
if own_session:
|
||||
session = aiohttp.ClientSession()
|
||||
|
||||
try:
|
||||
async with session.get(url, headers=self.HEADERS, timeout=aiohttp.ClientTimeout(total=15)) as resp:
|
||||
if resp.status == 200:
|
||||
data = await resp.json()
|
||||
wiki.present = data.get("type") != "disambiguation"
|
||||
wiki.url = data.get("content_urls", {}).get("desktop", {}).get("page", "")
|
||||
if not wiki.url:
|
||||
wiki.url = f"https://{language}.wikipedia.org/wiki/{quote(title)}"
|
||||
logger.info("Wikipedia article found for '%s' (%s)", entity_name, language)
|
||||
elif resp.status == 404:
|
||||
wiki.present = False
|
||||
logger.info("No Wikipedia article for '%s' (%s)", entity_name, language)
|
||||
else:
|
||||
logger.warning("Wikipedia API returned status %d", resp.status)
|
||||
except Exception as exc:
|
||||
logger.error("Wikipedia check failed: %s", exc)
|
||||
finally:
|
||||
if own_session:
|
||||
await session.close()
|
||||
|
||||
return wiki
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Wikidata check
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
async def check_wikidata(
|
||||
self,
|
||||
entity_name: str,
|
||||
session: aiohttp.ClientSession | None = None,
|
||||
) -> WikiPresence:
|
||||
"""Check Wikidata QID existence for entity."""
|
||||
wiki = WikiPresence(platform="wikidata")
|
||||
params = {
|
||||
"action": "wbsearchentities",
|
||||
"search": entity_name,
|
||||
"language": "en",
|
||||
"format": "json",
|
||||
"limit": 5,
|
||||
}
|
||||
|
||||
own_session = session is None
|
||||
if own_session:
|
||||
session = aiohttp.ClientSession()
|
||||
|
||||
try:
|
||||
async with session.get(
|
||||
self.WIKIDATA_API_URL, params=params, headers=self.HEADERS,
|
||||
timeout=aiohttp.ClientTimeout(total=15),
|
||||
) as resp:
|
||||
if resp.status == 200:
|
||||
data = await resp.json()
|
||||
results = data.get("search", [])
|
||||
if results:
|
||||
top = results[0]
|
||||
wiki.present = True
|
||||
wiki.qid = top.get("id", "")
|
||||
wiki.url = top.get("concepturi", f"https://www.wikidata.org/wiki/{wiki.qid}")
|
||||
logger.info("Wikidata entity found: %s (%s)", wiki.qid, entity_name)
|
||||
else:
|
||||
wiki.present = False
|
||||
logger.info("No Wikidata entity for '%s'", entity_name)
|
||||
else:
|
||||
logger.warning("Wikidata API returned status %d", resp.status)
|
||||
except Exception as exc:
|
||||
logger.error("Wikidata check failed: %s", exc)
|
||||
finally:
|
||||
if own_session:
|
||||
await session.close()
|
||||
|
||||
return wiki
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Naver encyclopedia
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
async def check_naver_encyclopedia(
|
||||
self,
|
||||
entity_name: str,
|
||||
session: aiohttp.ClientSession | None = None,
|
||||
) -> dict[str, Any]:
|
||||
"""Check Naver encyclopedia (네이버 백과사전) presence."""
|
||||
result = {"present": False, "url": None}
|
||||
params = {"query": entity_name, "searchType": 0}
|
||||
headers = {
|
||||
**self.HEADERS,
|
||||
"Accept-Language": "ko-KR,ko;q=0.9",
|
||||
}
|
||||
|
||||
own_session = session is None
|
||||
if own_session:
|
||||
session = aiohttp.ClientSession()
|
||||
|
||||
try:
|
||||
async with session.get(
|
||||
self.NAVER_ENCYCLOPEDIA_URL, params=params, headers=headers,
|
||||
timeout=aiohttp.ClientTimeout(total=15),
|
||||
) as resp:
|
||||
if resp.status == 200:
|
||||
html = await resp.text()
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
# Look for search result entries
|
||||
entries = soup.select("ul.content_list li, div.search_result a, a.title")
|
||||
if entries:
|
||||
result["present"] = True
|
||||
first_link = entries[0].find("a")
|
||||
if first_link and first_link.get("href"):
|
||||
href = first_link["href"]
|
||||
if not href.startswith("http"):
|
||||
href = urljoin("https://terms.naver.com", href)
|
||||
result["url"] = href
|
||||
else:
|
||||
result["url"] = f"https://terms.naver.com/search.naver?query={quote(entity_name)}"
|
||||
logger.info("Naver encyclopedia entry found for '%s'", entity_name)
|
||||
else:
|
||||
# Fallback: check page text for result indicators
|
||||
text = soup.get_text()
|
||||
if entity_name in text and "검색결과가 없습니다" not in text:
|
||||
result["present"] = True
|
||||
result["url"] = f"https://terms.naver.com/search.naver?query={quote(entity_name)}"
|
||||
else:
|
||||
logger.warning("Naver encyclopedia returned status %d", resp.status)
|
||||
except Exception as exc:
|
||||
logger.error("Naver encyclopedia check failed: %s", exc)
|
||||
finally:
|
||||
if own_session:
|
||||
await session.close()
|
||||
|
||||
return result
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Naver knowledge iN
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
async def check_naver_knowledge_in(
|
||||
self,
|
||||
entity_name: str,
|
||||
session: aiohttp.ClientSession | None = None,
|
||||
) -> dict[str, Any]:
|
||||
"""Check Naver knowledge iN (지식iN) entries."""
|
||||
result = {"present": False, "count": 0, "url": None}
|
||||
params = {"query": entity_name}
|
||||
headers = {
|
||||
**self.HEADERS,
|
||||
"Accept-Language": "ko-KR,ko;q=0.9",
|
||||
}
|
||||
|
||||
own_session = session is None
|
||||
if own_session:
|
||||
session = aiohttp.ClientSession()
|
||||
|
||||
try:
|
||||
async with session.get(
|
||||
self.NAVER_KIN_URL, params=params, headers=headers,
|
||||
timeout=aiohttp.ClientTimeout(total=15),
|
||||
) as resp:
|
||||
if resp.status == 200:
|
||||
html = await resp.text()
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
|
||||
# Extract total result count
|
||||
count_el = soup.select_one("span.number, em.total_count, span.result_count")
|
||||
count = 0
|
||||
if count_el:
|
||||
count_text = count_el.get_text(strip=True).replace(",", "")
|
||||
count_match = re.search(r"(\d+)", count_text)
|
||||
if count_match:
|
||||
count = int(count_match.group(1))
|
||||
|
||||
# Also check for list items
|
||||
entries = soup.select("ul.basic1 li, ul._list li, div.search_list li")
|
||||
if count > 0 or entries:
|
||||
result["present"] = True
|
||||
result["count"] = count if count > 0 else len(entries)
|
||||
result["url"] = f"https://kin.naver.com/search/list.naver?query={quote(entity_name)}"
|
||||
logger.info("Naver 지식iN: %d entries for '%s'", result["count"], entity_name)
|
||||
else:
|
||||
logger.info("No Naver 지식iN entries for '%s'", entity_name)
|
||||
else:
|
||||
logger.warning("Naver 지식iN returned status %d", resp.status)
|
||||
except Exception as exc:
|
||||
logger.error("Naver 지식iN check failed: %s", exc)
|
||||
finally:
|
||||
if own_session:
|
||||
await session.close()
|
||||
|
||||
return result
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Recommendations
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def generate_recommendations(self, result: KnowledgeGraphResult) -> list[str]:
|
||||
"""Generate actionable recommendations based on analysis."""
|
||||
recs: list[str] = []
|
||||
|
||||
kp = result.knowledge_panel
|
||||
if not kp.detected:
|
||||
recs.append(
|
||||
"Knowledge Panel이 감지되지 않았습니다. Google에 엔티티 등록을 위해 "
|
||||
"Wikipedia 페이지 생성, Wikidata 항목 추가, 구조화된 데이터(Organization schema) 구현을 권장합니다."
|
||||
)
|
||||
elif kp.completeness_score < 50:
|
||||
recs.append(
|
||||
f"Knowledge Panel 완성도가 {kp.completeness_score}%로 낮습니다. "
|
||||
"누락된 속성(소셜 프로필, 설명, 로고 등)을 보강하세요."
|
||||
)
|
||||
|
||||
if not result.wikipedia.present:
|
||||
recs.append(
|
||||
"Wikipedia 문서가 없습니다. 주목할 만한 출처(reliable sources)를 확보한 후 "
|
||||
"Wikipedia 문서 생성을 고려하세요."
|
||||
)
|
||||
|
||||
if not result.wikidata.present:
|
||||
recs.append(
|
||||
"Wikidata 항목이 없습니다. Wikidata에 엔티티를 등록하여 "
|
||||
"Knowledge Graph 인식을 강화하세요."
|
||||
)
|
||||
|
||||
if not result.naver.encyclopedia_present:
|
||||
recs.append(
|
||||
"네이버 백과사전에 등록되어 있지 않습니다. 한국 시장 SEO를 위해 "
|
||||
"네이버 백과사전 등재를 검토하세요."
|
||||
)
|
||||
|
||||
if result.naver.knowledge_in_count < 5:
|
||||
recs.append(
|
||||
"네이버 지식iN에 관련 콘텐츠가 부족합니다. Q&A 콘텐츠를 통해 "
|
||||
"브랜드 엔티티 인지도를 높이세요."
|
||||
)
|
||||
|
||||
# Check social profile completeness
|
||||
attr_map = {a.name: a for a in kp.attributes}
|
||||
missing_social = []
|
||||
for soc in ["social_twitter", "social_facebook", "social_linkedin", "social_youtube"]:
|
||||
attr = attr_map.get(soc)
|
||||
if not attr or not attr.present:
|
||||
missing_social.append(soc.replace("social_", "").title())
|
||||
if missing_social:
|
||||
recs.append(
|
||||
f"소셜 프로필 연결 누락: {', '.join(missing_social)}. "
|
||||
"웹사이트 schema의 sameAs 속성에 소셜 프로필을 추가하세요."
|
||||
)
|
||||
|
||||
if not recs:
|
||||
recs.append("Knowledge Graph 엔티티 상태가 양호합니다. 현재 수준을 유지하세요.")
|
||||
|
||||
return recs
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Main orchestrator
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
async def analyze(
|
||||
self,
|
||||
entity_name: str,
|
||||
language: str = "en",
|
||||
include_wiki: bool = True,
|
||||
include_naver: bool = True,
|
||||
) -> KnowledgeGraphResult:
|
||||
"""Orchestrate full Knowledge Graph analysis."""
|
||||
result = KnowledgeGraphResult(entity=entity_name, language=language)
|
||||
logger.info("Starting Knowledge Graph analysis for '%s' (lang=%s)", entity_name, language)
|
||||
|
||||
async with aiohttp.ClientSession() as session:
|
||||
# Step 1: Search entity on Google
|
||||
search_data = await self.search_entity(entity_name, language, session)
|
||||
|
||||
# Step 2: Detect Knowledge Panel
|
||||
kp = self.detect_knowledge_panel(search_data)
|
||||
|
||||
# Step 3: Extract attributes
|
||||
if kp.detected:
|
||||
kp.attributes = self.extract_attributes(kp, search_data.get("html", ""))
|
||||
kp.completeness_score = self.score_completeness(kp.attributes)
|
||||
|
||||
# Detect entity type from attributes
|
||||
for attr in kp.attributes:
|
||||
if attr.name == "type" and attr.present:
|
||||
kp.entity_type = attr.value
|
||||
break
|
||||
|
||||
result.knowledge_panel = kp
|
||||
|
||||
# Step 4: Wikipedia and Wikidata checks (parallel)
|
||||
if include_wiki:
|
||||
wiki_task = self.check_wikipedia(entity_name, language, session)
|
||||
wikidata_task = self.check_wikidata(entity_name, session)
|
||||
result.wikipedia, result.wikidata = await asyncio.gather(wiki_task, wikidata_task)
|
||||
|
||||
# Step 5: Naver checks (parallel)
|
||||
if include_naver:
|
||||
enc_task = self.check_naver_encyclopedia(entity_name, session)
|
||||
kin_task = self.check_naver_knowledge_in(entity_name, session)
|
||||
enc_result, kin_result = await asyncio.gather(enc_task, kin_task)
|
||||
|
||||
result.naver = NaverPresence(
|
||||
encyclopedia_present=enc_result.get("present", False),
|
||||
encyclopedia_url=enc_result.get("url"),
|
||||
knowledge_in_present=kin_result.get("present", False),
|
||||
knowledge_in_count=kin_result.get("count", 0),
|
||||
knowledge_in_url=kin_result.get("url"),
|
||||
)
|
||||
|
||||
# Step 6: Compute overall score
|
||||
scores = []
|
||||
if kp.detected:
|
||||
scores.append(kp.completeness_score * 0.35)
|
||||
else:
|
||||
scores.append(0)
|
||||
scores.append(20.0 if result.wikipedia.present else 0)
|
||||
scores.append(15.0 if result.wikidata.present else 0)
|
||||
scores.append(15.0 if result.naver.encyclopedia_present else 0)
|
||||
scores.append(15.0 if result.naver.knowledge_in_present else 0)
|
||||
result.overall_score = round(sum(scores), 1)
|
||||
|
||||
# Step 7: Recommendations
|
||||
result.recommendations = self.generate_recommendations(result)
|
||||
|
||||
logger.info("Analysis complete. Overall score: %.1f", result.overall_score)
|
||||
return result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI display helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def display_result(result: KnowledgeGraphResult) -> None:
|
||||
"""Display analysis result in a rich table."""
|
||||
console.print()
|
||||
console.print(f"[bold cyan]Knowledge Graph Analysis: {result.entity}[/bold cyan]")
|
||||
console.print(f"Language: {result.language} | Score: {result.overall_score}/100")
|
||||
console.print()
|
||||
|
||||
# Knowledge Panel table
|
||||
kp = result.knowledge_panel
|
||||
table = Table(title="Knowledge Panel", show_header=True)
|
||||
table.add_column("Property", style="bold")
|
||||
table.add_column("Value")
|
||||
table.add_column("Status")
|
||||
|
||||
table.add_row("Detected", str(kp.detected), "[green]OK[/]" if kp.detected else "[red]Missing[/]")
|
||||
table.add_row("Entity Type", kp.entity_type or "-", "[green]OK[/]" if kp.entity_type else "[yellow]Unknown[/]")
|
||||
table.add_row("Completeness", f"{kp.completeness_score}%", "[green]OK[/]" if kp.completeness_score >= 50 else "[red]Low[/]")
|
||||
|
||||
for attr in kp.attributes:
|
||||
status = "[green]Present[/]" if attr.present else "[red]Missing[/]"
|
||||
table.add_row(f" {attr.name}", attr.value or "-", status)
|
||||
|
||||
console.print(table)
|
||||
console.print()
|
||||
|
||||
# Platform presence table
|
||||
plat_table = Table(title="Platform Presence", show_header=True)
|
||||
plat_table.add_column("Platform", style="bold")
|
||||
plat_table.add_column("Present")
|
||||
plat_table.add_column("Details")
|
||||
|
||||
plat_table.add_row(
|
||||
"Wikipedia",
|
||||
"[green]Yes[/]" if result.wikipedia.present else "[red]No[/]",
|
||||
result.wikipedia.url or "-",
|
||||
)
|
||||
plat_table.add_row(
|
||||
"Wikidata",
|
||||
"[green]Yes[/]" if result.wikidata.present else "[red]No[/]",
|
||||
result.wikidata.qid or "-",
|
||||
)
|
||||
plat_table.add_row(
|
||||
"Naver Encyclopedia",
|
||||
"[green]Yes[/]" if result.naver.encyclopedia_present else "[red]No[/]",
|
||||
result.naver.encyclopedia_url or "-",
|
||||
)
|
||||
plat_table.add_row(
|
||||
"Naver 지식iN",
|
||||
"[green]Yes[/]" if result.naver.knowledge_in_present else "[red]No[/]",
|
||||
f"{result.naver.knowledge_in_count} entries" if result.naver.knowledge_in_present else "-",
|
||||
)
|
||||
|
||||
console.print(plat_table)
|
||||
console.print()
|
||||
|
||||
# Recommendations
|
||||
console.print("[bold yellow]Recommendations:[/bold yellow]")
|
||||
for i, rec in enumerate(result.recommendations, 1):
|
||||
console.print(f" {i}. {rec}")
|
||||
console.print()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Knowledge Graph & Entity Analyzer",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
)
|
||||
parser.add_argument("--entity", required=True, help="Entity name to analyze")
|
||||
parser.add_argument("--language", default="en", choices=["en", "ko", "ja", "zh"], help="Language (default: en)")
|
||||
parser.add_argument("--wiki", action="store_true", default=True, help="Include Wikipedia/Wikidata check (default: True)")
|
||||
parser.add_argument("--no-wiki", action="store_true", help="Skip Wikipedia/Wikidata check")
|
||||
parser.add_argument("--no-naver", action="store_true", help="Skip Naver checks")
|
||||
parser.add_argument("--json", action="store_true", help="Output as JSON")
|
||||
parser.add_argument("--output", type=str, help="Output file path")
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
async def main() -> None:
|
||||
args = parse_args()
|
||||
|
||||
analyzer = KnowledgeGraphAnalyzer()
|
||||
result = await analyzer.analyze(
|
||||
entity_name=args.entity,
|
||||
language=args.language,
|
||||
include_wiki=not args.no_wiki,
|
||||
include_naver=not args.no_naver,
|
||||
)
|
||||
|
||||
if args.json:
|
||||
output = json.dumps(result.to_dict(), ensure_ascii=False, indent=2)
|
||||
if args.output:
|
||||
with open(args.output, "w", encoding="utf-8") as f:
|
||||
f.write(output)
|
||||
console.print(f"[green]Output saved to {args.output}[/green]")
|
||||
else:
|
||||
print(output)
|
||||
else:
|
||||
display_result(result)
|
||||
if args.output:
|
||||
with open(args.output, "w", encoding="utf-8") as f:
|
||||
json.dump(result.to_dict(), f, ensure_ascii=False, indent=2)
|
||||
console.print(f"[green]Output saved to {args.output}[/green]")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -0,0 +1,9 @@
|
||||
# 28-seo-knowledge-graph dependencies
|
||||
requests>=2.31.0
|
||||
aiohttp>=3.9.0
|
||||
beautifulsoup4>=4.12.0
|
||||
lxml>=5.1.0
|
||||
tenacity>=8.2.0
|
||||
tqdm>=4.66.0
|
||||
python-dotenv>=1.0.0
|
||||
rich>=13.7.0
|
||||
Reference in New Issue
Block a user