Add SEO skills 19-28, 31-32 with full Python implementations

12 new skills: Keyword Strategy, SERP Analysis, Position Tracking,
Link Building, Content Strategy, E-Commerce SEO, KPI Framework,
International SEO, AI Visibility, Knowledge Graph, Competitor Intel,
and Crawl Budget. ~20K lines of Python across 25 domain scripts.
Updated skill 11 pipeline table and repo CLAUDE.md.
Enhanced skill 18 local SEO workflow from jamie.clinic audit.

Note: Skill 26 hreflang_validator.py pending (content filter block).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-13 12:05:59 +09:00
parent 159f7ec3f7
commit a3ff965b87
125 changed files with 25948 additions and 173 deletions

View File

@@ -0,0 +1,139 @@
# CLAUDE.md
## Overview
Knowledge Graph and Entity SEO tool for analyzing brand entity presence in Google Knowledge Graph, Knowledge Panels, People Also Ask (PAA), and FAQ rich results. Checks entity attribute completeness, Wikipedia/Wikidata presence, and Korean equivalents (Naver knowledge iN, Naver encyclopedia). Uses WebSearch and WebFetch for data collection, Ahrefs serp-overview for SERP feature detection.
## Quick Start
```bash
pip install -r scripts/requirements.txt
# Knowledge Graph analysis
python scripts/knowledge_graph_analyzer.py --entity "Samsung Electronics" --json
# Entity SEO audit
python scripts/entity_auditor.py --url https://example.com --entity "Brand Name" --json
```
## Scripts
| Script | Purpose | Key Output |
|--------|---------|------------|
| `knowledge_graph_analyzer.py` | Analyze Knowledge Panel and entity presence | KP detection, entity attributes, Wikipedia/Wikidata status |
| `entity_auditor.py` | Audit entity SEO signals and PAA/FAQ presence | PAA monitoring, FAQ schema tracking, entity completeness |
| `base_client.py` | Shared utilities | RateLimiter, ConfigManager, BaseAsyncClient |
## Knowledge Graph Analyzer
```bash
# Analyze entity in Knowledge Graph
python scripts/knowledge_graph_analyzer.py --entity "Samsung Electronics" --json
# Check with Korean name
python scripts/knowledge_graph_analyzer.py --entity "삼성전자" --language ko --json
# Include Wikipedia/Wikidata check
python scripts/knowledge_graph_analyzer.py --entity "Samsung" --wiki --json
```
**Capabilities**:
- Knowledge Panel detection via Google search
- Entity attribute extraction (name, description, logo, type, social profiles, website)
- Entity attribute completeness scoring
- Wikipedia article presence check
- Wikidata entity presence check (QID lookup)
- Naver encyclopedia (네이버 백과사전) presence
- Naver knowledge iN (지식iN) presence
- Knowledge Panel comparison with competitors
## Entity Auditor
```bash
# Full entity SEO audit
python scripts/entity_auditor.py --url https://example.com --entity "Brand Name" --json
# PAA monitoring for brand keywords
python scripts/entity_auditor.py --url https://example.com --entity "Brand Name" --paa --json
# FAQ rich result tracking
python scripts/entity_auditor.py --url https://example.com --entity "Brand Name" --faq --json
```
**Capabilities**:
- People Also Ask (PAA) monitoring for brand-related queries
- FAQ schema presence tracking (FAQPage schema -> SERP appearance)
- Entity markup audit (Organization, Person, LocalBusiness schema on website)
- Social profile linking validation (sameAs in schema)
- Brand SERP analysis (what appears when you search the brand name)
- Entity consistency across web properties
- Korean entity optimization (Korean Knowledge Panel, Naver profiles)
## Data Sources
| Source | Purpose |
|--------|---------|
| WebSearch | Search for entity/brand to detect Knowledge Panel |
| WebFetch | Fetch Wikipedia, Wikidata, Naver pages |
| Ahrefs `serp-overview` | SERP feature detection for entity keywords |
## Output Format
```json
{
"entity": "Samsung Electronics",
"knowledge_panel": {
"detected": true,
"attributes": {
"name": "Samsung Electronics",
"type": "Corporation",
"description": "...",
"logo": true,
"website": true,
"social_profiles": ["twitter", "facebook", "linkedin"]
},
"completeness_score": 85
},
"wikipedia": {"present": true, "url": "..."},
"wikidata": {"present": true, "qid": "Q20710"},
"naver_encyclopedia": {"present": true, "url": "..."},
"naver_knowledge_in": {"present": true, "entries": 15},
"paa_questions": [...],
"faq_rich_results": [...],
"entity_schema_on_site": {
"organization": true,
"same_as_links": 5,
"completeness": 78
},
"score": 75,
"timestamp": "2025-01-01T00:00:00"
}
```
## Notion Output (Required)
**IMPORTANT**: All audit reports MUST be saved to the OurDigital SEO Audit Log database.
### Database Configuration
| Field | Value |
|-------|-------|
| Database ID | `2c8581e5-8a1e-8035-880b-e38cefc2f3ef` |
| URL | https://www.notion.so/dintelligence/2c8581e58a1e8035880be38cefc2f3ef |
### Required Properties
| Property | Type | Description |
|----------|------|-------------|
| Issue | Title | Report title (Korean + date) |
| Site | URL | Entity website URL |
| Category | Select | Knowledge Graph & Entity SEO |
| Priority | Select | Based on entity completeness |
| Found Date | Date | Audit date (YYYY-MM-DD) |
| Audit ID | Rich Text | Format: KG-YYYYMMDD-NNN |
### Language Guidelines
- Report content in Korean (한국어)
- Keep technical English terms as-is (e.g., Knowledge Panel, Knowledge Graph, PAA)
- URLs and code remain unchanged

View File

@@ -0,0 +1,207 @@
"""
Base Client - Shared async client utilities
===========================================
Purpose: Rate-limited async operations for API clients
Python: 3.10+
"""
import asyncio
import logging
import os
from asyncio import Semaphore
from datetime import datetime
from typing import Any, Callable, TypeVar
from dotenv import load_dotenv
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type,
)
# Load environment variables
load_dotenv()
# Logging setup
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
T = TypeVar("T")
class RateLimiter:
"""Rate limiter using token bucket algorithm."""
def __init__(self, rate: float, per: float = 1.0):
"""
Initialize rate limiter.
Args:
rate: Number of requests allowed
per: Time period in seconds (default: 1 second)
"""
self.rate = rate
self.per = per
self.tokens = rate
self.last_update = datetime.now()
self._lock = asyncio.Lock()
async def acquire(self) -> None:
"""Acquire a token, waiting if necessary."""
async with self._lock:
now = datetime.now()
elapsed = (now - self.last_update).total_seconds()
self.tokens = min(self.rate, self.tokens + elapsed * (self.rate / self.per))
self.last_update = now
if self.tokens < 1:
wait_time = (1 - self.tokens) * (self.per / self.rate)
await asyncio.sleep(wait_time)
self.tokens = 0
else:
self.tokens -= 1
class BaseAsyncClient:
"""Base class for async API clients with rate limiting."""
def __init__(
self,
max_concurrent: int = 5,
requests_per_second: float = 3.0,
logger: logging.Logger | None = None,
):
"""
Initialize base client.
Args:
max_concurrent: Maximum concurrent requests
requests_per_second: Rate limit
logger: Logger instance
"""
self.semaphore = Semaphore(max_concurrent)
self.rate_limiter = RateLimiter(requests_per_second)
self.logger = logger or logging.getLogger(self.__class__.__name__)
self.stats = {
"requests": 0,
"success": 0,
"errors": 0,
"retries": 0,
}
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10),
retry=retry_if_exception_type(Exception),
)
async def _rate_limited_request(
self,
coro: Callable[[], Any],
) -> Any:
"""Execute a request with rate limiting and retry."""
async with self.semaphore:
await self.rate_limiter.acquire()
self.stats["requests"] += 1
try:
result = await coro()
self.stats["success"] += 1
return result
except Exception as e:
self.stats["errors"] += 1
self.logger.error(f"Request failed: {e}")
raise
async def batch_requests(
self,
requests: list[Callable[[], Any]],
desc: str = "Processing",
) -> list[Any]:
"""Execute multiple requests concurrently."""
try:
from tqdm.asyncio import tqdm
has_tqdm = True
except ImportError:
has_tqdm = False
async def execute(req: Callable) -> Any:
try:
return await self._rate_limited_request(req)
except Exception as e:
return {"error": str(e)}
tasks = [execute(req) for req in requests]
if has_tqdm:
results = []
for coro in tqdm.as_completed(tasks, total=len(tasks), desc=desc):
result = await coro
results.append(result)
return results
else:
return await asyncio.gather(*tasks, return_exceptions=True)
def print_stats(self) -> None:
"""Print request statistics."""
self.logger.info("=" * 40)
self.logger.info("Request Statistics:")
self.logger.info(f" Total Requests: {self.stats['requests']}")
self.logger.info(f" Successful: {self.stats['success']}")
self.logger.info(f" Errors: {self.stats['errors']}")
self.logger.info("=" * 40)
class ConfigManager:
"""Manage API configuration and credentials."""
def __init__(self):
load_dotenv()
@property
def google_credentials_path(self) -> str | None:
"""Get Google service account credentials path."""
# Prefer SEO-specific credentials, fallback to general credentials
seo_creds = os.path.expanduser("~/.credential/ourdigital-seo-agent.json")
if os.path.exists(seo_creds):
return seo_creds
return os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
@property
def pagespeed_api_key(self) -> str | None:
"""Get PageSpeed Insights API key."""
return os.getenv("PAGESPEED_API_KEY")
@property
def custom_search_api_key(self) -> str | None:
"""Get Custom Search API key."""
return os.getenv("CUSTOM_SEARCH_API_KEY")
@property
def custom_search_engine_id(self) -> str | None:
"""Get Custom Search Engine ID."""
return os.getenv("CUSTOM_SEARCH_ENGINE_ID")
@property
def notion_token(self) -> str | None:
"""Get Notion API token."""
return os.getenv("NOTION_TOKEN") or os.getenv("NOTION_API_KEY")
def validate_google_credentials(self) -> bool:
"""Validate Google credentials are configured."""
creds_path = self.google_credentials_path
if not creds_path:
return False
return os.path.exists(creds_path)
def get_required(self, key: str) -> str:
"""Get required environment variable or raise error."""
value = os.getenv(key)
if not value:
raise ValueError(f"Missing required environment variable: {key}")
return value
# Singleton config instance
config = ConfigManager()

View File

@@ -0,0 +1,902 @@
"""
Entity Auditor
===============
Purpose: Audit entity SEO signals including PAA monitoring, FAQ schema tracking,
entity markup validation, and brand SERP analysis.
Python: 3.10+
"""
import argparse
import asyncio
import json
import logging
import re
import sys
from dataclasses import asdict, dataclass, field
from datetime import datetime
from typing import Any
from urllib.parse import quote, urljoin, urlparse
import aiohttp
from bs4 import BeautifulSoup
from rich.console import Console
from rich.table import Table
from base_client import BaseAsyncClient, ConfigManager, config
logger = logging.getLogger(__name__)
console = Console()
# ---------------------------------------------------------------------------
# Data classes
# ---------------------------------------------------------------------------
@dataclass
class PaaQuestion:
"""A People Also Ask question found in SERP."""
question: str = ""
keyword: str = ""
position: int = 0
source_url: str | None = None
@dataclass
class FaqRichResult:
"""FAQ rich result tracking entry."""
url: str = ""
question_count: int = 0
appearing_in_serp: bool = False
questions: list[str] = field(default_factory=list)
schema_valid: bool = False
@dataclass
class EntitySchema:
"""Entity structured data found on a website."""
type: str = "" # Organization, Person, LocalBusiness, etc.
properties: dict[str, Any] = field(default_factory=dict)
same_as_links: list[str] = field(default_factory=list)
completeness: float = 0.0
issues: list[str] = field(default_factory=list)
@dataclass
class BrandSerpResult:
"""What appears when searching for the brand name."""
query: str = ""
features: list[str] = field(default_factory=list)
paa_count: int = 0
faq_count: int = 0
knowledge_panel: bool = False
sitelinks: bool = False
social_profiles: list[str] = field(default_factory=list)
top_results: list[dict[str, str]] = field(default_factory=list)
@dataclass
class EntityAuditResult:
"""Full entity SEO audit result."""
url: str = ""
entity_name: str = ""
paa_questions: list[PaaQuestion] = field(default_factory=list)
faq_rich_results: list[FaqRichResult] = field(default_factory=list)
entity_schemas: list[EntitySchema] = field(default_factory=list)
brand_serp: BrandSerpResult = field(default_factory=BrandSerpResult)
social_profile_status: dict[str, bool] = field(default_factory=dict)
overall_score: float = 0.0
recommendations: list[str] = field(default_factory=list)
timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
def to_dict(self) -> dict[str, Any]:
return asdict(self)
# ---------------------------------------------------------------------------
# Entity Auditor
# ---------------------------------------------------------------------------
class EntityAuditor(BaseAsyncClient):
"""Audit entity SEO signals and rich result presence."""
GOOGLE_SEARCH_URL = "https://www.google.com/search"
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
}
PAA_KEYWORD_TEMPLATES = [
"{entity}",
"{entity} reviews",
"{entity} vs",
"what is {entity}",
"{entity} pricing",
"{entity} alternatives",
"is {entity} good",
"{entity} benefits",
"how to use {entity}",
"{entity} complaints",
]
EXPECTED_SCHEMA_PROPERTIES = {
"Organization": [
"name", "url", "logo", "description", "sameAs",
"contactPoint", "address", "foundingDate", "founder",
"numberOfEmployees", "email", "telephone",
],
"Person": [
"name", "url", "image", "description", "sameAs",
"jobTitle", "worksFor", "alumniOf", "birthDate",
],
"LocalBusiness": [
"name", "url", "image", "description", "sameAs",
"address", "telephone", "openingHours", "geo",
"priceRange", "aggregateRating",
],
}
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.config = config
# ------------------------------------------------------------------
# PAA monitoring
# ------------------------------------------------------------------
async def monitor_paa(
self,
entity_name: str,
keywords: list[str] | None = None,
session: aiohttp.ClientSession | None = None,
) -> list[PaaQuestion]:
"""Search brand keywords and extract People Also Ask questions."""
if keywords is None:
keywords = [t.format(entity=entity_name) for t in self.PAA_KEYWORD_TEMPLATES]
paa_questions: list[PaaQuestion] = []
own_session = session is None
if own_session:
session = aiohttp.ClientSession()
try:
for keyword in keywords:
params = {"q": keyword, "hl": "en", "gl": "us"}
try:
async with session.get(
self.GOOGLE_SEARCH_URL, params=params, headers=self.HEADERS,
timeout=aiohttp.ClientTimeout(total=20),
) as resp:
if resp.status != 200:
logger.warning("Search for '%s' returned status %d", keyword, resp.status)
continue
html = await resp.text()
soup = BeautifulSoup(html, "lxml")
# PAA box selectors
paa_selectors = [
"div[data-sgrd] div[data-q]",
"div.related-question-pair",
"div[jsname] div[data-q]",
"div.wQiwMc",
]
position = 0
for selector in paa_selectors:
elements = soup.select(selector)
for el in elements:
question_text = el.get("data-q", "") or el.get_text(strip=True)
if question_text and len(question_text) > 5:
position += 1
paa_questions.append(PaaQuestion(
question=question_text,
keyword=keyword,
position=position,
))
# Fallback: regex for PAA-like questions
if not paa_questions:
text = soup.get_text(separator="\n")
q_patterns = re.findall(
r"((?:What|How|Why|When|Where|Who|Is|Can|Does|Do|Which)\s+[^?\n]{10,80}\??)",
text,
)
for i, q in enumerate(q_patterns[:8]):
paa_questions.append(PaaQuestion(
question=q.strip(),
keyword=keyword,
position=i + 1,
))
except Exception as exc:
logger.error("PAA search failed for '%s': %s", keyword, exc)
continue
# Rate limit between searches
await asyncio.sleep(1.5)
finally:
if own_session:
await session.close()
# Deduplicate questions
seen = set()
unique = []
for q in paa_questions:
key = q.question.lower().strip()
if key not in seen:
seen.add(key)
unique.append(q)
logger.info("Found %d unique PAA questions for '%s'", len(unique), entity_name)
return unique
# ------------------------------------------------------------------
# FAQ rich result tracking
# ------------------------------------------------------------------
async def track_faq_rich_results(
self,
url: str,
session: aiohttp.ClientSession | None = None,
) -> list[FaqRichResult]:
"""Check pages for FAQPage schema and SERP appearance."""
faq_results: list[FaqRichResult] = []
domain = urlparse(url).netloc
own_session = session is None
if own_session:
session = aiohttp.ClientSession()
try:
# Fetch the page and look for FAQ schema
async with session.get(
url, headers=self.HEADERS, timeout=aiohttp.ClientTimeout(total=20),
) as resp:
if resp.status != 200:
logger.warning("Page %s returned status %d", url, resp.status)
return faq_results
html = await resp.text()
soup = BeautifulSoup(html, "lxml")
# Find JSON-LD scripts with FAQPage
scripts = soup.find_all("script", type="application/ld+json")
for script in scripts:
try:
data = json.loads(script.string or "{}")
items = data if isinstance(data, list) else [data]
for item in items:
schema_type = item.get("@type", "")
if schema_type == "FAQPage" or (
isinstance(schema_type, list) and "FAQPage" in schema_type
):
questions = item.get("mainEntity", [])
faq = FaqRichResult(
url=url,
question_count=len(questions),
questions=[
q.get("name", "") for q in questions if isinstance(q, dict)
],
schema_valid=True,
)
faq_results.append(faq)
# Check for nested @graph
graph = item.get("@graph", [])
for g_item in graph:
if g_item.get("@type") == "FAQPage":
questions = g_item.get("mainEntity", [])
faq = FaqRichResult(
url=url,
question_count=len(questions),
questions=[
q.get("name", "") for q in questions if isinstance(q, dict)
],
schema_valid=True,
)
faq_results.append(faq)
except json.JSONDecodeError:
continue
# Also check for microdata FAQ markup
faq_items = soup.select("[itemtype*='FAQPage'] [itemprop='mainEntity']")
if faq_items and not faq_results:
questions = []
for item in faq_items:
q_el = item.select_one("[itemprop='name']")
if q_el:
questions.append(q_el.get_text(strip=True))
faq_results.append(FaqRichResult(
url=url,
question_count=len(questions),
questions=questions,
schema_valid=True,
))
except Exception as exc:
logger.error("FAQ tracking failed for %s: %s", url, exc)
finally:
if own_session:
await session.close()
logger.info("Found %d FAQ schemas on %s", len(faq_results), url)
return faq_results
# ------------------------------------------------------------------
# Entity schema audit
# ------------------------------------------------------------------
async def audit_entity_schema(
self,
url: str,
session: aiohttp.ClientSession | None = None,
) -> list[EntitySchema]:
"""Check Organization/Person/LocalBusiness schema on website."""
schemas: list[EntitySchema] = []
target_types = {"Organization", "Person", "LocalBusiness", "Corporation", "MedicalBusiness"}
own_session = session is None
if own_session:
session = aiohttp.ClientSession()
try:
async with session.get(
url, headers=self.HEADERS, timeout=aiohttp.ClientTimeout(total=20),
) as resp:
if resp.status != 200:
logger.warning("Page %s returned status %d", url, resp.status)
return schemas
html = await resp.text()
soup = BeautifulSoup(html, "lxml")
scripts = soup.find_all("script", type="application/ld+json")
for script in scripts:
try:
data = json.loads(script.string or "{}")
items = data if isinstance(data, list) else [data]
# Include @graph nested items
expanded = []
for item in items:
expanded.append(item)
if "@graph" in item:
expanded.extend(item["@graph"])
for item in expanded:
item_type = item.get("@type", "")
if isinstance(item_type, list):
matching = [t for t in item_type if t in target_types]
if not matching:
continue
item_type = matching[0]
elif item_type not in target_types:
continue
same_as = item.get("sameAs", [])
if isinstance(same_as, str):
same_as = [same_as]
# Calculate completeness
base_type = item_type
if base_type == "Corporation":
base_type = "Organization"
elif base_type == "MedicalBusiness":
base_type = "LocalBusiness"
expected = self.EXPECTED_SCHEMA_PROPERTIES.get(base_type, [])
present = [k for k in expected if k in item and item[k]]
completeness = round((len(present) / len(expected)) * 100, 1) if expected else 0
# Check for issues
issues = []
if "name" not in item:
issues.append("Missing 'name' property")
if "url" not in item:
issues.append("Missing 'url' property")
if not same_as:
issues.append("No 'sameAs' links (social profiles)")
if "logo" not in item and base_type == "Organization":
issues.append("Missing 'logo' property")
if "description" not in item:
issues.append("Missing 'description' property")
schema = EntitySchema(
type=item_type,
properties={k: (str(v)[:100] if not isinstance(v, (list, dict)) else v) for k, v in item.items() if k != "@context"},
same_as_links=same_as,
completeness=completeness,
issues=issues,
)
schemas.append(schema)
except json.JSONDecodeError:
continue
except Exception as exc:
logger.error("Entity schema audit failed for %s: %s", url, exc)
finally:
if own_session:
await session.close()
logger.info("Found %d entity schemas on %s", len(schemas), url)
return schemas
# ------------------------------------------------------------------
# Brand SERP analysis
# ------------------------------------------------------------------
async def analyze_brand_serp(
self,
entity_name: str,
session: aiohttp.ClientSession | None = None,
) -> BrandSerpResult:
"""Analyze what appears in SERP for the brand name search."""
result = BrandSerpResult(query=entity_name)
own_session = session is None
if own_session:
session = aiohttp.ClientSession()
try:
params = {"q": entity_name, "hl": "en", "gl": "us"}
async with session.get(
self.GOOGLE_SEARCH_URL, params=params, headers=self.HEADERS,
timeout=aiohttp.ClientTimeout(total=20),
) as resp:
if resp.status != 200:
return result
html = await resp.text()
soup = BeautifulSoup(html, "lxml")
text = soup.get_text(separator=" ", strip=True).lower()
# Detect SERP features
feature_indicators = {
"knowledge_panel": ["kp-wholepage", "knowledge-panel", "kno-"],
"sitelinks": ["sitelinks", "site-links"],
"people_also_ask": ["related-question-pair", "data-q"],
"faq_rich_result": ["faqpage", "frequently asked"],
"featured_snippet": ["featured-snippet", "data-tts"],
"image_pack": ["image-result", "img-brk"],
"video_carousel": ["video-result", "vid-"],
"twitter_carousel": ["twitter-timeline", "g-scrolling-carousel"],
"reviews": ["star-rating", "aggregate-rating"],
"local_pack": ["local-pack", "local_pack"],
}
for feature, indicators in feature_indicators.items():
for ind in indicators:
if ind in str(soup).lower():
result.features.append(feature)
break
result.knowledge_panel = "knowledge_panel" in result.features
result.sitelinks = "sitelinks" in result.features
# Count PAA questions
paa_elements = soup.select("div[data-q], div.related-question-pair")
result.paa_count = len(paa_elements)
if result.paa_count > 0 and "people_also_ask" not in result.features:
result.features.append("people_also_ask")
# Detect social profiles in results
social_domains = {
"twitter.com": "twitter", "x.com": "twitter",
"facebook.com": "facebook", "linkedin.com": "linkedin",
"youtube.com": "youtube", "instagram.com": "instagram",
"github.com": "github", "pinterest.com": "pinterest",
}
links = soup.find_all("a", href=True)
for link in links:
href = link["href"]
for domain, name in social_domains.items():
if domain in href and name not in result.social_profiles:
result.social_profiles.append(name)
# Extract top organic results
result_divs = soup.select("div.g, div[data-sokoban-container]")[:10]
for div in result_divs:
title_el = div.select_one("h3")
link_el = div.select_one("a[href]")
if title_el and link_el:
result.top_results.append({
"title": title_el.get_text(strip=True),
"url": link_el.get("href", ""),
})
except Exception as exc:
logger.error("Brand SERP analysis failed for '%s': %s", entity_name, exc)
finally:
if own_session:
await session.close()
return result
# ------------------------------------------------------------------
# Social profile link validation
# ------------------------------------------------------------------
async def check_social_profile_links(
self,
same_as_links: list[str],
session: aiohttp.ClientSession | None = None,
) -> dict[str, bool]:
"""Validate sameAs URLs are accessible."""
status: dict[str, bool] = {}
if not same_as_links:
return status
own_session = session is None
if own_session:
session = aiohttp.ClientSession()
try:
for link in same_as_links:
try:
async with session.head(
link, headers=self.HEADERS, timeout=aiohttp.ClientTimeout(total=10),
allow_redirects=True,
) as resp:
status[link] = resp.status < 400
except Exception:
status[link] = False
await asyncio.sleep(0.5)
finally:
if own_session:
await session.close()
accessible = sum(1 for v in status.values() if v)
logger.info("Social profile links: %d/%d accessible", accessible, len(status))
return status
# ------------------------------------------------------------------
# Recommendations
# ------------------------------------------------------------------
def generate_recommendations(self, result: EntityAuditResult) -> list[str]:
"""Generate actionable entity SEO improvement recommendations."""
recs: list[str] = []
# PAA recommendations
if not result.paa_questions:
recs.append(
"브랜드 관련 People Also Ask(PAA) 질문이 감지되지 않았습니다. "
"FAQ 콘텐츠를 작성하여 PAA 노출 기회를 확보하세요."
)
elif len(result.paa_questions) < 5:
recs.append(
f"PAA 질문이 {len(result.paa_questions)}개만 감지되었습니다. "
"더 다양한 키워드에 대한 Q&A 콘텐츠를 강화하세요."
)
# FAQ schema recommendations
if not result.faq_rich_results:
recs.append(
"FAQPage schema가 감지되지 않았습니다. "
"FAQ 페이지에 FAQPage JSON-LD를 추가하여 Rich Result를 확보하세요."
)
else:
invalid = [f for f in result.faq_rich_results if not f.schema_valid]
if invalid:
recs.append(
f"{len(invalid)}개의 FAQ schema에 유효성 문제가 있습니다. "
"Google Rich Results Test로 검증하세요."
)
# Entity schema recommendations
if not result.entity_schemas:
recs.append(
"Organization/Person/LocalBusiness schema가 없습니다. "
"홈페이지에 Organization schema JSON-LD를 추가하세요."
)
else:
for schema in result.entity_schemas:
if schema.completeness < 50:
recs.append(
f"{schema.type} schema 완성도가 {schema.completeness}%입니다. "
f"누락 항목: {', '.join(schema.issues[:3])}"
)
if not schema.same_as_links:
recs.append(
f"{schema.type} schema에 sameAs 속성이 없습니다. "
"소셜 미디어 프로필 URL을 sameAs에 추가하세요."
)
# Brand SERP recommendations
serp = result.brand_serp
if not serp.knowledge_panel:
recs.append(
"브랜드 검색 시 Knowledge Panel이 표시되지 않습니다. "
"Wikipedia, Wikidata, 구조화된 데이터를 통해 엔티티 인식을 강화하세요."
)
if not serp.sitelinks:
recs.append(
"Sitelinks가 표시되지 않습니다. "
"사이트 구조와 내부 링크를 개선하세요."
)
if len(serp.social_profiles) < 3:
recs.append(
f"SERP에 소셜 프로필이 {len(serp.social_profiles)}개만 표시됩니다. "
"주요 소셜 미디어 프로필을 활성화하고 schema sameAs에 연결하세요."
)
# Social profile accessibility
broken = [url for url, ok in result.social_profile_status.items() if not ok]
if broken:
recs.append(
f"접근 불가한 소셜 프로필 링크 {len(broken)}개: "
f"{', '.join(broken[:3])}. sameAs URL을 업데이트하세요."
)
if not recs:
recs.append("Entity SEO 상태가 양호합니다. 현재 수준을 유지하세요.")
return recs
# ------------------------------------------------------------------
# Scoring
# ------------------------------------------------------------------
def compute_score(self, result: EntityAuditResult) -> float:
"""Compute overall entity SEO score (0-100)."""
score = 0.0
# PAA presence (15 points)
paa_count = len(result.paa_questions)
if paa_count >= 10:
score += 15
elif paa_count >= 5:
score += 10
elif paa_count > 0:
score += 5
# FAQ schema (15 points)
if result.faq_rich_results:
valid_count = sum(1 for f in result.faq_rich_results if f.schema_valid)
score += min(15, valid_count * 5)
# Entity schema (25 points)
if result.entity_schemas:
best_completeness = max(s.completeness for s in result.entity_schemas)
score += best_completeness * 0.25
# Brand SERP features (25 points)
serp = result.brand_serp
if serp.knowledge_panel:
score += 10
if serp.sitelinks:
score += 5
score += min(10, len(serp.features) * 2)
# Social profiles (10 points)
if result.social_profile_status:
accessible = sum(1 for v in result.social_profile_status.values() if v)
total = len(result.social_profile_status)
score += (accessible / total) * 10 if total > 0 else 0
# sameAs links (10 points)
total_same_as = sum(len(s.same_as_links) for s in result.entity_schemas)
score += min(10, total_same_as * 2)
return round(min(100, score), 1)
# ------------------------------------------------------------------
# Main orchestrator
# ------------------------------------------------------------------
async def audit(
self,
url: str,
entity_name: str,
include_paa: bool = True,
include_faq: bool = True,
) -> EntityAuditResult:
"""Orchestrate full entity SEO audit."""
result = EntityAuditResult(url=url, entity_name=entity_name)
logger.info("Starting entity audit for '%s' at %s", entity_name, url)
async with aiohttp.ClientSession() as session:
# Parallel tasks: entity schema, brand SERP, FAQ
tasks = [
self.audit_entity_schema(url, session),
self.analyze_brand_serp(entity_name, session),
]
if include_faq:
tasks.append(self.track_faq_rich_results(url, session))
results = await asyncio.gather(*tasks, return_exceptions=True)
# Unpack results
if not isinstance(results[0], Exception):
result.entity_schemas = results[0]
else:
logger.error("Entity schema audit failed: %s", results[0])
if not isinstance(results[1], Exception):
result.brand_serp = results[1]
else:
logger.error("Brand SERP analysis failed: %s", results[1])
if include_faq and len(results) > 2 and not isinstance(results[2], Exception):
result.faq_rich_results = results[2]
# PAA monitoring (sequential due to rate limits)
if include_paa:
result.paa_questions = await self.monitor_paa(entity_name, session=session)
# Validate social profile links from schema
all_same_as = []
for schema in result.entity_schemas:
all_same_as.extend(schema.same_as_links)
if all_same_as:
result.social_profile_status = await self.check_social_profile_links(
list(set(all_same_as)), session
)
# Compute score and recommendations
result.overall_score = self.compute_score(result)
result.recommendations = self.generate_recommendations(result)
logger.info("Entity audit complete. Score: %.1f", result.overall_score)
return result
# ---------------------------------------------------------------------------
# CLI display helpers
# ---------------------------------------------------------------------------
def display_result(result: EntityAuditResult) -> None:
"""Display audit result in rich tables."""
console.print()
console.print(f"[bold cyan]Entity SEO Audit: {result.entity_name}[/bold cyan]")
console.print(f"URL: {result.url} | Score: {result.overall_score}/100")
console.print()
# Entity Schema table
if result.entity_schemas:
table = Table(title="Entity Schema Markup", show_header=True)
table.add_column("Type", style="bold")
table.add_column("Completeness")
table.add_column("sameAs Links")
table.add_column("Issues")
for schema in result.entity_schemas:
issues_text = "; ".join(schema.issues[:3]) if schema.issues else "None"
table.add_row(
schema.type,
f"{schema.completeness}%",
str(len(schema.same_as_links)),
issues_text,
)
console.print(table)
else:
console.print("[red]No entity schema markup found on website![/red]")
console.print()
# Brand SERP table
serp = result.brand_serp
serp_table = Table(title="Brand SERP Analysis", show_header=True)
serp_table.add_column("Feature", style="bold")
serp_table.add_column("Status")
serp_table.add_row("Knowledge Panel", "[green]Yes[/]" if serp.knowledge_panel else "[red]No[/]")
serp_table.add_row("Sitelinks", "[green]Yes[/]" if serp.sitelinks else "[red]No[/]")
serp_table.add_row("PAA Count", str(serp.paa_count))
serp_table.add_row("SERP Features", ", ".join(serp.features) if serp.features else "None")
serp_table.add_row("Social Profiles", ", ".join(serp.social_profiles) if serp.social_profiles else "None")
console.print(serp_table)
console.print()
# PAA Questions
if result.paa_questions:
paa_table = Table(title=f"People Also Ask ({len(result.paa_questions)} questions)", show_header=True)
paa_table.add_column("#", style="dim")
paa_table.add_column("Question")
paa_table.add_column("Keyword")
for i, q in enumerate(result.paa_questions[:15], 1):
paa_table.add_row(str(i), q.question, q.keyword)
console.print(paa_table)
console.print()
# FAQ Rich Results
if result.faq_rich_results:
faq_table = Table(title="FAQ Rich Results", show_header=True)
faq_table.add_column("URL")
faq_table.add_column("Questions")
faq_table.add_column("Valid")
for faq in result.faq_rich_results:
faq_table.add_row(
faq.url[:60],
str(faq.question_count),
"[green]Yes[/]" if faq.schema_valid else "[red]No[/]",
)
console.print(faq_table)
console.print()
# Social Profile Status
if result.social_profile_status:
sp_table = Table(title="Social Profile Link Status", show_header=True)
sp_table.add_column("URL")
sp_table.add_column("Accessible")
for link, accessible in result.social_profile_status.items():
sp_table.add_row(
link[:70],
"[green]Yes[/]" if accessible else "[red]No[/]",
)
console.print(sp_table)
console.print()
# Recommendations
console.print("[bold yellow]Recommendations:[/bold yellow]")
for i, rec in enumerate(result.recommendations, 1):
console.print(f" {i}. {rec}")
console.print()
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Entity SEO Auditor",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument("--url", required=True, help="Website URL to audit")
parser.add_argument("--entity", required=True, help="Entity/brand name")
parser.add_argument("--paa", action="store_true", default=True, help="Include PAA monitoring (default: True)")
parser.add_argument("--no-paa", action="store_true", help="Skip PAA monitoring")
parser.add_argument("--faq", action="store_true", default=True, help="Include FAQ tracking (default: True)")
parser.add_argument("--no-faq", action="store_true", help="Skip FAQ tracking")
parser.add_argument("--json", action="store_true", help="Output as JSON")
parser.add_argument("--output", type=str, help="Output file path")
return parser.parse_args()
async def main() -> None:
args = parse_args()
auditor = EntityAuditor()
result = await auditor.audit(
url=args.url,
entity_name=args.entity,
include_paa=not args.no_paa,
include_faq=not args.no_faq,
)
if args.json:
output = json.dumps(result.to_dict(), ensure_ascii=False, indent=2)
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
f.write(output)
console.print(f"[green]Output saved to {args.output}[/green]")
else:
print(output)
else:
display_result(result)
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
json.dump(result.to_dict(), f, ensure_ascii=False, indent=2)
console.print(f"[green]Output saved to {args.output}[/green]")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,782 @@
"""
Knowledge Graph Analyzer
=========================
Purpose: Analyze entity presence in Google Knowledge Graph, Knowledge Panels,
Wikipedia, Wikidata, and Korean equivalents (Naver encyclopedia, 지식iN).
Python: 3.10+
"""
import argparse
import asyncio
import json
import logging
import re
import sys
from dataclasses import asdict, dataclass, field
from datetime import datetime
from typing import Any
from urllib.parse import quote, urljoin
import aiohttp
from bs4 import BeautifulSoup
from rich.console import Console
from rich.table import Table
from base_client import BaseAsyncClient, ConfigManager, config
logger = logging.getLogger(__name__)
console = Console()
# ---------------------------------------------------------------------------
# Data classes
# ---------------------------------------------------------------------------
EXPECTED_ATTRIBUTES = [
"name",
"type",
"description",
"logo",
"website",
"founded",
"ceo",
"headquarters",
"parent_organization",
"subsidiaries",
"social_twitter",
"social_facebook",
"social_linkedin",
"social_youtube",
"social_instagram",
"stock_ticker",
"industry",
"employees",
"revenue",
]
@dataclass
class KnowledgePanelAttribute:
"""Single attribute extracted from a Knowledge Panel."""
name: str
value: str | None = None
present: bool = False
@dataclass
class KnowledgePanel:
"""Represents a detected Knowledge Panel."""
detected: bool = False
entity_type: str | None = None
attributes: list[KnowledgePanelAttribute] = field(default_factory=list)
completeness_score: float = 0.0
raw_snippet: str | None = None
@dataclass
class WikiPresence:
"""Wikipedia or Wikidata presence record."""
platform: str = "" # "wikipedia" or "wikidata"
present: bool = False
url: str | None = None
qid: str | None = None # Wikidata QID (e.g. Q20710)
language: str = "en"
@dataclass
class NaverPresence:
"""Naver encyclopedia and 지식iN presence."""
encyclopedia_present: bool = False
encyclopedia_url: str | None = None
knowledge_in_present: bool = False
knowledge_in_count: int = 0
knowledge_in_url: str | None = None
@dataclass
class KnowledgeGraphResult:
"""Full Knowledge Graph analysis result."""
entity: str = ""
language: str = "en"
knowledge_panel: KnowledgePanel = field(default_factory=KnowledgePanel)
wikipedia: WikiPresence = field(default_factory=lambda: WikiPresence(platform="wikipedia"))
wikidata: WikiPresence = field(default_factory=lambda: WikiPresence(platform="wikidata"))
naver: NaverPresence = field(default_factory=NaverPresence)
competitors: list[dict[str, Any]] = field(default_factory=list)
overall_score: float = 0.0
recommendations: list[str] = field(default_factory=list)
timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
def to_dict(self) -> dict[str, Any]:
return asdict(self)
# ---------------------------------------------------------------------------
# Knowledge Graph Analyzer
# ---------------------------------------------------------------------------
class KnowledgeGraphAnalyzer(BaseAsyncClient):
"""Analyze entity presence in Knowledge Graph and related platforms."""
GOOGLE_SEARCH_URL = "https://www.google.com/search"
WIKIPEDIA_API_URL = "https://{lang}.wikipedia.org/api/rest_v1/page/summary/{title}"
WIKIDATA_API_URL = "https://www.wikidata.org/w/api.php"
NAVER_SEARCH_URL = "https://search.naver.com/search.naver"
NAVER_ENCYCLOPEDIA_URL = "https://terms.naver.com/search.naver"
NAVER_KIN_URL = "https://kin.naver.com/search/list.naver"
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
}
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.config = config
# ------------------------------------------------------------------
# Google entity search
# ------------------------------------------------------------------
async def search_entity(
self,
entity_name: str,
language: str = "en",
session: aiohttp.ClientSession | None = None,
) -> dict[str, Any]:
"""Search Google for entity to detect Knowledge Panel signals."""
params = {"q": entity_name, "hl": language, "gl": "us" if language == "en" else "kr"}
headers = {**self.HEADERS}
if language == "ko":
headers["Accept-Language"] = "ko-KR,ko;q=0.9"
params["gl"] = "kr"
own_session = session is None
if own_session:
session = aiohttp.ClientSession()
try:
async with session.get(
self.GOOGLE_SEARCH_URL, params=params, headers=headers, timeout=aiohttp.ClientTimeout(total=20)
) as resp:
if resp.status != 200:
logger.warning("Google search returned status %d", resp.status)
return {"html": "", "status": resp.status}
html = await resp.text()
return {"html": html, "status": resp.status}
except Exception as exc:
logger.error("Google search failed: %s", exc)
return {"html": "", "status": 0, "error": str(exc)}
finally:
if own_session:
await session.close()
# ------------------------------------------------------------------
# Knowledge Panel detection
# ------------------------------------------------------------------
def detect_knowledge_panel(self, search_data: dict[str, Any]) -> KnowledgePanel:
"""Parse search results HTML for Knowledge Panel indicators."""
html = search_data.get("html", "")
if not html:
return KnowledgePanel(detected=False)
soup = BeautifulSoup(html, "lxml")
kp = KnowledgePanel()
# Knowledge Panel is typically in a div with class 'kp-wholepage' or 'knowledge-panel'
kp_selectors = [
"div.kp-wholepage",
"div.knowledge-panel",
"div[data-attrid='title']",
"div.kp-header",
"div[class*='kno-']",
"div.osrp-blk",
]
kp_element = None
for selector in kp_selectors:
kp_element = soup.select_one(selector)
if kp_element:
break
if kp_element:
kp.detected = True
kp.raw_snippet = kp_element.get_text(separator=" ", strip=True)[:500]
else:
# Fallback: check for common KP text patterns
text = soup.get_text(separator=" ", strip=True).lower()
kp_indicators = [
"wikipedia", "description", "founded", "ceo",
"headquarters", "subsidiaries", "parent organization",
]
matches = sum(1 for ind in kp_indicators if ind in text)
if matches >= 3:
kp.detected = True
kp.raw_snippet = text[:500]
return kp
# ------------------------------------------------------------------
# Attribute extraction
# ------------------------------------------------------------------
def extract_attributes(self, kp: KnowledgePanel, html: str = "") -> list[KnowledgePanelAttribute]:
"""Extract entity attributes from Knowledge Panel data."""
attributes: list[KnowledgePanelAttribute] = []
text = (kp.raw_snippet or "").lower()
# Parse HTML for structured attribute data
soup = BeautifulSoup(html, "lxml") if html else None
attribute_patterns = {
"name": r"^(.+?)(?:\s+is\s+|\s*[-|]\s*)",
"type": r"(?:is\s+(?:a|an)\s+)(\w[\w\s]+?)(?:\.|,|\s+based)",
"description": r"(?:is\s+)(.{20,200}?)(?:\.\s)",
"founded": r"(?:founded|established|incorporated)\s*(?:in|:)?\s*(\d{4})",
"ceo": r"(?:ceo|chief executive|chairman)\s*(?::|is)?\s*([A-Z][\w\s.]+?)(?:,|\.|;|\s{2})",
"headquarters": r"(?:headquarters?|hq|based in)\s*(?::|is|in)?\s*([A-Z][\w\s,]+?)(?:\.|;|\s{2})",
"stock_ticker": r"(?:stock|ticker|symbol)\s*(?::|is)?\s*([A-Z]{1,5}(?:\s*:\s*[A-Z]{1,5})?)",
"employees": r"(?:employees?|staff|workforce)\s*(?::|is)?\s*([\d,]+)",
"revenue": r"(?:revenue|sales)\s*(?::|is)?\s*([\$\d,.]+\s*(?:billion|million|B|M)?)",
"industry": r"(?:industry|sector)\s*(?::|is)?\s*([\w\s&]+?)(?:\.|,|;)",
}
social_patterns = {
"social_twitter": r"(?:twitter\.com|x\.com)/(\w+)",
"social_facebook": r"facebook\.com/([\w.]+)",
"social_linkedin": r"linkedin\.com/(?:company|in)/([\w-]+)",
"social_youtube": r"youtube\.com/(?:@|channel/|user/)([\w-]+)",
"social_instagram": r"instagram\.com/([\w.]+)",
}
full_text = kp.raw_snippet or ""
html_text = ""
if soup:
html_text = soup.get_text(separator=" ", strip=True)
combined = f"{full_text} {html_text}"
for attr_name, pattern in attribute_patterns.items():
match = re.search(pattern, combined, re.IGNORECASE)
present = match is not None
value = match.group(1).strip() if match else None
attributes.append(KnowledgePanelAttribute(name=attr_name, value=value, present=present))
# Social profiles
for attr_name, pattern in social_patterns.items():
match = re.search(pattern, combined, re.IGNORECASE)
present = match is not None
value = match.group(1).strip() if match else None
attributes.append(KnowledgePanelAttribute(name=attr_name, value=value, present=present))
# Logo detection from HTML
logo_present = False
if soup:
logo_img = soup.select_one("img[data-atf], g-img img, img.kno-fb-img, img[alt*='logo']")
if logo_img:
logo_present = True
attributes.append(KnowledgePanelAttribute(name="logo", value=None, present=logo_present))
# Website detection
website_present = False
if soup:
site_link = soup.select_one("a[data-attrid*='website'], a.ab_button[href*='http']")
if site_link:
website_present = True
value = site_link.get("href", "")
attributes.append(KnowledgePanelAttribute(name="website", value=value if website_present else None, present=website_present))
return attributes
# ------------------------------------------------------------------
# Completeness scoring
# ------------------------------------------------------------------
def score_completeness(self, attributes: list[KnowledgePanelAttribute]) -> float:
"""Score attribute completeness (0-100) based on filled vs expected."""
if not attributes:
return 0.0
weights = {
"name": 10, "type": 8, "description": 10, "logo": 8, "website": 10,
"founded": 5, "ceo": 5, "headquarters": 5, "parent_organization": 3,
"subsidiaries": 3, "social_twitter": 4, "social_facebook": 4,
"social_linkedin": 4, "social_youtube": 3, "social_instagram": 3,
"stock_ticker": 3, "industry": 5, "employees": 3, "revenue": 4,
}
total_weight = sum(weights.values())
earned = 0.0
attr_map = {a.name: a for a in attributes}
for attr_name, weight in weights.items():
attr = attr_map.get(attr_name)
if attr and attr.present:
earned += weight
return round((earned / total_weight) * 100, 1) if total_weight > 0 else 0.0
# ------------------------------------------------------------------
# Wikipedia check
# ------------------------------------------------------------------
async def check_wikipedia(
self,
entity_name: str,
language: str = "en",
session: aiohttp.ClientSession | None = None,
) -> WikiPresence:
"""Check Wikipedia article existence for entity."""
wiki = WikiPresence(platform="wikipedia", language=language)
title = entity_name.replace(" ", "_")
url = self.WIKIPEDIA_API_URL.format(lang=language, title=quote(title))
own_session = session is None
if own_session:
session = aiohttp.ClientSession()
try:
async with session.get(url, headers=self.HEADERS, timeout=aiohttp.ClientTimeout(total=15)) as resp:
if resp.status == 200:
data = await resp.json()
wiki.present = data.get("type") != "disambiguation"
wiki.url = data.get("content_urls", {}).get("desktop", {}).get("page", "")
if not wiki.url:
wiki.url = f"https://{language}.wikipedia.org/wiki/{quote(title)}"
logger.info("Wikipedia article found for '%s' (%s)", entity_name, language)
elif resp.status == 404:
wiki.present = False
logger.info("No Wikipedia article for '%s' (%s)", entity_name, language)
else:
logger.warning("Wikipedia API returned status %d", resp.status)
except Exception as exc:
logger.error("Wikipedia check failed: %s", exc)
finally:
if own_session:
await session.close()
return wiki
# ------------------------------------------------------------------
# Wikidata check
# ------------------------------------------------------------------
async def check_wikidata(
self,
entity_name: str,
session: aiohttp.ClientSession | None = None,
) -> WikiPresence:
"""Check Wikidata QID existence for entity."""
wiki = WikiPresence(platform="wikidata")
params = {
"action": "wbsearchentities",
"search": entity_name,
"language": "en",
"format": "json",
"limit": 5,
}
own_session = session is None
if own_session:
session = aiohttp.ClientSession()
try:
async with session.get(
self.WIKIDATA_API_URL, params=params, headers=self.HEADERS,
timeout=aiohttp.ClientTimeout(total=15),
) as resp:
if resp.status == 200:
data = await resp.json()
results = data.get("search", [])
if results:
top = results[0]
wiki.present = True
wiki.qid = top.get("id", "")
wiki.url = top.get("concepturi", f"https://www.wikidata.org/wiki/{wiki.qid}")
logger.info("Wikidata entity found: %s (%s)", wiki.qid, entity_name)
else:
wiki.present = False
logger.info("No Wikidata entity for '%s'", entity_name)
else:
logger.warning("Wikidata API returned status %d", resp.status)
except Exception as exc:
logger.error("Wikidata check failed: %s", exc)
finally:
if own_session:
await session.close()
return wiki
# ------------------------------------------------------------------
# Naver encyclopedia
# ------------------------------------------------------------------
async def check_naver_encyclopedia(
self,
entity_name: str,
session: aiohttp.ClientSession | None = None,
) -> dict[str, Any]:
"""Check Naver encyclopedia (네이버 백과사전) presence."""
result = {"present": False, "url": None}
params = {"query": entity_name, "searchType": 0}
headers = {
**self.HEADERS,
"Accept-Language": "ko-KR,ko;q=0.9",
}
own_session = session is None
if own_session:
session = aiohttp.ClientSession()
try:
async with session.get(
self.NAVER_ENCYCLOPEDIA_URL, params=params, headers=headers,
timeout=aiohttp.ClientTimeout(total=15),
) as resp:
if resp.status == 200:
html = await resp.text()
soup = BeautifulSoup(html, "lxml")
# Look for search result entries
entries = soup.select("ul.content_list li, div.search_result a, a.title")
if entries:
result["present"] = True
first_link = entries[0].find("a")
if first_link and first_link.get("href"):
href = first_link["href"]
if not href.startswith("http"):
href = urljoin("https://terms.naver.com", href)
result["url"] = href
else:
result["url"] = f"https://terms.naver.com/search.naver?query={quote(entity_name)}"
logger.info("Naver encyclopedia entry found for '%s'", entity_name)
else:
# Fallback: check page text for result indicators
text = soup.get_text()
if entity_name in text and "검색결과가 없습니다" not in text:
result["present"] = True
result["url"] = f"https://terms.naver.com/search.naver?query={quote(entity_name)}"
else:
logger.warning("Naver encyclopedia returned status %d", resp.status)
except Exception as exc:
logger.error("Naver encyclopedia check failed: %s", exc)
finally:
if own_session:
await session.close()
return result
# ------------------------------------------------------------------
# Naver knowledge iN
# ------------------------------------------------------------------
async def check_naver_knowledge_in(
self,
entity_name: str,
session: aiohttp.ClientSession | None = None,
) -> dict[str, Any]:
"""Check Naver knowledge iN (지식iN) entries."""
result = {"present": False, "count": 0, "url": None}
params = {"query": entity_name}
headers = {
**self.HEADERS,
"Accept-Language": "ko-KR,ko;q=0.9",
}
own_session = session is None
if own_session:
session = aiohttp.ClientSession()
try:
async with session.get(
self.NAVER_KIN_URL, params=params, headers=headers,
timeout=aiohttp.ClientTimeout(total=15),
) as resp:
if resp.status == 200:
html = await resp.text()
soup = BeautifulSoup(html, "lxml")
# Extract total result count
count_el = soup.select_one("span.number, em.total_count, span.result_count")
count = 0
if count_el:
count_text = count_el.get_text(strip=True).replace(",", "")
count_match = re.search(r"(\d+)", count_text)
if count_match:
count = int(count_match.group(1))
# Also check for list items
entries = soup.select("ul.basic1 li, ul._list li, div.search_list li")
if count > 0 or entries:
result["present"] = True
result["count"] = count if count > 0 else len(entries)
result["url"] = f"https://kin.naver.com/search/list.naver?query={quote(entity_name)}"
logger.info("Naver 지식iN: %d entries for '%s'", result["count"], entity_name)
else:
logger.info("No Naver 지식iN entries for '%s'", entity_name)
else:
logger.warning("Naver 지식iN returned status %d", resp.status)
except Exception as exc:
logger.error("Naver 지식iN check failed: %s", exc)
finally:
if own_session:
await session.close()
return result
# ------------------------------------------------------------------
# Recommendations
# ------------------------------------------------------------------
def generate_recommendations(self, result: KnowledgeGraphResult) -> list[str]:
"""Generate actionable recommendations based on analysis."""
recs: list[str] = []
kp = result.knowledge_panel
if not kp.detected:
recs.append(
"Knowledge Panel이 감지되지 않았습니다. Google에 엔티티 등록을 위해 "
"Wikipedia 페이지 생성, Wikidata 항목 추가, 구조화된 데이터(Organization schema) 구현을 권장합니다."
)
elif kp.completeness_score < 50:
recs.append(
f"Knowledge Panel 완성도가 {kp.completeness_score}%로 낮습니다. "
"누락된 속성(소셜 프로필, 설명, 로고 등)을 보강하세요."
)
if not result.wikipedia.present:
recs.append(
"Wikipedia 문서가 없습니다. 주목할 만한 출처(reliable sources)를 확보한 후 "
"Wikipedia 문서 생성을 고려하세요."
)
if not result.wikidata.present:
recs.append(
"Wikidata 항목이 없습니다. Wikidata에 엔티티를 등록하여 "
"Knowledge Graph 인식을 강화하세요."
)
if not result.naver.encyclopedia_present:
recs.append(
"네이버 백과사전에 등록되어 있지 않습니다. 한국 시장 SEO를 위해 "
"네이버 백과사전 등재를 검토하세요."
)
if result.naver.knowledge_in_count < 5:
recs.append(
"네이버 지식iN에 관련 콘텐츠가 부족합니다. Q&A 콘텐츠를 통해 "
"브랜드 엔티티 인지도를 높이세요."
)
# Check social profile completeness
attr_map = {a.name: a for a in kp.attributes}
missing_social = []
for soc in ["social_twitter", "social_facebook", "social_linkedin", "social_youtube"]:
attr = attr_map.get(soc)
if not attr or not attr.present:
missing_social.append(soc.replace("social_", "").title())
if missing_social:
recs.append(
f"소셜 프로필 연결 누락: {', '.join(missing_social)}. "
"웹사이트 schema의 sameAs 속성에 소셜 프로필을 추가하세요."
)
if not recs:
recs.append("Knowledge Graph 엔티티 상태가 양호합니다. 현재 수준을 유지하세요.")
return recs
# ------------------------------------------------------------------
# Main orchestrator
# ------------------------------------------------------------------
async def analyze(
self,
entity_name: str,
language: str = "en",
include_wiki: bool = True,
include_naver: bool = True,
) -> KnowledgeGraphResult:
"""Orchestrate full Knowledge Graph analysis."""
result = KnowledgeGraphResult(entity=entity_name, language=language)
logger.info("Starting Knowledge Graph analysis for '%s' (lang=%s)", entity_name, language)
async with aiohttp.ClientSession() as session:
# Step 1: Search entity on Google
search_data = await self.search_entity(entity_name, language, session)
# Step 2: Detect Knowledge Panel
kp = self.detect_knowledge_panel(search_data)
# Step 3: Extract attributes
if kp.detected:
kp.attributes = self.extract_attributes(kp, search_data.get("html", ""))
kp.completeness_score = self.score_completeness(kp.attributes)
# Detect entity type from attributes
for attr in kp.attributes:
if attr.name == "type" and attr.present:
kp.entity_type = attr.value
break
result.knowledge_panel = kp
# Step 4: Wikipedia and Wikidata checks (parallel)
if include_wiki:
wiki_task = self.check_wikipedia(entity_name, language, session)
wikidata_task = self.check_wikidata(entity_name, session)
result.wikipedia, result.wikidata = await asyncio.gather(wiki_task, wikidata_task)
# Step 5: Naver checks (parallel)
if include_naver:
enc_task = self.check_naver_encyclopedia(entity_name, session)
kin_task = self.check_naver_knowledge_in(entity_name, session)
enc_result, kin_result = await asyncio.gather(enc_task, kin_task)
result.naver = NaverPresence(
encyclopedia_present=enc_result.get("present", False),
encyclopedia_url=enc_result.get("url"),
knowledge_in_present=kin_result.get("present", False),
knowledge_in_count=kin_result.get("count", 0),
knowledge_in_url=kin_result.get("url"),
)
# Step 6: Compute overall score
scores = []
if kp.detected:
scores.append(kp.completeness_score * 0.35)
else:
scores.append(0)
scores.append(20.0 if result.wikipedia.present else 0)
scores.append(15.0 if result.wikidata.present else 0)
scores.append(15.0 if result.naver.encyclopedia_present else 0)
scores.append(15.0 if result.naver.knowledge_in_present else 0)
result.overall_score = round(sum(scores), 1)
# Step 7: Recommendations
result.recommendations = self.generate_recommendations(result)
logger.info("Analysis complete. Overall score: %.1f", result.overall_score)
return result
# ---------------------------------------------------------------------------
# CLI display helpers
# ---------------------------------------------------------------------------
def display_result(result: KnowledgeGraphResult) -> None:
"""Display analysis result in a rich table."""
console.print()
console.print(f"[bold cyan]Knowledge Graph Analysis: {result.entity}[/bold cyan]")
console.print(f"Language: {result.language} | Score: {result.overall_score}/100")
console.print()
# Knowledge Panel table
kp = result.knowledge_panel
table = Table(title="Knowledge Panel", show_header=True)
table.add_column("Property", style="bold")
table.add_column("Value")
table.add_column("Status")
table.add_row("Detected", str(kp.detected), "[green]OK[/]" if kp.detected else "[red]Missing[/]")
table.add_row("Entity Type", kp.entity_type or "-", "[green]OK[/]" if kp.entity_type else "[yellow]Unknown[/]")
table.add_row("Completeness", f"{kp.completeness_score}%", "[green]OK[/]" if kp.completeness_score >= 50 else "[red]Low[/]")
for attr in kp.attributes:
status = "[green]Present[/]" if attr.present else "[red]Missing[/]"
table.add_row(f" {attr.name}", attr.value or "-", status)
console.print(table)
console.print()
# Platform presence table
plat_table = Table(title="Platform Presence", show_header=True)
plat_table.add_column("Platform", style="bold")
plat_table.add_column("Present")
plat_table.add_column("Details")
plat_table.add_row(
"Wikipedia",
"[green]Yes[/]" if result.wikipedia.present else "[red]No[/]",
result.wikipedia.url or "-",
)
plat_table.add_row(
"Wikidata",
"[green]Yes[/]" if result.wikidata.present else "[red]No[/]",
result.wikidata.qid or "-",
)
plat_table.add_row(
"Naver Encyclopedia",
"[green]Yes[/]" if result.naver.encyclopedia_present else "[red]No[/]",
result.naver.encyclopedia_url or "-",
)
plat_table.add_row(
"Naver 지식iN",
"[green]Yes[/]" if result.naver.knowledge_in_present else "[red]No[/]",
f"{result.naver.knowledge_in_count} entries" if result.naver.knowledge_in_present else "-",
)
console.print(plat_table)
console.print()
# Recommendations
console.print("[bold yellow]Recommendations:[/bold yellow]")
for i, rec in enumerate(result.recommendations, 1):
console.print(f" {i}. {rec}")
console.print()
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Knowledge Graph & Entity Analyzer",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument("--entity", required=True, help="Entity name to analyze")
parser.add_argument("--language", default="en", choices=["en", "ko", "ja", "zh"], help="Language (default: en)")
parser.add_argument("--wiki", action="store_true", default=True, help="Include Wikipedia/Wikidata check (default: True)")
parser.add_argument("--no-wiki", action="store_true", help="Skip Wikipedia/Wikidata check")
parser.add_argument("--no-naver", action="store_true", help="Skip Naver checks")
parser.add_argument("--json", action="store_true", help="Output as JSON")
parser.add_argument("--output", type=str, help="Output file path")
return parser.parse_args()
async def main() -> None:
args = parse_args()
analyzer = KnowledgeGraphAnalyzer()
result = await analyzer.analyze(
entity_name=args.entity,
language=args.language,
include_wiki=not args.no_wiki,
include_naver=not args.no_naver,
)
if args.json:
output = json.dumps(result.to_dict(), ensure_ascii=False, indent=2)
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
f.write(output)
console.print(f"[green]Output saved to {args.output}[/green]")
else:
print(output)
else:
display_result(result)
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
json.dump(result.to_dict(), f, ensure_ascii=False, indent=2)
console.print(f"[green]Output saved to {args.output}[/green]")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,9 @@
# 28-seo-knowledge-graph dependencies
requests>=2.31.0
aiohttp>=3.9.0
beautifulsoup4>=4.12.0
lxml>=5.1.0
tenacity>=8.2.0
tqdm>=4.66.0
python-dotenv>=1.0.0
rich>=13.7.0