Every analysis runs the same deterministic 19-step pipeline — from URL validation to AI narrative. No black boxes.
Strips the URL to the homepage root (removes deep paths, UTM params, query strings), then runs a live DNS lookup to confirm the domain resolves before any work begins. Returns HTTP 422 immediately if the domain doesn't exist.
Fetches raw HTML via the rendering service and hands it to BeautifulSoup. Two parsers are available: lxml (fast, used by default) and html.parser (used for Next.js and Framer, which produce attribute-heavy HTML that lxml can mangle). If the rendered page has fewer than 30 meaningful words, the entire run is flagged as a render failure and downstream stages are skipped.
Single-pass fingerprinting that reads raw HTML for known framework markers. Detects Next.js (self.__next_f), Framer (data-framer-hydrate-v2), Webflow (data-wf-site), Shopify (cdn.shopify.com), React SPA (empty #root), Angular (ng-version), Nuxt (__nuxt), Wix, Squarespace, and Ghost. Returns a structured dict: name, category (ssr_framework / static_builder / csr_spa / ecommerce / cms / none), csr_risk level, and behaviour hints for downstream stages. This is the single source of truth — nothing downstream re-detects the framework.
Only runs when Next.js is detected. Parses self.__next_f.push flight data payloads from the raw HTML, extracts quoted strings of 20–500 chars that contain spaces and real letters (skipping className, href, SVG data, and __next internals), then injects the extracted text into a hidden <div> before the closing </body>. This makes SSR content readable to BeautifulSoup even when the <main> tag is nearly empty.
Extracts exactly what an AI crawler sees. Runs _clean_dom first (removes scripts, styles, cookie banners, nav, footers, hidden elements) — skipped for Next.js (skip_clean=True) to preserve injected flight content. Extracts: title (separator-split logic + H1 fallback), meta description, OG tags (with og_confidence 0–1), H1 (filtered by _is_valid_h1 to reject nav chrome like 'Copy' or 'Menu'), hero section (named selector → H1 context → BFS candidates), headings, paragraphs (p + li + div, deduplicated), full_text assembly, and structured data (FAQPage, Organization, WebPage, SoftwareApplication schemas from JSON-LD). Framer sites use a completely separate extraction path via _extract_framer_content, reading RichTextContainer components from the desktop SSR variant only. Outputs extraction_quality: high / medium / low / failed.
Runs on a clean re-parse of raw_html (separate from the ai_view soup, which was mutated by _clean_dom). Computes all 15 crawl signals used in the crawl score: schema location check (head vs body vs JS-injected), CSR severity detection (full / partial / none per framework), semantic tag ratio, H1 count, meta description length, canonical URL, hreflang count, and content quality signals (stat_count via regex, faq_question_count from H2/H3 headings + schema + <details> elements, has_author_signal, has_freshness_signal via <time> tag, table_count, alt text coverage). Also injects _stat_count, _faq_question_count, _has_author_signal, _has_freshness_signal, and _table_count into ai_view with underscore prefix so the AI scorer can read them directly.
Async HTTP checks run in parallel with brand presence (asyncio.gather). Hits five endpoints: /llms.txt (checks H1 + blockquote for spec validity), /llms-full.txt, /robots.txt (parsed for 30 AI crawler directives across training/search/commoncrawl categories — GPTBot, ClaudeBot, PerplexityBot, etc.), /sitemap.xml (with robots.txt Sitemap: header fallback and index detection), and Common Crawl index API (fetches latest CC-MAIN index dynamically, cached per process). Results are merged back into geo_signals via patch_crawl_score_with_discoverability, which recomputes the crawl score with real network data and rebuilds the issues/passing lists.
Primary path: Groq (llama-3.3-70b-versatile, temperature=0, json_object mode, 5s timeout) classifies the site into one of 21 intent categories from title + H1 + meta description + hero text. Returns intent, confidence (0–1), and a one-sentence reasoning string. Fallback: if GROQ_API_KEY is missing, Groq times out, returns an unknown intent, or fails JSON parsing, the regex-based infer_site_intent() runs instead and tags the result with detection_method=regex_fallback. The intent label drives category mapping in the next stage.
Async function that builds what AI 'believes' about the site. First enriches ai_view by prepending structured_data descriptions to meta_description and hero_section (so the site's own schema feeds extraction). Then: detects category via CATEGORY_PATTERNS regex (falls back to INTENT_TO_CATEGORY if confidence < 0.60 or impossible-combo guard fires), qualifies generic categories with domain keywords from structured_data (e.g. 'online education platform' + 'cricket' → 'online cricket coaching platform'), extracts audience via AUDIENCE_PATTERNS and validates against CATEGORY_AUDIENCE_ALLOWLIST (removes implausible matches, falls back to CATEGORY_DEFAULT_AUDIENCE), calls Groq for capabilities (2-6 specific capability strings, 5s timeout), falls back to regex CAPABILITY_PATTERNS if Groq fails, then filters capabilities via _filter_capabilities (removes city names, brand name itself, nav noise, single short words). Outputs confidence: high / medium / low with a scored evidence checklist.
Measures how fast AI clarity builds as content is read. With bfs_result=None (the current path), derives everything from ai_view structural signals directly: title word count, meta char count, hero word count, H1 word count, H2 count, paragraph count, FAQ count (from faq_items array + headings). Computes three depth scores — immediate (title + meta + hero + VP composite), after_scroll (adds heading density bonus), after_exploration (adds volume score + FAQ boost up to 15pts). Classifies curve_shape as fast_clear / slow_clear / partial / thin based on immediate and exploration levels. Slope = exploration minus immediate score.
analyze_clarity is now a thin structural extractor only — all word-set matching, TF-IDF, and meaning-inference were removed in v4. It pulls raw counts from ai_view: word_count, paragraph_word_count, title_word_count, meta_desc_char_count, hero_word_count, h2_count (from structured headings), heading_count_ratio (h2 / paragraph count), h1_text, meta_description, og_confidence. compute_metrics computes semantic_density and high_info_ratio from full_text. value_prop_confidence is computed inline in analyze.py as a clamped ratio (word_count / 500, boosted by hero and headings presence) and injected into signals for insights compatibility.
Runs in parallel with fetch_llm_discoverability via asyncio.gather. Extracts brand name via priority chain: og:site_name → domain base (with _strip_domain_prefix for 'use'/'get'/'try' prefixes, e.g. 'usevelo' → 'Velo') → og:title separator → page title separator → first capitalised word. Checks DuckDuckGo HTML API for domain presence and assigns presence_tier: high (top results, 3+ hits) / medium (lower position) / low (not found) / unknown (timeout). In v8, presence_tier no longer affects the AI score — coherence and brand floors were removed. It feeds generate_verdict, analyze_ai_failure_modes, and the training_data_signal response field only.
Pure weighted sum of 12 content signals (weights sum to 100). Signals and weights: title_words (10) — full credit at 6+ words; meta_desc_words (10) — full credit at 12+ words; hero_word_count (12) — full credit at 40+ words; paragraph_density (12) — para_wc / total_wc, full credit at 0.60+ ratio; heading_count_ratio (8) — H2s per 100 words, sweet spot 1–4, reads from h2_count directly from signals; h1_meta_alignment (10) — character trigram cosine similarity between H1 and meta description; stat_count_score (8) — quantified claims (number + unit regex), full credit at 5+; faq_heading_count (8) — question-format headings + schema FAQ items, full credit at 3+; author_present (4) — binary E-E-A-T signal; freshness_present (4) — binary <time> or article date signal; table_count_score (4) — structured tables, full credit at 2+; og_completeness (10) — og_confidence passthrough. Only post-sum step: extraction quality hard cap (high=100, medium=78, low=55, failed=25, unknown=88). No page-type multipliers. No brand floors. No cross-signal penalties.
Separate from the AI content score. Also a pure weighted sum of 15 structural/technical signals (weights sum to 100): schema_in_head (12, 0.3 partial credit for body-only), has_faq_schema (8), has_org_schema (5), csr_score (12, none=1.0/partial=0.5/full=0.0), semantic_tag_ratio (8, full at ≥8%), h1_present (5, 0.5 for multiple H1s), meta_desc_length (6, full at 80+ chars), crawler_access (12, from robots.txt parsing), sitemap_present (6), llms_txt_present (8, 0.5 if present but fails spec validation), common_crawl_indexed (5), hreflang_present (3), freshness_signal (4), canonical_present (4), main_word_count_score (2, full at 100+ words). Initially computed without network data, then recomputed after fetch_llm_discoverability resolves via patch_crawl_score_with_discoverability. This is the left-panel score shown in the dashboard.
Projects the full feature vector into four named dimensions using fixed linear combinations (no trained PCA model): understanding_depth (word count + high_info_ratio + curve scores), signal_imbalance (title + meta + hero word counts against targets), density_vs_description (semantic_density + paragraph density + og_confidence), value_prop_speed (hero word count + heading ratio + immediate curve score). The minimum-value dimension is flagged as dominant_weakness and used by pca_based_recommendations to surface the highest-leverage fix.
Compares against a static corpus of 91 analysed sites, segmented by category and page_type. If the category is too niche for a meaningful static comparison (available=False), falls back to generate_dynamic_benchmarks: a Groq call that returns a peer set with real scores and a positioning statement. Geo signals and presence_tier influence which benchmarks are most relevant.
Generates targeted fixes from heuristic rules keyed to score_breakdown components, then overlays pca_based_recommendations for the dominant_weakness dimension. A single authoritative schema recommendation is built by _build_schema_rec (reads geo_signals.schema for ground truth — not ai_view guesses) and appended after stripping all other schema rec IDs from the base set to prevent duplicates. suppress_recommendations then filters by page_type and belief context. reword_recs_for_context adjusts copy for CSR sites. Every recommendation shown is specific to the actual signals found.
Two-endpoint flow. POST /suggest: Groq (llama-3.3-70b, json_object, ~1s) finds 3 direct competitors matching category + intent + capabilities — returns name, domain, relevance_reason only. POST /find-competitors: calls /suggest first, then runs each competitor URL through /analyze concurrently (45s timeout each). Extracts scores, signal_breakdown strengths, and capabilities. Runs /analyze on your own site in parallel for gap analysis. Generates category_insights comparing your score against the competitor average. Available as a separate endpoint for ongoing monitoring.
The final step — runs after the full response dict is assembled. Groq receives the complete result (score, crawl_score, geo_signals, presence_tier, page_type, understanding_curve, recommendations, competitor context) and writes a plain-English summary: what's working, what isn't, and why — personalised to the site's actual signals. Wrapped in try/except so a Groq failure returns null narrative without breaking the response.
Select a stage
Click any step to see exactly what it does and why it matters.
Pipeline totals
19
Total stages
5
LLM calls
3
Async ops
Two separate scores
Content score — weighted sum of 12 signals measuring what AI crawlers actually read.
Technical score — 15 structural signals including schema, CSR, robots.txt, and discoverability.
Why deterministic?
Same input → same output. No model drift, no prompt fragility. Every score is a reproducible weighted sum — auditable down to the individual signal.
Tap any stage to expand