Under the hood · Pipeline

How Brandioz scores
your site.

Every analysis runs the same deterministic 19-step pipeline — from URL validation to AI narrative. No black boxes.

Input4
Extraction3
Analysis5
Scoring3
Output4
01
Input

Strips the URL to the homepage root (removes deep paths, UTM params, query strings), then runs a live DNS lookup to confirm the domain resolves before any work begins. Returns HTTP 422 immediately if the domain doesn't exist.

inputDNS + normalisation

Fetches raw HTML via the rendering service and hands it to BeautifulSoup. Two parsers are available: lxml (fast, used by default) and html.parser (used for Next.js and Framer, which produce attribute-heavy HTML that lxml can mangle). If the rendered page has fewer than 30 meaningful words, the entire run is flagged as a render failure and downstream stages are skipped.

inputget_rendered_html · BeautifulSoup

Single-pass fingerprinting that reads raw HTML for known framework markers. Detects Next.js (self.__next_f), Framer (data-framer-hydrate-v2), Webflow (data-wf-site), Shopify (cdn.shopify.com), React SPA (empty #root), Angular (ng-version), Nuxt (__nuxt), Wix, Squarespace, and Ghost. Returns a structured dict: name, category (ssr_framework / static_builder / csr_spa / ecommerce / cms / none), csr_risk level, and behaviour hints for downstream stages. This is the single source of truth — nothing downstream re-detects the framework.

input_detect_framework · ingestion.py v7

Only runs when Next.js is detected. Parses self.__next_f.push flight data payloads from the raw HTML, extracts quoted strings of 20–500 chars that contain spaces and real letters (skipping className, href, SVG data, and __next internals), then injects the extracted text into a hidden <div> before the closing </body>. This makes SSR content readable to BeautifulSoup even when the <main> tag is nearly empty.

input_extract_nextjs_text · _inject_nextjs_content
02
Extraction

Extracts exactly what an AI crawler sees. Runs _clean_dom first (removes scripts, styles, cookie banners, nav, footers, hidden elements) — skipped for Next.js (skip_clean=True) to preserve injected flight content. Extracts: title (separator-split logic + H1 fallback), meta description, OG tags (with og_confidence 0–1), H1 (filtered by _is_valid_h1 to reject nav chrome like 'Copy' or 'Menu'), hero section (named selector → H1 context → BFS candidates), headings, paragraphs (p + li + div, deduplicated), full_text assembly, and structured data (FAQPage, Organization, WebPage, SoftwareApplication schemas from JSON-LD). Framer sites use a completely separate extraction path via _extract_framer_content, reading RichTextContainer components from the desktop SSR variant only. Outputs extraction_quality: high / medium / low / failed.

extractionextract_ai_view · ai_view.py v6

Runs on a clean re-parse of raw_html (separate from the ai_view soup, which was mutated by _clean_dom). Computes all 15 crawl signals used in the crawl score: schema location check (head vs body vs JS-injected), CSR severity detection (full / partial / none per framework), semantic tag ratio, H1 count, meta description length, canonical URL, hreflang count, and content quality signals (stat_count via regex, faq_question_count from H2/H3 headings + schema + <details> elements, has_author_signal, has_freshness_signal via <time> tag, table_count, alt text coverage). Also injects _stat_count, _faq_question_count, _has_author_signal, _has_freshness_signal, and _table_count into ai_view with underscore prefix so the AI scorer can read them directly.

extractionextract_geo_signals · geo_signals.py v8

Async HTTP checks run in parallel with brand presence (asyncio.gather). Hits five endpoints: /llms.txt (checks H1 + blockquote for spec validity), /llms-full.txt, /robots.txt (parsed for 30 AI crawler directives across training/search/commoncrawl categories — GPTBot, ClaudeBot, PerplexityBot, etc.), /sitemap.xml (with robots.txt Sitemap: header fallback and index detection), and Common Crawl index API (fetches latest CC-MAIN index dynamically, cached per process). Results are merged back into geo_signals via patch_crawl_score_with_discoverability, which recomputes the crawl score with real network data and rebuilds the issues/passing lists.

extractionasyncfetch_llm_discoverability · async network
03
Analysis

Primary path: Groq (llama-3.3-70b-versatile, temperature=0, json_object mode, 5s timeout) classifies the site into one of 21 intent categories from title + H1 + meta description + hero text. Returns intent, confidence (0–1), and a one-sentence reasoning string. Fallback: if GROQ_API_KEY is missing, Groq times out, returns an unknown intent, or fails JSON parsing, the regex-based infer_site_intent() runs instead and tags the result with detection_method=regex_fallback. The intent label drives category mapping in the next stage.

analysisLLM callinfer_site_intent_llm · Groq llama-3.3-70b

Async function that builds what AI 'believes' about the site. First enriches ai_view by prepending structured_data descriptions to meta_description and hero_section (so the site's own schema feeds extraction). Then: detects category via CATEGORY_PATTERNS regex (falls back to INTENT_TO_CATEGORY if confidence < 0.60 or impossible-combo guard fires), qualifies generic categories with domain keywords from structured_data (e.g. 'online education platform' + 'cricket' → 'online cricket coaching platform'), extracts audience via AUDIENCE_PATTERNS and validates against CATEGORY_AUDIENCE_ALLOWLIST (removes implausible matches, falls back to CATEGORY_DEFAULT_AUDIENCE), calls Groq for capabilities (2-6 specific capability strings, 5s timeout), falls back to regex CAPABILITY_PATTERNS if Groq fails, then filters capabilities via _filter_capabilities (removes city names, brand name itself, nav noise, single short words). Outputs confidence: high / medium / low with a scored evidence checklist.

analysisLLM callbuild_heuristic_belief · belief.py v5.1

Measures how fast AI clarity builds as content is read. With bfs_result=None (the current path), derives everything from ai_view structural signals directly: title word count, meta char count, hero word count, H1 word count, H2 count, paragraph count, FAQ count (from faq_items array + headings). Computes three depth scores — immediate (title + meta + hero + VP composite), after_scroll (adds heading density bonus), after_exploration (adds volume score + FAQ boost up to 15pts). Classifies curve_shape as fast_clear / slow_clear / partial / thin based on immediate and exploration levels. Slope = exploration minus immediate score.

analysiscompute_understanding_curve · ai_view path

analyze_clarity is now a thin structural extractor only — all word-set matching, TF-IDF, and meaning-inference were removed in v4. It pulls raw counts from ai_view: word_count, paragraph_word_count, title_word_count, meta_desc_char_count, hero_word_count, h2_count (from structured headings), heading_count_ratio (h2 / paragraph count), h1_text, meta_description, og_confidence. compute_metrics computes semantic_density and high_info_ratio from full_text. value_prop_confidence is computed inline in analyze.py as a clamped ratio (word_count / 500, boosted by hero and headings presence) and injected into signals for insights compatibility.

analysiscompute_metrics · analyze_clarity · clarity.py v4

Runs in parallel with fetch_llm_discoverability via asyncio.gather. Extracts brand name via priority chain: og:site_name → domain base (with _strip_domain_prefix for 'use'/'get'/'try' prefixes, e.g. 'usevelo' → 'Velo') → og:title separator → page title separator → first capitalised word. Checks DuckDuckGo HTML API for domain presence and assigns presence_tier: high (top results, 3+ hits) / medium (lower position) / low (not found) / unknown (timeout). In v8, presence_tier no longer affects the AI score — coherence and brand floors were removed. It feeds generate_verdict, analyze_ai_failure_modes, and the training_data_signal response field only.

analysisasynccompute_brand_presence · DuckDuckGo · async
04
Scoring

Pure weighted sum of 12 content signals (weights sum to 100). Signals and weights: title_words (10) — full credit at 6+ words; meta_desc_words (10) — full credit at 12+ words; hero_word_count (12) — full credit at 40+ words; paragraph_density (12) — para_wc / total_wc, full credit at 0.60+ ratio; heading_count_ratio (8) — H2s per 100 words, sweet spot 1–4, reads from h2_count directly from signals; h1_meta_alignment (10) — character trigram cosine similarity between H1 and meta description; stat_count_score (8) — quantified claims (number + unit regex), full credit at 5+; faq_heading_count (8) — question-format headings + schema FAQ items, full credit at 3+; author_present (4) — binary E-E-A-T signal; freshness_present (4) — binary <time> or article date signal; table_count_score (4) — structured tables, full credit at 2+; og_completeness (10) — og_confidence passthrough. Only post-sum step: extraction quality hard cap (high=100, medium=78, low=55, failed=25, unknown=88). No page-type multipliers. No brand floors. No cross-signal penalties.

scoringcalculate_score_by_mode · dispatch.py v9

Separate from the AI content score. Also a pure weighted sum of 15 structural/technical signals (weights sum to 100): schema_in_head (12, 0.3 partial credit for body-only), has_faq_schema (8), has_org_schema (5), csr_score (12, none=1.0/partial=0.5/full=0.0), semantic_tag_ratio (8, full at ≥8%), h1_present (5, 0.5 for multiple H1s), meta_desc_length (6, full at 80+ chars), crawler_access (12, from robots.txt parsing), sitemap_present (6), llms_txt_present (8, 0.5 if present but fails spec validation), common_crawl_indexed (5), hreflang_present (3), freshness_signal (4), canonical_present (4), main_word_count_score (2, full at 100+ words). Initially computed without network data, then recomputed after fetch_llm_discoverability resolves via patch_crawl_score_with_discoverability. This is the left-panel score shown in the dashboard.

scoringpatch_crawl_score_with_discoverability · 15 signals

Projects the full feature vector into four named dimensions using fixed linear combinations (no trained PCA model): understanding_depth (word count + high_info_ratio + curve scores), signal_imbalance (title + meta + hero word counts against targets), density_vs_description (semantic_density + paragraph density + og_confidence), value_prop_speed (hero word count + heading ratio + immediate curve score). The minimum-value dimension is flagged as dominant_weakness and used by pca_based_recommendations to surface the highest-leverage fix.

scoring_compute_latent_dimensions · PCA-style
05
Output

Compares against a static corpus of 91 analysed sites, segmented by category and page_type. If the category is too niche for a meaningful static comparison (available=False), falls back to generate_dynamic_benchmarks: a Groq call that returns a peer set with real scores and a positioning statement. Geo signals and presence_tier influence which benchmarks are most relevant.

outputLLM callgenerate_benchmarks · dynamic fallback · Groq

Generates targeted fixes from heuristic rules keyed to score_breakdown components, then overlays pca_based_recommendations for the dominant_weakness dimension. A single authoritative schema recommendation is built by _build_schema_rec (reads geo_signals.schema for ground truth — not ai_view guesses) and appended after stripping all other schema rec IDs from the base set to prevent duplicates. suppress_recommendations then filters by page_type and belief context. reword_recs_for_context adjusts copy for CSR sites. Every recommendation shown is specific to the actual signals found.

outputgenerate_recommendations · suppress · pca_overrides

Two-endpoint flow. POST /suggest: Groq (llama-3.3-70b, json_object, ~1s) finds 3 direct competitors matching category + intent + capabilities — returns name, domain, relevance_reason only. POST /find-competitors: calls /suggest first, then runs each competitor URL through /analyze concurrently (45s timeout each). Extracts scores, signal_breakdown strengths, and capabilities. Runs /analyze on your own site in parallel for gap analysis. Generates category_insights comparing your score against the competitor average. Available as a separate endpoint for ongoing monitoring.

outputLLM callasync/competitors/suggest · /find-competitors · Groq

The final step — runs after the full response dict is assembled. Groq receives the complete result (score, crawl_score, geo_signals, presence_tier, page_type, understanding_curve, recommendations, competitor context) and writes a plain-English summary: what's working, what isn't, and why — personalised to the site's actual signals. Wrapped in try/except so a Groq failure returns null narrative without breaking the response.

outputLLM callgenerate_ai_narrative · Groq

Tap any stage to expand