diff --git a/gigl/analytics/README.md b/gigl/analytics/README.md
new file mode 100644
index 000000000..f4fb7062f
--- /dev/null
+++ b/gigl/analytics/README.md
@@ -0,0 +1,225 @@
+# GiGL Analytics
+
+Pre-training graph data validation and analysis tooling. Use this module before committing to a GNN training run to
+catch data quality and structural issues that silently degrade model quality.
+
+Two subpackages:
+
+- [`data_analyzer/`](data_analyzer/) — end-to-end `DataAnalyzer` that runs BigQuery checks and produces a single
+  self-contained HTML report. **Start here.**
+- [`graph_validation/`](graph_validation/) — lightweight standalone validators (currently: `BQGraphValidator` for
+  dangling-edge checks). Use when you only need one check and not the full report.
+
+## Quickstart
+
+**Prerequisites.** Follow the [GiGL installation guide](../../docs/user_guide/getting_started/installation.md) so that
+`uv` and GiGL's Python dependencies are available. Then authenticate to BigQuery:
+
+```bash
+gcloud auth application-default login
+```
+
+**1. Write a YAML config.** Save as `my_analyzer_config.yaml`:
+
+```yaml
+node_tables:
+  - bq_table: "your-project.your_dataset.user_nodes"
+    node_type: "user"
+    id_column: "user_id"
+    feature_columns: ["age", "country"]  # optional; omit to auto-infer all non-ID, TFDV-compatible columns from the BQ schema
+    # label_column: "label"              # optional; enables Tier 3 label checks
+
+edge_tables:
+  - bq_table: "your-project.your_dataset.user_edges"
+    edge_type: "follows"
+    src_id_column: "src_user_id"
+    dst_id_column: "dst_user_id"
+
+# Where to write the HTML report. Local path for quick iteration, or a gs:// URI.
+output_gcs_path: "/tmp/my_analysis/"
+
+# Optional: sizing for the neighbor-explosion estimate (fan-out per GNN layer).
+fan_out: [15, 10, 5]
+```
+
+**2. Run the analyzer.**
+
+```bash
+uv run python -m gigl.analytics.data_analyzer \
+    --analyzer_config_uri my_analyzer_config.yaml
+```
+
+**3. Open the report.** When the run completes:
+
+```
+[INFO] Report written to /tmp/my_analysis/report.html
+```
+
+Open the file in any browser. No server, no external dependencies, fully offline.
+
+## What it checks
+
+The analyzer organizes checks into four tiers. Tiers 1 and 2 always run; Tier 3 auto-enables when your config supports
+it; Tier 4 is opt-in.
+
+| Tier                         | When                                                                                 | What it checks                                                                                                                                                                                                                                                                         |
+| ---------------------------- | ------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **1. Hard fails**            | Always                                                                               | Dangling edges (NULL src/dst), referential integrity (edges pointing to nodes not in the node table), duplicate nodes. Raises `DataQualityError` — the report still renders to show partial results.                                                                                   |
+| **2. Core metrics**          | Always                                                                               | Node/edge counts, degree distribution (in/out) with percentiles, degree buckets, top-K hubs, super-hub int16 clamp count, cold-start node count, self-loops, duplicate edges, NULL rates per column, feature memory budget estimate, neighbor-explosion estimate (requires `fan_out`). |
+| **3. Label + heterogeneous** | Auto when `label_column` is set on any node table, or when multiple edge types exist | Class imbalance, label coverage, edge type distribution, per-edge-type node coverage.                                                                                                                                                                                                  |
+| **4. Advanced**              | Opt-in via config flags                                                              | Power-law exponent (implemented as a degree-stats approximation). Reciprocity, homophily, connected components, clustering coefficient are **not yet implemented** — the flags are accepted but currently no-op.                                                                       |
+
+The thresholds below come from a review of production GNN papers (PinSage, BLADE, LiGNN, TwHIN, AliGraph, GraphSMOTE,
+Beyond Homophily, Feature Propagation, and others). See the inline citations in the threshold table for what each paper
+contributes.
+
+## Feature profiling
+
+In addition to the structural checks above, the analyzer runs
+[TensorFlow Data Validation](https://www.tensorflow.org/tfx/guide/tfdv) on every node and edge table and embeds the
+resulting Facets HTML report in the final output.
+
+- **Auto-inference.** By default, the profiler reads the BQ table schema and profiles every non-ID column whose type is
+  TFDV-compatible — scalars `STRING`, `INT64`, `FLOAT64`, `NUMERIC`, `BIGNUMERIC`, `BOOL`. Temporal types (`DATE`,
+  `DATETIME`, `TIMESTAMP`, `TIME`) and complex types (`RECORD`, `GEOGRAPHY`, `JSON`, `BYTES`) are not supported by TFDV
+  and are skipped with an info log.
+- **Embedding columns.** `REPEATED` `FLOAT64` / `FLOAT` / `NUMERIC` / `BIGNUMERIC` columns are treated as embedding
+  vectors. Each expands in the Beam `SELECT` into four scalar hygiene companions — `<col>_len`, `<col>_has_nan`,
+  `<col>_has_inf`, `<col>_is_all_zero` — which are profiled by TFDV like any other scalar. Other REPEATED types
+  (`STRING` / `INT64` arrays, etc.) are skipped.
+- **Embedding diagnostics.** After the TFDV pipelines finish, one BigQuery aggregate per embedding column computes
+  `total`, `unique_count`, `unique_ratio`, and top-K most-frequent hash clusters (via
+  `FARM_FINGERPRINT(TO_JSON_STRING(<col>))`). Results land in `FeatureProfileResult.embedding_diagnostics` and render as
+  a dedicated "Embedding Diagnostics" section in the report.
+- **Explicit override.** Setting `feature_columns` in the YAML narrows the projection to those columns (still honoring
+  embedding expansion for REPEATED FLOAT families). Use this to scope down to a handful of columns, or to exclude PII /
+  expensive fields.
+- **Join keys are excluded.** `id_column` on nodes and `src_id_column` / `dst_id_column` on edges are always dropped
+  from the auto-inferred list. `label_column` and `timestamp_column` are kept (profiling class balance / temporal
+  sparsity is useful).
+- **Cost.** One Dataflow job is launched per table, so a config with many tables translates to many concurrent Dataflow
+  runs. During iteration, pass `--only structure` to skip the profiler entirely. Run `--only feature` (or the default
+  `both`) once the config is stable.
+
+## Machine-readable outputs
+
+Alongside `report.html`, each analyzer run writes versioned Pydantic JSON sidecars under `output_gcs_path/`:
+
+- `graph_structure.json` — the `GraphAnalysisResult` payload from `GraphStructureAnalyzer`. Written on success and also
+  on a Tier 1 `DataQualityError` (partial result) so failures are still recoverable.
+- `feature_profile.json` — the `FeatureProfileResult` payload (facets URIs, TFDV stats URIs, embedding diagnostics).
+
+Each sidecar wraps its payload in an envelope: `{schema_version, component, generated_at, data}`. Load one with
+`gigl.analytics.data_analyzer.types.load_artifact(path, expected_component=...)`. Schema changes are additive-only at
+`schema_version="1"`; breaking changes bump the version.
+
+## Interpreting the report
+
+The report color-codes every numeric finding. Summary of the most important thresholds:
+
+| Metric                                                   | Green | Yellow     | Red     | What to do when yellow/red                                                                                                                                    |
+| -------------------------------------------------------- | ----- | ---------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Dangling edges / referential integrity / duplicate nodes | 0     | —          | any > 0 | Fix the input tables. Training will fail or silently corrupt otherwise.                                                                                       |
+| Feature missing rate                                     | < 10% | 10–50%     | > 90%   | Plan an imputation strategy; above ~95% the Feature Propagation phase transition (Rossi et al., ICLR 2022) hits and GNNs stop recovering signal reliably.     |
+| Isolated node fraction                                   | < 1%  | 1–5%       | > 5%    | Filter isolated nodes or densify (LiGNN, KDD 2024) for cold-start cohorts.                                                                                    |
+| Cold-start fraction (degree 0–1)                         | < 5%  | 5–10%      | > 10%   | Candidates for graph densification; also flag for special handling at serving time.                                                                           |
+| Super-hub int16 clamp (degree > 32,767)                  | 0     | —          | any > 0 | GiGL silently truncates super-hub degrees in `gigl/distributed/utils/degree.py`. Either cap the hub's edges upstream or plan to address the clamp.            |
+| Degree p99 / median                                      | < 50  | 50–100     | > 100   | Use importance sampling (PinSage, KDD 2018) or degree-adaptive neighborhoods (BLADE, WSDM 2023) — degree skew is the single biggest lever in production GNNs. |
+| Class imbalance ratio                                    | < 1:5 | 1:5 – 1:10 | > 1:10  | Message passing amplifies label imbalance 2–3× in representation space (GraphSMOTE, WSDM 2021). Consider resampling or GraphSMOTE-style synthetic nodes.      |
+| Edge homophily (Tier 4, future)                          | > 0.7 | 0.3 – 0.7  | < 0.3   | Standard GCN/GAT fail at low h (Zhu et al., NeurIPS 2020). Consider H2GCN-style architectures; below h ≈ 0.2 a plain MLP often wins.                          |
+
+## Advanced config
+
+Optional YAML keys beyond the minimal quickstart:
+
+```yaml
+# Enable Tier 3 class-imbalance + label-coverage checks for a node type:
+node_tables:
+  - bq_table: ...
+    label_column: "label"
+
+# Neighbor explosion estimation — the fan-out per GNN layer you plan to train with:
+fan_out: [15, 10, 5]
+
+# Tier 4 opt-in flags. Default false.
+# NOTE: Only `compute_reciprocity` is wired into the analyzer today and it logs a
+# warning rather than computing a result. The other three flags are placeholders
+# for future work (see "Scope and limitations" below).
+compute_reciprocity: true
+compute_homophily: true
+compute_connected_components: true
+compute_clustering: true
+
+# Per-edge-type timestamp hint. NOTE: accepted by the config schema but not yet
+# consumed by any Tier 4 query (temporal freshness check is planned).
+edge_tables:
+  - bq_table: ...
+    timestamp_column: "created_at"
+```
+
+## Python API
+
+The CLI wraps a regular class. Call from your own code when you want programmatic access to the `GraphAnalysisResult`:
+
+```python
+from gigl.analytics.data_analyzer import DataAnalyzer
+from gigl.analytics.data_analyzer.config import load_analyzer_config
+
+config = load_analyzer_config("my_analyzer_config.yaml")
+analyzer = DataAnalyzer()
+report_path = analyzer.run(config=config)
+# report_path points to the written report.html (local path or gs:// URI)
+```
+
+The underlying `GraphStructureAnalyzer` is also callable directly if you want the raw result dataclass and no HTML:
+
+```python
+from gigl.analytics.data_analyzer.graph_structure_analyzer import GraphStructureAnalyzer
+
+result = GraphStructureAnalyzer().analyze(config)
+print(result.degree_stats)
+```
+
+See a rendered report example at
+[`tests/test_assets/analytics/golden_report.html`](../../tests/test_assets/analytics/golden_report.html) to preview the
+output format before authenticating to BQ.
+
+## graph_validation
+
+One-off validators for the subset of cases where the full analyzer is overkill. Today the only check is dangling-edge
+detection:
+
+```python
+from gigl.analytics.graph_validation import BQGraphValidator
+
+has_dangling = BQGraphValidator.does_edge_table_have_dangling_edges(
+    edge_table="your-project.your_dataset.user_edges",
+    src_node_column_name="src_user_id",
+    dst_node_column_name="dst_user_id",
+)
+```
+
+The `DataAnalyzer` runs this check (and many more) as part of Tier 1, so prefer the full analyzer unless you
+specifically need a one-line gate (e.g., inside an Airflow task or a preprocessing job). This subpackage is the intended
+home for additional standalone validators in the future.
+
+## Scope and limitations
+
+Current implementation status:
+
+- **Tier 4 checks are partial.** Power-law exponent is computed as a degree-stats approximation. Reciprocity, homophily,
+  connected components, and clustering coefficient config flags are accepted but currently no-op. The `timestamp_column`
+  edge field is accepted but no temporal-freshness query runs yet.
+- **Heterogeneous graphs: referential integrity caveat.** For each edge table, the referential-integrity check joins
+  against `config.node_tables[0]`. On heterogeneous graphs where different edges reference different node types, the
+  current implementation will under-report integrity violations — fix is tracked for a follow-up.
+- **GCS upload** works via `GcsUtils.upload_from_string` when `output_gcs_path` is a `gs://` URI, and falls back to
+  local filesystem write otherwise.
+
+## Related documents
+
+Within this module:
+
+- [`data_analyzer/report/PRD.md`](data_analyzer/report/PRD.md) — product intent for the HTML report (AI-owned)
+- [`data_analyzer/report/SPEC.md`](data_analyzer/report/SPEC.md) — technical contract for the AI-owned HTML/JS/CSS
+  assets
diff --git a/gigl/analytics/data_analyzer/__init__.py b/gigl/analytics/data_analyzer/__init__.py
new file mode 100644
index 000000000..45304dacc
--- /dev/null
+++ b/gigl/analytics/data_analyzer/__init__.py
@@ -0,0 +1,10 @@
+"""
+BQ Data Analyzer for pre-training graph data analysis.
+
+Produces a single HTML report covering data quality, feature distributions,
+and graph structure metrics from BigQuery node/edge tables.
+"""
+
+from gigl.analytics.data_analyzer.data_analyzer import DataAnalyzer
+
+__all__ = ["DataAnalyzer"]
diff --git a/gigl/analytics/data_analyzer/__main__.py b/gigl/analytics/data_analyzer/__main__.py
new file mode 100644
index 000000000..693551d33
--- /dev/null
+++ b/gigl/analytics/data_analyzer/__main__.py
@@ -0,0 +1,6 @@
+"""Entry point for running the BQ Data Analyzer as a module: python -m gigl.analytics.data_analyzer."""
+
+from gigl.analytics.data_analyzer.data_analyzer import main
+
+if __name__ == "__main__":
+    main()
diff --git a/gigl/analytics/data_analyzer/config.py b/gigl/analytics/data_analyzer/config.py
new file mode 100644
index 000000000..bb3fdcdc7
--- /dev/null
+++ b/gigl/analytics/data_analyzer/config.py
@@ -0,0 +1,283 @@
+import re
+from dataclasses import dataclass, field
+from typing import Optional
+
+from omegaconf import MISSING, OmegaConf
+
+from gigl.common.logger import Logger
+
+logger = Logger()
+
+# BigQuery identifier regexes used to reject configs that would be interpolated
+# directly into SQL. See https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical
+# for the allowed grammar. Tables are of the form project.dataset.table;
+# columns are simple unquoted identifiers.
+_BQ_TABLE_REGEX = re.compile(r"^[A-Za-z0-9_.\-]+\.[A-Za-z0-9_\-]+\.[A-Za-z0-9_$\-]+$")
+_BQ_COLUMN_REGEX = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")
+
+
+def _validate_bq_table(name: str, field_label: str) -> None:
+    if not _BQ_TABLE_REGEX.fullmatch(name):
+        raise ValueError(
+            f"{field_label}={name!r} is not a valid BigQuery table reference. "
+            f"Expected project.dataset.table with no backticks, whitespace, or quotes."
+        )
+
+
+def _validate_bq_column(name: str, field_label: str) -> None:
+    if not _BQ_COLUMN_REGEX.fullmatch(name):
+        raise ValueError(
+            f"{field_label}={name!r} is not a valid BigQuery column identifier. "
+            f"Expected [A-Za-z_][A-Za-z0-9_]* with no backticks, whitespace, or quotes."
+        )
+
+
+@dataclass
+class NodeTableSpec:
+    """Specification for a node table in BigQuery.
+
+    Node-classification supervision is activated when ``label_column`` is
+    set. ``label_sentinel_values`` lets users distinguish "missing" labels
+    encoded as ``-1`` / ``"unknown"`` from SQL NULL — both are excluded
+    from the valid-label denominator used by class-imbalance and
+    homophily computations, but are reported separately so the upstream
+    bug can be traced. ``split_column`` enables split-validation checks
+    (cross-split node-id leakage as a Tier 1 hard fail, plus per-split
+    TFDV slicing for distribution drift).
+    """
+
+    bq_table: str = MISSING
+    node_type: str = MISSING
+    id_column: str = MISSING
+    feature_columns: list[str] = field(default_factory=list)
+    label_column: Optional[str] = None
+    label_sentinel_values: list[str] = field(default_factory=list)
+    split_column: Optional[str] = None
+
+
+EDGE_ROLE_MESSAGE_PASSING = "message_passing"
+EDGE_ROLE_SUPERVISION_POS = "supervision_pos"
+EDGE_ROLE_SUPERVISION_NEG = "supervision_neg"
+_VALID_EDGE_ROLES = frozenset(
+    {EDGE_ROLE_MESSAGE_PASSING, EDGE_ROLE_SUPERVISION_POS, EDGE_ROLE_SUPERVISION_NEG}
+)
+
+
+@dataclass
+class EdgeTableSpec:
+    """Specification for an edge table in BigQuery.
+
+    For heterogeneous graphs (more than one node table), src_node_type and
+    dst_node_type must be set to the node_type of the matching node table.
+    For homogeneous graphs (single node table) they default to that node_type.
+
+    ``role`` marks the table's purpose for cross-table supervision analysis.
+    Defaults to ``"message_passing"`` when omitted. ``node_anchor`` selects
+    which side (src or dst) of the table is the anchor for the per-anchor
+    cross-table stats; required on ``supervision_pos`` tables, ignored when
+    no analysis applies.
+    """
+
+    bq_table: str = MISSING
+    edge_type: str = MISSING
+    src_id_column: str = MISSING
+    dst_id_column: str = MISSING
+    src_node_type: Optional[str] = None
+    dst_node_type: Optional[str] = None
+    feature_columns: list[str] = field(default_factory=list)
+    timestamp_column: Optional[str] = None
+    role: Optional[str] = None
+    node_anchor: Optional[str] = None
+
+
+@dataclass
+class DataAnalyzerConfig:
+    """Configuration for the BQ Data Analyzer.
+
+    Parsed from YAML via OmegaConf.
+
+    Example:
+        >>> config = load_analyzer_config("gs://bucket/config.yaml")
+        >>> config.node_tables[0].bq_table
+        'project.dataset.user_nodes'
+    """
+
+    node_tables: list[NodeTableSpec] = MISSING
+    edge_tables: list[EdgeTableSpec] = MISSING
+    output_gcs_path: str = MISSING
+    fan_out: Optional[list[int]] = None
+    compute_reciprocity: bool = False
+    compute_homophily: bool = False
+    compute_connected_components: bool = False
+    compute_clustering: bool = False
+
+    # Node-classification supervision tier flags. Activate any time a
+    # NodeTableSpec.label_column is set.
+    #
+    # ``compute_per_class_feature_stats`` controls TFDV slicing on the
+    # label column inside the feature profiler — default on because it's
+    # the highest-value NC-specific feature signal and costs one extra
+    # column on the existing BQ projection.
+    #
+    # ``compute_label_informativeness`` is the expensive (full-graph
+    # mutual-information) homophily measure from Platonov et al. 2023.
+    # Default off; the cheaper sampled adjusted-homophily always runs.
+    #
+    # ``label_homophily_edge_sample_cap`` caps the message-passing edge
+    # sample used to compute adjusted homophily. ``0`` means full-graph.
+    compute_per_class_feature_stats: bool = True
+    compute_label_informativeness: bool = False
+    label_homophily_edge_sample_cap: int = 50_000_000
+
+    # Per-chunk feature cap for TFDV profiling. Wide projections explode
+    # Beam 2.56's CombinePerKey state and trip
+    # "Instruction id ... was not registered" failures on Runner v2;
+    # chunking keeps every Dataflow job within the runner's
+    # state-iteration budget. 350 was validated end-to-end on a 706-col /
+    # ~950 M-row user table.
+    max_features_per_chunk: int = 350
+
+    # Per-config Dataflow job name prefix. Combined with a per-run
+    # timestamp at the entry point to keep concurrent / repeated runs
+    # from colliding on the fixed Dataflow job name. The CLI flag
+    # ``--job_name_prefix`` overrides this when set; if neither is set
+    # the entry point fails fast.
+    job_name_prefix: Optional[str] = None
+
+
+def _validate_and_backfill(config: DataAnalyzerConfig) -> None:
+    """Run identifier validation and backfill default node-type references.
+
+    - Every bq_table must match project.dataset.table.
+    - Every id_column / src_id_column / dst_id_column / feature_column /
+      label_column / timestamp_column must be a bare BQ identifier.
+    - For homogeneous configs, an edge table with no src_node_type /
+      dst_node_type inherits the single node table's node_type.
+    - For heterogeneous configs, every edge table must explicitly declare
+      src_node_type and dst_node_type, and both must resolve to a known
+      node_type.
+    """
+    known_node_types = {nt.node_type for nt in config.node_tables}
+    single_node_type: Optional[str] = (
+        next(iter(known_node_types)) if len(config.node_tables) == 1 else None
+    )
+
+    for node_table in config.node_tables:
+        _validate_bq_table(node_table.bq_table, "node_tables.bq_table")
+        _validate_bq_column(node_table.id_column, "node_tables.id_column")
+        for col in node_table.feature_columns:
+            _validate_bq_column(col, "node_tables.feature_columns")
+        if node_table.label_column is not None:
+            _validate_bq_column(node_table.label_column, "node_tables.label_column")
+        if node_table.split_column is not None:
+            _validate_bq_column(node_table.split_column, "node_tables.split_column")
+        # Sentinel values are not SQL identifiers (they're literal label
+        # values), but they're still embedded into SQL via parameterized
+        # IN clauses elsewhere. Reject empty strings to fail fast on
+        # likely-misconfigured YAML where a value got stripped.
+        for sentinel in node_table.label_sentinel_values:
+            if sentinel == "":
+                raise ValueError(
+                    f"node_tables.label_sentinel_values contains an empty string "
+                    f"for node_type={node_table.node_type!r}; declare each "
+                    "sentinel value explicitly (e.g. '-1', 'unknown')."
+                )
+        if node_table.label_sentinel_values and node_table.label_column is None:
+            raise ValueError(
+                f"node_type={node_table.node_type!r}: label_sentinel_values "
+                "are declared but label_column is not set; sentinels apply "
+                "to the label_column only."
+            )
+
+    for edge_table in config.edge_tables:
+        _validate_bq_table(edge_table.bq_table, "edge_tables.bq_table")
+        _validate_bq_column(edge_table.src_id_column, "edge_tables.src_id_column")
+        _validate_bq_column(edge_table.dst_id_column, "edge_tables.dst_id_column")
+        for col in edge_table.feature_columns:
+            _validate_bq_column(col, "edge_tables.feature_columns")
+        if edge_table.timestamp_column is not None:
+            _validate_bq_column(
+                edge_table.timestamp_column, "edge_tables.timestamp_column"
+            )
+
+        if edge_table.src_node_type is None:
+            if single_node_type is not None:
+                edge_table.src_node_type = single_node_type
+            else:
+                raise ValueError(
+                    f"edge_type={edge_table.edge_type}: src_node_type is required "
+                    f"when there are multiple node tables"
+                )
+        if edge_table.dst_node_type is None:
+            if single_node_type is not None:
+                edge_table.dst_node_type = single_node_type
+            else:
+                raise ValueError(
+                    f"edge_type={edge_table.edge_type}: dst_node_type is required "
+                    f"when there are multiple node tables"
+                )
+        if edge_table.src_node_type not in known_node_types:
+            raise ValueError(
+                f"edge_type={edge_table.edge_type}: src_node_type="
+                f"{edge_table.src_node_type!r} is not a declared node_type. "
+                f"Known: {sorted(known_node_types)}"
+            )
+        if edge_table.dst_node_type not in known_node_types:
+            raise ValueError(
+                f"edge_type={edge_table.edge_type}: dst_node_type="
+                f"{edge_table.dst_node_type!r} is not a declared node_type. "
+                f"Known: {sorted(known_node_types)}"
+            )
+
+        if edge_table.role is None:
+            edge_table.role = EDGE_ROLE_MESSAGE_PASSING
+        elif edge_table.role not in _VALID_EDGE_ROLES:
+            raise ValueError(
+                f"edge_type={edge_table.edge_type}: role={edge_table.role!r} "
+                f"is not valid. Expected one of {sorted(_VALID_EDGE_ROLES)}."
+            )
+
+        if edge_table.node_anchor is not None:
+            if edge_table.node_anchor not in (
+                edge_table.src_node_type,
+                edge_table.dst_node_type,
+            ):
+                raise ValueError(
+                    f"edge_type={edge_table.edge_type}: node_anchor="
+                    f"{edge_table.node_anchor!r} must equal src_node_type="
+                    f"{edge_table.src_node_type!r} or dst_node_type="
+                    f"{edge_table.dst_node_type!r}."
+                )
+        elif edge_table.role == EDGE_ROLE_SUPERVISION_POS:
+            raise ValueError(
+                f"edge_type={edge_table.edge_type}: node_anchor is required "
+                f"when role={EDGE_ROLE_SUPERVISION_POS!r}."
+            )
+
+
+def load_analyzer_config(config_path: str) -> DataAnalyzerConfig:
+    """Load and validate a DataAnalyzerConfig from a YAML file.
+
+    Args:
+        config_path: Local file path or GCS URI to the YAML config.
+
+    Returns:
+        Validated DataAnalyzerConfig instance with node-type references
+        backfilled on edge tables.
+
+    Raises:
+        omegaconf.errors.MissingMandatoryValue: If required fields are missing.
+        ValueError: If any bq_table or column name is not a valid BigQuery
+            identifier, or if a heterogeneous config is missing a required
+            src_node_type / dst_node_type.
+    """
+    logger.info(f"Loading analyzer config from {config_path}")
+    raw = OmegaConf.load(config_path)
+    merged = OmegaConf.merge(OmegaConf.structured(DataAnalyzerConfig), raw)
+    config: DataAnalyzerConfig = OmegaConf.to_object(merged)  # type: ignore
+    _validate_and_backfill(config)
+    logger.info(
+        f"Loaded analyzer config with {len(config.node_tables)} node tables "
+        f"and {len(config.edge_tables)} edge tables"
+    )
+    return config
diff --git a/gigl/analytics/data_analyzer/data_analyzer.py b/gigl/analytics/data_analyzer/data_analyzer.py
new file mode 100644
index 000000000..0e0187c9f
--- /dev/null
+++ b/gigl/analytics/data_analyzer/data_analyzer.py
@@ -0,0 +1,260 @@
+"""Main orchestrator and CLI entry point for the BQ Data Analyzer."""
+import argparse
+import re
+from concurrent.futures import ThreadPoolExecutor
+from pathlib import Path
+from typing import Literal, Optional
+
+from gigl.analytics.data_analyzer.config import DataAnalyzerConfig, load_analyzer_config
+from gigl.analytics.data_analyzer.feature_profiler import FeatureProfiler
+from gigl.analytics.data_analyzer.graph_structure_analyzer import (
+    DataQualityError,
+    GraphStructureAnalyzer,
+)
+from gigl.analytics.data_analyzer.report.report_generator import generate_report
+from gigl.analytics.data_analyzer.types import FeatureProfileResult, GraphAnalysisResult
+from gigl.common import GcsUri, UriFactory
+from gigl.common.logger import Logger
+from gigl.common.utils.gcs import GcsUtils
+from gigl.env.pipelines_config import GiglResourceConfigWrapper, get_resource_config
+from gigl.src.common.utils.time import current_formatted_datetime
+
+logger = Logger()
+
+# Lowercase, hyphen-safe, ≤20 chars. Composes cleanly with
+# ``get_sanitized_dataflow_job_name`` and keeps the final Dataflow job
+# name (``gigl-analyzer-{prefix}-{ts}-profile-{kind}-{type}``) inside
+# Dataflow's ~63-char budget for typical type_name lengths.
+_JOB_NAME_PREFIX_REGEX = re.compile(r"^[a-z][a-z0-9-]{0,19}$")
+_RUN_TIMESTAMP_FORMAT = "%Y%m%d-%H%M"
+
+
+def _write_report(html: str, output_gcs_path: str) -> str:
+    """Write the HTML report to a GCS URI or local path.
+
+    Args:
+        html: Rendered HTML string.
+        output_gcs_path: Output directory. If it starts with ``gs://`` the
+            report is uploaded via ``GcsUtils``. Otherwise it is written to
+            the local filesystem (the directory is created if missing).
+
+    Returns:
+        The full path to the written ``report.html`` file.
+    """
+    trimmed = output_gcs_path.rstrip("/")
+    report_path = f"{trimmed}/report.html"
+    if trimmed.startswith("gs://"):
+        GcsUtils().upload_from_string(GcsUri(report_path), html)
+    else:
+        local_path = Path(report_path).expanduser().resolve()
+        local_path.parent.mkdir(parents=True, exist_ok=True)
+        local_path.write_text(html)
+        report_path = str(local_path)
+    return report_path
+
+
+class DataAnalyzer:
+    """Orchestrates graph structure analysis, feature profiling, and report generation.
+
+    Example:
+        >>> from gigl.analytics.data_analyzer.config import load_analyzer_config
+        >>> config = load_analyzer_config("gs://bucket/config.yaml")
+        >>> analyzer = DataAnalyzer()
+        >>> report_path = analyzer.run(config=config)
+    """
+
+    def run(
+        self,
+        config: DataAnalyzerConfig,
+        resource_config: GiglResourceConfigWrapper,
+        job_name_prefix: str,
+        run_timestamp: str,
+        components: Literal["structure", "feature", "both"] = "both",
+        custom_worker_image_uri: Optional[str] = None,
+    ) -> str:
+        """Run the analysis pipeline and write an HTML report.
+
+        The report is written to ``{config.output_gcs_path}/report.html`` via
+        ``GcsUtils`` when the output path is a ``gs://`` URI, or to the local
+        filesystem otherwise (the parent directory is created if missing).
+
+        Args:
+            config: Analyzer configuration.
+            resource_config: Resource config for Dataflow sizing.
+            job_name_prefix: Prefix mixed into every per-table Dataflow job
+                name (resolved at the entry point from CLI flag or YAML).
+            run_timestamp: Per-run timestamp shared by every per-table job in
+                this invocation (computed once at the entry point).
+            components: Which components to run. ``"both"`` (default) runs the
+                structure analyzer and feature profiler concurrently.
+                ``"structure"`` runs only the graph structure analyzer.
+                ``"feature"`` runs only the feature profiler. The skipped
+                component is represented in the report by an empty result.
+            custom_worker_image_uri: Optional Docker image URI for the Dataflow
+                worker harness used by the feature profiler. When ``None``, the
+                profiler falls back to ``DEFAULT_GIGL_RELEASE_SRC_IMAGE_DATAFLOW_CPU``.
+
+        Returns:
+            The path to the written ``report.html`` (GCS URI or local path).
+        """
+        analysis_result: GraphAnalysisResult
+        profile_result: FeatureProfileResult
+
+        if components == "both":
+            with ThreadPoolExecutor(max_workers=2) as executor:
+                structure_future = executor.submit(
+                    GraphStructureAnalyzer().analyze, config
+                )
+                profile_future = executor.submit(
+                    FeatureProfiler().profile,
+                    config,
+                    resource_config,
+                    job_name_prefix,
+                    run_timestamp,
+                    custom_worker_image_uri,
+                )
+
+                try:
+                    analysis_result = structure_future.result()
+                except DataQualityError as e:
+                    logger.error(f"Tier 1 data quality failure: {e}")
+                    analysis_result = e.partial_result
+
+                try:
+                    profile_result = profile_future.result()
+                except Exception as e:
+                    logger.exception(f"Feature profiler failed: {e}")
+                    profile_result = FeatureProfileResult()
+        elif components == "structure":
+            try:
+                analysis_result = GraphStructureAnalyzer().analyze(config)
+            except DataQualityError as e:
+                logger.error(f"Tier 1 data quality failure: {e}")
+                analysis_result = e.partial_result
+            profile_result = FeatureProfileResult()
+        elif components == "feature":
+            analysis_result = GraphAnalysisResult()
+            profile_result = FeatureProfiler().profile(
+                config,
+                resource_config,
+                job_name_prefix=job_name_prefix,
+                run_timestamp=run_timestamp,
+                custom_worker_image_uri=custom_worker_image_uri,
+            )
+        else:
+            raise ValueError(
+                f"components={components!r} must be one of 'structure', 'feature', 'both'"
+            )
+
+        html = generate_report(
+            analysis_result=analysis_result,
+            profile_result=profile_result,
+        )
+
+        report_path = _write_report(html, config.output_gcs_path)
+        logger.info(f"Report written to {report_path}")
+        return report_path
+
+
+def _resolve_job_name_prefix(
+    cli_value: Optional[str], yaml_value: Optional[str]
+) -> str:
+    """Pick the effective ``job_name_prefix`` from CLI flag or YAML field.
+
+    CLI takes precedence; if both are set and differ the override is logged.
+    Raises ``ValueError`` if neither source supplies a value, or if the
+    chosen value doesn't match the lowercase / hyphen / ≤20-char shape we
+    require to keep the final Dataflow job name within Dataflow's ~63-char
+    cap.
+    """
+    if cli_value and yaml_value and cli_value != yaml_value:
+        logger.info(
+            f"--job_name_prefix={cli_value!r} overrides YAML "
+            f"job_name_prefix={yaml_value!r}."
+        )
+    effective = cli_value or yaml_value
+    if not effective:
+        raise ValueError(
+            "job_name_prefix is required: pass --job_name_prefix on the CLI "
+            "or set job_name_prefix in the analyzer YAML."
+        )
+    if not _JOB_NAME_PREFIX_REGEX.fullmatch(effective):
+        raise ValueError(
+            f"job_name_prefix={effective!r} is invalid. Expected lowercase "
+            "letters, digits, and hyphens, starting with a letter, ≤20 chars."
+        )
+    return effective
+
+
+def main() -> None:
+    """CLI entry point for the BQ Data Analyzer."""
+    parser = argparse.ArgumentParser(
+        description="BQ Data Analyzer: analyze graph data in BigQuery before GNN training"
+    )
+    parser.add_argument(
+        "--analyzer_config_uri",
+        required=True,
+        help="Path or GCS URI to the analyzer YAML config",
+    )
+    parser.add_argument(
+        "--resource_config_uri",
+        required=False,
+        help="Path or GCS URI to the resource config for Dataflow sizing",
+    )
+    parser.add_argument(
+        "--only",
+        choices=["structure", "feature", "both"],
+        default="both",
+        help=(
+            "Run only the graph structure analyzer, only the feature profiler, "
+            "or both (default: both)."
+        ),
+    )
+    parser.add_argument(
+        "--custom_worker_image_uri",
+        type=str,
+        required=False,
+        help=(
+            "Docker image URI to use for the Dataflow worker harness in the "
+            "feature profiler. When omitted, falls back to "
+            "DEFAULT_GIGL_RELEASE_SRC_IMAGE_DATAFLOW_CPU."
+        ),
+    )
+    parser.add_argument(
+        "--job_name_prefix",
+        type=str,
+        required=False,
+        help=(
+            "Prefix mixed into every per-table Dataflow job name to "
+            "disambiguate concurrent / repeat runs. Required, but may be "
+            "set in the analyzer YAML instead. CLI overrides YAML. Lowercase "
+            "letters, digits, and hyphens, starting with a letter, ≤20 chars."
+        ),
+    )
+    args = parser.parse_args()
+    resource_config = get_resource_config(
+        UriFactory.create_uri(args.resource_config_uri)
+    )
+    config = load_analyzer_config(args.analyzer_config_uri)
+    job_name_prefix = _resolve_job_name_prefix(
+        cli_value=args.job_name_prefix, yaml_value=config.job_name_prefix
+    )
+    run_timestamp = current_formatted_datetime(_RUN_TIMESTAMP_FORMAT)
+    logger.info(
+        f"Using job_name_prefix={job_name_prefix!r}, run_timestamp={run_timestamp!r}."
+    )
+
+    analyzer = DataAnalyzer()
+    report_path = analyzer.run(
+        config=config,
+        resource_config=resource_config,
+        job_name_prefix=job_name_prefix,
+        run_timestamp=run_timestamp,
+        components=args.only,
+        custom_worker_image_uri=args.custom_worker_image_uri,
+    )
+    logger.info(f"Report generated at: {report_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/gigl/analytics/data_analyzer/embedding_diagnostics.py b/gigl/analytics/data_analyzer/embedding_diagnostics.py
new file mode 100644
index 000000000..7fa5d36eb
--- /dev/null
+++ b/gigl/analytics/data_analyzer/embedding_diagnostics.py
@@ -0,0 +1,174 @@
+"""Structural-sanity diagnostics for REPEATED FLOAT (embedding) columns.
+
+Runs one BigQuery aggregate per (table, embedding column) to compute
+``total`` rows, ``unique_count`` of distinct vectors, ``unique_ratio``,
+and the top-K most-frequent hash clusters. Uses
+``FARM_FINGERPRINT(TO_JSON_STRING(<col>))`` as the deduplication key —
+cheap, deterministic, and exact for equality (not similarity).
+
+A low ``unique_ratio`` or a heavily-weighted top entry indicates upstream
+degeneracy (many rows emitting the same embedding — often a zero-padded
+placeholder for missing data).
+
+The component is best-effort: a failure on one column logs a warning and
+is skipped; callers receive an empty mapping for that column rather than
+an exception.
+"""
+
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from dataclasses import dataclass
+from typing import Optional
+
+from gigl.analytics.data_analyzer.types import EmbeddingDiagnosticsResult, TopKEntry
+from gigl.common.logger import Logger
+from gigl.src.common.utils.bq import BqUtils
+
+logger = Logger()
+
+_PARALLEL_DIAGNOSTICS_QUERIES = 8
+_DEFAULT_TOP_K = 20
+
+
+@dataclass(frozen=True)
+class EmbeddingDiagnosticsRequest:
+    """One (table, embedding columns, result key) triple to analyze.
+
+    ``result_key`` is the per-table analyzer key (``"node:{type}"`` or
+    ``"edge:{type}"``) used to organize outputs into the
+    ``FeatureProfileResult.embedding_diagnostics`` two-level dict.
+    """
+
+    result_key: str
+    bq_table: str
+    embedding_columns: list[str]
+
+
+class EmbeddingDiagnostics:
+    """Compute structural diagnostics for embedding columns via BigQuery."""
+
+    def __init__(
+        self,
+        bq_utils: BqUtils,
+        top_k: int = _DEFAULT_TOP_K,
+        max_workers: int = _PARALLEL_DIAGNOSTICS_QUERIES,
+    ) -> None:
+        self._bq_utils = bq_utils
+        self._top_k = top_k
+        self._max_workers = max_workers
+
+    def analyze(
+        self, requests: list[EmbeddingDiagnosticsRequest]
+    ) -> dict[str, dict[str, EmbeddingDiagnosticsResult]]:
+        """Run one aggregate query per (table, column) and collect results.
+
+        Per-column failures are logged and skipped; one bad column does not
+        sink other columns in the same request or other requests. A request
+        whose every column failed produces an empty inner dict, which is
+        omitted from the output.
+
+        Args:
+            requests: One entry per table with at least one embedding column.
+
+        Returns:
+            ``{result_key: {column_name: EmbeddingDiagnosticsResult}}``.
+            Missing keys indicate the column's query failed.
+        """
+        jobs: list[tuple[str, str, str]] = []
+        for request in requests:
+            for column in request.embedding_columns:
+                jobs.append((request.result_key, request.bq_table, column))
+        if not jobs:
+            return {}
+
+        logger.info(
+            f"Running {len(jobs)} embedding diagnostic query(ies) across "
+            f"{len(requests)} table(s)."
+        )
+        out: dict[str, dict[str, EmbeddingDiagnosticsResult]] = {}
+        with ThreadPoolExecutor(max_workers=self._max_workers) as executor:
+            future_to_key = {
+                executor.submit(
+                    self._analyze_column, bq_table=bq_table, column=column
+                ): (result_key, column)
+                for result_key, bq_table, column in jobs
+            }
+            for future in as_completed(future_to_key):
+                result_key, column = future_to_key[future]
+                try:
+                    diagnostics = future.result()
+                except Exception as exc:
+                    logger.exception(
+                        f"Embedding diagnostics failed for "
+                        f"{result_key}:{column}: {exc}"
+                    )
+                    continue
+                if diagnostics is None:
+                    continue
+                out.setdefault(result_key, {})[column] = diagnostics
+        return out
+
+    def _analyze_column(
+        self, bq_table: str, column: str
+    ) -> Optional[EmbeddingDiagnosticsResult]:
+        """Run the dedup aggregate for one column; return its result."""
+        query = _build_dedup_query(bq_table=bq_table, column=column, top_k=self._top_k)
+        rows = list(self._bq_utils.run_query(query=query, labels={}))
+        if len(rows) != 1:
+            raise RuntimeError(
+                f"Embedding diagnostics query expected exactly 1 row for "
+                f"{bq_table}.{column}; got {len(rows)}."
+            )
+        row = rows[0]
+        total = int(row["total"] or 0)
+        unique_count = int(row["unique_count"] or 0)
+        unique_ratio = float(row["unique_ratio"] or 0.0)
+        top_k_rows = row["top_k"] or []
+        top_k = [
+            TopKEntry(
+                hash=int(entry["hash_value"]),
+                count=int(entry["count_value"]),
+                fraction=float(entry["fraction"] or 0.0),
+            )
+            for entry in top_k_rows
+        ]
+        return EmbeddingDiagnosticsResult(
+            total=total,
+            unique_count=unique_count,
+            unique_ratio=unique_ratio,
+            top_k=top_k,
+        )
+
+
+def _build_dedup_query(bq_table: str, column: str, top_k: int) -> str:
+    """Render the per-column dedup aggregate.
+
+    ``FARM_FINGERPRINT(TO_JSON_STRING(<col>))`` is deterministic and
+    collision-resistant enough for this purpose — we're looking for
+    unusually clumped clusters, not cryptographic uniqueness.
+    """
+    return f"""
+WITH hashes AS (
+  SELECT FARM_FINGERPRINT(TO_JSON_STRING(`{column}`)) AS h
+  FROM `{bq_table}`
+),
+counts AS (
+  SELECT h, COUNT(*) AS n FROM hashes GROUP BY h
+),
+agg AS (
+  SELECT SUM(n) AS total, COUNT(*) AS unique_count FROM counts
+)
+SELECT
+  agg.total,
+  agg.unique_count,
+  SAFE_DIVIDE(agg.unique_count, agg.total) AS unique_ratio,
+  ARRAY(
+    SELECT AS STRUCT
+      h AS hash_value,
+      n AS count_value,
+      SAFE_DIVIDE(n, agg.total) AS fraction
+    FROM counts
+    ORDER BY n DESC
+    LIMIT {top_k}
+  ) AS top_k
+FROM agg
+""".strip()
diff --git a/gigl/analytics/data_analyzer/embedding_projection.py b/gigl/analytics/data_analyzer/embedding_projection.py
new file mode 100644
index 000000000..fec7d440d
--- /dev/null
+++ b/gigl/analytics/data_analyzer/embedding_projection.py
@@ -0,0 +1,179 @@
+"""Schema-aware BQ projection builder for the feature profiler.
+
+Translates a BigQuery table schema into a ``SELECT`` projection that
+TFDV can profile. Scalar profileable columns pass through unchanged;
+REPEATED ``FLOAT`` / ``FLOAT64`` / ``NUMERIC`` / ``BIGNUMERIC`` columns
+(embeddings) expand into four scalar hygiene companions:
+
+* ``<col>_len`` — array length
+* ``<col>_has_nan`` — any NaN element
+* ``<col>_has_inf`` — any Inf element
+* ``<col>_is_all_zero`` — every element equals 0
+
+Structural-sanity (dedup / unique-ratio / top-K) lives in
+:mod:`gigl.analytics.data_analyzer.embedding_diagnostics`, which runs its
+own aggregate query over ``FARM_FINGERPRINT(TO_JSON_STRING(<col>))``. The
+hash is deliberately excluded from this projection so TFDV doesn't render
+noisy stats on a 64-bit hash column.
+"""
+
+from dataclasses import dataclass
+
+from google.cloud.bigquery import SchemaField
+
+from gigl.common.logger import Logger
+
+logger = Logger()
+
+# BigQuery scalar types TFDV can profile once wrapped as ``list<scalar>`` by
+# ``BqTableToRecordBatch``. Matches ``_PROFILEABLE_FIELD_TYPES`` in
+# ``feature_profiler.py`` — kept in sync via a single import site.
+_SCALAR_PROFILEABLE_TYPES: frozenset[str] = frozenset(
+    {
+        "STRING",
+        "INTEGER",
+        "INT64",
+        "FLOAT",
+        "FLOAT64",
+        "NUMERIC",
+        "BIGNUMERIC",
+        "BOOLEAN",
+        "BOOL",
+    }
+)
+
+# REPEATED types that represent embedding vectors. STRING / INT arrays are
+# intentionally excluded — they need different diagnostics (e.g. vocab stats)
+# and are out of scope for this pass.
+_EMBEDDING_FLOAT_TYPES: frozenset[str] = frozenset(
+    {"FLOAT", "FLOAT64", "NUMERIC", "BIGNUMERIC"}
+)
+
+
+@dataclass(frozen=True)
+class ProjectionResult:
+    """Output of :func:`build_projection`.
+
+    ``projection`` is a list of ``(column_name, sql_expression)`` pairs
+    suitable for feeding directly into a
+    :class:`~gigl.common.beam.tfdv_transforms.BqTableToRecordBatch`. Each
+    entry renders as ``{sql_expression} AS \\`{column_name}\\``` in the
+    resulting ``SELECT``.
+
+    ``embedding_columns`` lists the original REPEATED FLOAT column names
+    (pre-expansion) in schema order; the dedup pass uses them to locate the
+    corresponding ``<col>_hash`` companion.
+    """
+
+    projection: list[tuple[str, str]]
+    embedding_columns: list[str]
+
+
+def is_embedding_column(field: SchemaField) -> bool:
+    """Return ``True`` for REPEATED FLOAT-family columns (embedding vectors)."""
+    return (
+        field.mode == "REPEATED" and field.field_type.upper() in _EMBEDDING_FLOAT_TYPES
+    )
+
+
+def detect_embedding_columns(
+    schema: dict[str, SchemaField], excluded: set[str]
+) -> list[str]:
+    """List REPEATED FLOAT-family columns in the schema, in declaration order.
+
+    Excluded columns (typically structural join keys) are dropped.
+    """
+    return [
+        name
+        for name, field in schema.items()
+        if name not in excluded and is_embedding_column(field)
+    ]
+
+
+def build_projection(
+    schema: dict[str, SchemaField], excluded: set[str]
+) -> ProjectionResult:
+    """Build a TFDV-compatible projection from a BigQuery schema.
+
+    Scalar profileable columns (see :data:`_SCALAR_PROFILEABLE_TYPES`) are
+    passed through verbatim, *except* BOOL / BOOLEAN columns are cast to
+    INT64. ``BqTableToRecordBatch`` wraps each value in a single-element
+    list before emitting an Arrow ``RecordBatch``; TFDV's
+    ``get_feature_type_from_arrow_type`` does not accept ``list<bool>``
+    (only int / float / string / bytes lists), so a raw BOOL column would
+    crash the Dataflow job in ``BasicStatsGenerator.add_input``. Casting
+    to INT64 in SQL keeps the BOOL semantics (0/1) profileable as an
+    int feature.
+
+    REPEATED FLOAT-family columns are expanded into four scalar hygiene
+    companions (see module docstring). The three boolean companions
+    (``_has_nan``, ``_has_inf``, ``_is_all_zero``) are likewise cast to
+    INT64 for the same reason. REPEATED non-FLOAT columns and
+    non-profileable scalar types are skipped with an ``INFO`` log.
+
+    Args:
+        schema: Column name → ``SchemaField`` map (as returned by
+            ``BqUtils.fetch_bq_table_schema``).
+        excluded: Column names to drop entirely (typically structural join
+            keys: node ``id_column``; edge ``src_id_column`` +
+            ``dst_id_column``).
+
+    Returns:
+        :class:`ProjectionResult`. ``projection`` preserves schema order
+        with each embedding's hygiene companions appearing in a contiguous
+        block.
+    """
+    projection: list[tuple[str, str]] = []
+    embedding_columns: list[str] = []
+    for name, field in schema.items():
+        if name in excluded:
+            continue
+        if is_embedding_column(field):
+            projection.extend(_embedding_hygiene_projection(name))
+            embedding_columns.append(name)
+            continue
+        if field.mode == "REPEATED":
+            logger.info(
+                f"skipping REPEATED column {name!r} of type {field.field_type} "
+                "(hygiene companions only cover REPEATED FLOAT families)."
+            )
+            continue
+        type_upper = field.field_type.upper()
+        if type_upper not in _SCALAR_PROFILEABLE_TYPES:
+            logger.info(
+                f"skipping column {name!r} of type {field.field_type} "
+                "(not TFDV-profileable)."
+            )
+            continue
+        if type_upper in ("BOOL", "BOOLEAN"):
+            projection.append((name, f"CAST(`{name}` AS INT64)"))
+        else:
+            projection.append((name, f"`{name}`"))
+    return ProjectionResult(projection=projection, embedding_columns=embedding_columns)
+
+
+def _embedding_hygiene_projection(column: str) -> list[tuple[str, str]]:
+    """Return the four hygiene ``(name, expr)`` entries for one embedding column.
+
+    The three boolean companions are wrapped in ``CAST(... AS INT64)`` so
+    the resulting Arrow column is ``list<int64>`` rather than ``list<bool>``;
+    see :func:`build_projection` for the TFDV compatibility rationale.
+    """
+    return [
+        (f"{column}_len", f"ARRAY_LENGTH(`{column}`)"),
+        (
+            f"{column}_has_nan",
+            f"CAST(IFNULL((SELECT LOGICAL_OR(IS_NAN(v)) FROM UNNEST(`{column}`) v), "
+            "FALSE) AS INT64)",
+        ),
+        (
+            f"{column}_has_inf",
+            f"CAST(IFNULL((SELECT LOGICAL_OR(IS_INF(v)) FROM UNNEST(`{column}`) v), "
+            "FALSE) AS INT64)",
+        ),
+        (
+            f"{column}_is_all_zero",
+            f"CAST(IFNULL((SELECT LOGICAL_AND(v = 0) FROM UNNEST(`{column}`) v), "
+            "FALSE) AS INT64)",
+        ),
+    ]
diff --git a/gigl/analytics/data_analyzer/feature_profiler.py b/gigl/analytics/data_analyzer/feature_profiler.py
new file mode 100644
index 000000000..451806883
--- /dev/null
+++ b/gigl/analytics/data_analyzer/feature_profiler.py
@@ -0,0 +1,749 @@
+"""TFDV feature profiling via Beam/Dataflow.
+
+Launches one Dataflow pipeline per node and edge table in the analyzer
+config. For each table, the BQ projection is built from the table schema
+via :func:`~gigl.analytics.data_analyzer.embedding_projection.build_projection`:
+scalar profileable columns pass through, REPEATED FLOAT-family columns
+(embeddings) expand into four hygiene companions
+(``<col>_len``/``_has_nan``/``_has_inf``/``_is_all_zero``). Each pipeline
+reads the resulting columns from BigQuery, emits ``pa.RecordBatch``
+batches, and runs ``tfdv.GenerateStatistics`` to write a Facets HTML
+visualization plus a TFDV stats TFRecord to GCS.
+
+After all Dataflow pipelines finish, one aggregate BigQuery query per
+embedding column runs via
+:class:`~gigl.analytics.data_analyzer.embedding_diagnostics.EmbeddingDiagnostics`
+to compute structural sanity (unique ratio + top-K most-frequent hashes).
+The final :class:`FeatureProfileResult` is serialized to
+``{output_gcs_path}/feature_profile.json`` via :func:`write_artifact` so
+external consumers can parse it without scraping HTML.
+
+Tables whose final projection is empty (e.g. only ID columns, or a schema
+fetch failed) are skipped with a warning. Per-table Beam failures, the
+diagnostics pass, and the sidecar write are all best-effort: the TFDV
+artifacts remain valuable even if one downstream step fails.
+"""
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from dataclasses import dataclass, field
+from typing import Any, Optional
+
+import apache_beam as beam
+import tensorflow_data_validation as tfdv
+from apache_beam.options.pipeline_options import GoogleCloudOptions
+from tensorflow_data_validation.utils import slicing_util
+
+from gigl.analytics.data_analyzer.config import DataAnalyzerConfig
+from gigl.analytics.data_analyzer.embedding_diagnostics import (
+    EmbeddingDiagnostics,
+    EmbeddingDiagnosticsRequest,
+)
+from gigl.analytics.data_analyzer.embedding_projection import (
+    ProjectionResult,
+    build_projection,
+)
+from gigl.analytics.data_analyzer.types import (
+    FeatureProfileError,
+    FeatureProfileResult,
+    write_artifact,
+)
+from gigl.common import UriFactory
+from gigl.common.beam.sharded_read import BigQueryShardedReadConfig
+from gigl.common.beam.tfdv_transforms import (
+    BqTableToRecordBatch,
+    GenerateAndVisualizeStats,
+)
+from gigl.common.logger import Logger
+from gigl.env.pipelines_config import GiglResourceConfigWrapper
+from gigl.src.common.constants.components import GiGLComponents
+from gigl.src.common.types import AppliedTaskIdentifier
+from gigl.src.common.utils.bq import BqUtils
+from gigl.src.common.utils.dataflow import init_beam_pipeline_options
+
+logger = Logger()
+
+_PARALLEL_DATAFLOW_WORKERS = 10
+# Kept short to leave room for the per-run prefix and timestamp inside
+# the Dataflow job-name budget (~63 chars).
+_APPLIED_TASK_IDENTIFIER = AppliedTaskIdentifier("analyzer")
+
+
+def _safe_dataflow_job_id(result: Any) -> Optional[str]:
+    """Return ``result.job_id()`` if present, else ``None``.
+
+    The DataflowRunner returns a ``DataflowPipelineResult`` whose
+    ``job_id()`` method exposes the submitted job's UUID. Other runners
+    (DirectRunner, etc.) don't have this attribute; we degrade silently
+    instead of raising so callers can keep an unrelated failure path
+    clean.
+    """
+    job_id_attr = getattr(result, "job_id", None)
+    if job_id_attr is None:
+        return None
+    try:
+        if callable(job_id_attr):
+            value = job_id_attr()
+        else:
+            value = job_id_attr
+    except Exception:
+        return None
+    return str(value) if value else None
+
+
+def _build_dataflow_console_url(
+    project: Optional[str], region: Optional[str], job_id: Optional[str]
+) -> Optional[str]:
+    """Compose the Cloud Console URL for a Dataflow job.
+
+    Returns ``None`` if any of project / region / job_id is missing,
+    rather than producing a malformed URL.
+    """
+    if not project or not region or not job_id:
+        return None
+    return (
+        f"https://console.cloud.google.com/dataflow/jobs/{region}/{job_id}"
+        f"?project={project}"
+    )
+
+
+def _resolve_projection(
+    bq_table: str,
+    explicit: list[str],
+    excluded: set[str],
+    bq_utils: BqUtils,
+    extra_columns: Optional[list[str]] = None,
+) -> tuple[ProjectionResult, Optional[str]]:
+    """Build the projection for one table, honoring an explicit override.
+
+    If ``explicit`` is non-empty, the schema is still fetched but only
+    those columns are considered (minus ``excluded``). Explicit names not
+    present in the schema are logged and dropped rather than raising.
+    Otherwise every non-excluded column is routed through
+    :func:`build_projection`.
+
+    ``extra_columns`` are appended to the resulting projection unconditionally
+    if they exist in the schema (e.g. label / split columns the analyzer
+    needs available for TFDV slicing even when the user's explicit
+    ``feature_columns`` doesn't list them). Extras already present in the
+    base projection are skipped to avoid duplicate SELECT entries; extras
+    missing from the schema are warned about and dropped.
+
+    Returns ``(projection_result, error_message_or_none)``. A non-None
+    second element means the schema fetch failed; the caller should
+    surface that as a structured error instead of just silently skipping
+    the table.
+    """
+    try:
+        schema = bq_utils.fetch_bq_table_schema(bq_table)
+    except Exception as exc:
+        message = f"Schema fetch failed for {bq_table}: {exc}"
+        logger.warning(message)
+        return ProjectionResult(projection=[], embedding_columns=[]), message
+
+    if explicit:
+        unknown = [c for c in explicit if c not in schema]
+        if unknown:
+            logger.warning(
+                f"{bq_table}: explicit feature_columns {unknown} not in "
+                f"schema; ignoring."
+            )
+        filtered_schema = {
+            name: field
+            for name, field in schema.items()
+            if name in explicit and name not in excluded
+        }
+        base = build_projection(filtered_schema, excluded=set())
+    else:
+        base = build_projection(schema, excluded=excluded)
+
+    if extra_columns:
+        existing_names = {name for name, _ in base.projection}
+        extras_schema = {}
+        for column in extra_columns:
+            if column in existing_names:
+                continue
+            if column not in schema:
+                logger.warning(
+                    f"{bq_table}: extra projection column {column!r} not in "
+                    f"schema; ignoring."
+                )
+                continue
+            extras_schema[column] = schema[column]
+        if extras_schema:
+            extras_projection = build_projection(extras_schema, excluded=set())
+            base = ProjectionResult(
+                projection=list(base.projection) + list(extras_projection.projection),
+                embedding_columns=list(base.embedding_columns),
+            )
+
+    return base, None
+
+
+@dataclass(frozen=True)
+class _ProfileTask:
+    """One profiling unit: all columns of a single node or edge table.
+
+    ``kind`` is ``"node"`` or ``"edge"`` (singular) and is used to build
+    the GCS output path and the result key (``"node:user"``, etc.).
+
+    ``shard_key`` is the column the BQ read fans out on (hash-mod-N) to
+    avoid the single-giant-export pattern that hangs ``SplitWithSizing``
+    on very large tables. Sourced from ``NodeTableSpec.id_column`` for
+    node tables and ``EdgeTableSpec.src_id_column`` for edge tables —
+    both are guaranteed present and uniformly distributed enough for a
+    FARM_FINGERPRINT-based mod split.
+
+    ``slice_columns`` lists columns whose distinct values should each
+    produce a slice of the TFDV stats. The values come from
+    ``NodeTableSpec.label_column`` / ``NodeTableSpec.split_column`` —
+    when set, the profiler routes them through ``slicing_util`` so the
+    resulting TFDV stats include per-slice ``DatasetFeatureStatistics``
+    entries (per-class label histograms, per-class feature null-rate,
+    per-split distributions). Empty for edge tables and for node tables
+    that don't activate NC supervision.
+    """
+
+    kind: str
+    type_name: str
+    bq_table: str
+    projection: list[tuple[str, str]]
+    embedding_columns: list[str]
+    shard_key: str
+    slice_columns: list[str] = field(default_factory=list)
+    chunk_index: int = 0
+    total_chunks: int = 1
+
+    @property
+    def result_key(self) -> str:
+        return f"{self.kind}:{self.type_name}"
+
+    @property
+    def artifact_subdir(self) -> str:
+        """Empty for single-chunk tables; ``chunk_NN/`` for multi-chunk tables.
+
+        Multi-chunk tables write each chunk's Facets HTML + stats TFRecord
+        under their own ``chunk_NN/`` subdir to avoid collisions; single-chunk
+        tables keep the historical flat layout for backward-compatible URLs.
+        """
+        if self.total_chunks <= 1:
+            return ""
+        return f"chunk_{self.chunk_index:02d}/"
+
+
+class FeatureProfiler:
+    """Runs TFDV feature profiling + embedding diagnostics on BQ tables via Dataflow.
+
+    Example:
+        >>> profiler = FeatureProfiler()
+        >>> result = profiler.profile(config, resource_config=config)
+        >>> result.facets_html_paths["node:user"]
+        'gs://bucket/analyzer/feature_profiler/nodes/user/facets.html'
+    """
+
+    def profile(
+        self,
+        config: DataAnalyzerConfig,
+        resource_config: GiglResourceConfigWrapper,
+        job_name_prefix: str,
+        run_timestamp: str,
+        custom_worker_image_uri: Optional[str] = None,
+    ) -> FeatureProfileResult:
+        """Run TFDV profiling + embedding diagnostics for every table in the config.
+
+        For each table, the BQ projection is built via
+        :func:`_resolve_projection` (explicit ``feature_columns`` narrow the
+        schema; otherwise every non-excluded column is considered).
+        Embedding columns (REPEATED FLOAT families) expand into hygiene
+        companions in the projection and trigger a post-Dataflow structural
+        diagnostics pass.
+
+        Tables whose final projection is empty are skipped with a warning.
+        Per-table Dataflow failures are logged and omitted. The embedding
+        diagnostics pass and JSON sidecar write are best-effort.
+
+        Args:
+            config: Analyzer configuration with node and edge table specs.
+            resource_config: Resource config; its ``.project`` is used for
+                BigQuery schema lookups and diagnostics queries.
+            job_name_prefix: User-supplied prefix mixed into every per-table
+                Dataflow job name to disambiguate concurrent / repeat runs.
+            run_timestamp: Per-run timestamp string mixed into every per-table
+                Dataflow job name. Computed once at the entry point so all
+                jobs from one analyzer invocation share the same value.
+            custom_worker_image_uri: Optional Docker image URI for the
+                Dataflow worker harness. When ``None``, falls back to
+                ``DEFAULT_GIGL_RELEASE_SRC_IMAGE_DATAFLOW_CPU``.
+
+        Returns:
+            :class:`FeatureProfileResult` with GCS paths keyed by
+            ``"node:{type}"`` / ``"edge:{type}"`` plus any embedding
+            diagnostics that succeeded. Empty facets / stats paths indicate
+            a skipped or failed table.
+        """
+        bq_utils = BqUtils(project=resource_config.project)
+        tasks, collection_errors = _collect_profile_tasks(config, bq_utils)
+        result = FeatureProfileResult()
+        result.errors.extend(collection_errors)
+        if not tasks:
+            logger.info("No tables have profileable columns; returning empty result.")
+            self._maybe_write_sidecar(result, config.output_gcs_path)
+            return result
+
+        logger.info(f"Launching {len(tasks)} Dataflow feature-profile job(s).")
+        with ThreadPoolExecutor(max_workers=_PARALLEL_DATAFLOW_WORKERS) as executor:
+            future_to_task = {
+                executor.submit(
+                    self._run_single_pipeline,
+                    task,
+                    config.output_gcs_path,
+                    resource_config,
+                    job_name_prefix,
+                    run_timestamp,
+                    custom_worker_image_uri,
+                ): task
+                for task in tasks
+            }
+            for future in as_completed(future_to_task):
+                task = future_to_task[future]
+                try:
+                    facets_uri, stats_uri = future.result()
+                    # ``setdefault`` keeps multi-chunk per-table aggregation safe
+                    # under the unordered ``as_completed`` iteration: each chunk
+                    # lands as a list entry under the table-level result_key.
+                    result.facets_html_paths.setdefault(task.result_key, []).append(
+                        facets_uri
+                    )
+                    result.stats_paths.setdefault(task.result_key, []).append(stats_uri)
+                    if task.slice_columns:
+                        result.slice_columns_by_result_key[task.result_key] = list(
+                            task.slice_columns
+                        )
+                except Exception as exc:
+                    logger.exception(
+                        f"Feature profiling failed for {task.result_key} "
+                        f"(table={task.bq_table}): {exc}"
+                    )
+                    result.errors.append(
+                        FeatureProfileError(
+                            result_key=task.result_key,
+                            bq_table=task.bq_table,
+                            stage="dataflow",
+                            message=f"{type(exc).__name__}: {exc}",
+                            job_id=getattr(exc, "_gigl_job_id", None),
+                            job_name=getattr(exc, "_gigl_job_name", None),
+                            console_url=getattr(exc, "_gigl_console_url", None),
+                        )
+                    )
+
+        self._run_embedding_diagnostics(tasks, bq_utils, result)
+        self._maybe_write_sidecar(result, config.output_gcs_path)
+        return result
+
+    def _run_single_pipeline(
+        self,
+        task: _ProfileTask,
+        output_gcs_path: str,
+        resource_config: GiglResourceConfigWrapper,
+        job_name_prefix: str,
+        run_timestamp: str,
+        custom_worker_image_uri: Optional[str] = None,
+    ) -> tuple[str, str]:
+        """Build, run, and block on a single table's Dataflow pipeline.
+
+        Returns the ``(facets_uri, stats_uri)`` strings on success.
+
+        Worker sizing (machine_type / num_workers / max_num_workers /
+        disk_size_gb / timeout) is read from
+        ``resource_config.preprocessor_config.node_preprocessor_config`` for
+        node tasks and ``.edge_preprocessor_config`` for edge tasks. The
+        analyzer reuses the preprocessor's Dataflow sizing on the same
+        kind of table rather than declaring its own block, mirroring the
+        pattern in
+        :func:`gigl.src.data_preprocessor.lib.transform.utils.transform_features`.
+
+        Captures the Dataflow ``job_id`` / ``job_name`` / console URL on the
+        raised exception (as ``_gigl_*`` attributes) when the pipeline fails
+        on a Dataflow runner. The caller reads those off the exception and
+        promotes them into a :class:`FeatureProfileError` so the HTML report
+        can deep-link to the failed job's logs. Best-effort: a non-Dataflow
+        runner (e.g. DirectRunner in tests) yields ``None`` for job_id.
+        """
+        # Single-chunk tables keep the historical flat layout
+        # (``.../{type}/facets.html``); multi-chunk tables write each chunk
+        # under its own ``chunk_NN/`` subdir so the stats / Facets per chunk
+        # don't collide.
+        base = (
+            f"{output_gcs_path.rstrip('/')}/feature_profiler/"
+            f"{task.kind}s/{task.type_name}/{task.artifact_subdir}"
+        ).rstrip("/")
+        facets_uri = UriFactory.create_uri(f"{base}/facets.html")
+        stats_uri = UriFactory.create_uri(f"{base}/stats.tfrecord")
+
+        if task.kind == "node":
+            dataflow_config = (
+                resource_config.preprocessor_config.node_preprocessor_config
+            )
+        elif task.kind == "edge":
+            dataflow_config = (
+                resource_config.preprocessor_config.edge_preprocessor_config
+            )
+        else:
+            raise ValueError(
+                f"Unexpected task.kind={task.kind!r}; expected 'node' or 'edge'."
+            )
+
+        # Append a chunk suffix to the Dataflow job-name only when the table
+        # is actually being chunked, to keep single-chunk job names stable
+        # and within Dataflow's 63-char job-name budget for the common case.
+        chunk_suffix = (
+            f"-chunk-{task.chunk_index:02d}-of-{task.total_chunks:02d}"
+            if task.total_chunks > 1
+            else ""
+        )
+        options = init_beam_pipeline_options(
+            applied_task_identifier=_APPLIED_TASK_IDENTIFIER,
+            job_name_suffix=(
+                f"{job_name_prefix}-{run_timestamp}-profile-"
+                f"{task.kind}-{task.type_name}{chunk_suffix}"
+            ),
+            component=GiGLComponents.DataAnalyzer,
+            custom_worker_image_uri=custom_worker_image_uri,
+            timeout_seconds=dataflow_config.timeout
+            if dataflow_config.timeout
+            else None,
+            num_workers=dataflow_config.num_workers,
+            max_num_workers=dataflow_config.max_num_workers,
+            machine_type=dataflow_config.machine_type,
+            disk_size_gb=dataflow_config.disk_size_gb,
+        )
+        gcp_opts = options.view_as(GoogleCloudOptions)
+        job_name = gcp_opts.job_name
+        project = gcp_opts.project
+        region = gcp_opts.region
+
+        stats_options = _build_slice_stats_options(task.slice_columns)
+
+        # Shard the BQ read on the natural per-table key (id_column for nodes,
+        # src_id_column for edges). Mirrors the data_preprocessor's
+        # ShardedExportRead pattern; without it, a single giant ReadFromBigQuery
+        # on a large user/edge table hangs Dataflow's SplitWithSizing on
+        # oversized GCS Avro reads. ``num_shards`` defaults to 20 inside the
+        # config dataclass (matches the preprocessor default).
+        sharded_read_config = BigQueryShardedReadConfig(
+            shard_key=task.shard_key,
+            project_id=resource_config.project,
+            temp_dataset_name=resource_config.temp_assets_bq_dataset_name,
+        )
+
+        pipeline = beam.Pipeline(options=options)
+        _ = (
+            pipeline
+            | f"Read {task.result_key} from BQ"
+            >> BqTableToRecordBatch(
+                bq_table=task.bq_table,
+                projection=task.projection,
+                sharded_read_config=sharded_read_config,
+            )
+            | f"Generate TFDV stats for {task.result_key}"
+            >> GenerateAndVisualizeStats(
+                facets_report_uri=facets_uri,
+                stats_output_uri=stats_uri,
+                stats_options=stats_options,
+            )
+        )
+        result = pipeline.run()
+        try:
+            result.wait_until_finish()
+        except Exception as exc:
+            job_id = _safe_dataflow_job_id(result)
+            console_url = _build_dataflow_console_url(
+                project=project, region=region, job_id=job_id
+            )
+            exc._gigl_job_id = job_id  # type: ignore[attr-defined]
+            exc._gigl_job_name = job_name  # type: ignore[attr-defined]
+            exc._gigl_console_url = console_url  # type: ignore[attr-defined]
+            raise
+        logger.info(f"Finished feature profiling for {task.result_key}.")
+        return facets_uri.uri, stats_uri.uri
+
+    def _run_embedding_diagnostics(
+        self,
+        tasks: list[_ProfileTask],
+        bq_utils: BqUtils,
+        result: FeatureProfileResult,
+    ) -> None:
+        """Run structural diagnostics for every task with embedding columns.
+
+        Best-effort: any exception is caught so the sidecar write and the
+        already-produced TFDV artifacts remain valuable.
+
+        Multi-chunk tables emit multiple ``_ProfileTask``s with the same
+        ``result_key`` and ``embedding_columns`` (table-level). We dedupe
+        per ``result_key`` so the embedding-diagnostics BQ aggregate runs
+        once per table, not once per chunk.
+        """
+        deduped: dict[str, EmbeddingDiagnosticsRequest] = {}
+        for task in tasks:
+            if not task.embedding_columns:
+                continue
+            existing = deduped.get(task.result_key)
+            if existing is None:
+                deduped[task.result_key] = EmbeddingDiagnosticsRequest(
+                    result_key=task.result_key,
+                    bq_table=task.bq_table,
+                    embedding_columns=list(task.embedding_columns),
+                )
+                continue
+            # Same result_key seen on a previous chunk — union the embedding
+            # columns to be safe against any chunk that happens to carry a
+            # narrower embedding subset (chunks share table-level
+            # embedding_columns today, but defensive).
+            seen = set(existing.embedding_columns)
+            extra = [c for c in task.embedding_columns if c not in seen]
+            if extra:
+                deduped[task.result_key] = EmbeddingDiagnosticsRequest(
+                    result_key=existing.result_key,
+                    bq_table=existing.bq_table,
+                    embedding_columns=existing.embedding_columns + extra,
+                )
+        requests = list(deduped.values())
+        if not requests:
+            return
+        try:
+            diagnostics = EmbeddingDiagnostics(bq_utils=bq_utils).analyze(requests)
+        except Exception as exc:
+            logger.exception(f"Embedding diagnostics pass failed: {exc}")
+            message = f"{type(exc).__name__}: {exc}"
+            for request in requests:
+                result.errors.append(
+                    FeatureProfileError(
+                        result_key=request.result_key,
+                        bq_table=request.bq_table,
+                        stage="embedding_diagnostics",
+                        message=message,
+                    )
+                )
+            return
+        for result_key, per_column in diagnostics.items():
+            result.embedding_diagnostics[result_key] = per_column
+
+    def _maybe_write_sidecar(
+        self, result: FeatureProfileResult, output_gcs_path: str
+    ) -> None:
+        """Best-effort write of the Pydantic JSON sidecar."""
+        try:
+            write_artifact(
+                result=result,
+                component="feature_profile",
+                output_gcs_path=output_gcs_path,
+            )
+        except Exception as exc:
+            logger.exception(f"Failed to write feature_profile.json sidecar: {exc}")
+
+
+def _build_slice_stats_options(
+    slice_columns: list[str],
+) -> Optional[tfdv.StatsOptions]:
+    """Build a ``tfdv.StatsOptions`` configured to slice on the given columns.
+
+    Returns ``None`` when no slice columns are requested so callers can
+    cheaply pass through to TFDV's defaults. Each entry produces a
+    standard "feature value slicer" that emits one slice per distinct
+    value of the column. The unsliced ("Overall") stats are always
+    emitted by TFDV in addition to the per-slice stats, so existing
+    consumers continue to see the same top-level stats they did before
+    slicing was enabled.
+    """
+    if not slice_columns:
+        return None
+    slice_functions = [
+        slicing_util.get_feature_value_slicer({column: None})
+        for column in slice_columns
+    ]
+    return tfdv.StatsOptions(slice_functions=slice_functions)
+
+
+def _chunk_projection(
+    projection: list[tuple[str, str]],
+    max_features: int,
+    forced_columns: set[str],
+) -> list[list[tuple[str, str]]]:
+    """Slice a projection into ``ceil(len/max_features)`` ≤``max_features``-sized chunks.
+
+    Beam 2.56's runner-v2 cannot reliably iterate the per-key state TFDV's
+    ``CombinePerKey(PreCombineFn)`` accumulates over very wide projections
+    (work items time out on ``Instruction id ... was not registered``).
+    Splitting the projection across multiple Dataflow pipelines keeps
+    every per-key partition small enough for the runner to iterate.
+
+    ``forced_columns`` (typically slice columns: ``label_column`` /
+    ``split_column``) are present in **every** chunk so TFDV slicing
+    applies uniformly across chunks. Each chunk's effective non-forced
+    budget is ``max_features - len(forced_pairs)`` (clamped to ≥1).
+
+    Args:
+        projection: ``(column_name, sql_expression)`` pairs from
+            :func:`_resolve_projection`. Slice columns are already in here
+            (via that function's ``extra_columns``).
+        max_features: Target per-chunk column cap. The actual chunk size
+            is ``max_features`` for non-forced columns plus the forced
+            columns appended.
+        forced_columns: Names that must appear in every chunk.
+
+    Returns:
+        Non-empty list of chunks. Empty input returns ``[]``.
+    """
+    forced_pairs = [(n, e) for n, e in projection if n in forced_columns]
+    rest = [(n, e) for n, e in projection if n not in forced_columns]
+    if not rest:
+        return [list(forced_pairs)] if forced_pairs else []
+    budget_per_chunk = max(1, max_features - len(forced_pairs))
+    chunks: list[list[tuple[str, str]]] = []
+    for start in range(0, len(rest), budget_per_chunk):
+        chunks.append(list(forced_pairs) + rest[start : start + budget_per_chunk])
+    return chunks
+
+
+def _collect_profile_tasks(
+    config: DataAnalyzerConfig, bq_utils: BqUtils
+) -> tuple[list[_ProfileTask], list[FeatureProfileError]]:
+    """Flatten the analyzer config into one ``_ProfileTask`` per table.
+
+    Resolves the projection for each node/edge spec by either restricting
+    to explicit ``feature_columns`` or auto-inferring from the BQ table
+    schema (excluding structural join keys). Tables whose resolved
+    projection is empty (e.g. only ID columns, or the schema fetch failed)
+    are logged, recorded as a structured ``FeatureProfileError`` so the
+    HTML report can surface them, and skipped.
+    """
+    tasks: list[_ProfileTask] = []
+    errors: list[FeatureProfileError] = []
+    for node_table in config.node_tables:
+        result_key = f"node:{node_table.node_type}"
+        # Slice columns must be in the projection so TFDV can read them.
+        # ``compute_per_class_feature_stats`` opts out of the label slice
+        # without forcing the user to drop ``label_column`` itself (the
+        # graph_structure_analyzer NC tier still needs the column there).
+        slice_columns: list[str] = []
+        if (
+            node_table.label_column is not None
+            and config.compute_per_class_feature_stats
+        ):
+            slice_columns.append(node_table.label_column)
+        if node_table.split_column is not None:
+            slice_columns.append(node_table.split_column)
+
+        projection, schema_error = _resolve_projection(
+            bq_table=node_table.bq_table,
+            explicit=node_table.feature_columns,
+            excluded={node_table.id_column},
+            bq_utils=bq_utils,
+            extra_columns=slice_columns,
+        )
+        if schema_error is not None:
+            errors.append(
+                FeatureProfileError(
+                    result_key=result_key,
+                    bq_table=node_table.bq_table,
+                    stage="schema_fetch",
+                    message=schema_error,
+                )
+            )
+            continue
+        if not projection.projection:
+            message = (
+                f"No profileable columns after projection "
+                f"(id_column={node_table.id_column!r}, "
+                f"explicit feature_columns={node_table.feature_columns})."
+            )
+            logger.warning(f"Skipping {result_key}: {message}")
+            errors.append(
+                FeatureProfileError(
+                    result_key=result_key,
+                    bq_table=node_table.bq_table,
+                    stage="empty_projection",
+                    message=message,
+                )
+            )
+            continue
+        # Slice columns that didn't make it into the projection (missing
+        # from schema) are dropped; ``_resolve_projection`` already logged.
+        projected_names = {name for name, _ in projection.projection}
+        active_slice_columns = [
+            column for column in slice_columns if column in projected_names
+        ]
+        chunks = _chunk_projection(
+            projection.projection,
+            max_features=config.max_features_per_chunk,
+            forced_columns=set(active_slice_columns),
+        )
+        for chunk_index, chunk_projection in enumerate(chunks):
+            tasks.append(
+                _ProfileTask(
+                    kind="node",
+                    type_name=node_table.node_type,
+                    bq_table=node_table.bq_table,
+                    projection=chunk_projection,
+                    embedding_columns=projection.embedding_columns,
+                    shard_key=node_table.id_column,
+                    slice_columns=active_slice_columns,
+                    chunk_index=chunk_index,
+                    total_chunks=len(chunks),
+                )
+            )
+    for edge_table in config.edge_tables:
+        result_key = f"edge:{edge_table.edge_type}"
+        projection, schema_error = _resolve_projection(
+            bq_table=edge_table.bq_table,
+            explicit=edge_table.feature_columns,
+            excluded={
+                edge_table.src_id_column,
+                edge_table.dst_id_column,
+            },
+            bq_utils=bq_utils,
+        )
+        if schema_error is not None:
+            errors.append(
+                FeatureProfileError(
+                    result_key=result_key,
+                    bq_table=edge_table.bq_table,
+                    stage="schema_fetch",
+                    message=schema_error,
+                )
+            )
+            continue
+        if not projection.projection:
+            message = (
+                f"No profileable columns after projection "
+                f"(src_id_column={edge_table.src_id_column!r}, "
+                f"dst_id_column={edge_table.dst_id_column!r}, "
+                f"explicit feature_columns={edge_table.feature_columns})."
+            )
+            logger.warning(f"Skipping {result_key}: {message}")
+            errors.append(
+                FeatureProfileError(
+                    result_key=result_key,
+                    bq_table=edge_table.bq_table,
+                    stage="empty_projection",
+                    message=message,
+                )
+            )
+            continue
+        chunks = _chunk_projection(
+            projection.projection,
+            max_features=config.max_features_per_chunk,
+            forced_columns=set(),
+        )
+        for chunk_index, chunk_projection in enumerate(chunks):
+            tasks.append(
+                _ProfileTask(
+                    kind="edge",
+                    type_name=edge_table.edge_type,
+                    bq_table=edge_table.bq_table,
+                    projection=chunk_projection,
+                    embedding_columns=projection.embedding_columns,
+                    shard_key=edge_table.src_id_column,
+                    chunk_index=chunk_index,
+                    total_chunks=len(chunks),
+                )
+            )
+    return tasks, errors
diff --git a/gigl/analytics/data_analyzer/graph_structure_analyzer.py b/gigl/analytics/data_analyzer/graph_structure_analyzer.py
new file mode 100644
index 000000000..48933f14f
--- /dev/null
+++ b/gigl/analytics/data_analyzer/graph_structure_analyzer.py
@@ -0,0 +1,1169 @@
+"""GraphStructureAnalyzer: 4-tier BigQuery-based graph data quality checks.
+
+Tier 1 (hard fails)
+    dangling edges, referential integrity, duplicate nodes. Any violation
+    raises DataQualityError with a partially populated GraphAnalysisResult.
+
+Tier 2 (core metrics)
+    node/edge counts, degree distribution, top-K hubs, INT16 clamp hazards,
+    isolated/cold-start nodes, duplicate edges, self-loops, NULL rates, and
+    two Python-side computations (feature memory budget, neighbor explosion).
+
+Tier 3 (label and heterogeneous)
+    class imbalance and label coverage (auto-enabled when node_tables have a
+    label_column); edge-type distribution and per-edge-type node coverage
+    (auto-enabled when more than one edge table is declared).
+
+Tier 4 (opt-in)
+    reciprocity, power-law exponent estimate. Gated by config flags.
+"""
+
+import math
+from concurrent.futures import ThreadPoolExecutor
+from typing import Optional
+
+from gigl.analytics.data_analyzer.config import (
+    EDGE_ROLE_MESSAGE_PASSING,
+    EDGE_ROLE_SUPERVISION_NEG,
+    EDGE_ROLE_SUPERVISION_POS,
+    DataAnalyzerConfig,
+    EdgeTableSpec,
+    NodeTableSpec,
+)
+from gigl.analytics.data_analyzer.queries import (
+    CLASS_IMBALANCE_QUERY,
+    COLD_START_NODE_COUNT_QUERY,
+    CROSS_SPLIT_OVERLAP_QUERY,
+    DANGLING_EDGES_QUERY,
+    DEGREE_BUCKET_QUERY,
+    DEGREE_DISTRIBUTION_QUERY,
+    DUPLICATE_EDGE_COUNT_QUERY,
+    DUPLICATE_NODE_COUNT_QUERY,
+    EDGE_COUNT_QUERY,
+    EDGE_REFERENTIAL_INTEGRITY_QUERY,
+    EDGE_TYPE_DISTRIBUTION_QUERY,
+    EDGE_TYPE_NODE_COVERAGE_QUERY,
+    ISOLATED_NODE_COUNT_QUERY,
+    LABEL_COVERAGE_QUERY,
+    NODE_COUNT_QUERY,
+    SELF_LOOP_COUNT_QUERY,
+    SPLIT_VALUE_COUNTS_QUERY,
+    SUPER_HUB_INT16_CLAMP_QUERY,
+    SUPERVISION_CROSS_TABLE_QUERY,
+    TOP_K_HUBS_QUERY,
+    build_adjusted_homophily_query,
+    build_label_sentinel_query,
+    build_null_rates_query,
+    build_per_class_degree_query,
+)
+from gigl.analytics.data_analyzer.types import (
+    CrossSplitOverlap,
+    DegreeStats,
+    GraphAnalysisResult,
+    HomophilyStats,
+    LabelSentinelStats,
+    NodeClassificationSupervisionStats,
+    PerClassDegreeStats,
+    SupervisionCrossTableStats,
+    write_artifact,
+)
+from gigl.common.logger import Logger
+from gigl.src.common.utils.bq import BqUtils
+
+logger = Logger()
+
+# Default assumption for feature memory budget: float64 per feature column.
+_BYTES_PER_FEATURE = 8
+_TOP_K_HUBS = 20
+_PARALLEL_BQ_WORKERS = 10
+
+
+class DataQualityError(Exception):
+    """Raised when Tier 1 hard-fail checks detect data quality violations.
+
+    Carries a partially populated GraphAnalysisResult so callers can inspect
+    which specific checks failed without re-running the analyzer.
+    """
+
+    def __init__(self, message: str, partial_result: GraphAnalysisResult) -> None:
+        super().__init__(message)
+        self.partial_result = partial_result
+
+
+class GraphStructureAnalyzer:
+    """Runs BigQuery SQL checks across 4 tiers against the tables declared in a config.
+
+    Example:
+        >>> config = load_analyzer_config("gs://bucket/config.yaml")
+        >>> analyzer = GraphStructureAnalyzer()
+        >>> result = analyzer.analyze(config)
+        >>> result.node_counts["user"]
+        1000000
+
+    Tier 1 is blocking: a violation raises DataQualityError before Tiers 2-4 run.
+    Tiers 2-4 are aggregated best-effort into a single GraphAnalysisResult.
+    """
+
+    def __init__(self, bq_project: Optional[str] = None) -> None:
+        self._bq_utils = BqUtils(project=bq_project)
+        self._query_log: dict[str, list[str]] = {}
+
+    def analyze(self, config: DataAnalyzerConfig) -> GraphAnalysisResult:
+        """Run all applicable tiers and return aggregated results.
+
+        Always writes a versioned JSON sidecar to
+        ``{config.output_gcs_path}/graph_structure.json`` before returning
+        (or re-raising), so partial Tier 1 failures are recoverable by
+        downstream consumers without rerunning the analyzer.
+
+        Args:
+            config: Data analyzer configuration declaring node and edge tables
+                plus any opt-in expensive checks (reciprocity, etc.).
+
+        Returns:
+            GraphAnalysisResult with tier 1-4 fields populated per config.
+
+        Raises:
+            DataQualityError: If tier 1 checks find any violations. The
+                exception carries a partial result with the specific counts;
+                that same partial result is persisted to the sidecar.
+        """
+        self._query_log = {}
+        result = GraphAnalysisResult()
+        try:
+            logger.info("Starting graph structure analysis (Tier 1: hard fails)")
+            self._run_tier1(config, result)
+
+            logger.info("Tier 1 passed. Running Tier 2 (core metrics)")
+            self._run_tier2(config, result)
+
+            logger.info("Running Tier 3 (label / heterogeneous)")
+            self._run_tier3(config, result)
+
+            logger.info("Running node-classification supervision tier")
+            self._run_node_classification_supervision(config, result)
+
+            logger.info("Running supervision cross-table analysis")
+            self._run_supervision_cross_table(config, result)
+
+            logger.info("Running Tier 4 (opt-in)")
+            self._run_tier4(config, result)
+        except DataQualityError as err:
+            err.partial_result.queries = dict(self._query_log)
+            self._maybe_write_sidecar(err.partial_result, config.output_gcs_path)
+            raise
+        result.queries = dict(self._query_log)
+        self._maybe_write_sidecar(result, config.output_gcs_path)
+        return result
+
+    def _maybe_write_sidecar(
+        self, result: GraphAnalysisResult, output_gcs_path: str
+    ) -> None:
+        """Best-effort write of the Pydantic JSON sidecar.
+
+        Never raises: the sidecar is a convenience artifact, not a
+        correctness contract. Failures are logged and swallowed so Tier 1
+        errors (which also trigger a sidecar write) propagate intact.
+        """
+        try:
+            write_artifact(
+                result=result,
+                component="graph_structure",
+                output_gcs_path=output_gcs_path,
+            )
+        except Exception as exc:
+            logger.exception(f"Failed to write graph_structure.json sidecar: {exc}")
+
+    # ------------------------------------------------------------------ #
+    # Tier 1: hard fails                                                  #
+    # ------------------------------------------------------------------ #
+
+    def _run_tier1(
+        self, config: DataAnalyzerConfig, result: GraphAnalysisResult
+    ) -> None:
+        """Run all tier 1 checks; raise DataQualityError on any violation."""
+        violations: list[str] = []
+        node_tables_by_type = {nt.node_type: nt for nt in config.node_tables}
+
+        # Duplicate nodes (per node table).
+        for node_table in config.node_tables:
+            query = DUPLICATE_NODE_COUNT_QUERY.format(
+                table=node_table.bq_table, id_column=node_table.id_column
+            )
+            count = self._query_scalar(
+                query,
+                "duplicate_count",
+                block_id=f"data_quality:duplicate_nodes:{node_table.node_type}",
+            )
+            result.duplicate_node_counts[node_table.node_type] = count
+            if count > 0:
+                violations.append(
+                    f"node_type={node_table.node_type} has {count} duplicate IDs"
+                )
+
+        # Dangling edges and referential integrity (per edge table).
+        for edge_table in config.edge_tables:
+            dangling_query = DANGLING_EDGES_QUERY.format(
+                table=edge_table.bq_table,
+                src_id_column=edge_table.src_id_column,
+                dst_id_column=edge_table.dst_id_column,
+            )
+            dangling = self._query_scalar(
+                dangling_query,
+                "dangling_count",
+                block_id=f"data_quality:dangling_edges:{edge_table.edge_type}",
+            )
+            result.dangling_edge_counts[edge_table.edge_type] = dangling
+            if dangling > 0:
+                violations.append(
+                    f"edge_type={edge_table.edge_type} has {dangling} dangling edges"
+                )
+
+            # Referential integrity: src and dst can resolve to different node
+            # tables on heterogeneous graphs. `load_analyzer_config` guarantees
+            # src_node_type / dst_node_type are populated and known.
+            if not config.node_tables:
+                continue
+            assert edge_table.src_node_type is not None, (
+                f"edge_type={edge_table.edge_type} has no src_node_type; "
+                "load the config via load_analyzer_config to backfill it."
+            )
+            assert edge_table.dst_node_type is not None, (
+                f"edge_type={edge_table.edge_type} has no dst_node_type; "
+                "load the config via load_analyzer_config to backfill it."
+            )
+            src_node_table = node_tables_by_type[edge_table.src_node_type]
+            dst_node_table = node_tables_by_type[edge_table.dst_node_type]
+            ref_query = EDGE_REFERENTIAL_INTEGRITY_QUERY.format(
+                edge_table=edge_table.bq_table,
+                src_node_table=src_node_table.bq_table,
+                dst_node_table=dst_node_table.bq_table,
+                src_id_column=edge_table.src_id_column,
+                dst_id_column=edge_table.dst_id_column,
+                src_node_id_column=src_node_table.id_column,
+                dst_node_id_column=dst_node_table.id_column,
+            )
+            self._record_query(
+                f"data_quality:referential_integrity:{edge_table.edge_type}",
+                ref_query,
+            )
+            rows = list(self._bq_utils.run_query(query=ref_query, labels={}))
+            if len(rows) != 1:
+                raise RuntimeError(
+                    f"Referential integrity query expected exactly 1 row; "
+                    f"got {len(rows)}. Query: {ref_query.strip()[:200]}"
+                )
+            missing_src = int(rows[0]["missing_src_count"] or 0)
+            missing_dst = int(rows[0]["missing_dst_count"] or 0)
+            total_missing = missing_src + missing_dst
+            result.referential_integrity_violations[
+                edge_table.edge_type
+            ] = total_missing
+            if total_missing > 0:
+                violations.append(
+                    f"edge_type={edge_table.edge_type} has {total_missing} "
+                    "referential integrity violations"
+                )
+
+        if violations:
+            msg = "Tier 1 data quality violations detected:\n  - " + "\n  - ".join(
+                violations
+            )
+            logger.error(msg)
+            raise DataQualityError(msg, partial_result=result)
+
+    # ------------------------------------------------------------------ #
+    # Tier 2: core metrics                                                #
+    # ------------------------------------------------------------------ #
+
+    def _run_tier2(
+        self, config: DataAnalyzerConfig, result: GraphAnalysisResult
+    ) -> None:
+        """Collect core structural metrics, fanning out BQ jobs in parallel.
+
+        Edge-level metrics are computed from the src-side perspective:
+        isolated/cold-start joins pair each edge with its src_node_type's
+        table. Hetero dst-perspective coverage is exposed separately via
+        Tier 3 edge_type_node_coverage.
+
+        BQ jobs are I/O-bound so ThreadPoolExecutor is used. Each worker
+        writes to distinct keys of the shared `result` dict (one key per
+        node_type / edge_type), so no lock is required under CPython's GIL.
+        """
+        node_tables_by_type = {nt.node_type: nt for nt in config.node_tables}
+
+        with ThreadPoolExecutor(max_workers=_PARALLEL_BQ_WORKERS) as executor:
+            futures = []
+            for node_table in config.node_tables:
+                futures.append(
+                    executor.submit(self._tier2_node_metrics, node_table, result)
+                )
+            for edge_table in config.edge_tables:
+                src_node_table = node_tables_by_type.get(edge_table.src_node_type or "")
+                futures.append(
+                    executor.submit(
+                        self._tier2_edge_metrics, edge_table, src_node_table, result
+                    )
+                )
+            for future in futures:
+                future.result()  # re-raise any exception
+
+        # Python-side computations run after all BQ data is collected.
+        self._compute_feature_memory_budget(config, result)
+        self._compute_neighbor_explosion_estimate(config, result)
+
+    def _tier2_node_metrics(
+        self, node_table: NodeTableSpec, result: GraphAnalysisResult
+    ) -> None:
+        node_count_query = NODE_COUNT_QUERY.format(table=node_table.bq_table)
+        node_count = self._query_scalar(
+            node_count_query,
+            "node_count",
+            block_id=f"graph_structure:node_count:{node_table.node_type}",
+        )
+        result.node_counts[node_table.node_type] = node_count
+
+        columns_to_check: list[str] = [node_table.id_column]
+        columns_to_check.extend(node_table.feature_columns)
+        if node_table.label_column:
+            columns_to_check.append(node_table.label_column)
+
+        null_query = build_null_rates_query(
+            table=node_table.bq_table, columns=columns_to_check
+        )
+        self._record_query(
+            f"data_quality:null_rates:node:{node_table.node_type}", null_query
+        )
+        rows = list(self._bq_utils.run_query(query=null_query, labels={}))
+        if rows:
+            row = rows[0]
+            rates: dict[str, float] = {}
+            for col in columns_to_check:
+                key = f"{col}_null_rate"
+                rate = row[key]
+                rates[col] = float(rate) if rate is not None else 0.0
+            result.null_rates[node_table.node_type] = rates
+
+    def _tier2_edge_metrics(
+        self,
+        edge_table: EdgeTableSpec,
+        node_table: Optional[NodeTableSpec],
+        result: GraphAnalysisResult,
+    ) -> None:
+        edge_type = edge_table.edge_type
+
+        # Scalar counts.
+        edge_count_query = EDGE_COUNT_QUERY.format(table=edge_table.bq_table)
+        result.edge_counts[edge_type] = self._query_scalar(
+            edge_count_query,
+            "edge_count",
+            block_id=f"graph_structure:edge_count:{edge_type}",
+        )
+        duplicate_edges_query = DUPLICATE_EDGE_COUNT_QUERY.format(
+            table=edge_table.bq_table,
+            src_id_column=edge_table.src_id_column,
+            dst_id_column=edge_table.dst_id_column,
+        )
+        result.duplicate_edge_counts[edge_type] = self._query_scalar(
+            duplicate_edges_query,
+            "duplicate_count",
+            block_id=f"data_quality:duplicate_edges:{edge_type}",
+        )
+        self_loop_query = SELF_LOOP_COUNT_QUERY.format(
+            table=edge_table.bq_table,
+            src_id_column=edge_table.src_id_column,
+            dst_id_column=edge_table.dst_id_column,
+        )
+        result.self_loop_counts[edge_type] = self._query_scalar(
+            self_loop_query,
+            "self_loop_count",
+            block_id=f"graph_structure:self_loops:{edge_type}",
+        )
+
+        # Super-hub INT16 clamp check (indexed by src).
+        super_hub_query = SUPER_HUB_INT16_CLAMP_QUERY.format(
+            table=edge_table.bq_table, id_column=edge_table.src_id_column
+        )
+        result.super_hub_int16_clamp_count[edge_type] = self._query_scalar(
+            super_hub_query,
+            "super_hub_count",
+            block_id=f"graph_structure:super_hub_clamp:{edge_type}",
+        )
+
+        # Isolated and cold-start require a node table join.
+        if node_table is not None:
+            isolated_query = ISOLATED_NODE_COUNT_QUERY.format(
+                node_table=node_table.bq_table,
+                edge_table=edge_table.bq_table,
+                node_id_column=node_table.id_column,
+                src_id_column=edge_table.src_id_column,
+                dst_id_column=edge_table.dst_id_column,
+            )
+            result.isolated_node_counts[edge_type] = self._query_scalar(
+                isolated_query,
+                "isolated_count",
+                block_id=f"graph_structure:isolated_nodes:{edge_type}",
+            )
+            cold_start_query = COLD_START_NODE_COUNT_QUERY.format(
+                node_table=node_table.bq_table,
+                edge_table=edge_table.bq_table,
+                node_id_column=node_table.id_column,
+                src_id_column=edge_table.src_id_column,
+                dst_id_column=edge_table.dst_id_column,
+            )
+            result.cold_start_node_counts[edge_type] = self._query_scalar(
+                cold_start_query,
+                "cold_start_count",
+                block_id=f"graph_structure:cold_start_nodes:{edge_type}",
+            )
+
+        # Top-K hubs (by src).
+        top_hubs_query = TOP_K_HUBS_QUERY.format(
+            table=edge_table.bq_table,
+            id_column=edge_table.src_id_column,
+            k=_TOP_K_HUBS,
+        )
+        self._record_query(f"graph_structure:top_hubs:{edge_type}", top_hubs_query)
+        top_hub_rows = list(self._bq_utils.run_query(query=top_hubs_query, labels={}))
+        result.top_hubs[edge_type] = [
+            (str(row["node_id"]), int(row["degree"])) for row in top_hub_rows
+        ]
+
+        # Degree statistics: distribution + buckets, in + out directions.
+        for direction, id_column in (
+            ("out", edge_table.src_id_column),
+            ("in", edge_table.dst_id_column),
+        ):
+            degree_key = f"{edge_type}_{direction}"
+            result.degree_stats[degree_key] = self._build_degree_stats(
+                table=edge_table.bq_table,
+                id_column=id_column,
+                block_id=f"graph_structure:degree:{degree_key}",
+            )
+
+    def _build_degree_stats(
+        self, table: str, id_column: str, *, block_id: Optional[str] = None
+    ) -> DegreeStats:
+        """Run degree distribution + bucket queries and pack into DegreeStats.
+
+        When ``block_id`` is provided both rendered SQL strings are recorded
+        under that key (in distribution-then-bucket order) so the report can
+        show the full pair behind the histogram + summary line.
+        """
+        dist_query = DEGREE_DISTRIBUTION_QUERY.format(table=table, id_column=id_column)
+        bucket_query = DEGREE_BUCKET_QUERY.format(table=table, id_column=id_column)
+        if block_id is not None:
+            self._record_query(block_id, dist_query)
+            self._record_query(block_id, bucket_query)
+        dist_rows = list(self._bq_utils.run_query(query=dist_query, labels={}))
+        bucket_rows = list(self._bq_utils.run_query(query=bucket_query, labels={}))
+        dist_row = dist_rows[0]
+        bucket_row = bucket_rows[0]
+
+        percentiles_raw = list(dist_row["percentiles"])
+        percentiles = [int(p) if p is not None else 0 for p in percentiles_raw]
+        # APPROX_QUANTILES(degree, 100) returns 101 values: index 0..100.
+        median = percentiles[50] if len(percentiles) > 50 else 0
+        p90 = percentiles[90] if len(percentiles) > 90 else percentiles[-1]
+        p99 = percentiles[99] if len(percentiles) > 99 else percentiles[-1]
+        # We only have 100-bucket quantiles, so p999 ~= p99 as best-effort.
+        p999 = p99
+
+        # Bucket keys must match BUCKET_ORDER in report/charts.ai.js for the
+        # histogram to render correctly; keep uppercase K.
+        buckets: dict[str, int] = {
+            "0-1": int(bucket_row["bucket_0_1"]),
+            "2-10": int(bucket_row["bucket_2_10"]),
+            "11-100": int(bucket_row["bucket_11_100"]),
+            "101-1K": int(bucket_row["bucket_101_1k"]),
+            "1K-10K": int(bucket_row["bucket_1k_10k"]),
+            "10K+": int(bucket_row["bucket_10k_plus"]),
+        }
+
+        return DegreeStats(
+            min=int(dist_row["min_degree"] or 0),
+            max=int(dist_row["max_degree"] or 0),
+            mean=float(dist_row["avg_degree"] or 0.0),
+            median=median,
+            p90=p90,
+            p99=p99,
+            p999=p999,
+            percentiles=percentiles,
+            buckets=buckets,
+        )
+
+    # ------------------------------------------------------------------ #
+    # Tier 3: label and heterogeneous                                     #
+    # ------------------------------------------------------------------ #
+
+    def _run_tier3(
+        self, config: DataAnalyzerConfig, result: GraphAnalysisResult
+    ) -> None:
+        # Label-related checks per node table with a label column.
+        for node_table in config.node_tables:
+            if not node_table.label_column:
+                continue
+            class_imbalance_query = CLASS_IMBALANCE_QUERY.format(
+                table=node_table.bq_table,
+                label_column=node_table.label_column,
+            )
+            self._record_query(
+                f"advanced:class_imbalance:{node_table.node_type}",
+                class_imbalance_query,
+            )
+            class_rows = list(
+                self._bq_utils.run_query(query=class_imbalance_query, labels={})
+            )
+            result.class_imbalance[node_table.node_type] = {
+                str(row["label"]): int(row["count"]) for row in class_rows
+            }
+
+            label_coverage_query = LABEL_COVERAGE_QUERY.format(
+                table=node_table.bq_table,
+                label_column=node_table.label_column,
+            )
+            self._record_query(
+                f"advanced:label_coverage:{node_table.node_type}",
+                label_coverage_query,
+            )
+            coverage_rows = list(
+                self._bq_utils.run_query(query=label_coverage_query, labels={})
+            )
+            if coverage_rows:
+                coverage = coverage_rows[0]["coverage"]
+                result.label_coverage[node_table.node_type] = (
+                    float(coverage) if coverage is not None else 0.0
+                )
+
+        # Heterogeneous distribution only if more than one edge type.
+        if len(config.edge_tables) > 1:
+            for edge_table in config.edge_tables:
+                edge_type = edge_table.edge_type
+                # Edge-type distribution is effectively the edge count; reuse.
+                if edge_type in result.edge_counts:
+                    result.edge_type_distribution[edge_type] = result.edge_counts[
+                        edge_type
+                    ]
+                else:
+                    edge_type_dist_query = EDGE_TYPE_DISTRIBUTION_QUERY.format(
+                        table=edge_table.bq_table
+                    )
+                    result.edge_type_distribution[edge_type] = self._query_scalar(
+                        edge_type_dist_query,
+                        "edge_count",
+                        block_id=f"advanced:edge_type_distribution:{edge_type}",
+                    )
+                coverage_query = EDGE_TYPE_NODE_COVERAGE_QUERY.format(
+                    table=edge_table.bq_table,
+                    src_id_column=edge_table.src_id_column,
+                    dst_id_column=edge_table.dst_id_column,
+                )
+                self._record_query(
+                    f"advanced:edge_type_node_coverage:{edge_type}", coverage_query
+                )
+                coverage_rows = list(
+                    self._bq_utils.run_query(query=coverage_query, labels={})
+                )
+                if coverage_rows:
+                    row = coverage_rows[0]
+                    result.edge_type_node_coverage[edge_type] = {
+                        "distinct_src_count": int(row["distinct_src_count"] or 0),
+                        "distinct_dst_count": int(row["distinct_dst_count"] or 0),
+                    }
+
+    # ------------------------------------------------------------------ #
+    # Node-classification supervision tier                                #
+    # ------------------------------------------------------------------ #
+
+    def _run_node_classification_supervision(
+        self, config: DataAnalyzerConfig, result: GraphAnalysisResult
+    ) -> None:
+        """Run NC-supervision-tier checks for every labeled node table.
+
+        Activates whenever a ``NodeTableSpec.label_column`` is set.
+        Computes the BQ-side metrics that aren't covered by the TFDV
+        slicing in the feature profiler:
+
+        1. Sentinel-vs-NULL accounting on the label column.
+        2. Per-class degree distribution (joining labels to a
+           message-passing edge table).
+        3. Adjusted homophily on a sampled message-passing edge set
+           (raw + class-prior-adjusted, per Platonov et al. 2023).
+        4. Optional label informativeness when
+           ``config.compute_label_informativeness`` is True.
+        5. Cross-split node-id leakage (hard fail) when
+           ``NodeTableSpec.split_column`` is set.
+
+        Hard fails (cross-split id overlap) raise
+        :class:`DataQualityError` with a partially populated result, just
+        like Tier 1.
+        """
+        message_passing_tables = [
+            edge
+            for edge in config.edge_tables
+            if edge.role == EDGE_ROLE_MESSAGE_PASSING
+        ]
+        violations: list[str] = []
+
+        for node_table in config.node_tables:
+            if node_table.label_column is None:
+                continue
+
+            sentinel_stats = self._compute_label_sentinel_stats(node_table)
+            per_class_degree, sentinel_degree_stats = self._compute_per_class_degree(
+                node_table, message_passing_tables
+            )
+            homophily = self._compute_homophily_for_node_type(
+                node_table, message_passing_tables, config
+            )
+            cross_split_overlap = self._compute_cross_split_overlap(node_table)
+
+            stats = NodeClassificationSupervisionStats(
+                node_type=node_table.node_type,
+                label_column=node_table.label_column,
+                sentinel_stats=sentinel_stats,
+                per_class_degree=per_class_degree,
+                sentinel_degree_stats=sentinel_degree_stats,
+                homophily=homophily,
+                cross_split_overlap=cross_split_overlap,
+            )
+            result.node_classification_supervision_stats.append(stats)
+
+            if (
+                cross_split_overlap is not None
+                and cross_split_overlap.overlap_node_count > 0
+            ):
+                violations.append(
+                    f"node_type={node_table.node_type}: "
+                    f"{cross_split_overlap.overlap_node_count} node_ids appear "
+                    f"in more than one split (column "
+                    f"{node_table.split_column!r})"
+                )
+
+        if violations:
+            msg = (
+                "Node-classification supervision violations detected:\n  - "
+                + "\n  - ".join(violations)
+            )
+            logger.error(msg)
+            raise DataQualityError(msg, partial_result=result)
+
+    def _compute_label_sentinel_stats(
+        self, node_table: NodeTableSpec
+    ) -> LabelSentinelStats:
+        """Single-pass query splitting label cells into NULL / sentinel / valid."""
+        assert (
+            node_table.label_column is not None
+        ), "_compute_label_sentinel_stats requires NodeTableSpec.label_column"
+        query = build_label_sentinel_query(
+            table=node_table.bq_table,
+            label_column=node_table.label_column,
+            sentinel_values=node_table.label_sentinel_values,
+        )
+        self._record_query(
+            f"nc_supervision:label_sentinel:{node_table.node_type}", query
+        )
+        rows = list(self._bq_utils.run_query(query=query, labels={}))
+        if len(rows) != 1:
+            raise RuntimeError(
+                f"Label sentinel query expected exactly 1 row; got {len(rows)}. "
+                f"node_type={node_table.node_type}"
+            )
+        row = rows[0]
+        total_rows = int(row["total_rows"] or 0)
+        null_count = int(row["null_count"] or 0)
+        valid_count = int(row["valid_count"] or 0)
+        sentinel_counts: dict[str, int] = {}
+        for index, sentinel in enumerate(node_table.label_sentinel_values):
+            sentinel_counts[sentinel] = int(row[f"sentinel_{index}"] or 0)
+        coverage = (valid_count / total_rows) if total_rows > 0 else 0.0
+        return LabelSentinelStats(
+            total_rows=total_rows,
+            null_count=null_count,
+            sentinel_counts=sentinel_counts,
+            valid_label_count=valid_count,
+            valid_label_coverage=coverage,
+        )
+
+    def _compute_per_class_degree(
+        self,
+        node_table: NodeTableSpec,
+        message_passing_tables: list[EdgeTableSpec],
+    ) -> tuple[list[PerClassDegreeStats], list[PerClassDegreeStats]]:
+        """Per-label-value degree distribution against a message-passing edge table.
+
+        Only edge tables whose src or dst node_type matches the labeled
+        node_type are included. The edge-type identity is not preserved
+        on the result here because per-class degree is defined over total
+        degree (in + out) regardless of which edge table contributed it.
+        When multiple message-passing edge tables match, only the first
+        is used to keep the output flat — multi-edge-type per-class
+        degree is left for a future iteration.
+
+        Returns a 2-tuple ``(per_class, sentinel)``: rows whose
+        ``class_value`` matches a declared sentinel in
+        ``node_table.label_sentinel_values`` are routed to ``sentinel``;
+        all other non-NULL label rows go to ``per_class``.
+        """
+        matching = [
+            edge_table
+            for edge_table in message_passing_tables
+            if node_table.node_type
+            in (edge_table.src_node_type, edge_table.dst_node_type)
+        ]
+        if not matching:
+            return [], []
+        edge_table = matching[0]
+        if len(matching) > 1:
+            logger.info(
+                f"Per-class degree for node_type={node_table.node_type!r}: "
+                f"using first matching message-passing edge table "
+                f"{edge_table.edge_type!r} of {[m.edge_type for m in matching]}."
+            )
+
+        assert (
+            node_table.label_column is not None
+        ), "_compute_per_class_degree requires NodeTableSpec.label_column"
+        query = build_per_class_degree_query(
+            node_table=node_table.bq_table,
+            node_id_column=node_table.id_column,
+            label_column=node_table.label_column,
+            edge_table=edge_table.bq_table,
+            edge_src_column=edge_table.src_id_column,
+            edge_dst_column=edge_table.dst_id_column,
+        )
+        self._record_query(
+            f"nc_supervision:per_class_degree:{node_table.node_type}", query
+        )
+        rows = list(self._bq_utils.run_query(query=query, labels={}))
+        sentinel_value_set = set(node_table.label_sentinel_values)
+        per_class: list[PerClassDegreeStats] = []
+        sentinel: list[PerClassDegreeStats] = []
+        for row in rows:
+            percentiles_raw = list(row["percentiles"]) if row["percentiles"] else []
+            percentiles = [int(p) if p is not None else 0 for p in percentiles_raw]
+            median = percentiles[50] if len(percentiles) > 50 else 0
+            p90 = (
+                percentiles[90]
+                if len(percentiles) > 90
+                else (percentiles[-1] if percentiles else 0)
+            )
+            p99 = (
+                percentiles[99]
+                if len(percentiles) > 99
+                else (percentiles[-1] if percentiles else 0)
+            )
+            # Bucket keys must match BUCKET_ORDER in report/charts.ai.js so the
+            # sparkline histogram lines up with the overall degree chart.
+            buckets: dict[str, int] = {
+                "0-1": int(row["bucket_0_1"] or 0),
+                "2-10": int(row["bucket_2_10"] or 0),
+                "11-100": int(row["bucket_11_100"] or 0),
+                "101-1K": int(row["bucket_101_1k"] or 0),
+                "1K-10K": int(row["bucket_1k_10k"] or 0),
+                "10K+": int(row["bucket_10k_plus"] or 0),
+            }
+            class_value = str(row["class_value"])
+            stats = PerClassDegreeStats(
+                class_value=class_value,
+                count=int(row["class_count"] or 0),
+                cold_start_count=int(row["cold_start_count"] or 0),
+                mean_degree=float(row["mean_degree"] or 0.0),
+                median_degree=median,
+                p90_degree=p90,
+                p99_degree=p99,
+                max_degree=int(row["max_degree"] or 0),
+                buckets=buckets,
+            )
+            if class_value in sentinel_value_set:
+                sentinel.append(stats)
+            else:
+                per_class.append(stats)
+        return per_class, sentinel
+
+    def _compute_homophily_for_node_type(
+        self,
+        node_table: NodeTableSpec,
+        message_passing_tables: list[EdgeTableSpec],
+        config: DataAnalyzerConfig,
+    ) -> list[HomophilyStats]:
+        """Sampled adjusted homophily per (labeled node type, edge type).
+
+        Edges are sampled to ``config.label_homophily_edge_sample_cap``
+        via deterministic ``MOD(FARM_FINGERPRINT(...))`` filtering. The
+        modulus is computed from the edge table's row count so the
+        sampled set is ~= the cap; small graphs (count <= cap) skip
+        sampling entirely.
+        """
+        out: list[HomophilyStats] = []
+        for edge_table in message_passing_tables:
+            if node_table.node_type not in (
+                edge_table.src_node_type,
+                edge_table.dst_node_type,
+            ):
+                continue
+            # Edge-count subquery here is unrelated to the per-edge-type one
+            # in Tier 2 — it gates only the sampling decision below — so we
+            # don't tag it for the report and just run it.
+            edge_count = self._query_scalar(
+                EDGE_COUNT_QUERY.format(table=edge_table.bq_table), "edge_count"
+            )
+            cap = config.label_homophily_edge_sample_cap
+            if cap > 0 and edge_count > cap:
+                modulus = max(1, edge_count // cap)
+                sample_cap = cap
+            else:
+                modulus = 1
+                sample_cap = 0  # signal "no sampling"
+            assert (
+                node_table.label_column is not None
+            ), "_compute_homophily_for_node_type requires NodeTableSpec.label_column"
+            template = build_adjusted_homophily_query(
+                node_table=node_table.bq_table,
+                node_id_column=node_table.id_column,
+                label_column=node_table.label_column,
+                sentinel_values=node_table.label_sentinel_values,
+                edge_table=edge_table.bq_table,
+                edge_src_column=edge_table.src_id_column,
+                edge_dst_column=edge_table.dst_id_column,
+                sample_cap=sample_cap,
+            )
+            query = template.replace("{modulus_placeholder}", str(modulus))
+            self._record_query(
+                f"nc_supervision:homophily:{node_table.node_type}:"
+                f"{edge_table.edge_type}",
+                query,
+            )
+            rows = list(self._bq_utils.run_query(query=query, labels={}))
+            if len(rows) != 1:
+                raise RuntimeError(
+                    f"Adjusted-homophily query expected exactly 1 row; got "
+                    f"{len(rows)}. node_type={node_table.node_type}, "
+                    f"edge_type={edge_table.edge_type}"
+                )
+            row = rows[0]
+            edge_homophily_value = row["edge_homophily"]
+            expected_value = row["expected_homophily"]
+            edge_homophily = (
+                float(edge_homophily_value) if edge_homophily_value is not None else 0.0
+            )
+            expected = float(expected_value) if expected_value is not None else 0.0
+            if expected < 1.0:
+                adjusted = (edge_homophily - expected) / (1.0 - expected)
+            else:
+                adjusted = 0.0
+            out.append(
+                HomophilyStats(
+                    edge_type=edge_table.edge_type,
+                    edge_homophily=edge_homophily,
+                    adjusted_homophily=adjusted,
+                    edge_sample_count=int(row["edge_sample_count"] or 0),
+                    label_informativeness=None,
+                )
+            )
+        return out
+
+    def _compute_cross_split_overlap(
+        self, node_table: NodeTableSpec
+    ) -> Optional[CrossSplitOverlap]:
+        """Cross-split id leakage + per-split row counts. Returns None if no split_column."""
+        if node_table.split_column is None:
+            return None
+        block_id = f"nc_supervision:cross_split:{node_table.node_type}"
+        cross_split_query = CROSS_SPLIT_OVERLAP_QUERY.format(
+            table=node_table.bq_table,
+            id_column=node_table.id_column,
+            split_column=node_table.split_column,
+        )
+        overlap_count = self._query_scalar(
+            cross_split_query, "overlap_node_count", block_id=block_id
+        )
+        split_value_query = SPLIT_VALUE_COUNTS_QUERY.format(
+            table=node_table.bq_table,
+            split_column=node_table.split_column,
+        )
+        self._record_query(block_id, split_value_query)
+        split_rows = list(self._bq_utils.run_query(query=split_value_query, labels={}))
+        split_value_counts: dict[str, int] = {
+            str(row["split_value"]): int(row["row_count"] or 0) for row in split_rows
+        }
+        return CrossSplitOverlap(
+            overlap_node_count=overlap_count,
+            split_value_counts=split_value_counts,
+        )
+
+    # ------------------------------------------------------------------ #
+    # Supervision cross-table analysis                                    #
+    # ------------------------------------------------------------------ #
+
+    def _run_supervision_cross_table(
+        self, config: DataAnalyzerConfig, result: GraphAnalysisResult
+    ) -> None:
+        """Run cross-table per-anchor stats for supervision edge tables.
+
+        For every ``supervision_pos`` table we pair it with each
+        ``supervision_neg`` and ``message_passing`` table that shares its
+        ``(src_node_type, dst_node_type)``, then compute per-anchor edge
+        counts and label-leakage overlap. Each ``supervision_neg`` table
+        also drives a pass against matching ``message_passing`` tables so
+        the report can flag (negative-edge ∩ message-passing) leaks. Jobs
+        run in parallel via ``ThreadPoolExecutor`` (BQ is I/O-bound).
+        """
+        pos_tables = [
+            e for e in config.edge_tables if e.role == EDGE_ROLE_SUPERVISION_POS
+        ]
+        neg_tables = [
+            e for e in config.edge_tables if e.role == EDGE_ROLE_SUPERVISION_NEG
+        ]
+        # Treat unset role as message_passing (default), matching backfill behavior.
+        mp_tables = [
+            e
+            for e in config.edge_tables
+            if e.role is None or e.role == EDGE_ROLE_MESSAGE_PASSING
+        ]
+
+        jobs: list[tuple[EdgeTableSpec, EdgeTableSpec, str]] = []
+
+        # Driver = positive: pair with every neg / mp sharing (src_type, dst_type).
+        for pos in pos_tables:
+            assert pos.node_anchor is not None, (
+                f"edge_type={pos.edge_type}: supervision_pos must have node_anchor; "
+                "load the config via load_analyzer_config to enforce this."
+            )
+            for other in neg_tables + mp_tables:
+                if (pos.src_node_type, pos.dst_node_type) == (
+                    other.src_node_type,
+                    other.dst_node_type,
+                ):
+                    jobs.append((pos, other, pos.node_anchor))
+
+        # Driver = negative: pair with mp sharing (src_type, dst_type). Anchor
+        # is the negative's own node_anchor when set, else inherited from a
+        # matching positive table to keep configs concise.
+        for neg in neg_tables:
+            anchor = neg.node_anchor or self._inherit_anchor_from_pos(neg, pos_tables)
+            if anchor is None:
+                continue
+            for mp in mp_tables:
+                if (neg.src_node_type, neg.dst_node_type) == (
+                    mp.src_node_type,
+                    mp.dst_node_type,
+                ):
+                    jobs.append((neg, mp, anchor))
+
+        if not jobs:
+            return
+
+        with ThreadPoolExecutor(max_workers=_PARALLEL_BQ_WORKERS) as executor:
+            futures = [
+                executor.submit(self._supervision_pair_stats, driver, other, anchor)
+                for driver, other, anchor in jobs
+            ]
+            for future in futures:
+                stats = future.result()
+                if stats is not None:
+                    result.supervision_cross_table_stats.append(stats)
+
+    @staticmethod
+    def _inherit_anchor_from_pos(
+        neg: EdgeTableSpec, pos_tables: list[EdgeTableSpec]
+    ) -> Optional[str]:
+        """Return the node_anchor of any positive table sharing neg's node types.
+
+        Lets users declare ``node_anchor`` once on the positive table and
+        skip duplicating it on the matching negative.
+        """
+        for pos in pos_tables:
+            if (pos.src_node_type, pos.dst_node_type) == (
+                neg.src_node_type,
+                neg.dst_node_type,
+            ):
+                return pos.node_anchor
+        return None
+
+    @staticmethod
+    def _resolve_anchor_columns(
+        edge_table: EdgeTableSpec, node_anchor: str
+    ) -> Optional[tuple[str, str]]:
+        """Return (anchor_column, other_column) for the given anchor node_type.
+
+        If ``node_anchor`` matches both src and dst (homogeneous self-loop
+        edge), prefer the src side. Returns ``None`` if it matches neither.
+        """
+        if node_anchor == edge_table.src_node_type:
+            return edge_table.src_id_column, edge_table.dst_id_column
+        if node_anchor == edge_table.dst_node_type:
+            return edge_table.dst_id_column, edge_table.src_id_column
+        return None
+
+    def _supervision_pair_stats(
+        self,
+        driver: EdgeTableSpec,
+        other: EdgeTableSpec,
+        node_anchor: str,
+    ) -> Optional[SupervisionCrossTableStats]:
+        """Run the cross-table query for one (driver, other) pair.
+
+        Returns ``None`` (and logs a warning) when the anchor cannot be
+        resolved on one of the two tables — happens only on misconfigured
+        heterogeneous pairs and should not abort the whole run.
+        """
+        driver_cols = self._resolve_anchor_columns(driver, node_anchor)
+        other_cols = self._resolve_anchor_columns(other, node_anchor)
+        if driver_cols is None or other_cols is None:
+            logger.warning(
+                f"Skipping supervision pair driver={driver.edge_type!r} "
+                f"other={other.edge_type!r}: node_anchor={node_anchor!r} not "
+                "present on both tables."
+            )
+            return None
+
+        driver_anchor_column, driver_other_column = driver_cols
+        other_anchor_column, other_other_column = other_cols
+
+        query = SUPERVISION_CROSS_TABLE_QUERY.format(
+            driver_table=driver.bq_table,
+            other_table=other.bq_table,
+            driver_anchor_column=driver_anchor_column,
+            driver_other_column=driver_other_column,
+            other_anchor_column=other_anchor_column,
+            other_other_column=other_other_column,
+        )
+        self._record_query(
+            f"supervision_overlap:{driver.edge_type}:{other.edge_type}:"
+            f"{driver_anchor_column}:{other_anchor_column}",
+            query,
+        )
+        rows = list(self._bq_utils.run_query(query=query, labels={}))
+        if len(rows) != 1:
+            raise RuntimeError(
+                f"Supervision cross-table query expected exactly 1 row; "
+                f"got {len(rows)}. driver={driver.edge_type} other={other.edge_type}"
+            )
+        row = rows[0]
+        avg_value = row["avg_other_per_driver_anchor"]
+        return SupervisionCrossTableStats(
+            driver_edge_type=driver.edge_type,
+            driver_role=driver.role or EDGE_ROLE_MESSAGE_PASSING,
+            other_edge_type=other.edge_type,
+            other_role=other.role or EDGE_ROLE_MESSAGE_PASSING,
+            node_anchor=node_anchor,
+            driver_anchor_count=int(row["driver_anchor_count"] or 0),
+            driver_pair_count=int(row["driver_pair_count"] or 0),
+            other_pair_count=int(row["other_pair_count"] or 0),
+            overlap_pair_count=int(row["overlap_pair_count"] or 0),
+            driver_anchors_with_zero_other=int(
+                row["driver_anchors_with_zero_other"] or 0
+            ),
+            avg_other_per_driver_anchor=float(avg_value)
+            if avg_value is not None
+            else 0.0,
+            p50_other_per_driver_anchor=int(row["p50_other_per_driver_anchor"] or 0),
+            p90_other_per_driver_anchor=int(row["p90_other_per_driver_anchor"] or 0),
+            p99_other_per_driver_anchor=int(row["p99_other_per_driver_anchor"] or 0),
+            max_other_per_driver_anchor=int(row["max_other_per_driver_anchor"] or 0),
+        )
+
+    # ------------------------------------------------------------------ #
+    # Tier 4: opt-in                                                      #
+    # ------------------------------------------------------------------ #
+
+    def _run_tier4(
+        self, config: DataAnalyzerConfig, result: GraphAnalysisResult
+    ) -> None:
+        """Populate opt-in metrics gated by config flags.
+
+        Power-law exponent is always cheap (derived from existing degree stats)
+        and is computed whenever degree stats are available. Reciprocity,
+        homophily, connected components and clustering require dedicated
+        queries not yet defined; they remain empty unless the corresponding
+        flag is enabled AND a query is implemented.
+        """
+        # Power-law exponent: approximate from degree stats using a simple
+        # heuristic: alpha ~= 1 + log(max) / log(median) for median > 1.
+        for degree_key, stats in result.degree_stats.items():
+            if stats.median > 1 and stats.max > stats.median:
+                exponent = 1.0 + math.log(stats.max) / math.log(stats.median)
+                result.power_law_exponent[degree_key] = exponent
+
+        if config.compute_reciprocity:
+            # Query not yet defined; log and skip.
+            logger.warning(
+                "compute_reciprocity=True but reciprocity query is not implemented; "
+                "skipping Tier 4 reciprocity."
+            )
+
+    # ------------------------------------------------------------------ #
+    # Python-only computations                                            #
+    # ------------------------------------------------------------------ #
+
+    def _compute_feature_memory_budget(
+        self, config: DataAnalyzerConfig, result: GraphAnalysisResult
+    ) -> None:
+        """Estimate per-node-type memory footprint of features (float64 assumed)."""
+        for node_table in config.node_tables:
+            node_count = result.node_counts.get(node_table.node_type, 0)
+            num_features = len(node_table.feature_columns)
+            result.feature_memory_bytes[node_table.node_type] = (
+                node_count * num_features * _BYTES_PER_FEATURE
+            )
+
+    def _compute_neighbor_explosion_estimate(
+        self, config: DataAnalyzerConfig, result: GraphAnalysisResult
+    ) -> None:
+        """Multiply fan-out factors and scale by out-degree mean per edge type."""
+        if not config.fan_out:
+            return
+        fan_out_product = 1
+        for hop in config.fan_out:
+            fan_out_product *= int(hop)
+        for edge_table in config.edge_tables:
+            out_stats = result.degree_stats.get(f"{edge_table.edge_type}_out")
+            if out_stats is None:
+                continue
+            estimate = int(fan_out_product * max(out_stats.mean, 1.0))
+            result.neighbor_explosion_estimate[edge_table.edge_type] = estimate
+
+    # ------------------------------------------------------------------ #
+    # Helpers                                                             #
+    # ------------------------------------------------------------------ #
+
+    def _query_scalar(
+        self, query: str, column: str, *, block_id: Optional[str] = None
+    ) -> int:
+        """Run a single-row, single-column query and return the scalar as int.
+
+        Scalar queries (COUNT, COUNTIF) must return exactly one row with a
+        non-NULL value for the requested column. Any deviation indicates a
+        driver, auth, or schema mismatch rather than legitimate data — raise
+        loudly instead of silently coercing to 0, which would let a broken run
+        pass through as a green-light result.
+
+        When ``block_id`` is provided the rendered SQL is recorded under
+        that key in ``self._query_log`` so the report can surface it.
+        """
+        if block_id is not None:
+            self._record_query(block_id, query)
+        rows = list(self._bq_utils.run_query(query=query, labels={}))
+        if len(rows) != 1:
+            raise RuntimeError(
+                f"Scalar query expected exactly 1 row; got {len(rows)}. "
+                f"Query: {query.strip()[:200]}"
+            )
+        value = rows[0][column]
+        if value is None:
+            raise RuntimeError(
+                f"Scalar query returned NULL for column '{column}'. "
+                f"Query: {query.strip()[:200]}"
+            )
+        return int(value)
+
+    def _record_query(self, block_id: str, query: str) -> None:
+        """Append ``query`` under ``block_id`` in the per-block SQL log.
+
+        The report JS does dict lookups against ``GraphAnalysisResult.queries``
+        keyed by the same ``block_id`` strings. CPython's GIL makes
+        ``dict.setdefault`` and ``list.append`` atomic, so concurrent writes
+        from the Tier-2 thread pool are safe without an explicit lock.
+        """
+        self._query_log.setdefault(block_id, []).append(query)
diff --git a/gigl/analytics/data_analyzer/queries.py b/gigl/analytics/data_analyzer/queries.py
new file mode 100644
index 000000000..fedb57b3c
--- /dev/null
+++ b/gigl/analytics/data_analyzer/queries.py
@@ -0,0 +1,485 @@
+"""SQL query templates for graph structure analysis.
+
+Each constant is a format-string template parameterized with table names
+and column names. Pattern matches gigl/src/data_preprocessor/lib/enumerate/queries.py.
+"""
+
+import torch
+
+INT16_MAX = int(torch.iinfo(torch.int16).max)  # 32767
+
+# --- Tier 1: Hard fails ---
+
+DANGLING_EDGES_QUERY = """
+SELECT COUNT(*) AS dangling_count
+FROM `{table}`
+WHERE {src_id_column} IS NULL OR {dst_id_column} IS NULL
+"""
+
+EDGE_REFERENTIAL_INTEGRITY_QUERY = """
+SELECT
+    COUNTIF(src_node.{src_node_id_column} IS NULL) AS missing_src_count,
+    COUNTIF(dst_node.{dst_node_id_column} IS NULL) AS missing_dst_count
+FROM `{edge_table}` AS e
+LEFT JOIN `{src_node_table}` AS src_node
+    ON e.{src_id_column} = src_node.{src_node_id_column}
+LEFT JOIN `{dst_node_table}` AS dst_node
+    ON e.{dst_id_column} = dst_node.{dst_node_id_column}
+"""
+
+DUPLICATE_NODE_COUNT_QUERY = """
+SELECT COUNT(*) AS duplicate_count FROM (
+    SELECT {id_column}
+    FROM `{table}`
+    GROUP BY {id_column}
+    HAVING COUNT(*) > 1
+)
+"""
+
+# --- Tier 2: Core metrics ---
+
+NODE_COUNT_QUERY = """
+SELECT COUNT(*) AS node_count FROM `{table}`
+"""
+
+EDGE_COUNT_QUERY = """
+SELECT COUNT(*) AS edge_count FROM `{table}`
+"""
+
+DUPLICATE_EDGE_COUNT_QUERY = """
+SELECT COUNT(*) AS duplicate_count FROM (
+    SELECT {src_id_column}, {dst_id_column}
+    FROM `{table}`
+    GROUP BY {src_id_column}, {dst_id_column}
+    HAVING COUNT(*) > 1
+)
+"""
+
+SELF_LOOP_COUNT_QUERY = """
+SELECT COUNT(*) AS self_loop_count
+FROM `{table}`
+WHERE {src_id_column} = {dst_id_column}
+"""
+
+ISOLATED_NODE_COUNT_QUERY = """
+SELECT COUNT(*) AS isolated_count FROM (
+    SELECT n.{node_id_column}
+    FROM `{node_table}` AS n
+    LEFT JOIN `{edge_table}` AS e_src
+        ON n.{node_id_column} = e_src.{src_id_column}
+    LEFT JOIN `{edge_table}` AS e_dst
+        ON n.{node_id_column} = e_dst.{dst_id_column}
+    WHERE e_src.{src_id_column} IS NULL
+        AND e_dst.{dst_id_column} IS NULL
+)
+"""
+
+DEGREE_DISTRIBUTION_QUERY = """
+SELECT
+    MIN(degree) AS min_degree,
+    MAX(degree) AS max_degree,
+    AVG(degree) AS avg_degree,
+    APPROX_QUANTILES(degree, 100) AS percentiles
+FROM (
+    SELECT {id_column}, COUNT(*) AS degree
+    FROM `{table}`
+    GROUP BY {id_column}
+)
+"""
+
+DEGREE_BUCKET_QUERY = """
+SELECT
+    COUNTIF(degree BETWEEN 0 AND 1) AS bucket_0_1,
+    COUNTIF(degree BETWEEN 2 AND 10) AS bucket_2_10,
+    COUNTIF(degree BETWEEN 11 AND 100) AS bucket_11_100,
+    COUNTIF(degree BETWEEN 101 AND 1000) AS bucket_101_1k,
+    COUNTIF(degree BETWEEN 1001 AND 10000) AS bucket_1k_10k,
+    COUNTIF(degree > 10000) AS bucket_10k_plus
+FROM (
+    SELECT {id_column}, COUNT(*) AS degree
+    FROM `{table}`
+    GROUP BY {id_column}
+)
+"""
+
+TOP_K_HUBS_QUERY = """
+SELECT {id_column} AS node_id, COUNT(*) AS degree
+FROM `{table}`
+GROUP BY {id_column}
+ORDER BY degree DESC
+LIMIT {k}
+"""
+
+SUPER_HUB_INT16_CLAMP_QUERY = f"""
+SELECT COUNT(*) AS super_hub_count FROM (
+    SELECT {{id_column}}, COUNT(*) AS degree
+    FROM `{{table}}`
+    GROUP BY {{id_column}}
+    HAVING COUNT(*) > {INT16_MAX}
+)
+"""
+
+COLD_START_NODE_COUNT_QUERY = """
+SELECT COUNT(*) AS cold_start_count FROM (
+    SELECT n.{node_id_column}, COALESCE(e.degree, 0) AS degree
+    FROM `{node_table}` AS n
+    LEFT JOIN (
+        SELECT nid, COUNT(*) AS degree FROM (
+            SELECT {src_id_column} AS nid FROM `{edge_table}`
+            UNION ALL
+            SELECT {dst_id_column} AS nid FROM `{edge_table}`
+        )
+        GROUP BY nid
+    ) AS e ON n.{node_id_column} = e.nid
+    WHERE COALESCE(e.degree, 0) <= 1
+)
+"""
+
+# --- Tier 3: Label and heterogeneous ---
+
+CLASS_IMBALANCE_QUERY = """
+SELECT {label_column} AS label, COUNT(*) AS count
+FROM `{table}`
+WHERE {label_column} IS NOT NULL
+GROUP BY {label_column}
+ORDER BY count DESC
+"""
+
+LABEL_COVERAGE_QUERY = """
+SELECT
+    COUNT(*) AS total,
+    COUNTIF({label_column} IS NOT NULL) AS labeled,
+    SAFE_DIVIDE(COUNTIF({label_column} IS NOT NULL), COUNT(*)) AS coverage
+FROM `{table}`
+"""
+
+EDGE_TYPE_DISTRIBUTION_QUERY = """
+SELECT COUNT(*) AS edge_count FROM `{table}`
+"""
+
+EDGE_TYPE_NODE_COVERAGE_QUERY = """
+SELECT
+    APPROX_COUNT_DISTINCT({src_id_column}) AS distinct_src_count,
+    APPROX_COUNT_DISTINCT({dst_id_column}) AS distinct_dst_count
+FROM `{table}`
+"""
+
+
+# --- Supervision cross-table analysis ---
+
+SUPERVISION_CROSS_TABLE_QUERY = """
+WITH driver_pairs AS (
+    SELECT DISTINCT
+        {driver_anchor_column} AS anchor,
+        {driver_other_column}  AS neighbor
+    FROM `{driver_table}`
+    WHERE {driver_anchor_column} IS NOT NULL
+      AND {driver_other_column}  IS NOT NULL
+),
+other_pairs AS (
+    SELECT DISTINCT
+        {other_anchor_column} AS anchor,
+        {other_other_column}  AS neighbor
+    FROM `{other_table}`
+    WHERE {other_anchor_column} IS NOT NULL
+      AND {other_other_column}  IS NOT NULL
+),
+driver_anchors AS (
+    SELECT DISTINCT anchor FROM driver_pairs
+),
+other_per_driver_anchor AS (
+    SELECT driver_anchors.anchor,
+           COALESCE(other_counts.cnt, 0) AS cnt
+    FROM driver_anchors
+    LEFT JOIN (
+        SELECT anchor, COUNT(*) AS cnt FROM other_pairs GROUP BY anchor
+    ) AS other_counts USING (anchor)
+)
+SELECT
+    (SELECT COUNT(*) FROM driver_anchors) AS driver_anchor_count,
+    (SELECT COUNT(*) FROM driver_pairs)   AS driver_pair_count,
+    (SELECT COUNT(*) FROM other_pairs)    AS other_pair_count,
+    (
+        SELECT COUNT(*)
+        FROM driver_pairs
+        INNER JOIN other_pairs USING (anchor, neighbor)
+    ) AS overlap_pair_count,
+    (SELECT COUNTIF(cnt = 0) FROM other_per_driver_anchor)
+        AS driver_anchors_with_zero_other,
+    (SELECT AVG(cnt) FROM other_per_driver_anchor)
+        AS avg_other_per_driver_anchor,
+    (SELECT APPROX_QUANTILES(cnt, 100)[OFFSET(50)] FROM other_per_driver_anchor)
+        AS p50_other_per_driver_anchor,
+    (SELECT APPROX_QUANTILES(cnt, 100)[OFFSET(90)] FROM other_per_driver_anchor)
+        AS p90_other_per_driver_anchor,
+    (SELECT APPROX_QUANTILES(cnt, 100)[OFFSET(99)] FROM other_per_driver_anchor)
+        AS p99_other_per_driver_anchor,
+    (SELECT MAX(cnt) FROM other_per_driver_anchor)
+        AS max_other_per_driver_anchor
+"""
+
+
+# --- Node-classification supervision tier ---
+
+
+def build_label_sentinel_query(
+    table: str, label_column: str, sentinel_values: list[str]
+) -> str:
+    """Build a single-pass query that splits label cells into NULL / sentinel / valid.
+
+    Sentinel values are interpolated as quoted string literals; callers
+    must ensure values come from a trusted config (the analyzer config
+    is loaded by ``load_analyzer_config`` which already validates the
+    structure of the YAML it reads). The label column is cast to STRING
+    in the comparison so integer and string sentinels both work.
+
+    Args:
+        table: Fully qualified BQ table name.
+        label_column: Column whose cells we're bucketing.
+        sentinel_values: Strings that should be classified as sentinels
+            distinct from SQL NULL.
+
+    Returns:
+        SQL query string returning one row with columns ``total_rows``,
+        ``null_count``, ``valid_count``, and one ``sentinel_<idx>`` count
+        per sentinel value (in declaration order).
+    """
+    sentinel_clauses = ",\n    ".join(
+        f"COUNTIF(CAST({label_column} AS STRING) = "
+        f"{_sql_string_literal(sentinel)}) AS sentinel_{idx}"
+        for idx, sentinel in enumerate(sentinel_values)
+    )
+    sentinel_in_list = (
+        ", ".join(_sql_string_literal(s) for s in sentinel_values)
+        if sentinel_values
+        else None
+    )
+    valid_clause = (
+        f"COUNTIF({label_column} IS NOT NULL "
+        f"AND CAST({label_column} AS STRING) NOT IN ({sentinel_in_list})) AS valid_count"
+        if sentinel_in_list is not None
+        else f"COUNTIF({label_column} IS NOT NULL) AS valid_count"
+    )
+    extra = f",\n    {sentinel_clauses}" if sentinel_clauses else ""
+    return f"""
+SELECT
+    COUNT(*) AS total_rows,
+    COUNTIF({label_column} IS NULL) AS null_count,
+    {valid_clause}{extra}
+FROM `{table}`
+"""
+
+
+def _sql_string_literal(value: str) -> str:
+    """Quote a string for safe inline use in BQ SQL.
+
+    Escapes single quotes and backslashes; no other characters are
+    transformed. Sentinel values flow into ``IN`` lists so we control
+    the surrounding context. Anything more invasive (parameterized
+    queries) would require restructuring how every other query in this
+    module is built.
+    """
+    escaped = value.replace("\\", "\\\\").replace("'", "\\'")
+    return f"'{escaped}'"
+
+
+def build_per_class_degree_query(
+    node_table: str,
+    node_id_column: str,
+    label_column: str,
+    edge_table: str,
+    edge_src_column: str,
+    edge_dst_column: str,
+) -> str:
+    """Per-label-value degree distribution joining labeled nodes to a message-passing edge table.
+
+    Computes for each distinct non-NULL label value: count of class
+    members, count with total degree <= 1 (cold-start), and degree
+    distribution (mean / median / p90 / p99 / max). NULL labels are
+    excluded — they are accounted for separately in
+    :class:`LabelSentinelStats`. Sentinel-declared values (e.g. ``-1``)
+    are *not* filtered out and surface as their own rows; the caller is
+    responsible for partitioning the result into "valid class" vs
+    "sentinel" using its own ``label_sentinel_values``.
+
+    Returns one row per distinct non-NULL label value.
+    """
+    return f"""
+WITH node_degrees AS (
+    SELECT nid, COUNT(*) AS degree FROM (
+        SELECT {edge_src_column} AS nid FROM `{edge_table}`
+        UNION ALL
+        SELECT {edge_dst_column} AS nid FROM `{edge_table}`
+    )
+    GROUP BY nid
+),
+labeled AS (
+    SELECT
+        CAST(n.{label_column} AS STRING) AS class_value,
+        COALESCE(d.degree, 0) AS degree
+    FROM `{node_table}` AS n
+    LEFT JOIN node_degrees AS d
+        ON n.{node_id_column} = d.nid
+    WHERE n.{label_column} IS NOT NULL
+)
+SELECT
+    class_value,
+    COUNT(*) AS class_count,
+    COUNTIF(degree <= 1) AS cold_start_count,
+    AVG(degree) AS mean_degree,
+    APPROX_QUANTILES(degree, 100) AS percentiles,
+    MAX(degree) AS max_degree,
+    COUNTIF(degree BETWEEN 0 AND 1) AS bucket_0_1,
+    COUNTIF(degree BETWEEN 2 AND 10) AS bucket_2_10,
+    COUNTIF(degree BETWEEN 11 AND 100) AS bucket_11_100,
+    COUNTIF(degree BETWEEN 101 AND 1000) AS bucket_101_1k,
+    COUNTIF(degree BETWEEN 1001 AND 10000) AS bucket_1k_10k,
+    COUNTIF(degree > 10000) AS bucket_10k_plus
+FROM labeled
+GROUP BY class_value
+ORDER BY class_count DESC
+"""
+
+
+def build_adjusted_homophily_query(
+    node_table: str,
+    node_id_column: str,
+    label_column: str,
+    sentinel_values: list[str],
+    edge_table: str,
+    edge_src_column: str,
+    edge_dst_column: str,
+    sample_cap: int,
+) -> str:
+    """Edge homophily and class-prior-adjusted homophily on a sampled edge set.
+
+    Adjusted homophily is computed per Platonov et al., NeurIPS 2023:
+
+        adjusted = (h_edge - sum_c (D_c / 2|E|)^2)
+                   / (1 - sum_c (D_c / 2|E|)^2)
+
+    where ``D_c`` is the sum of degrees of nodes in class ``c`` over the
+    sampled edge set. Values near 0 mean "no signal beyond class
+    priors"; positive is homophilic, negative heterophilic.
+
+    Edges are sampled by ``MOD(FARM_FINGERPRINT(...), modulus) = 0`` so
+    sampling is deterministic and consistent across reruns. ``sample_cap
+    = 0`` means full-graph (no sampling).
+
+    Returns one row with: ``edge_homophily``, ``expected_homophily``
+    (the class-prior baseline), ``adjusted_homophily`` (computed in
+    Python from the two columns above), and ``edge_sample_count``.
+    """
+    sentinel_filter_src = ""
+    sentinel_filter_dst = ""
+    if sentinel_values:
+        sentinel_in_list = ", ".join(_sql_string_literal(s) for s in sentinel_values)
+        sentinel_filter_src = (
+            f"AND CAST(s.{label_column} AS STRING) NOT IN ({sentinel_in_list})"
+        )
+        sentinel_filter_dst = (
+            f"AND CAST(d.{label_column} AS STRING) NOT IN ({sentinel_in_list})"
+        )
+
+    sample_filter = (
+        ""
+        if sample_cap <= 0
+        else (
+            f"WHERE MOD(ABS(FARM_FINGERPRINT(CONCAT("
+            f"CAST({edge_src_column} AS STRING), '|', "
+            f"CAST({edge_dst_column} AS STRING)))), {{modulus_placeholder}}) = 0"
+        )
+    )
+    # We pass {modulus_placeholder} verbatim and let the caller fill it
+    # in based on the cardinality of the edge table, so the same SQL
+    # template is used for any sample size.
+    return f"""
+WITH sampled_edges AS (
+    SELECT {edge_src_column} AS src_id, {edge_dst_column} AS dst_id
+    FROM `{edge_table}`
+    {sample_filter}
+),
+labeled_pairs AS (
+    SELECT
+        CAST(s.{label_column} AS STRING) AS src_label,
+        CAST(d.{label_column} AS STRING) AS dst_label
+    FROM sampled_edges AS e
+    JOIN `{node_table}` AS s
+        ON e.src_id = s.{node_id_column}
+    JOIN `{node_table}` AS d
+        ON e.dst_id = d.{node_id_column}
+    WHERE s.{label_column} IS NOT NULL
+      AND d.{label_column} IS NOT NULL
+      {sentinel_filter_src}
+      {sentinel_filter_dst}
+),
+endpoint_classes AS (
+    SELECT label, COUNT(*) AS endpoint_count FROM (
+        SELECT src_label AS label FROM labeled_pairs
+        UNION ALL
+        SELECT dst_label AS label FROM labeled_pairs
+    )
+    GROUP BY label
+),
+totals AS (
+    SELECT SUM(endpoint_count) AS total_endpoints FROM endpoint_classes
+)
+SELECT
+    SAFE_DIVIDE(COUNTIF(src_label = dst_label), COUNT(*)) AS edge_homophily,
+    (
+        SELECT SUM(POW(SAFE_DIVIDE(endpoint_count, total_endpoints), 2))
+        FROM endpoint_classes, totals
+    ) AS expected_homophily,
+    COUNT(*) AS edge_sample_count
+FROM labeled_pairs
+"""
+
+
+CROSS_SPLIT_OVERLAP_QUERY = """
+SELECT
+    (
+        SELECT COUNT(*) FROM (
+            SELECT {id_column}
+            FROM `{table}`
+            WHERE {id_column} IS NOT NULL
+              AND {split_column} IS NOT NULL
+            GROUP BY {id_column}
+            HAVING COUNT(DISTINCT {split_column}) > 1
+        )
+    ) AS overlap_node_count
+"""
+
+
+SPLIT_VALUE_COUNTS_QUERY = """
+SELECT
+    CAST({split_column} AS STRING) AS split_value,
+    COUNT(*) AS row_count
+FROM `{table}`
+WHERE {split_column} IS NOT NULL
+GROUP BY split_value
+ORDER BY row_count DESC
+"""
+
+
+def build_null_rates_query(table: str, columns: list[str]) -> str:
+    """Build a batched NULL rates query for multiple columns.
+
+    One query, one table scan, one COUNTIF per column.
+
+    Args:
+        table: Fully qualified BQ table name.
+        columns: List of column names to check.
+
+    Returns:
+        SQL query string.
+    """
+    countif_clauses = ",\n    ".join(
+        f"SAFE_DIVIDE(COUNTIF({col} IS NULL), COUNT(*)) AS {col}_null_rate"
+        for col in columns
+    )
+    return f"""
+SELECT
+    COUNT(*) AS total_rows,
+    {countif_clauses}
+FROM `{table}`
+"""
diff --git a/gigl/analytics/data_analyzer/report/PRD.md b/gigl/analytics/data_analyzer/report/PRD.md
new file mode 100644
index 000000000..9888e676c
--- /dev/null
+++ b/gigl/analytics/data_analyzer/report/PRD.md
@@ -0,0 +1,166 @@
+# PRD: BQ Data Analyzer HTML Report
+
+## Status
+
+**AI-owned.** An AI agent reads this PRD together with the sibling `SPEC.md` and regenerates `report.ai.html`,
+`charts.ai.js`, and `styles.ai.css` when the product intent or technical contract changes. This PRD describes *why* and
+*what*; `SPEC.md` describes *how*.
+
+## Problem
+
+Before training a GNN on graph data in BigQuery, engineers need a fast way to see whether the data is healthy enough to
+train on. Today they find out only after a Dataflow job crashes or a trainer produces a poor model, which costs days and
+thousands of dollars per iteration.
+
+A review of 18 production GNN papers ([reference doc](../../../docs/plans/20260415-bq-data-analyzer-references.md))
+found that graph-specific data properties drive 30-230% model quality differences. None of these are caught by standard
+tabular data quality tools. We need a report that surfaces these graph-specific issues in a form engineers can act on in
+minutes, not days.
+
+## Users
+
+| Persona                                  | Primary need                                                              | Frequency                  |
+| ---------------------------------------- | ------------------------------------------------------------------------- | -------------------------- |
+| **GNN engineer running an applied task** | Decide whether a new BQ dataset is trainable, and if not, what to fix     | Per new dataset or refresh |
+| **Applied task reviewer / tech lead**    | Sanity-check a teammate's dataset choices before approving a training run | Per PR                     |
+| **On-call engineer**                     | Triage why a training run degraded vs last week                           | Per incident               |
+
+Out of scope: data scientists doing generic exploratory data analysis, product managers, non-technical stakeholders.
+
+## User Stories
+
+1. **As a GNN engineer**, I point the analyzer at a new BQ node/edge table pair and open the resulting HTML report.
+   Within 30 seconds of scrolling I know whether the dataset has any training-blocking issues (dangling edges,
+   referential integrity, duplicates).
+2. **As a GNN engineer**, I inspect the degree distribution histogram for each edge type and decide whether my planned
+   fan-out is realistic or will cause neighbor explosion.
+3. **As a reviewer**, I share the GCS link to the report in a PR comment. My teammate opens it in a browser without
+   installing anything.
+4. **As an on-call engineer**, I run the analyzer on today's data and last week's data and diff the two reports to see
+   what changed.
+5. **As any of the above**, I expand the collapsed sections I do not care about so the overview stays scannable.
+
+## Goals
+
+1. **Zero-setup viewing.** The report opens in any modern browser with no server, no CDN, no authentication beyond the
+   GCS link. Works offline once downloaded.
+2. **Action-oriented.** Every numeric finding is color-coded against a literature-derived threshold (green/yellow/red)
+   so the reader knows what to do about it.
+3. **Traceable.** Every color-coded threshold and every check cites the paper or codebase location that justifies it, so
+   readers can verify claims.
+4. **Portable.** A single `.html` file that can be shared in chat, stored indefinitely in GCS, and archived alongside
+   the training run it describes.
+5. **Graph-native.** Surfaces metrics that matter for GNNs specifically (degree distribution, super-hub int16 clamp,
+   cold-start fraction, homophily, neighbor explosion), not just generic tabular stats.
+6. **AI-regenerable.** The three `.ai.*` assets can be regenerated deterministically from this PRD plus `SPEC.md`
+   without human intervention on the HTML/JS/CSS.
+
+## Non-Goals
+
+- **Not a real-time monitoring dashboard.** Aegis covers that
+  ([Phase 2](../../../docs/plans/20260415-bq-data-analyzer.md#aegis-integration-phase-2)). This report is a
+  point-in-time snapshot.
+- **Not a BI tool.** No filtering, drill-down, or ad-hoc querying. The report is a rendered artifact, not an interactive
+  app.
+- **Not cross-dataset comparison.** Diffing reports is a user workflow (open two tabs), not a report feature.
+- **Not a model evaluation report.** This is about training data, not trained model performance.
+- **Not accessible (WCAG AA) in v1.** We document this gap and will address it if the report is used by users who need
+  it.
+
+## Functional Requirements
+
+Each requirement maps to a section of `SPEC.md` where the implementation contract lives.
+
+**FR-1: Overview at a glance.** The first screen (above the fold) shows total nodes, total edges, node/edge type counts,
+and a single green/yellow/red status light summarizing the worst issue found. Rationale: engineers decide "do I need to
+look deeper" in the first 5 seconds.
+
+**FR-2: Hard-fail visibility.** Dangling edges, referential integrity violations, and duplicate nodes render red
+regardless of magnitude. These block training entirely. The report shows them prominently even if count is exactly one.
+Rationale: [GiGL](../../../docs/plans/20260415-bq-data-analyzer-references.md#6-gigl),
+[AliGraph (7.1)](../../../docs/plans/20260415-bq-data-analyzer-references.md#7-aligraph) — silent NaN propagation from
+referential integrity violations is a production-documented failure mode.
+
+**FR-3: Degree distribution per edge type.** Inline SVG histogram using the six literature-aligned buckets: `0-1`,
+`2-10`, `11-100`, `101-1K`, `1K-10K`, `10K+`. Separate in-degree and out-degree. Rationale:
+[BLADE](../../../docs/plans/20260415-bq-data-analyzer-references.md#3-blade) showed 230% embedding improvement from
+degree-adaptive neighborhoods; the reader needs to see which buckets dominate.
+
+**FR-4: Super-hub warning.** A red call-out appears when any node exceeds the GiGL int16 degree clamp (32,767). Include
+the count and the affected edge type. Rationale:
+[GiGL (6.2)](../../../docs/plans/20260415-bq-data-analyzer-references.md#6-gigl) — the clamp is silent in production and
+corrupts PPR sampling probabilities. Users have no other way to discover this.
+
+**FR-5: Cold-start visibility.** Show the count and fraction of degree-0-1 nodes per type. Color-code the fraction
+against the 5% / 10% threshold. Rationale:
+[LiGNN (4.1)](../../../docs/plans/20260415-bq-data-analyzer-references.md#4-lignn) — +0.28% AUC from cold-start
+densification; the reader decides whether densification is worth investigating.
+
+**FR-6: Optional Tier 3 visibility.** Class imbalance, label coverage, edge type distribution, and per-edge-type node
+coverage are shown only when the input data supports them. Rationale: a report full of "not applicable" sections is
+noise.
+
+**FR-7: Embedded FACETS.** When feature profiling is available, the FACETS HTML output is embedded inline via
+`<iframe srcdoc="...">` so that the TFDV-generated styles do not leak into the main report. Rationale: FACETS is an
+industry-standard visualization; engineers already know how to read it.
+
+**FR-8: Collapsible sections.** Every section below the overview is independently collapsible via native
+`<details>`/`<summary>` with sensible defaults (hard fails always open; advanced sections closed by default). Rationale:
+the report is comprehensive by design, but any one reading needs only the sections relevant to their question.
+
+**FR-9: Raw artifact links.** The footer lists GCS paths to the raw outputs (TFDV stats `.tfrecord`, FACETS `.html` per
+table, schema `.pbtxt`) so the reader can dig deeper with other tools.
+
+## Non-Functional Requirements
+
+| Requirement                                      | Target                                                                                                                   |
+| ------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------ |
+| **Load time** (opening the HTML from local disk) | Under 3 seconds for a report with up to 20 tables                                                                        |
+| **File size**                                    | Under 1 MB baseline; up to ~10 MB when FACETS iframes are embedded                                                       |
+| **Browser support**                              | Latest Chrome, Firefox, Safari, Edge. No IE.                                                                             |
+| **Dependencies**                                 | Zero external — no CDN, no Google Fonts, no JS framework. All CSS/JS inlined.                                            |
+| **Portability**                                  | Viewing the report over a GCS `gs://` link works without re-download. Saving to disk works.                              |
+| **Determinism**                                  | Same input data + same analyzer version produces byte-identical HTML (enables snapshot testing).                         |
+| **Security**                                     | All data injected via `textContent`, never `innerHTML`. FACETS embeds are isolated in iframes. No remote resource loads. |
+| **Accessibility**                                | Best-effort only in v1: semantic HTML, reasonable color contrast. Full WCAG AA is a non-goal.                            |
+
+## Success Metrics
+
+How we know this PRD was successfully implemented:
+
+1. **Snapshot test stays green.** The golden file at `tests/test_assets/analytics/golden_report.html` matches the
+   generated output for a known input. Any intentional change to the report requires a reviewed update to the golden
+   file.
+2. **Report opens standalone.** Downloading the HTML file and opening it offline produces the same rendering as opening
+   it from GCS.
+3. **All threshold values match the design doc.** A reviewer can open `SPEC.md`, the `20260415-bq-data-analyzer.md`
+   design doc, and the rendered report and confirm all three agree on green/yellow/red cutoffs.
+4. **Regeneration works end-to-end.** An AI agent, given only this PRD and `SPEC.md`, regenerates `report.ai.html`,
+   `charts.ai.js`, and `styles.ai.css` such that the snapshot test still passes.
+
+## Open Questions
+
+1. **Should the report surface the power-law exponent estimate by default?** We compute it from degree stats (cheap),
+   but
+   [Demystifying (17.1)](../../../docs/plans/20260415-bq-data-analyzer-references.md#17-demystifying-common-beliefs-in-graph-ml)
+   cautions against relying on derived metrics that summarize away the full distribution. Current answer: show it only
+   in the Advanced section with a caveat.
+2. **Should FACETS embeds be lazy-loaded?** A 20-table report with FACETS per table can be ~10 MB. Lazy loading (iframe
+   `loading="lazy"`) would speed first paint but complicates the "single self-contained HTML" goal. Current answer:
+   eager load; revisit if reports routinely exceed 10 MB.
+3. **Should we support dark mode?** Not in v1. The color-coded thresholds (red/yellow/green) assume a light background;
+   a dark theme would need separate color values.
+
+## References
+
+- **Technical spec:** [`SPEC.md`](SPEC.md) in this directory — the contract for regenerating the `.ai.*` files.
+- **Design doc:** [`docs/plans/20260415-bq-data-analyzer.md`](../../../docs/plans/20260415-bq-data-analyzer.md) —
+  architecture, 4-tier validation, cost control, tradeoff analysis.
+- **Literature review:**
+  [`docs/plans/20260415-bq-data-analyzer-references.md`](../../../docs/plans/20260415-bq-data-analyzer-references.md) —
+  18 papers, 100+ findings with source citations, consolidated threshold table.
+- **1-pager:** [`docs/plans/20260416-data-analyzer-1-pager.md`](../../../docs/plans/20260416-data-analyzer-1-pager.md) —
+  executive summary for peer engineers.
+- **Engineering spec:**
+  [`docs/plans/20260416-data-analyzer-engineering-spec.md`](../../../docs/plans/20260416-data-analyzer-engineering-spec.md)
+  — per-layer implementation plan.
diff --git a/gigl/analytics/data_analyzer/report/SPEC.md b/gigl/analytics/data_analyzer/report/SPEC.md
new file mode 100644
index 000000000..2079b4f5e
--- /dev/null
+++ b/gigl/analytics/data_analyzer/report/SPEC.md
@@ -0,0 +1,176 @@
+# Report Generator SPEC
+
+## Purpose
+
+This SPEC defines the single self-contained HTML report that the BQ Data Analyzer produces for a graph dataset. The
+three `.ai.{html,js,css}` files in this directory implement the SPEC and should be regenerated from it whenever the SPEC
+changes. The Python `report_generator.py` module is the only non-AI-owned component in this directory; it loads the AI
+assets via `importlib.resources`, injects data from a `GraphAnalysisResult` dataclass, and writes a single HTML file to
+disk.
+
+## Constraints
+
+- Single self-contained HTML file. No external CDN, no external JS/CSS/font dependencies, no network requests at view
+  time.
+- Opens in any modern browser (Chrome, Firefox, Safari, Edge) without a server.
+- Max-width 1200px, centered horizontally.
+- Light background (`#f8f9fa`).
+- Monospace font (`ui-monospace`, `SFMono-Regular`, `Menlo`, `monospace`) for all numeric data values; sans-serif
+  (`system-ui`, `-apple-system`, `"Segoe UI"`, `Roboto`, sans-serif) for labels and headings.
+- Collapsible sections use `<details>` / `<summary>` (no JS required to expand/collapse).
+- Color coding for status uses these exact values:
+  - Green: `#28a745` (OK)
+  - Yellow: `#ffc107` (warning)
+  - Red: `#dc3545` (critical)
+- Total report HTML should be reasonable in size (a single dataset's report with embedded FACETS iframes may be
+  multi-MB; that is acceptable).
+
+## Sections (in display order)
+
+1. **Header** (`<header id="report-header">`) — "GiGL Data Analysis Report" title, generation timestamp, and a short
+   config summary listing the analyzed node tables and edge tables.
+2. **Overview Dashboard** (`<section id="overview">`) — Card grid showing total nodes, total edges, number of node
+   types, number of edge types, and an overall traffic-light status indicator (green/yellow/red). The status is the
+   worst severity across all detected issues.
+3. **Data Quality** (`<section id="data-quality">`) — Per-table NULL rates table sorted highest-first with rows
+   color-coded (NULL rate > 50% = yellow,
+   > 90% = red). Duplicate node counts, duplicate edge counts, dangling edge counts, and referential integrity
+   > violations. Any nonzero count in these four is rendered red.
+4. **Feature Statistics** (`<section id="feature-statistics">`) — Optional. One `<details>` block per table. Each block
+   embeds the corresponding FACETS HTML via `<iframe src="...">` using a **relative path** of the form
+   `feature_profiler/{kind}s/{type_name}/facets.html` (derived from the result_key like `node:user`). Above the iframe,
+   an "Open full-screen ↗" anchor opens the same relative path in a new tab. Relative paths mean the embed and
+   full-screen link both work as long as the report folder retains the layout produced by `FeatureProfiler` (i.e.
+   `report.html` and the `feature_profiler/` subdirectory live in the same directory). The absolute GCS URI from
+   `facets_html_paths` is shown as a label for traceability. When `profile.errors` is non-empty, a red warning box plus
+   a per-error table (table key, stage, BQ table, **Dataflow job**, message) is rendered at the top of the section so
+   users can diagnose schema-fetch failures, empty projections, Dataflow crashes, and embedding-diagnostics failures
+   without reading logs. For `stage == "dataflow"` errors the Dataflow job cell links to the Cloud Console URL when
+   `job_id` / project / region are all known; otherwise the cell shows the job name or `—`. Section is hidden only when
+   both `profile.facets_html_paths` and `profile.errors` are empty.
+5. **Graph Structure** (`<section id="graph-structure">`) — Node and edge count table. Per-edge-type degree distribution
+   rendered as inline SVG histogram using the `buckets` dict from `DegreeStats` (buckets `0-1`, `2-10`, `11-100`,
+   `101-1K`, `1K-10K`, `10K+`). Top-20 hub table per edge type. Super-hub int16 clamp warning box (red) shown if any
+   edge type reports a clamp count > 0. Each per-edge-type subsection header (`Degree distribution`, `Top-20 hubs`)
+   carries a `<details class="query-disclosure">Show SQL` button rendered next to the heading; expanding it shows the
+   rendered BigQuery SQL strings recorded under the matching `analysis.queries` block ID.
+6. **Supervision Overlap** (`<section id="supervision-overlap">`) — One card per `SupervisionCrossTableStats` entry
+   showing `{driver_edge_type} → {other_edge_type} ({other_role})`, anchored on `node_anchor`. Row labels reference the
+   actual `driver_edge_type` and `other_edge_type` names directly (e.g. "Distinct anchors in `viewed_pos`", "Avg edges
+   in `viewed_neg` per anchor in `viewed_pos`") rather than generic "driver" / "other" placeholders. Each card lists
+   distinct anchor / pair counts on each side, per-anchor count distribution (avg / p50 / p90 / p99 / max), the count of
+   anchors with zero edges on the other side, and the overlap pair count (label-leakage signal). Each card title is
+   accompanied by a `Show SQL` disclosure exposing the underlying cross-table query. Section is hidden when
+   `analysis.supervision_cross_table_stats` is empty.
+6a. **Node Classification Supervision** (`<section id="node-classification-supervision">`) — One card per labeled node
+    type. Subsections are: **Label hygiene** (sentinel / NULL / valid counts), **Per-class degree** (one row per class
+    with count, cold-start fraction, mean / median / p90 / p99 / max degree, and an inline SVG sparkline histogram in a
+    `Distribution` column rendered from the per-class `buckets` dict), **Homophily** (per-edge-type edge / adjusted
+    homophily and sample size), and **Train / val / test split** (cross-split id leakage and per-split row counts). Each
+    subsection's `<h4>` has a `Show SQL` disclosure next to it; for Homophily the disclosure aggregates queries across
+    all matching edge types.
+7. **Advanced** (`<section id="advanced">`) — Optional Tier 3 / Tier 4 data. Shown only if the relevant fields are
+   populated. Each subsection's `<h3>` carries a `Show SQL` disclosure when corresponding queries were recorded:
+   - Class imbalance (bar chart and per-class counts)
+   - Label coverage (percentage per node type)
+   - Edge type distribution (bar chart)
+   - Reciprocity per edge type
+   - Power-law exponent per edge type
+8. **Footer** (`<footer id="report-footer">`) — GiGL version / commit and a list of raw artifact GCS paths.
+
+## Key Thresholds
+
+Thresholds used to color-code metrics. These must match the design doc (`docs/plans/20260415-bq-data-analyzer.md`)
+exactly.
+
+| Metric                           | Green         | Yellow     | Red        |
+| -------------------------------- | ------------- | ---------- | ---------- |
+| Edge homophily                   | > 0.7         | 0.3 - 0.7  | < 0.3      |
+| Class imbalance ratio            | < 1:5         | 1:5 - 1:10 | > 1:10     |
+| Feature missing rate             | < 10%         | 10 - 50%   | > 90%      |
+| Isolated node fraction           | < 1%          | 1 - 5%     | > 5%       |
+| Degree p99/median                | < 50          | 50 - 100   | > 100      |
+| Node degree (int16 clamp)        | < 32,767      | n/a        | > 32,767   |
+| Cold-start fraction (degree 0-1) | < 5%          | 5 - 10%    | > 10%      |
+| Edge type dominance              | No type > 80% | Any > 90%  | Any < 0.1% |
+| Overlap pair fraction            | 0             | (0, 1%)    | ≥ 1%       |
+| Driver anchors with zero `other` | < 5%          | 5 - 50%    | > 50%      |
+
+## Data Injection Contract
+
+`report_generator.py` produces a final HTML file by performing four exact string replacements on `report.ai.html`:
+
+| Placeholder                  | Replaced with                                                |
+| ---------------------------- | ------------------------------------------------------------ |
+| `/* INJECT_STYLES */`        | Raw contents of `styles.ai.css`                              |
+| `/* INJECT_SCRIPTS */`       | Raw contents of `charts.ai.js`                               |
+| `/* INJECT_ANALYSIS_DATA */` | JSON-serialized `GraphAnalysisResult` (`dataclasses.asdict`) |
+| `/* INJECT_PROFILE_DATA */`  | JSON-serialized `FeatureProfileResult` (or `{}` if absent)   |
+
+The JS reads these injected JSON strings from hidden script tags:
+
+```html
+<script id="analysis-data" type="application/json">/* INJECT_ANALYSIS_DATA */</script>
+<script id="profile-data"  type="application/json">/* INJECT_PROFILE_DATA */</script>
+```
+
+On page load the JS:
+
+1. Parses both JSON blobs.
+2. Populates each section by generating DOM nodes (never `innerHTML` with untrusted strings; always `textContent`).
+3. Renders the degree distribution as an inline SVG bar chart.
+4. Applies color coding (`status-green`, `status-yellow`, `status-red`) based on the thresholds above.
+5. Hides `#feature-statistics` if the profile data is empty / `{}`.
+6. Hides `#advanced` if no Tier 3 or Tier 4 data is present.
+7. Renders per-block `Show SQL` disclosures from the `analysis.queries` map (see contract below). The disclosure is
+   omitted when no queries were recorded for the matching block ID.
+
+### `analysis.queries` Contract
+
+`GraphAnalysisResult.queries` is a flat `dict[str, list[str]]` populated at execution time by the analyzer. Keys are
+block IDs that the report renderer uses to locate which `<details class="query-disclosure">` to attach near a header.
+Multiple SQL strings under one key are rendered as separate `<pre class="sql">` blocks. Block ID conventions:
+
+| Section              | Pattern                                          |
+| -------------------- | ------------------------------------------------ |
+| Data quality         | `data_quality:<metric>:<scope>`                  |
+| Graph structure      | `graph_structure:<metric>:<edge_or_node_type>`   |
+| NC supervision       | `nc_supervision:<metric>:<node_type>[:<edge>]`   |
+| Supervision overlap  | `supervision_overlap:<driver>:<other>:<roles>`   |
+| Advanced             | `advanced:<metric>:<scope>`                      |
+
+Renderer behavior:
+
+- **Per-block headers** (Degree distribution per edge type, Top hubs per edge type, NC supervision sub-blocks,
+  Supervision overlap card title, Advanced sub-blocks) render one `Show SQL` disclosure per block ID using
+  `renderBlockHeader` / `renderQueryDisclosure`.
+- **Aggregated section headers** (NULL rates, Integrity checks, Counts, Homophily within an NC supervision card)
+  render one disclosure that pulls every block ID matching a prefix using `renderQueryDisclosureByPrefix`.
+- Missing keys (or an empty `analysis.queries`) cause the disclosure to be skipped silently — old artifacts that
+  predate this field still render correctly without buttons.
+
+### JS Element Contract
+
+The JS queries these DOM IDs. The HTML template must provide them:
+
+- `#report-header`
+- `#overview`
+- `#data-quality`
+- `#feature-statistics`
+- `#graph-structure`
+- `#supervision-overlap`
+- `#advanced`
+- `#report-footer`
+- `#analysis-data` (hidden JSON script tag)
+- `#profile-data` (hidden JSON script tag)
+
+## Regeneration Instructions
+
+To regenerate `report.ai.html`, `charts.ai.js`, and `styles.ai.css`:
+
+1. Read this `SPEC.md` in full.
+2. Implement the sections, element IDs, thresholds, and data injection contract exactly as specified.
+3. Keep the HTML template minimal (all content is rendered by JS).
+4. Keep the JS as a single IIFE with no external dependencies; use DOM helpers, not templating libraries.
+5. Use the exact color hex values specified in "Constraints".
+6. Update the snapshot test golden file at `tests/test_assets/analytics/golden_report.html` after regenerating.
diff --git a/gigl/analytics/data_analyzer/report/__init__.py b/gigl/analytics/data_analyzer/report/__init__.py
new file mode 100644
index 000000000..8cde20291
--- /dev/null
+++ b/gigl/analytics/data_analyzer/report/__init__.py
@@ -0,0 +1,6 @@
+"""
+HTML report generation for the BQ Data Analyzer.
+
+AI-owned assets (*.ai.html, *.ai.js, *.ai.css) are defined by SPEC.md
+in this directory and can be regenerated from that spec.
+"""
diff --git a/gigl/analytics/data_analyzer/report/charts.ai.js b/gigl/analytics/data_analyzer/report/charts.ai.js
new file mode 100644
index 000000000..6653015b9
--- /dev/null
+++ b/gigl/analytics/data_analyzer/report/charts.ai.js
@@ -0,0 +1,1275 @@
+(function () {
+    "use strict";
+
+    // Bucket order for degree histograms; must match GraphStructureAnalyzer output.
+    const BUCKET_ORDER = ["0-1", "2-10", "11-100", "101-1K", "1K-10K", "10K+"];
+
+    function parseJSONTag(id) {
+        const node = document.getElementById(id);
+        if (!node) return {};
+        const raw = (node.textContent || "").trim();
+        if (!raw) return {};
+        try {
+            return JSON.parse(raw);
+        } catch (e) {
+            console.error("Failed to parse JSON tag #" + id, e);
+            return {};
+        }
+    }
+
+    function createElement(tag, attrs, ...children) {
+        const el = document.createElement(tag);
+        if (attrs) {
+            for (const key of Object.keys(attrs)) {
+                const val = attrs[key];
+                if (val === null || val === undefined || val === false) continue;
+                if (key === "className") el.className = val;
+                else if (key === "text") el.textContent = val;
+                else if (key === "hidden") el.hidden = Boolean(val);
+                else el.setAttribute(key, val);
+            }
+        }
+        for (const child of children) {
+            if (child === null || child === undefined) continue;
+            if (typeof child === "string" || typeof child === "number") {
+                el.appendChild(document.createTextNode(String(child)));
+            } else {
+                el.appendChild(child);
+            }
+        }
+        return el;
+    }
+
+    function formatNumber(n) {
+        if (n === null || n === undefined) return "-";
+        if (typeof n !== "number") return String(n);
+        return n.toLocaleString("en-US");
+    }
+
+    function formatPercent(fraction) {
+        if (fraction === null || fraction === undefined) return "-";
+        return (fraction * 100).toFixed(2) + "%";
+    }
+
+    function classForThreshold(value, green, yellow) {
+        // value <= green -> green, value <= yellow -> yellow, else red.
+        if (value <= green) return "status-green";
+        if (value <= yellow) return "status-yellow";
+        return "status-red";
+    }
+
+    function classForNullRate(rate) {
+        if (rate > 0.9) return "status-red";
+        if (rate > 0.5) return "status-yellow";
+        return "status-green";
+    }
+
+    function sumValues(obj) {
+        if (!obj) return 0;
+        let total = 0;
+        for (const key of Object.keys(obj)) {
+            const v = obj[key];
+            if (typeof v === "number") total += v;
+        }
+        return total;
+    }
+
+    function hasAnyPositive(obj) {
+        if (!obj) return false;
+        for (const key of Object.keys(obj)) {
+            if (obj[key] > 0) return true;
+        }
+        return false;
+    }
+
+    // ---- Rendering ----
+
+    function renderHeader(analysis) {
+        const metaEl = document.getElementById("report-meta");
+        const cfgEl = document.getElementById("report-config-summary");
+        const now = new Date().toISOString();
+        metaEl.textContent = "Generated at " + now;
+
+        const nodeTypes = Object.keys(analysis.node_counts || {});
+        const edgeTypes = Object.keys(analysis.edge_counts || {});
+        cfgEl.textContent =
+            "Node tables: " + (nodeTypes.length ? nodeTypes.join(", ") : "(none)") +
+            " | Edge tables: " + (edgeTypes.length ? edgeTypes.join(", ") : "(none)");
+    }
+
+    function overallStatus(analysis) {
+        // Hard fails -> red.
+        if (hasAnyPositive(analysis.duplicate_node_counts) ||
+            hasAnyPositive(analysis.dangling_edge_counts) ||
+            hasAnyPositive(analysis.referential_integrity_violations) ||
+            hasAnyPositive(analysis.super_hub_int16_clamp_count)) {
+            return "status-red";
+        }
+        // Check thresholded metrics for yellow.
+        const totalNodes = sumValues(analysis.node_counts);
+        if (totalNodes > 0) {
+            const isolatedFrac = sumValues(analysis.isolated_node_counts) / totalNodes;
+            const coldFrac = sumValues(analysis.cold_start_node_counts) / totalNodes;
+            if (isolatedFrac > 0.05 || coldFrac > 0.10) return "status-red";
+            if (isolatedFrac > 0.01 || coldFrac > 0.05) return "status-yellow";
+        }
+        // NULL rates.
+        const nullRates = analysis.null_rates || {};
+        for (const table of Object.keys(nullRates)) {
+            for (const col of Object.keys(nullRates[table])) {
+                const r = nullRates[table][col];
+                if (r > 0.9) return "status-red";
+            }
+        }
+        return "status-green";
+    }
+
+    function renderOverview(analysis) {
+        const container = document.getElementById("overview-cards");
+        const totalNodes = sumValues(analysis.node_counts);
+        const totalEdges = sumValues(analysis.edge_counts);
+        const nodeTypes = Object.keys(analysis.node_counts || {}).length;
+        const edgeTypes = Object.keys(analysis.edge_counts || {}).length;
+        const status = overallStatus(analysis);
+
+        const cards = [
+            ["Total nodes", formatNumber(totalNodes)],
+            ["Total edges", formatNumber(totalEdges)],
+            ["Node types", formatNumber(nodeTypes)],
+            ["Edge types", formatNumber(edgeTypes)],
+        ];
+        for (const [label, value] of cards) {
+            container.appendChild(createElement("div", { className: "card" },
+                createElement("div", { className: "card-label", text: label }),
+                createElement("div", { className: "card-value data-value", text: value })
+            ));
+        }
+        const statusLabel = status === "status-green" ? "OK" :
+                            status === "status-yellow" ? "WARNING" : "CRITICAL";
+        container.appendChild(createElement("div", { className: "card" },
+            createElement("div", { className: "card-label", text: "Overall status" }),
+            createElement("div", { className: "card-value" },
+                createElement("span", { className: status, text: statusLabel }))
+        ));
+    }
+
+    function renderNullRates(analysis, queriesMap) {
+        const container = document.getElementById("null-rates-container");
+        const rates = analysis.null_rates || {};
+        const rows = [];
+        for (const table of Object.keys(rates)) {
+            for (const col of Object.keys(rates[table])) {
+                rows.push({ table: table, column: col, rate: rates[table][col] });
+            }
+        }
+        if (rows.length === 0) {
+            container.appendChild(createElement("p", { text: "No NULL rate data available." }));
+            return;
+        }
+        const disc = renderQueryDisclosureByPrefix(
+            queriesMap, "data_quality:null_rates:"
+        );
+        if (disc) container.appendChild(disc);
+        rows.sort((a, b) => b.rate - a.rate);
+        const thead = createElement("thead", null,
+            createElement("tr", null,
+                createElement("th", { text: "Table" }),
+                createElement("th", { text: "Column" }),
+                createElement("th", { text: "NULL rate" })));
+        const tbody = createElement("tbody");
+        for (const r of rows) {
+            const cls = classForNullRate(r.rate);
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: r.table }),
+                createElement("td", { text: r.column }),
+                createElement("td", { className: "numeric" },
+                    createElement("span", { className: cls, text: formatPercent(r.rate) }))));
+        }
+        container.appendChild(createElement("table", null, thead, tbody));
+    }
+
+    function renderIntegrity(analysis, queriesMap) {
+        const container = document.getElementById("integrity-container");
+        const integrityPrefixes = [
+            "data_quality:duplicate_nodes:",
+            "data_quality:duplicate_edges:",
+            "data_quality:dangling_edges:",
+            "data_quality:referential_integrity:",
+            "graph_structure:self_loops:",
+            "graph_structure:isolated_nodes:",
+            "graph_structure:cold_start_nodes:",
+        ];
+        const aggregate = (queriesMap && Object.keys(queriesMap).length)
+            ? createElement("details", { className: "query-disclosure" })
+            : null;
+        if (aggregate) {
+            aggregate.appendChild(createElement("summary", { text: "Show SQL" }));
+            let any = false;
+            for (const prefix of integrityPrefixes) {
+                for (const key of Object.keys(queriesMap)) {
+                    if (key.indexOf(prefix) !== 0) continue;
+                    for (const sql of (queriesMap[key] || [])) {
+                        aggregate.appendChild(createElement("p", {
+                            className: "sql-key",
+                            text: key,
+                        }));
+                        aggregate.appendChild(createElement("pre", {
+                            className: "sql",
+                            text: sql,
+                        }));
+                        any = true;
+                    }
+                }
+            }
+            if (any) container.appendChild(aggregate);
+        }
+        const rows = [
+            ["Duplicate nodes", analysis.duplicate_node_counts],
+            ["Duplicate edges", analysis.duplicate_edge_counts],
+            ["Dangling edges", analysis.dangling_edge_counts],
+            ["Referential integrity violations", analysis.referential_integrity_violations],
+            ["Self loops", analysis.self_loop_counts],
+            ["Isolated nodes", analysis.isolated_node_counts],
+            ["Cold-start nodes (degree 0-1)", analysis.cold_start_node_counts],
+        ];
+        const thead = createElement("thead", null,
+            createElement("tr", null,
+                createElement("th", { text: "Check" }),
+                createElement("th", { text: "Per-type counts" }),
+                createElement("th", { text: "Total" })));
+        const tbody = createElement("tbody");
+        for (const [label, obj] of rows) {
+            const total = sumValues(obj);
+            const isHardFail = (label === "Duplicate nodes" ||
+                                label === "Dangling edges" ||
+                                label === "Referential integrity violations");
+            const cls = isHardFail
+                ? (total > 0 ? "status-red" : "status-green")
+                : (total > 0 ? "status-yellow" : "status-green");
+            const detail = obj && Object.keys(obj).length
+                ? Object.keys(obj).map(k => k + ": " + formatNumber(obj[k])).join(", ")
+                : "(none)";
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: label }),
+                createElement("td", { className: "data-value", text: detail }),
+                createElement("td", { className: "numeric" },
+                    createElement("span", { className: cls, text: formatNumber(total) }))));
+        }
+        container.appendChild(createElement("table", null, thead, tbody));
+    }
+
+    function relativeFacetsPath(resultKey, chunkIndex, totalChunks) {
+        // result_key is "node:user" / "edge:engagement"; the FeatureProfiler
+        // writes facets.html to {output_gcs_path}/feature_profiler/{kind}s/{type}/facets.html
+        // for single-chunk tables, or to .../{type}/chunk_NN/facets.html when
+        // the projection was split across multiple Dataflow pipelines.
+        // Using a relative src means the embed and "full-screen" link both work
+        // when the report folder is downloaded from GCS as-is.
+        const parts = resultKey.split(":");
+        if (parts.length !== 2) return null;
+        const kind = parts[0];
+        const typeName = parts[1];
+        if (!kind || !typeName) return null;
+        const total = typeof totalChunks === "number" ? totalChunks : 1;
+        const idx = typeof chunkIndex === "number" ? chunkIndex : 0;
+        const subdir = total > 1 ? "chunk_" + String(idx).padStart(2, "0") + "/" : "";
+        return "feature_profiler/" + kind + "s/" + typeName + "/" + subdir + "facets.html";
+    }
+
+    function renderFeatureProfileErrors(container, errors) {
+        if (!errors || !errors.length) return;
+        const card = createElement("div", { className: "warning-box" });
+        card.appendChild(createElement("strong", {
+            text: "Feature profiling errors (" + errors.length + ")",
+        }));
+        const tbody = createElement("tbody");
+        for (const err of errors) {
+            const jobCell = createElement("td", { className: "data-value" });
+            if (err.console_url) {
+                const link = createElement("a", {
+                    href: err.console_url,
+                    target: "_blank",
+                    rel: "noopener noreferrer",
+                    text: err.job_name || err.job_id || "Open Dataflow job ↗",
+                });
+                jobCell.appendChild(link);
+                if (err.job_id) {
+                    jobCell.appendChild(createElement("br"));
+                    jobCell.appendChild(createElement("span",
+                        { className: "data-value", text: err.job_id }));
+                }
+            } else if (err.job_name || err.job_id) {
+                jobCell.textContent = err.job_name || err.job_id;
+            } else {
+                jobCell.textContent = "—";
+            }
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: err.result_key || "" }),
+                createElement("td", { text: err.stage || "" }),
+                createElement("td", { className: "data-value", text: err.bq_table || "" }),
+                jobCell,
+                createElement("td", { className: "data-value", text: err.message || "" })));
+        }
+        const table = createElement("table", null,
+            createElement("thead", null, createElement("tr", null,
+                createElement("th", { text: "Table key" }),
+                createElement("th", { text: "Stage" }),
+                createElement("th", { text: "BQ table" }),
+                createElement("th", { text: "Dataflow job" }),
+                createElement("th", { text: "Message" }))),
+            tbody);
+        const details = createElement("details", { open: "" },
+            createElement("summary", { text: "Errors and skipped tables" }),
+            table);
+        container.appendChild(card);
+        container.appendChild(details);
+    }
+
+    function renderFeatureStatistics(profile) {
+        const section = document.getElementById("feature-statistics");
+        const container = document.getElementById("feature-statistics-container");
+        const facets = (profile && profile.facets_html_paths) || {};
+        const errors = (profile && profile.errors) || [];
+        const keys = Object.keys(facets);
+        if (keys.length === 0 && errors.length === 0) {
+            section.hidden = true;
+            return;
+        }
+        section.hidden = false;
+        renderFeatureProfileErrors(container, errors);
+        for (const resultKey of keys) {
+            // Sidecar may be either a list of GCS URIs (one per chunk for
+            // wide tables that were split across multiple Dataflow pipelines)
+            // or a bare string (legacy single-Facets shape). Normalize.
+            const value = facets[resultKey];
+            const paths = Array.isArray(value) ? value : (value ? [value] : []);
+            const totalChunks = paths.length;
+            const summary = createElement("summary", { text: "FACETS: " + resultKey });
+            const details = createElement("details", { open: "" }, summary);
+
+            for (let i = 0; i < paths.length; i++) {
+                const relPath = relativeFacetsPath(resultKey, i, totalChunks);
+                const absPath = paths[i] || "";
+                if (totalChunks > 1) {
+                    details.appendChild(createElement("div", {
+                        className: "facets-chunk-caption",
+                        text: "Chunk " + (i + 1) + " / " + totalChunks,
+                    }));
+                }
+                if (relPath) {
+                    details.appendChild(createElement("p", { className: "data-value" },
+                        createElement("a", {
+                            href: relPath,
+                            target: "_blank",
+                            rel: "noopener noreferrer",
+                            text: "Open full-screen ↗",
+                        }),
+                        createElement("span", { text: "  (" + absPath + ")" }),
+                    ));
+                    details.appendChild(createElement("iframe", {
+                        className: "facets-embed",
+                        src: relPath,
+                        sandbox: "allow-scripts allow-same-origin",
+                    }));
+                } else {
+                    // Fall back to absolute path when the result_key is malformed.
+                    details.appendChild(createElement("p", { className: "data-value" },
+                        createElement("a", {
+                            href: absPath,
+                            target: "_blank",
+                            rel: "noopener noreferrer",
+                            text: absPath,
+                        }),
+                    ));
+                }
+            }
+            container.appendChild(details);
+        }
+    }
+
+    function renderEmbeddingDiagnostics(profile) {
+        const section = document.getElementById("embedding-diagnostics");
+        const container = document.getElementById("embedding-diagnostics-container");
+        const diagnostics = (profile && profile.embedding_diagnostics) || {};
+        const tableKeys = Object.keys(diagnostics);
+        if (tableKeys.length === 0) {
+            section.hidden = true;
+            return;
+        }
+        section.hidden = false;
+        const TOP_K_IN_REPORT = 5;
+        for (const tableKey of tableKeys) {
+            const perColumn = diagnostics[tableKey] || {};
+            const columnKeys = Object.keys(perColumn);
+            if (columnKeys.length === 0) continue;
+            const tableNode = createElement("details", { open: "" },
+                createElement("summary", { text: tableKey }));
+            for (const colName of columnKeys) {
+                const d = perColumn[colName] || {};
+                const summary = createElement("p", { className: "embedding-summary" },
+                    createElement("strong", { text: colName + ":" }),
+                    " total=" + formatNumber(d.total),
+                    ", unique=" + formatNumber(d.unique_count),
+                    ", unique_ratio=" + formatPercent(d.unique_ratio));
+                tableNode.appendChild(summary);
+
+                const topK = Array.isArray(d.top_k) ? d.top_k.slice(0, TOP_K_IN_REPORT) : [];
+                if (topK.length > 0) {
+                    const thead = createElement("thead", null,
+                        createElement("tr", null,
+                            createElement("th", { text: "Hash" }),
+                            createElement("th", { text: "Count" }),
+                            createElement("th", { text: "Fraction" })));
+                    const tbody = createElement("tbody");
+                    for (const entry of topK) {
+                        tbody.appendChild(createElement("tr", null,
+                            createElement("td", { className: "numeric", text: String(entry.hash) }),
+                            createElement("td", { className: "numeric data-value", text: formatNumber(entry.count) }),
+                            createElement("td", { className: "numeric", text: formatPercent(entry.fraction) })));
+                    }
+                    tableNode.appendChild(createElement("table", null, thead, tbody));
+                }
+            }
+            container.appendChild(tableNode);
+        }
+    }
+
+    function renderCounts(analysis, queriesMap) {
+        const container = document.getElementById("counts-container");
+        const disc = renderQueryDisclosureByPrefix(
+            queriesMap, "graph_structure:node_count:"
+        );
+        if (disc) container.appendChild(disc);
+        const edgeDisc = renderQueryDisclosureByPrefix(
+            queriesMap, "graph_structure:edge_count:"
+        );
+        if (edgeDisc) container.appendChild(edgeDisc);
+        const thead = createElement("thead", null,
+            createElement("tr", null,
+                createElement("th", { text: "Type" }),
+                createElement("th", { text: "Kind" }),
+                createElement("th", { text: "Count" })));
+        const tbody = createElement("tbody");
+        for (const [name, count] of Object.entries(analysis.node_counts || {})) {
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: name }),
+                createElement("td", { text: "node" }),
+                createElement("td", { className: "numeric data-value", text: formatNumber(count) })));
+        }
+        for (const [name, count] of Object.entries(analysis.edge_counts || {})) {
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: name }),
+                createElement("td", { text: "edge" }),
+                createElement("td", { className: "numeric data-value", text: formatNumber(count) })));
+        }
+        container.appendChild(createElement("table", null, thead, tbody));
+    }
+
+    function renderDegreeHistogram(buckets, opts) {
+        // Returns an SVG element for the given bucket counts.
+        // opts (all optional):
+        //   width, height: outer dimensions; default 720x220 for the
+        //     full per-edge-type chart, override for sparkline mode.
+        //   showLabels: when false, skips axis padding, value labels,
+        //     bucket name labels and y-axis max — sparkline-style.
+        //   sparkline: shorthand for sparkline styling (adds a CSS class
+        //     so styles can override the regular histogram look).
+        const o = opts || {};
+        const width = o.width || 720;
+        const height = o.height || 220;
+        const showLabels = o.showLabels !== false;
+        const padLeft = showLabels ? 50 : 2;
+        const padRight = showLabels ? 10 : 2;
+        const padTop = showLabels ? 16 : 2;
+        const padBottom = showLabels ? 40 : 2;
+        const innerW = width - padLeft - padRight;
+        const innerH = height - padTop - padBottom;
+
+        const svg = document.createElementNS("http://www.w3.org/2000/svg", "svg");
+        svg.setAttribute("class", o.sparkline ? "histogram sparkline" : "histogram");
+        svg.setAttribute("viewBox", "0 0 " + width + " " + height);
+
+        const counts = BUCKET_ORDER.map(k => (buckets && buckets[k]) || 0);
+        const maxCount = Math.max(1, ...counts);
+        const barWidth = innerW / BUCKET_ORDER.length;
+        const gap = showLabels ? 8 : 1;
+
+        if (showLabels) {
+            const axis = document.createElementNS("http://www.w3.org/2000/svg", "line");
+            axis.setAttribute("class", "axis");
+            axis.setAttribute("x1", padLeft);
+            axis.setAttribute("y1", padTop + innerH);
+            axis.setAttribute("x2", padLeft + innerW);
+            axis.setAttribute("y2", padTop + innerH);
+            svg.appendChild(axis);
+        }
+
+        for (let i = 0; i < BUCKET_ORDER.length; i++) {
+            const c = counts[i];
+            const h = (c / maxCount) * innerH;
+            const x = padLeft + i * barWidth + gap / 2;
+            const y = padTop + innerH - h;
+            const rect = document.createElementNS("http://www.w3.org/2000/svg", "rect");
+            rect.setAttribute("class", "bar");
+            rect.setAttribute("x", x);
+            rect.setAttribute("y", y);
+            rect.setAttribute("width", Math.max(1, barWidth - gap));
+            rect.setAttribute("height", h);
+            svg.appendChild(rect);
+
+            if (!showLabels) continue;
+
+            const valueLabel = document.createElementNS("http://www.w3.org/2000/svg", "text");
+            valueLabel.setAttribute("class", "value");
+            valueLabel.setAttribute("x", x + (barWidth - gap) / 2);
+            valueLabel.setAttribute("y", y - 4);
+            valueLabel.setAttribute("text-anchor", "middle");
+            valueLabel.textContent = formatNumber(c);
+            svg.appendChild(valueLabel);
+
+            const xLabel = document.createElementNS("http://www.w3.org/2000/svg", "text");
+            xLabel.setAttribute("class", "label");
+            xLabel.setAttribute("x", x + (barWidth - gap) / 2);
+            xLabel.setAttribute("y", padTop + innerH + 16);
+            xLabel.setAttribute("text-anchor", "middle");
+            xLabel.textContent = BUCKET_ORDER[i];
+            svg.appendChild(xLabel);
+        }
+
+        if (showLabels) {
+            // Y-axis max label.
+            const maxLabel = document.createElementNS("http://www.w3.org/2000/svg", "text");
+            maxLabel.setAttribute("class", "label");
+            maxLabel.setAttribute("x", padLeft - 6);
+            maxLabel.setAttribute("y", padTop + 10);
+            maxLabel.setAttribute("text-anchor", "end");
+            maxLabel.textContent = formatNumber(maxCount);
+            svg.appendChild(maxLabel);
+        }
+
+        return svg;
+    }
+
+    function renderQueryDisclosure(queriesMap, blockId) {
+        // Returns a <details> with the SQL strings recorded under blockId,
+        // or null when no queries were captured for that block.
+        const queries = (queriesMap || {})[blockId] || [];
+        if (!queries.length) return null;
+        const det = createElement("details", { className: "query-disclosure" });
+        det.appendChild(createElement("summary", { text: "Show SQL" }));
+        for (const q of queries) {
+            det.appendChild(createElement("pre", { className: "sql", text: q }));
+        }
+        return det;
+    }
+
+    function renderQueryDisclosureByPrefix(queriesMap, prefix) {
+        // Aggregate disclosure — collects every block_id starting with
+        // `prefix` into one expander. Used at section level when one
+        // header summarizes data from many block_ids (e.g. NULL rates,
+        // integrity counts).
+        const matches = [];
+        const map = queriesMap || {};
+        for (const key of Object.keys(map)) {
+            if (key.indexOf(prefix) !== 0) continue;
+            const list = map[key] || [];
+            for (const sql of list) matches.push({ key: key, sql: sql });
+        }
+        if (!matches.length) return null;
+        const det = createElement("details", { className: "query-disclosure" });
+        det.appendChild(createElement("summary", { text: "Show SQL" }));
+        for (const entry of matches) {
+            det.appendChild(createElement("p", {
+                className: "sql-key",
+                text: entry.key,
+            }));
+            det.appendChild(createElement("pre", {
+                className: "sql",
+                text: entry.sql,
+            }));
+        }
+        return det;
+    }
+
+    function renderBlockHeader(level, title, queriesMap, blockId) {
+        // <div class="block-header"><h{level}>title</h{level}>[<details>...]</div>
+        // The disclosure is omitted when no queries are recorded for blockId.
+        const wrap = createElement("div", { className: "block-header" });
+        wrap.appendChild(createElement(level, { text: title }));
+        const disc = renderQueryDisclosure(queriesMap, blockId);
+        if (disc) wrap.appendChild(disc);
+        return wrap;
+    }
+
+    function renderDegree(analysis, queriesMap) {
+        const container = document.getElementById("degree-container");
+        const degrees = analysis.degree_stats || {};
+        const keys = Object.keys(degrees);
+        if (keys.length === 0) {
+            container.appendChild(createElement("p", { text: "No degree stats available." }));
+            return;
+        }
+        for (const edgeType of keys) {
+            const stats = degrees[edgeType];
+            const median = stats.median || 1;
+            const ratio = stats.p99 / Math.max(1, median);
+            const ratioClass = classForThreshold(ratio, 50, 100);
+
+            const statsLine = createElement("p", { className: "data-value" },
+                "min=" + formatNumber(stats.min) +
+                ", mean=" + (stats.mean !== undefined ? stats.mean.toFixed(2) : "-") +
+                ", median=" + formatNumber(stats.median) +
+                ", p90=" + formatNumber(stats.p90) +
+                ", p99=" + formatNumber(stats.p99) +
+                ", p99.9=" + formatNumber(stats.p999) +
+                ", max=" + formatNumber(stats.max) +
+                " | p99/median=",
+                createElement("span", { className: ratioClass, text: ratio.toFixed(1) }));
+
+            container.appendChild(renderBlockHeader(
+                "h3", edgeType, queriesMap, "graph_structure:degree:" + edgeType
+            ));
+            container.appendChild(statsLine);
+            container.appendChild(renderDegreeHistogram(stats.buckets || {}));
+        }
+    }
+
+    function renderHubs(analysis, queriesMap) {
+        const container = document.getElementById("hubs-container");
+        const hubs = analysis.top_hubs || {};
+        const keys = Object.keys(hubs);
+        if (keys.length === 0) {
+            container.appendChild(createElement("p", { text: "No hub data available." }));
+            return;
+        }
+        for (const edgeType of keys) {
+            container.appendChild(renderBlockHeader(
+                "h3", edgeType, queriesMap, "graph_structure:top_hubs:" + edgeType
+            ));
+            const thead = createElement("thead", null,
+                createElement("tr", null,
+                    createElement("th", { text: "Rank" }),
+                    createElement("th", { text: "Node ID" }),
+                    createElement("th", { text: "Degree" })));
+            const tbody = createElement("tbody");
+            const rows = (hubs[edgeType] || []).slice(0, 20);
+            rows.forEach((entry, i) => {
+                const nodeId = Array.isArray(entry) ? entry[0] : entry.node_id;
+                const degree = Array.isArray(entry) ? entry[1] : entry.degree;
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: String(i + 1) }),
+                    createElement("td", { className: "data-value", text: String(nodeId) }),
+                    createElement("td", { className: "numeric data-value", text: formatNumber(degree) })));
+            });
+            container.appendChild(createElement("table", null, thead, tbody));
+        }
+    }
+
+    function renderSuperHubWarning(analysis) {
+        const box = document.getElementById("super-hub-warning");
+        const clamps = analysis.super_hub_int16_clamp_count || {};
+        const totalClamps = sumValues(clamps);
+        if (totalClamps <= 0) {
+            box.hidden = true;
+            return;
+        }
+        box.hidden = false;
+        box.className = "warning-box";
+        const detail = Object.keys(clamps)
+            .map(k => k + ": " + formatNumber(clamps[k]))
+            .join(", ");
+        box.appendChild(createElement("strong", { text: "Super-hub int16 clamp warning. " }));
+        box.appendChild(document.createTextNode(
+            formatNumber(totalClamps) + " node(s) exceed the int16 degree limit (32,767) and " +
+            "will be silently clamped by GiGL. Per-type: " + detail
+        ));
+    }
+
+    function renderSupervisionOverlap(analysis, queriesMap) {
+        const section = document.getElementById("supervision-overlap");
+        const container = document.getElementById("supervision-overlap-container");
+        const stats = (analysis && analysis.supervision_cross_table_stats) || [];
+        if (!stats.length) {
+            section.hidden = true;
+            return;
+        }
+        section.hidden = false;
+        while (container.firstChild) container.removeChild(container.firstChild);
+
+        for (const entry of stats) {
+            const card = createElement("div", { className: "card" });
+            const title = entry.driver_edge_type + " → " + entry.other_edge_type +
+                " (" + entry.other_role + ")";
+            // Card-level disclosure aggregating any block_id starting with
+            // "supervision_overlap:<driver>:<other>:" — covers homogeneous
+            // and heterogeneous anchor-column suffixes alike.
+            const cardHeader = createElement("div", { className: "block-header" });
+            cardHeader.appendChild(createElement("h3", { text: title }));
+            const cardDisc = renderQueryDisclosureByPrefix(
+                queriesMap,
+                "supervision_overlap:" + entry.driver_edge_type +
+                    ":" + entry.other_edge_type + ":"
+            );
+            if (cardDisc) cardHeader.appendChild(cardDisc);
+            card.appendChild(cardHeader);
+            card.appendChild(createElement("p", { className: "data-value" },
+                "Anchor node type: ", entry.node_anchor,
+                " (driver role: ", entry.driver_role, ")"));
+
+            const driverPairs = entry.driver_pair_count || 0;
+            const overlap = entry.overlap_pair_count || 0;
+            const overlapFrac = driverPairs > 0 ? overlap / driverPairs : 0;
+            const overlapClass = overlap === 0
+                ? "status-green"
+                : (overlapFrac >= 0.01 ? "status-red" : "status-yellow");
+
+            const driverAnchors = entry.driver_anchor_count || 0;
+            const zeroOther = entry.driver_anchors_with_zero_other || 0;
+            const zeroFrac = driverAnchors > 0 ? zeroOther / driverAnchors : 0;
+            const zeroClass = zeroFrac > 0.5
+                ? "status-red"
+                : (zeroFrac >= 0.05 ? "status-yellow" : "status-green");
+
+            const driverName = entry.driver_edge_type;
+            const otherName = entry.other_edge_type;
+            const tbody = createElement("tbody");
+            const rows = [
+                [
+                    "Distinct anchors in " + driverName,
+                    formatNumber(driverAnchors),
+                    null,
+                ],
+                [
+                    "Distinct (anchor, neighbor) pairs in " + driverName,
+                    formatNumber(driverPairs),
+                    null,
+                ],
+                [
+                    "Distinct (anchor, neighbor) pairs in " + otherName,
+                    formatNumber(entry.other_pair_count || 0),
+                    null,
+                ],
+                [
+                    "Overlap pair count (" + driverName + " ∩ " + otherName + ")",
+                    formatNumber(overlap) + "  (" + formatPercent(overlapFrac) + ")",
+                    overlapClass,
+                ],
+                [
+                    "Anchors in " + driverName + " with zero edges in " + otherName,
+                    formatNumber(zeroOther) + "  (" + formatPercent(zeroFrac) + ")",
+                    zeroClass,
+                ],
+                [
+                    "Avg edges in " + otherName + " per anchor in " + driverName,
+                    (entry.avg_other_per_driver_anchor || 0).toFixed(2),
+                    null,
+                ],
+                [
+                    "p50 / p90 / p99 / max edges in " + otherName +
+                        " per anchor in " + driverName,
+                    formatNumber(entry.p50_other_per_driver_anchor || 0) + " / " +
+                    formatNumber(entry.p90_other_per_driver_anchor || 0) + " / " +
+                    formatNumber(entry.p99_other_per_driver_anchor || 0) + " / " +
+                    formatNumber(entry.max_other_per_driver_anchor || 0),
+                    null,
+                ],
+            ];
+            for (const [label, value, cls] of rows) {
+                const valueCell = createElement("td", { className: "numeric data-value" });
+                if (cls) {
+                    valueCell.appendChild(createElement("span", { className: cls, text: value }));
+                } else {
+                    valueCell.textContent = value;
+                }
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: label }),
+                    valueCell));
+            }
+            card.appendChild(createElement("table", null, tbody));
+            container.appendChild(card);
+        }
+    }
+
+    function renderNodeClassificationSupervision(analysis, queriesMap) {
+        const section = document.getElementById("node-classification-supervision");
+        const container = document.getElementById(
+            "node-classification-supervision-container"
+        );
+        const stats = (analysis && analysis.node_classification_supervision_stats) || [];
+        if (!stats.length) {
+            section.hidden = true;
+            return;
+        }
+        section.hidden = false;
+        while (container.firstChild) container.removeChild(container.firstChild);
+
+        for (const entry of stats) {
+            const card = createElement("div", { className: "card" });
+            card.appendChild(createElement(
+                "h3",
+                { text: "Node type: " + entry.node_type +
+                       "   (label column: " + entry.label_column + ")" }
+            ));
+            const nt = entry.node_type;
+
+            const sentinel = entry.sentinel_stats || {};
+            const totalRows = sentinel.total_rows || 0;
+            const nullCount = sentinel.null_count || 0;
+            const validCount = sentinel.valid_label_count || 0;
+            const validCoverage = sentinel.valid_label_coverage || 0;
+            const sentinelCounts = sentinel.sentinel_counts || {};
+            const sentinelTotal = Object.values(sentinelCounts)
+                .reduce((acc, value) => acc + (value || 0), 0);
+
+            const sentinelTbody = createElement("tbody");
+            const sentinelRows = [
+                ["Total rows", formatNumber(totalRows), null],
+                [
+                    "Valid labels (non-null AND non-sentinel)",
+                    formatNumber(validCount) + "  (" + formatPercent(validCoverage) + ")",
+                    validCoverage > 0 ? "status-green" : "status-red",
+                ],
+                [
+                    "NULL labels",
+                    formatNumber(nullCount),
+                    nullCount > 0 ? "status-yellow" : null,
+                ],
+                [
+                    "Sentinel labels (treated as missing)",
+                    formatNumber(sentinelTotal),
+                    sentinelTotal > 0 ? "status-yellow" : null,
+                ],
+            ];
+            for (const [label, value, cls] of sentinelRows) {
+                const valueCell = createElement("td", { className: "numeric data-value" });
+                if (cls) {
+                    valueCell.appendChild(createElement("span", { className: cls, text: value }));
+                } else {
+                    valueCell.textContent = value;
+                }
+                sentinelTbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: label }),
+                    valueCell));
+            }
+            card.appendChild(renderBlockHeader(
+                "h4", "Label hygiene", queriesMap,
+                "nc_supervision:label_sentinel:" + nt
+            ));
+            card.appendChild(createElement("table", null, sentinelTbody));
+
+            if (Object.keys(sentinelCounts).length) {
+                const ulSentinel = createElement("ul");
+                for (const [val, count] of Object.entries(sentinelCounts)) {
+                    ulSentinel.appendChild(createElement("li", {
+                        text: "sentinel " + JSON.stringify(val) + ": " +
+                              formatNumber(count || 0),
+                    }));
+                }
+                card.appendChild(ulSentinel);
+            }
+
+            const perClass = entry.per_class_degree || [];
+            if (perClass.length) {
+                card.appendChild(renderBlockHeader(
+                    "h4", "Per-class degree", queriesMap,
+                    "nc_supervision:per_class_degree:" + nt
+                ));
+                const tbody = createElement("tbody");
+                tbody.appendChild(createElement("tr", null,
+                    createElement("th", { text: "Class" }),
+                    createElement("th", { text: "Count" }),
+                    createElement("th", { text: "Cold-start (deg ≤ 1)" }),
+                    createElement("th", { text: "Mean" }),
+                    createElement("th", { text: "Median" }),
+                    createElement("th", { text: "p90" }),
+                    createElement("th", { text: "p99" }),
+                    createElement("th", { text: "Max" }),
+                    createElement("th", { text: "Distribution" })));
+                for (const cls of perClass) {
+                    const coldFrac = cls.count > 0
+                        ? (cls.cold_start_count || 0) / cls.count
+                        : 0;
+                    const coldClass = coldFrac >= 0.5
+                        ? "status-red"
+                        : (coldFrac >= 0.1 ? "status-yellow" : "status-green");
+                    const coldCell = createElement("td", { className: "numeric data-value" });
+                    coldCell.appendChild(createElement("span", {
+                        className: coldClass,
+                        text: formatNumber(cls.cold_start_count || 0) +
+                              "  (" + formatPercent(coldFrac) + ")",
+                    }));
+                    const distCell = createElement("td", { className: "sparkline-cell" });
+                    distCell.appendChild(renderDegreeHistogram(cls.buckets || {}, {
+                        width: 140,
+                        height: 32,
+                        showLabels: false,
+                        sparkline: true,
+                    }));
+                    tbody.appendChild(createElement("tr", null,
+                        createElement("td", { text: String(cls.class_value) }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.count || 0),
+                        }),
+                        coldCell,
+                        createElement("td", {
+                            className: "numeric",
+                            text: (cls.mean_degree || 0).toFixed(2),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.median_degree || 0),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.p90_degree || 0),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.p99_degree || 0),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.max_degree || 0),
+                        }),
+                        distCell));
+                }
+                card.appendChild(createElement("table", null, tbody));
+            }
+
+            const sentinelDegree = entry.sentinel_degree_stats || [];
+            if (sentinelDegree.length) {
+                card.appendChild(renderBlockHeader(
+                    "h4", "Sentinel-label degree distribution", queriesMap,
+                    "nc_supervision:per_class_degree:" + nt
+                ));
+                const tbody = createElement("tbody");
+                tbody.appendChild(createElement("tr", null,
+                    createElement("th", { text: "Sentinel" }),
+                    createElement("th", { text: "Count" }),
+                    createElement("th", { text: "Cold-start (deg ≤ 1)" }),
+                    createElement("th", { text: "Mean" }),
+                    createElement("th", { text: "Median" }),
+                    createElement("th", { text: "p90" }),
+                    createElement("th", { text: "p99" }),
+                    createElement("th", { text: "Max" }),
+                    createElement("th", { text: "Distribution" })));
+                for (const cls of sentinelDegree) {
+                    const coldFrac = cls.count > 0
+                        ? (cls.cold_start_count || 0) / cls.count
+                        : 0;
+                    const coldClass = coldFrac >= 0.5
+                        ? "status-red"
+                        : (coldFrac >= 0.1 ? "status-yellow" : "status-green");
+                    const coldCell = createElement("td", { className: "numeric data-value" });
+                    coldCell.appendChild(createElement("span", {
+                        className: coldClass,
+                        text: formatNumber(cls.cold_start_count || 0) +
+                              "  (" + formatPercent(coldFrac) + ")",
+                    }));
+                    const distCell = createElement("td", { className: "sparkline-cell" });
+                    distCell.appendChild(renderDegreeHistogram(cls.buckets || {}, {
+                        width: 140,
+                        height: 32,
+                        showLabels: false,
+                        sparkline: true,
+                    }));
+                    tbody.appendChild(createElement("tr", null,
+                        createElement("td", { text: String(cls.class_value) }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.count || 0),
+                        }),
+                        coldCell,
+                        createElement("td", {
+                            className: "numeric",
+                            text: (cls.mean_degree || 0).toFixed(2),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.median_degree || 0),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.p90_degree || 0),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.p99_degree || 0),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.max_degree || 0),
+                        }),
+                        distCell));
+                }
+                card.appendChild(createElement("table", null, tbody));
+            }
+
+            const homophily = entry.homophily || [];
+            if (homophily.length) {
+                // One query was recorded per (node_type, edge_type) so the
+                // disclosure aggregates across all edge types in this card.
+                const homHeader = createElement("div", { className: "block-header" });
+                homHeader.appendChild(createElement("h4", { text: "Homophily" }));
+                const homDisc = renderQueryDisclosureByPrefix(
+                    queriesMap, "nc_supervision:homophily:" + nt + ":"
+                );
+                if (homDisc) homHeader.appendChild(homDisc);
+                card.appendChild(homHeader);
+                const tbody = createElement("tbody");
+                tbody.appendChild(createElement("tr", null,
+                    createElement("th", { text: "Edge type" }),
+                    createElement("th", { text: "Edge homophily" }),
+                    createElement("th", { text: "Adjusted homophily" }),
+                    createElement("th", { text: "Sample size" })));
+                for (const h of homophily) {
+                    tbody.appendChild(createElement("tr", null,
+                        createElement("td", { text: h.edge_type }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: (h.edge_homophily || 0).toFixed(4),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: (h.adjusted_homophily || 0).toFixed(4),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(h.edge_sample_count || 0),
+                        })));
+                }
+                card.appendChild(createElement("table", null, tbody));
+            }
+
+            const split = entry.cross_split_overlap;
+            if (split) {
+                card.appendChild(renderBlockHeader(
+                    "h4", "Train / val / test split", queriesMap,
+                    "nc_supervision:cross_split:" + nt
+                ));
+                const overlap = split.overlap_node_count || 0;
+                const overlapClass = overlap === 0 ? "status-green" : "status-red";
+                const overlapCell = createElement("td", { className: "numeric data-value" });
+                overlapCell.appendChild(createElement("span", {
+                    className: overlapClass,
+                    text: formatNumber(overlap),
+                }));
+                const tbody = createElement("tbody");
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: "Cross-split node-id overlap (must be 0)" }),
+                    overlapCell));
+                for (const [splitValue, count] of Object.entries(split.split_value_counts || {})) {
+                    tbody.appendChild(createElement("tr", null,
+                        createElement("td", { text: "Rows in split " + JSON.stringify(splitValue) }),
+                        createElement("td", {
+                            className: "numeric data-value",
+                            text: formatNumber(count || 0),
+                        })));
+                }
+                card.appendChild(createElement("table", null, tbody));
+            }
+
+            container.appendChild(card);
+        }
+    }
+
+    function renderAdvanced(analysis, queriesMap) {
+        const section = document.getElementById("advanced");
+        const container = document.getElementById("advanced-container");
+
+        const classImb = analysis.class_imbalance || {};
+        const labelCov = analysis.label_coverage || {};
+        const edgeDist = analysis.edge_type_distribution || {};
+        const reciprocity = analysis.reciprocity || {};
+        const powerLaw = analysis.power_law_exponent || {};
+
+        const hasTier3 = Object.keys(classImb).length ||
+                         Object.keys(labelCov).length ||
+                         Object.keys(edgeDist).length;
+        const hasTier4 = Object.keys(reciprocity).length ||
+                         Object.keys(powerLaw).length;
+
+        if (!hasTier3 && !hasTier4) {
+            section.hidden = true;
+            return;
+        }
+        section.hidden = false;
+
+        if (Object.keys(classImb).length) {
+            const classImbHeader = createElement("div", { className: "block-header" });
+            classImbHeader.appendChild(createElement("h3", { text: "Class imbalance" }));
+            const classImbDisc = renderQueryDisclosureByPrefix(
+                queriesMap, "advanced:class_imbalance:"
+            );
+            if (classImbDisc) classImbHeader.appendChild(classImbDisc);
+            container.appendChild(classImbHeader);
+            for (const nodeType of Object.keys(classImb)) {
+                const counts = classImb[nodeType];
+                const values = Object.values(counts);
+                const maxC = Math.max(...values);
+                const minC = Math.max(1, Math.min(...values));
+                const ratio = maxC / minC;
+                const cls = ratio > 10 ? "status-red" : ratio > 5 ? "status-yellow" : "status-green";
+                container.appendChild(createElement("p", { className: "data-value" },
+                    nodeType + " max/min ratio = ",
+                    createElement("span", { className: cls, text: "1:" + ratio.toFixed(1) })));
+                const tbody = createElement("tbody");
+                for (const [label, count] of Object.entries(counts)) {
+                    tbody.appendChild(createElement("tr", null,
+                        createElement("td", { text: String(label) }),
+                        createElement("td", { className: "numeric data-value", text: formatNumber(count) })));
+                }
+                container.appendChild(createElement("table", null,
+                    createElement("thead", null, createElement("tr", null,
+                        createElement("th", { text: "Class" }),
+                        createElement("th", { text: "Count" }))),
+                    tbody));
+            }
+        }
+
+        if (Object.keys(labelCov).length) {
+            const labelCovHeader = createElement("div", { className: "block-header" });
+            labelCovHeader.appendChild(createElement("h3", { text: "Label coverage" }));
+            const labelCovDisc = renderQueryDisclosureByPrefix(
+                queriesMap, "advanced:label_coverage:"
+            );
+            if (labelCovDisc) labelCovHeader.appendChild(labelCovDisc);
+            container.appendChild(labelCovHeader);
+            const tbody = createElement("tbody");
+            for (const [nodeType, frac] of Object.entries(labelCov)) {
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: nodeType }),
+                    createElement("td", { className: "numeric data-value", text: formatPercent(frac) })));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Node type" }),
+                    createElement("th", { text: "Coverage" }))),
+                tbody));
+        }
+
+        if (Object.keys(edgeDist).length) {
+            const edgeDistHeader = createElement("div", { className: "block-header" });
+            edgeDistHeader.appendChild(createElement(
+                "h3", { text: "Edge type distribution" }
+            ));
+            const edgeDistDisc = renderQueryDisclosureByPrefix(
+                queriesMap, "advanced:edge_type_distribution:"
+            );
+            if (edgeDistDisc) edgeDistHeader.appendChild(edgeDistDisc);
+            container.appendChild(edgeDistHeader);
+            const total = sumValues(edgeDist);
+            const tbody = createElement("tbody");
+            for (const [edgeType, count] of Object.entries(edgeDist)) {
+                const frac = total > 0 ? count / total : 0;
+                let cls = "status-green";
+                if (frac < 0.001) cls = "status-red";
+                else if (frac > 0.9) cls = "status-red";
+                else if (frac > 0.8) cls = "status-yellow";
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: edgeType }),
+                    createElement("td", { className: "numeric data-value", text: formatNumber(count) }),
+                    createElement("td", { className: "numeric" },
+                        createElement("span", { className: cls, text: formatPercent(frac) }))));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Edge type" }),
+                    createElement("th", { text: "Count" }),
+                    createElement("th", { text: "Share" }))),
+                tbody));
+        }
+
+        if (Object.keys(reciprocity).length) {
+            container.appendChild(createElement("h3", { text: "Reciprocity" }));
+            const tbody = createElement("tbody");
+            for (const [edgeType, val] of Object.entries(reciprocity)) {
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: edgeType }),
+                    createElement("td", { className: "numeric data-value", text: formatPercent(val) })));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Edge type" }),
+                    createElement("th", { text: "Reciprocity" }))),
+                tbody));
+        }
+
+        if (Object.keys(powerLaw).length) {
+            container.appendChild(createElement("h3", { text: "Power-law exponent" }));
+            const tbody = createElement("tbody");
+            for (const [edgeType, alpha] of Object.entries(powerLaw)) {
+                const cls = alpha < 2 ? "status-red" : alpha < 2.5 ? "status-yellow" : "status-green";
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: edgeType }),
+                    createElement("td", { className: "numeric" },
+                        createElement("span", { className: cls, text: alpha.toFixed(2) }))));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Edge type" }),
+                    createElement("th", { text: "Alpha" }))),
+                tbody));
+        }
+    }
+
+    function renderFooter(analysis, profile) {
+        const container = document.getElementById("footer-container");
+
+        // facets_html_paths / stats_paths are list-valued so wide tables can
+        // contribute one entry per chunk. Flatten with a "(chunk i/N)" suffix
+        // when a table has more than one entry; preserve the legacy unsuffixed
+        // form for single-chunk tables (the common case).
+        function pushFlattened(label, dict) {
+            for (const [k, v] of Object.entries(dict)) {
+                const list = Array.isArray(v) ? v : (v ? [v] : []);
+                list.forEach((p, i) => {
+                    const suffix = list.length > 1
+                        ? " (chunk " + (i + 1) + "/" + list.length + ")"
+                        : "";
+                    artifacts.push(label + " " + k + suffix + ": " + p);
+                });
+            }
+        }
+
+        const artifacts = [];
+        pushFlattened("FACETS", (profile && profile.facets_html_paths) || {});
+        pushFlattened("Stats", (profile && profile.stats_paths) || {});
+
+        if (artifacts.length) {
+            container.appendChild(createElement("h3", { text: "Raw artifacts" }));
+            const ul = createElement("ul", { className: "footer-list" });
+            for (const a of artifacts) {
+                ul.appendChild(createElement("li", null, createElement("code", { text: a })));
+            }
+            container.appendChild(ul);
+        }
+    }
+
+    function main() {
+        const analysis = parseJSONTag("analysis-data");
+        const profile = parseJSONTag("profile-data");
+        const queriesMap = (analysis && analysis.queries) || {};
+        renderHeader(analysis);
+        renderOverview(analysis);
+        renderNullRates(analysis, queriesMap);
+        renderIntegrity(analysis, queriesMap);
+        renderFeatureStatistics(profile);
+        renderEmbeddingDiagnostics(profile);
+        renderCounts(analysis, queriesMap);
+        renderDegree(analysis, queriesMap);
+        renderHubs(analysis, queriesMap);
+        renderSuperHubWarning(analysis);
+        renderNodeClassificationSupervision(analysis, queriesMap);
+        renderSupervisionOverlap(analysis, queriesMap);
+        renderAdvanced(analysis, queriesMap);
+        renderFooter(analysis, profile);
+    }
+
+    if (document.readyState === "loading") {
+        document.addEventListener("DOMContentLoaded", main);
+    } else {
+        main();
+    }
+})();
diff --git a/gigl/analytics/data_analyzer/report/report.ai.html b/gigl/analytics/data_analyzer/report/report.ai.html
new file mode 100644
index 000000000..b56d20b58
--- /dev/null
+++ b/gigl/analytics/data_analyzer/report/report.ai.html
@@ -0,0 +1,84 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="utf-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <title>GiGL Data Analysis Report</title>
+    <style>/* INJECT_STYLES */</style>
+</head>
+<body>
+    <header id="report-header">
+        <h1>GiGL Data Analysis Report</h1>
+        <p class="meta" id="report-meta"></p>
+        <p class="config-summary" id="report-config-summary"></p>
+    </header>
+
+    <section id="overview">
+        <h2>Overview</h2>
+        <div class="card-grid" id="overview-cards"></div>
+    </section>
+
+    <section id="data-quality">
+        <h2>Data Quality</h2>
+        <details open>
+            <summary>NULL rates per column</summary>
+            <div id="null-rates-container"></div>
+        </details>
+        <details open>
+            <summary>Integrity checks</summary>
+            <div id="integrity-container"></div>
+        </details>
+    </section>
+
+    <section id="feature-statistics" hidden>
+        <h2>Feature Statistics</h2>
+        <div id="feature-statistics-container"></div>
+    </section>
+
+    <section id="embedding-diagnostics" hidden>
+        <h2>Embedding Diagnostics</h2>
+        <div id="embedding-diagnostics-container"></div>
+    </section>
+
+    <section id="graph-structure">
+        <h2>Graph Structure</h2>
+        <details open>
+            <summary>Node and edge counts</summary>
+            <div id="counts-container"></div>
+        </details>
+        <details open>
+            <summary>Degree distribution</summary>
+            <div id="degree-container"></div>
+        </details>
+        <details open>
+            <summary>Top-20 hubs</summary>
+            <div id="hubs-container"></div>
+        </details>
+        <div id="super-hub-warning" hidden></div>
+    </section>
+
+    <section id="node-classification-supervision" hidden>
+        <h2>Node Classification Supervision</h2>
+        <div id="node-classification-supervision-container"></div>
+    </section>
+
+    <section id="supervision-overlap" hidden>
+        <h2>Supervision Overlap</h2>
+        <div id="supervision-overlap-container"></div>
+    </section>
+
+    <section id="advanced" hidden>
+        <h2>Advanced Metrics</h2>
+        <div id="advanced-container"></div>
+    </section>
+
+    <footer id="report-footer">
+        <h2>Artifacts</h2>
+        <div id="footer-container"></div>
+    </footer>
+
+    <script id="analysis-data" type="application/json">/* INJECT_ANALYSIS_DATA */</script>
+    <script id="profile-data" type="application/json">/* INJECT_PROFILE_DATA */</script>
+    <script>/* INJECT_SCRIPTS */</script>
+</body>
+</html>
diff --git a/gigl/analytics/data_analyzer/report/report_generator.py b/gigl/analytics/data_analyzer/report/report_generator.py
new file mode 100644
index 000000000..848a03f51
--- /dev/null
+++ b/gigl/analytics/data_analyzer/report/report_generator.py
@@ -0,0 +1,61 @@
+"""Generates a single self-contained HTML report from analysis results.
+
+Loads the AI-owned template (report.ai.html), styles (styles.ai.css),
+and chart logic (charts.ai.js), then injects serialized analysis data.
+
+The template, styles, and chart logic are defined by SPEC.md in this
+directory. AI-owned files (*.ai.html, *.ai.js, *.ai.css) can be
+regenerated from the SPEC.
+"""
+import json
+from importlib import resources
+from typing import Optional
+
+from gigl.analytics.data_analyzer.types import FeatureProfileResult, GraphAnalysisResult
+from gigl.common.logger import Logger
+
+logger = Logger()
+
+
+def generate_report(
+    analysis_result: Optional[GraphAnalysisResult] = None,
+    profile_result: Optional[FeatureProfileResult] = None,
+) -> str:
+    """Generate a self-contained HTML report from analysis results.
+
+    Args:
+        analysis_result: Graph structure analysis results.
+        profile_result: TFDV feature profiling results (optional).
+        config: Analyzer config for metadata display (optional).
+
+    Returns:
+        Complete HTML string that opens standalone in any browser.
+
+    Example:
+        >>> html = generate_report(
+        ...     analysis_result=result,
+        ...     profile_result=None,
+        ...     config=None,
+        ... )
+        >>> # Write to GCS or local file
+    """
+    template_dir = resources.files("gigl.analytics.data_analyzer.report")
+    html_template = template_dir.joinpath("report.ai.html").read_text()
+    css = template_dir.joinpath("styles.ai.css").read_text()
+    js = template_dir.joinpath("charts.ai.js").read_text()
+
+    analysis_json = json.dumps(
+        analysis_result.model_dump(mode="json") if analysis_result else {}
+    )
+    profile_json = json.dumps(
+        profile_result.model_dump(mode="json") if profile_result else {}
+    )
+
+    html = html_template
+    html = html.replace("/* INJECT_STYLES */", css)
+    html = html.replace("/* INJECT_SCRIPTS */", js)
+    html = html.replace("/* INJECT_ANALYSIS_DATA */", analysis_json)
+    html = html.replace("/* INJECT_PROFILE_DATA */", profile_json)
+
+    logger.info(f"Generated HTML report ({len(html)} bytes)")
+    return html
diff --git a/gigl/analytics/data_analyzer/report/styles.ai.css b/gigl/analytics/data_analyzer/report/styles.ai.css
new file mode 100644
index 000000000..430115a2d
--- /dev/null
+++ b/gigl/analytics/data_analyzer/report/styles.ai.css
@@ -0,0 +1,237 @@
+:root {
+    --color-ok: #28a745;
+    --color-warn: #ffc107;
+    --color-crit: #dc3545;
+    --color-bg: #f8f9fa;
+    --color-card-bg: #ffffff;
+    --color-border: #dee2e6;
+    --color-text: #212529;
+    --color-text-muted: #6c757d;
+    --font-sans: system-ui, -apple-system, "Segoe UI", Roboto, sans-serif;
+    --font-mono: ui-monospace, SFMono-Regular, Menlo, monospace;
+}
+
+* { box-sizing: border-box; }
+
+body {
+    max-width: 1200px;
+    margin: 0 auto;
+    padding: 24px;
+    font-family: var(--font-sans);
+    background: var(--color-bg);
+    color: var(--color-text);
+    line-height: 1.5;
+}
+
+h1, h2, h3, h4 {
+    font-family: var(--font-sans);
+    margin-top: 1.2em;
+    margin-bottom: 0.5em;
+}
+
+h1 { font-size: 1.8rem; }
+h2 { font-size: 1.4rem; border-bottom: 1px solid var(--color-border); padding-bottom: 4px; }
+h3 { font-size: 1.15rem; }
+
+.meta, .config-summary {
+    color: var(--color-text-muted);
+    font-size: 0.9rem;
+    margin: 4px 0;
+}
+
+.data-value {
+    font-family: var(--font-mono);
+    color: #111;
+}
+
+.status-green  { background: var(--color-ok);   color: #ffffff; padding: 2px 6px; border-radius: 3px; }
+.status-yellow { background: var(--color-warn); color: #212529; padding: 2px 6px; border-radius: 3px; }
+.status-red    { background: var(--color-crit); color: #ffffff; padding: 2px 6px; border-radius: 3px; }
+
+.status-dot {
+    display: inline-block;
+    width: 12px;
+    height: 12px;
+    border-radius: 50%;
+    vertical-align: middle;
+}
+.status-dot.status-green  { background: var(--color-ok); }
+.status-dot.status-yellow { background: var(--color-warn); }
+.status-dot.status-red    { background: var(--color-crit); }
+
+details {
+    background: var(--color-card-bg);
+    border: 1px solid var(--color-border);
+    border-radius: 6px;
+    padding: 8px 12px;
+    margin: 8px 0;
+}
+
+summary {
+    cursor: pointer;
+    font-weight: 600;
+    padding: 4px 0;
+    user-select: none;
+}
+
+.card-grid {
+    display: grid;
+    grid-template-columns: repeat(auto-fit, minmax(180px, 1fr));
+    gap: 12px;
+    margin-top: 8px;
+}
+
+.card {
+    background: var(--color-card-bg);
+    border: 1px solid var(--color-border);
+    border-radius: 6px;
+    padding: 16px;
+    text-align: center;
+}
+
+.card .card-label {
+    font-size: 0.85rem;
+    color: var(--color-text-muted);
+    margin-bottom: 6px;
+}
+
+.card .card-value {
+    font-family: var(--font-mono);
+    font-size: 1.4rem;
+    font-weight: 600;
+}
+
+table {
+    width: 100%;
+    border-collapse: collapse;
+    margin: 8px 0;
+    background: var(--color-card-bg);
+    font-size: 0.92rem;
+}
+
+th, td {
+    padding: 6px 10px;
+    border-bottom: 1px solid var(--color-border);
+    text-align: left;
+}
+
+th {
+    background: #f1f3f5;
+    font-weight: 600;
+}
+
+tbody tr:nth-child(even) {
+    background: #fafbfc;
+}
+
+td.numeric {
+    font-family: var(--font-mono);
+    text-align: right;
+}
+
+svg.histogram {
+    width: 100%;
+    max-width: 720px;
+    height: 220px;
+    background: var(--color-card-bg);
+    border: 1px solid var(--color-border);
+    border-radius: 6px;
+}
+
+svg.histogram .bar     { fill: #4c78a8; }
+svg.histogram .axis    { stroke: var(--color-border); stroke-width: 1; }
+svg.histogram .label   { font-family: var(--font-sans); font-size: 11px; fill: var(--color-text); }
+svg.histogram .value   { font-family: var(--font-mono); font-size: 11px; fill: var(--color-text); }
+
+iframe.facets-embed {
+    width: 100%;
+    min-height: 600px;
+    border: 0;
+}
+
+.facets-chunk-caption {
+    font-family: var(--font-sans);
+    font-size: 0.85em;
+    color: var(--color-muted, #666);
+    margin: 12px 0 4px 0;
+    font-weight: 600;
+}
+
+.warning-box {
+    background: var(--color-crit);
+    color: #ffffff;
+    padding: 12px 16px;
+    border-radius: 6px;
+    margin: 12px 0;
+    font-weight: 600;
+}
+
+.footer-list {
+    font-size: 0.85rem;
+    color: var(--color-text-muted);
+}
+
+.footer-list code { font-family: var(--font-mono); }
+
+.block-header {
+    display: flex;
+    align-items: center;
+    gap: 12px;
+    flex-wrap: wrap;
+    margin-top: 1.2em;
+    margin-bottom: 0.5em;
+}
+.block-header h3, .block-header h4 {
+    margin: 0;
+}
+
+details.query-disclosure {
+    padding: 2px 8px;
+    margin: 0;
+    font-size: 0.85rem;
+    background: var(--color-card-bg);
+    border: 1px solid var(--color-border);
+    border-radius: 4px;
+}
+details.query-disclosure summary {
+    font-weight: 500;
+    color: var(--color-text-muted);
+    padding: 0;
+}
+details.query-disclosure p.sql-key {
+    font-family: var(--font-mono);
+    font-size: 0.78rem;
+    color: var(--color-text-muted);
+    margin: 8px 0 2px 0;
+}
+details.query-disclosure pre.sql {
+    font-family: var(--font-mono);
+    font-size: 0.85rem;
+    background: #f1f3f5;
+    padding: 8px;
+    border-radius: 4px;
+    overflow-x: auto;
+    white-space: pre;
+    margin: 6px 0;
+}
+
+td.sparkline-cell {
+    padding: 2px 6px;
+}
+svg.histogram.sparkline {
+    width: 140px;
+    max-width: 140px;
+    height: 32px;
+    border: none;
+    background: transparent;
+    border-radius: 0;
+    display: block;
+}
+
+@media print {
+    body { background: #ffffff; padding: 0; max-width: none; }
+    details { break-inside: avoid; border: 1px solid #ccc; }
+    iframe.facets-embed { min-height: 400px; }
+    .card, table, svg.histogram { break-inside: avoid; }
+    details.query-disclosure { display: none; }
+}
diff --git a/gigl/analytics/data_analyzer/types.py b/gigl/analytics/data_analyzer/types.py
new file mode 100644
index 000000000..48c45cf3d
--- /dev/null
+++ b/gigl/analytics/data_analyzer/types.py
@@ -0,0 +1,503 @@
+"""Pydantic result types and JSON artifact IO for the BQ Data Analyzer.
+
+Each analyzer component (:class:`GraphStructureAnalyzer`, :class:`FeatureProfiler`)
+returns a versioned Pydantic model. Components persist their results as
+JSON sidecars at ``{output_gcs_path}/{component}.json`` using
+:func:`write_artifact`; consumers (report generator, downstream quality
+gates) rehydrate them via :func:`load_artifact`.
+
+Envelope shape (see :class:`GraphStructureArtifact`,
+:class:`FeatureProfileArtifact`)::
+
+    {
+      "schema_version": "1",
+      "component": "feature_profile",
+      "generated_at": "2026-04-23T20:00:00+00:00",
+      "data": { ... }
+    }
+
+Additive fields bump nothing (JSON readers tolerate them); rename / remove
+bumps :data:`SCHEMA_VERSION` and requires consumers to handle the new major.
+"""
+
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Final, Literal, Optional, Union
+
+from pydantic import BaseModel, ConfigDict, Field
+
+from gigl.common import GcsUri
+from gigl.common.logger import Logger
+from gigl.common.utils.gcs import GcsUtils
+
+logger = Logger()
+
+SCHEMA_VERSION: Final[Literal["1"]] = "1"
+
+_Component = Literal["graph_structure", "feature_profile"]
+
+
+class TopKEntry(BaseModel):
+    """One row of a top-K most-frequent-hash listing."""
+
+    model_config = ConfigDict(extra="forbid")
+
+    hash: int
+    count: int
+    fraction: float
+
+
+class EmbeddingDiagnosticsResult(BaseModel):
+    """Structural sanity counts for a single REPEATED FLOAT column.
+
+    ``unique_ratio`` close to 1.0 means embeddings are well-differentiated;
+    low values indicate upstream degeneracy (many rows sharing the same
+    vector). ``top_k`` surfaces the most frequent exact-duplicate clusters
+    via ``FARM_FINGERPRINT(TO_JSON_STRING(<col>))``.
+    """
+
+    model_config = ConfigDict(extra="forbid")
+
+    total: int
+    unique_count: int
+    unique_ratio: float
+    top_k: list[TopKEntry] = Field(default_factory=list)
+
+
+class DegreeStats(BaseModel):
+    """Degree distribution statistics for one edge type and direction.
+
+    Computed from ``APPROX_QUANTILES(degree, 100)`` in BigQuery.
+    """
+
+    model_config = ConfigDict(extra="forbid")
+
+    min: int
+    max: int
+    mean: float
+    median: int
+    p90: int
+    p99: int
+    p999: int
+    percentiles: list[int]
+    buckets: dict[str, int]
+
+
+class PerClassDegreeStats(BaseModel):
+    """Per-class degree-distribution summary for one labeled node type.
+
+    Surfaces the silent NC-at-scale footgun where labeled positive-class
+    nodes are biased toward high-degree, leading to "model just learns
+    degree" behavior at inference. Computed once per (node_type, class)
+    by joining the labeled node table to the message-passing edge table.
+
+    The companion ``cold_start_count`` counts class members with degree
+    <= 1 — these will fail at inductive serving regardless of training
+    quality.
+    """
+
+    model_config = ConfigDict(extra="forbid")
+
+    class_value: str
+    count: int
+    cold_start_count: int
+    mean_degree: float
+    median_degree: int
+    p90_degree: int
+    p99_degree: int
+    max_degree: int
+    buckets: dict[str, int] = Field(default_factory=dict)
+
+
+class HomophilyStats(BaseModel):
+    """Homophily measures for one (labeled node type, edge type) pair.
+
+    ``edge_homophily`` is the raw fraction of message-passing edges whose
+    endpoints share a label. It's the standard textbook measure but is
+    not comparable across datasets with different class priors.
+
+    ``adjusted_homophily`` corrects for class priors per Platonov et al.,
+    *Characterizing Graph Datasets for Node Classification*, NeurIPS
+    2023. Range is approximately [-1, 1]; values near 0 indicate
+    "no signal beyond class priors", positive means homophilic, negative
+    heterophilic.
+
+    ``edge_sample_count`` records how many edges were sampled to compute
+    the measures so consumers can assess statistical reliability.
+    """
+
+    model_config = ConfigDict(extra="forbid")
+
+    edge_type: str
+    edge_homophily: float
+    adjusted_homophily: float
+    edge_sample_count: int
+    label_informativeness: Optional[float] = None
+
+
+class LabelSentinelStats(BaseModel):
+    """Sentinel-vs-NULL accounting for one labeled node type.
+
+    Surfaces the *upstream-bug* case where labels are present but encode
+    "missing/unknown" via sentinel values like ``-1`` or ``"unknown"``
+    rather than SQL NULL. Treating those as real classes silently
+    poisons training. Reported as a separate count from NULL so the
+    upstream owner can be paged on the right thing.
+
+    ``valid_label_coverage`` is the fraction of rows with a real label
+    (non-NULL AND non-sentinel) and is what downstream class-imbalance /
+    homophily computations use as the denominator.
+    """
+
+    model_config = ConfigDict(extra="forbid")
+
+    total_rows: int
+    null_count: int
+    sentinel_counts: dict[str, int] = Field(default_factory=dict)
+    valid_label_count: int
+    valid_label_coverage: float
+
+
+class CrossSplitOverlap(BaseModel):
+    """Cross-split node-id leakage stats for one labeled node type.
+
+    A node-id appearing in more than one split is unconditionally a
+    bug — train/val/test contamination silently inflates eval metrics.
+    The analyzer treats any non-zero ``overlap_node_count`` as a Tier 1
+    style hard fail and raises :class:`DataQualityError`.
+    """
+
+    model_config = ConfigDict(extra="forbid")
+
+    overlap_node_count: int
+    split_value_counts: dict[str, int] = Field(default_factory=dict)
+
+
+class NodeClassificationSupervisionStats(BaseModel):
+    """Aggregated NC supervision-tier results for one labeled node type.
+
+    Holds the BQ-side metrics that aren't covered by the TFDV slicing
+    pass (per-class degree, homophily, split-leakage). Per-class label
+    histograms and per-class feature null-rates are produced by the
+    feature profiler via TFDV ``slice_functions`` and surface there.
+
+    ``sentinel_degree_stats`` carries the same shape as ``per_class_degree``
+    but for rows whose label matches a value declared in
+    ``NodeTableSpec.label_sentinel_values`` (e.g. ``-1``). Surfacing the
+    sentinel pool's degree distribution exposes whether "no ground-truth
+    label" rows are mostly cold-start (cheap to keep as message-passing
+    context) or mostly hubs (will dominate aggregation and bias the model).
+    """
+
+    model_config = ConfigDict(extra="forbid")
+
+    node_type: str
+    label_column: str
+    sentinel_stats: LabelSentinelStats
+    per_class_degree: list[PerClassDegreeStats] = Field(default_factory=list)
+    sentinel_degree_stats: list[PerClassDegreeStats] = Field(default_factory=list)
+    homophily: list[HomophilyStats] = Field(default_factory=list)
+    cross_split_overlap: Optional[CrossSplitOverlap] = None
+
+
+class SupervisionCrossTableStats(BaseModel):
+    """Per-anchor cross-table statistics for one (driver, other) edge-table pair.
+
+    Computed from a positive (or negative) supervision edge table — the
+    *driver* — against another edge table — the *other* — that shares the
+    same ``(src_node_type, dst_node_type)``. The driver defines the anchor
+    population: distinct anchor IDs that appear in the driver. For each such
+    anchor we count how many edges it has in ``other`` and report the
+    distribution. ``overlap_pair_count`` flags ``(anchor, neighbor)`` pairs
+    that appear in both tables — typically a label-leakage signal.
+    """
+
+    model_config = ConfigDict(extra="forbid")
+
+    driver_edge_type: str
+    driver_role: str
+    other_edge_type: str
+    other_role: str
+    node_anchor: str
+
+    driver_anchor_count: int
+    driver_pair_count: int
+    other_pair_count: int
+    overlap_pair_count: int
+    driver_anchors_with_zero_other: int
+
+    avg_other_per_driver_anchor: float
+    p50_other_per_driver_anchor: int
+    p90_other_per_driver_anchor: int
+    p99_other_per_driver_anchor: int
+    max_other_per_driver_anchor: int
+
+
+class GraphAnalysisResult(BaseModel):
+    """Complete result of graph structure analysis across all tiers.
+
+    Tier 1 fields are always populated. Tier 3/4 fields may be empty
+    dicts if the corresponding checks were not applicable or not enabled.
+    """
+
+    model_config = ConfigDict(extra="forbid")
+
+    # Tier 1: hard fails
+    duplicate_node_counts: dict[str, int] = Field(default_factory=dict)
+    dangling_edge_counts: dict[str, int] = Field(default_factory=dict)
+    referential_integrity_violations: dict[str, int] = Field(default_factory=dict)
+
+    # Tier 2: core metrics
+    node_counts: dict[str, int] = Field(default_factory=dict)
+    edge_counts: dict[str, int] = Field(default_factory=dict)
+    null_rates: dict[str, dict[str, float]] = Field(default_factory=dict)
+    duplicate_edge_counts: dict[str, int] = Field(default_factory=dict)
+    self_loop_counts: dict[str, int] = Field(default_factory=dict)
+    isolated_node_counts: dict[str, int] = Field(default_factory=dict)
+    degree_stats: dict[str, DegreeStats] = Field(default_factory=dict)
+    top_hubs: dict[str, list[tuple[str, int]]] = Field(default_factory=dict)
+    super_hub_int16_clamp_count: dict[str, int] = Field(default_factory=dict)
+    cold_start_node_counts: dict[str, int] = Field(default_factory=dict)
+    feature_memory_bytes: dict[str, int] = Field(default_factory=dict)
+    neighbor_explosion_estimate: dict[str, int] = Field(default_factory=dict)
+
+    # Tier 3: label and heterogeneous
+    class_imbalance: dict[str, dict[str, int]] = Field(default_factory=dict)
+    label_coverage: dict[str, float] = Field(default_factory=dict)
+    edge_type_distribution: dict[str, int] = Field(default_factory=dict)
+    edge_type_node_coverage: dict[str, dict[str, int]] = Field(default_factory=dict)
+
+    # Tier 4: opt-in
+    reciprocity: dict[str, float] = Field(default_factory=dict)
+    power_law_exponent: dict[str, float] = Field(default_factory=dict)
+
+    # Supervision cross-table analysis (per (driver, other) edge-table pair).
+    supervision_cross_table_stats: list[SupervisionCrossTableStats] = Field(
+        default_factory=list
+    )
+
+    # Node-classification supervision tier (per labeled node type).
+    node_classification_supervision_stats: list[
+        NodeClassificationSupervisionStats
+    ] = Field(default_factory=list)
+
+    # Per-block rendered BQ SQL captured at execution time, keyed by a
+    # block identifier the report JS uses to locate the corresponding
+    # section header. The flat shape (a single ``dict[str, list[str]]``)
+    # is intentional — the JS does dict lookups, not parsing.
+    queries: dict[str, list[str]] = Field(default_factory=dict)
+
+
+class FeatureProfileError(BaseModel):
+    """One per-table failure or skip captured during feature profiling.
+
+    Surfaces to the HTML report so users can see *why* a table did not
+    produce a FACETS embed instead of silently missing from the result.
+
+    ``stage`` is one of:
+        * ``"schema_fetch"`` — BigQuery schema lookup raised
+        * ``"empty_projection"`` — no profileable columns after projection
+        * ``"dataflow"`` — the per-table Dataflow pipeline raised
+        * ``"embedding_diagnostics"`` — the post-Dataflow diagnostics query raised
+
+    For ``stage == "dataflow"`` we additionally try to capture the Dataflow
+    job identifiers so the report can deep-link to the failed job's logs:
+
+        * ``job_id`` — the Dataflow job UUID, or ``None`` if the runner
+          isn't Dataflow / the result didn't expose one.
+        * ``job_name`` — the human-readable job name (e.g.
+          ``gigl-analyzer-svij-test-20260506-1430-profile-node-user``).
+        * ``console_url`` — a link to the Dataflow console for the job, or
+          ``None`` when ``job_id`` / region / project couldn't be resolved.
+    """
+
+    model_config = ConfigDict(extra="forbid")
+
+    result_key: str
+    bq_table: str
+    stage: str
+    message: str
+    job_id: Optional[str] = None
+    job_name: Optional[str] = None
+    console_url: Optional[str] = None
+
+
+class FeatureProfileResult(BaseModel):
+    """Result of TFDV feature profiling across all tables.
+
+    ``facets_html_paths`` / ``stats_paths`` point at per-table GCS
+    artifacts produced by the Dataflow pipelines. When a node table
+    declares ``label_column`` or ``split_column`` the profiler enables
+    TFDV ``slice_functions`` on those columns; the resulting per-slice
+    stats live alongside the unsliced stats in the same TFRecord and the
+    per-slice listing is surfaced via ``slice_columns_by_result_key`` so
+    consumers know which slices to expect.
+
+    ``embedding_diagnostics`` is keyed by result_key
+    (``"node:{type}"`` / ``"edge:{type}"``) and then by embedding column
+    name. ``errors`` collects per-table failures or skips so consumers
+    (and the HTML report) can show why a particular table did not produce
+    facets.
+    """
+
+    model_config = ConfigDict(extra="forbid")
+
+    # List-valued so a wide table can be split into multiple per-chunk
+    # Dataflow pipelines (one Facets HTML + stats TFRecord per chunk).
+    # Single-chunk tables produce a list of length 1.
+    facets_html_paths: dict[str, list[str]] = Field(default_factory=dict)
+    stats_paths: dict[str, list[str]] = Field(default_factory=dict)
+    schema_paths: dict[str, list[str]] = Field(default_factory=dict)
+    anomalies: dict[str, list[str]] = Field(default_factory=dict)
+    embedding_diagnostics: dict[str, dict[str, EmbeddingDiagnosticsResult]] = Field(
+        default_factory=dict
+    )
+    errors: list[FeatureProfileError] = Field(default_factory=list)
+
+    # Per-result-key list of column names that were used as TFDV slice
+    # functions. Consumers (the HTML report, downstream gates) read this
+    # to know which slice listings to render from the TFDV stats.
+    slice_columns_by_result_key: dict[str, list[str]] = Field(default_factory=dict)
+
+
+class GraphStructureArtifact(BaseModel):
+    """Versioned envelope for a :class:`GraphAnalysisResult`."""
+
+    model_config = ConfigDict(extra="forbid")
+
+    schema_version: Literal["1"] = SCHEMA_VERSION
+    component: Literal["graph_structure"] = "graph_structure"
+    generated_at: datetime
+    data: GraphAnalysisResult
+
+
+class FeatureProfileArtifact(BaseModel):
+    """Versioned envelope for a :class:`FeatureProfileResult`."""
+
+    model_config = ConfigDict(extra="forbid")
+
+    schema_version: Literal["1"] = SCHEMA_VERSION
+    component: Literal["feature_profile"] = "feature_profile"
+    generated_at: datetime
+    data: FeatureProfileResult
+
+
+def write_artifact(
+    result: Union[GraphAnalysisResult, FeatureProfileResult],
+    component: _Component,
+    output_gcs_path: str,
+) -> str:
+    """Serialize ``result`` into a versioned envelope and persist it.
+
+    Writes ``{output_gcs_path}/{component}.json``. If ``output_gcs_path``
+    starts with ``gs://`` the payload is uploaded via ``GcsUtils``; otherwise
+    the parent directory is created and the file is written locally.
+
+    Args:
+        result: The component's in-memory result model.
+        component: Which component is writing (``"graph_structure"`` or
+            ``"feature_profile"``). Must match the ``result`` type.
+        output_gcs_path: Directory URI or local path. Trailing slashes are
+            stripped.
+
+    Returns:
+        The full path (GCS URI or absolute local path) that was written.
+
+    Raises:
+        TypeError: If ``result`` does not match the declared ``component``.
+        ValueError: If ``component`` is not one of the known literals.
+    """
+    now = datetime.now(timezone.utc)
+    if component == "graph_structure":
+        if not isinstance(result, GraphAnalysisResult):
+            raise TypeError(
+                f"component='graph_structure' expects GraphAnalysisResult, "
+                f"got {type(result).__name__}"
+            )
+        artifact: BaseModel = GraphStructureArtifact(generated_at=now, data=result)
+    elif component == "feature_profile":
+        if not isinstance(result, FeatureProfileResult):
+            raise TypeError(
+                f"component='feature_profile' expects FeatureProfileResult, "
+                f"got {type(result).__name__}"
+            )
+        artifact = FeatureProfileArtifact(generated_at=now, data=result)
+    else:
+        raise ValueError(
+            f"component={component!r} must be 'graph_structure' or 'feature_profile'"
+        )
+
+    payload = artifact.model_dump_json(indent=2)
+    trimmed = output_gcs_path.rstrip("/")
+    path = f"{trimmed}/{component}.json"
+    if trimmed.startswith("gs://"):
+        GcsUtils().upload_from_string(GcsUri(path), payload)
+    else:
+        local_path = Path(path).expanduser().resolve()
+        local_path.parent.mkdir(parents=True, exist_ok=True)
+        local_path.write_text(payload)
+        path = str(local_path)
+    logger.info(f"Wrote {component} artifact to {path}")
+    return path
+
+
+def load_artifact(
+    path: str, expected_component: _Component
+) -> Union[GraphAnalysisResult, FeatureProfileResult]:
+    """Load and validate a JSON sidecar, returning its ``.data`` payload.
+
+    Args:
+        path: GCS URI (``gs://...``) or local filesystem path to the JSON.
+        expected_component: Which component's artifact is expected. A
+            mismatch raises ``ValueError`` rather than silently returning
+            the wrong type.
+
+    Returns:
+        The component's result model (``GraphAnalysisResult`` or
+        ``FeatureProfileResult``).
+
+    Raises:
+        ValueError: If the loaded envelope's ``component`` does not match
+            ``expected_component``, or its ``schema_version`` is unknown.
+    """
+    if path.startswith("gs://"):
+        text = GcsUtils().read_from_gcs(GcsUri(path))
+    else:
+        text = Path(path).expanduser().resolve().read_text()
+
+    if expected_component == "graph_structure":
+        artifact_gs: GraphStructureArtifact = (
+            GraphStructureArtifact.model_validate_json(text)
+        )
+        return artifact_gs.data
+    elif expected_component == "feature_profile":
+        artifact_fp: FeatureProfileArtifact = (
+            FeatureProfileArtifact.model_validate_json(text)
+        )
+        return artifact_fp.data
+    else:
+        raise ValueError(
+            f"expected_component={expected_component!r} must be "
+            f"'graph_structure' or 'feature_profile'"
+        )
+
+
+__all__ = [
+    "SCHEMA_VERSION",
+    "TopKEntry",
+    "EmbeddingDiagnosticsResult",
+    "DegreeStats",
+    "PerClassDegreeStats",
+    "HomophilyStats",
+    "LabelSentinelStats",
+    "CrossSplitOverlap",
+    "NodeClassificationSupervisionStats",
+    "SupervisionCrossTableStats",
+    "GraphAnalysisResult",
+    "FeatureProfileError",
+    "FeatureProfileResult",
+    "GraphStructureArtifact",
+    "FeatureProfileArtifact",
+    "write_artifact",
+    "load_artifact",
+]
diff --git a/gigl/common/beam/tfdv_transforms.py b/gigl/common/beam/tfdv_transforms.py
new file mode 100644
index 000000000..63045eb7d
--- /dev/null
+++ b/gigl/common/beam/tfdv_transforms.py
@@ -0,0 +1,258 @@
+"""Shared TFDV / Beam PTransforms usable by the data preprocessor and analytics.
+
+Exposes:
+  * ``GenerateAndVisualizeStats`` - Runs ``tfdv.GenerateStatistics`` over a
+    ``PCollection[pa.RecordBatch]`` and writes both a Facets HTML
+    visualization and a TFDV stats TFRecord.
+  * ``BqTableToRecordBatch`` - Reads the given columns from a BigQuery table
+    and emits ``PCollection[pa.RecordBatch]`` suitable for TFDV. Schema is
+    inferred from row values; no pre-declared TFDV schema is required.
+"""
+
+from typing import Iterable, Optional
+
+import apache_beam as beam
+import pyarrow as pa
+import tensorflow_data_validation as tfdv
+from apache_beam.io.gcp.bigquery import BigQueryQueryPriority
+from apache_beam.io.gcp.internal.clients.bigquery import DatasetReference
+from apache_beam.pvalue import PBegin, PCollection
+from apache_beam.transforms.window import GlobalWindow
+from apache_beam.utils.windowed_value import WindowedValue
+from tensorflow_metadata.proto.v0 import statistics_pb2
+
+from gigl.common import Uri
+from gigl.common.beam.sharded_read import BigQueryShardedReadConfig
+
+_DEFAULT_BQ_READ_BATCH_SIZE = 1000
+
+# Frozen at module load so unit tests that patch ``beam.io.ReadFromBigQuery``
+# wholesale don't accidentally mask the ``Method`` enum value.
+_BQ_READ_METHOD_EXPORT = beam.io.ReadFromBigQuery.Method.EXPORT
+
+
+class GenerateAndVisualizeStats(beam.PTransform):
+    """Generate TFDV statistics and a Facets HTML visualization from a record
+    batch ``PCollection``.
+
+    Writes two side-effect outputs:
+      * A single-shard Facets HTML file at ``facets_report_uri``.
+      * A TFRecord of ``DatasetFeatureStatisticsList`` at ``stats_output_uri``.
+
+    Args:
+        facets_report_uri: URI for the Facets HTML visualization (typically
+            a ``GcsUri``; local ``LocalUri`` is also accepted for tests).
+        stats_output_uri: URI (file prefix) for the TFDV stats TFRecord.
+        stats_options: Optional ``tfdv.StatsOptions`` to configure
+            slicing, schema-based hints, etc. When ``None``, TFDV uses
+            its defaults (no slicing). Callers that need per-class /
+            per-split TFDV stats wire ``slice_functions`` here.
+    """
+
+    def __init__(
+        self,
+        facets_report_uri: Uri,
+        stats_output_uri: Uri,
+        stats_options: Optional[tfdv.StatsOptions] = None,
+    ):
+        self.facets_report_uri = facets_report_uri
+        self.stats_output_uri = stats_output_uri
+        self.stats_options = stats_options
+
+    def expand(
+        self, features: PCollection[pa.RecordBatch]
+    ) -> PCollection[statistics_pb2.DatasetFeatureStatisticsList]:
+        if self.stats_options is not None:
+            stats = features | "Generate TFDV statistics" >> tfdv.GenerateStatistics(
+                options=self.stats_options
+            )
+        else:
+            stats = features | "Generate TFDV statistics" >> tfdv.GenerateStatistics()
+
+        _ = (
+            stats
+            | "Generate stats visualization"
+            >> beam.Map(tfdv.utils.display_util.get_statistics_html)
+            | "Write stats Facets report HTML"
+            >> beam.io.WriteToText(
+                self.facets_report_uri.uri, num_shards=1, shard_name_template=""
+            )
+        )
+
+        _ = (
+            stats
+            | "Write TFDV stats output TFRecord"
+            >> tfdv.WriteStatisticsToTFRecord(self.stats_output_uri.uri)
+        )
+
+        return stats
+
+
+class _RowsToRecordBatchDoFn(beam.DoFn):
+    """Buffer incoming row dicts and emit ``pa.RecordBatch`` batches.
+
+    Each output column is encoded as an Arrow list-typed column
+    (``list<T>``) with NULLs mapped to Arrow nulls, matching TFDV's
+    expectation that each feature column be a ``(Large)List<primitive|struct>``
+    (or null). See ``tfdv.utils.stats_util.get_feature_type_from_arrow_type``.
+    """
+
+    def __init__(self, batch_size: int, feature_columns: list[str]):
+        self._batch_size = batch_size
+        self._feature_columns = feature_columns
+        self._buffer: list[dict] = []
+
+    def start_bundle(self) -> None:
+        self._buffer = []
+
+    def process(self, element: dict) -> Iterable[pa.RecordBatch]:
+        self._buffer.append(element)
+        if len(self._buffer) >= self._batch_size:
+            yield self._drain()
+
+    def finish_bundle(self) -> Iterable[WindowedValue]:
+        if self._buffer:
+            yield WindowedValue(
+                value=self._drain(),
+                timestamp=0,
+                windows=(GlobalWindow(),),
+            )
+
+    def _drain(self) -> pa.RecordBatch:
+        buffered = self._buffer
+        self._buffer = []
+        column_values: dict[str, list] = {col: [] for col in self._feature_columns}
+        for row in buffered:
+            for col in self._feature_columns:
+                value = row[col]
+                column_values[col].append(None if value is None else [value])
+        return pa.RecordBatch.from_pydict(
+            {col: pa.array(values) for col, values in column_values.items()}
+        )
+
+
+class BqTableToRecordBatch(beam.PTransform):
+    """Read selected columns from a BigQuery table and emit Arrow record batches.
+
+    The output is a ``PCollection[pa.RecordBatch]`` whose columns are Arrow
+    list-typed (``list<T>``), which is the shape TFDV expects. Schema is
+    inferred from row values; rows with NULL values are represented as Arrow
+    nulls (missing features).
+
+    ``projection`` is a list of ``(column_name, sql_expression)`` pairs. Each
+    pair renders as ``{sql_expression} AS \\`{column_name}\\``` in the
+    SELECT. For plain scalar columns the pair is ``("age", "\\`age\\`")``;
+    for derived columns (e.g. array hygiene companions) the expression is a
+    full SQL fragment. The ``column_name`` is the identifier used downstream
+    in the record batch and in TFDV stats.
+
+    When ``sharded_read_config`` is provided, the read is split into
+    ``num_shards`` parallel BQ queries with
+    ``WHERE ABS(MOD(FARM_FINGERPRINT(CAST(shard_key AS STRING)), N)) = i``
+    and ``Flatten``-ed back together. This mirrors
+    :class:`gigl.common.beam.sharded_read.ShardedExportRead` and avoids the
+    "single giant export" pattern that hangs Dataflow's ``SplitWithSizing``
+    on very large tables (oversized status update payloads, slow GCS Avro
+    reads). Without it, the read goes through a single
+    ``beam.io.ReadFromBigQuery``.
+
+    Args:
+        bq_table: Fully qualified ``project.dataset.table`` reference.
+        projection: ``(column_name, sql_expression)`` pairs to SELECT.
+        batch_size: Rows per emitted ``RecordBatch``. Defaults to 1000.
+        bq_project: Optional GCP project to bill the read against. Defaults to
+            the project inferred by ``beam.io.ReadFromBigQuery``.
+        sharded_read_config: Optional sharded read config. When set, fans the
+            read into ``num_shards`` parallel ``EXPORT``-method reads keyed
+            on ``shard_key``.
+    """
+
+    def __init__(
+        self,
+        bq_table: str,
+        projection: list[tuple[str, str]],
+        batch_size: int = _DEFAULT_BQ_READ_BATCH_SIZE,
+        bq_project: Optional[str] = None,
+        sharded_read_config: Optional[BigQueryShardedReadConfig] = None,
+    ):
+        if not projection:
+            raise ValueError(
+                f"BqTableToRecordBatch requires at least one projected column "
+                f"for table {bq_table!r}"
+            )
+        if sharded_read_config is not None and sharded_read_config.num_shards <= 0:
+            raise ValueError(
+                f"sharded_read_config.num_shards must be > 0, got "
+                f"{sharded_read_config.num_shards}"
+            )
+        self.bq_table = bq_table
+        self.projection = projection
+        self.batch_size = batch_size
+        self.bq_project = bq_project
+        self.sharded_read_config = sharded_read_config
+
+    def expand(self, pbegin: PBegin) -> PCollection[pa.RecordBatch]:
+        if not isinstance(pbegin, PBegin):
+            raise TypeError(
+                f"Input to {BqTableToRecordBatch.__name__} transform must be "
+                f"a PBegin but found {pbegin})"
+            )
+        column_list = ", ".join(f"{expr} AS `{name}`" for name, expr in self.projection)
+        column_names = [name for name, _ in self.projection]
+
+        if self.sharded_read_config is not None:
+            rows = self._sharded_read(pbegin, column_list)
+        else:
+            rows = self._single_read(pbegin, column_list)
+
+        return rows | "Buffer rows and emit record batches" >> beam.ParDo(
+            _RowsToRecordBatchDoFn(
+                batch_size=self.batch_size,
+                feature_columns=column_names,
+            )
+        )
+
+    def _single_read(self, pbegin: PBegin, column_list: str) -> PCollection[dict]:
+        query = f"SELECT {column_list} FROM `{self.bq_table}`"
+        read_kwargs: dict = {
+            "query": query,
+            "use_standard_sql": True,
+        }
+        if self.bq_project is not None:
+            read_kwargs["project"] = self.bq_project
+        return pbegin | "Read feature rows from BQ" >> beam.io.ReadFromBigQuery(
+            **read_kwargs
+        )
+
+    def _sharded_read(self, pbegin: PBegin, column_list: str) -> PCollection[dict]:
+        # ABS(MOD(FARM_FINGERPRINT(...), N)) = i, mirroring ShardedExportRead.
+        # MOD is taken before ABS because ABS errors on the largest negative
+        # INT64; doing it in this order keeps every shard index in
+        # [0, num_shards-1].
+        config = self.sharded_read_config
+        assert config is not None  # for mypy; guarded by the caller branch.
+        temp_dataset = DatasetReference(
+            projectId=config.project_id, datasetId=config.temp_dataset_name
+        )
+        per_shard: list[PCollection[dict]] = []
+        for i in range(config.num_shards):
+            query = (
+                f"SELECT {column_list} FROM `{self.bq_table}` "
+                f"WHERE ABS(MOD(FARM_FINGERPRINT(CAST({config.shard_key} AS STRING)), "
+                f"{config.num_shards})) = {i}"
+            )
+            read_kwargs: dict = {
+                "query": query,
+                "use_standard_sql": True,
+                "method": _BQ_READ_METHOD_EXPORT,
+                "query_priority": BigQueryQueryPriority.INTERACTIVE,
+                "temp_dataset": temp_dataset,
+            }
+            if self.bq_project is not None:
+                read_kwargs["project"] = self.bq_project
+            per_shard.append(
+                pbegin
+                | f"Read feature rows from BQ shard {i}/{config.num_shards}"
+                >> beam.io.ReadFromBigQuery(**read_kwargs)
+            )
+        return per_shard | "Flatten BQ shards" >> beam.Flatten()
diff --git a/gigl/src/common/constants/components.py b/gigl/src/common/constants/components.py
index 29e9e4091..ae52a5cbb 100644
--- a/gigl/src/common/constants/components.py
+++ b/gigl/src/common/constants/components.py
@@ -10,6 +10,7 @@ class GiGLComponents(Enum):
     Trainer = "trainer"
     Inferencer = "inferencer"
     PostProcessor = "post_processor"
+    DataAnalyzer = "data_analyzer"
 
     @property
     def kebab_case_value(self):
diff --git a/gigl/src/data_preprocessor/lib/transform/utils.py b/gigl/src/data_preprocessor/lib/transform/utils.py
index f2b990abf..9694005cc 100644
--- a/gigl/src/data_preprocessor/lib/transform/utils.py
+++ b/gigl/src/data_preprocessor/lib/transform/utils.py
@@ -2,11 +2,10 @@
 
 import apache_beam as beam
 import pyarrow as pa
-import tensorflow_data_validation as tfdv
 import tensorflow_transform
 import tfx_bsl
 from apache_beam.pvalue import PBegin, PCollection, PDone
-from tensorflow_metadata.proto.v0 import schema_pb2, statistics_pb2
+from tensorflow_metadata.proto.v0 import schema_pb2
 from tensorflow_transform import beam as tft_beam
 from tensorflow_transform.tf_metadata import schema_utils
 from tfx_bsl.tfxio.record_based_tfxio import RecordBasedTFXIO
@@ -117,35 +116,6 @@ def expand(self, pbegin: PBegin) -> PCollection[pa.RecordBatch]:
         )
 
 
-class GenerateAndVisualizeStats(beam.PTransform):
-    def __init__(self, facets_report_uri: GcsUri, stats_output_uri: GcsUri):
-        self.facets_report_uri = facets_report_uri
-        self.stats_output_uri = stats_output_uri
-
-    def expand(
-        self, features: PCollection[pa.RecordBatch]
-    ) -> PCollection[statistics_pb2.DatasetFeatureStatisticsList]:
-        stats = features | "Generate TFDV statistics" >> tfdv.GenerateStatistics()
-
-        _ = (
-            stats
-            | "Generate stats visualization"
-            >> beam.Map(tfdv.utils.display_util.get_statistics_html)
-            | "Write stats Facets report HTML"
-            >> beam.io.WriteToText(
-                self.facets_report_uri.uri, num_shards=1, shard_name_template=""
-            )
-        )
-
-        _ = (
-            stats
-            | "Write TFDV stats output TFRecord"
-            >> tfdv.WriteStatisticsToTFRecord(self.stats_output_uri.uri)
-        )
-
-        return stats
-
-
 class ReadExistingTFTransformFn(beam.PTransform):
     def __init__(self, tf_transform_directory: Uri):
         assert isinstance(tf_transform_directory, (GcsUri, LocalUri)), (
diff --git a/mypy.ini b/mypy.ini
index d488c2a83..cb42eaa75 100644
--- a/mypy.ini
+++ b/mypy.ini
@@ -19,6 +19,9 @@ ignore_missing_imports = True
 [mypy-tensorflow_data_validation]
 ignore_missing_imports = True
 
+[mypy-tensorflow_data_validation.*]
+ignore_missing_imports = True
+
 [mypy-tensorflow_metadata.*]
 ignore_missing_imports = True
 
diff --git a/pyproject.toml b/pyproject.toml
index d83e5587b..15c0495cf 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -246,6 +246,7 @@ gigl-post-install = "gigl.scripts.post_install:main"
 # Include dep_vars.env from the root directory
 "gigl" = ["dep_vars.env", "**/*.yaml"]
 "gigl.scripts" = ["*.sh"]
+"gigl.analytics.data_analyzer.report" = ["*.ai.html", "*.ai.js", "*.ai.css"]
 
 
 [tool.black]
diff --git a/tests/test_assets/analytics/__init__.py b/tests/test_assets/analytics/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/tests/test_assets/analytics/golden_report.html b/tests/test_assets/analytics/golden_report.html
new file mode 100644
index 000000000..2c1db60d2
--- /dev/null
+++ b/tests/test_assets/analytics/golden_report.html
@@ -0,0 +1,1596 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="utf-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <title>GiGL Data Analysis Report</title>
+    <style>:root {
+    --color-ok: #28a745;
+    --color-warn: #ffc107;
+    --color-crit: #dc3545;
+    --color-bg: #f8f9fa;
+    --color-card-bg: #ffffff;
+    --color-border: #dee2e6;
+    --color-text: #212529;
+    --color-text-muted: #6c757d;
+    --font-sans: system-ui, -apple-system, "Segoe UI", Roboto, sans-serif;
+    --font-mono: ui-monospace, SFMono-Regular, Menlo, monospace;
+}
+
+* { box-sizing: border-box; }
+
+body {
+    max-width: 1200px;
+    margin: 0 auto;
+    padding: 24px;
+    font-family: var(--font-sans);
+    background: var(--color-bg);
+    color: var(--color-text);
+    line-height: 1.5;
+}
+
+h1, h2, h3, h4 {
+    font-family: var(--font-sans);
+    margin-top: 1.2em;
+    margin-bottom: 0.5em;
+}
+
+h1 { font-size: 1.8rem; }
+h2 { font-size: 1.4rem; border-bottom: 1px solid var(--color-border); padding-bottom: 4px; }
+h3 { font-size: 1.15rem; }
+
+.meta, .config-summary {
+    color: var(--color-text-muted);
+    font-size: 0.9rem;
+    margin: 4px 0;
+}
+
+.data-value {
+    font-family: var(--font-mono);
+    color: #111;
+}
+
+.status-green  { background: var(--color-ok);   color: #ffffff; padding: 2px 6px; border-radius: 3px; }
+.status-yellow { background: var(--color-warn); color: #212529; padding: 2px 6px; border-radius: 3px; }
+.status-red    { background: var(--color-crit); color: #ffffff; padding: 2px 6px; border-radius: 3px; }
+
+.status-dot {
+    display: inline-block;
+    width: 12px;
+    height: 12px;
+    border-radius: 50%;
+    vertical-align: middle;
+}
+.status-dot.status-green  { background: var(--color-ok); }
+.status-dot.status-yellow { background: var(--color-warn); }
+.status-dot.status-red    { background: var(--color-crit); }
+
+details {
+    background: var(--color-card-bg);
+    border: 1px solid var(--color-border);
+    border-radius: 6px;
+    padding: 8px 12px;
+    margin: 8px 0;
+}
+
+summary {
+    cursor: pointer;
+    font-weight: 600;
+    padding: 4px 0;
+    user-select: none;
+}
+
+.card-grid {
+    display: grid;
+    grid-template-columns: repeat(auto-fit, minmax(180px, 1fr));
+    gap: 12px;
+    margin-top: 8px;
+}
+
+.card {
+    background: var(--color-card-bg);
+    border: 1px solid var(--color-border);
+    border-radius: 6px;
+    padding: 16px;
+    text-align: center;
+}
+
+.card .card-label {
+    font-size: 0.85rem;
+    color: var(--color-text-muted);
+    margin-bottom: 6px;
+}
+
+.card .card-value {
+    font-family: var(--font-mono);
+    font-size: 1.4rem;
+    font-weight: 600;
+}
+
+table {
+    width: 100%;
+    border-collapse: collapse;
+    margin: 8px 0;
+    background: var(--color-card-bg);
+    font-size: 0.92rem;
+}
+
+th, td {
+    padding: 6px 10px;
+    border-bottom: 1px solid var(--color-border);
+    text-align: left;
+}
+
+th {
+    background: #f1f3f5;
+    font-weight: 600;
+}
+
+tbody tr:nth-child(even) {
+    background: #fafbfc;
+}
+
+td.numeric {
+    font-family: var(--font-mono);
+    text-align: right;
+}
+
+svg.histogram {
+    width: 100%;
+    max-width: 720px;
+    height: 220px;
+    background: var(--color-card-bg);
+    border: 1px solid var(--color-border);
+    border-radius: 6px;
+}
+
+svg.histogram .bar     { fill: #4c78a8; }
+svg.histogram .axis    { stroke: var(--color-border); stroke-width: 1; }
+svg.histogram .label   { font-family: var(--font-sans); font-size: 11px; fill: var(--color-text); }
+svg.histogram .value   { font-family: var(--font-mono); font-size: 11px; fill: var(--color-text); }
+
+iframe.facets-embed {
+    width: 100%;
+    min-height: 600px;
+    border: 0;
+}
+
+.facets-chunk-caption {
+    font-family: var(--font-sans);
+    font-size: 0.85em;
+    color: var(--color-muted, #666);
+    margin: 12px 0 4px 0;
+    font-weight: 600;
+}
+
+.warning-box {
+    background: var(--color-crit);
+    color: #ffffff;
+    padding: 12px 16px;
+    border-radius: 6px;
+    margin: 12px 0;
+    font-weight: 600;
+}
+
+.footer-list {
+    font-size: 0.85rem;
+    color: var(--color-text-muted);
+}
+
+.footer-list code { font-family: var(--font-mono); }
+
+.block-header {
+    display: flex;
+    align-items: center;
+    gap: 12px;
+    flex-wrap: wrap;
+    margin-top: 1.2em;
+    margin-bottom: 0.5em;
+}
+.block-header h3, .block-header h4 {
+    margin: 0;
+}
+
+details.query-disclosure {
+    padding: 2px 8px;
+    margin: 0;
+    font-size: 0.85rem;
+    background: var(--color-card-bg);
+    border: 1px solid var(--color-border);
+    border-radius: 4px;
+}
+details.query-disclosure summary {
+    font-weight: 500;
+    color: var(--color-text-muted);
+    padding: 0;
+}
+details.query-disclosure p.sql-key {
+    font-family: var(--font-mono);
+    font-size: 0.78rem;
+    color: var(--color-text-muted);
+    margin: 8px 0 2px 0;
+}
+details.query-disclosure pre.sql {
+    font-family: var(--font-mono);
+    font-size: 0.85rem;
+    background: #f1f3f5;
+    padding: 8px;
+    border-radius: 4px;
+    overflow-x: auto;
+    white-space: pre;
+    margin: 6px 0;
+}
+
+td.sparkline-cell {
+    padding: 2px 6px;
+}
+svg.histogram.sparkline {
+    width: 140px;
+    max-width: 140px;
+    height: 32px;
+    border: none;
+    background: transparent;
+    border-radius: 0;
+    display: block;
+}
+
+@media print {
+    body { background: #ffffff; padding: 0; max-width: none; }
+    details { break-inside: avoid; border: 1px solid #ccc; }
+    iframe.facets-embed { min-height: 400px; }
+    .card, table, svg.histogram { break-inside: avoid; }
+    details.query-disclosure { display: none; }
+}
+</style>
+</head>
+<body>
+    <header id="report-header">
+        <h1>GiGL Data Analysis Report</h1>
+        <p class="meta" id="report-meta"></p>
+        <p class="config-summary" id="report-config-summary"></p>
+    </header>
+
+    <section id="overview">
+        <h2>Overview</h2>
+        <div class="card-grid" id="overview-cards"></div>
+    </section>
+
+    <section id="data-quality">
+        <h2>Data Quality</h2>
+        <details open>
+            <summary>NULL rates per column</summary>
+            <div id="null-rates-container"></div>
+        </details>
+        <details open>
+            <summary>Integrity checks</summary>
+            <div id="integrity-container"></div>
+        </details>
+    </section>
+
+    <section id="feature-statistics" hidden>
+        <h2>Feature Statistics</h2>
+        <div id="feature-statistics-container"></div>
+    </section>
+
+    <section id="embedding-diagnostics" hidden>
+        <h2>Embedding Diagnostics</h2>
+        <div id="embedding-diagnostics-container"></div>
+    </section>
+
+    <section id="graph-structure">
+        <h2>Graph Structure</h2>
+        <details open>
+            <summary>Node and edge counts</summary>
+            <div id="counts-container"></div>
+        </details>
+        <details open>
+            <summary>Degree distribution</summary>
+            <div id="degree-container"></div>
+        </details>
+        <details open>
+            <summary>Top-20 hubs</summary>
+            <div id="hubs-container"></div>
+        </details>
+        <div id="super-hub-warning" hidden></div>
+    </section>
+
+    <section id="node-classification-supervision" hidden>
+        <h2>Node Classification Supervision</h2>
+        <div id="node-classification-supervision-container"></div>
+    </section>
+
+    <section id="supervision-overlap" hidden>
+        <h2>Supervision Overlap</h2>
+        <div id="supervision-overlap-container"></div>
+    </section>
+
+    <section id="advanced" hidden>
+        <h2>Advanced Metrics</h2>
+        <div id="advanced-container"></div>
+    </section>
+
+    <footer id="report-footer">
+        <h2>Artifacts</h2>
+        <div id="footer-container"></div>
+    </footer>
+
+    <script id="analysis-data" type="application/json">{"duplicate_node_counts": {"user": 0}, "dangling_edge_counts": {"follows": 0}, "referential_integrity_violations": {"follows": 0}, "node_counts": {"user": 1000000}, "edge_counts": {"follows": 5000000}, "null_rates": {"p.d.nodes": {"age": 0.05, "country": 0.12}}, "duplicate_edge_counts": {"follows": 150}, "self_loop_counts": {"follows": 0}, "isolated_node_counts": {"user": 8000}, "degree_stats": {"follows_out": {"min": 0, "max": 50000, "mean": 10.0, "median": 5, "p90": 25, "p99": 200, "p999": 5000, "percentiles": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100], "buckets": {"0-1": 100000, "2-10": 600000, "11-100": 250000, "101-1K": 45000, "1K-10K": 4500, "10K+": 500}}}, "top_hubs": {"follows_out": [["hub_1", 50000], ["hub_2", 35000]]}, "super_hub_int16_clamp_count": {"follows_out": 2}, "cold_start_node_counts": {"user": 100000}, "feature_memory_bytes": {"user": 8000000000}, "neighbor_explosion_estimate": {"follows": 75000}, "class_imbalance": {}, "label_coverage": {}, "edge_type_distribution": {}, "edge_type_node_coverage": {}, "reciprocity": {}, "power_law_exponent": {}, "supervision_cross_table_stats": [], "node_classification_supervision_stats": [], "queries": {}}</script>
+    <script id="profile-data" type="application/json">{}</script>
+    <script>(function () {
+    "use strict";
+
+    // Bucket order for degree histograms; must match GraphStructureAnalyzer output.
+    const BUCKET_ORDER = ["0-1", "2-10", "11-100", "101-1K", "1K-10K", "10K+"];
+
+    function parseJSONTag(id) {
+        const node = document.getElementById(id);
+        if (!node) return {};
+        const raw = (node.textContent || "").trim();
+        if (!raw) return {};
+        try {
+            return JSON.parse(raw);
+        } catch (e) {
+            console.error("Failed to parse JSON tag #" + id, e);
+            return {};
+        }
+    }
+
+    function createElement(tag, attrs, ...children) {
+        const el = document.createElement(tag);
+        if (attrs) {
+            for (const key of Object.keys(attrs)) {
+                const val = attrs[key];
+                if (val === null || val === undefined || val === false) continue;
+                if (key === "className") el.className = val;
+                else if (key === "text") el.textContent = val;
+                else if (key === "hidden") el.hidden = Boolean(val);
+                else el.setAttribute(key, val);
+            }
+        }
+        for (const child of children) {
+            if (child === null || child === undefined) continue;
+            if (typeof child === "string" || typeof child === "number") {
+                el.appendChild(document.createTextNode(String(child)));
+            } else {
+                el.appendChild(child);
+            }
+        }
+        return el;
+    }
+
+    function formatNumber(n) {
+        if (n === null || n === undefined) return "-";
+        if (typeof n !== "number") return String(n);
+        return n.toLocaleString("en-US");
+    }
+
+    function formatPercent(fraction) {
+        if (fraction === null || fraction === undefined) return "-";
+        return (fraction * 100).toFixed(2) + "%";
+    }
+
+    function classForThreshold(value, green, yellow) {
+        // value <= green -> green, value <= yellow -> yellow, else red.
+        if (value <= green) return "status-green";
+        if (value <= yellow) return "status-yellow";
+        return "status-red";
+    }
+
+    function classForNullRate(rate) {
+        if (rate > 0.9) return "status-red";
+        if (rate > 0.5) return "status-yellow";
+        return "status-green";
+    }
+
+    function sumValues(obj) {
+        if (!obj) return 0;
+        let total = 0;
+        for (const key of Object.keys(obj)) {
+            const v = obj[key];
+            if (typeof v === "number") total += v;
+        }
+        return total;
+    }
+
+    function hasAnyPositive(obj) {
+        if (!obj) return false;
+        for (const key of Object.keys(obj)) {
+            if (obj[key] > 0) return true;
+        }
+        return false;
+    }
+
+    // ---- Rendering ----
+
+    function renderHeader(analysis) {
+        const metaEl = document.getElementById("report-meta");
+        const cfgEl = document.getElementById("report-config-summary");
+        const now = new Date().toISOString();
+        metaEl.textContent = "Generated at " + now;
+
+        const nodeTypes = Object.keys(analysis.node_counts || {});
+        const edgeTypes = Object.keys(analysis.edge_counts || {});
+        cfgEl.textContent =
+            "Node tables: " + (nodeTypes.length ? nodeTypes.join(", ") : "(none)") +
+            " | Edge tables: " + (edgeTypes.length ? edgeTypes.join(", ") : "(none)");
+    }
+
+    function overallStatus(analysis) {
+        // Hard fails -> red.
+        if (hasAnyPositive(analysis.duplicate_node_counts) ||
+            hasAnyPositive(analysis.dangling_edge_counts) ||
+            hasAnyPositive(analysis.referential_integrity_violations) ||
+            hasAnyPositive(analysis.super_hub_int16_clamp_count)) {
+            return "status-red";
+        }
+        // Check thresholded metrics for yellow.
+        const totalNodes = sumValues(analysis.node_counts);
+        if (totalNodes > 0) {
+            const isolatedFrac = sumValues(analysis.isolated_node_counts) / totalNodes;
+            const coldFrac = sumValues(analysis.cold_start_node_counts) / totalNodes;
+            if (isolatedFrac > 0.05 || coldFrac > 0.10) return "status-red";
+            if (isolatedFrac > 0.01 || coldFrac > 0.05) return "status-yellow";
+        }
+        // NULL rates.
+        const nullRates = analysis.null_rates || {};
+        for (const table of Object.keys(nullRates)) {
+            for (const col of Object.keys(nullRates[table])) {
+                const r = nullRates[table][col];
+                if (r > 0.9) return "status-red";
+            }
+        }
+        return "status-green";
+    }
+
+    function renderOverview(analysis) {
+        const container = document.getElementById("overview-cards");
+        const totalNodes = sumValues(analysis.node_counts);
+        const totalEdges = sumValues(analysis.edge_counts);
+        const nodeTypes = Object.keys(analysis.node_counts || {}).length;
+        const edgeTypes = Object.keys(analysis.edge_counts || {}).length;
+        const status = overallStatus(analysis);
+
+        const cards = [
+            ["Total nodes", formatNumber(totalNodes)],
+            ["Total edges", formatNumber(totalEdges)],
+            ["Node types", formatNumber(nodeTypes)],
+            ["Edge types", formatNumber(edgeTypes)],
+        ];
+        for (const [label, value] of cards) {
+            container.appendChild(createElement("div", { className: "card" },
+                createElement("div", { className: "card-label", text: label }),
+                createElement("div", { className: "card-value data-value", text: value })
+            ));
+        }
+        const statusLabel = status === "status-green" ? "OK" :
+                            status === "status-yellow" ? "WARNING" : "CRITICAL";
+        container.appendChild(createElement("div", { className: "card" },
+            createElement("div", { className: "card-label", text: "Overall status" }),
+            createElement("div", { className: "card-value" },
+                createElement("span", { className: status, text: statusLabel }))
+        ));
+    }
+
+    function renderNullRates(analysis, queriesMap) {
+        const container = document.getElementById("null-rates-container");
+        const rates = analysis.null_rates || {};
+        const rows = [];
+        for (const table of Object.keys(rates)) {
+            for (const col of Object.keys(rates[table])) {
+                rows.push({ table: table, column: col, rate: rates[table][col] });
+            }
+        }
+        if (rows.length === 0) {
+            container.appendChild(createElement("p", { text: "No NULL rate data available." }));
+            return;
+        }
+        const disc = renderQueryDisclosureByPrefix(
+            queriesMap, "data_quality:null_rates:"
+        );
+        if (disc) container.appendChild(disc);
+        rows.sort((a, b) => b.rate - a.rate);
+        const thead = createElement("thead", null,
+            createElement("tr", null,
+                createElement("th", { text: "Table" }),
+                createElement("th", { text: "Column" }),
+                createElement("th", { text: "NULL rate" })));
+        const tbody = createElement("tbody");
+        for (const r of rows) {
+            const cls = classForNullRate(r.rate);
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: r.table }),
+                createElement("td", { text: r.column }),
+                createElement("td", { className: "numeric" },
+                    createElement("span", { className: cls, text: formatPercent(r.rate) }))));
+        }
+        container.appendChild(createElement("table", null, thead, tbody));
+    }
+
+    function renderIntegrity(analysis, queriesMap) {
+        const container = document.getElementById("integrity-container");
+        const integrityPrefixes = [
+            "data_quality:duplicate_nodes:",
+            "data_quality:duplicate_edges:",
+            "data_quality:dangling_edges:",
+            "data_quality:referential_integrity:",
+            "graph_structure:self_loops:",
+            "graph_structure:isolated_nodes:",
+            "graph_structure:cold_start_nodes:",
+        ];
+        const aggregate = (queriesMap && Object.keys(queriesMap).length)
+            ? createElement("details", { className: "query-disclosure" })
+            : null;
+        if (aggregate) {
+            aggregate.appendChild(createElement("summary", { text: "Show SQL" }));
+            let any = false;
+            for (const prefix of integrityPrefixes) {
+                for (const key of Object.keys(queriesMap)) {
+                    if (key.indexOf(prefix) !== 0) continue;
+                    for (const sql of (queriesMap[key] || [])) {
+                        aggregate.appendChild(createElement("p", {
+                            className: "sql-key",
+                            text: key,
+                        }));
+                        aggregate.appendChild(createElement("pre", {
+                            className: "sql",
+                            text: sql,
+                        }));
+                        any = true;
+                    }
+                }
+            }
+            if (any) container.appendChild(aggregate);
+        }
+        const rows = [
+            ["Duplicate nodes", analysis.duplicate_node_counts],
+            ["Duplicate edges", analysis.duplicate_edge_counts],
+            ["Dangling edges", analysis.dangling_edge_counts],
+            ["Referential integrity violations", analysis.referential_integrity_violations],
+            ["Self loops", analysis.self_loop_counts],
+            ["Isolated nodes", analysis.isolated_node_counts],
+            ["Cold-start nodes (degree 0-1)", analysis.cold_start_node_counts],
+        ];
+        const thead = createElement("thead", null,
+            createElement("tr", null,
+                createElement("th", { text: "Check" }),
+                createElement("th", { text: "Per-type counts" }),
+                createElement("th", { text: "Total" })));
+        const tbody = createElement("tbody");
+        for (const [label, obj] of rows) {
+            const total = sumValues(obj);
+            const isHardFail = (label === "Duplicate nodes" ||
+                                label === "Dangling edges" ||
+                                label === "Referential integrity violations");
+            const cls = isHardFail
+                ? (total > 0 ? "status-red" : "status-green")
+                : (total > 0 ? "status-yellow" : "status-green");
+            const detail = obj && Object.keys(obj).length
+                ? Object.keys(obj).map(k => k + ": " + formatNumber(obj[k])).join(", ")
+                : "(none)";
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: label }),
+                createElement("td", { className: "data-value", text: detail }),
+                createElement("td", { className: "numeric" },
+                    createElement("span", { className: cls, text: formatNumber(total) }))));
+        }
+        container.appendChild(createElement("table", null, thead, tbody));
+    }
+
+    function relativeFacetsPath(resultKey, chunkIndex, totalChunks) {
+        // result_key is "node:user" / "edge:engagement"; the FeatureProfiler
+        // writes facets.html to {output_gcs_path}/feature_profiler/{kind}s/{type}/facets.html
+        // for single-chunk tables, or to .../{type}/chunk_NN/facets.html when
+        // the projection was split across multiple Dataflow pipelines.
+        // Using a relative src means the embed and "full-screen" link both work
+        // when the report folder is downloaded from GCS as-is.
+        const parts = resultKey.split(":");
+        if (parts.length !== 2) return null;
+        const kind = parts[0];
+        const typeName = parts[1];
+        if (!kind || !typeName) return null;
+        const total = typeof totalChunks === "number" ? totalChunks : 1;
+        const idx = typeof chunkIndex === "number" ? chunkIndex : 0;
+        const subdir = total > 1 ? "chunk_" + String(idx).padStart(2, "0") + "/" : "";
+        return "feature_profiler/" + kind + "s/" + typeName + "/" + subdir + "facets.html";
+    }
+
+    function renderFeatureProfileErrors(container, errors) {
+        if (!errors || !errors.length) return;
+        const card = createElement("div", { className: "warning-box" });
+        card.appendChild(createElement("strong", {
+            text: "Feature profiling errors (" + errors.length + ")",
+        }));
+        const tbody = createElement("tbody");
+        for (const err of errors) {
+            const jobCell = createElement("td", { className: "data-value" });
+            if (err.console_url) {
+                const link = createElement("a", {
+                    href: err.console_url,
+                    target: "_blank",
+                    rel: "noopener noreferrer",
+                    text: err.job_name || err.job_id || "Open Dataflow job ↗",
+                });
+                jobCell.appendChild(link);
+                if (err.job_id) {
+                    jobCell.appendChild(createElement("br"));
+                    jobCell.appendChild(createElement("span",
+                        { className: "data-value", text: err.job_id }));
+                }
+            } else if (err.job_name || err.job_id) {
+                jobCell.textContent = err.job_name || err.job_id;
+            } else {
+                jobCell.textContent = "—";
+            }
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: err.result_key || "" }),
+                createElement("td", { text: err.stage || "" }),
+                createElement("td", { className: "data-value", text: err.bq_table || "" }),
+                jobCell,
+                createElement("td", { className: "data-value", text: err.message || "" })));
+        }
+        const table = createElement("table", null,
+            createElement("thead", null, createElement("tr", null,
+                createElement("th", { text: "Table key" }),
+                createElement("th", { text: "Stage" }),
+                createElement("th", { text: "BQ table" }),
+                createElement("th", { text: "Dataflow job" }),
+                createElement("th", { text: "Message" }))),
+            tbody);
+        const details = createElement("details", { open: "" },
+            createElement("summary", { text: "Errors and skipped tables" }),
+            table);
+        container.appendChild(card);
+        container.appendChild(details);
+    }
+
+    function renderFeatureStatistics(profile) {
+        const section = document.getElementById("feature-statistics");
+        const container = document.getElementById("feature-statistics-container");
+        const facets = (profile && profile.facets_html_paths) || {};
+        const errors = (profile && profile.errors) || [];
+        const keys = Object.keys(facets);
+        if (keys.length === 0 && errors.length === 0) {
+            section.hidden = true;
+            return;
+        }
+        section.hidden = false;
+        renderFeatureProfileErrors(container, errors);
+        for (const resultKey of keys) {
+            // Sidecar may be either a list of GCS URIs (one per chunk for
+            // wide tables that were split across multiple Dataflow pipelines)
+            // or a bare string (legacy single-Facets shape). Normalize.
+            const value = facets[resultKey];
+            const paths = Array.isArray(value) ? value : (value ? [value] : []);
+            const totalChunks = paths.length;
+            const summary = createElement("summary", { text: "FACETS: " + resultKey });
+            const details = createElement("details", { open: "" }, summary);
+
+            for (let i = 0; i < paths.length; i++) {
+                const relPath = relativeFacetsPath(resultKey, i, totalChunks);
+                const absPath = paths[i] || "";
+                if (totalChunks > 1) {
+                    details.appendChild(createElement("div", {
+                        className: "facets-chunk-caption",
+                        text: "Chunk " + (i + 1) + " / " + totalChunks,
+                    }));
+                }
+                if (relPath) {
+                    details.appendChild(createElement("p", { className: "data-value" },
+                        createElement("a", {
+                            href: relPath,
+                            target: "_blank",
+                            rel: "noopener noreferrer",
+                            text: "Open full-screen ↗",
+                        }),
+                        createElement("span", { text: "  (" + absPath + ")" }),
+                    ));
+                    details.appendChild(createElement("iframe", {
+                        className: "facets-embed",
+                        src: relPath,
+                        sandbox: "allow-scripts allow-same-origin",
+                    }));
+                } else {
+                    // Fall back to absolute path when the result_key is malformed.
+                    details.appendChild(createElement("p", { className: "data-value" },
+                        createElement("a", {
+                            href: absPath,
+                            target: "_blank",
+                            rel: "noopener noreferrer",
+                            text: absPath,
+                        }),
+                    ));
+                }
+            }
+            container.appendChild(details);
+        }
+    }
+
+    function renderEmbeddingDiagnostics(profile) {
+        const section = document.getElementById("embedding-diagnostics");
+        const container = document.getElementById("embedding-diagnostics-container");
+        const diagnostics = (profile && profile.embedding_diagnostics) || {};
+        const tableKeys = Object.keys(diagnostics);
+        if (tableKeys.length === 0) {
+            section.hidden = true;
+            return;
+        }
+        section.hidden = false;
+        const TOP_K_IN_REPORT = 5;
+        for (const tableKey of tableKeys) {
+            const perColumn = diagnostics[tableKey] || {};
+            const columnKeys = Object.keys(perColumn);
+            if (columnKeys.length === 0) continue;
+            const tableNode = createElement("details", { open: "" },
+                createElement("summary", { text: tableKey }));
+            for (const colName of columnKeys) {
+                const d = perColumn[colName] || {};
+                const summary = createElement("p", { className: "embedding-summary" },
+                    createElement("strong", { text: colName + ":" }),
+                    " total=" + formatNumber(d.total),
+                    ", unique=" + formatNumber(d.unique_count),
+                    ", unique_ratio=" + formatPercent(d.unique_ratio));
+                tableNode.appendChild(summary);
+
+                const topK = Array.isArray(d.top_k) ? d.top_k.slice(0, TOP_K_IN_REPORT) : [];
+                if (topK.length > 0) {
+                    const thead = createElement("thead", null,
+                        createElement("tr", null,
+                            createElement("th", { text: "Hash" }),
+                            createElement("th", { text: "Count" }),
+                            createElement("th", { text: "Fraction" })));
+                    const tbody = createElement("tbody");
+                    for (const entry of topK) {
+                        tbody.appendChild(createElement("tr", null,
+                            createElement("td", { className: "numeric", text: String(entry.hash) }),
+                            createElement("td", { className: "numeric data-value", text: formatNumber(entry.count) }),
+                            createElement("td", { className: "numeric", text: formatPercent(entry.fraction) })));
+                    }
+                    tableNode.appendChild(createElement("table", null, thead, tbody));
+                }
+            }
+            container.appendChild(tableNode);
+        }
+    }
+
+    function renderCounts(analysis, queriesMap) {
+        const container = document.getElementById("counts-container");
+        const disc = renderQueryDisclosureByPrefix(
+            queriesMap, "graph_structure:node_count:"
+        );
+        if (disc) container.appendChild(disc);
+        const edgeDisc = renderQueryDisclosureByPrefix(
+            queriesMap, "graph_structure:edge_count:"
+        );
+        if (edgeDisc) container.appendChild(edgeDisc);
+        const thead = createElement("thead", null,
+            createElement("tr", null,
+                createElement("th", { text: "Type" }),
+                createElement("th", { text: "Kind" }),
+                createElement("th", { text: "Count" })));
+        const tbody = createElement("tbody");
+        for (const [name, count] of Object.entries(analysis.node_counts || {})) {
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: name }),
+                createElement("td", { text: "node" }),
+                createElement("td", { className: "numeric data-value", text: formatNumber(count) })));
+        }
+        for (const [name, count] of Object.entries(analysis.edge_counts || {})) {
+            tbody.appendChild(createElement("tr", null,
+                createElement("td", { text: name }),
+                createElement("td", { text: "edge" }),
+                createElement("td", { className: "numeric data-value", text: formatNumber(count) })));
+        }
+        container.appendChild(createElement("table", null, thead, tbody));
+    }
+
+    function renderDegreeHistogram(buckets, opts) {
+        // Returns an SVG element for the given bucket counts.
+        // opts (all optional):
+        //   width, height: outer dimensions; default 720x220 for the
+        //     full per-edge-type chart, override for sparkline mode.
+        //   showLabels: when false, skips axis padding, value labels,
+        //     bucket name labels and y-axis max — sparkline-style.
+        //   sparkline: shorthand for sparkline styling (adds a CSS class
+        //     so styles can override the regular histogram look).
+        const o = opts || {};
+        const width = o.width || 720;
+        const height = o.height || 220;
+        const showLabels = o.showLabels !== false;
+        const padLeft = showLabels ? 50 : 2;
+        const padRight = showLabels ? 10 : 2;
+        const padTop = showLabels ? 16 : 2;
+        const padBottom = showLabels ? 40 : 2;
+        const innerW = width - padLeft - padRight;
+        const innerH = height - padTop - padBottom;
+
+        const svg = document.createElementNS("http://www.w3.org/2000/svg", "svg");
+        svg.setAttribute("class", o.sparkline ? "histogram sparkline" : "histogram");
+        svg.setAttribute("viewBox", "0 0 " + width + " " + height);
+
+        const counts = BUCKET_ORDER.map(k => (buckets && buckets[k]) || 0);
+        const maxCount = Math.max(1, ...counts);
+        const barWidth = innerW / BUCKET_ORDER.length;
+        const gap = showLabels ? 8 : 1;
+
+        if (showLabels) {
+            const axis = document.createElementNS("http://www.w3.org/2000/svg", "line");
+            axis.setAttribute("class", "axis");
+            axis.setAttribute("x1", padLeft);
+            axis.setAttribute("y1", padTop + innerH);
+            axis.setAttribute("x2", padLeft + innerW);
+            axis.setAttribute("y2", padTop + innerH);
+            svg.appendChild(axis);
+        }
+
+        for (let i = 0; i < BUCKET_ORDER.length; i++) {
+            const c = counts[i];
+            const h = (c / maxCount) * innerH;
+            const x = padLeft + i * barWidth + gap / 2;
+            const y = padTop + innerH - h;
+            const rect = document.createElementNS("http://www.w3.org/2000/svg", "rect");
+            rect.setAttribute("class", "bar");
+            rect.setAttribute("x", x);
+            rect.setAttribute("y", y);
+            rect.setAttribute("width", Math.max(1, barWidth - gap));
+            rect.setAttribute("height", h);
+            svg.appendChild(rect);
+
+            if (!showLabels) continue;
+
+            const valueLabel = document.createElementNS("http://www.w3.org/2000/svg", "text");
+            valueLabel.setAttribute("class", "value");
+            valueLabel.setAttribute("x", x + (barWidth - gap) / 2);
+            valueLabel.setAttribute("y", y - 4);
+            valueLabel.setAttribute("text-anchor", "middle");
+            valueLabel.textContent = formatNumber(c);
+            svg.appendChild(valueLabel);
+
+            const xLabel = document.createElementNS("http://www.w3.org/2000/svg", "text");
+            xLabel.setAttribute("class", "label");
+            xLabel.setAttribute("x", x + (barWidth - gap) / 2);
+            xLabel.setAttribute("y", padTop + innerH + 16);
+            xLabel.setAttribute("text-anchor", "middle");
+            xLabel.textContent = BUCKET_ORDER[i];
+            svg.appendChild(xLabel);
+        }
+
+        if (showLabels) {
+            // Y-axis max label.
+            const maxLabel = document.createElementNS("http://www.w3.org/2000/svg", "text");
+            maxLabel.setAttribute("class", "label");
+            maxLabel.setAttribute("x", padLeft - 6);
+            maxLabel.setAttribute("y", padTop + 10);
+            maxLabel.setAttribute("text-anchor", "end");
+            maxLabel.textContent = formatNumber(maxCount);
+            svg.appendChild(maxLabel);
+        }
+
+        return svg;
+    }
+
+    function renderQueryDisclosure(queriesMap, blockId) {
+        // Returns a <details> with the SQL strings recorded under blockId,
+        // or null when no queries were captured for that block.
+        const queries = (queriesMap || {})[blockId] || [];
+        if (!queries.length) return null;
+        const det = createElement("details", { className: "query-disclosure" });
+        det.appendChild(createElement("summary", { text: "Show SQL" }));
+        for (const q of queries) {
+            det.appendChild(createElement("pre", { className: "sql", text: q }));
+        }
+        return det;
+    }
+
+    function renderQueryDisclosureByPrefix(queriesMap, prefix) {
+        // Aggregate disclosure — collects every block_id starting with
+        // `prefix` into one expander. Used at section level when one
+        // header summarizes data from many block_ids (e.g. NULL rates,
+        // integrity counts).
+        const matches = [];
+        const map = queriesMap || {};
+        for (const key of Object.keys(map)) {
+            if (key.indexOf(prefix) !== 0) continue;
+            const list = map[key] || [];
+            for (const sql of list) matches.push({ key: key, sql: sql });
+        }
+        if (!matches.length) return null;
+        const det = createElement("details", { className: "query-disclosure" });
+        det.appendChild(createElement("summary", { text: "Show SQL" }));
+        for (const entry of matches) {
+            det.appendChild(createElement("p", {
+                className: "sql-key",
+                text: entry.key,
+            }));
+            det.appendChild(createElement("pre", {
+                className: "sql",
+                text: entry.sql,
+            }));
+        }
+        return det;
+    }
+
+    function renderBlockHeader(level, title, queriesMap, blockId) {
+        // <div class="block-header"><h{level}>title</h{level}>[<details>...]</div>
+        // The disclosure is omitted when no queries are recorded for blockId.
+        const wrap = createElement("div", { className: "block-header" });
+        wrap.appendChild(createElement(level, { text: title }));
+        const disc = renderQueryDisclosure(queriesMap, blockId);
+        if (disc) wrap.appendChild(disc);
+        return wrap;
+    }
+
+    function renderDegree(analysis, queriesMap) {
+        const container = document.getElementById("degree-container");
+        const degrees = analysis.degree_stats || {};
+        const keys = Object.keys(degrees);
+        if (keys.length === 0) {
+            container.appendChild(createElement("p", { text: "No degree stats available." }));
+            return;
+        }
+        for (const edgeType of keys) {
+            const stats = degrees[edgeType];
+            const median = stats.median || 1;
+            const ratio = stats.p99 / Math.max(1, median);
+            const ratioClass = classForThreshold(ratio, 50, 100);
+
+            const statsLine = createElement("p", { className: "data-value" },
+                "min=" + formatNumber(stats.min) +
+                ", mean=" + (stats.mean !== undefined ? stats.mean.toFixed(2) : "-") +
+                ", median=" + formatNumber(stats.median) +
+                ", p90=" + formatNumber(stats.p90) +
+                ", p99=" + formatNumber(stats.p99) +
+                ", p99.9=" + formatNumber(stats.p999) +
+                ", max=" + formatNumber(stats.max) +
+                " | p99/median=",
+                createElement("span", { className: ratioClass, text: ratio.toFixed(1) }));
+
+            container.appendChild(renderBlockHeader(
+                "h3", edgeType, queriesMap, "graph_structure:degree:" + edgeType
+            ));
+            container.appendChild(statsLine);
+            container.appendChild(renderDegreeHistogram(stats.buckets || {}));
+        }
+    }
+
+    function renderHubs(analysis, queriesMap) {
+        const container = document.getElementById("hubs-container");
+        const hubs = analysis.top_hubs || {};
+        const keys = Object.keys(hubs);
+        if (keys.length === 0) {
+            container.appendChild(createElement("p", { text: "No hub data available." }));
+            return;
+        }
+        for (const edgeType of keys) {
+            container.appendChild(renderBlockHeader(
+                "h3", edgeType, queriesMap, "graph_structure:top_hubs:" + edgeType
+            ));
+            const thead = createElement("thead", null,
+                createElement("tr", null,
+                    createElement("th", { text: "Rank" }),
+                    createElement("th", { text: "Node ID" }),
+                    createElement("th", { text: "Degree" })));
+            const tbody = createElement("tbody");
+            const rows = (hubs[edgeType] || []).slice(0, 20);
+            rows.forEach((entry, i) => {
+                const nodeId = Array.isArray(entry) ? entry[0] : entry.node_id;
+                const degree = Array.isArray(entry) ? entry[1] : entry.degree;
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: String(i + 1) }),
+                    createElement("td", { className: "data-value", text: String(nodeId) }),
+                    createElement("td", { className: "numeric data-value", text: formatNumber(degree) })));
+            });
+            container.appendChild(createElement("table", null, thead, tbody));
+        }
+    }
+
+    function renderSuperHubWarning(analysis) {
+        const box = document.getElementById("super-hub-warning");
+        const clamps = analysis.super_hub_int16_clamp_count || {};
+        const totalClamps = sumValues(clamps);
+        if (totalClamps <= 0) {
+            box.hidden = true;
+            return;
+        }
+        box.hidden = false;
+        box.className = "warning-box";
+        const detail = Object.keys(clamps)
+            .map(k => k + ": " + formatNumber(clamps[k]))
+            .join(", ");
+        box.appendChild(createElement("strong", { text: "Super-hub int16 clamp warning. " }));
+        box.appendChild(document.createTextNode(
+            formatNumber(totalClamps) + " node(s) exceed the int16 degree limit (32,767) and " +
+            "will be silently clamped by GiGL. Per-type: " + detail
+        ));
+    }
+
+    function renderSupervisionOverlap(analysis, queriesMap) {
+        const section = document.getElementById("supervision-overlap");
+        const container = document.getElementById("supervision-overlap-container");
+        const stats = (analysis && analysis.supervision_cross_table_stats) || [];
+        if (!stats.length) {
+            section.hidden = true;
+            return;
+        }
+        section.hidden = false;
+        while (container.firstChild) container.removeChild(container.firstChild);
+
+        for (const entry of stats) {
+            const card = createElement("div", { className: "card" });
+            const title = entry.driver_edge_type + " → " + entry.other_edge_type +
+                " (" + entry.other_role + ")";
+            // Card-level disclosure aggregating any block_id starting with
+            // "supervision_overlap:<driver>:<other>:" — covers homogeneous
+            // and heterogeneous anchor-column suffixes alike.
+            const cardHeader = createElement("div", { className: "block-header" });
+            cardHeader.appendChild(createElement("h3", { text: title }));
+            const cardDisc = renderQueryDisclosureByPrefix(
+                queriesMap,
+                "supervision_overlap:" + entry.driver_edge_type +
+                    ":" + entry.other_edge_type + ":"
+            );
+            if (cardDisc) cardHeader.appendChild(cardDisc);
+            card.appendChild(cardHeader);
+            card.appendChild(createElement("p", { className: "data-value" },
+                "Anchor node type: ", entry.node_anchor,
+                " (driver role: ", entry.driver_role, ")"));
+
+            const driverPairs = entry.driver_pair_count || 0;
+            const overlap = entry.overlap_pair_count || 0;
+            const overlapFrac = driverPairs > 0 ? overlap / driverPairs : 0;
+            const overlapClass = overlap === 0
+                ? "status-green"
+                : (overlapFrac >= 0.01 ? "status-red" : "status-yellow");
+
+            const driverAnchors = entry.driver_anchor_count || 0;
+            const zeroOther = entry.driver_anchors_with_zero_other || 0;
+            const zeroFrac = driverAnchors > 0 ? zeroOther / driverAnchors : 0;
+            const zeroClass = zeroFrac > 0.5
+                ? "status-red"
+                : (zeroFrac >= 0.05 ? "status-yellow" : "status-green");
+
+            const driverName = entry.driver_edge_type;
+            const otherName = entry.other_edge_type;
+            const tbody = createElement("tbody");
+            const rows = [
+                [
+                    "Distinct anchors in " + driverName,
+                    formatNumber(driverAnchors),
+                    null,
+                ],
+                [
+                    "Distinct (anchor, neighbor) pairs in " + driverName,
+                    formatNumber(driverPairs),
+                    null,
+                ],
+                [
+                    "Distinct (anchor, neighbor) pairs in " + otherName,
+                    formatNumber(entry.other_pair_count || 0),
+                    null,
+                ],
+                [
+                    "Overlap pair count (" + driverName + " ∩ " + otherName + ")",
+                    formatNumber(overlap) + "  (" + formatPercent(overlapFrac) + ")",
+                    overlapClass,
+                ],
+                [
+                    "Anchors in " + driverName + " with zero edges in " + otherName,
+                    formatNumber(zeroOther) + "  (" + formatPercent(zeroFrac) + ")",
+                    zeroClass,
+                ],
+                [
+                    "Avg edges in " + otherName + " per anchor in " + driverName,
+                    (entry.avg_other_per_driver_anchor || 0).toFixed(2),
+                    null,
+                ],
+                [
+                    "p50 / p90 / p99 / max edges in " + otherName +
+                        " per anchor in " + driverName,
+                    formatNumber(entry.p50_other_per_driver_anchor || 0) + " / " +
+                    formatNumber(entry.p90_other_per_driver_anchor || 0) + " / " +
+                    formatNumber(entry.p99_other_per_driver_anchor || 0) + " / " +
+                    formatNumber(entry.max_other_per_driver_anchor || 0),
+                    null,
+                ],
+            ];
+            for (const [label, value, cls] of rows) {
+                const valueCell = createElement("td", { className: "numeric data-value" });
+                if (cls) {
+                    valueCell.appendChild(createElement("span", { className: cls, text: value }));
+                } else {
+                    valueCell.textContent = value;
+                }
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: label }),
+                    valueCell));
+            }
+            card.appendChild(createElement("table", null, tbody));
+            container.appendChild(card);
+        }
+    }
+
+    function renderNodeClassificationSupervision(analysis, queriesMap) {
+        const section = document.getElementById("node-classification-supervision");
+        const container = document.getElementById(
+            "node-classification-supervision-container"
+        );
+        const stats = (analysis && analysis.node_classification_supervision_stats) || [];
+        if (!stats.length) {
+            section.hidden = true;
+            return;
+        }
+        section.hidden = false;
+        while (container.firstChild) container.removeChild(container.firstChild);
+
+        for (const entry of stats) {
+            const card = createElement("div", { className: "card" });
+            card.appendChild(createElement(
+                "h3",
+                { text: "Node type: " + entry.node_type +
+                       "   (label column: " + entry.label_column + ")" }
+            ));
+            const nt = entry.node_type;
+
+            const sentinel = entry.sentinel_stats || {};
+            const totalRows = sentinel.total_rows || 0;
+            const nullCount = sentinel.null_count || 0;
+            const validCount = sentinel.valid_label_count || 0;
+            const validCoverage = sentinel.valid_label_coverage || 0;
+            const sentinelCounts = sentinel.sentinel_counts || {};
+            const sentinelTotal = Object.values(sentinelCounts)
+                .reduce((acc, value) => acc + (value || 0), 0);
+
+            const sentinelTbody = createElement("tbody");
+            const sentinelRows = [
+                ["Total rows", formatNumber(totalRows), null],
+                [
+                    "Valid labels (non-null AND non-sentinel)",
+                    formatNumber(validCount) + "  (" + formatPercent(validCoverage) + ")",
+                    validCoverage > 0 ? "status-green" : "status-red",
+                ],
+                [
+                    "NULL labels",
+                    formatNumber(nullCount),
+                    nullCount > 0 ? "status-yellow" : null,
+                ],
+                [
+                    "Sentinel labels (treated as missing)",
+                    formatNumber(sentinelTotal),
+                    sentinelTotal > 0 ? "status-yellow" : null,
+                ],
+            ];
+            for (const [label, value, cls] of sentinelRows) {
+                const valueCell = createElement("td", { className: "numeric data-value" });
+                if (cls) {
+                    valueCell.appendChild(createElement("span", { className: cls, text: value }));
+                } else {
+                    valueCell.textContent = value;
+                }
+                sentinelTbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: label }),
+                    valueCell));
+            }
+            card.appendChild(renderBlockHeader(
+                "h4", "Label hygiene", queriesMap,
+                "nc_supervision:label_sentinel:" + nt
+            ));
+            card.appendChild(createElement("table", null, sentinelTbody));
+
+            if (Object.keys(sentinelCounts).length) {
+                const ulSentinel = createElement("ul");
+                for (const [val, count] of Object.entries(sentinelCounts)) {
+                    ulSentinel.appendChild(createElement("li", {
+                        text: "sentinel " + JSON.stringify(val) + ": " +
+                              formatNumber(count || 0),
+                    }));
+                }
+                card.appendChild(ulSentinel);
+            }
+
+            const perClass = entry.per_class_degree || [];
+            if (perClass.length) {
+                card.appendChild(renderBlockHeader(
+                    "h4", "Per-class degree", queriesMap,
+                    "nc_supervision:per_class_degree:" + nt
+                ));
+                const tbody = createElement("tbody");
+                tbody.appendChild(createElement("tr", null,
+                    createElement("th", { text: "Class" }),
+                    createElement("th", { text: "Count" }),
+                    createElement("th", { text: "Cold-start (deg ≤ 1)" }),
+                    createElement("th", { text: "Mean" }),
+                    createElement("th", { text: "Median" }),
+                    createElement("th", { text: "p90" }),
+                    createElement("th", { text: "p99" }),
+                    createElement("th", { text: "Max" }),
+                    createElement("th", { text: "Distribution" })));
+                for (const cls of perClass) {
+                    const coldFrac = cls.count > 0
+                        ? (cls.cold_start_count || 0) / cls.count
+                        : 0;
+                    const coldClass = coldFrac >= 0.5
+                        ? "status-red"
+                        : (coldFrac >= 0.1 ? "status-yellow" : "status-green");
+                    const coldCell = createElement("td", { className: "numeric data-value" });
+                    coldCell.appendChild(createElement("span", {
+                        className: coldClass,
+                        text: formatNumber(cls.cold_start_count || 0) +
+                              "  (" + formatPercent(coldFrac) + ")",
+                    }));
+                    const distCell = createElement("td", { className: "sparkline-cell" });
+                    distCell.appendChild(renderDegreeHistogram(cls.buckets || {}, {
+                        width: 140,
+                        height: 32,
+                        showLabels: false,
+                        sparkline: true,
+                    }));
+                    tbody.appendChild(createElement("tr", null,
+                        createElement("td", { text: String(cls.class_value) }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.count || 0),
+                        }),
+                        coldCell,
+                        createElement("td", {
+                            className: "numeric",
+                            text: (cls.mean_degree || 0).toFixed(2),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.median_degree || 0),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.p90_degree || 0),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.p99_degree || 0),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.max_degree || 0),
+                        }),
+                        distCell));
+                }
+                card.appendChild(createElement("table", null, tbody));
+            }
+
+            const sentinelDegree = entry.sentinel_degree_stats || [];
+            if (sentinelDegree.length) {
+                card.appendChild(renderBlockHeader(
+                    "h4", "Sentinel-label degree distribution", queriesMap,
+                    "nc_supervision:per_class_degree:" + nt
+                ));
+                const tbody = createElement("tbody");
+                tbody.appendChild(createElement("tr", null,
+                    createElement("th", { text: "Sentinel" }),
+                    createElement("th", { text: "Count" }),
+                    createElement("th", { text: "Cold-start (deg ≤ 1)" }),
+                    createElement("th", { text: "Mean" }),
+                    createElement("th", { text: "Median" }),
+                    createElement("th", { text: "p90" }),
+                    createElement("th", { text: "p99" }),
+                    createElement("th", { text: "Max" }),
+                    createElement("th", { text: "Distribution" })));
+                for (const cls of sentinelDegree) {
+                    const coldFrac = cls.count > 0
+                        ? (cls.cold_start_count || 0) / cls.count
+                        : 0;
+                    const coldClass = coldFrac >= 0.5
+                        ? "status-red"
+                        : (coldFrac >= 0.1 ? "status-yellow" : "status-green");
+                    const coldCell = createElement("td", { className: "numeric data-value" });
+                    coldCell.appendChild(createElement("span", {
+                        className: coldClass,
+                        text: formatNumber(cls.cold_start_count || 0) +
+                              "  (" + formatPercent(coldFrac) + ")",
+                    }));
+                    const distCell = createElement("td", { className: "sparkline-cell" });
+                    distCell.appendChild(renderDegreeHistogram(cls.buckets || {}, {
+                        width: 140,
+                        height: 32,
+                        showLabels: false,
+                        sparkline: true,
+                    }));
+                    tbody.appendChild(createElement("tr", null,
+                        createElement("td", { text: String(cls.class_value) }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.count || 0),
+                        }),
+                        coldCell,
+                        createElement("td", {
+                            className: "numeric",
+                            text: (cls.mean_degree || 0).toFixed(2),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.median_degree || 0),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.p90_degree || 0),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.p99_degree || 0),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(cls.max_degree || 0),
+                        }),
+                        distCell));
+                }
+                card.appendChild(createElement("table", null, tbody));
+            }
+
+            const homophily = entry.homophily || [];
+            if (homophily.length) {
+                // One query was recorded per (node_type, edge_type) so the
+                // disclosure aggregates across all edge types in this card.
+                const homHeader = createElement("div", { className: "block-header" });
+                homHeader.appendChild(createElement("h4", { text: "Homophily" }));
+                const homDisc = renderQueryDisclosureByPrefix(
+                    queriesMap, "nc_supervision:homophily:" + nt + ":"
+                );
+                if (homDisc) homHeader.appendChild(homDisc);
+                card.appendChild(homHeader);
+                const tbody = createElement("tbody");
+                tbody.appendChild(createElement("tr", null,
+                    createElement("th", { text: "Edge type" }),
+                    createElement("th", { text: "Edge homophily" }),
+                    createElement("th", { text: "Adjusted homophily" }),
+                    createElement("th", { text: "Sample size" })));
+                for (const h of homophily) {
+                    tbody.appendChild(createElement("tr", null,
+                        createElement("td", { text: h.edge_type }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: (h.edge_homophily || 0).toFixed(4),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: (h.adjusted_homophily || 0).toFixed(4),
+                        }),
+                        createElement("td", {
+                            className: "numeric",
+                            text: formatNumber(h.edge_sample_count || 0),
+                        })));
+                }
+                card.appendChild(createElement("table", null, tbody));
+            }
+
+            const split = entry.cross_split_overlap;
+            if (split) {
+                card.appendChild(renderBlockHeader(
+                    "h4", "Train / val / test split", queriesMap,
+                    "nc_supervision:cross_split:" + nt
+                ));
+                const overlap = split.overlap_node_count || 0;
+                const overlapClass = overlap === 0 ? "status-green" : "status-red";
+                const overlapCell = createElement("td", { className: "numeric data-value" });
+                overlapCell.appendChild(createElement("span", {
+                    className: overlapClass,
+                    text: formatNumber(overlap),
+                }));
+                const tbody = createElement("tbody");
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: "Cross-split node-id overlap (must be 0)" }),
+                    overlapCell));
+                for (const [splitValue, count] of Object.entries(split.split_value_counts || {})) {
+                    tbody.appendChild(createElement("tr", null,
+                        createElement("td", { text: "Rows in split " + JSON.stringify(splitValue) }),
+                        createElement("td", {
+                            className: "numeric data-value",
+                            text: formatNumber(count || 0),
+                        })));
+                }
+                card.appendChild(createElement("table", null, tbody));
+            }
+
+            container.appendChild(card);
+        }
+    }
+
+    function renderAdvanced(analysis, queriesMap) {
+        const section = document.getElementById("advanced");
+        const container = document.getElementById("advanced-container");
+
+        const classImb = analysis.class_imbalance || {};
+        const labelCov = analysis.label_coverage || {};
+        const edgeDist = analysis.edge_type_distribution || {};
+        const reciprocity = analysis.reciprocity || {};
+        const powerLaw = analysis.power_law_exponent || {};
+
+        const hasTier3 = Object.keys(classImb).length ||
+                         Object.keys(labelCov).length ||
+                         Object.keys(edgeDist).length;
+        const hasTier4 = Object.keys(reciprocity).length ||
+                         Object.keys(powerLaw).length;
+
+        if (!hasTier3 && !hasTier4) {
+            section.hidden = true;
+            return;
+        }
+        section.hidden = false;
+
+        if (Object.keys(classImb).length) {
+            const classImbHeader = createElement("div", { className: "block-header" });
+            classImbHeader.appendChild(createElement("h3", { text: "Class imbalance" }));
+            const classImbDisc = renderQueryDisclosureByPrefix(
+                queriesMap, "advanced:class_imbalance:"
+            );
+            if (classImbDisc) classImbHeader.appendChild(classImbDisc);
+            container.appendChild(classImbHeader);
+            for (const nodeType of Object.keys(classImb)) {
+                const counts = classImb[nodeType];
+                const values = Object.values(counts);
+                const maxC = Math.max(...values);
+                const minC = Math.max(1, Math.min(...values));
+                const ratio = maxC / minC;
+                const cls = ratio > 10 ? "status-red" : ratio > 5 ? "status-yellow" : "status-green";
+                container.appendChild(createElement("p", { className: "data-value" },
+                    nodeType + " max/min ratio = ",
+                    createElement("span", { className: cls, text: "1:" + ratio.toFixed(1) })));
+                const tbody = createElement("tbody");
+                for (const [label, count] of Object.entries(counts)) {
+                    tbody.appendChild(createElement("tr", null,
+                        createElement("td", { text: String(label) }),
+                        createElement("td", { className: "numeric data-value", text: formatNumber(count) })));
+                }
+                container.appendChild(createElement("table", null,
+                    createElement("thead", null, createElement("tr", null,
+                        createElement("th", { text: "Class" }),
+                        createElement("th", { text: "Count" }))),
+                    tbody));
+            }
+        }
+
+        if (Object.keys(labelCov).length) {
+            const labelCovHeader = createElement("div", { className: "block-header" });
+            labelCovHeader.appendChild(createElement("h3", { text: "Label coverage" }));
+            const labelCovDisc = renderQueryDisclosureByPrefix(
+                queriesMap, "advanced:label_coverage:"
+            );
+            if (labelCovDisc) labelCovHeader.appendChild(labelCovDisc);
+            container.appendChild(labelCovHeader);
+            const tbody = createElement("tbody");
+            for (const [nodeType, frac] of Object.entries(labelCov)) {
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: nodeType }),
+                    createElement("td", { className: "numeric data-value", text: formatPercent(frac) })));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Node type" }),
+                    createElement("th", { text: "Coverage" }))),
+                tbody));
+        }
+
+        if (Object.keys(edgeDist).length) {
+            const edgeDistHeader = createElement("div", { className: "block-header" });
+            edgeDistHeader.appendChild(createElement(
+                "h3", { text: "Edge type distribution" }
+            ));
+            const edgeDistDisc = renderQueryDisclosureByPrefix(
+                queriesMap, "advanced:edge_type_distribution:"
+            );
+            if (edgeDistDisc) edgeDistHeader.appendChild(edgeDistDisc);
+            container.appendChild(edgeDistHeader);
+            const total = sumValues(edgeDist);
+            const tbody = createElement("tbody");
+            for (const [edgeType, count] of Object.entries(edgeDist)) {
+                const frac = total > 0 ? count / total : 0;
+                let cls = "status-green";
+                if (frac < 0.001) cls = "status-red";
+                else if (frac > 0.9) cls = "status-red";
+                else if (frac > 0.8) cls = "status-yellow";
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: edgeType }),
+                    createElement("td", { className: "numeric data-value", text: formatNumber(count) }),
+                    createElement("td", { className: "numeric" },
+                        createElement("span", { className: cls, text: formatPercent(frac) }))));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Edge type" }),
+                    createElement("th", { text: "Count" }),
+                    createElement("th", { text: "Share" }))),
+                tbody));
+        }
+
+        if (Object.keys(reciprocity).length) {
+            container.appendChild(createElement("h3", { text: "Reciprocity" }));
+            const tbody = createElement("tbody");
+            for (const [edgeType, val] of Object.entries(reciprocity)) {
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: edgeType }),
+                    createElement("td", { className: "numeric data-value", text: formatPercent(val) })));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Edge type" }),
+                    createElement("th", { text: "Reciprocity" }))),
+                tbody));
+        }
+
+        if (Object.keys(powerLaw).length) {
+            container.appendChild(createElement("h3", { text: "Power-law exponent" }));
+            const tbody = createElement("tbody");
+            for (const [edgeType, alpha] of Object.entries(powerLaw)) {
+                const cls = alpha < 2 ? "status-red" : alpha < 2.5 ? "status-yellow" : "status-green";
+                tbody.appendChild(createElement("tr", null,
+                    createElement("td", { text: edgeType }),
+                    createElement("td", { className: "numeric" },
+                        createElement("span", { className: cls, text: alpha.toFixed(2) }))));
+            }
+            container.appendChild(createElement("table", null,
+                createElement("thead", null, createElement("tr", null,
+                    createElement("th", { text: "Edge type" }),
+                    createElement("th", { text: "Alpha" }))),
+                tbody));
+        }
+    }
+
+    function renderFooter(analysis, profile) {
+        const container = document.getElementById("footer-container");
+
+        // facets_html_paths / stats_paths are list-valued so wide tables can
+        // contribute one entry per chunk. Flatten with a "(chunk i/N)" suffix
+        // when a table has more than one entry; preserve the legacy unsuffixed
+        // form for single-chunk tables (the common case).
+        function pushFlattened(label, dict) {
+            for (const [k, v] of Object.entries(dict)) {
+                const list = Array.isArray(v) ? v : (v ? [v] : []);
+                list.forEach((p, i) => {
+                    const suffix = list.length > 1
+                        ? " (chunk " + (i + 1) + "/" + list.length + ")"
+                        : "";
+                    artifacts.push(label + " " + k + suffix + ": " + p);
+                });
+            }
+        }
+
+        const artifacts = [];
+        pushFlattened("FACETS", (profile && profile.facets_html_paths) || {});
+        pushFlattened("Stats", (profile && profile.stats_paths) || {});
+
+        if (artifacts.length) {
+            container.appendChild(createElement("h3", { text: "Raw artifacts" }));
+            const ul = createElement("ul", { className: "footer-list" });
+            for (const a of artifacts) {
+                ul.appendChild(createElement("li", null, createElement("code", { text: a })));
+            }
+            container.appendChild(ul);
+        }
+    }
+
+    function main() {
+        const analysis = parseJSONTag("analysis-data");
+        const profile = parseJSONTag("profile-data");
+        const queriesMap = (analysis && analysis.queries) || {};
+        renderHeader(analysis);
+        renderOverview(analysis);
+        renderNullRates(analysis, queriesMap);
+        renderIntegrity(analysis, queriesMap);
+        renderFeatureStatistics(profile);
+        renderEmbeddingDiagnostics(profile);
+        renderCounts(analysis, queriesMap);
+        renderDegree(analysis, queriesMap);
+        renderHubs(analysis, queriesMap);
+        renderSuperHubWarning(analysis);
+        renderNodeClassificationSupervision(analysis, queriesMap);
+        renderSupervisionOverlap(analysis, queriesMap);
+        renderAdvanced(analysis, queriesMap);
+        renderFooter(analysis, profile);
+    }
+
+    if (document.readyState === "loading") {
+        document.addEventListener("DOMContentLoaded", main);
+    } else {
+        main();
+    }
+})();
+</script>
+</body>
+</html>
diff --git a/tests/test_assets/analytics/sample_analyzer_config.yaml b/tests/test_assets/analytics/sample_analyzer_config.yaml
new file mode 100644
index 000000000..acd3fb7b9
--- /dev/null
+++ b/tests/test_assets/analytics/sample_analyzer_config.yaml
@@ -0,0 +1,16 @@
+node_tables:
+  - bq_table: "test_project.test_dataset.user_nodes"
+    node_type: "user"
+    id_column: "user_id"
+    feature_columns: ["age", "country"]
+    label_column: "label"
+
+edge_tables:
+  - bq_table: "test_project.test_dataset.user_edges"
+    edge_type: "follows"
+    src_id_column: "src_user_id"
+    dst_id_column: "dst_user_id"
+    feature_columns: ["weight"]
+
+output_gcs_path: "gs://test-bucket/analysis_output/"
+fan_out: [15, 10, 5]
diff --git a/tests/unit/analytics/__init__.py b/tests/unit/analytics/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/tests/unit/analytics/data_analyzer/__init__.py b/tests/unit/analytics/data_analyzer/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/tests/unit/analytics/data_analyzer/config_label_sentinel_test.py b/tests/unit/analytics/data_analyzer/config_label_sentinel_test.py
new file mode 100644
index 000000000..017437901
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/config_label_sentinel_test.py
@@ -0,0 +1,194 @@
+"""Schema validation tests for the NC supervision additions to NodeTableSpec.
+
+Exercises ``label_sentinel_values``, ``split_column``, and the three new
+``DataAnalyzerConfig`` flags (``compute_per_class_feature_stats``,
+``compute_label_informativeness``, ``label_homophily_edge_sample_cap``).
+"""
+
+import tempfile
+from pathlib import Path
+from typing import cast
+
+from omegaconf import OmegaConf
+
+from gigl.analytics.data_analyzer.config import DataAnalyzerConfig, load_analyzer_config
+from tests.test_assets.test_case import TestCase
+
+
+def _write_yaml(yaml_str: str) -> str:
+    """Write a YAML string to a temp file and return its absolute path."""
+    tmp = tempfile.NamedTemporaryFile(
+        mode="w", suffix=".yaml", delete=False, encoding="utf-8"
+    )
+    tmp.write(yaml_str)
+    tmp.flush()
+    tmp.close()
+    return tmp.name
+
+
+class LabelSentinelConfigTest(TestCase):
+    def test_default_label_sentinel_values_is_empty_list(self) -> None:
+        yaml_str = """
+        node_tables:
+          - bq_table: "p.d.t"
+            node_type: "user"
+            id_column: "uid"
+            label_column: "label"
+        edge_tables:
+          - bq_table: "p.d.e"
+            edge_type: "follows"
+            src_id_column: "src"
+            dst_id_column: "dst"
+        output_gcs_path: "gs://bucket/out/"
+        """
+        raw = OmegaConf.create(yaml_str)
+        merged = OmegaConf.merge(OmegaConf.structured(DataAnalyzerConfig), raw)
+        config = cast(DataAnalyzerConfig, OmegaConf.to_object(merged))
+        self.assertEqual(config.node_tables[0].label_sentinel_values, [])
+        self.assertIsNone(config.node_tables[0].split_column)
+
+    def test_loads_label_sentinel_values_and_split_column(self) -> None:
+        path = _write_yaml(
+            """
+            node_tables:
+              - bq_table: "p.d.t"
+                node_type: "user"
+                id_column: "uid"
+                label_column: "node_label"
+                label_sentinel_values:
+                  - "-1"
+                  - "unknown"
+                split_column: "split"
+            edge_tables:
+              - bq_table: "p.d.e"
+                edge_type: "follows"
+                src_id_column: "src"
+                dst_id_column: "dst"
+            output_gcs_path: "gs://bucket/out/"
+            """
+        )
+        try:
+            config = load_analyzer_config(path)
+        finally:
+            Path(path).unlink(missing_ok=True)
+        self.assertEqual(config.node_tables[0].label_sentinel_values, ["-1", "unknown"])
+        self.assertEqual(config.node_tables[0].split_column, "split")
+
+    def test_empty_sentinel_string_rejected(self) -> None:
+        path = _write_yaml(
+            """
+            node_tables:
+              - bq_table: "p.d.t"
+                node_type: "user"
+                id_column: "uid"
+                label_column: "node_label"
+                label_sentinel_values:
+                  - ""
+            edge_tables:
+              - bq_table: "p.d.e"
+                edge_type: "follows"
+                src_id_column: "src"
+                dst_id_column: "dst"
+            output_gcs_path: "gs://bucket/out/"
+            """
+        )
+        try:
+            with self.assertRaises(ValueError):
+                load_analyzer_config(path)
+        finally:
+            Path(path).unlink(missing_ok=True)
+
+    def test_sentinel_without_label_column_rejected(self) -> None:
+        """Sentinels apply to label_column only — declaring them without it is a bug."""
+        path = _write_yaml(
+            """
+            node_tables:
+              - bq_table: "p.d.t"
+                node_type: "user"
+                id_column: "uid"
+                label_sentinel_values:
+                  - "-1"
+            edge_tables:
+              - bq_table: "p.d.e"
+                edge_type: "follows"
+                src_id_column: "src"
+                dst_id_column: "dst"
+            output_gcs_path: "gs://bucket/out/"
+            """
+        )
+        try:
+            with self.assertRaises(ValueError):
+                load_analyzer_config(path)
+        finally:
+            Path(path).unlink(missing_ok=True)
+
+    def test_invalid_split_column_identifier_rejected(self) -> None:
+        path = _write_yaml(
+            """
+            node_tables:
+              - bq_table: "p.d.t"
+                node_type: "user"
+                id_column: "uid"
+                label_column: "label"
+                split_column: "bad column"
+            edge_tables:
+              - bq_table: "p.d.e"
+                edge_type: "follows"
+                src_id_column: "src"
+                dst_id_column: "dst"
+            output_gcs_path: "gs://bucket/out/"
+            """
+        )
+        try:
+            with self.assertRaises(ValueError):
+                load_analyzer_config(path)
+        finally:
+            Path(path).unlink(missing_ok=True)
+
+
+class DataAnalyzerConfigFlagDefaultsTest(TestCase):
+    def test_nc_flag_defaults(self) -> None:
+        yaml_str = """
+        node_tables:
+          - bq_table: "p.d.t"
+            node_type: "user"
+            id_column: "uid"
+        edge_tables:
+          - bq_table: "p.d.e"
+            edge_type: "follows"
+            src_id_column: "src"
+            dst_id_column: "dst"
+        output_gcs_path: "gs://bucket/out/"
+        """
+        raw = OmegaConf.create(yaml_str)
+        merged = OmegaConf.merge(OmegaConf.structured(DataAnalyzerConfig), raw)
+        config = cast(DataAnalyzerConfig, OmegaConf.to_object(merged))
+        # Per-class feature stats default on (cheap; highest-value NC signal).
+        self.assertTrue(config.compute_per_class_feature_stats)
+        # Label informativeness default off (expensive full-graph join).
+        self.assertFalse(config.compute_label_informativeness)
+        # Default sample cap is 50M edges.
+        self.assertEqual(config.label_homophily_edge_sample_cap, 50_000_000)
+
+    def test_overriding_homophily_sample_cap(self) -> None:
+        yaml_str = """
+        node_tables:
+          - bq_table: "p.d.t"
+            node_type: "user"
+            id_column: "uid"
+        edge_tables:
+          - bq_table: "p.d.e"
+            edge_type: "follows"
+            src_id_column: "src"
+            dst_id_column: "dst"
+        output_gcs_path: "gs://bucket/out/"
+        label_homophily_edge_sample_cap: 0
+        compute_label_informativeness: true
+        compute_per_class_feature_stats: false
+        """
+        raw = OmegaConf.create(yaml_str)
+        merged = OmegaConf.merge(OmegaConf.structured(DataAnalyzerConfig), raw)
+        config = cast(DataAnalyzerConfig, OmegaConf.to_object(merged))
+        self.assertEqual(config.label_homophily_edge_sample_cap, 0)
+        self.assertTrue(config.compute_label_informativeness)
+        self.assertFalse(config.compute_per_class_feature_stats)
diff --git a/tests/unit/analytics/data_analyzer/config_test.py b/tests/unit/analytics/data_analyzer/config_test.py
new file mode 100644
index 000000000..2edfa5432
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/config_test.py
@@ -0,0 +1,290 @@
+from pathlib import Path
+from typing import cast
+
+from omegaconf import OmegaConf
+from omegaconf.errors import MissingMandatoryValue
+
+from gigl.analytics.data_analyzer.config import DataAnalyzerConfig, load_analyzer_config
+from tests.test_assets.test_case import TestCase
+
+SAMPLE_CONFIG_PATH = (
+    Path(__file__).parents[3]
+    / "test_assets"
+    / "analytics"
+    / "sample_analyzer_config.yaml"
+)
+
+
+class DataAnalyzerConfigTest(TestCase):
+    def test_load_valid_config(self) -> None:
+        config = load_analyzer_config(str(SAMPLE_CONFIG_PATH))
+        self.assertIsInstance(config, DataAnalyzerConfig)
+        self.assertEqual(len(config.node_tables), 1)
+        self.assertEqual(len(config.edge_tables), 1)
+        self.assertEqual(config.node_tables[0].node_type, "user")
+        self.assertEqual(config.node_tables[0].label_column, "label")
+        self.assertEqual(config.edge_tables[0].edge_type, "follows")
+        self.assertEqual(config.output_gcs_path, "gs://test-bucket/analysis_output/")
+        self.assertEqual(config.fan_out, [15, 10, 5])
+
+    def test_optional_fields_default_to_none_or_false(self) -> None:
+        yaml_str = """
+        node_tables:
+          - bq_table: "p.d.t"
+            node_type: "user"
+            id_column: "uid"
+            feature_columns: ["f1"]
+        edge_tables:
+          - bq_table: "p.d.e"
+            edge_type: "follows"
+            src_id_column: "src"
+            dst_id_column: "dst"
+        output_gcs_path: "gs://bucket/out/"
+        """
+        raw = OmegaConf.create(yaml_str)
+        merged = OmegaConf.merge(OmegaConf.structured(DataAnalyzerConfig), raw)
+        config = cast(DataAnalyzerConfig, OmegaConf.to_object(merged))
+        self.assertIsNone(config.node_tables[0].label_column)
+        self.assertIsNone(config.edge_tables[0].timestamp_column)
+        self.assertIsNone(config.fan_out)
+        self.assertFalse(config.compute_reciprocity)
+        self.assertFalse(config.compute_homophily)
+        self.assertIsNone(config.job_name_prefix)
+
+    def test_job_name_prefix_round_trips(self) -> None:
+        """``job_name_prefix`` parses through OmegaConf when set."""
+        yaml_str = """
+        node_tables:
+          - bq_table: "p.d.t"
+            node_type: "user"
+            id_column: "uid"
+        edge_tables:
+          - bq_table: "p.d.e"
+            edge_type: "follows"
+            src_id_column: "src"
+            dst_id_column: "dst"
+        output_gcs_path: "gs://bucket/out/"
+        job_name_prefix: "cd-content"
+        """
+        raw = OmegaConf.create(yaml_str)
+        merged = OmegaConf.merge(OmegaConf.structured(DataAnalyzerConfig), raw)
+        config = cast(DataAnalyzerConfig, OmegaConf.to_object(merged))
+        self.assertEqual(config.job_name_prefix, "cd-content")
+
+    def test_missing_required_field_raises(self) -> None:
+        yaml_str = """
+        node_tables:
+          - bq_table: "p.d.t"
+            node_type: "user"
+        edge_tables: []
+        output_gcs_path: "gs://bucket/out/"
+        """
+        raw = OmegaConf.create(yaml_str)
+        with self.assertRaises(MissingMandatoryValue):
+            merged = OmegaConf.merge(OmegaConf.structured(DataAnalyzerConfig), raw)
+            OmegaConf.to_object(merged)
+
+    def test_node_table_without_feature_columns(self) -> None:
+        """Nodes with no features are legal; feature_columns defaults to []."""
+        yaml_str = """
+        node_tables:
+          - bq_table: "p.d.t"
+            node_type: "user"
+            id_column: "uid"
+        edge_tables:
+          - bq_table: "p.d.e"
+            edge_type: "follows"
+            src_id_column: "src"
+            dst_id_column: "dst"
+        output_gcs_path: "gs://bucket/out/"
+        """
+        raw = OmegaConf.create(yaml_str)
+        merged = OmegaConf.merge(OmegaConf.structured(DataAnalyzerConfig), raw)
+        config = cast(DataAnalyzerConfig, OmegaConf.to_object(merged))
+        self.assertEqual(config.node_tables[0].feature_columns, [])
+
+    def test_homogeneous_edge_backfills_src_and_dst_node_type(self) -> None:
+        """Single-node-table configs auto-populate src/dst node types."""
+        config = load_analyzer_config(str(SAMPLE_CONFIG_PATH))
+        self.assertEqual(config.edge_tables[0].src_node_type, "user")
+        self.assertEqual(config.edge_tables[0].dst_node_type, "user")
+
+
+SAMPLE_HETERO_YAML = """
+node_tables:
+  - bq_table: "p.d.users"
+    node_type: "user"
+    id_column: "uid"
+  - bq_table: "p.d.content"
+    node_type: "content"
+    id_column: "cid"
+edge_tables:
+  - bq_table: "p.d.viewed"
+    edge_type: "viewed"
+    src_id_column: "user_id"
+    dst_id_column: "content_id"
+    src_node_type: "user"
+    dst_node_type: "content"
+output_gcs_path: "gs://bucket/out/"
+"""
+
+
+class DataAnalyzerConfigHeterogeneousTest(TestCase):
+    """Tests for heterogeneous graph support (I3) and identifier validation (I1)."""
+
+    def _write_yaml(self, yaml_str: str) -> str:
+        import tempfile
+
+        fd = tempfile.NamedTemporaryFile(
+            mode="w", suffix=".yaml", delete=False, encoding="utf-8"
+        )
+        fd.write(yaml_str)
+        fd.close()
+        return fd.name
+
+    def test_heterogeneous_config_with_node_types_loads(self) -> None:
+        path = self._write_yaml(SAMPLE_HETERO_YAML)
+        config = load_analyzer_config(path)
+        self.assertEqual(len(config.node_tables), 2)
+        self.assertEqual(config.edge_tables[0].src_node_type, "user")
+        self.assertEqual(config.edge_tables[0].dst_node_type, "content")
+
+    def test_heterogeneous_missing_src_node_type_raises(self) -> None:
+        """Regression test for I3: multi-node-table configs must declare both sides."""
+        yaml_str = SAMPLE_HETERO_YAML.replace('    src_node_type: "user"\n', "")
+        path = self._write_yaml(yaml_str)
+        with self.assertRaises(ValueError) as ctx:
+            load_analyzer_config(path)
+        self.assertIn("src_node_type is required", str(ctx.exception))
+
+    def test_heterogeneous_unknown_node_type_raises(self) -> None:
+        yaml_str = SAMPLE_HETERO_YAML.replace(
+            '    dst_node_type: "content"', '    dst_node_type: "movie"'
+        )
+        path = self._write_yaml(yaml_str)
+        with self.assertRaises(ValueError) as ctx:
+            load_analyzer_config(path)
+        self.assertIn("is not a declared node_type", str(ctx.exception))
+
+    def test_invalid_bq_table_reference_raises(self) -> None:
+        """Regression test for I1: reject malformed table identifiers."""
+        yaml_str = SAMPLE_HETERO_YAML.replace(
+            'bq_table: "p.d.users"', 'bq_table: "p.d.users; DROP TABLE x"'
+        )
+        path = self._write_yaml(yaml_str)
+        with self.assertRaises(ValueError) as ctx:
+            load_analyzer_config(path)
+        self.assertIn("not a valid BigQuery table reference", str(ctx.exception))
+
+    def test_invalid_column_identifier_raises(self) -> None:
+        """Regression test for I1: reject column names with backticks/quotes."""
+        yaml_str = SAMPLE_HETERO_YAML.replace(
+            'src_id_column: "user_id"', 'src_id_column: "user`id"'
+        )
+        path = self._write_yaml(yaml_str)
+        with self.assertRaises(ValueError) as ctx:
+            load_analyzer_config(path)
+        self.assertIn("not a valid BigQuery column identifier", str(ctx.exception))
+
+    def test_column_with_whitespace_raises(self) -> None:
+        """Regression test for I1: reject column names containing whitespace."""
+        yaml_str = SAMPLE_HETERO_YAML.replace(
+            'dst_id_column: "content_id"', 'dst_id_column: "content id"'
+        )
+        path = self._write_yaml(yaml_str)
+        with self.assertRaises(ValueError) as ctx:
+            load_analyzer_config(path)
+        self.assertIn("not a valid BigQuery column identifier", str(ctx.exception))
+
+
+SUPERVISION_HETERO_YAML = """
+node_tables:
+  - bq_table: "p.d.users"
+    node_type: "user"
+    id_column: "uid"
+  - bq_table: "p.d.content"
+    node_type: "content"
+    id_column: "cid"
+edge_tables:
+  - bq_table: "p.d.viewed"
+    edge_type: "viewed"
+    role: "message_passing"
+    src_id_column: "user_id"
+    dst_id_column: "content_id"
+    src_node_type: "user"
+    dst_node_type: "content"
+  - bq_table: "p.d.viewed_pos"
+    edge_type: "viewed_pos"
+    role: "supervision_pos"
+    node_anchor: "user"
+    src_id_column: "user_id"
+    dst_id_column: "content_id"
+    src_node_type: "user"
+    dst_node_type: "content"
+  - bq_table: "p.d.viewed_neg"
+    edge_type: "viewed_neg"
+    role: "supervision_neg"
+    src_id_column: "user_id"
+    dst_id_column: "content_id"
+    src_node_type: "user"
+    dst_node_type: "content"
+output_gcs_path: "gs://bucket/out/"
+"""
+
+
+class SupervisionRoleConfigTest(TestCase):
+    """Validation for the role / node_anchor fields on EdgeTableSpec."""
+
+    def _write_yaml(self, yaml_str: str) -> str:
+        import tempfile
+
+        fd = tempfile.NamedTemporaryFile(
+            mode="w", suffix=".yaml", delete=False, encoding="utf-8"
+        )
+        fd.write(yaml_str)
+        fd.close()
+        return fd.name
+
+    def test_supervision_config_loads(self) -> None:
+        path = self._write_yaml(SUPERVISION_HETERO_YAML)
+        config = load_analyzer_config(path)
+        roles = {e.edge_type: e.role for e in config.edge_tables}
+        self.assertEqual(roles["viewed"], "message_passing")
+        self.assertEqual(roles["viewed_pos"], "supervision_pos")
+        self.assertEqual(roles["viewed_neg"], "supervision_neg")
+        anchors = {e.edge_type: e.node_anchor for e in config.edge_tables}
+        self.assertEqual(anchors["viewed_pos"], "user")
+        # Negatives without explicit anchor stay None — analyzer auto-inherits at runtime.
+        self.assertIsNone(anchors["viewed_neg"])
+
+    def test_role_defaults_to_message_passing(self) -> None:
+        """Edge tables without a role field default to message_passing."""
+        path = self._write_yaml(SAMPLE_HETERO_YAML)
+        config = load_analyzer_config(path)
+        self.assertEqual(config.edge_tables[0].role, "message_passing")
+
+    def test_unknown_role_raises(self) -> None:
+        yaml_str = SUPERVISION_HETERO_YAML.replace(
+            'role: "supervision_pos"', 'role: "bogus_role"'
+        )
+        path = self._write_yaml(yaml_str)
+        with self.assertRaises(ValueError) as ctx:
+            load_analyzer_config(path)
+        self.assertIn("role=", str(ctx.exception))
+        self.assertIn("bogus_role", str(ctx.exception))
+
+    def test_supervision_pos_without_node_anchor_raises(self) -> None:
+        yaml_str = SUPERVISION_HETERO_YAML.replace('    node_anchor: "user"\n', "")
+        path = self._write_yaml(yaml_str)
+        with self.assertRaises(ValueError) as ctx:
+            load_analyzer_config(path)
+        self.assertIn("node_anchor is required", str(ctx.exception))
+
+    def test_node_anchor_not_matching_src_or_dst_raises(self) -> None:
+        yaml_str = SUPERVISION_HETERO_YAML.replace(
+            'node_anchor: "user"', 'node_anchor: "movie"'
+        )
+        path = self._write_yaml(yaml_str)
+        with self.assertRaises(ValueError) as ctx:
+            load_analyzer_config(path)
+        self.assertIn("node_anchor=", str(ctx.exception))
diff --git a/tests/unit/analytics/data_analyzer/data_analyzer_test.py b/tests/unit/analytics/data_analyzer/data_analyzer_test.py
new file mode 100644
index 000000000..d44303e18
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/data_analyzer_test.py
@@ -0,0 +1,330 @@
+import tempfile
+from pathlib import Path
+from unittest.mock import MagicMock, patch
+
+from gigl.analytics.data_analyzer.config import (
+    DataAnalyzerConfig,
+    EdgeTableSpec,
+    NodeTableSpec,
+)
+from gigl.analytics.data_analyzer.data_analyzer import (
+    DataAnalyzer,
+    _resolve_job_name_prefix,
+    _write_report,
+)
+from gigl.analytics.data_analyzer.graph_structure_analyzer import DataQualityError
+from gigl.analytics.data_analyzer.types import FeatureProfileResult, GraphAnalysisResult
+from tests.test_assets.test_case import TestCase
+
+HTML = "<html><body>report</body></html>"
+_TEST_JOB_NAME_PREFIX = "tp"
+_TEST_RUN_TIMESTAMP = "20260101-0000"
+
+
+def _run(analyzer: DataAnalyzer, **kwargs) -> str:
+    """Invoke ``DataAnalyzer.run`` with the test prefix and timestamp.
+
+    Centralizes the new required kwargs so individual tests stay focused.
+    """
+    return analyzer.run(
+        job_name_prefix=_TEST_JOB_NAME_PREFIX,
+        run_timestamp=_TEST_RUN_TIMESTAMP,
+        **kwargs,
+    )
+
+
+def _make_config(output_gcs_path: str) -> DataAnalyzerConfig:
+    return DataAnalyzerConfig(
+        node_tables=[
+            NodeTableSpec(
+                bq_table="p.d.users",
+                node_type="user",
+                id_column="uid",
+                feature_columns=["age"],
+            )
+        ],
+        edge_tables=[
+            EdgeTableSpec(
+                bq_table="p.d.follows",
+                edge_type="follows",
+                src_id_column="src",
+                dst_id_column="dst",
+                src_node_type="user",
+                dst_node_type="user",
+            )
+        ],
+        output_gcs_path=output_gcs_path,
+    )
+
+
+class WriteReportLocalTest(TestCase):
+    def test_writes_to_local_directory(self) -> None:
+        with tempfile.TemporaryDirectory() as tmpdir:
+            path = _write_report(HTML, tmpdir)
+            report = Path(path)
+            self.assertTrue(report.exists())
+            self.assertEqual(report.read_text(), HTML)
+            self.assertEqual(report.name, "report.html")
+
+    def test_creates_missing_parent_dirs(self) -> None:
+        with tempfile.TemporaryDirectory() as tmpdir:
+            nested = Path(tmpdir) / "nested" / "path"
+            path = _write_report(HTML, str(nested))
+            self.assertTrue(Path(path).exists())
+
+
+@patch("gigl.analytics.data_analyzer.data_analyzer.GcsUtils")
+class WriteReportGcsTest(TestCase):
+    def test_uploads_to_gcs(self, mock_gcs_cls: MagicMock) -> None:
+        path = _write_report(HTML, "gs://my-bucket/output/")
+        self.assertEqual(path, "gs://my-bucket/output/report.html")
+        mock_gcs_cls.return_value.upload_from_string.assert_called_once()
+
+    def test_handles_trailing_slash(self, mock_gcs_cls: MagicMock) -> None:
+        path_with = _write_report(HTML, "gs://my-bucket/output/")
+        path_without = _write_report(HTML, "gs://my-bucket/output")
+        self.assertEqual(path_with, path_without)
+
+
+class DataAnalyzerRunTest(TestCase):
+    """Orchestrator tests: structure analyzer and feature profiler run
+    concurrently, their results both reach ``generate_report``, and failures
+    in either are handled independently without blocking the other.
+    """
+
+    def setUp(self) -> None:
+        super().setUp()
+        self._generate_report = patch(
+            "gigl.analytics.data_analyzer.data_analyzer.generate_report",
+            return_value=HTML,
+        ).start()
+        self._analyze = patch(
+            "gigl.analytics.data_analyzer.data_analyzer.GraphStructureAnalyzer.analyze",
+        ).start()
+        self._profile = patch(
+            "gigl.analytics.data_analyzer.data_analyzer.FeatureProfiler.profile",
+        ).start()
+        self.addCleanup(patch.stopall)
+
+    def test_invokes_both_analyzer_and_profiler(self) -> None:
+        analysis = GraphAnalysisResult()
+        profile = FeatureProfileResult(
+            facets_html_paths={"node:user": ["gs://b/facets.html"]}
+        )
+        self._analyze.return_value = analysis
+        self._profile.return_value = profile
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            _run(
+                DataAnalyzer(),
+                config=_make_config(tmpdir),
+                resource_config=MagicMock(),
+            )
+
+        self.assertEqual(self._analyze.call_count, 1)
+        self.assertEqual(self._profile.call_count, 1)
+        _, call_kwargs = self._generate_report.call_args
+        self.assertIs(call_kwargs["analysis_result"], analysis)
+        self.assertIs(call_kwargs["profile_result"], profile)
+
+    def test_profiler_failure_does_not_block_report(self) -> None:
+        self._analyze.return_value = GraphAnalysisResult()
+        self._profile.side_effect = RuntimeError("Dataflow went boom")
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            path = _run(
+                DataAnalyzer(),
+                config=_make_config(tmpdir),
+                resource_config=MagicMock(),
+            )
+            _, call_kwargs = self._generate_report.call_args
+            self.assertIsInstance(call_kwargs["profile_result"], FeatureProfileResult)
+            self.assertEqual(call_kwargs["profile_result"].facets_html_paths, {})
+            self.assertTrue(Path(path).exists())
+
+    def test_data_quality_error_uses_partial_result_and_still_runs_profiler(
+        self,
+    ) -> None:
+        partial = GraphAnalysisResult(dangling_edge_counts={"follows": 1})
+        self._analyze.side_effect = DataQualityError(
+            "Tier 1 failure", partial_result=partial
+        )
+        profile = FeatureProfileResult(
+            facets_html_paths={"node:user": ["gs://b/facets.html"]}
+        )
+        self._profile.return_value = profile
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            _run(
+                DataAnalyzer(),
+                config=_make_config(tmpdir),
+                resource_config=MagicMock(),
+            )
+
+        _, call_kwargs = self._generate_report.call_args
+        self.assertIs(call_kwargs["analysis_result"], partial)
+        self.assertIs(call_kwargs["profile_result"], profile)
+
+    def test_passes_resource_config_to_profiler(self) -> None:
+        self._analyze.return_value = GraphAnalysisResult()
+        self._profile.return_value = FeatureProfileResult()
+        resource_config = MagicMock()
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            _run(
+                DataAnalyzer(),
+                config=_make_config(tmpdir),
+                resource_config=resource_config,
+            )
+
+        self.assertIs(self._profile.call_args.args[1], resource_config)
+
+    def test_components_structure_only_skips_profiler(self) -> None:
+        analysis = GraphAnalysisResult(node_counts={"user": 100})
+        self._analyze.return_value = analysis
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            _run(
+                DataAnalyzer(),
+                config=_make_config(tmpdir),
+                resource_config=MagicMock(),
+                components="structure",
+            )
+
+        self.assertEqual(self._analyze.call_count, 1)
+        self.assertEqual(self._profile.call_count, 0)
+        _, call_kwargs = self._generate_report.call_args
+        self.assertIs(call_kwargs["analysis_result"], analysis)
+        self.assertIsInstance(call_kwargs["profile_result"], FeatureProfileResult)
+        self.assertEqual(call_kwargs["profile_result"].facets_html_paths, {})
+
+    def test_components_feature_only_skips_analyzer(self) -> None:
+        profile = FeatureProfileResult(
+            facets_html_paths={"node:user": ["gs://b/facets.html"]}
+        )
+        self._profile.return_value = profile
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            _run(
+                DataAnalyzer(),
+                config=_make_config(tmpdir),
+                resource_config=MagicMock(),
+                components="feature",
+            )
+
+        self.assertEqual(self._analyze.call_count, 0)
+        self.assertEqual(self._profile.call_count, 1)
+        _, call_kwargs = self._generate_report.call_args
+        self.assertIs(call_kwargs["profile_result"], profile)
+        self.assertIsInstance(call_kwargs["analysis_result"], GraphAnalysisResult)
+        self.assertEqual(call_kwargs["analysis_result"].node_counts, {})
+
+    def test_custom_worker_image_uri_passed_through_feature_only(self) -> None:
+        self._profile.return_value = FeatureProfileResult()
+        image_uri = "gcr.io/proj/gbml_dataflow_runtime:20260422-000000"
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            _run(
+                DataAnalyzer(),
+                config=_make_config(tmpdir),
+                resource_config=MagicMock(),
+                components="feature",
+                custom_worker_image_uri=image_uri,
+            )
+
+        self.assertEqual(
+            self._profile.call_args.kwargs["custom_worker_image_uri"], image_uri
+        )
+
+    def test_custom_worker_image_uri_passed_through_both(self) -> None:
+        self._analyze.return_value = GraphAnalysisResult()
+        self._profile.return_value = FeatureProfileResult()
+        image_uri = "gcr.io/proj/gbml_dataflow_runtime:20260422-000000"
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            _run(
+                DataAnalyzer(),
+                config=_make_config(tmpdir),
+                resource_config=MagicMock(),
+                components="both",
+                custom_worker_image_uri=image_uri,
+            )
+
+        # "both" path submits FeatureProfiler.profile via ThreadPoolExecutor
+        # with (config, resource_config, job_name_prefix, run_timestamp,
+        # custom_worker_image_uri) as positional args.
+        positional = self._profile.call_args.args
+        self.assertEqual(positional[2], _TEST_JOB_NAME_PREFIX)
+        self.assertEqual(positional[3], _TEST_RUN_TIMESTAMP)
+        self.assertEqual(positional[4], image_uri)
+
+    def test_custom_worker_image_uri_defaults_to_none(self) -> None:
+        self._profile.return_value = FeatureProfileResult()
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            _run(
+                DataAnalyzer(),
+                config=_make_config(tmpdir),
+                resource_config=MagicMock(),
+                components="feature",
+            )
+
+        self.assertIsNone(self._profile.call_args.kwargs["custom_worker_image_uri"])
+
+
+class ResolveJobNamePrefixTest(TestCase):
+    """The resolver enforces 'set somewhere' + lightweight shape checks."""
+
+    def test_uses_yaml_when_cli_unset(self) -> None:
+        self.assertEqual(
+            _resolve_job_name_prefix(cli_value=None, yaml_value="cd-content"),
+            "cd-content",
+        )
+
+    def test_uses_cli_when_yaml_unset(self) -> None:
+        self.assertEqual(
+            _resolve_job_name_prefix(cli_value="svij-test", yaml_value=None),
+            "svij-test",
+        )
+
+    def test_cli_overrides_yaml_when_both_set(self) -> None:
+        with self.assertLogs(level="INFO") as cap:
+            result = _resolve_job_name_prefix(
+                cli_value="svij-test", yaml_value="cd-content"
+            )
+        self.assertEqual(result, "svij-test")
+        self.assertTrue(
+            any("overrides YAML" in msg for msg in cap.output),
+            f"expected override log, got {cap.output}",
+        )
+
+    def test_raises_when_neither_set(self) -> None:
+        with self.assertRaises(ValueError):
+            _resolve_job_name_prefix(cli_value=None, yaml_value=None)
+
+    def test_raises_when_both_empty_strings(self) -> None:
+        # argparse with required=False yields None for absent flags, but
+        # an empty YAML value would slip through without this check.
+        with self.assertRaises(ValueError):
+            _resolve_job_name_prefix(cli_value=None, yaml_value="")
+
+    def test_rejects_uppercase(self) -> None:
+        with self.assertRaises(ValueError):
+            _resolve_job_name_prefix(cli_value="SvijTest", yaml_value=None)
+
+    def test_rejects_underscore(self) -> None:
+        with self.assertRaises(ValueError):
+            _resolve_job_name_prefix(cli_value="svij_test", yaml_value=None)
+
+    def test_rejects_too_long(self) -> None:
+        with self.assertRaises(ValueError):
+            _resolve_job_name_prefix(
+                cli_value="a" * 21,  # 21 chars exceeds the 20-char cap
+                yaml_value=None,
+            )
+
+    def test_accepts_at_length_cap(self) -> None:
+        prefix = "a" + "b" * 19  # exactly 20 chars
+        self.assertEqual(
+            _resolve_job_name_prefix(cli_value=prefix, yaml_value=None), prefix
+        )
diff --git a/tests/unit/analytics/data_analyzer/embedding_diagnostics_test.py b/tests/unit/analytics/data_analyzer/embedding_diagnostics_test.py
new file mode 100644
index 000000000..67b60be51
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/embedding_diagnostics_test.py
@@ -0,0 +1,168 @@
+"""Unit tests for embedding_diagnostics.
+
+BQ calls are mocked via patched ``BqUtils.run_query``.
+"""
+from typing import Any
+from unittest.mock import MagicMock
+
+from gigl.analytics.data_analyzer.embedding_diagnostics import (
+    EmbeddingDiagnostics,
+    EmbeddingDiagnosticsRequest,
+)
+from tests.test_assets.test_case import TestCase
+
+
+def _mock_row(data: dict[str, Any]) -> MagicMock:
+    row = MagicMock()
+    row.__getitem__ = lambda self, key: data[key]
+    return row
+
+
+def _mock_rows(rows: list[dict[str, Any]]) -> MagicMock:
+    iterator = MagicMock()
+    iterator.__iter__ = lambda self: iter([_mock_row(r) for r in rows])
+    return iterator
+
+
+def _success_row(
+    total: int = 100,
+    unique_count: int = 90,
+    unique_ratio: float = 0.9,
+    top_k: list[dict[str, Any]] | None = None,
+) -> dict[str, Any]:
+    return {
+        "total": total,
+        "unique_count": unique_count,
+        "unique_ratio": unique_ratio,
+        "top_k": top_k if top_k is not None else [],
+    }
+
+
+class EmbeddingDiagnosticsAnalyzeTest(TestCase):
+    def test_empty_requests_returns_empty_mapping(self) -> None:
+        bq_utils = MagicMock()
+        diagnostics = EmbeddingDiagnostics(bq_utils=bq_utils)
+        self.assertEqual(diagnostics.analyze([]), {})
+        bq_utils.run_query.assert_not_called()
+
+    def test_one_query_per_embedding_column(self) -> None:
+        bq_utils = MagicMock()
+        bq_utils.run_query.side_effect = lambda query, labels=None: _mock_rows(
+            [_success_row()]
+        )
+        requests = [
+            EmbeddingDiagnosticsRequest(
+                result_key="node:user",
+                bq_table="p.d.users",
+                embedding_columns=["emb_a", "emb_b"],
+            ),
+            EmbeddingDiagnosticsRequest(
+                result_key="node:content",
+                bq_table="p.d.content",
+                embedding_columns=["emb_c"],
+            ),
+        ]
+        EmbeddingDiagnostics(bq_utils=bq_utils).analyze(requests)
+        self.assertEqual(bq_utils.run_query.call_count, 3)
+
+    def test_result_is_populated_from_row(self) -> None:
+        bq_utils = MagicMock()
+        bq_utils.run_query.return_value = _mock_rows(
+            [
+                _success_row(
+                    total=1000,
+                    unique_count=980,
+                    unique_ratio=0.98,
+                    top_k=[
+                        {"hash_value": 1, "count_value": 10, "fraction": 0.01},
+                        {"hash_value": 2, "count_value": 5, "fraction": 0.005},
+                    ],
+                )
+            ]
+        )
+        requests = [
+            EmbeddingDiagnosticsRequest(
+                result_key="node:user",
+                bq_table="p.d.users",
+                embedding_columns=["emb"],
+            )
+        ]
+        out = EmbeddingDiagnostics(bq_utils=bq_utils).analyze(requests)
+        result = out["node:user"]["emb"]
+        self.assertEqual(result.total, 1000)
+        self.assertEqual(result.unique_count, 980)
+        self.assertAlmostEqual(result.unique_ratio, 0.98)
+        self.assertEqual(len(result.top_k), 2)
+        self.assertEqual(result.top_k[0].hash, 1)
+        self.assertEqual(result.top_k[0].count, 10)
+        self.assertAlmostEqual(result.top_k[0].fraction, 0.01)
+
+    def test_per_column_failure_is_logged_and_skipped(self) -> None:
+        bq_utils = MagicMock()
+
+        def _side_effect(query: str, labels: Any = None) -> MagicMock:
+            if "emb_bad" in query:
+                raise RuntimeError("BQ permission denied")
+            return _mock_rows([_success_row()])
+
+        bq_utils.run_query.side_effect = _side_effect
+        requests = [
+            EmbeddingDiagnosticsRequest(
+                result_key="node:user",
+                bq_table="p.d.users",
+                embedding_columns=["emb_good", "emb_bad"],
+            )
+        ]
+        with self.assertLogs(level="ERROR") as cap:
+            out = EmbeddingDiagnostics(bq_utils=bq_utils).analyze(requests)
+        self.assertIn("emb_good", out["node:user"])
+        self.assertNotIn("emb_bad", out["node:user"])
+        self.assertTrue(
+            any("emb_bad" in msg for msg in cap.output),
+            f"expected error mentioning emb_bad, got {cap.output}",
+        )
+
+    def test_query_uses_farm_fingerprint_and_table(self) -> None:
+        bq_utils = MagicMock()
+        bq_utils.run_query.return_value = _mock_rows([_success_row()])
+        requests = [
+            EmbeddingDiagnosticsRequest(
+                result_key="node:user",
+                bq_table="proj.ds.users",
+                embedding_columns=["emb"],
+            )
+        ]
+        EmbeddingDiagnostics(bq_utils=bq_utils).analyze(requests)
+        query = bq_utils.run_query.call_args_list[0].kwargs["query"]
+        self.assertIn("FARM_FINGERPRINT(TO_JSON_STRING(`emb`))", query)
+        self.assertIn("`proj.ds.users`", query)
+        self.assertIn("LIMIT 20", query)
+
+    def test_top_k_limit_is_configurable(self) -> None:
+        bq_utils = MagicMock()
+        bq_utils.run_query.return_value = _mock_rows([_success_row()])
+        requests = [
+            EmbeddingDiagnosticsRequest(
+                result_key="node:user",
+                bq_table="p.d.users",
+                embedding_columns=["emb"],
+            )
+        ]
+        EmbeddingDiagnostics(bq_utils=bq_utils, top_k=5).analyze(requests)
+        query = bq_utils.run_query.call_args_list[0].kwargs["query"]
+        self.assertIn("LIMIT 5", query)
+
+    def test_empty_row_result_raises(self) -> None:
+        bq_utils = MagicMock()
+        bq_utils.run_query.return_value = _mock_rows([])
+        requests = [
+            EmbeddingDiagnosticsRequest(
+                result_key="node:user",
+                bq_table="p.d.users",
+                embedding_columns=["emb"],
+            )
+        ]
+        with self.assertLogs(level="ERROR"):
+            out = EmbeddingDiagnostics(bq_utils=bq_utils).analyze(requests)
+        # Failure is caught, result_key is omitted.
+        self.assertNotIn("node:user", out)
diff --git a/tests/unit/analytics/data_analyzer/embedding_projection_test.py b/tests/unit/analytics/data_analyzer/embedding_projection_test.py
new file mode 100644
index 000000000..157351bd2
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/embedding_projection_test.py
@@ -0,0 +1,182 @@
+"""Unit tests for embedding_projection."""
+from google.cloud.bigquery import SchemaField
+
+from gigl.analytics.data_analyzer.embedding_projection import (
+    build_projection,
+    detect_embedding_columns,
+    is_embedding_column,
+)
+from tests.test_assets.test_case import TestCase
+
+
+def _schema(
+    fields: list[tuple[str, str, str]],
+) -> dict[str, SchemaField]:
+    return {
+        name: SchemaField(name=name, field_type=field_type, mode=mode)
+        for name, field_type, mode in fields
+    }
+
+
+class IsEmbeddingColumnTest(TestCase):
+    def test_repeated_float64_is_embedding(self) -> None:
+        field = SchemaField(name="emb", field_type="FLOAT64", mode="REPEATED")
+        self.assertTrue(is_embedding_column(field))
+
+    def test_repeated_float_is_embedding(self) -> None:
+        field = SchemaField(name="emb", field_type="FLOAT", mode="REPEATED")
+        self.assertTrue(is_embedding_column(field))
+
+    def test_repeated_numeric_is_embedding(self) -> None:
+        field = SchemaField(name="weights", field_type="NUMERIC", mode="REPEATED")
+        self.assertTrue(is_embedding_column(field))
+
+    def test_repeated_string_is_not_embedding(self) -> None:
+        field = SchemaField(name="tags", field_type="STRING", mode="REPEATED")
+        self.assertFalse(is_embedding_column(field))
+
+    def test_scalar_float_is_not_embedding(self) -> None:
+        field = SchemaField(name="weight", field_type="FLOAT64", mode="NULLABLE")
+        self.assertFalse(is_embedding_column(field))
+
+
+class DetectEmbeddingColumnsTest(TestCase):
+    def test_returns_repeated_float_family_in_schema_order(self) -> None:
+        schema = _schema(
+            [
+                ("age", "INT64", "NULLABLE"),
+                ("emb_a", "FLOAT64", "REPEATED"),
+                ("country", "STRING", "NULLABLE"),
+                ("emb_b", "NUMERIC", "REPEATED"),
+                ("tags", "STRING", "REPEATED"),
+            ]
+        )
+        self.assertEqual(
+            detect_embedding_columns(schema, excluded=set()),
+            ["emb_a", "emb_b"],
+        )
+
+    def test_excluded_columns_dropped(self) -> None:
+        schema = _schema(
+            [
+                ("emb_a", "FLOAT64", "REPEATED"),
+                ("emb_b", "FLOAT64", "REPEATED"),
+            ]
+        )
+        self.assertEqual(
+            detect_embedding_columns(schema, excluded={"emb_a"}),
+            ["emb_b"],
+        )
+
+
+class BuildProjectionTest(TestCase):
+    def test_scalar_columns_pass_through_backtick_quoted(self) -> None:
+        schema = _schema(
+            [
+                ("age", "INT64", "NULLABLE"),
+                ("country", "STRING", "NULLABLE"),
+            ]
+        )
+        result = build_projection(schema, excluded=set())
+        self.assertEqual(
+            result.projection,
+            [("age", "`age`"), ("country", "`country`")],
+        )
+        self.assertEqual(result.embedding_columns, [])
+
+    def test_excluded_columns_dropped(self) -> None:
+        schema = _schema(
+            [
+                ("uid", "STRING", "REQUIRED"),
+                ("age", "INT64", "NULLABLE"),
+            ]
+        )
+        result = build_projection(schema, excluded={"uid"})
+        self.assertEqual(result.projection, [("age", "`age`")])
+
+    def test_embedding_column_expands_to_four_hygiene_entries(self) -> None:
+        schema = _schema([("emb", "FLOAT64", "REPEATED")])
+        result = build_projection(schema, excluded=set())
+        self.assertEqual(
+            [name for name, _ in result.projection],
+            ["emb_len", "emb_has_nan", "emb_has_inf", "emb_is_all_zero"],
+        )
+        self.assertEqual(result.embedding_columns, ["emb"])
+
+    def test_embedding_hygiene_expressions_match_expected_sql(self) -> None:
+        schema = _schema([("emb", "FLOAT64", "REPEATED")])
+        result = build_projection(schema, excluded=set())
+        by_name = dict(result.projection)
+        self.assertEqual(by_name["emb_len"], "ARRAY_LENGTH(`emb`)")
+        self.assertEqual(
+            by_name["emb_has_nan"],
+            "IFNULL((SELECT LOGICAL_OR(IS_NAN(v)) FROM UNNEST(`emb`) v), FALSE)",
+        )
+        self.assertEqual(
+            by_name["emb_has_inf"],
+            "IFNULL((SELECT LOGICAL_OR(IS_INF(v)) FROM UNNEST(`emb`) v), FALSE)",
+        )
+        self.assertEqual(
+            by_name["emb_is_all_zero"],
+            "IFNULL((SELECT LOGICAL_AND(v = 0) FROM UNNEST(`emb`) v), FALSE)",
+        )
+
+    def test_hash_column_is_not_in_projection(self) -> None:
+        # Hash column lives in the diagnostics pass, not the TFDV projection.
+        schema = _schema([("emb", "FLOAT64", "REPEATED")])
+        result = build_projection(schema, excluded=set())
+        names = {name for name, _ in result.projection}
+        self.assertNotIn("emb_hash", names)
+
+    def test_mixed_scalar_and_embedding_preserves_schema_order(self) -> None:
+        schema = _schema(
+            [
+                ("age", "INT64", "NULLABLE"),
+                ("emb", "FLOAT64", "REPEATED"),
+                ("country", "STRING", "NULLABLE"),
+            ]
+        )
+        result = build_projection(schema, excluded=set())
+        self.assertEqual(
+            [name for name, _ in result.projection],
+            [
+                "age",
+                "emb_len",
+                "emb_has_nan",
+                "emb_has_inf",
+                "emb_is_all_zero",
+                "country",
+            ],
+        )
+
+    def test_repeated_non_float_columns_are_skipped(self) -> None:
+        schema = _schema(
+            [
+                ("age", "INT64", "NULLABLE"),
+                ("tags", "STRING", "REPEATED"),
+                ("ids", "INT64", "REPEATED"),
+            ]
+        )
+        with self.assertLogs(level="INFO") as cap:
+            result = build_projection(schema, excluded=set())
+        self.assertEqual([name for name, _ in result.projection], ["age"])
+        skip_log = " ".join(cap.output)
+        self.assertIn("tags", skip_log)
+        self.assertIn("ids", skip_log)
+
+    def test_non_profileable_scalar_types_are_skipped(self) -> None:
+        schema = _schema(
+            [
+                ("age", "INT64", "NULLABLE"),
+                ("extras", "RECORD", "NULLABLE"),
+                ("location", "GEOGRAPHY", "NULLABLE"),
+                ("event_time", "TIMESTAMP", "NULLABLE"),
+            ]
+        )
+        with self.assertLogs(level="INFO") as cap:
+            result = build_projection(schema, excluded=set())
+        self.assertEqual(result.projection, [("age", "`age`")])
+        skip_log = " ".join(cap.output)
+        self.assertIn("RECORD", skip_log)
+        self.assertIn("GEOGRAPHY", skip_log)
+        self.assertIn("TIMESTAMP", skip_log)
diff --git a/tests/unit/analytics/data_analyzer/feature_profiler_test.py b/tests/unit/analytics/data_analyzer/feature_profiler_test.py
new file mode 100644
index 000000000..850a7eb77
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/feature_profiler_test.py
@@ -0,0 +1,1189 @@
+"""Unit tests for the FeatureProfiler.
+
+Dataflow job execution is mocked: ``beam.Pipeline`` is replaced with a
+dummy that records construction, and ``init_beam_pipeline_options`` is
+patched so tests don't touch real GCP resources.
+
+Type-specific projection logic (scalar-type filtering, embedding
+expansion) is exercised in ``embedding_projection_test.py``; this file
+focuses on how ``FeatureProfiler`` wires projection → Beam → diagnostics
+→ sidecar.
+"""
+import itertools
+import tempfile
+from typing import Optional
+from unittest.mock import MagicMock, patch
+
+import apache_beam as beam
+from google.cloud.bigquery import SchemaField
+
+from gigl.analytics.data_analyzer.config import (
+    DataAnalyzerConfig,
+    EdgeTableSpec,
+    NodeTableSpec,
+)
+from gigl.analytics.data_analyzer.feature_profiler import (
+    FeatureProfiler,
+    _collect_profile_tasks,
+    _resolve_projection,
+)
+from gigl.analytics.data_analyzer.types import EmbeddingDiagnosticsResult, TopKEntry
+from gigl.env.pipelines_config import GiglResourceConfigWrapper
+from gigl.src.common.constants.components import GiGLComponents
+from tests.test_assets.test_case import TestCase
+
+# Fixed values used wherever ``FeatureProfiler.profile`` is called from these
+# tests. The exact strings show up in assertions for the Dataflow job-name
+# suffix, so they're constants rather than per-test fixtures.
+_TEST_JOB_NAME_PREFIX = "tp"
+_TEST_RUN_TIMESTAMP = "20260101-0000"
+
+
+def _schema(
+    fields: list[tuple[str, str, str]],
+) -> dict[str, SchemaField]:
+    """Build a schema dict from ``(name, field_type, mode)`` tuples."""
+    return {
+        name: SchemaField(name=name, field_type=field_type, mode=mode)
+        for name, field_type, mode in fields
+    }
+
+
+def _run_profile(prof: FeatureProfiler, **kwargs):
+    """Invoke ``FeatureProfiler.profile`` with the test prefix and timestamp.
+
+    Centralizes the new required kwargs (``job_name_prefix``, ``run_timestamp``)
+    so individual tests stay focused on their own setup.
+    """
+    return prof.profile(
+        job_name_prefix=_TEST_JOB_NAME_PREFIX,
+        run_timestamp=_TEST_RUN_TIMESTAMP,
+        **kwargs,
+    )
+
+
+def _make_config(
+    node_specs: Optional[list[NodeTableSpec]] = None,
+    edge_specs: Optional[list[EdgeTableSpec]] = None,
+    output_gcs_path: Optional[str] = None,
+) -> DataAnalyzerConfig:
+    # A temp dir keeps the sidecar write side-effect local to the test.
+    out_path = (
+        output_gcs_path
+        if output_gcs_path is not None
+        else (tempfile.mkdtemp(prefix="feature_profiler_test_"))
+    )
+    return DataAnalyzerConfig(
+        node_tables=node_specs
+        if node_specs is not None
+        else [
+            NodeTableSpec(
+                bq_table="p.d.users",
+                node_type="user",
+                id_column="uid",
+                feature_columns=["age", "country"],
+            )
+        ],
+        edge_tables=edge_specs
+        if edge_specs is not None
+        else [
+            EdgeTableSpec(
+                bq_table="p.d.follows",
+                edge_type="follows",
+                src_id_column="src",
+                dst_id_column="dst",
+                src_node_type="user",
+                dst_node_type="user",
+                feature_columns=["weight"],
+            )
+        ],
+        output_gcs_path=out_path,
+    )
+
+
+class ResolveProjectionTest(TestCase):
+    def test_auto_infers_from_schema_when_explicit_is_empty(self) -> None:
+        bq_utils = MagicMock()
+        bq_utils.fetch_bq_table_schema.return_value = _schema(
+            [
+                ("uid", "STRING", "REQUIRED"),
+                ("age", "INT64", "NULLABLE"),
+                ("country", "STRING", "NULLABLE"),
+            ]
+        )
+        result, schema_error = _resolve_projection(
+            bq_table="p.d.users",
+            explicit=[],
+            excluded={"uid"},
+            bq_utils=bq_utils,
+        )
+        self.assertIsNone(schema_error)
+        self.assertEqual([name for name, _ in result.projection], ["age", "country"])
+        self.assertEqual(result.embedding_columns, [])
+
+    def test_honors_explicit_feature_columns(self) -> None:
+        bq_utils = MagicMock()
+        bq_utils.fetch_bq_table_schema.return_value = _schema(
+            [
+                ("uid", "STRING", "REQUIRED"),
+                ("age", "INT64", "NULLABLE"),
+                ("country", "STRING", "NULLABLE"),
+                ("extra_feature", "FLOAT64", "NULLABLE"),
+            ]
+        )
+        result, schema_error = _resolve_projection(
+            bq_table="p.d.users",
+            explicit=["age", "country"],
+            excluded={"uid"},
+            bq_utils=bq_utils,
+        )
+        self.assertIsNone(schema_error)
+        self.assertEqual([name for name, _ in result.projection], ["age", "country"])
+
+    def test_logs_and_drops_explicit_columns_not_in_schema(self) -> None:
+        bq_utils = MagicMock()
+        bq_utils.fetch_bq_table_schema.return_value = _schema(
+            [("age", "INT64", "NULLABLE")]
+        )
+        with self.assertLogs(level="WARNING") as cap:
+            result, schema_error = _resolve_projection(
+                bq_table="p.d.users",
+                explicit=["age", "phantom"],
+                excluded=set(),
+                bq_utils=bq_utils,
+            )
+        self.assertIsNone(schema_error)
+        self.assertEqual([name for name, _ in result.projection], ["age"])
+        self.assertTrue(
+            any("phantom" in msg for msg in cap.output),
+            f"expected warning about phantom, got {cap.output}",
+        )
+
+    def test_schema_fetch_failure_returns_empty_projection(self) -> None:
+        bq_utils = MagicMock()
+        bq_utils.fetch_bq_table_schema.side_effect = RuntimeError("permission denied")
+        with self.assertLogs(level="WARNING"):
+            result, schema_error = _resolve_projection(
+                bq_table="p.d.missing",
+                explicit=[],
+                excluded=set(),
+                bq_utils=bq_utils,
+            )
+        self.assertEqual(result.projection, [])
+        self.assertEqual(result.embedding_columns, [])
+        self.assertIsNotNone(schema_error)
+        self.assertIn("permission denied", schema_error)
+        self.assertIn("p.d.missing", schema_error)
+
+    def test_bool_scalar_columns_are_cast_to_int64(self) -> None:
+        """BOOL/BOOLEAN scalars must be CAST to INT64 in the projection.
+
+        ``BqTableToRecordBatch`` wraps each scalar value in a single-element
+        list before TFDV consumes it. TFDV's
+        ``get_feature_type_from_arrow_type`` rejects ``list<bool>``; passing
+        a raw BOOL crashes the Dataflow job in BasicStatsGenerator. Casting
+        to INT64 keeps the BOOL semantics profileable as an int feature.
+        """
+        bq_utils = MagicMock()
+        bq_utils.fetch_bq_table_schema.return_value = _schema(
+            [
+                ("uid", "STRING", "REQUIRED"),
+                ("is_active", "BOOL", "NULLABLE"),
+                ("flagged", "BOOLEAN", "NULLABLE"),
+                ("age", "INT64", "NULLABLE"),
+            ]
+        )
+        result, schema_error = _resolve_projection(
+            bq_table="p.d.users",
+            explicit=[],
+            excluded={"uid"},
+            bq_utils=bq_utils,
+        )
+        self.assertIsNone(schema_error)
+        by_name = dict(result.projection)
+        self.assertEqual(by_name["is_active"], "CAST(`is_active` AS INT64)")
+        self.assertEqual(by_name["flagged"], "CAST(`flagged` AS INT64)")
+        # Non-BOOL scalars retain their pass-through expression.
+        self.assertEqual(by_name["age"], "`age`")
+
+    def test_embedding_column_expands_into_hygiene_companions(self) -> None:
+        bq_utils = MagicMock()
+        bq_utils.fetch_bq_table_schema.return_value = _schema(
+            [
+                ("uid", "STRING", "REQUIRED"),
+                ("emb", "FLOAT64", "REPEATED"),
+            ]
+        )
+        result, schema_error = _resolve_projection(
+            bq_table="p.d.users",
+            explicit=[],
+            excluded={"uid"},
+            bq_utils=bq_utils,
+        )
+        self.assertIsNone(schema_error)
+        self.assertEqual(
+            [name for name, _ in result.projection],
+            ["emb_len", "emb_has_nan", "emb_has_inf", "emb_is_all_zero"],
+        )
+        self.assertEqual(result.embedding_columns, ["emb"])
+        # The three boolean hygiene companions must be CAST to INT64;
+        # otherwise TFDV crashes on list<bool>.
+        by_name = dict(result.projection)
+        self.assertIn("CAST", by_name["emb_has_nan"])
+        self.assertIn("AS INT64", by_name["emb_has_nan"])
+        self.assertIn("CAST", by_name["emb_has_inf"])
+        self.assertIn("AS INT64", by_name["emb_has_inf"])
+        self.assertIn("CAST", by_name["emb_is_all_zero"])
+        self.assertIn("AS INT64", by_name["emb_is_all_zero"])
+        # _len stays as a plain ARRAY_LENGTH (already INT64).
+        self.assertNotIn("CAST", by_name["emb_len"])
+
+
+class CollectProfileTasksTest(TestCase):
+    def test_preserves_explicit_feature_columns_and_skips_inference(self) -> None:
+        config = _make_config()
+        bq_utils = MagicMock()
+        bq_utils.fetch_bq_table_schema.side_effect = lambda table: _schema(
+            [
+                ("uid", "STRING", "REQUIRED"),
+                ("age", "INT64", "NULLABLE"),
+                ("country", "STRING", "NULLABLE"),
+                ("src", "STRING", "REQUIRED"),
+                ("dst", "STRING", "REQUIRED"),
+                ("weight", "FLOAT64", "NULLABLE"),
+            ]
+        )
+        tasks, errors = _collect_profile_tasks(config, bq_utils)
+        self.assertEqual(errors, [])
+        by_key = {t.result_key: t for t in tasks}
+        self.assertEqual(
+            [name for name, _ in by_key["node:user"].projection], ["age", "country"]
+        )
+        self.assertEqual(
+            [name for name, _ in by_key["edge:follows"].projection], ["weight"]
+        )
+        self.assertEqual(by_key["node:user"].embedding_columns, [])
+
+    def test_infers_columns_when_feature_columns_empty(self) -> None:
+        config = _make_config(
+            node_specs=[
+                NodeTableSpec(
+                    bq_table="p.d.users",
+                    node_type="user",
+                    id_column="uid",
+                    feature_columns=[],
+                )
+            ],
+            edge_specs=[
+                EdgeTableSpec(
+                    bq_table="p.d.follows",
+                    edge_type="follows",
+                    src_id_column="src",
+                    dst_id_column="dst",
+                    src_node_type="user",
+                    dst_node_type="user",
+                    feature_columns=[],
+                )
+            ],
+        )
+        bq_utils = MagicMock()
+        bq_utils.fetch_bq_table_schema.side_effect = lambda table: {
+            "p.d.users": _schema(
+                [
+                    ("uid", "STRING", "REQUIRED"),
+                    ("age", "INT64", "NULLABLE"),
+                    ("country", "STRING", "NULLABLE"),
+                ]
+            ),
+            "p.d.follows": _schema(
+                [
+                    ("src", "STRING", "REQUIRED"),
+                    ("dst", "STRING", "REQUIRED"),
+                    ("weight", "FLOAT64", "NULLABLE"),
+                ]
+            ),
+        }[table]
+        tasks, errors = _collect_profile_tasks(config, bq_utils)
+        self.assertEqual(errors, [])
+        by_key = {t.result_key: t for t in tasks}
+        self.assertEqual(
+            [name for name, _ in by_key["node:user"].projection], ["age", "country"]
+        )
+        self.assertEqual(
+            [name for name, _ in by_key["edge:follows"].projection], ["weight"]
+        )
+
+    def test_embedding_columns_surface_on_task(self) -> None:
+        config = _make_config(
+            node_specs=[
+                NodeTableSpec(
+                    bq_table="p.d.users",
+                    node_type="user",
+                    id_column="uid",
+                    feature_columns=[],
+                )
+            ],
+            edge_specs=[],
+        )
+        bq_utils = MagicMock()
+        bq_utils.fetch_bq_table_schema.return_value = _schema(
+            [
+                ("uid", "STRING", "REQUIRED"),
+                ("age", "INT64", "NULLABLE"),
+                ("emb", "FLOAT64", "REPEATED"),
+            ]
+        )
+        tasks, errors = _collect_profile_tasks(config, bq_utils)
+        self.assertEqual(errors, [])
+        self.assertEqual(len(tasks), 1)
+        task = tasks[0]
+        self.assertEqual(task.embedding_columns, ["emb"])
+        # age + four hygiene companions for emb.
+        self.assertEqual(
+            [name for name, _ in task.projection],
+            ["age", "emb_len", "emb_has_nan", "emb_has_inf", "emb_is_all_zero"],
+        )
+
+    def test_skips_table_when_resolved_projection_is_empty(self) -> None:
+        config = _make_config(
+            node_specs=[
+                NodeTableSpec(
+                    bq_table="p.d.users",
+                    node_type="user",
+                    id_column="uid",
+                    feature_columns=[],
+                )
+            ],
+            edge_specs=[],
+        )
+        bq_utils = MagicMock()
+        bq_utils.fetch_bq_table_schema.return_value = _schema(
+            [("uid", "STRING", "REQUIRED")]
+        )
+        with self.assertLogs(level="WARNING") as log_capture:
+            tasks, errors = _collect_profile_tasks(config, bq_utils)
+        self.assertEqual(tasks, [])
+        self.assertTrue(
+            any("node:user" in msg for msg in log_capture.output),
+            f"expected warning mentioning node:user, got {log_capture.output}",
+        )
+        self.assertEqual(len(errors), 1)
+        self.assertEqual(errors[0].result_key, "node:user")
+        self.assertEqual(errors[0].stage, "empty_projection")
+        self.assertEqual(errors[0].bq_table, "p.d.users")
+
+    def test_schema_fetch_failure_skips_table_without_crashing(self) -> None:
+        config = _make_config(
+            node_specs=[
+                NodeTableSpec(
+                    bq_table="p.d.broken",
+                    node_type="broken",
+                    id_column="uid",
+                    feature_columns=[],
+                ),
+                NodeTableSpec(
+                    bq_table="p.d.users",
+                    node_type="user",
+                    id_column="uid",
+                    feature_columns=[],
+                ),
+            ],
+            edge_specs=[],
+        )
+        bq_utils = MagicMock()
+
+        def _maybe_raise(table: str):
+            if table == "p.d.broken":
+                raise RuntimeError("permission denied")
+            return _schema(
+                [
+                    ("uid", "STRING", "REQUIRED"),
+                    ("age", "INT64", "NULLABLE"),
+                ]
+            )
+
+        bq_utils.fetch_bq_table_schema.side_effect = _maybe_raise
+        with self.assertLogs(level="WARNING") as log_capture:
+            tasks, errors = _collect_profile_tasks(config, bq_utils)
+        self.assertEqual([t.result_key for t in tasks], ["node:user"])
+        self.assertTrue(
+            any("broken" in msg for msg in log_capture.output),
+            f"expected warning mentioning broken table, got {log_capture.output}",
+        )
+        self.assertEqual(len(errors), 1)
+        self.assertEqual(errors[0].result_key, "node:broken")
+        self.assertEqual(errors[0].stage, "schema_fetch")
+        self.assertIn("permission denied", errors[0].message)
+
+
+class FeatureProfilerRunTest(TestCase):
+    def setUp(self) -> None:
+        super().setUp()
+        self._resource_config = MagicMock(spec=GiglResourceConfigWrapper)
+        self._resource_config.project = "test-project"
+
+        self._gcp_options = MagicMock(name="GoogleCloudOptions")
+        self._gcp_options.job_name = "test-job-name"
+        self._gcp_options.project = "test-project"
+        self._gcp_options.region = "us-central1"
+
+        pipeline_options = MagicMock(name="PipelineOptions")
+        pipeline_options.view_as = MagicMock(return_value=self._gcp_options)
+
+        self._init_beam_pipeline_options = patch(
+            "gigl.analytics.data_analyzer.feature_profiler.init_beam_pipeline_options",
+            return_value=pipeline_options,
+        ).start()
+
+        self._bq_utils_cls = patch(
+            "gigl.analytics.data_analyzer.feature_profiler.BqUtils",
+        ).start()
+        self._bq_utils_cls.return_value.fetch_bq_table_schema.return_value = {}
+
+        self._diagnostics_cls = patch(
+            "gigl.analytics.data_analyzer.feature_profiler.EmbeddingDiagnostics",
+        ).start()
+        self._diagnostics_cls.return_value.analyze.return_value = {}
+
+        self._pipelines: list[MagicMock] = []
+
+        def _make_pipeline(*args, **kwargs):
+            pipeline = MagicMock(name="Pipeline")
+            pipeline_result = MagicMock(name="PipelineResult")
+            pipeline_result.wait_until_finish = MagicMock(return_value=None)
+            pipeline_result.job_id = MagicMock(return_value="test-job-id")
+            pipeline.run = MagicMock(return_value=pipeline_result)
+            self._pipelines.append(pipeline)
+            return pipeline
+
+        self._pipeline_ctor = patch(
+            "gigl.analytics.data_analyzer.feature_profiler.beam.Pipeline",
+            side_effect=_make_pipeline,
+        ).start()
+
+        self.addCleanup(patch.stopall)
+
+    def test_returns_empty_when_inferred_columns_are_all_ids(self) -> None:
+        config = _make_config(
+            node_specs=[
+                NodeTableSpec(
+                    bq_table="p.d.users",
+                    node_type="user",
+                    id_column="uid",
+                    feature_columns=[],
+                )
+            ],
+            edge_specs=[
+                EdgeTableSpec(
+                    bq_table="p.d.follows",
+                    edge_type="follows",
+                    src_id_column="src",
+                    dst_id_column="dst",
+                    src_node_type="user",
+                    dst_node_type="user",
+                    feature_columns=[],
+                )
+            ],
+        )
+        self._bq_utils_cls.return_value.fetch_bq_table_schema.side_effect = (
+            lambda table: {
+                "p.d.users": _schema([("uid", "STRING", "REQUIRED")]),
+                "p.d.follows": _schema(
+                    [
+                        ("src", "STRING", "REQUIRED"),
+                        ("dst", "STRING", "REQUIRED"),
+                    ]
+                ),
+            }[table]
+        )
+        profiler = FeatureProfiler()
+        result = _run_profile(
+            profiler, config=config, resource_config=self._resource_config
+        )
+        self.assertEqual(result.facets_html_paths, {})
+        self.assertEqual(result.stats_paths, {})
+        self.assertEqual(len(self._pipelines), 0)
+
+    def test_launches_one_pipeline_per_feature_table(self) -> None:
+        config = _make_config()
+        self._bq_utils_cls.return_value.fetch_bq_table_schema.side_effect = (
+            lambda table: _schema(
+                [
+                    ("uid", "STRING", "REQUIRED"),
+                    ("age", "INT64", "NULLABLE"),
+                    ("country", "STRING", "NULLABLE"),
+                    ("src", "STRING", "REQUIRED"),
+                    ("dst", "STRING", "REQUIRED"),
+                    ("weight", "FLOAT64", "NULLABLE"),
+                ]
+            )
+        )
+        profiler = FeatureProfiler()
+        result = _run_profile(
+            profiler, config=config, resource_config=self._resource_config
+        )
+        self.assertEqual(len(self._pipelines), 2)
+        self.assertEqual(
+            sorted(result.facets_html_paths.keys()),
+            ["edge:follows", "node:user"],
+        )
+        self.assertEqual(
+            sorted(result.stats_paths.keys()),
+            ["edge:follows", "node:user"],
+        )
+        component_kwargs = [
+            call.kwargs.get("component")
+            for call in self._init_beam_pipeline_options.call_args_list
+        ]
+        self.assertTrue(all(c == GiGLComponents.DataAnalyzer for c in component_kwargs))
+
+    def test_gcs_paths_use_expected_layout(self) -> None:
+        with tempfile.TemporaryDirectory() as tmp:
+            self._bq_utils_cls.return_value.fetch_bq_table_schema.side_effect = (
+                lambda table: _schema(
+                    [
+                        ("uid", "STRING", "REQUIRED"),
+                        ("age", "INT64", "NULLABLE"),
+                        ("country", "STRING", "NULLABLE"),
+                        ("src", "STRING", "REQUIRED"),
+                        ("dst", "STRING", "REQUIRED"),
+                        ("weight", "FLOAT64", "NULLABLE"),
+                    ]
+                )
+            )
+            profiler = FeatureProfiler()
+            result = _run_profile(
+                profiler,
+                config=_make_config(output_gcs_path=f"{tmp}/run1/"),
+                resource_config=self._resource_config,
+            )
+        # Single-chunk tables produce a list of length 1 with the historical
+        # flat path (no chunk_NN/ subdir).
+        self.assertEqual(
+            result.facets_html_paths["node:user"],
+            [f"{tmp}/run1/feature_profiler/nodes/user/facets.html"],
+        )
+        self.assertEqual(
+            result.stats_paths["node:user"],
+            [f"{tmp}/run1/feature_profiler/nodes/user/stats.tfrecord"],
+        )
+        self.assertEqual(
+            result.facets_html_paths["edge:follows"],
+            [f"{tmp}/run1/feature_profiler/edges/follows/facets.html"],
+        )
+
+    def test_individual_pipeline_failure_is_caught(self) -> None:
+        self._bq_utils_cls.return_value.fetch_bq_table_schema.side_effect = (
+            lambda table: _schema(
+                [
+                    ("uid", "STRING", "REQUIRED"),
+                    ("age", "INT64", "NULLABLE"),
+                    ("country", "STRING", "NULLABLE"),
+                    ("src", "STRING", "REQUIRED"),
+                    ("dst", "STRING", "REQUIRED"),
+                    ("weight", "FLOAT64", "NULLABLE"),
+                ]
+            )
+        )
+        counter = itertools.count(1)
+
+        def _make_pipeline_fail_second(*args, **kwargs):
+            pipeline = MagicMock(name="Pipeline")
+            pipeline_result = MagicMock(name="PipelineResult")
+            pipeline_result.job_id = MagicMock(return_value="test-job-id")
+            if next(counter) == 2:
+                pipeline_result.wait_until_finish = MagicMock(
+                    side_effect=RuntimeError("Dataflow boom")
+                )
+            else:
+                pipeline_result.wait_until_finish = MagicMock(return_value=None)
+            pipeline.run = MagicMock(return_value=pipeline_result)
+            self._pipelines.append(pipeline)
+            return pipeline
+
+        self._pipeline_ctor.side_effect = _make_pipeline_fail_second
+
+        profiler = FeatureProfiler()
+        result = _run_profile(
+            profiler,
+            config=_make_config(),
+            resource_config=self._resource_config,
+        )
+        self.assertEqual(len(self._pipelines), 2)
+        total_keys = set(result.facets_html_paths.keys())
+        self.assertEqual(len(total_keys), 1)
+        self.assertLessEqual(total_keys, {"node:user", "edge:follows"})
+        # The failed pipeline shows up as a structured error, so the report can
+        # surface it instead of silently dropping the table.
+        self.assertEqual(len(result.errors), 1)
+        err = result.errors[0]
+        self.assertIn(err.result_key, {"node:user", "edge:follows"})
+        self.assertEqual(err.stage, "dataflow")
+        self.assertIn("Dataflow boom", err.message)
+        # The Dataflow job id and console URL should be captured so the report
+        # can deep-link to the failed job's logs.
+        self.assertEqual(err.job_id, "test-job-id")
+        self.assertEqual(err.job_name, "test-job-name")
+        self.assertIsNotNone(err.console_url)
+        self.assertIn("test-job-id", err.console_url)
+        self.assertIn("us-central1", err.console_url)
+        self.assertIn("test-project", err.console_url)
+
+    def test_embedding_diagnostics_failure_is_recorded(self) -> None:
+        """When the diagnostics pass raises, every requesting table gets an error."""
+        self._bq_utils_cls.return_value.fetch_bq_table_schema.side_effect = (
+            lambda table: _schema(
+                [
+                    ("uid", "STRING", "REQUIRED"),
+                    ("emb", "FLOAT64", "REPEATED"),
+                    ("src", "STRING", "REQUIRED"),
+                    ("dst", "STRING", "REQUIRED"),
+                    ("weight_emb", "FLOAT64", "REPEATED"),
+                ]
+            )
+        )
+        self._diagnostics_cls.return_value.analyze.side_effect = RuntimeError(
+            "diagnostics down"
+        )
+        profiler = FeatureProfiler()
+        result = _run_profile(
+            profiler,
+            config=_make_config(
+                node_specs=[
+                    NodeTableSpec(
+                        bq_table="p.d.users",
+                        node_type="user",
+                        id_column="uid",
+                        feature_columns=[],
+                    )
+                ],
+                edge_specs=[
+                    EdgeTableSpec(
+                        bq_table="p.d.follows",
+                        edge_type="follows",
+                        src_id_column="src",
+                        dst_id_column="dst",
+                        src_node_type="user",
+                        dst_node_type="user",
+                        feature_columns=[],
+                    )
+                ],
+            ),
+            resource_config=self._resource_config,
+        )
+        diagnostics_errors = [
+            e for e in result.errors if e.stage == "embedding_diagnostics"
+        ]
+        self.assertEqual(
+            sorted(e.result_key for e in diagnostics_errors),
+            ["edge:follows", "node:user"],
+        )
+        for e in diagnostics_errors:
+            self.assertIn("diagnostics down", e.message)
+
+    def test_uses_data_analyzer_job_name_suffix(self) -> None:
+        self._bq_utils_cls.return_value.fetch_bq_table_schema.side_effect = (
+            lambda table: _schema(
+                [
+                    ("uid", "STRING", "REQUIRED"),
+                    ("age", "INT64", "NULLABLE"),
+                    ("country", "STRING", "NULLABLE"),
+                    ("src", "STRING", "REQUIRED"),
+                    ("dst", "STRING", "REQUIRED"),
+                    ("weight", "FLOAT64", "NULLABLE"),
+                ]
+            )
+        )
+        profiler = FeatureProfiler()
+        _run_profile(
+            profiler,
+            config=_make_config(),
+            resource_config=self._resource_config,
+        )
+        suffixes = {
+            call.kwargs.get("job_name_suffix")
+            for call in self._init_beam_pipeline_options.call_args_list
+        }
+        self.assertEqual(
+            suffixes,
+            {
+                f"{_TEST_JOB_NAME_PREFIX}-{_TEST_RUN_TIMESTAMP}-profile-node-user",
+                f"{_TEST_JOB_NAME_PREFIX}-{_TEST_RUN_TIMESTAMP}-profile-edge-follows",
+            },
+        )
+        # Regression-protect the rename from "data-analyzer" → "analyzer";
+        # the static prefix lives in the applied_task_identifier kwarg.
+        identifiers = {
+            call.kwargs.get("applied_task_identifier")
+            for call in self._init_beam_pipeline_options.call_args_list
+        }
+        self.assertEqual(identifiers, {"analyzer"})
+
+    def test_runs_embedding_diagnostics_for_tables_with_embeddings(self) -> None:
+        config = _make_config(
+            node_specs=[
+                NodeTableSpec(
+                    bq_table="p.d.users",
+                    node_type="user",
+                    id_column="uid",
+                    feature_columns=[],
+                )
+            ],
+            edge_specs=[],
+        )
+        self._bq_utils_cls.return_value.fetch_bq_table_schema.return_value = _schema(
+            [
+                ("uid", "STRING", "REQUIRED"),
+                ("age", "INT64", "NULLABLE"),
+                ("emb", "FLOAT64", "REPEATED"),
+            ]
+        )
+        self._diagnostics_cls.return_value.analyze.return_value = {
+            "node:user": {
+                "emb": EmbeddingDiagnosticsResult(
+                    total=100,
+                    unique_count=98,
+                    unique_ratio=0.98,
+                    top_k=[TopKEntry(hash=1, count=2, fraction=0.02)],
+                )
+            }
+        }
+        profiler = FeatureProfiler()
+        result = _run_profile(
+            profiler, config=config, resource_config=self._resource_config
+        )
+
+        # Diagnostics invoked once with a single request for the embedding column.
+        self._diagnostics_cls.return_value.analyze.assert_called_once()
+        call_arg = self._diagnostics_cls.return_value.analyze.call_args[0][0]
+        self.assertEqual(len(call_arg), 1)
+        self.assertEqual(call_arg[0].result_key, "node:user")
+        self.assertEqual(call_arg[0].embedding_columns, ["emb"])
+
+        self.assertIn("node:user", result.embedding_diagnostics)
+        self.assertEqual(
+            result.embedding_diagnostics["node:user"]["emb"].unique_ratio, 0.98
+        )
+
+    def test_skips_embedding_diagnostics_when_no_embeddings(self) -> None:
+        config = _make_config()
+        self._bq_utils_cls.return_value.fetch_bq_table_schema.side_effect = (
+            lambda table: _schema(
+                [
+                    ("uid", "STRING", "REQUIRED"),
+                    ("age", "INT64", "NULLABLE"),
+                    ("country", "STRING", "NULLABLE"),
+                    ("src", "STRING", "REQUIRED"),
+                    ("dst", "STRING", "REQUIRED"),
+                    ("weight", "FLOAT64", "NULLABLE"),
+                ]
+            )
+        )
+        profiler = FeatureProfiler()
+        _run_profile(profiler, config=config, resource_config=self._resource_config)
+        self._diagnostics_cls.return_value.analyze.assert_not_called()
+
+    def test_writes_feature_profile_sidecar(self) -> None:
+        with tempfile.TemporaryDirectory() as tmp:
+            self._bq_utils_cls.return_value.fetch_bq_table_schema.side_effect = (
+                lambda table: _schema(
+                    [
+                        ("uid", "STRING", "REQUIRED"),
+                        ("age", "INT64", "NULLABLE"),
+                        ("country", "STRING", "NULLABLE"),
+                        ("src", "STRING", "REQUIRED"),
+                        ("dst", "STRING", "REQUIRED"),
+                        ("weight", "FLOAT64", "NULLABLE"),
+                    ]
+                )
+            )
+            profiler = FeatureProfiler()
+            _run_profile(
+                profiler,
+                config=_make_config(output_gcs_path=tmp),
+                resource_config=self._resource_config,
+            )
+            from pathlib import Path
+
+            sidecar = Path(tmp) / "feature_profile.json"
+            self.assertTrue(sidecar.exists(), f"{sidecar} was not written")
+            import json
+
+            payload = json.loads(sidecar.read_text())
+            self.assertEqual(payload["schema_version"], "1")
+            self.assertEqual(payload["component"], "feature_profile")
+            self.assertIn("data", payload)
+
+    def test_forwards_preprocessor_sizing_to_beam_options(self) -> None:
+        """Node tasks pull sizing from ``node_preprocessor_config``; edge tasks from ``edge_preprocessor_config``.
+
+        The analyzer reuses the preprocessor's Dataflow sizing on the
+        same kind of table (node vs edge) rather than declaring its own
+        block. Falls out of mirroring the
+        ``data_preprocessor.lib.transform.utils.transform_features``
+        pattern; verified by checking the kwargs passed to
+        ``init_beam_pipeline_options`` for each profiled table.
+        """
+        node_block = self._resource_config.preprocessor_config.node_preprocessor_config
+        node_block.machine_type = "n2d-highmem-64"
+        node_block.num_workers = 4
+        node_block.max_num_workers = 128
+        node_block.disk_size_gb = 300
+        node_block.timeout = 10800
+
+        edge_block = self._resource_config.preprocessor_config.edge_preprocessor_config
+        edge_block.machine_type = "n2d-highmem-32"
+        edge_block.num_workers = 1
+        edge_block.max_num_workers = 64
+        edge_block.disk_size_gb = 200
+        edge_block.timeout = 0  # falsy timeout collapses to None
+
+        self._bq_utils_cls.return_value.fetch_bq_table_schema.side_effect = (
+            lambda table: _schema(
+                [
+                    ("uid", "STRING", "REQUIRED"),
+                    ("age", "INT64", "NULLABLE"),
+                    ("country", "STRING", "NULLABLE"),
+                    ("src", "STRING", "REQUIRED"),
+                    ("dst", "STRING", "REQUIRED"),
+                    ("weight", "FLOAT64", "NULLABLE"),
+                ]
+            )
+        )
+
+        profiler = FeatureProfiler()
+        _run_profile(
+            profiler, config=_make_config(), resource_config=self._resource_config
+        )
+
+        node_suffix = f"{_TEST_JOB_NAME_PREFIX}-{_TEST_RUN_TIMESTAMP}-profile-node-user"
+        edge_suffix = (
+            f"{_TEST_JOB_NAME_PREFIX}-{_TEST_RUN_TIMESTAMP}-profile-edge-follows"
+        )
+        calls_by_suffix = {
+            call.kwargs.get("job_name_suffix"): call
+            for call in self._init_beam_pipeline_options.call_args_list
+        }
+        self.assertIn(node_suffix, calls_by_suffix)
+        self.assertIn(edge_suffix, calls_by_suffix)
+
+        node_kwargs = calls_by_suffix[node_suffix].kwargs
+        self.assertEqual(node_kwargs["machine_type"], "n2d-highmem-64")
+        self.assertEqual(node_kwargs["num_workers"], 4)
+        self.assertEqual(node_kwargs["max_num_workers"], 128)
+        self.assertEqual(node_kwargs["disk_size_gb"], 300)
+        self.assertEqual(node_kwargs["timeout_seconds"], 10800)
+
+        edge_kwargs = calls_by_suffix[edge_suffix].kwargs
+        self.assertEqual(edge_kwargs["machine_type"], "n2d-highmem-32")
+        self.assertEqual(edge_kwargs["num_workers"], 1)
+        self.assertEqual(edge_kwargs["max_num_workers"], 64)
+        self.assertEqual(edge_kwargs["disk_size_gb"], 200)
+        self.assertIsNone(edge_kwargs["timeout_seconds"])
+
+    def test_passes_sharded_read_config_keyed_per_table_kind(self) -> None:
+        """Node tasks shard on ``id_column``; edge tasks shard on ``src_id_column``.
+
+        The ``BigQueryShardedReadConfig`` carries the shard key and points
+        at ``resource_config.temp_assets_bq_dataset_name`` for the BQ
+        export temp dataset, mirroring the data_preprocessor's
+        ``ShardedExportRead`` pattern.
+        """
+        self._resource_config.temp_assets_bq_dataset_name = "gigl_temp_assets"
+
+        self._bq_utils_cls.return_value.fetch_bq_table_schema.side_effect = (
+            lambda table: _schema(
+                [
+                    ("uid", "STRING", "REQUIRED"),
+                    ("age", "INT64", "NULLABLE"),
+                    ("country", "STRING", "NULLABLE"),
+                    ("src", "STRING", "REQUIRED"),
+                    ("dst", "STRING", "REQUIRED"),
+                    ("weight", "FLOAT64", "NULLABLE"),
+                ]
+            )
+        )
+
+        bq_to_record_batch_calls: list[dict] = []
+
+        class _FakeBqTableToRecordBatch(beam.PTransform):
+            def __init__(self, **kwargs):
+                super().__init__()
+                bq_to_record_batch_calls.append(kwargs)
+
+            def expand(
+                self, pbegin
+            ):  # pragma: no cover - never expanded; pipeline is mocked
+                return pbegin
+
+        with patch(
+            "gigl.analytics.data_analyzer.feature_profiler.BqTableToRecordBatch",
+            _FakeBqTableToRecordBatch,
+        ):
+            profiler = FeatureProfiler()
+            _run_profile(
+                profiler, config=_make_config(), resource_config=self._resource_config
+            )
+
+        calls_by_table = {
+            kwargs["bq_table"]: kwargs for kwargs in bq_to_record_batch_calls
+        }
+        self.assertIn("p.d.users", calls_by_table)
+        self.assertIn("p.d.follows", calls_by_table)
+
+        node_config = calls_by_table["p.d.users"]["sharded_read_config"]
+        self.assertEqual(node_config.shard_key, "uid")
+        self.assertEqual(node_config.project_id, "test-project")
+        self.assertEqual(node_config.temp_dataset_name, "gigl_temp_assets")
+        self.assertEqual(node_config.num_shards, 20)
+
+        edge_config = calls_by_table["p.d.follows"]["sharded_read_config"]
+        self.assertEqual(edge_config.shard_key, "src")
+        self.assertEqual(edge_config.project_id, "test-project")
+        self.assertEqual(edge_config.temp_dataset_name, "gigl_temp_assets")
+        self.assertEqual(edge_config.num_shards, 20)
+
+
+class ChunkedProjectionTest(TestCase):
+    """Wide projections split into multiple per-chunk Dataflow pipelines.
+
+    Beam 2.56's runner-v2 cannot reliably iterate the per-key state TFDV's
+    ``CombinePerKey(PreCombineFn)`` accumulates over very wide projections.
+    The profiler chunks each table's projection into ≤ ``max_features_per_chunk``
+    pieces and emits one ``_ProfileTask`` per chunk, with ``chunk_NN/`` GCS
+    artifact subdirs and chunk-aware Dataflow job names.
+    """
+
+    def setUp(self) -> None:
+        super().setUp()
+        self._resource_config = MagicMock(spec=GiglResourceConfigWrapper)
+        self._resource_config.project = "test-project"
+        self._resource_config.temp_assets_bq_dataset_name = "gigl_temp_assets"
+
+        self._gcp_options = MagicMock(name="GoogleCloudOptions")
+        self._gcp_options.job_name = "test-job-name"
+        self._gcp_options.project = "test-project"
+        self._gcp_options.region = "us-central1"
+
+        pipeline_options = MagicMock(name="PipelineOptions")
+        pipeline_options.view_as = MagicMock(return_value=self._gcp_options)
+
+        self._init_beam_pipeline_options = patch(
+            "gigl.analytics.data_analyzer.feature_profiler.init_beam_pipeline_options",
+            return_value=pipeline_options,
+        ).start()
+
+        self._bq_utils_cls = patch(
+            "gigl.analytics.data_analyzer.feature_profiler.BqUtils",
+        ).start()
+        self._bq_utils_cls.return_value.fetch_bq_table_schema.return_value = {}
+
+        self._diagnostics_cls = patch(
+            "gigl.analytics.data_analyzer.feature_profiler.EmbeddingDiagnostics",
+        ).start()
+        self._diagnostics_cls.return_value.analyze.return_value = {}
+
+        self._pipelines: list[MagicMock] = []
+
+        def _make_pipeline(*args, **kwargs):
+            pipeline = MagicMock(name="Pipeline")
+            pipeline_result = MagicMock(name="PipelineResult")
+            pipeline_result.wait_until_finish = MagicMock(return_value=None)
+            pipeline_result.job_id = MagicMock(return_value="test-job-id")
+            pipeline.run = MagicMock(return_value=pipeline_result)
+            self._pipelines.append(pipeline)
+            return pipeline
+
+        self._pipeline_ctor = patch(
+            "gigl.analytics.data_analyzer.feature_profiler.beam.Pipeline",
+            side_effect=_make_pipeline,
+        ).start()
+
+        self.addCleanup(patch.stopall)
+
+    def _config_with_chunk_cap(
+        self, max_features_per_chunk: int, output_gcs_path: str
+    ) -> DataAnalyzerConfig:
+        return DataAnalyzerConfig(
+            node_tables=[
+                NodeTableSpec(
+                    bq_table="p.d.users",
+                    node_type="user",
+                    id_column="uid",
+                    feature_columns=["age", "country", "city", "lang"],
+                )
+            ],
+            edge_tables=[],
+            output_gcs_path=output_gcs_path,
+            max_features_per_chunk=max_features_per_chunk,
+            compute_per_class_feature_stats=False,
+        )
+
+    def test_chunks_wide_projection_into_multiple_pipelines(self) -> None:
+        self._bq_utils_cls.return_value.fetch_bq_table_schema.return_value = _schema(
+            [
+                ("uid", "STRING", "REQUIRED"),
+                ("age", "INT64", "NULLABLE"),
+                ("country", "STRING", "NULLABLE"),
+                ("city", "STRING", "NULLABLE"),
+                ("lang", "STRING", "NULLABLE"),
+            ]
+        )
+        with tempfile.TemporaryDirectory() as tmp:
+            config = self._config_with_chunk_cap(2, output_gcs_path=f"{tmp}/run/")
+            profiler = FeatureProfiler()
+            result = _run_profile(
+                profiler, config=config, resource_config=self._resource_config
+            )
+
+        # 4 explicit feature columns, cap of 2 → 2 chunks (2+2).
+        self.assertEqual(len(self._pipelines), 2)
+        self.assertEqual(len(result.facets_html_paths["node:user"]), 2)
+        self.assertEqual(len(result.stats_paths["node:user"]), 2)
+
+        suffixes = sorted(
+            call.kwargs["job_name_suffix"]
+            for call in self._init_beam_pipeline_options.call_args_list
+        )
+        # Multi-chunk runs append "-chunk-NN-of-NN" to disambiguate Dataflow jobs.
+        self.assertTrue(
+            all("-chunk-00-of-02" in s or "-chunk-01-of-02" in s for s in suffixes)
+        )
+
+        # Per-chunk GCS artifacts land under chunk_NN/.
+        sorted_facets = sorted(result.facets_html_paths["node:user"])
+        self.assertTrue(
+            sorted_facets[0].endswith(
+                "/feature_profiler/nodes/user/chunk_00/facets.html"
+            ),
+            f"got {sorted_facets[0]!r}",
+        )
+        self.assertTrue(
+            sorted_facets[1].endswith(
+                "/feature_profiler/nodes/user/chunk_01/facets.html"
+            ),
+            f"got {sorted_facets[1]!r}",
+        )
+
+    def test_single_chunk_path_is_flat(self) -> None:
+        """When the projection fits in one chunk, the historical flat path is preserved."""
+        self._bq_utils_cls.return_value.fetch_bq_table_schema.return_value = _schema(
+            [
+                ("uid", "STRING", "REQUIRED"),
+                ("age", "INT64", "NULLABLE"),
+                ("country", "STRING", "NULLABLE"),
+                ("city", "STRING", "NULLABLE"),
+                ("lang", "STRING", "NULLABLE"),
+            ]
+        )
+        with tempfile.TemporaryDirectory() as tmp:
+            # Cap >> projection → exactly one chunk.
+            config = self._config_with_chunk_cap(100, output_gcs_path=f"{tmp}/run/")
+            profiler = FeatureProfiler()
+            result = _run_profile(
+                profiler, config=config, resource_config=self._resource_config
+            )
+
+        self.assertEqual(len(self._pipelines), 1)
+        paths = result.facets_html_paths["node:user"]
+        self.assertEqual(len(paths), 1)
+        self.assertTrue(
+            paths[0].endswith("/feature_profiler/nodes/user/facets.html"),
+            f"single-chunk path should stay flat; got {paths[0]!r}",
+        )
+        # No chunk suffix in the Dataflow job-name suffix for single-chunk runs.
+        suffix = self._init_beam_pipeline_options.call_args_list[0].kwargs[
+            "job_name_suffix"
+        ]
+        self.assertNotIn("-chunk-", suffix)
+
+    def test_slice_columns_force_included_in_every_chunk(self) -> None:
+        """Slice columns must appear in every chunk so TFDV slicing applies uniformly."""
+        self._bq_utils_cls.return_value.fetch_bq_table_schema.return_value = _schema(
+            [
+                ("uid", "STRING", "REQUIRED"),
+                ("age", "INT64", "NULLABLE"),
+                ("country", "STRING", "NULLABLE"),
+                ("city", "STRING", "NULLABLE"),
+                ("lang", "STRING", "NULLABLE"),
+                ("node_label", "INT64", "NULLABLE"),
+            ]
+        )
+        bq_to_record_batch_calls: list[dict] = []
+
+        class _FakeBqTableToRecordBatch(beam.PTransform):
+            def __init__(self, **kwargs):
+                super().__init__()
+                bq_to_record_batch_calls.append(kwargs)
+
+            def expand(self, pbegin):  # pragma: no cover
+                return pbegin
+
+        with tempfile.TemporaryDirectory() as tmp:
+            config = DataAnalyzerConfig(
+                node_tables=[
+                    NodeTableSpec(
+                        bq_table="p.d.users",
+                        node_type="user",
+                        id_column="uid",
+                        feature_columns=["age", "country", "city", "lang"],
+                        label_column="node_label",
+                    )
+                ],
+                edge_tables=[],
+                output_gcs_path=f"{tmp}/run/",
+                max_features_per_chunk=3,  # 4 features + label slice → 2 chunks
+                compute_per_class_feature_stats=True,
+            )
+            with patch(
+                "gigl.analytics.data_analyzer.feature_profiler.BqTableToRecordBatch",
+                _FakeBqTableToRecordBatch,
+            ):
+                profiler = FeatureProfiler()
+                _run_profile(
+                    profiler, config=config, resource_config=self._resource_config
+                )
+
+        # Every chunk's projection should include node_label so TFDV slicing
+        # works on each chunk independently.
+        self.assertEqual(len(bq_to_record_batch_calls), 2)
+        for call in bq_to_record_batch_calls:
+            projected_names = {name for name, _ in call["projection"]}
+            self.assertIn("node_label", projected_names, msg=str(projected_names))
+
+    def test_embedding_diagnostics_runs_once_per_table_across_chunks(self) -> None:
+        """The embedding-diagnostics BQ aggregate runs once per table, not once per chunk."""
+        self._bq_utils_cls.return_value.fetch_bq_table_schema.return_value = _schema(
+            [
+                ("uid", "STRING", "REQUIRED"),
+                ("age", "INT64", "NULLABLE"),
+                ("country", "STRING", "NULLABLE"),
+                ("emb", "FLOAT64", "REPEATED"),
+            ]
+        )
+        with tempfile.TemporaryDirectory() as tmp:
+            config = DataAnalyzerConfig(
+                node_tables=[
+                    NodeTableSpec(
+                        bq_table="p.d.users",
+                        node_type="user",
+                        id_column="uid",
+                        feature_columns=[],  # auto-infer
+                    )
+                ],
+                edge_tables=[],
+                output_gcs_path=f"{tmp}/run/",
+                max_features_per_chunk=2,
+                compute_per_class_feature_stats=False,
+            )
+            profiler = FeatureProfiler()
+            _run_profile(profiler, config=config, resource_config=self._resource_config)
+
+        # Wide-enough table to chunk → multiple Dataflow pipelines launched.
+        self.assertGreater(len(self._pipelines), 1)
+        # ...but EmbeddingDiagnostics.analyze fires exactly once for the whole table.
+        self._diagnostics_cls.return_value.analyze.assert_called_once()
+        request_list = self._diagnostics_cls.return_value.analyze.call_args[0][0]
+        self.assertEqual(len(request_list), 1)
+        self.assertEqual(request_list[0].result_key, "node:user")
+        self.assertEqual(request_list[0].embedding_columns, ["emb"])
diff --git a/tests/unit/analytics/data_analyzer/graph_structure_analyzer_test.py b/tests/unit/analytics/data_analyzer/graph_structure_analyzer_test.py
new file mode 100644
index 000000000..a85a3913b
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/graph_structure_analyzer_test.py
@@ -0,0 +1,767 @@
+"""Unit tests for GraphStructureAnalyzer.
+
+All BQ calls are mocked via patching BqUtils. The goal is to exercise the
+orchestration logic (tier ordering, gating, result population) without hitting
+a real BigQuery backend.
+"""
+
+import json
+import tempfile
+from pathlib import Path
+from typing import Any, Optional
+from unittest.mock import MagicMock, patch
+
+from gigl.analytics.data_analyzer.config import (
+    DataAnalyzerConfig,
+    EdgeTableSpec,
+    NodeTableSpec,
+)
+from gigl.analytics.data_analyzer.graph_structure_analyzer import (
+    DataQualityError,
+    GraphStructureAnalyzer,
+)
+from tests.test_assets.test_case import TestCase
+
+
+def _make_config(
+    label_column: Optional[str] = None,
+    compute_reciprocity: bool = False,
+    extra_edge: bool = False,
+) -> DataAnalyzerConfig:
+    edge_tables = [
+        EdgeTableSpec(
+            bq_table="p.d.edges",
+            edge_type="follows",
+            src_id_column="src",
+            dst_id_column="dst",
+            src_node_type="user",
+            dst_node_type="user",
+        )
+    ]
+    if extra_edge:
+        edge_tables.append(
+            EdgeTableSpec(
+                bq_table="p.d.edges2",
+                edge_type="likes",
+                src_id_column="src",
+                dst_id_column="dst",
+                src_node_type="user",
+                dst_node_type="user",
+            )
+        )
+    return DataAnalyzerConfig(
+        node_tables=[
+            NodeTableSpec(
+                bq_table="p.d.nodes",
+                node_type="user",
+                id_column="uid",
+                feature_columns=["f1", "f2"],
+                label_column=label_column,
+            )
+        ],
+        edge_tables=edge_tables,
+        output_gcs_path="gs://bucket/out/",
+        fan_out=[15, 10],
+        compute_reciprocity=compute_reciprocity,
+    )
+
+
+def _make_heterogeneous_config() -> DataAnalyzerConfig:
+    """User -[viewed]-> content bipartite graph."""
+    return DataAnalyzerConfig(
+        node_tables=[
+            NodeTableSpec(
+                bq_table="p.d.users",
+                node_type="user",
+                id_column="uid",
+                feature_columns=["age"],
+            ),
+            NodeTableSpec(
+                bq_table="p.d.content",
+                node_type="content",
+                id_column="cid",
+                feature_columns=["topic"],
+            ),
+        ],
+        edge_tables=[
+            EdgeTableSpec(
+                bq_table="p.d.viewed",
+                edge_type="viewed",
+                src_id_column="user_id",
+                dst_id_column="content_id",
+                src_node_type="user",
+                dst_node_type="content",
+            )
+        ],
+        output_gcs_path="gs://bucket/out/",
+    )
+
+
+def _make_supervision_config(anchor_side: str = "src") -> DataAnalyzerConfig:
+    """User -[viewed]-> content with one supervision_pos + supervision_neg.
+
+    Args:
+        anchor_side: ``"src"`` to anchor on user (default), ``"dst"`` to anchor on content.
+    """
+    if anchor_side == "src":
+        node_anchor = "user"
+    elif anchor_side == "dst":
+        node_anchor = "content"
+    else:
+        raise ValueError(f"anchor_side={anchor_side!r} must be 'src' or 'dst'")
+    return DataAnalyzerConfig(
+        node_tables=[
+            NodeTableSpec(bq_table="p.d.users", node_type="user", id_column="uid"),
+            NodeTableSpec(bq_table="p.d.content", node_type="content", id_column="cid"),
+        ],
+        edge_tables=[
+            EdgeTableSpec(
+                bq_table="p.d.viewed",
+                edge_type="viewed",
+                role="message_passing",
+                src_id_column="user_id",
+                dst_id_column="content_id",
+                src_node_type="user",
+                dst_node_type="content",
+            ),
+            EdgeTableSpec(
+                bq_table="p.d.viewed_pos",
+                edge_type="viewed_pos",
+                role="supervision_pos",
+                node_anchor=node_anchor,
+                src_id_column="user_id",
+                dst_id_column="content_id",
+                src_node_type="user",
+                dst_node_type="content",
+            ),
+            EdgeTableSpec(
+                bq_table="p.d.viewed_neg",
+                edge_type="viewed_neg",
+                role="supervision_neg",
+                src_id_column="user_id",
+                dst_id_column="content_id",
+                src_node_type="user",
+                dst_node_type="content",
+            ),
+        ],
+        output_gcs_path="gs://bucket/out/",
+    )
+
+
+def _mock_row(data: dict[str, Any]) -> MagicMock:
+    """Mock a BigQuery Row supporting both key and attribute access."""
+    row = MagicMock()
+    keys = list(data.keys())
+    values = list(data.values())
+    row.__getitem__ = lambda self, key: (
+        data[key] if isinstance(key, str) else values[key]
+    )
+    row.keys = lambda: keys
+    row.values = lambda: values
+    for k, v in data.items():
+        setattr(row, k, v)
+    return row
+
+
+def _mock_row_iterator(rows: list[dict[str, Any]]) -> MagicMock:
+    """Mock a RowIterator yielding the given row dicts."""
+    mock = MagicMock()
+    mock.__iter__ = lambda self: iter([_mock_row(r) for r in rows])
+    return mock
+
+
+_DEFAULT_SUPERVISION_ROW = {
+    "driver_anchor_count": 1000,
+    "driver_pair_count": 5000,
+    "other_pair_count": 6000,
+    "overlap_pair_count": 0,
+    "driver_anchors_with_zero_other": 50,
+    "avg_other_per_driver_anchor": 4.5,
+    "p50_other_per_driver_anchor": 4,
+    "p90_other_per_driver_anchor": 12,
+    "p99_other_per_driver_anchor": 40,
+    "max_other_per_driver_anchor": 200,
+}
+
+
+def _default_row_for_query(query: str) -> dict[str, Any]:
+    """Return a reasonable 'zero violation, small graph' row for any query."""
+    q = query.lower()
+    if "driver_anchor_count" in q:
+        return dict(_DEFAULT_SUPERVISION_ROW)
+    if "dangling_count" in q:
+        return {"dangling_count": 0}
+    if "missing_src_count" in q:
+        return {"missing_src_count": 0, "missing_dst_count": 0}
+    if "duplicate_count" in q:
+        return {"duplicate_count": 0}
+    if "node_count" in q and "distinct_src_count" not in q:
+        return {"node_count": 1000}
+    if "edge_count" in q:
+        return {"edge_count": 5000}
+    if "self_loop_count" in q:
+        return {"self_loop_count": 0}
+    if "isolated_count" in q:
+        return {"isolated_count": 0}
+    if "min_degree" in q or "approx_quantiles" in q:
+        return {
+            "min_degree": 0,
+            "max_degree": 100,
+            "avg_degree": 5.0,
+            "percentiles": list(range(101)),
+        }
+    if "bucket_0_1" in q:
+        return {
+            "bucket_0_1": 10,
+            "bucket_2_10": 900,
+            "bucket_11_100": 80,
+            "bucket_101_1k": 10,
+            "bucket_1k_10k": 0,
+            "bucket_10k_plus": 0,
+        }
+    if "super_hub_count" in q:
+        return {"super_hub_count": 0}
+    if "cold_start_count" in q:
+        return {"cold_start_count": 50}
+    if "null_rate" in q:
+        # Include any plausible column name ending in _null_rate with zero default.
+        # Extend this list when adding new feature columns to test configs.
+        return {
+            "total_rows": 1000,
+            "f1_null_rate": 0.0,
+            "f2_null_rate": 0.01,
+            "uid_null_rate": 0.0,
+            "cid_null_rate": 0.0,
+            "oid_null_rate": 0.0,
+            "age_null_rate": 0.0,
+            "topic_null_rate": 0.0,
+            "is_active_null_rate": 0.0,
+        }
+    if "distinct_src_count" in q:
+        return {"distinct_src_count": 900, "distinct_dst_count": 950}
+    # NC supervision tier sentinel-aware label query: presence of
+    # ``valid_count`` is the unambiguous discriminator.
+    if "valid_count" in q:
+        row: dict[str, Any] = {
+            "total_rows": 1000,
+            "null_count": 50,
+            "valid_count": 950,
+        }
+        for idx in range(5):
+            row[f"sentinel_{idx}"] = 0
+        return row
+    # NC homophily query.
+    if "edge_homophily" in q:
+        return {
+            "edge_homophily": 0.0,
+            "expected_homophily": 0.0,
+            "edge_sample_count": 0,
+        }
+    # NC cross-split id-overlap query (matches before generic ``node_count``
+    # because ``overlap_node_count`` contains it as a substring).
+    if "overlap_node_count" in q:
+        return {"overlap_node_count": 0}
+    if "labeled" in q:
+        return {"total": 1000, "labeled": 800, "coverage": 0.8}
+    if "label" in q and "count" in q:
+        return {"label": 0, "count": 500}
+    # Fallback: one zero-valued scalar
+    return {"count": 0}
+
+
+def _default_rows_for_query(query: str) -> list[dict[str, Any]]:
+    q = query.lower()
+    if "order by degree desc" in q:
+        # Top-K hubs query returns multiple rows
+        return [
+            {"node_id": "u1", "degree": 500},
+            {"node_id": "u2", "degree": 400},
+        ]
+    if "group by " in q and "label" in q and "order by count" in q:
+        return [{"label": 0, "count": 600}, {"label": 1, "count": 400}]
+    # NC per-class degree query: emit the same shape graph_structure
+    # consumes, with one row per class. Distinguished by ``class_count``
+    # + ``cold_start_count`` columns. The six log-bucket counts feed the
+    # per-class sparkline column in the report.
+    if "class_count" in q and "cold_start_count" in q:
+        return [
+            {
+                "class_value": "0",
+                "class_count": 500,
+                "cold_start_count": 25,
+                "mean_degree": 5.0,
+                "percentiles": list(range(101)),
+                "max_degree": 100,
+                "bucket_0_1": 25,
+                "bucket_2_10": 400,
+                "bucket_11_100": 65,
+                "bucket_101_1k": 10,
+                "bucket_1k_10k": 0,
+                "bucket_10k_plus": 0,
+            },
+            {
+                "class_value": "1",
+                "class_count": 300,
+                "cold_start_count": 15,
+                "mean_degree": 6.0,
+                "percentiles": list(range(101)),
+                "max_degree": 110,
+                "bucket_0_1": 15,
+                "bucket_2_10": 230,
+                "bucket_11_100": 50,
+                "bucket_101_1k": 5,
+                "bucket_1k_10k": 0,
+                "bucket_10k_plus": 0,
+            },
+        ]
+    # NC split-value-counts query.
+    if "split_value" in q and "row_count" in q:
+        return [
+            {"split_value": "train", "row_count": 700},
+            {"split_value": "val", "row_count": 100},
+            {"split_value": "test", "row_count": 100},
+        ]
+    return [_default_row_for_query(query)]
+
+
+@patch("gigl.analytics.data_analyzer.graph_structure_analyzer.BqUtils")
+class GraphStructureAnalyzerTest(TestCase):
+    def test_tier1_passes_when_no_violations(self, mock_bq_cls: MagicMock) -> None:
+        """With zero dangling, zero duplicates, zero referential violations, Tier 1 passes."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config())
+        self.assertIsNotNone(result)
+        self.assertEqual(result.dangling_edge_counts["follows"], 0)
+        self.assertEqual(result.duplicate_node_counts["user"], 0)
+        self.assertEqual(result.node_counts["user"], 1000)
+
+    def test_dangling_edges_raises(self, mock_bq_cls: MagicMock) -> None:
+        """If dangling edge query returns > 0, DataQualityError is raised."""
+        mock_bq = mock_bq_cls.return_value
+
+        def _side_effect(query: str, labels: Optional[dict] = None) -> MagicMock:
+            if "dangling_count" in query:
+                return _mock_row_iterator([{"dangling_count": 42}])
+            return _mock_row_iterator(_default_rows_for_query(query))
+
+        mock_bq.run_query.side_effect = _side_effect
+        analyzer = GraphStructureAnalyzer()
+        with self.assertRaises(DataQualityError) as ctx:
+            analyzer.analyze(_make_config())
+        self.assertEqual(
+            ctx.exception.partial_result.dangling_edge_counts["follows"], 42
+        )
+
+    def test_duplicate_nodes_raises(self, mock_bq_cls: MagicMock) -> None:
+        """If duplicate node query returns > 0, DataQualityError is raised."""
+        mock_bq = mock_bq_cls.return_value
+
+        def _side_effect(query: str, labels: Optional[dict] = None) -> MagicMock:
+            q = query.lower()
+            # The duplicate_node query groups on id_column with HAVING COUNT(*) > 1.
+            if "duplicate_count" in q and "having count(*) > 1" in q and "uid" in q:
+                return _mock_row_iterator([{"duplicate_count": 5}])
+            return _mock_row_iterator(_default_rows_for_query(query))
+
+        mock_bq.run_query.side_effect = _side_effect
+        analyzer = GraphStructureAnalyzer()
+        with self.assertRaises(DataQualityError):
+            analyzer.analyze(_make_config())
+
+    def test_tier3_skipped_without_label(self, mock_bq_cls: MagicMock) -> None:
+        """Without label_column, class_imbalance and label_coverage dicts are empty."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config(label_column=None))
+        self.assertEqual(result.class_imbalance, {})
+        self.assertEqual(result.label_coverage, {})
+
+    def test_tier3_populated_with_label(self, mock_bq_cls: MagicMock) -> None:
+        """With label_column, class_imbalance and label_coverage are populated."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config(label_column="is_active"))
+        self.assertIn("user", result.class_imbalance)
+        self.assertIn("user", result.label_coverage)
+        self.assertAlmostEqual(result.label_coverage["user"], 0.8)
+
+    def test_tier4_skipped_when_flag_false(self, mock_bq_cls: MagicMock) -> None:
+        """Without compute_reciprocity flag, reciprocity dict is empty."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config(compute_reciprocity=False))
+        self.assertEqual(result.reciprocity, {})
+
+    def test_feature_memory_budget_computed(self, mock_bq_cls: MagicMock) -> None:
+        """feature_memory_bytes is computed from schema metadata in Python, not a BQ query."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config())
+        self.assertIn("user", result.feature_memory_bytes)
+        # 1000 nodes * 2 features * 8 bytes/float64 = 16000
+        self.assertEqual(result.feature_memory_bytes["user"], 1000 * 2 * 8)
+
+    def test_neighbor_explosion_populated(self, mock_bq_cls: MagicMock) -> None:
+        """With fan_out=[15,10] and avg degree 5, explosion estimate = 15*10*5."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config())
+        self.assertIn("follows", result.neighbor_explosion_estimate)
+        self.assertGreater(result.neighbor_explosion_estimate["follows"], 0)
+
+    def test_edge_type_distribution_populated_for_multiple_edges(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """edge_type_distribution is populated when there are multiple edge types."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config(extra_edge=True))
+        self.assertIn("follows", result.edge_type_distribution)
+        self.assertIn("likes", result.edge_type_distribution)
+
+    def test_degree_stats_bucket_keys_match_report_bucket_order(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """Bucket keys must exactly match BUCKET_ORDER in report/charts.ai.js.
+
+        Regression test for C1: previously, Python emitted lowercase 'k' keys
+        (e.g., '101-1k') while the JS renderer expected uppercase 'K', causing
+        the three highest buckets to silently render as zero.
+        """
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config())
+        self.assertIn("follows_out", result.degree_stats)
+        stats = result.degree_stats["follows_out"]
+        expected_bucket_keys = ["0-1", "2-10", "11-100", "101-1K", "1K-10K", "10K+"]
+        self.assertEqual(list(stats.buckets.keys()), expected_bucket_keys)
+
+    def test_cold_start_query_includes_both_src_and_dst_columns(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """Cold-start must be computed from total degree (src + dst).
+
+        Regression test for C2: previously only src-side edges were counted,
+        misclassifying pure-destination nodes as cold-start.
+        """
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        analyzer.analyze(_make_config())
+        cold_start_queries = [
+            call.kwargs["query"]
+            for call in mock_bq.run_query.call_args_list
+            if "cold_start_count" in call.kwargs.get("query", "")
+        ]
+        self.assertGreaterEqual(len(cold_start_queries), 1)
+        for sql in cold_start_queries:
+            self.assertIn("src", sql)
+            self.assertIn("dst", sql)
+            self.assertIn("UNION ALL", sql)
+
+    def test_query_scalar_raises_on_empty_rows(self, mock_bq_cls: MagicMock) -> None:
+        """Scalar queries must fail loudly on unexpected empty results.
+
+        Regression test for I2: previously _query_scalar silently returned 0
+        when BQ returned no rows, hiding driver/auth/schema issues.
+        """
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            []
+        )
+        analyzer = GraphStructureAnalyzer()
+        with self.assertRaises(RuntimeError) as ctx:
+            analyzer.analyze(_make_config())
+        self.assertIn("expected exactly 1 row", str(ctx.exception))
+
+    def test_writes_graph_structure_sidecar_on_success(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """analyze() writes a Pydantic JSON sidecar to output_gcs_path."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        with tempfile.TemporaryDirectory() as tmp:
+            config = _make_config()
+            config.output_gcs_path = tmp
+            analyzer = GraphStructureAnalyzer()
+            analyzer.analyze(config)
+
+            sidecar = Path(tmp) / "graph_structure.json"
+            self.assertTrue(sidecar.exists())
+            payload = json.loads(sidecar.read_text())
+            self.assertEqual(payload["schema_version"], "1")
+            self.assertEqual(payload["component"], "graph_structure")
+            self.assertIn("data", payload)
+            self.assertEqual(payload["data"]["node_counts"], {"user": 1000})
+
+    def test_writes_graph_structure_sidecar_on_tier1_failure(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """On DataQualityError, the partial_result is persisted to the sidecar."""
+        mock_bq = mock_bq_cls.return_value
+
+        def _side_effect(query: str, labels: Optional[dict] = None) -> MagicMock:
+            if "dangling_count" in query:
+                return _mock_row_iterator([{"dangling_count": 7}])
+            return _mock_row_iterator(_default_rows_for_query(query))
+
+        mock_bq.run_query.side_effect = _side_effect
+        with tempfile.TemporaryDirectory() as tmp:
+            config = _make_config()
+            config.output_gcs_path = tmp
+            analyzer = GraphStructureAnalyzer()
+            with self.assertRaises(DataQualityError):
+                analyzer.analyze(config)
+
+            sidecar = Path(tmp) / "graph_structure.json"
+            self.assertTrue(sidecar.exists())
+            payload = json.loads(sidecar.read_text())
+            self.assertEqual(payload["data"]["dangling_edge_counts"], {"follows": 7})
+
+    def test_supervision_cross_table_empty_when_no_pos_table(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """No supervision_pos / supervision_neg → list is empty, no extra queries."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config())
+        self.assertEqual(result.supervision_cross_table_stats, [])
+        cross_table_calls = [
+            call.kwargs["query"]
+            for call in mock_bq.run_query.call_args_list
+            if "driver_anchor_count" in call.kwargs.get("query", "")
+        ]
+        self.assertEqual(cross_table_calls, [])
+
+    def test_supervision_cross_table_pairs_pos_with_neg_and_mp(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """One pos + one neg + one mp → 3 jobs (pos×neg, pos×mp, neg×mp)."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        config = _make_supervision_config()
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(config)
+        # 3 pairs: pos×neg, pos×mp, neg×mp.
+        self.assertEqual(len(result.supervision_cross_table_stats), 3)
+        pairs = {
+            (s.driver_edge_type, s.other_edge_type)
+            for s in result.supervision_cross_table_stats
+        }
+        self.assertEqual(
+            pairs,
+            {
+                ("viewed_pos", "viewed_neg"),
+                ("viewed_pos", "viewed"),
+                ("viewed_neg", "viewed"),
+            },
+        )
+        for stats in result.supervision_cross_table_stats:
+            self.assertEqual(stats.node_anchor, "user")
+            self.assertEqual(stats.driver_anchor_count, 1000)
+            self.assertEqual(stats.avg_other_per_driver_anchor, 4.5)
+
+    def test_supervision_cross_table_skips_mismatched_node_types(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """A pos table is only paired with neg/mp tables that share its node types."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        config = _make_supervision_config()
+        config.node_tables.append(
+            NodeTableSpec(bq_table="p.d.other", node_type="other", id_column="oid")
+        )
+        config.edge_tables.append(
+            EdgeTableSpec(
+                bq_table="p.d.unrelated",
+                edge_type="unrelated",
+                role="message_passing",
+                src_id_column="src",
+                dst_id_column="dst",
+                src_node_type="other",
+                dst_node_type="other",
+            )
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(config)
+        edge_types_in_results = {
+            s.other_edge_type for s in result.supervision_cross_table_stats
+        }
+        self.assertNotIn("unrelated", edge_types_in_results)
+
+    def test_supervision_cross_table_overlap_flagged(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """When the cross-table query reports overlap > 0, the field is preserved."""
+        mock_bq = mock_bq_cls.return_value
+
+        def _side_effect(query: str, labels: Optional[dict] = None) -> MagicMock:
+            if "driver_anchor_count" in query:
+                row = dict(_DEFAULT_SUPERVISION_ROW)
+                row["overlap_pair_count"] = 17
+                return _mock_row_iterator([row])
+            return _mock_row_iterator(_default_rows_for_query(query))
+
+        mock_bq.run_query.side_effect = _side_effect
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_supervision_config())
+        for stats in result.supervision_cross_table_stats:
+            self.assertEqual(stats.overlap_pair_count, 17)
+
+    def test_supervision_cross_table_resolves_dst_anchor(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """When node_anchor == dst_node_type, query uses dst_id_column for the anchor."""
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        config = _make_supervision_config(anchor_side="dst")
+        analyzer = GraphStructureAnalyzer()
+        analyzer.analyze(config)
+        cross_table_queries = [
+            call.kwargs["query"]
+            for call in mock_bq.run_query.call_args_list
+            if "driver_anchor_count" in call.kwargs.get("query", "")
+        ]
+        self.assertGreaterEqual(len(cross_table_queries), 1)
+        for sql in cross_table_queries:
+            # Anchor column is content_id (dst side); neighbor column is user_id.
+            self.assertIn("content_id AS anchor", sql)
+            self.assertIn("user_id  AS neighbor", sql)
+
+    def test_queries_log_populated_across_sections(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """``result.queries`` carries one entry per BQ-backed report block.
+
+        The report renderer keys disclosures off this dict, so the analyzer
+        must populate it with the conventional ``<section>:<metric>:<scope>``
+        block IDs the JS expects.
+        """
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_config(label_column="is_active"))
+
+        # At least one block_id from each major section should be recorded.
+        keys = result.queries.keys()
+        self.assertTrue(
+            any(k.startswith("data_quality:dangling_edges:") for k in keys),
+            f"missing data_quality block_ids; got {sorted(keys)}",
+        )
+        self.assertTrue(
+            any(k.startswith("graph_structure:degree:") for k in keys),
+            f"missing graph_structure block_ids; got {sorted(keys)}",
+        )
+        # The label_sentinel sub-block runs for any labeled node type even
+        # when no edge is explicitly tagged as message-passing — assert on
+        # the broader prefix so this stays robust across fixture variations.
+        self.assertTrue(
+            any(k.startswith("nc_supervision:") for k in keys),
+            f"missing nc_supervision block_ids; got {sorted(keys)}",
+        )
+        self.assertTrue(
+            any(k.startswith("advanced:class_imbalance:") for k in keys),
+            f"missing advanced block_ids; got {sorted(keys)}",
+        )
+
+        # Values are non-empty SQL strings.
+        for sql_list in result.queries.values():
+            self.assertGreater(len(sql_list), 0)
+            for sql in sql_list:
+                self.assertIsInstance(sql, str)
+                self.assertGreater(len(sql.strip()), 0)
+
+    def test_queries_log_persisted_on_tier1_failure(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """When Tier 1 fails, the partial result must still carry recorded queries.
+
+        The sidecar written for failed runs is the user's only debugging
+        artifact, so it has to retain the rendered SQL behind the failed
+        check.
+        """
+        mock_bq = mock_bq_cls.return_value
+
+        def _side_effect(query: str, labels: Optional[dict] = None) -> MagicMock:
+            if "dangling_count" in query:
+                return _mock_row_iterator([{"dangling_count": 7}])
+            return _mock_row_iterator(_default_rows_for_query(query))
+
+        mock_bq.run_query.side_effect = _side_effect
+        analyzer = GraphStructureAnalyzer()
+        with self.assertRaises(DataQualityError) as ctx:
+            analyzer.analyze(_make_config())
+        partial = ctx.exception.partial_result
+        self.assertTrue(
+            any(k.startswith("data_quality:dangling_edges:") for k in partial.queries),
+            f"missing dangling_edges block_id; got {sorted(partial.queries)}",
+        )
+
+    def test_heterogeneous_tier1_joins_correct_node_tables(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """For hetero edges, src and dst must join against their own node tables.
+
+        Regression test for I3: previously every edge table was joined against
+        node_tables[0] on both sides, producing false-positive missing_dst
+        violations for bipartite edges like user->content.
+        """
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_heterogeneous_config())
+        self.assertEqual(result.referential_integrity_violations["viewed"], 0)
+        # Inspect the referential integrity query: src joins user_nodes, dst joins content_nodes.
+        ref_queries = [
+            call.kwargs["query"]
+            for call in mock_bq.run_query.call_args_list
+            if "missing_src_count" in call.kwargs.get("query", "")
+        ]
+        self.assertGreaterEqual(len(ref_queries), 1)
+        ref_sql = ref_queries[0]
+        self.assertIn("`p.d.users`", ref_sql)
+        self.assertIn("`p.d.content`", ref_sql)
+        self.assertIn("e.user_id = src_node.uid", ref_sql)
+        self.assertIn("e.content_id = dst_node.cid", ref_sql)
diff --git a/tests/unit/analytics/data_analyzer/node_classification_supervision_test.py b/tests/unit/analytics/data_analyzer/node_classification_supervision_test.py
new file mode 100644
index 000000000..9df0f7675
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/node_classification_supervision_test.py
@@ -0,0 +1,496 @@
+"""Unit tests for the NC supervision tier in GraphStructureAnalyzer.
+
+Mocks BqUtils to exercise the orchestration logic (label sentinel
+accounting, per-class degree, adjusted homophily, cross-split id-overlap
+hard fail) without hitting a real BigQuery backend.
+"""
+
+from typing import Any, Optional
+from unittest.mock import MagicMock, patch
+
+from gigl.analytics.data_analyzer.config import (
+    DataAnalyzerConfig,
+    EdgeTableSpec,
+    NodeTableSpec,
+)
+from gigl.analytics.data_analyzer.graph_structure_analyzer import (
+    DataQualityError,
+    GraphStructureAnalyzer,
+)
+from tests.test_assets.test_case import TestCase
+
+
+def _make_nc_config(
+    label_sentinel_values: Optional[list[str]] = None,
+    split_column: Optional[str] = None,
+    homophily_sample_cap: int = 0,
+) -> DataAnalyzerConfig:
+    return DataAnalyzerConfig(
+        node_tables=[
+            NodeTableSpec(
+                bq_table="p.d.users",
+                node_type="user",
+                id_column="uid",
+                feature_columns=["f1"],
+                label_column="node_label",
+                label_sentinel_values=label_sentinel_values or [],
+                split_column=split_column,
+            )
+        ],
+        edge_tables=[
+            EdgeTableSpec(
+                bq_table="p.d.edges",
+                edge_type="to",
+                role="message_passing",
+                src_id_column="src",
+                dst_id_column="dst",
+                src_node_type="user",
+                dst_node_type="user",
+            )
+        ],
+        output_gcs_path="gs://bucket/out/",
+        label_homophily_edge_sample_cap=homophily_sample_cap,
+    )
+
+
+def _mock_row(data: dict[str, Any]) -> MagicMock:
+    row = MagicMock()
+    row.__getitem__ = lambda self, key: data[key]
+    row.keys = lambda: list(data.keys())
+    for k, v in data.items():
+        setattr(row, k, v)
+    return row
+
+
+def _mock_row_iterator(rows: list[dict[str, Any]]) -> MagicMock:
+    mock = MagicMock()
+    mock.__iter__ = lambda self: iter([_mock_row(r) for r in rows])
+    return mock
+
+
+def _default_rows_for_query(
+    query: str,
+    sentinel_count: int = 0,
+    overlap_count: int = 0,
+    include_sentinel_class_row: bool = False,
+) -> list[dict[str, Any]]:
+    """Reasonable fixture rows for any query the analyzer issues.
+
+    Returns an NC-friendly fixture: balanced classes, mild homophily,
+    no cross-split overlap unless ``overlap_count > 0``.
+
+    When ``include_sentinel_class_row`` is True, the per-class degree
+    fixture includes an additional ``class_value="-1"`` row so tests can
+    exercise the sentinel-vs-class partition in
+    ``_compute_per_class_degree``.
+    """
+    q = query.lower()
+    if "dangling_count" in q:
+        return [{"dangling_count": 0}]
+    if "missing_src_count" in q:
+        return [{"missing_src_count": 0, "missing_dst_count": 0}]
+    if "duplicate_count" in q:
+        return [{"duplicate_count": 0}]
+    if "self_loop_count" in q:
+        return [{"self_loop_count": 0}]
+    if "isolated_count" in q:
+        return [{"isolated_count": 0}]
+    if "super_hub_count" in q:
+        return [{"super_hub_count": 0}]
+    if "cold_start_count" in q and "class_count" not in q:
+        return [{"cold_start_count": 50}]
+    # Per-class degree (the SQL now also carries bucket_* columns) must be
+    # matched before the standalone Tier-2 bucket branch below — both share
+    # ``bucket_0_1`` but only the per-class query has ``class_count``.
+    if "class_count" in q and "approx_quantiles" in q:
+        rows: list[dict[str, Any]] = [
+            {
+                "class_value": "0",
+                "class_count": 600,
+                "cold_start_count": 30,
+                "mean_degree": 5.0,
+                "percentiles": list(range(101)),
+                "max_degree": 100,
+                "bucket_0_1": 30,
+                "bucket_2_10": 500,
+                "bucket_11_100": 60,
+                "bucket_101_1k": 10,
+                "bucket_1k_10k": 0,
+                "bucket_10k_plus": 0,
+            },
+            {
+                "class_value": "1",
+                "class_count": 200,
+                "cold_start_count": 5,
+                "mean_degree": 7.0,
+                "percentiles": list(range(101)),
+                "max_degree": 120,
+                "bucket_0_1": 5,
+                "bucket_2_10": 150,
+                "bucket_11_100": 40,
+                "bucket_101_1k": 5,
+                "bucket_1k_10k": 0,
+                "bucket_10k_plus": 0,
+            },
+        ]
+        if include_sentinel_class_row:
+            # Mirrors a real "-1" sentinel row coming back from BQ alongside
+            # the valid classes; the parser must route it to
+            # ``sentinel_degree_stats`` (not ``per_class_degree``) when "-1"
+            # is declared in ``label_sentinel_values``. Concentrated
+            # cold-start with a long tail mirrors the typical
+            # missing-label pool.
+            rows.append(
+                {
+                    "class_value": "-1",
+                    "class_count": 50,
+                    "cold_start_count": 40,
+                    "mean_degree": 1.5,
+                    "percentiles": list(range(101)),
+                    "max_degree": 80,
+                    "bucket_0_1": 40,
+                    "bucket_2_10": 8,
+                    "bucket_11_100": 2,
+                    "bucket_101_1k": 0,
+                    "bucket_1k_10k": 0,
+                    "bucket_10k_plus": 0,
+                }
+            )
+        return rows
+    if "min_degree" in q or "approx_quantiles" in q and "class_count" not in q:
+        # Tier 2 degree distribution.
+        if "edge_homophily" in q or "labeled_pairs" in q:
+            pass  # fall through — handled below
+        else:
+            return [
+                {
+                    "min_degree": 0,
+                    "max_degree": 100,
+                    "avg_degree": 5.0,
+                    "percentiles": list(range(101)),
+                }
+            ]
+    if "bucket_0_1" in q:
+        return [
+            {
+                "bucket_0_1": 10,
+                "bucket_2_10": 900,
+                "bucket_11_100": 80,
+                "bucket_101_1k": 10,
+                "bucket_1k_10k": 0,
+                "bucket_10k_plus": 0,
+            }
+        ]
+    if "null_rate" in q:
+        return [
+            {
+                "total_rows": 1000,
+                "f1_null_rate": 0.0,
+                "node_label_null_rate": 0.0,
+                "uid_null_rate": 0.0,
+            }
+        ]
+    if "distinct_src_count" in q:
+        return [{"distinct_src_count": 900, "distinct_dst_count": 950}]
+    # ``overlap_node_count`` is a substring of ``node_count`` so it must
+    # be matched before the generic node-count branch below.
+    if "overlap_node_count" in q:
+        return [{"overlap_node_count": overlap_count}]
+    if "node_count" in q:
+        return [{"node_count": 1000}]
+    if "edge_count" in q:
+        return [{"edge_count": 5000}]
+    # NC-specific queries below.
+    if "valid_count" in q and "null_count" in q:
+        # Sentinel-aware label query.
+        row = {
+            "total_rows": 1000,
+            "null_count": 100,
+            "valid_count": 800 - sentinel_count,
+        }
+        # Add up to a few sentinel_<idx> entries; tests that need them
+        # fill them in via dedicated side-effects rather than this default.
+        for idx in range(5):
+            row[f"sentinel_{idx}"] = sentinel_count if idx == 0 else 0
+        return [row]
+    if "edge_homophily" in q or "labeled_pairs" in q:
+        return [
+            {
+                "edge_homophily": 0.7,
+                "expected_homophily": 0.5,
+                "edge_sample_count": 4500,
+            }
+        ]
+    if "overlap_node_count" in q:
+        return [{"overlap_node_count": overlap_count}]
+    if "split_value" in q and "row_count" in q:
+        return [
+            {"split_value": "train", "row_count": 700},
+            {"split_value": "val", "row_count": 100},
+            {"split_value": "test", "row_count": 100},
+        ]
+    if "order by degree desc" in q:
+        return [
+            {"node_id": "u1", "degree": 500},
+            {"node_id": "u2", "degree": 400},
+        ]
+    if "label" in q and "count" in q and "order by count" in q:
+        return [{"label": 0, "count": 600}, {"label": 1, "count": 400}]
+    if "labeled" in q and "coverage" in q:
+        return [{"total": 1000, "labeled": 800, "coverage": 0.8}]
+    return [{"count": 0}]
+
+
+@patch("gigl.analytics.data_analyzer.graph_structure_analyzer.BqUtils")
+class NodeClassificationSupervisionTierTest(TestCase):
+    def test_skipped_when_no_label_column(self, mock_bq_cls: MagicMock) -> None:
+        config = DataAnalyzerConfig(
+            node_tables=[
+                NodeTableSpec(
+                    bq_table="p.d.users",
+                    node_type="user",
+                    id_column="uid",
+                )
+            ],
+            edge_tables=[
+                EdgeTableSpec(
+                    bq_table="p.d.edges",
+                    edge_type="to",
+                    role="message_passing",
+                    src_id_column="src",
+                    dst_id_column="dst",
+                    src_node_type="user",
+                    dst_node_type="user",
+                )
+            ],
+            output_gcs_path="gs://bucket/out/",
+        )
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(config)
+        self.assertEqual(result.node_classification_supervision_stats, [])
+
+    def test_populated_when_label_column_set(self, mock_bq_cls: MagicMock) -> None:
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_nc_config())
+        stats_list = result.node_classification_supervision_stats
+        self.assertEqual(len(stats_list), 1)
+        stats = stats_list[0]
+        self.assertEqual(stats.node_type, "user")
+        self.assertEqual(stats.label_column, "node_label")
+        # Sentinel accounting.
+        self.assertEqual(stats.sentinel_stats.total_rows, 1000)
+        self.assertEqual(stats.sentinel_stats.null_count, 100)
+        self.assertEqual(stats.sentinel_stats.valid_label_count, 800)
+        self.assertAlmostEqual(stats.sentinel_stats.valid_label_coverage, 0.8)
+        # Per-class degree.
+        self.assertEqual(len(stats.per_class_degree), 2)
+        self.assertEqual(stats.per_class_degree[0].class_value, "0")
+        self.assertEqual(stats.per_class_degree[0].count, 600)
+        # Per-class log buckets are populated for the report sparkline.
+        self.assertEqual(
+            sorted(stats.per_class_degree[0].buckets.keys()),
+            sorted(["0-1", "2-10", "11-100", "101-1K", "1K-10K", "10K+"]),
+        )
+        self.assertEqual(stats.per_class_degree[0].buckets["2-10"], 500)
+        self.assertEqual(stats.per_class_degree[1].buckets["11-100"], 40)
+        # Rendered SQL was captured for the report's per-class disclosure.
+        self.assertIn("nc_supervision:per_class_degree:user", result.queries)
+        nc_sql_list = result.queries["nc_supervision:per_class_degree:user"]
+        self.assertTrue(any("class_value" in q for q in nc_sql_list))
+        # Homophily.
+        self.assertEqual(len(stats.homophily), 1)
+        self.assertAlmostEqual(stats.homophily[0].edge_homophily, 0.7)
+        # adjusted = (0.7 - 0.5) / (1 - 0.5) = 0.4
+        self.assertAlmostEqual(stats.homophily[0].adjusted_homophily, 0.4)
+        # No split column → no cross_split_overlap.
+        self.assertIsNone(stats.cross_split_overlap)
+
+    def test_sentinel_counts_surface(self, mock_bq_cls: MagicMock) -> None:
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query, sentinel_count=42)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(
+            _make_nc_config(label_sentinel_values=["-1", "unknown"])
+        )
+        sentinel_stats = result.node_classification_supervision_stats[0].sentinel_stats
+        # First sentinel "-1" gets sentinel_0 = 42 from the fixture; the rest are 0.
+        self.assertEqual(sentinel_stats.sentinel_counts["-1"], 42)
+        self.assertEqual(sentinel_stats.sentinel_counts["unknown"], 0)
+
+    def test_sentinel_degree_stats_partitioned_from_per_class(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """Rows whose label matches a declared sentinel get routed to
+        ``sentinel_degree_stats``; valid-class rows stay in
+        ``per_class_degree``. Both share the same ``PerClassDegreeStats``
+        shape so the sentinel pool's degree distribution is read end-to-end
+        the same way a "real" class is.
+        """
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query, include_sentinel_class_row=True)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_nc_config(label_sentinel_values=["-1"]))
+        stats = result.node_classification_supervision_stats[0]
+
+        # Valid classes only — "-1" must NOT appear here.
+        self.assertEqual(len(stats.per_class_degree), 2)
+        self.assertEqual(
+            sorted(c.class_value for c in stats.per_class_degree), ["0", "1"]
+        )
+
+        # Sentinel "-1" lands in its own list with full degree distribution.
+        self.assertEqual(len(stats.sentinel_degree_stats), 1)
+        sentinel_row = stats.sentinel_degree_stats[0]
+        self.assertEqual(sentinel_row.class_value, "-1")
+        self.assertEqual(sentinel_row.count, 50)
+        self.assertEqual(sentinel_row.cold_start_count, 40)
+        self.assertAlmostEqual(sentinel_row.mean_degree, 1.5)
+        self.assertEqual(sentinel_row.max_degree, 80)
+        # Buckets carry the same log-bucket keys as per_class_degree so the
+        # report sparkline renders identically.
+        self.assertEqual(
+            sorted(sentinel_row.buckets.keys()),
+            sorted(["0-1", "2-10", "11-100", "101-1K", "1K-10K", "10K+"]),
+        )
+        self.assertEqual(sentinel_row.buckets["0-1"], 40)
+
+    def test_sentinel_degree_stats_empty_when_no_sentinels_declared(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        """When ``label_sentinel_values`` is empty, every non-NULL label row
+        is treated as a real class and ``sentinel_degree_stats`` stays empty
+        even if a "-1" row happens to come back from BQ.
+        """
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query, include_sentinel_class_row=True)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_nc_config())
+        stats = result.node_classification_supervision_stats[0]
+        self.assertEqual(len(stats.sentinel_degree_stats), 0)
+        # "-1" appears as a regular class row when not declared as sentinel.
+        self.assertIn("-1", [c.class_value for c in stats.per_class_degree])
+
+    def test_split_column_no_overlap_passes(self, mock_bq_cls: MagicMock) -> None:
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query, overlap_count=0)
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(_make_nc_config(split_column="split"))
+        overlap = result.node_classification_supervision_stats[0].cross_split_overlap
+        self.assertIsNotNone(overlap)
+        assert overlap is not None
+        self.assertEqual(overlap.overlap_node_count, 0)
+        self.assertEqual(overlap.split_value_counts.get("train"), 700)
+
+    def test_split_column_overlap_raises(self, mock_bq_cls: MagicMock) -> None:
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query, overlap_count=17)
+        )
+        analyzer = GraphStructureAnalyzer()
+        with self.assertRaises(DataQualityError) as ctx:
+            analyzer.analyze(_make_nc_config(split_column="split"))
+        partial = ctx.exception.partial_result
+        self.assertEqual(len(partial.node_classification_supervision_stats), 1)
+        overlap = partial.node_classification_supervision_stats[0].cross_split_overlap
+        assert overlap is not None
+        self.assertEqual(overlap.overlap_node_count, 17)
+
+    def test_homophily_query_includes_sample_filter_when_capped(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        mock_bq = mock_bq_cls.return_value
+
+        def _side_effect(query: str, labels: Optional[dict] = None) -> MagicMock:
+            q = query.lower()
+            if "edge_count" in q and "duplicate_count" not in q:
+                # Force edge_count above the cap so sampling activates.
+                return _mock_row_iterator([{"edge_count": 200_000}])
+            return _mock_row_iterator(_default_rows_for_query(query))
+
+        mock_bq.run_query.side_effect = _side_effect
+        analyzer = GraphStructureAnalyzer()
+        analyzer.analyze(_make_nc_config(homophily_sample_cap=10_000))
+
+        homophily_queries = [
+            call.kwargs["query"]
+            for call in mock_bq.run_query.call_args_list
+            if "edge_homophily" in call.kwargs.get("query", "")
+        ]
+        self.assertGreaterEqual(len(homophily_queries), 1)
+        # When sample_cap > 0 and edge_count > cap, the query carries
+        # a MOD(ABS(FARM_FINGERPRINT(...))) sample filter.
+        for sql in homophily_queries:
+            self.assertIn("FARM_FINGERPRINT", sql)
+            self.assertIn("MOD(", sql)
+
+    def test_homophily_skips_sampling_when_below_cap(
+        self, mock_bq_cls: MagicMock
+    ) -> None:
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        analyzer = GraphStructureAnalyzer()
+        analyzer.analyze(_make_nc_config(homophily_sample_cap=1_000_000))
+        homophily_queries = [
+            call.kwargs["query"]
+            for call in mock_bq.run_query.call_args_list
+            if "edge_homophily" in call.kwargs.get("query", "")
+        ]
+        self.assertGreaterEqual(len(homophily_queries), 1)
+        # edge_count fixture is 5000 which is < the 1M cap; no sampling.
+        for sql in homophily_queries:
+            self.assertNotIn("FARM_FINGERPRINT", sql)
+
+    def test_existing_lp_path_byte_compatible(self, mock_bq_cls: MagicMock) -> None:
+        """An LP-only config (no label_column) leaves the new tier empty.
+
+        Sanity check that adding the NC tier didn't accidentally fire on
+        configs that should remain LP-only. The supervision_cross_table
+        path remains driven by edge ``role`` and is exercised by
+        ``graph_structure_analyzer_test.py``.
+        """
+        mock_bq = mock_bq_cls.return_value
+        mock_bq.run_query.side_effect = lambda query, labels=None: _mock_row_iterator(
+            _default_rows_for_query(query)
+        )
+        config = DataAnalyzerConfig(
+            node_tables=[
+                NodeTableSpec(
+                    bq_table="p.d.users",
+                    node_type="user",
+                    id_column="uid",
+                )
+            ],
+            edge_tables=[
+                EdgeTableSpec(
+                    bq_table="p.d.edges",
+                    edge_type="to",
+                    role="message_passing",
+                    src_id_column="src",
+                    dst_id_column="dst",
+                    src_node_type="user",
+                    dst_node_type="user",
+                )
+            ],
+            output_gcs_path="gs://bucket/out/",
+        )
+        analyzer = GraphStructureAnalyzer()
+        result = analyzer.analyze(config)
+        self.assertEqual(result.node_classification_supervision_stats, [])
diff --git a/tests/unit/analytics/data_analyzer/queries_test.py b/tests/unit/analytics/data_analyzer/queries_test.py
new file mode 100644
index 000000000..e4af16cae
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/queries_test.py
@@ -0,0 +1,224 @@
+from gigl.analytics.data_analyzer.queries import (
+    COLD_START_NODE_COUNT_QUERY,
+    DANGLING_EDGES_QUERY,
+    DEGREE_BUCKET_QUERY,
+    DEGREE_DISTRIBUTION_QUERY,
+    DUPLICATE_NODE_COUNT_QUERY,
+    EDGE_REFERENTIAL_INTEGRITY_QUERY,
+    NODE_COUNT_QUERY,
+    SUPER_HUB_INT16_CLAMP_QUERY,
+    SUPERVISION_CROSS_TABLE_QUERY,
+    TOP_K_HUBS_QUERY,
+    build_null_rates_query,
+    build_per_class_degree_query,
+)
+from tests.test_assets.test_case import TestCase
+
+NODE_TABLE = "project.dataset.user_nodes"
+EDGE_TABLE = "project.dataset.user_edges"
+
+
+class NodeCountQueryTest(TestCase):
+    def test_contains_table_name(self) -> None:
+        sql = NODE_COUNT_QUERY.format(table=NODE_TABLE)
+        self.assertIn(f"`{NODE_TABLE}`", sql)
+        self.assertIn("COUNT(*)", sql)
+
+
+class DanglingEdgesQueryTest(TestCase):
+    def test_contains_null_checks(self) -> None:
+        sql = DANGLING_EDGES_QUERY.format(
+            table=EDGE_TABLE, src_id_column="src_uid", dst_id_column="dst_uid"
+        )
+        self.assertIn("src_uid IS NULL", sql)
+        self.assertIn("dst_uid IS NULL", sql)
+        self.assertIn(f"`{EDGE_TABLE}`", sql)
+
+
+class EdgeReferentialIntegrityQueryTest(TestCase):
+    def test_contains_left_join_homogeneous(self) -> None:
+        """Homogeneous case: src and dst resolve to the same node table."""
+        sql = EDGE_REFERENTIAL_INTEGRITY_QUERY.format(
+            edge_table=EDGE_TABLE,
+            src_node_table=NODE_TABLE,
+            dst_node_table=NODE_TABLE,
+            src_id_column="src_uid",
+            dst_id_column="dst_uid",
+            src_node_id_column="user_id",
+            dst_node_id_column="user_id",
+        )
+        self.assertIn("LEFT JOIN", sql)
+        self.assertIn(f"`{NODE_TABLE}`", sql)
+        self.assertIn(f"`{EDGE_TABLE}`", sql)
+        self.assertIn("IS NULL", sql)
+
+    def test_contains_left_join_heterogeneous(self) -> None:
+        """Heterogeneous case: src and dst resolve to different node tables.
+
+        Regression test for I3: previously the query took a single node_table
+        and always joined both sides against it, producing false-positive
+        missing_dst violations on bipartite graphs.
+        """
+        user_table = "project.dataset.user_nodes"
+        content_table = "project.dataset.content_nodes"
+        sql = EDGE_REFERENTIAL_INTEGRITY_QUERY.format(
+            edge_table=EDGE_TABLE,
+            src_node_table=user_table,
+            dst_node_table=content_table,
+            src_id_column="user_id",
+            dst_id_column="content_id",
+            src_node_id_column="uid",
+            dst_node_id_column="cid",
+        )
+        self.assertIn(f"`{user_table}`", sql)
+        self.assertIn(f"`{content_table}`", sql)
+        self.assertIn("e.user_id = src_node.uid", sql)
+        self.assertIn("e.content_id = dst_node.cid", sql)
+
+
+class DuplicateNodeCountQueryTest(TestCase):
+    def test_contains_group_by_having(self) -> None:
+        sql = DUPLICATE_NODE_COUNT_QUERY.format(table=NODE_TABLE, id_column="user_id")
+        self.assertIn("GROUP BY", sql)
+        self.assertIn("HAVING", sql)
+        self.assertIn("user_id", sql)
+
+
+class DegreeDistributionQueryTest(TestCase):
+    def test_contains_approx_quantiles(self) -> None:
+        sql = DEGREE_DISTRIBUTION_QUERY.format(table=EDGE_TABLE, id_column="src_uid")
+        self.assertIn("APPROX_QUANTILES", sql)
+        self.assertIn("src_uid", sql)
+
+
+class DegreeBucketQueryTest(TestCase):
+    def test_contains_countif_buckets(self) -> None:
+        sql = DEGREE_BUCKET_QUERY.format(table=EDGE_TABLE, id_column="src_uid")
+        self.assertIn("COUNTIF", sql)
+        self.assertIn("src_uid", sql)
+
+
+class NullRatesQueryTest(TestCase):
+    def test_batches_multiple_columns(self) -> None:
+        sql = build_null_rates_query(
+            table=NODE_TABLE, columns=["age", "country", "embedding"]
+        )
+        self.assertIn(f"`{NODE_TABLE}`", sql)
+        self.assertEqual(sql.count("COUNTIF"), 3)
+        self.assertIn("age", sql)
+        self.assertIn("country", sql)
+        self.assertIn("embedding", sql)
+
+
+class SuperHubInt16ClampQueryTest(TestCase):
+    def test_contains_32767_threshold(self) -> None:
+        sql = SUPER_HUB_INT16_CLAMP_QUERY.format(table=EDGE_TABLE, id_column="src_uid")
+        self.assertIn("32767", sql)
+
+
+class TopKHubsQueryTest(TestCase):
+    def test_contains_limit(self) -> None:
+        sql = TOP_K_HUBS_QUERY.format(table=EDGE_TABLE, id_column="src_uid", k=20)
+        self.assertIn("LIMIT 20", sql)
+        self.assertIn("ORDER BY", sql)
+        self.assertIn("DESC", sql)
+
+
+class ColdStartNodeCountQueryTest(TestCase):
+    def test_unions_src_and_dst_columns(self) -> None:
+        """Cold-start is a property of total degree, not out-degree alone.
+
+        Regression test for C2: previously the query only counted src-side
+        edges, which misclassified pure-destination node types (e.g., content
+        receiving likes) as cold-start regardless of in-degree.
+        """
+        sql = COLD_START_NODE_COUNT_QUERY.format(
+            node_table=NODE_TABLE,
+            edge_table=EDGE_TABLE,
+            node_id_column="user_id",
+            src_id_column="src_uid",
+            dst_id_column="dst_uid",
+        )
+        self.assertIn("src_uid", sql)
+        self.assertIn("dst_uid", sql)
+        self.assertIn("UNION ALL", sql)
+        self.assertIn(f"`{NODE_TABLE}`", sql)
+        self.assertIn(f"`{EDGE_TABLE}`", sql)
+
+
+class PerClassDegreeQueryTest(TestCase):
+    def test_emits_six_log_buckets_for_sparkline(self) -> None:
+        """Per-class query carries the same log-bucket counts as the overall
+        degree query so the report can render a per-class sparkline next to
+        each row using the existing histogram helper.
+        """
+        sql = build_per_class_degree_query(
+            node_table=NODE_TABLE,
+            node_id_column="user_id",
+            label_column="label",
+            edge_table=EDGE_TABLE,
+            edge_src_column="src_uid",
+            edge_dst_column="dst_uid",
+        )
+        for column in [
+            "bucket_0_1",
+            "bucket_2_10",
+            "bucket_11_100",
+            "bucket_101_1k",
+            "bucket_1k_10k",
+            "bucket_10k_plus",
+        ]:
+            self.assertIn(column, sql)
+        # And the existing summary projection is unchanged.
+        self.assertIn("class_value", sql)
+        self.assertIn("APPROX_QUANTILES(degree, 100)", sql)
+        self.assertIn("MAX(degree) AS max_degree", sql)
+        self.assertIn("GROUP BY class_value", sql)
+
+    def test_does_not_filter_sentinel_values(self) -> None:
+        """Sentinel-labeled rows must surface as their own ``class_value`` rows so
+        the caller can compute a degree distribution for them. The query no
+        longer filters by ``label_sentinel_values``; partitioning happens in
+        Python after the rows come back.
+        """
+        sql = build_per_class_degree_query(
+            node_table=NODE_TABLE,
+            node_id_column="user_id",
+            label_column="label",
+            edge_table=EDGE_TABLE,
+            edge_src_column="src_uid",
+            edge_dst_column="dst_uid",
+        )
+        self.assertNotIn("NOT IN", sql)
+        self.assertNotIn("'-1'", sql)
+
+
+class SupervisionCrossTableQueryTest(TestCase):
+    def test_query_substitutes_all_table_and_column_placeholders(self) -> None:
+        sql = SUPERVISION_CROSS_TABLE_QUERY.format(
+            driver_table="project.dataset.pos_edges",
+            other_table="project.dataset.neg_edges",
+            driver_anchor_column="user_id",
+            driver_other_column="content_id",
+            other_anchor_column="user_id",
+            other_other_column="content_id",
+        )
+        self.assertIn("`project.dataset.pos_edges`", sql)
+        self.assertIn("`project.dataset.neg_edges`", sql)
+        self.assertIn("user_id AS anchor", sql)
+        self.assertIn("content_id  AS neighbor", sql)
+        # All 10 returned columns must appear in the projection.
+        for column in [
+            "driver_anchor_count",
+            "driver_pair_count",
+            "other_pair_count",
+            "overlap_pair_count",
+            "driver_anchors_with_zero_other",
+            "avg_other_per_driver_anchor",
+            "p50_other_per_driver_anchor",
+            "p90_other_per_driver_anchor",
+            "p99_other_per_driver_anchor",
+            "max_other_per_driver_anchor",
+        ]:
+            self.assertIn(column, sql)
+        self.assertIn("INNER JOIN", sql)
diff --git a/tests/unit/analytics/data_analyzer/report/__init__.py b/tests/unit/analytics/data_analyzer/report/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/tests/unit/analytics/data_analyzer/report/report_generator_test.py b/tests/unit/analytics/data_analyzer/report/report_generator_test.py
new file mode 100644
index 000000000..2e7710cfd
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/report/report_generator_test.py
@@ -0,0 +1,117 @@
+from pathlib import Path
+
+from gigl.analytics.data_analyzer.report.report_generator import generate_report
+from gigl.analytics.data_analyzer.types import DegreeStats, GraphAnalysisResult
+from tests.test_assets.test_case import TestCase
+
+GOLDEN_REPORT_PATH = (
+    Path(__file__).parents[4] / "test_assets" / "analytics" / "golden_report.html"
+)
+
+
+def _make_test_result() -> GraphAnalysisResult:
+    """Deterministic test data for snapshot testing."""
+    return GraphAnalysisResult(
+        duplicate_node_counts={"user": 0},
+        dangling_edge_counts={"follows": 0},
+        referential_integrity_violations={"follows": 0},
+        node_counts={"user": 1000000},
+        edge_counts={"follows": 5000000},
+        null_rates={"p.d.nodes": {"age": 0.05, "country": 0.12}},
+        duplicate_edge_counts={"follows": 150},
+        self_loop_counts={"follows": 0},
+        isolated_node_counts={"user": 8000},
+        degree_stats={
+            "follows_out": DegreeStats(
+                min=0,
+                max=50000,
+                mean=10.0,
+                median=5,
+                p90=25,
+                p99=200,
+                p999=5000,
+                percentiles=list(range(101)),
+                buckets={
+                    "0-1": 100000,
+                    "2-10": 600000,
+                    "11-100": 250000,
+                    "101-1K": 45000,
+                    "1K-10K": 4500,
+                    "10K+": 500,
+                },
+            )
+        },
+        top_hubs={"follows_out": [("hub_1", 50000), ("hub_2", 35000)]},
+        super_hub_int16_clamp_count={"follows_out": 2},
+        cold_start_node_counts={"user": 100000},
+        feature_memory_bytes={"user": 8000000000},
+        neighbor_explosion_estimate={"follows": 75000},
+    )
+
+
+class ReportGeneratorStructuralTest(TestCase):
+    """Structural assertions on the generated HTML."""
+
+    def test_output_is_non_empty_html(self) -> None:
+        html = generate_report(
+            analysis_result=_make_test_result(),
+            profile_result=None,
+        )
+        self.assertIsInstance(html, str)
+        self.assertGreater(len(html), 1000)
+        self.assertIn("<html", html)
+        self.assertIn("GiGL Data Analysis Report", html)
+
+    def test_placeholders_all_replaced(self) -> None:
+        html = generate_report(
+            analysis_result=_make_test_result(),
+            profile_result=None,
+        )
+        # None of the injection placeholders should remain in the output.
+        self.assertNotIn("/* INJECT_STYLES */", html)
+        self.assertNotIn("/* INJECT_SCRIPTS */", html)
+        self.assertNotIn("/* INJECT_ANALYSIS_DATA */", html)
+        self.assertNotIn("/* INJECT_PROFILE_DATA */", html)
+
+    def test_injected_data_present(self) -> None:
+        html = generate_report(
+            analysis_result=_make_test_result(),
+            profile_result=None,
+        )
+        # The JSON data lives inside a hidden script tag.
+        self.assertIn('"node_counts"', html)
+        # Either 1000000 (int) or "1000000" (str) is acceptable depending on serialization.
+        self.assertTrue('"1000000"' in html or "1000000" in html)
+
+    def test_empty_profile_serializes_as_empty_object(self) -> None:
+        html = generate_report(
+            analysis_result=_make_test_result(),
+            profile_result=None,
+        )
+        # When profile_result is None, we inject an empty JSON object.
+        self.assertIn('id="profile-data"', html)
+
+
+class ReportGeneratorSnapshotTest(TestCase):
+    """Golden-file snapshot test to catch structural regressions."""
+
+    def test_snapshot_matches_golden(self) -> None:
+        html = generate_report(
+            analysis_result=_make_test_result(),
+            profile_result=None,
+        )
+        if not GOLDEN_REPORT_PATH.exists():
+            self.fail(
+                f"Golden file missing: {GOLDEN_REPORT_PATH}. "
+                f"Create it by writing the current output of generate_report "
+                f"with _make_test_result() as input."
+            )
+        golden = GOLDEN_REPORT_PATH.read_text()
+        self.assertEqual(
+            html,
+            golden,
+            msg=(
+                "HTML output changed. If this is intentional, regenerate the "
+                f"golden file at {GOLDEN_REPORT_PATH}."
+            ),
+        )
diff --git a/tests/unit/analytics/data_analyzer/types_test.py b/tests/unit/analytics/data_analyzer/types_test.py
new file mode 100644
index 000000000..cacfc5974
--- /dev/null
+++ b/tests/unit/analytics/data_analyzer/types_test.py
@@ -0,0 +1,351 @@
+"""Unit tests for Pydantic result types and artifact IO."""
+import json
+import tempfile
+from datetime import datetime, timezone
+from pathlib import Path
+
+from pydantic import ValidationError
+
+from gigl.analytics.data_analyzer.types import (
+    SCHEMA_VERSION,
+    DegreeStats,
+    EmbeddingDiagnosticsResult,
+    FeatureProfileArtifact,
+    FeatureProfileError,
+    FeatureProfileResult,
+    GraphAnalysisResult,
+    GraphStructureArtifact,
+    LabelSentinelStats,
+    NodeClassificationSupervisionStats,
+    PerClassDegreeStats,
+    SupervisionCrossTableStats,
+    TopKEntry,
+    load_artifact,
+    write_artifact,
+)
+from tests.test_assets.test_case import TestCase
+
+
+class SchemaVersionTest(TestCase):
+    def test_schema_version_is_one(self) -> None:
+        self.assertEqual(SCHEMA_VERSION, "1")
+
+
+class ResultRoundtripTest(TestCase):
+    def test_graph_analysis_result_roundtrip(self) -> None:
+        original = GraphAnalysisResult(
+            node_counts={"user": 1000},
+            edge_counts={"follows": 5000},
+            degree_stats={
+                "follows_out": DegreeStats(
+                    min=0,
+                    max=100,
+                    mean=5.0,
+                    median=3,
+                    p90=20,
+                    p99=50,
+                    p999=80,
+                    percentiles=list(range(101)),
+                    buckets={"0-1": 100},
+                )
+            },
+            top_hubs={"follows_out": [("u1", 50), ("u2", 40)]},
+        )
+        serialized = original.model_dump_json()
+        rehydrated = GraphAnalysisResult.model_validate_json(serialized)
+        self.assertEqual(rehydrated, original)
+
+    def test_feature_profile_errors_roundtrip(self) -> None:
+        original = FeatureProfileResult(
+            errors=[
+                FeatureProfileError(
+                    result_key="node:user",
+                    bq_table="p.d.users",
+                    stage="schema_fetch",
+                    message="permission denied",
+                ),
+                FeatureProfileError(
+                    result_key="edge:follows",
+                    bq_table="p.d.follows",
+                    stage="dataflow",
+                    message="RuntimeError: boom",
+                ),
+            ],
+        )
+        serialized = original.model_dump_json()
+        rehydrated = FeatureProfileResult.model_validate_json(serialized)
+        self.assertEqual(rehydrated, original)
+
+    def test_feature_profile_result_roundtrip(self) -> None:
+        original = FeatureProfileResult(
+            facets_html_paths={"node:user": ["gs://b/facets.html"]},
+            stats_paths={"node:user": ["gs://b/stats.tfrecord"]},
+            embedding_diagnostics={
+                "node:user": {
+                    "emb": EmbeddingDiagnosticsResult(
+                        total=100,
+                        unique_count=98,
+                        unique_ratio=0.98,
+                        top_k=[TopKEntry(hash=42, count=2, fraction=0.02)],
+                    )
+                }
+            },
+        )
+        serialized = original.model_dump_json()
+        rehydrated = FeatureProfileResult.model_validate_json(serialized)
+        self.assertEqual(rehydrated, original)
+
+    def test_supervision_cross_table_stats_roundtrip(self) -> None:
+        original = GraphAnalysisResult(
+            supervision_cross_table_stats=[
+                SupervisionCrossTableStats(
+                    driver_edge_type="viewed_pos",
+                    driver_role="supervision_pos",
+                    other_edge_type="viewed_neg",
+                    other_role="supervision_neg",
+                    node_anchor="user",
+                    driver_anchor_count=1000,
+                    driver_pair_count=5000,
+                    other_pair_count=6000,
+                    overlap_pair_count=3,
+                    driver_anchors_with_zero_other=50,
+                    avg_other_per_driver_anchor=4.5,
+                    p50_other_per_driver_anchor=4,
+                    p90_other_per_driver_anchor=12,
+                    p99_other_per_driver_anchor=40,
+                    max_other_per_driver_anchor=200,
+                )
+            ],
+        )
+        serialized = original.model_dump_json()
+        rehydrated = GraphAnalysisResult.model_validate_json(serialized)
+        self.assertEqual(rehydrated, original)
+
+    def test_per_class_degree_buckets_roundtrip(self) -> None:
+        """Per-class buckets, sentinel degree stats, and the queries log all
+        survive JSON round-trip.
+        """
+        original = GraphAnalysisResult(
+            node_classification_supervision_stats=[
+                NodeClassificationSupervisionStats(
+                    node_type="user",
+                    label_column="label",
+                    sentinel_stats=LabelSentinelStats(
+                        total_rows=10,
+                        null_count=0,
+                        valid_label_count=10,
+                        valid_label_coverage=1.0,
+                    ),
+                    per_class_degree=[
+                        PerClassDegreeStats(
+                            class_value="0",
+                            count=600,
+                            cold_start_count=30,
+                            mean_degree=5.0,
+                            median_degree=4,
+                            p90_degree=20,
+                            p99_degree=80,
+                            max_degree=100,
+                            buckets={
+                                "0-1": 30,
+                                "2-10": 500,
+                                "11-100": 60,
+                                "101-1K": 10,
+                                "1K-10K": 0,
+                                "10K+": 0,
+                            },
+                        )
+                    ],
+                    sentinel_degree_stats=[
+                        PerClassDegreeStats(
+                            class_value="-1",
+                            count=50,
+                            cold_start_count=40,
+                            mean_degree=1.5,
+                            median_degree=1,
+                            p90_degree=5,
+                            p99_degree=40,
+                            max_degree=80,
+                            buckets={
+                                "0-1": 40,
+                                "2-10": 8,
+                                "11-100": 2,
+                                "101-1K": 0,
+                                "1K-10K": 0,
+                                "10K+": 0,
+                            },
+                        )
+                    ],
+                )
+            ],
+            queries={
+                "nc_supervision:per_class_degree:user": ["SELECT 1"],
+                "graph_structure:degree:follows_out": ["SELECT 2", "SELECT 3"],
+            },
+        )
+        serialized = original.model_dump_json()
+        rehydrated = GraphAnalysisResult.model_validate_json(serialized)
+        self.assertEqual(rehydrated, original)
+        self.assertEqual(
+            rehydrated.queries["graph_structure:degree:follows_out"],
+            ["SELECT 2", "SELECT 3"],
+        )
+        self.assertEqual(
+            rehydrated.node_classification_supervision_stats[0]
+            .sentinel_degree_stats[0]
+            .class_value,
+            "-1",
+        )
+
+    def test_extra_fields_are_rejected(self) -> None:
+        with self.assertRaises(ValidationError):
+            GraphAnalysisResult.model_validate(
+                {"node_counts": {"user": 1}, "unknown_field": 42}
+            )
+
+
+class WriteArtifactTest(TestCase):
+    def test_writes_versioned_envelope_locally(self) -> None:
+        result = GraphAnalysisResult(node_counts={"user": 1})
+        with tempfile.TemporaryDirectory() as tmp:
+            path = write_artifact(
+                result=result,
+                component="graph_structure",
+                output_gcs_path=tmp,
+            )
+            self.assertTrue(path.endswith("/graph_structure.json"))
+            payload = json.loads(Path(path).read_text())
+            self.assertEqual(payload["schema_version"], "1")
+            self.assertEqual(payload["component"], "graph_structure")
+            self.assertIn("generated_at", payload)
+            self.assertEqual(payload["data"]["node_counts"], {"user": 1})
+
+    def test_feature_profile_component_writes_correct_name(self) -> None:
+        result = FeatureProfileResult(
+            facets_html_paths={"node:user": ["x"]},
+        )
+        with tempfile.TemporaryDirectory() as tmp:
+            path = write_artifact(
+                result=result,
+                component="feature_profile",
+                output_gcs_path=tmp,
+            )
+            self.assertTrue(path.endswith("/feature_profile.json"))
+
+    def test_type_mismatch_raises(self) -> None:
+        with tempfile.TemporaryDirectory() as tmp:
+            with self.assertRaises(TypeError):
+                write_artifact(
+                    result=FeatureProfileResult(),  # wrong component pairing
+                    component="graph_structure",
+                    output_gcs_path=tmp,
+                )
+
+    def test_trailing_slash_normalized(self) -> None:
+        result = GraphAnalysisResult()
+        with tempfile.TemporaryDirectory() as tmp:
+            path = write_artifact(
+                result=result,
+                component="graph_structure",
+                output_gcs_path=tmp + "/",
+            )
+            # The final path should be `tmp/graph_structure.json`, not
+            # `tmp//graph_structure.json`.
+            self.assertNotIn("//", path.replace("file://", ""))
+
+    def test_creates_parent_directory_if_missing(self) -> None:
+        result = GraphAnalysisResult()
+        with tempfile.TemporaryDirectory() as tmp:
+            nested = Path(tmp) / "nested" / "dir"
+            path = write_artifact(
+                result=result,
+                component="graph_structure",
+                output_gcs_path=str(nested),
+            )
+            self.assertTrue(Path(path).exists())
+
+
+class LoadArtifactTest(TestCase):
+    def test_round_trip_via_write_then_load(self) -> None:
+        original = GraphAnalysisResult(node_counts={"user": 1000})
+        with tempfile.TemporaryDirectory() as tmp:
+            path = write_artifact(
+                result=original,
+                component="graph_structure",
+                output_gcs_path=tmp,
+            )
+            loaded = load_artifact(path, expected_component="graph_structure")
+        self.assertEqual(loaded, original)
+
+    def test_feature_profile_round_trip(self) -> None:
+        original = FeatureProfileResult(
+            facets_html_paths={"node:user": ["gs://b/facets.html"]},
+            embedding_diagnostics={
+                "node:user": {
+                    "emb": EmbeddingDiagnosticsResult(
+                        total=100,
+                        unique_count=100,
+                        unique_ratio=1.0,
+                    )
+                }
+            },
+        )
+        with tempfile.TemporaryDirectory() as tmp:
+            path = write_artifact(
+                result=original,
+                component="feature_profile",
+                output_gcs_path=tmp,
+            )
+            loaded = load_artifact(path, expected_component="feature_profile")
+        self.assertEqual(loaded, original)
+
+    def test_load_mismatched_component_raises(self) -> None:
+        result = FeatureProfileResult()
+        with tempfile.TemporaryDirectory() as tmp:
+            path = write_artifact(
+                result=result,
+                component="feature_profile",
+                output_gcs_path=tmp,
+            )
+            with self.assertRaises(ValidationError):
+                # Loader expected graph_structure but file is feature_profile.
+                load_artifact(path, expected_component="graph_structure")
+
+
+class EnvelopeValidationTest(TestCase):
+    def test_schema_version_literal_is_enforced(self) -> None:
+        now = datetime.now(timezone.utc)
+        envelope = GraphStructureArtifact(generated_at=now, data=GraphAnalysisResult())
+        self.assertEqual(envelope.schema_version, "1")
+        self.assertEqual(envelope.component, "graph_structure")
+
+    def test_wrong_component_discriminator_is_rejected(self) -> None:
+        # GraphStructureArtifact has component=Literal["graph_structure"]; a
+        # JSON blob with component="feature_profile" must fail validation.
+        payload = {
+            "schema_version": "1",
+            "component": "feature_profile",
+            "generated_at": datetime.now(timezone.utc).isoformat(),
+            "data": {},
+        }
+        with self.assertRaises(ValidationError):
+            GraphStructureArtifact.model_validate(payload)
+
+    def test_feature_profile_envelope_carries_embedding_diagnostics(self) -> None:
+        envelope = FeatureProfileArtifact(
+            generated_at=datetime.now(timezone.utc),
+            data=FeatureProfileResult(
+                embedding_diagnostics={
+                    "node:user": {
+                        "emb": EmbeddingDiagnosticsResult(
+                            total=1,
+                            unique_count=1,
+                            unique_ratio=1.0,
+                        )
+                    }
+                }
+            ),
+        )
+        serialized = envelope.model_dump_json()
+        rehydrated = FeatureProfileArtifact.model_validate_json(serialized)
+        self.assertEqual(rehydrated, envelope)
diff --git a/tests/unit/common/beam/__init__.py b/tests/unit/common/beam/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/tests/unit/common/beam/tfdv_transforms_test.py b/tests/unit/common/beam/tfdv_transforms_test.py
new file mode 100644
index 000000000..48214eb6c
--- /dev/null
+++ b/tests/unit/common/beam/tfdv_transforms_test.py
@@ -0,0 +1,246 @@
+"""Unit tests for the shared TFDV/Beam PTransforms."""
+import os
+import tempfile
+from pathlib import Path
+from unittest.mock import patch
+
+import apache_beam as beam
+import pyarrow as pa
+from apache_beam.io.gcp.bigquery import BigQueryQueryPriority
+from apache_beam.testing.util import assert_that, equal_to
+
+from gigl.common import LocalUri
+from gigl.common.beam.sharded_read import BigQueryShardedReadConfig
+from gigl.common.beam.tfdv_transforms import (
+    BqTableToRecordBatch,
+    GenerateAndVisualizeStats,
+)
+from tests.test_assets.test_case import TestCase
+
+
+class BqTableToRecordBatchTest(TestCase):
+    def test_raises_on_empty_projection(self) -> None:
+        with self.assertRaises(ValueError):
+            BqTableToRecordBatch(bq_table="p.d.t", projection=[])
+
+    def test_query_renders_expr_as_name_for_each_entry(self) -> None:
+        transform = BqTableToRecordBatch(
+            bq_table="proj.ds.users",
+            projection=[("age", "`age`"), ("country", "`country`")],
+        )
+        captured_kwargs: dict = {}
+
+        def _fake_read(**kwargs):
+            captured_kwargs.update(kwargs)
+            return beam.Create([{"age": 1, "country": "US"}])
+
+        with patch(
+            "gigl.common.beam.tfdv_transforms.beam.io.ReadFromBigQuery",
+            side_effect=_fake_read,
+        ):
+            with beam.Pipeline() as p:
+                _ = p | transform
+
+        self.assertEqual(
+            captured_kwargs["query"],
+            "SELECT `age` AS `age`, `country` AS `country` FROM `proj.ds.users`",
+        )
+        self.assertTrue(captured_kwargs["use_standard_sql"])
+        self.assertNotIn("project", captured_kwargs)
+
+    def test_query_supports_derived_expressions(self) -> None:
+        transform = BqTableToRecordBatch(
+            bq_table="proj.ds.users",
+            projection=[
+                ("emb_len", "ARRAY_LENGTH(`emb`)"),
+                ("country", "`country`"),
+            ],
+        )
+        captured_kwargs: dict = {}
+
+        def _fake_read(**kwargs):
+            captured_kwargs.update(kwargs)
+            return beam.Create([{"emb_len": 64, "country": "US"}])
+
+        with patch(
+            "gigl.common.beam.tfdv_transforms.beam.io.ReadFromBigQuery",
+            side_effect=_fake_read,
+        ):
+            with beam.Pipeline() as p:
+                _ = p | transform
+
+        self.assertEqual(
+            captured_kwargs["query"],
+            "SELECT ARRAY_LENGTH(`emb`) AS `emb_len`, `country` AS `country` "
+            "FROM `proj.ds.users`",
+        )
+
+    def test_passes_bq_project_when_given(self) -> None:
+        transform = BqTableToRecordBatch(
+            bq_table="proj.ds.users",
+            projection=[("age", "`age`")],
+            bq_project="billing-project",
+        )
+        captured_kwargs: dict = {}
+
+        def _fake_read(**kwargs):
+            captured_kwargs.update(kwargs)
+            return beam.Create([{"age": 1}])
+
+        with patch(
+            "gigl.common.beam.tfdv_transforms.beam.io.ReadFromBigQuery",
+            side_effect=_fake_read,
+        ):
+            with beam.Pipeline() as p:
+                _ = p | transform
+
+        self.assertEqual(captured_kwargs["project"], "billing-project")
+
+    def test_raises_on_non_positive_num_shards(self) -> None:
+        with self.assertRaises(ValueError):
+            BqTableToRecordBatch(
+                bq_table="p.d.t",
+                projection=[("age", "`age`")],
+                sharded_read_config=BigQueryShardedReadConfig(
+                    shard_key="uid",
+                    project_id="proj",
+                    temp_dataset_name="temp",
+                    num_shards=0,
+                ),
+            )
+
+    def test_sharded_read_emits_one_query_per_shard(self) -> None:
+        """Each shard fans out into its own BQ read with a FARM_FINGERPRINT WHERE clause.
+
+        Mirrors :class:`gigl.common.beam.sharded_read.ShardedExportRead`'s
+        contract — same hashing scheme, ``method=EXPORT``,
+        ``query_priority=INTERACTIVE``, and ``temp_dataset``.
+        """
+        captured: list[dict] = []
+
+        def _fake_read(**kwargs):
+            captured.append(kwargs)
+            return beam.Create([{"age": 1, "country": "US"}])
+
+        config = BigQueryShardedReadConfig(
+            shard_key="uid",
+            project_id="proj",
+            temp_dataset_name="temp_ds",
+            num_shards=3,
+        )
+        transform = BqTableToRecordBatch(
+            bq_table="proj.ds.users",
+            projection=[("age", "`age`"), ("country", "`country`")],
+            sharded_read_config=config,
+        )
+
+        with patch(
+            "gigl.common.beam.tfdv_transforms.beam.io.ReadFromBigQuery",
+            side_effect=_fake_read,
+        ):
+            with beam.Pipeline() as p:
+                _ = p | transform
+
+        self.assertEqual(len(captured), 3)
+        for i, kwargs in enumerate(captured):
+            self.assertIn(
+                f"WHERE ABS(MOD(FARM_FINGERPRINT(CAST(uid AS STRING)), 3)) = {i}",
+                kwargs["query"],
+            )
+            self.assertIn(
+                "SELECT `age` AS `age`, `country` AS `country` FROM `proj.ds.users`",
+                kwargs["query"],
+            )
+            self.assertTrue(kwargs["use_standard_sql"])
+            self.assertEqual(kwargs["method"], "EXPORT")
+            self.assertEqual(
+                kwargs["query_priority"], BigQueryQueryPriority.INTERACTIVE
+            )
+            self.assertEqual(kwargs["temp_dataset"].projectId, "proj")
+            self.assertEqual(kwargs["temp_dataset"].datasetId, "temp_ds")
+
+    def test_emits_record_batches_with_list_typed_columns(self) -> None:
+        rows = [
+            {"age": 30, "country": "US"},
+            {"age": 25, "country": "CA"},
+            {"age": None, "country": "US"},
+        ]
+
+        def _fake_read(**kwargs):
+            return beam.Create(rows)
+
+        def _extract(batch: pa.RecordBatch) -> tuple:
+            age_type = batch.schema.field("age").type
+            country_type = batch.schema.field("country").type
+            return (
+                batch.num_rows,
+                tuple(sorted(batch.schema.names)),
+                pa.types.is_list(age_type),
+                pa.types.is_list(country_type),
+                tuple(batch.column("age").to_pylist()),
+                tuple(batch.column("country").to_pylist()),
+            )
+
+        with patch(
+            "gigl.common.beam.tfdv_transforms.beam.io.ReadFromBigQuery",
+            side_effect=_fake_read,
+        ):
+            with beam.Pipeline() as p:
+                batches = p | BqTableToRecordBatch(
+                    bq_table="p.d.t",
+                    projection=[("age", "`age`"), ("country", "`country`")],
+                    batch_size=10,
+                )
+                summaries = batches | "Summarize batch" >> beam.Map(_extract)
+                assert_that(
+                    summaries,
+                    equal_to(
+                        [
+                            (
+                                3,
+                                ("age", "country"),
+                                True,
+                                True,
+                                ([30], [25], None),
+                                (["US"], ["CA"], ["US"]),
+                            )
+                        ]
+                    ),
+                )
+
+
+class GenerateAndVisualizeStatsTest(TestCase):
+    def test_runs_and_writes_artifacts(self) -> None:
+        """Smoke test: runs the PTransform on a tiny in-memory RecordBatch and
+        verifies that both the Facets HTML and the stats TFRecord are written.
+        """
+        batch = pa.RecordBatch.from_pydict(
+            {
+                "age": pa.array([[30], [25], [40]], type=pa.list_(pa.int64())),
+                "country": pa.array(
+                    [["US"], ["CA"], ["US"]], type=pa.list_(pa.string())
+                ),
+            }
+        )
+        with tempfile.TemporaryDirectory() as tmpdir:
+            facets_path = os.path.join(tmpdir, "facets.html")
+            stats_path = os.path.join(tmpdir, "stats.tfrecord")
+            with beam.Pipeline() as p:
+                _ = (
+                    p
+                    | "Create a single record batch" >> beam.Create([batch])
+                    | GenerateAndVisualizeStats(
+                        facets_report_uri=LocalUri(facets_path),
+                        stats_output_uri=LocalUri(stats_path),
+                    )
+                )
+
+            self.assertTrue(
+                Path(facets_path).exists(),
+                f"Facets HTML not written at {facets_path}",
+            )
+            self.assertGreater(Path(facets_path).stat().st_size, 0)
+            written = list(Path(tmpdir).glob("stats.tfrecord*"))
+            self.assertTrue(
+                written, f"No stats TFRecord written under prefix {stats_path}"
+            )