Skip to content

java(dsm): fix flaky Test_Dsm_Manual_Checkpoint_Inter_Process by waiting for DSM data#6579

Draft
robcarlan-datadog wants to merge 2 commits intomainfrom
fix/dsm-manual-checkpoint-flaky-wait
Draft

java(dsm): fix flaky Test_Dsm_Manual_Checkpoint_Inter_Process by waiting for DSM data#6579
robcarlan-datadog wants to merge 2 commits intomainfrom
fix/dsm-manual-checkpoint-flaky-wait

Conversation

@robcarlan-datadog
Copy link
Contributor

Summary

  • Fix the root cause of flaky Test_Dsm_Manual_Checkpoint_Inter_Process for Java spring-boot (DSMON-1257)
  • Add wait_for() to DsmHelper.assert_checkpoint_presence so it waits up to 30s for matching DSM data before asserting
  • Remove the flaky marking from manifests/java.yml

Root Cause

The Python Flask weblog explicitly calls flush_dsm_checkpoints() after every manual checkpoint operation, forcing the tracer to send DSM stats to the agent immediately. The Java Spring Boot weblog has no equivalent flush — it relies on the Java tracer's periodic flush interval (~10s).

assert_checkpoint_presence did a single pass through collected agent data with no retry. When the Java tracer hadn't flushed yet, the correct consumer checkpoint wasn't in the agent data and the assertion failed.

Fix

Use the existing interfaces.agent.wait_for() event-driven mechanism (already used by wait_for_remote_config_request, wait_for_client_side_stats_payload, etc.) to wait for matching DSM data to arrive:

  1. Check existing data first — returns immediately if checkpoint is already there (no regression for fast tracers like Python)
  2. If not found, listen for new incoming agent data up to 30s timeout
  3. Final verification pass with detailed logging

Test plan

  • CI passes for Java spring-boot INTEGRATIONS scenario
  • Retrigger pipeline multiple times to confirm flake is resolved
  • Other DSM tests unaffected (wait_for returns immediately when data is already present)

🤖 Generated with Claude Code

…ing for DSM data

The Java tracer does not flush DSM stats immediately (unlike Python which
calls flush_dsm_checkpoints()), relying instead on periodic flush. The
assert_checkpoint_presence helper did a single pass through collected data
with no retry, so it would miss checkpoints that hadn't been flushed yet.

Add wait_for() to assert_checkpoint_presence so it waits up to 30s for
matching DSM data to arrive before asserting. This uses the existing
event-driven wait mechanism in ProxyBasedInterfaceValidator. Remove the
flaky marking from manifests/java.yml (DSMON-1257).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@robcarlan-datadog robcarlan-datadog requested review from a team as code owners March 24, 2026 20:30
@robcarlan-datadog robcarlan-datadog marked this pull request as draft March 24, 2026 20:31
@github-actions
Copy link
Contributor

CODEOWNERS have been resolved as:

manifests/java.yml                                                      @DataDog/asm-java @DataDog/apm-java
tests/integrations/test_dsm.py                                          @DataDog/system-tests-core

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@datadog-prod-us1-4
Copy link

datadog-prod-us1-4 bot commented Mar 24, 2026

⚠️ Tests

Fix all issues with BitsAI or with Cursor

⚠️ Warnings

🧪 7 Tests failed

tests.integrations.test_dsm.Test_Dsm_Manual_Checkpoint_Inter_Process.test_dsm_manual_checkpoint_inter_process[spring-boot] from system_tests_suite   View in Datadog   (Fix with Cursor)
ValueError: Checkpoint has not been found, please have a look in logs

self = <tests.integrations.test_dsm.Test_Dsm_Manual_Checkpoint_Inter_Process object at 0x7fbffc5b2960>

    def test_dsm_manual_checkpoint_inter_process(self):
        assert self.produce_threaded.status_code == 200
        assert self.produce_threaded.text == "ok"
        assert "dd-pathway-ctx-base64" in self.produce_threaded.headers
    
        assert self.consume_threaded.status_code == 200
...
tests.integrations.test_service_overrides.Test_SqlServiceNameSource.test_sql_srv_src[chi] from system_tests_suite   View in Datadog   (Fix with Cursor)
AssertionError: Expected at least one SQL span to have _dd.svc_src set
assert False

self = <tests.integrations.test_service_overrides.Test_SqlServiceNameSource object at 0x7f133adf5eb0>

    def test_sql_srv_src(self):
        assert self.r.status_code == 200
    
        srv_src_found = False
        for _, _, span in interfaces.library.get_spans(request=self.r, full_trace=True):
...
tests.integrations.test_service_overrides.Test_SqlServiceNameSource.test_sql_srv_src[echo] from system_tests_suite   View in Datadog   (Fix with Cursor)
AssertionError: Expected at least one SQL span to have _dd.svc_src set
assert False

self = <tests.integrations.test_service_overrides.Test_SqlServiceNameSource object at 0x7f4207bf7050>

    def test_sql_srv_src(self):
        assert self.r.status_code == 200
    
        srv_src_found = False
        for _, _, span in interfaces.library.get_spans(request=self.r, full_trace=True):
...
View all

ℹ️ Info

No other issues found (see more)

❄️ No new flaky tests detected

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 5d0db53 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant