[Performance Analysis] Adding intra-kernel timing runs by SergioMartin86 · Pull Request #829 · hw-native-sys/simpler

SergioMartin86 · 2026-05-20T10:03:42Z

We want to add the ability to run a task multiple times inside the same kernel launch. This is essential for precise timing and performance evaluation of both orchestration and scheduling.

We add:

Warmup runs: used to disregard cache intialization/dlopen/kernel launch noise.
Timed runs: these are actually timed, and an average + stddev is reported.

By running multiple timed runs, we dissipate OS/device noise that cause random variations in running time. This noise is significant when running these extremely low-latency kernels, so, if we want to precisely measure scheduling/orchestration performance, we need to use a statistical analysis with many samples inside the same kernel launch.

Current blocker:

We are trying (and failing) to reset the SchedulerContext back to its initial state, to be able to be re-run in the same kernel. We try:

deinit(runtime); init(runtime);

But this results in the test failing and a 10x increase in running time.

Relevant Change:

See https://github.com/hw-native-sys/simpler/pull/829/changes#diff-f1bd1d412c7f0c6e99f4f11c3830d67582037fbbd6ef3a981c34edb244f9a849R761 for main timing function we added.

We appreciate help figuring out how to reset the scheduler context to cleanly re-run the pypto task.

…. This is important for accurate timing

gemini-code-assist

Code Review

This pull request introduces a performance timing framework for AICPU kernels, enabling warmup and timed execution iterations configurable via environment variables. The changes include a new two-phase barrier for thread synchronization, the use of thread-local storage for thread indexing, and enhanced logging. Feedback highlights several critical issues: an operator precedence bug in the thread completion logic that prevents proper cleanup, thread-safety violations when calling initialization routines concurrently, and a break in binary compatibility due to field insertion in the Runtime class. Additionally, improvements are suggested for memory ordering in the barrier, robustness in environment variable parsing, and correcting a log message typo.

gemini-code-assist · 2026-05-20T10:05:37Z

+            std::string env_timing_iterations_string = std::string(env_timing_iterations);
+            bool isValidValue = false;
+            if (env_timing_iterations_string == "True") { runtime->is_timing_enabled = true; isValidValue = true; }
+            if (env_timing_iterations_string == "False") { runtime->is_timing_enabled = false; isValidValue = true; }
+            if (isValidValue == false) 
+            {
+                LOG_WARN("PTO2_KERNEL_TIMING_ENABLED=%s is invalid, using default: \"False\"", env_timing_iterations);
+                runtime->is_timing_enabled = false;
+            }
+        }


The environment variable parsing for PTO2_KERNEL_TIMING_ENABLED is brittle as it only accepts exact case-sensitive matches for 'True' or 'False'. It would be more robust to support a wider range of boolean representations (e.g., '1', '0', 'true', 'false', 'on', 'off') and perform case-insensitive comparisons.

ChaoWao · 2026-05-21T03:00:12Z

Please give the comparison data of N*kernelLaunch and 1 kernelLaunch*N inner run.

Run 100 times and trim highest 10 and lowest 10

SergioMartin86 added 5 commits May 19, 2026 15:24

Adding the capability of re-running a task within a single kernel run…

a1ecccf

…. This is important for accurate timing

Adding timing

847b64c

Progress

38e5008

Simplifying

f1953c5

merging with upstream

7c37c6c

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

SergioMartin86 added 3 commits May 20, 2026 12:10

Addressing some agent suggestions

3bf8780

Fixes

f9f25c3

Fix

220d1d9

ChaoWao closed this May 21, 2026

ChaoWao reopened this May 21, 2026

SergioMartin86 added 4 commits May 21, 2026 10:14

Succesffuly running two consecutive inner runs

192ef42

Recovering timing runs

108bfc4

Separated orchestration loading from actual run

55580d6

Separating orhestration from scheduling activities

570660f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance Analysis] Adding intra-kernel timing runs #829

[Performance Analysis] Adding intra-kernel timing runs #829
SergioMartin86 wants to merge 12 commits into
hw-native-sys:mainfrom
huawei-csl:intraKernelTiming

SergioMartin86 commented May 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

Uh oh!

ChaoWao commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SergioMartin86 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ChaoWao commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SergioMartin86 commented May 20, 2026 •

edited

Loading