[Performance Analysis] Adding intra-kernel timing runs #829
[Performance Analysis] Adding intra-kernel timing runs #829SergioMartin86 wants to merge 12 commits into
Conversation
…. This is important for accurate timing
There was a problem hiding this comment.
Code Review
This pull request introduces a performance timing framework for AICPU kernels, enabling warmup and timed execution iterations configurable via environment variables. The changes include a new two-phase barrier for thread synchronization, the use of thread-local storage for thread indexing, and enhanced logging. Feedback highlights several critical issues: an operator precedence bug in the thread completion logic that prevents proper cleanup, thread-safety violations when calling initialization routines concurrently, and a break in binary compatibility due to field insertion in the Runtime class. Additionally, improvements are suggested for memory ordering in the barrier, robustness in environment variable parsing, and correcting a log message typo.
| std::string env_timing_iterations_string = std::string(env_timing_iterations); | ||
| bool isValidValue = false; | ||
| if (env_timing_iterations_string == "True") { runtime->is_timing_enabled = true; isValidValue = true; } | ||
| if (env_timing_iterations_string == "False") { runtime->is_timing_enabled = false; isValidValue = true; } | ||
| if (isValidValue == false) | ||
| { | ||
| LOG_WARN("PTO2_KERNEL_TIMING_ENABLED=%s is invalid, using default: \"False\"", env_timing_iterations); | ||
| runtime->is_timing_enabled = false; | ||
| } | ||
| } |
There was a problem hiding this comment.
The environment variable parsing for PTO2_KERNEL_TIMING_ENABLED is brittle as it only accepts exact case-sensitive matches for 'True' or 'False'. It would be more robust to support a wider range of boolean representations (e.g., '1', '0', 'true', 'false', 'on', 'off') and perform case-insensitive comparisons.
|
We want to add the ability to run a task multiple times inside the same kernel launch. This is essential for precise timing and performance evaluation of both orchestration and scheduling.
We add:
By running multiple timed runs, we dissipate OS/device noise that cause random variations in running time. This noise is significant when running these extremely low-latency kernels, so, if we want to precisely measure scheduling/orchestration performance, we need to use a statistical analysis with many samples inside the same kernel launch.
Current blocker:
We are trying (and failing) to reset the SchedulerContext back to its initial state, to be able to be re-run in the same kernel. We try:
deinit(runtime); init(runtime);But this results in the test failing and a 10x increase in running time.
Relevant Change:
See https://github.com/hw-native-sys/simpler/pull/829/changes#diff-f1bd1d412c7f0c6e99f4f11c3830d67582037fbbd6ef3a981c34edb244f9a849R761 for main timing function we added.
We appreciate help figuring out how to reset the scheduler context to cleanly re-run the pypto task.