Skip to content

Fix WindowsPerformanceCounter OOM on ARM64 >64 LP systems#661

Open
rudraptpsingh wants to merge 1 commit intomicrosoft:mainfrom
rudraptpsingh:user/rudrasingh/perfcounter_fix
Open

Fix WindowsPerformanceCounter OOM on ARM64 >64 LP systems#661
rudraptpsingh wants to merge 1 commit intomicrosoft:mainfrom
rudraptpsingh:user/rudrasingh/perfcounter_fix

Conversation

@rudraptpsingh
Copy link
Contributor

On ARM64 Cobalt 200 machines with >64 logical processors (132 LPs), the .NET [PerformanceCounter.NextValue()] API throws [InvalidOperationException] for every counter instance on every capture interval. The Windows PerfOS.dll library doesn't support processor groups (>64 LPs) — instance names like 0, 1, etc. are discovered successfully, but reading values fails with "Instance N does not exist in the specified Category".

With ~60 counters captured every second over 48 hours, this produces 7M–32M error telemetry events per machine (34–143 GB), ultimately causing OOM and experiment failures. This resulted in a 69% failure rate (164/236 experiments) across 10 racks in idlestress testing.

Fix

  1. Circuit breaker in [WindowsPerformanceCounter] — after 5 consecutive failures, the counter sets [IsDisabled = true] and all subsequent calls return immediately without throwing. This stops the error flood.

  2. WMI fallback via new [WmiPerformanceCounterProvider] — uses wmic.exe subprocess to query Win32_PerfFormattedData_Counters_ProcessorInformation, which supports multi-processor groups and returns all CPU instances correctly. Includes bidirectional name mapping for 19 known counters so metric names match the legacy format (\Processor(_Total)% Processor Time).

  3. Startup validation in [WindowsPerformanceCounterMonitor] — after discovering counters, does a test-read. If the legacy API fails for a category, removes those counters and activates WMI fallback automatically. No configuration changes needed.

Result: Tested on ARM64 132-LP machine — 0 errors, 76K+ metrics collected, WMI fallback activates cleanly.
Changes:

  • WindowsPerformanceCounter: IsDisabled, ConsecutiveFailures, LastError, ResetDisabledState(), MaxConsecutiveFailures=5
  • WmiPerformanceCounterProvider (new): wmic subprocess CSV parser with forward/reverse counter name mapping
  • WindowsPerformanceCounterMonitor: WmiCounters dict, TryActivateWmiFallback, test-read validation in LoadCounters, WMI capture/snapshot in loops
  • Unit tests: circuit breaker, snapshot strategies, WMI mappings, disabled-counter skip in capture loop

Add circuit breaker to WindowsPerformanceCounter that disables after 5
consecutive failures, preventing the error flood (7M-32M events) that
caused OOM on Cobalt 200 (132 LP) machines.

Add WmiPerformanceCounterProvider as fallback using wmic.exe subprocess
when legacy PerformanceCounter API fails. Supports all Processor/
Processor Information counters with bidirectional name mapping.

WindowsPerformanceCounterMonitor now test-reads counters at startup and
switches to WMI fallback if legacy API is broken for a category.

Tested on ARM64 132-LP machine: 0 errors, 76K+ metrics, exit code 0.

Changes:
- WindowsPerformanceCounter: IsDisabled, ConsecutiveFailures, LastError,
  ResetDisabledState(), MaxConsecutiveFailures=5
- WmiPerformanceCounterProvider (new): wmic subprocess CSV parser with
  forward/reverse counter name mapping
- WindowsPerformanceCounterMonitor: WmiCounters dict, TryActivateWmiFallback,
  test-read validation in LoadCounters, WMI capture/snapshot in loops
- Unit tests: circuit breaker, snapshot strategies, WMI mappings,
  disabled-counter skip in capture loop
Copy link
Contributor

@brdeyo brdeyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will not be able to use this fallback scenario. It will work on "regular bare metal" systems but will not work on the Azure Host OS as WMI is not available there.

My recommendation is to create a separate profile MONITORS-DEFAULT-2.json that uses the WMI option vs. the WindowsPerformanceCounter option explicitly.

/// multi-processor groups and returns all CPU cores with "Group,Core" instance naming.
/// Data is collected by invoking wmic.exe and parsing CSV output.
/// </remarks>
public class WmiPerformanceCounterProvider : IPerformanceMetric, IDisposable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to WindowsWmiPerformanceCounter

Keep with the same naming convention as the existing so that they are more "visibly obvious" as similar.

/// The set of WMI-based fallback counters for categories where the legacy
/// PerformanceCounter API fails (e.g. on systems with >64 logical processors).
/// </summary>
protected IDictionary<string, WmiPerformanceCounterProvider> WmiCounters { get; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate the WMI feature from the base WindowsPerformanceCounter OR alternatively, allow the user/profile author to pass in a flat to the class that defines which counter handler/provider will be used.

We SHOULD be able to use conditional parameters in profiles to identify when we are on a Cobalt system and set the flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants