apache
diff --git a/‎.agents/skills‎
Lines changed: 1 addition & 0 deletions b/‎.agents/skills‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.claude/skills/fory-performance-optimization/SKILL.md‎
Lines changed: 119 additions & 0 deletions b/‎.claude/skills/fory-performance-optimization/SKILL.md‎
Lines changed: 119 additions & 0 deletions
diff --git a/‎.claude/skills/fory-performance-optimization/agents/openai.yaml‎
Lines changed: 21 additions & 0 deletions b/‎.claude/skills/fory-performance-optimization/agents/openai.yaml‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎.claude/skills/fory-performance-optimization/references/bottleneck-playbook.md‎
Lines changed: 142 additions & 0 deletions b/‎.claude/skills/fory-performance-optimization/references/bottleneck-playbook.md‎
Lines changed: 142 additions & 0 deletions
diff --git a/‎.claude/skills/fory-performance-optimization/references/language-command-matrix.md‎
Lines changed: 89 additions & 0 deletions b/‎.claude/skills/fory-performance-optimization/references/language-command-matrix.md‎
Lines changed: 89 additions & 0 deletions
@@ -0,0 +1 @@
+../.claude/skills
@@ -0,0 +1,119 @@
+---
+name: fory-performance-optimization
+description: Run profile-driven bottleneck optimization across Apache Fory implementations (Java, C++, Python/Cython, Go, Rust, Swift, C#, JavaScript/TypeScript, Dart, Kotlin, Scala). Use when improving serialize/deserialize throughput or latency, recovering regressions against a reference commit, diagnosing flamegraphs, fixing perf-related CI failures, or porting proven optimizations across languages without protocol or API regressions.
+---
+
+# Fory Performance Optimization
+
+## Mission
+
+Deliver measurable performance improvements in Apache Fory without protocol drift, correctness regressions, benchmark-shape tricks, or accidental API rollback.
+
+## Operating Principles
+
+- Start from data, not intuition.
+- Profile before changing hot code.
+- Change one bottleneck at a time.
+- Benchmark sequentially on the same machine state (one benchmark process at a time).
+- Keep only measured wins or explicitly requested architecture cleanups.
+- Revert speculative changes that do not pay off.
+- Align with reference runtimes (usually C++ first, then Rust/Java) when behavior and ownership models differ.
+
+## Enforce Hard Constraints
+
+- Preserve wire protocol unless explicitly requested.
+- Preserve cross-language semantics and xlang compatibility.
+- Never run two benchmarks at the same time on one host; run exactly one benchmark command at a time.
+- Do not optimize by changing benchmark payload definitions, field encodings, or benchmark methodology.
+- Do not add payload-identity or repeated-input caches that depend on benchmark shape.
+- Do not restore removed APIs/legacy wrappers when the user forbids it.
+- Do not preserve legacy/dead code or stale docs in optimization rounds; remove them when touched.
+- Keep API surface minimal: do not add new API unless required by protocol/correctness or explicitly requested.
+- Never add public hacky API for performance shortcuts; keep optimization helpers internal/private and conceptually clean.
+- Do not hide regressions behind unsafe compiler flags or benchmark-only code paths.
+- Keep optimization surfaces nested-safe; avoid root-only shortcuts unless they are architecturally valid and requested.
+
+## Execute Workflow
+
+1. Read context and constraints.
+
+- Read `tasks/perf_optimization_rounds.md` and `tasks/lessons.md`.
+- Read the relevant spec in `docs/specification/` for any path that may affect wire behavior.
+- Record explicit user constraints (forbidden APIs, naming, architecture, protocol rules).
+
+2. Define target and baseline.
+
+- Identify one primary KPI (for example `Struct Serialize ns/op` or ops/sec).
+- Benchmark current `HEAD`.
+- If a reference commit is provided, benchmark it once and persist the result in a file (for example `tasks/perf_baselines/<id>.md`) to avoid repeated reruns.
+
+3. Profile the hotspot.
+
+- Capture a flamegraph or sampled stacks on the exact benchmark command.
+- Quantify top costs by bucket (runtime bookkeeping, dispatch, allocation/copy, map/cache operations, buffer growth, metadata parse/validation).
+- Tie each bucket to concrete file/line ownership before proposing changes.
+
+4. Form one round hypothesis.
+
+- State one bottleneck and one expected effect.
+- Prefer structural fixes over micro-tweaks.
+- If another runtime already solved the same bottleneck, port its design shape first.
+
+5. Implement minimal change.
+
+- Touch the smallest surface that can validate the hypothesis.
+- Keep invariants explicit: protocol bytes, ownership, cache lifetime, reference semantics, nullability, schema-compatible behavior.
+
+6. Verify correctness.
+
+- Run language-local build/test/lint for the touched implementation.
+- Run cross-language checks when runtime/type/protocol behavior can affect xlang.
+- Confirm serialized sizes and compatibility expectations where applicable.
+
+7. Benchmark and compare.
+
+- Run targeted benchmark at least twice sequentially.
+- Use longer duration when signal is noisy.
+- Run one short full-suite sanity benchmark to catch collateral regressions.
+
+8. Decide keep or revert.
+
+- Keep only if gain is repeatable or cleanup is explicitly requested and accepted with measured tradeoff.
+- Revert if performance regresses or gain is within noise and complexity increases.
+- If a required cleanup regresses, redesign inside the new architecture instead of restoring banned patterns.
+
+9. Log every round.
+
+- Append one round entry to `tasks/perf_optimization_rounds.md` before starting the next round.
+- Include hypothesis, code change, exact commands, before/after numbers, and keep/revert decision.
+- Commit retained non-trivial rounds immediately.
+
+10. Re-plan on instability.
+
+- Stop and re-plan when benchmark runs conflict, machine contention is suspected, or profile does not match hypothesis.
+- Re-ground on current `HEAD` after reset/rebase/checkout events before making further changes.
+
+## Apply Decision Rules
+
+- Treat <1-2% movement as noise unless repeated under controlled runs.
+- Require explicit proof for complexity-increasing optimizations.
+- Prefer deleting dead APIs and dead state quickly after refactors.
+- Keep naming/API cleanup only if performance remains in band.
+- Never run before/after comparisons in parallel.
+
+## Use References
+
+- Use [`references/workflow-checklist.md`](references/workflow-checklist.md) for execution checklists and stop conditions.
+- Use [`references/language-command-matrix.md`](references/language-command-matrix.md) for per-language build/test/benchmark/profile commands.
+- Use [`references/bottleneck-playbook.md`](references/bottleneck-playbook.md) for hotspot-to-fix mapping.
+- Use [`references/round-template.md`](references/round-template.md) to log each optimization round consistently.
+
+## Produce Output
+
+When finishing an optimization task, report:
+
+- Baseline command and numbers.
+- Final command and numbers.
+- Net delta on primary KPI.
+- Correctness and compatibility verification run.
+- Kept vs reverted rounds and rationale.
@@ -0,0 +1,21 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+interface:
+  display_name: "Fory Perf Optimizer"
+  short_description: "Profile-first Apache Fory perf optimization playbook"
+  default_prompt: "Optimize Apache Fory bottlenecks with a profile-driven, benchmark-verified, cross-language workflow that preserves protocol correctness."
@@ -0,0 +1,142 @@
+# Bottleneck Playbook
+
+## 1) Dispatch And Runtime Bookkeeping
+
+Symptoms:
+
+- High samples in runtime access/exclusivity or witness dispatch bookkeeping.
+
+Actions:
+
+- Reduce repeated mutable accesses in tight loops.
+- Collapse helper layering on hot paths.
+- Move costly work from per-field/per-element paths to one-time setup.
+- Prefer concrete/local cursor mutation in critical loops.
+
+Avoid:
+
+- API splits that add extra existential/cross-protocol dispatch in hottest generic paths.
+
+## 2) Buffer Growth And Materialization
+
+Symptoms:
+
+- High time in allocation, copy, or final materialization to output buffers.
+
+Actions:
+
+- Grow once for max possible bytes when encoding variable-width fields.
+- Use local write cursor and commit once.
+- Keep copy boundaries explicit and minimize conversion churn.
+
+Avoid:
+
+- Rewrites that increase allocation count or add copy steps despite lower-level pointer usage.
+
+## 3) Varint Encode/Decode Overhead
+
+Symptoms:
+
+- Repeated size prepass plus repeated encode work for the same value.
+- Slow varint branches dominating primitive-heavy structs.
+
+Actions:
+
+- Remove value-dependent prepass when safe by reserving maximum bytes.
+- Use packed/loop-based slow paths where appropriate.
+- Keep exact writer-index commit after block write.
+
+Avoid:
+
+- Double-checking varint widths per field when one max-size reservation can cover the block.
+
+## 4) Type Resolver And Metadata Path
+
+Symptoms:
+
+- Heavy cost in compatible type-info lookup, parsing, or temporary wrappers.
+
+Actions:
+
+- Keep canonical type info ownership in resolver/context aligned with reference runtimes.
+- Cache by stable protocol keys (for example, headers), not benchmark payload identity.
+- Reduce redundant wrappers and duplicated metadata ownership.
+
+Avoid:
+
+- Side caches that leak abstractions to callsites (`push/pop/clear` bookkeeping in user-facing flow).
+
+## 5) Context Reset And Map/Array Maintenance
+
+Symptoms:
+
+- Noticeable time in context reset, map clear, array churn, or cache maintenance.
+
+Actions:
+
+- Use O(1) reset for reusable containers.
+- Keep data structures cache-local and simple for hot-path operations.
+- Remove dead fields/methods quickly after refactors.
+
+Avoid:
+
+- Over-engineered multi-path caches unless proven necessary and mirrored by reference runtimes.
+
+## 6) Compatible Schema Read/Write Flow
+
+Symptoms:
+
+- Large compatible-path overhead or regressions after cleanup.
+
+Actions:
+
+- Keep flow aligned with C++/Rust ownership and dispatch model.
+- Move expensive matching/validation to type-info parse stage when possible.
+- Keep typed scoping of pending compatible metadata to avoid nested decode corruption.
+
+Avoid:
+
+- Untyped global compatible slots.
+- Broad helper-shaped replacement paths that bypass established protocol flow.
+
+## 7) Cleanup-Driven Regressions
+
+Symptoms:
+
+- API/abstraction cleanup causes throughput drop.
+
+Actions:
+
+- Keep cleanup only if in benchmark noise band or user explicitly accepts tradeoff.
+- Redesign inside the cleaned architecture to recover performance.
+
+Avoid:
+
+- Reverting to banned legacy shapes.
+- Preserving cleanup that harms hot paths without follow-up recovery plan.
+
+## 8) Cross-Language Porting
+
+Actions:
+
+- Identify the exact structure in the reference runtime (owner, cache key, lifetime, loop shape).
+- Port behavior and data-flow model, not language syntax.
+- Verify xlang semantics after porting.
+
+Avoid:
+
+- Language-specific shortcuts that diverge from shared protocol/runtime concepts.
+
+## Keep/Revert Rubric
+
+Keep when:
+
+- Improvement is repeatable and non-trivial.
+- Correctness/lint/tests remain green.
+- Complexity increase is justified by measured gain.
+
+Revert when:
+
+- Regression is clear or gain is noise.
+- Change introduces benchmark-only behavior.
+- Change violates explicit user constraints.
@@ -0,0 +1,89 @@
+# Language Command Matrix
+
+Use this as the default verification matrix after performance changes. Run commands from the language directory unless noted.
+
+## Swift
+
+- Build: `swift build`
+- Tests: `swift test`
+- Lint: `swiftlint lint --config .swiftlint.yml`
+- Benchmark: `cd benchmarks/swift && swift build -c release && ./.build/release/swift-benchmark --duration <N>`
+- Profile (macOS sample): run benchmark with long duration, then `sample <pid> 10 1 -mayDie -file /tmp/<name>.sample.txt`
+
+## C++
+
+- Build: `bazel build //cpp/...`
+- Tests: `bazel test $(bazel query //cpp/...)`
+- Perf tests: `bazel test $(bazel query //cpp/fory/serialization/...)`
+- Profile: use repository-approved sampling tooling from `CONTRIBUTING.md` and `docs/cpp_debug.md`
+
+## Java
+
+- Build: `mvn -T16 package`
+- Tests: `mvn -T16 test`
+- Format/style checks as needed: `spotless:check`, `checkstyle:check`
+- Profile: JFR or async-profiler on the exact benchmark/test workload
+
+## Python/Cython
+
+- Install: `pip install -v -e .`
+- Tests (python mode): `ENABLE_FORY_CYTHON_SERIALIZATION=0 pytest -v -s .`
+- Tests (cython mode): `ENABLE_FORY_CYTHON_SERIALIZATION=1 pytest -v -s .`
+- Format/lint: `ruff format . && ruff check --fix .`
+- Profile: `py-spy`, `cProfile`, and Cython annotations as needed
+
+## Rust
+
+- Build: `cargo build`
+- Check: `cargo check`
+- Lint: `cargo clippy --all-targets --all-features -- -D warnings`
+- Tests: `cargo test --features tests`
+- Profile: flamegraph/perf tooling on benchmark or targeted test
+
+## Go
+
+- Build: `go build`
+- Tests: `go test -v ./...`
+- Format: `go fmt ./...`
+- Profile: `pprof` (`go test -bench` + cpu/mem profiles)
+
+## C#
+
+- Build: `dotnet build Fory.sln -c Release --no-restore`
+- Tests: `dotnet test Fory.sln -c Release`
+- Format check: `dotnet format Fory.sln --verify-no-changes`
+- Profile: `dotnet-trace` / `dotnet-counters` on benchmark/test runs
+
+## JavaScript/TypeScript
+
+- Install: `npm install`
+- Tests: `node ./node_modules/.bin/jest --ci --reporters=default --reporters=jest-junit`
+- Lint: `git ls-files -- '*.ts' | xargs -P 5 node ./node_modules/.bin/eslint`
+
+## Dart
+
+- Generate: `dart run build_runner build`
+- Tests: `dart test`
+- Analyze/fix: `dart analyze && dart fix --dry-run`
+
+## Kotlin
+
+- Build: `mvn clean package`
+- Tests: `mvn test`
+
+## Scala
+
+- Build: `sbt compile`
+- Tests: `sbt test`
+- Format: `sbt scalafmt`
+
+## Cross-Language Xlang Verification
+
+When changing xlang/runtime semantics, run relevant Java-driven xlang tests from `java/fory-core` with debug output enabled, for impacted languages:
+
+- `CPPXlangTest`
+- `CSharpXlangTest`
+- `RustXlangTest`
+- `GoXlangTest`
+- `PythonXlangTest`
+- `SwiftXlangTest`