Skip to content

Commit 30a3919

Browse files
Merge branch 'main' into id-based-enum
2 parents 5ab49f8 + 5fc06f1 commit 30a3919

398 files changed

Lines changed: 23032 additions & 9814 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/skills

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../.claude/skills
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
---
2+
name: fory-performance-optimization
3+
description: Run profile-driven bottleneck optimization across Apache Fory implementations (Java, C++, Python/Cython, Go, Rust, Swift, C#, JavaScript/TypeScript, Dart, Kotlin, Scala). Use when improving serialize/deserialize throughput or latency, recovering regressions against a reference commit, diagnosing flamegraphs, fixing perf-related CI failures, or porting proven optimizations across languages without protocol or API regressions.
4+
---
5+
6+
# Fory Performance Optimization
7+
8+
## Mission
9+
10+
Deliver measurable performance improvements in Apache Fory without protocol drift, correctness regressions, benchmark-shape tricks, or accidental API rollback.
11+
12+
## Operating Principles
13+
14+
- Start from data, not intuition.
15+
- Profile before changing hot code.
16+
- Change one bottleneck at a time.
17+
- Benchmark sequentially on the same machine state (one benchmark process at a time).
18+
- Keep only measured wins or explicitly requested architecture cleanups.
19+
- Revert speculative changes that do not pay off.
20+
- Align with reference runtimes (usually C++ first, then Rust/Java) when behavior and ownership models differ.
21+
22+
## Enforce Hard Constraints
23+
24+
- Preserve wire protocol unless explicitly requested.
25+
- Preserve cross-language semantics and xlang compatibility.
26+
- Never run two benchmarks at the same time on one host; run exactly one benchmark command at a time.
27+
- Do not optimize by changing benchmark payload definitions, field encodings, or benchmark methodology.
28+
- Do not add payload-identity or repeated-input caches that depend on benchmark shape.
29+
- Do not restore removed APIs/legacy wrappers when the user forbids it.
30+
- Do not preserve legacy/dead code or stale docs in optimization rounds; remove them when touched.
31+
- Keep API surface minimal: do not add new API unless required by protocol/correctness or explicitly requested.
32+
- Never add public hacky API for performance shortcuts; keep optimization helpers internal/private and conceptually clean.
33+
- Do not hide regressions behind unsafe compiler flags or benchmark-only code paths.
34+
- Keep optimization surfaces nested-safe; avoid root-only shortcuts unless they are architecturally valid and requested.
35+
36+
## Execute Workflow
37+
38+
1. Read context and constraints.
39+
40+
- Read `tasks/perf_optimization_rounds.md` and `tasks/lessons.md`.
41+
- Read the relevant spec in `docs/specification/` for any path that may affect wire behavior.
42+
- Record explicit user constraints (forbidden APIs, naming, architecture, protocol rules).
43+
44+
2. Define target and baseline.
45+
46+
- Identify one primary KPI (for example `Struct Serialize ns/op` or ops/sec).
47+
- Benchmark current `HEAD`.
48+
- If a reference commit is provided, benchmark it once and persist the result in a file (for example `tasks/perf_baselines/<id>.md`) to avoid repeated reruns.
49+
50+
3. Profile the hotspot.
51+
52+
- Capture a flamegraph or sampled stacks on the exact benchmark command.
53+
- Quantify top costs by bucket (runtime bookkeeping, dispatch, allocation/copy, map/cache operations, buffer growth, metadata parse/validation).
54+
- Tie each bucket to concrete file/line ownership before proposing changes.
55+
56+
4. Form one round hypothesis.
57+
58+
- State one bottleneck and one expected effect.
59+
- Prefer structural fixes over micro-tweaks.
60+
- If another runtime already solved the same bottleneck, port its design shape first.
61+
62+
5. Implement minimal change.
63+
64+
- Touch the smallest surface that can validate the hypothesis.
65+
- Keep invariants explicit: protocol bytes, ownership, cache lifetime, reference semantics, nullability, schema-compatible behavior.
66+
67+
6. Verify correctness.
68+
69+
- Run language-local build/test/lint for the touched implementation.
70+
- Run cross-language checks when runtime/type/protocol behavior can affect xlang.
71+
- Confirm serialized sizes and compatibility expectations where applicable.
72+
73+
7. Benchmark and compare.
74+
75+
- Run targeted benchmark at least twice sequentially.
76+
- Use longer duration when signal is noisy.
77+
- Run one short full-suite sanity benchmark to catch collateral regressions.
78+
79+
8. Decide keep or revert.
80+
81+
- Keep only if gain is repeatable or cleanup is explicitly requested and accepted with measured tradeoff.
82+
- Revert if performance regresses or gain is within noise and complexity increases.
83+
- If a required cleanup regresses, redesign inside the new architecture instead of restoring banned patterns.
84+
85+
9. Log every round.
86+
87+
- Append one round entry to `tasks/perf_optimization_rounds.md` before starting the next round.
88+
- Include hypothesis, code change, exact commands, before/after numbers, and keep/revert decision.
89+
- Commit retained non-trivial rounds immediately.
90+
91+
10. Re-plan on instability.
92+
93+
- Stop and re-plan when benchmark runs conflict, machine contention is suspected, or profile does not match hypothesis.
94+
- Re-ground on current `HEAD` after reset/rebase/checkout events before making further changes.
95+
96+
## Apply Decision Rules
97+
98+
- Treat <1-2% movement as noise unless repeated under controlled runs.
99+
- Require explicit proof for complexity-increasing optimizations.
100+
- Prefer deleting dead APIs and dead state quickly after refactors.
101+
- Keep naming/API cleanup only if performance remains in band.
102+
- Never run before/after comparisons in parallel.
103+
104+
## Use References
105+
106+
- Use [`references/workflow-checklist.md`](references/workflow-checklist.md) for execution checklists and stop conditions.
107+
- Use [`references/language-command-matrix.md`](references/language-command-matrix.md) for per-language build/test/benchmark/profile commands.
108+
- Use [`references/bottleneck-playbook.md`](references/bottleneck-playbook.md) for hotspot-to-fix mapping.
109+
- Use [`references/round-template.md`](references/round-template.md) to log each optimization round consistently.
110+
111+
## Produce Output
112+
113+
When finishing an optimization task, report:
114+
115+
- Baseline command and numbers.
116+
- Final command and numbers.
117+
- Net delta on primary KPI.
118+
- Correctness and compatibility verification run.
119+
- Kept vs reverted rounds and rationale.
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
18+
interface:
19+
display_name: "Fory Perf Optimizer"
20+
short_description: "Profile-first Apache Fory perf optimization playbook"
21+
default_prompt: "Optimize Apache Fory bottlenecks with a profile-driven, benchmark-verified, cross-language workflow that preserves protocol correctness."
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# Bottleneck Playbook
2+
3+
## 1) Dispatch And Runtime Bookkeeping
4+
5+
Symptoms:
6+
7+
- High samples in runtime access/exclusivity or witness dispatch bookkeeping.
8+
9+
Actions:
10+
11+
- Reduce repeated mutable accesses in tight loops.
12+
- Collapse helper layering on hot paths.
13+
- Move costly work from per-field/per-element paths to one-time setup.
14+
- Prefer concrete/local cursor mutation in critical loops.
15+
16+
Avoid:
17+
18+
- API splits that add extra existential/cross-protocol dispatch in hottest generic paths.
19+
20+
## 2) Buffer Growth And Materialization
21+
22+
Symptoms:
23+
24+
- High time in allocation, copy, or final materialization to output buffers.
25+
26+
Actions:
27+
28+
- Grow once for max possible bytes when encoding variable-width fields.
29+
- Use local write cursor and commit once.
30+
- Keep copy boundaries explicit and minimize conversion churn.
31+
32+
Avoid:
33+
34+
- Rewrites that increase allocation count or add copy steps despite lower-level pointer usage.
35+
36+
## 3) Varint Encode/Decode Overhead
37+
38+
Symptoms:
39+
40+
- Repeated size prepass plus repeated encode work for the same value.
41+
- Slow varint branches dominating primitive-heavy structs.
42+
43+
Actions:
44+
45+
- Remove value-dependent prepass when safe by reserving maximum bytes.
46+
- Use packed/loop-based slow paths where appropriate.
47+
- Keep exact writer-index commit after block write.
48+
49+
Avoid:
50+
51+
- Double-checking varint widths per field when one max-size reservation can cover the block.
52+
53+
## 4) Type Resolver And Metadata Path
54+
55+
Symptoms:
56+
57+
- Heavy cost in compatible type-info lookup, parsing, or temporary wrappers.
58+
59+
Actions:
60+
61+
- Keep canonical type info ownership in resolver/context aligned with reference runtimes.
62+
- Cache by stable protocol keys (for example, headers), not benchmark payload identity.
63+
- Reduce redundant wrappers and duplicated metadata ownership.
64+
65+
Avoid:
66+
67+
- Side caches that leak abstractions to callsites (`push/pop/clear` bookkeeping in user-facing flow).
68+
69+
## 5) Context Reset And Map/Array Maintenance
70+
71+
Symptoms:
72+
73+
- Noticeable time in context reset, map clear, array churn, or cache maintenance.
74+
75+
Actions:
76+
77+
- Use O(1) reset for reusable containers.
78+
- Keep data structures cache-local and simple for hot-path operations.
79+
- Remove dead fields/methods quickly after refactors.
80+
81+
Avoid:
82+
83+
- Over-engineered multi-path caches unless proven necessary and mirrored by reference runtimes.
84+
85+
## 6) Compatible Schema Read/Write Flow
86+
87+
Symptoms:
88+
89+
- Large compatible-path overhead or regressions after cleanup.
90+
91+
Actions:
92+
93+
- Keep flow aligned with C++/Rust ownership and dispatch model.
94+
- Move expensive matching/validation to type-info parse stage when possible.
95+
- Keep typed scoping of pending compatible metadata to avoid nested decode corruption.
96+
97+
Avoid:
98+
99+
- Untyped global compatible slots.
100+
- Broad helper-shaped replacement paths that bypass established protocol flow.
101+
102+
## 7) Cleanup-Driven Regressions
103+
104+
Symptoms:
105+
106+
- API/abstraction cleanup causes throughput drop.
107+
108+
Actions:
109+
110+
- Keep cleanup only if in benchmark noise band or user explicitly accepts tradeoff.
111+
- Redesign inside the cleaned architecture to recover performance.
112+
113+
Avoid:
114+
115+
- Reverting to banned legacy shapes.
116+
- Preserving cleanup that harms hot paths without follow-up recovery plan.
117+
118+
## 8) Cross-Language Porting
119+
120+
Actions:
121+
122+
- Identify the exact structure in the reference runtime (owner, cache key, lifetime, loop shape).
123+
- Port behavior and data-flow model, not language syntax.
124+
- Verify xlang semantics after porting.
125+
126+
Avoid:
127+
128+
- Language-specific shortcuts that diverge from shared protocol/runtime concepts.
129+
130+
## Keep/Revert Rubric
131+
132+
Keep when:
133+
134+
- Improvement is repeatable and non-trivial.
135+
- Correctness/lint/tests remain green.
136+
- Complexity increase is justified by measured gain.
137+
138+
Revert when:
139+
140+
- Regression is clear or gain is noise.
141+
- Change introduces benchmark-only behavior.
142+
- Change violates explicit user constraints.
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Language Command Matrix
2+
3+
Use this as the default verification matrix after performance changes. Run commands from the language directory unless noted.
4+
5+
## Swift
6+
7+
- Build: `swift build`
8+
- Tests: `swift test`
9+
- Lint: `swiftlint lint --config .swiftlint.yml`
10+
- Benchmark: `cd benchmarks/swift && swift build -c release && ./.build/release/swift-benchmark --duration <N>`
11+
- Profile (macOS sample): run benchmark with long duration, then `sample <pid> 10 1 -mayDie -file /tmp/<name>.sample.txt`
12+
13+
## C++
14+
15+
- Build: `bazel build //cpp/...`
16+
- Tests: `bazel test $(bazel query //cpp/...)`
17+
- Perf tests: `bazel test $(bazel query //cpp/fory/serialization/...)`
18+
- Profile: use repository-approved sampling tooling from `CONTRIBUTING.md` and `docs/cpp_debug.md`
19+
20+
## Java
21+
22+
- Build: `mvn -T16 package`
23+
- Tests: `mvn -T16 test`
24+
- Format/style checks as needed: `spotless:check`, `checkstyle:check`
25+
- Profile: JFR or async-profiler on the exact benchmark/test workload
26+
27+
## Python/Cython
28+
29+
- Install: `pip install -v -e .`
30+
- Tests (python mode): `ENABLE_FORY_CYTHON_SERIALIZATION=0 pytest -v -s .`
31+
- Tests (cython mode): `ENABLE_FORY_CYTHON_SERIALIZATION=1 pytest -v -s .`
32+
- Format/lint: `ruff format . && ruff check --fix .`
33+
- Profile: `py-spy`, `cProfile`, and Cython annotations as needed
34+
35+
## Rust
36+
37+
- Build: `cargo build`
38+
- Check: `cargo check`
39+
- Lint: `cargo clippy --all-targets --all-features -- -D warnings`
40+
- Tests: `cargo test --features tests`
41+
- Profile: flamegraph/perf tooling on benchmark or targeted test
42+
43+
## Go
44+
45+
- Build: `go build`
46+
- Tests: `go test -v ./...`
47+
- Format: `go fmt ./...`
48+
- Profile: `pprof` (`go test -bench` + cpu/mem profiles)
49+
50+
## C#
51+
52+
- Build: `dotnet build Fory.sln -c Release --no-restore`
53+
- Tests: `dotnet test Fory.sln -c Release`
54+
- Format check: `dotnet format Fory.sln --verify-no-changes`
55+
- Profile: `dotnet-trace` / `dotnet-counters` on benchmark/test runs
56+
57+
## JavaScript/TypeScript
58+
59+
- Install: `npm install`
60+
- Tests: `node ./node_modules/.bin/jest --ci --reporters=default --reporters=jest-junit`
61+
- Lint: `git ls-files -- '*.ts' | xargs -P 5 node ./node_modules/.bin/eslint`
62+
63+
## Dart
64+
65+
- Generate: `dart run build_runner build`
66+
- Tests: `dart test`
67+
- Analyze/fix: `dart analyze && dart fix --dry-run`
68+
69+
## Kotlin
70+
71+
- Build: `mvn clean package`
72+
- Tests: `mvn test`
73+
74+
## Scala
75+
76+
- Build: `sbt compile`
77+
- Tests: `sbt test`
78+
- Format: `sbt scalafmt`
79+
80+
## Cross-Language Xlang Verification
81+
82+
When changing xlang/runtime semantics, run relevant Java-driven xlang tests from `java/fory-core` with debug output enabled, for impacted languages:
83+
84+
- `CPPXlangTest`
85+
- `CSharpXlangTest`
86+
- `RustXlangTest`
87+
- `GoXlangTest`
88+
- `PythonXlangTest`
89+
- `SwiftXlangTest`

0 commit comments

Comments
 (0)