Skip to content

ci: add swap during build, use tpchgen-cli#1443

Merged
timsaucer merged 12 commits intoapache:mainfrom
timsaucer:ci/native-linux-arm
Mar 26, 2026
Merged

ci: add swap during build, use tpchgen-cli#1443
timsaucer merged 12 commits intoapache:mainfrom
timsaucer:ci/native-linux-arm

Conversation

@timsaucer
Copy link
Member

@timsaucer timsaucer commented Mar 25, 2026

Which issue does this PR close?

Related to #1429 but we need to verify if it resolves the issue.

Rationale for this change

We are getting OOM errors during build. This adds a swap to the build stage to prevent them.

What changes are included in this PR?

Add swap during build stage.
Added tpchgen-cli to generate TPC-H data, so also committed the answer files.

Are there any user-facing changes?

No

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking into fixing this and testing on my fork,

adding swap helped
https://github.com/kevinjqliu/datafusion-python/pull/4/files#diff-5c3fa597431eda03ac3339ae6bf7f05e1a50d6fc7333679ec38e21b337cb6721R244-R255

we can add it to give us more buffer

# temporarily comment out to verify it works in the PR
# if: inputs.build_mode == 'release'
env:
CARGO_BUILD_JOBS: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should help, we can even do CARGO_BUILD_JOBS=1 and also add it to the "debug" mode below on L241

Cargo.toml Outdated
Comment on lines +70 to +73
[profile.release.package.substrait]
opt-level = 1
codegen-units = 16

Copy link
Contributor

@kevinjqliu kevinjqliu Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets see if scoping it to substrait will help here.

I think we might need to bite the bullet and do this:

[profile.release]
lto = "thin"
codegen-units = 4

or override the options using env var just for that one job.

but note that this affects the final artifact

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Mar 25, 2026

Claude on why profile.release.package.substrait might not work:
"""
That's a reasonable idea but it won't help with your specific OOM. The problem is that with lto = true (fat LTO), all crates get merged into a single IR blob inside rustc for the final optimization pass — the per-package opt-level and codegen-units overrides apply during the initial compilation of each crate, but by the time the LTO link step runs, rustc is holding everything together regardless.
"""

The priority order that minimizes artifact impact:

@timsaucer
Copy link
Member Author

Thanks @kevinjqliu ! If the swap is good enough to get us around this OOM then I think that's the best option.

@kevinjqliu
Copy link
Contributor

22 unapproved licences. Check rat report: rat.txt

maybe need to add the new examples/tpch/answers_sf1/* files to the dev/release/rat_exclude_files.txt file

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Looks like adding the swap worked.

@kevinjqliu
Copy link
Contributor

        Error: Custom { kind: NotFound, error: "Could not find `protoc`.
      If `protoc` is installed, try setting the `PROTOC` environment
      variable to the path of the `protoc` binary. To install it on Debian,
      run `apt-get install protobuf-compiler`. It is also available at

probably needs


      - name: Install protoc
        uses: arduino/setup-protoc@v3
        with:
          repo-token: ${{ secrets.GITHUB_TOKEN }}

@timsaucer
Copy link
Member Author

Super frustrating because these same tests pass locally for me every time

@kevinjqliu
Copy link
Contributor

i was able to repro the issue locally. claude says its a datafusion regression, im running df 52 locally to check

@timsaucer
Copy link
Member Author

timsaucer commented Mar 26, 2026

I take it back. Now I'm getting a failure. I must not have built recently. Investigating.

@timsaucer
Copy link
Member Author

Found it: apache/datafusion#21011 and that tpch code depended on it.

@kevinjqliu
Copy link
Contributor

on Mac M4
Tried with 52.3 on this commit (7f484cd) and it worked.

Looks like its due to apache/datafusion#21011
and

"failed_suppliers", F.array_remove(col("failed_suppliers"), lit(None))

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Mar 26, 2026

heres a potential fix kevinjqliu@1fb80a3

according to claude
"""
Note: the upstream fix also changes the semantics — per Trino/Spark, array_remove(array, NULL) should return NULL (not remove nulls from the array). So even after the upstream fix lands, the q21 example would still need the workaround since it relies on null-removal behavior.
"""

@timsaucer
Copy link
Member Author

looks like claude code and tim code came to the same conclusion

@timsaucer timsaucer changed the title ci: restrict number of jobs during build stage ci: add swap during build, use tpchgen-cli Mar 26, 2026
@timsaucer
Copy link
Member Author

here goes nothing...

@timsaucer timsaucer merged commit e09c93b into apache:main Mar 26, 2026
19 checks passed
@timsaucer timsaucer deleted the ci/native-linux-arm branch March 26, 2026 19:45
@kevinjqliu
Copy link
Contributor

Error: The operation was canceled.

https://github.com/apache/datafusion-python/actions/runs/23614615854/job/68778937649

@timsaucer did you cancel it?

@timsaucer
Copy link
Member Author

Nope, I don't think so. I'm going to let the rest finish and rerun that work flow

@kevinjqliu
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants