Feat(tests): build test infrastructure by chen2021673 · Pull Request #144 · InfiniTensor/InfiniTrain

chen2021673 · 2026-04-14T02:20:43Z

Summary

This PR refactors InfiniTrain’s test infrastructure around CTest and GoogleTest.

It consolidates the old test/ and tests/ layout into a single tests/ directory, introduces shared CMake utilities for test registration, and migrates applicable tests to device-parameterized TEST_P so CPU/CUDA cases can share the same test logic where appropriate.

Closes #120.

Changes

merge the old test/ directory into tests/
add shared CMake/GTest utilities under tests/common/
reduce repeated test registration boilerplate in per-suite CMakeLists.txt
migrate applicable tests from fixed-device TEST_F to device-parameterized TEST_P
replace hardcoded device selection with shared helpers such as GetDevice()
improve label-based selection for CPU/CUDA-related tests
refactor registration for all tests

How to run

ctest --output-on-failure
ctest -L cpu --output-on-failure
ctest -L cuda --output-on-failure

Impact

This is mainly a test infrastructure refactor. It is not intended to change training/runtime behavior, but it does change how tests are organized and registered.

Result

ctest --output-on-failure -j1 （并行可能抢占，先串行）

- Add infini_train_add_test CMake macro for simplified test registration - Integrate gtest_discover_tests for automatic test case discovery - Refactor all test directories to use unified macro (autograd, optimizer, hook, slow, lora) - Reduce test CMakeLists.txt code by 68% - Add LoRA tests (12 test cases) - Delete TEST_REPORT.md - Test labels: cpu/cuda/distributed/slow for flexible test execution - Add shared test_macros.cmake in tests/common/ BREAKING CHANGE: Test registration now uses macro instead of manual add_test() Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

Replace TEST_F with TEST_P across all test suites so each suite runs on both CPU and CUDA without duplicating test logic. Adds InfiniTrainTestP, TensorTestBaseP, AutogradTestBaseP, and DistributedInfiniTrainTestP base classes with automatic CUDA/NCCL skip guards. Introduces INFINI_TRAIN_REGISTER_TEST* C++ macros and infini_train_add_test_suite CMake macro to eliminate repetitive INSTANTIATE_TEST_SUITE_P / infini_train_add_test boilerplate. Removes deprecated test/, slow/, and split optimizer test files; consolidates optimizer tests into a single binary with creation + step suites.

- Simplify CMakeLists: single CTest target per suite, remove label splitting - Migrate old test/ directory into tests/ and delete test/

- Add docs/test_usage_guide.md with build/run/write instructions - Rename hook_mechanism.md → hook_mechanism_design.md - Rename lora_usage.md → lora_usage_guide.md - Add googletest as submodule in .gitmodules - Add infini_run tool target in CMakeLists.txt, remove stale comments

Add IsInitialized() to GlobalEnv and guard SetUpTestSuite so a second test class in the same process skips re-initialization instead of hitting CHECK(!initialized_). Also print try_compile output on compile-fail test to surface header-not-found vs real type errors.

kilinchange · 2026-05-07T15:27:48Z

+
+- 所有 autograd 测试都需要 `requires_grad=true`
+- 所有 autograd 测试都需要填充数据
+- 前向/反向传播测试必须有输入数据才能验证结果。`AutogradTestBase` 把 `FillSequentialTensor` 内置了，避免每个测试都手动调用


直接用 Arrange init 就行吧，给 Arrange 函数加一个 requires_grad 属性就行，对标 torch：
https://docs.pytorch.org/docs/2.11/generated/torch.arange.html

kilinchange · 2026-05-07T15:30:55Z

+    for (size_t i = 0; i < size; ++i) { data[i] = start + static_cast<float>(i); }
+}
+
+inline void FillConstantTensor(const std::shared_ptr<Tensor> &tensor, float value) {


直接调用 Tensor 的 Fill 函数就行，感觉也没必要封装这个函数。

那这个AutogradTestBase我就直接删了

kilinchange · 2026-05-07T15:51:34Z

这个文件放已经有的 InfiniTrain/cmake 目录下。

这个文件只服务 tests 目录下的测试注册逻辑，非通用，放在顶层合适吗？

kilinchange · 2026-05-07T16:15:20Z

+        }
+    }
+    Device GetDevice() const { return Device(GetParam(), 0); }
+    std::shared_ptr<Tensor> createTensor(const std::vector<int64_t> &shape, DataType dtype = DataType::kFLOAT32,


这个给 Tensor 的构造接口加一个 requires_grad 的参数就行，也是对齐 torch 的：
https://docs.pytorch.org/docs/2.11/generated/torch.tensor.html#torch.tensor

kilinchange · 2026-05-07T16:20:01Z

+}
+
+TEST_P(TensorCopyTest, CopiesCPUToCUDA) {
+    ONLY_CUDA();


这种语义上就不应该有 cpu 的版本，但实际上还是注册了 cpu 的版本，虽然被 skip 了，但感觉还是有点怪：

not use cuda 时这个函数不应该被编译；

即使 use cuda，也不应该注册 cpu 版本（那也没有 TEST_P 的必要了），可能需要改一下注册函数体现这种例外。

对的，现在这样skip是因为编译期感知不到测例内部的信息。如果要在编译期进行控制，那就需要用#ifdef USE_CUDA + TEST_F/TEST，另外也不能用infini_train_add_test_suite，要用CUDA-only test 注册方式。我觉得如果这种ONLY_CUDA/ONLY_CPU的测例确认是极少数的话可以不这么搞，用冗余滞后的跳过逻辑保留注册清晰度？

kilinchange · 2026-05-07T16:26:42Z

+#pragma once
+
+#include <algorithm>
+#include <gtest/gtest.h>


gtest 放到 cuda_xxx 下面的组里，用双引号：
https://gxtctab8no8.feishu.cn/docx/ARFVdldxPo87zHxIXe4c5LMwnNl#share-MwLDdV6xeoeEJqxkBc8cifylnfe

其他文件同理。

kilinchange · 2026-05-07T16:30:26Z

+#else
+#include <cuda_runtime_api.h>
+#endif
+#endif


这里没必要这么写吧，如果找不到 cuda_runtime_api.h，编译器本来就会报错，USE_CUDA 了的时候直接 include 就行。

kilinchange · 2026-05-07T16:36:24Z

+
+#define INFINI_TRAIN_REGISTER_TEST(TestName)                                                                           \
+    INSTANTIATE_TEST_SUITE_P(CPU, TestName, ::testing::Values(infini_train::Device::DeviceType::kCPU));                \
+    INSTANTIATE_TEST_SUITE_P(CUDA, TestName, ::testing::ValuesIn(infini_train::test::CudaDeviceTypes()))


没有 USE_CUDA 的时候，就不应该有 cuda param case 的注册；有 USE_CUDA 的时候，没有卡的话这个 case 直接炸就行，不需要防呆做到这个程度。

kilinchange · 2026-05-07T16:43:00Z

+    EXPECT_FALSE(config.add_bias_linear);
+    EXPECT_FALSE(config.tie_weights);
+    EXPECT_TRUE(config.UseGQA());
+}


这俩 case 没必要吧，test 里引用 example 有点奇怪；如果觉得有必要做这种校验的话，可以在 example 里给这些 config 类补上 sanitize 方法。类似：
https://github.com/NVIDIA/Megatron-LM/blob/8de8238844bb7824d3e245efae89d7c8c4211bc7/megatron/core/transformer/transformer_config.py#L2374

kilinchange · 2026-05-07T16:55:13Z

    void Init(int threads_per_process, int tensor_parallel_size, bool sequence_parallel_enabled,
              int pipeline_parallel_size, int virtual_pipeline_parallel_size);

+    bool IsInitialized() const;


这个主要是为了防止 gtest RunAllTests 方法多线程启动以后，每个线程都初始化一次 global env。可以选择不用gtest_main 现成的实现，自己写一个 main 在调用 RunAllTests 之前初始化 global env，就不用在这里加接口了。

chen2021673 mentioned this pull request Apr 14, 2026

【训练营】基于 CTest + gtest 的测试体系搭建与工程化集成 #120

Closed

chen2021673 force-pushed the CTest-clean branch 7 times, most recently from f174886 to 073a4b6 Compare April 27, 2026 07:11

JYMiracle305 self-requested a review April 27, 2026 11:46

luoyueyuguang and others added 5 commits April 28, 2026 08:28

feat: expand test infrastructure

508ad4d

fix: make distributed labels selectable

2c258fa

refactor(tests): remove platform distinction, unify test infrastructure

20b735c

- Simplify CMakeLists: single CTest target per suite, remove label splitting - Migrate old test/ directory into tests/ and delete test/

chen2021673 force-pushed the CTest-clean branch from 073a4b6 to 20b735c Compare April 28, 2026 08:57

JYMiracle305 reviewed Apr 29, 2026

View reviewed changes

Comment thread CMakeLists.txt

JYMiracle305 reviewed Apr 29, 2026

View reviewed changes

Comment thread CMakeLists.txt

JYMiracle305 reviewed Apr 29, 2026

View reviewed changes

Comment thread CMakeLists.txt Outdated

chen2021673 requested a review from JYMiracle305 April 30, 2026 06:07

JYMiracle305 reviewed Apr 30, 2026

View reviewed changes

Comment thread tests/common/test_utils.h Outdated

Comment thread tests/dtype/CMakeLists.txt

chen2021673 requested review from JYMiracle305 and kilinchange May 7, 2026 02:22

kilinchange requested changes May 7, 2026

View reviewed changes

kilinchange changed the title ~~[WIP]Feat(tests): build test infrastructure~~ Feat(tests): build test infrastructure May 8, 2026

Conversation

chen2021673 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

How to run

Impact

Result

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chen2021673 commented Apr 14, 2026 •

edited

Loading