- Add API for ahead-of-time compilation and export via {py:func}
compilation.export_kernel() <cuda.tile.compilation.export_kernel>.
See the {doc}Compilation and Export </compilation>section for more details. - Add API for autotuning via {py:func}
tune.exhaustive_search() <cuda.tile.tune.exhaustive_search>and the following helpers:- Add API {py:meth}
kernel.replace_hints() <cuda.tile.kernel.replace_hints>to get a new kernel with updated hints. - Add API function {py:func}
compiler_timeout() <cuda.tile.compiler_timeout>for temporarily setting the timeout on the tileiras compiler.
See the {ref}Autotuning <autotuning>section for more details.
- Add API {py:meth}
- Add API {py:meth}
Array.tiled_view() <cuda.tile.Array.tiled_view>to create a tiled view of an array with a fixed tile shape and padding mode.
- Add support for specifying
memory_orderandmemory_scopeoncuda.tile.loadandcuda.tile.storeoperations. - Improve
print()to handle tuple and nested fstring.
- Fix a bug where restricted float dtype with simple reduce and scan did not
raise proper
TileTypeError.
- Change kernel ABI convention to omit parameters annotated with
cuda.tile.Constant.
- Support Ampere and Ada (sm80 family) GPUs.
- Support
pip install cuda-tile[tileiras]to usetileirasfrom Python environment without system-wide CTK installation. - Add
ct.atan2(y, x)operation for computing the arctangent of y/x. - Add optional
rounding_modeparameter forct.tanh(), supportingRoundingMode.FULLandRoundingMode.APPROX. - Compiling FP8 operations for sm80 family GPUs will raise
TileUnsupportedFeatureError. - Setting
opt_level=0onct.kernelis no longer required forct.printf()andct.print().
- Add
ct.static_iterkeyword that enables compile-timeforloops. - Add
ct.static_assertkeyword that can be used to assert that a condition is true at compile time. - Add
ct.static_evalkeyword that enables compile-time evaluation using the host Python interpreter. - Add
ct.scan()for custom scan. - Add
ct.isnan(). - Add
print()andct.print()that supports python-style print and f-strings. - Add optional
maskparameter toct.gather()andct.scatter()for custom boolean masking. - Operator
+can now be used to concatenate tuples. - Support unpacking nested tuples (e.g.,
a, (b, c) = t) and using square brackets for unpacking (e.g.,[a, b] = 1, 2). - Add bytecode-to-cubin disk cache to avoid recompilation of unchanged kernels.
Controlled by
CUDA_TILE_CACHE_DIRandCUDA_TILE_CACHE_SIZE.
- Fix a bug where
nan != nanreturns False. - Fix "potentially undefined variable
$retval" error when a helper function returns after awhileloop that contains no early return. - Fix the missing column indicator in error messages when the underlined text is only one character wide.
- Add a missing check for unpacking a tuple with too many values. For example,
a, b = 1, 2, 3now raises an error instead of silently discarding the extra value. - Fix a bug where the promoted dtype of uint16 and uint64 was incorrectly set to uint32.
- Erase the distinction between scalars and zero-dimensional tiles. They are now completely interchangeable.
~xfor const booleanxwill raise a TypeError to prevent inconsistent results compared to~xon a boolean Tile.- Add
TileUnsupportedFeatureErrorto the public API.
- Add support for nested functions and lambdas.
- Add support for custom reduction via
ct.reduce(). - Add
Array.slice(axis, start, stop)to create a view of an array sliced along a single axis. The result shares memory with the original array (no data copy).
- Fix reductions with multiple axes specified in non-increasing order.
- Fix a bug when pattern matching (FusedMultiplyAdd) attempts to remove a value that is used by the new operation.
- Allow assignments with type annotations. Type annotations are ignored.
- Support constructors of built-in numeric types (bool, int, float), e.g.,
float('inf'). - Lift the ban on recursive helper function calls. Instead, add a limit on recursion depth.
Add a new exception class
TileRecursionError, thrown at compile time when the recursion limit is reached during function call inlining. - Improve error messages for type mismatches in control flow statements.
- Relax type checking rules for variables that are assigned a different type depending on the branch taken: it is now only an error if the variable is used afterwards.
- Stricter rules for potentially-undefined variable detection: if a variable
is first assigned inside a
forloop, and then used after the loop, it is now an error because the loop may take zero iterations, resulting in a use of an undefined variable. - Include a full cuTile traceback in error messages. Improve formatting of code locations; include function names, remove unnecessary characters to reduce line lengths.
- Delay the loading of CUDA driver until kernel launch.
- Expose the
TileErrorbase class in the public API. - Add
ct.abs()for completeness.
- Fix a bug in hash function that resulted in potential performance regression for kernels with many specializations.
- Fix a bug where an if statement within a loop can trigger an internal compiler error.
- Fix SliceType
__eq__comparison logic.
- Improve error message for
ct.cat(). - Support
is not Nonecomparison.
Initial release.