Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl
NVIDIA CUDA Tile (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms of tile-level operations—loads, stores, and matrix multiply-accumulate—rather than manually coordinating threads, warps, and shared memory. cuTile.jl brings the same tile-based approach to the dynamic programming language Julia. Users can write custom GPU kernels without dropping down to NVIDIA CUDA C++. Custom kernels are often essential in Julia’s scientific computing ecosystem— spanning differential equations, probabilistic programming, and physics simulations. cuTile Python has a growing library of optimized kernels for GPU acceleration. The ability to translate those kernels to cuTile.jl provides the Julia ecosystem with immediate access to battle-tested implementations, instead of rewriting each one from scratch. This post covers cross-domain-specific language (DSL) GPU kernel translation, from porting cuTile Python kernels to cuTile.jl (Julia). It shows how to: - Translate GPU kernels between cuTile Python and cuTile.jl: Walk through a complete matrix multiplication example side-by-side. - Avoid semantic traps that break naive translations: Indexing, broadcasting, memory layout, and loop forms all diverge between the two DSLs—and silent mismatches produce wrong results, not compiler errors. - Build a repeatable, skill-driven AI workflow: The translation knowledge is packaged into an LLM skill in TileGym that produces validated Julia kernels in a single pass, systematizing a one-off porting effort. Cross-DSL GPU kernel translation Both cuTile Python and cuTile.jl frontends share the same tiled abstraction, making the translation largely algorithmic. However, the cumulative surface-level differences between the two languages are non-trivial, as shown in Table 1. None of these translations are conceptually difficult, but miss one ct.bid(0) that should be ct.bid(1) , and you get silent data corruption. Use * instead of .* for element-wise multiply, and Julia silently does a matrix multiply instead. These are the kinds of bugs that waste hours. A shared abstraction with a finite set of recurring pitfalls is well-suited for an AI-assisted workflow—if the model is taught what to watch out for. Translating cuTile Python to cuTile.jl The process is best understood through actual code. The following examples are from TileGym, where the team ported a set of cuTile Python kernels to cuTile.jl and packaged them as a self-contained Julia subproject. Matrix multiplication example The running example uses matmul, which is complex enough to show key translation challenges. Beyond basic syntax differences, the translation must handle loop structure, TF32 tensor core conversion, and the shift from row-major to column-major layout. cuTile Python: @ct.kernel def matmul_kernel(A, B, C, tm: ct.Constant[int], tn: ct.Constant[int], tk: ct.Constant[int]): bid_m = ct.bid(0) bid_n = ct.bid(1) num_k = ct.num_tiles(A, axis=1, shape=(tm, tk)) acc = ct.full((tm, tn), 0, dtype=ct.float32) dtype = ct.tfloat32 if A.dtype == ct.float32 else A.dtype for k in range(num_k): a = ct.load(A, index=(bid_m, k), shape=(tm, tk), padding_mode=ct.PaddingMode.ZERO) b = ct.load(B, index=(k, bid_n), shape=(tk, tn), padding_mode=ct.PaddingMode.ZERO) a = a.astype(dtype) b = b.astype(dtype) acc = ct.mma(a, b, acc) acc = ct.astype(acc, C.dtype) ct.store(C, index=(bid_m, bid_n), tile=acc) cuTile.jl (Julia): function matmul_kernel(A::ct.TileArray{T,2}, B::ct.TileArray{T,2}, C::ct.TileArray{T,2}, tm::Int, tn::Int, tk::Int) where {T} bid_m = ct.bid(1) bid_n = ct.bid(2) num_k = ct.num_tiles(A, 2, (tm, tk))…

