Composable triton-to-linalg #81

haishanzzzz · 2024-01-08T21:56:08Z

haishanzzzz
Jan 8, 2024

TL;DR

triton-to-linalg is an extremely powerful pass that does many things (very well if I may add 🙂). However, it also has a few hard edges especially in dealing with unsupported IRs.

We propose to redesign the pass to create a few composable and robust passes, each of which performs specific sets of transformations. The combined effect of these passes will be a better version of the triton-to-linalg today. More importantly, these composable passes will hopefully make it easier to take advantage of the codebase, and make future maintenance and improvement easier.

Introduction

triton-to-linalg is an extremely powerful pass to convert Triton MLIR into standard dialects, making a significant portion of Triton kernels target-able in a performant manner by non-GPU backends. However, it also has a few limitations:

It is a single monolithic pass that does many things, notably
- Pointer analysis and mask analysis
- Convert triton operations on pointers into memrefs
- Convert triton arithmetic operations into linalg operations
- Other misc. conversions and analysis, including
  - Leverage UseAnalysis to identify operations with mixed uses, remove unused ops
  - Fold tt.dot’s operand C when its a constant splat
  - Program IDs to function arguments
  - tt.func into func.func
The pass can be fragile in certain scenarios
- When dealing with operation patterns that are not supported, the pass will crash with an error, halting the compilation process
- When dealing with wrap-around pointers, the resulting IR is tricky to work with

The goal of this proposal is to address these limitations. In the rest of this post, we present a high-level technical proposal, some lower-level details, and a tentative execution plan. Feedback is much appreciated!

Proposal

At a high-level, we propose to tease apart the current triton-to-linalg into the the following three passes (please bear in mind that the exact details may change as we collaboratively make more progress):

1. `triton-to-structured`

Introduce a new dialect and corresponding operations (working on values) to represent pointer and mask analysis (tts.make_tptr, tts.load, tts.store).
Convert eligible Triton load (tt.load) and store (tt.store) operations into TritonStructured load (tts.load) and store (tts.store).
Transform loops and any other control structures as is currently done.
Delete operations that generate any intermediate pointers, offsets, masks that are no longer used through canonicalization.
Operations on block pointers can be handled the same way as structured non-block pointers. We can add a pass option in triton-to-structured to convert block pointers into tts.make_tptr.

2. `triton-arith-to-linalg`

Convert triton arithmetic operations into linalg dialect.
Provide options to convert tt.func into func.func, and to convert program IDs into function parameters.

3. `structured-to-memref`

Convert tt.ptr into unranked memrefs.
Convert tts.make_tptr into memref.reinterpret_cast, tts.load and tts.store into memref.copy with the corresponding bufferization operations.

Triton Structured Dialect

The goal of this dialect is to cleanly represent information of pointer and mask analysis without introducing new concepts such as the ones from memref dialect. We propose to introduce three operations in this dialect:

MakeTensorPtrOp (`tts.make_tptr`)

tts.make_tptr represents structured pointer patterns. It creates a statically-sized multi-dimensional tensor of triton pointers (tensor of tt.ptr) from a base pointer of the same type, and supports both dynamic and static strides, offsets, and parent_sizes of index types.

“parent_sizes” is used to represent the scenarios when pointers wrap around (represented by modulo operations in Triton kernels). A static 0 in parent_sizes indicates no wrap-around behavior along that dimension.
This field is also one of the main differentiators from MakeTensorPtrOp from Triton dialect (tt.make_tensor_ptr), which does not support such wrap-around behavior. As a side note, tt.make_tensor_ptr with row-major order can be treated similarly to tts.make_tptr without wrap-around.

E.g., the following Triton IR:

    %0 = tt.splat %arg0 : (!tt.ptr<bf16>) -> tensor<256x!tt.ptr<bf16>>
    %1 = tt.make_range {end = 1280 : i32, start = 1024 : i32}:tensor<256xi32>
    %2 = tt.addptr %0, %1 : tensor<256x!tt.ptr<bf16>>, tensor<256xi32>

Can be represented as:

    %2 = tts.make_tptr %arg0 to sizes: [256], strides: [1], offsets: [1024], parent_sizes: [0] : <bf16, 1> to tensor<256x!tt.ptr<bf16, 1>>

Note that wraparound takes effect after the offset. E.g., for a 1D tensor of pointers, the addresses it generates are: (offsets[0] + i * strides[0]) % parent_sizes[0]. This is the same is how PtrState represents pointers in the current codebase.

The conditions below should also be true:

In practical, the analysis will support wraparound along up to 1 dimension (as it currently does).
The offset along each dimension should not wrap around. offset[i] < parent_sizes[i].
The entire dimension can wrap around at most once. offset[i] + Nstride[i] < 2parent_sizes[i]
Note that the conditions above are typically true. However, in cases when they are not, the new triton-to-structured pass will keep the original triton operations intact without crashing, so later passes can choose to lower them differently.

LoadOp (`tts.load`)

tts.load represents loading from a structured pointer, with optional sub-dimensions and constant fill/other value. It always takes a pointer produced directly by tts.make_tptr. It can take optional arguments to indicate the sub-dimension memory loads. It can also take another optional scalar value to indicate what data to use to fill the rest of the tensor.

With sub-dimensions present, we always start loading data from index 0 to the size specified by the operand if present, or the full tensor if not.

StoreOp (`tts.store`)

tts.store represents storing to a structured pointer, with optional sub-dimensions. It is a mirror of tts.load.

TritonToStructured Pass Example

We are happy to discuss more details of the pass itself later. For brevity of this post, we provide an example below of 1D masked load

Triton IR

    %0 = tt.get_program_id x : i32
    %1 = arith.muli %0, %c1024_i32 : i32
    %2 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
    %3 = tt.splat %1 : (i32) -> tensor<1024xi32>
    %4 = arith.addi %3, %2 : tensor<1024xi32>
    %5 = tt.splat %arg3 : (i32) -> tensor<1024xi32>
    %6 = arith.cmpi slt, %4, %5 : tensor<1024xi32>
    %7 = tt.splat %arg0 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>>
    %8 = tt.addptr %7, %4 : tensor<1024x!tt.ptr<f32>>, tensor<1024xi32>
    %9 = tt.load %8, %6 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<1024xf32>

Result of triton-to-structured

    %0 = tt.get_program_id x : i32
    %1 = arith.muli %0, %c1024_i32 : i32
    %4 = arith.index_cast %1 : i32 to index
    %5 = tts.make_tptr %arg0 to sizes: [1024], strides: [1], offsets: [%4], parent_sizes: [0] : <f32, 1> to tensor<1024x!tt.ptr<f32, 1>>
    %6 = arith.index_cast %1 : i32 to index
    %7 = arith.addi %6, %c1024 : index
    %8 = arith.index_cast %arg3 : i32 to index
    %9 = arith.minsi %7, %8 : index
    %10 = arith.subi %9, %6 : index
    %11 = "tts.load"(%5, %10) <{operandSegmentSizes = array<i32: 1, 1, 0>, static_dims = array<i64: -9223372036854775808>}> : (tensor<1024x!tt.ptr<f32, 1>>, index) -> tensor<1024xf32>

Tentative Execution Plan

The effort required for this work is non-trivial. Below we discuss our plan of tackling things. Contributions are greatly appreciated!!

For stability of the codebase and other ongoing work, we will duplicate some code during this effort.

1. TritonStructured dialect and `triton-to-structured`

The initial implementation is code complete. The new pass can now process all LIT tests in the repo and fail gracefully. Please expect a PR soon.

Notably, we have not started working on lowering Triton block pointer operations (tt.make_tensor_ptr and tt.advance). Help is very much appreciated here.

2. `triton-arith-to-linalg`

We plan to take it on after the above work (except for Triton block pointer related items) completes. We expect to finish implementing this pass in Jan or Feb.

3. `structured-to-memref`

We have not started this work and are actively looking for collaboration on this. Note that this work is independent of triton-arith-to-linalg and can proceed in parallel.

manbearian · 2024-01-08T23:39:42Z

manbearian
Jan 8, 2024
Collaborator

@haishanzzzz

i'm 100% onboard with breaking up the triton-to-linalg pass into more discrete pieces and this looks like a good approach.

For some history, the original design for triton-shared had the pointer/mask analysis separate from the transformation, but we ran into some technical issues that prevented us from implementing it that way. i believe the issues we ran into can be addressed with this proposed design primarily by utilizing the proposed new structured triton dialect.

We may need to iterate on the details of the structured triton pointers, but since you're close to having a PR, let's handle that when you have it ready. That way we will have all the code and examples in front of us.

Regarding the proposed plan for delivering this work: i want to make sure we don't end up with two versions of this pass. I think delivering this work in stages makes sense, as its going to be quite extensive, so let's commit to making sure we complete this (full functional replacement) and remove the old version before we call it done. My team can help with reviews and landing some of the code along the way.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Composable triton-to-linalg #81

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Composable triton-to-linalg #81

haishanzzzz Jan 8, 2024

TL;DR

Introduction

Proposal

1. triton-to-structured

2. triton-arith-to-linalg

3. structured-to-memref

Triton Structured Dialect

MakeTensorPtrOp (tts.make_tptr)

LoadOp (tts.load)

StoreOp (tts.store)

TritonToStructured Pass Example

Tentative Execution Plan

1. TritonStructured dialect and triton-to-structured

2. triton-arith-to-linalg

3. structured-to-memref

Replies: 1 comment

manbearian Jan 8, 2024 Collaborator

haishanzzzz
Jan 8, 2024

1. `triton-to-structured`

2. `triton-arith-to-linalg`

3. `structured-to-memref`

MakeTensorPtrOp (`tts.make_tptr`)

LoadOp (`tts.load`)

StoreOp (`tts.store`)

1. TritonStructured dialect and `triton-to-structured`

2. `triton-arith-to-linalg`

3. `structured-to-memref`

manbearian
Jan 8, 2024
Collaborator