Digital Event Horizon
PyTorch has made significant strides in deep learning research and development, particularly in the realm of profiling. From linear layers to fused MLPs, the journey has been nothing short of remarkable. In this article, we explored how PyTorch profiling works, including the concept of GEMM kernels and tensor views. We also delved into the world of kernel tuning and fused Triton kernels, which have revolutionized deep learning research and development. Stay tuned for our next installment, where we will explore the attention block and its performance characteristics using PyTorch profiling.
PyTorch has introduced features like profiling to optimize model performance and efficiency. The evolution of linear layers from matmul-add to Linear has led to the development of fused MLPs. nn.Linear calls torch.nn.functional.linear, which in turn calls aten::linear. GEMM (General Matrix Multiplication) kernels are used for efficient computation. Tensor views and strides play a crucial role in PyTorch profiling. The compiled run pays for pre-ops before any GEMM runs, reducing memory access. Fused Triton kernels have revolutionized deep learning research and development with faster execution and improved performance.
PyTorch, a popular open-source machine learning framework, has been at the forefront of deep learning research and development for several years. In its quest to improve performance and efficiency, PyTorch has introduced various features and tools to help developers optimize their models. One such feature is profiling, which allows users to analyze the behavior of their models under different scenarios. In this article, we will delve into the world of PyTorch profiling, specifically focusing on the evolution of linear layers from matmul-add to Linear, and how it has led to the development of fused MLPs.
The journey began with the introduction of nn.Linear, a module wrapper around matrix multiplication and addition operations. The only difference between nn.Linear and the original matmul-add pair was that nn.Linear owned its weight and bias as parameters and exposed a forward method that PyTorch users were familiar with. As developers delved deeper into the world of PyTorch profiling, they discovered that nn.Linear calls torch.nn.functional.linear, which in turn calls aten::linear.
The aten::linear function dispatches aten::addmm(bias, x, weight), a computation that computes out = x @ w.T + bias. This is where the magic happens – cuBLAS GEMM kernel that runs on the GPU has a bias-add variant built-in, and that's what aten::addmm picks. The add never appears as a separate kernel because it's part of the matmul kernel's writeback, which is exactly what an epilogue is.
As developers explored the world of PyTorch profiling further, they came across the concept of tensor views and strides. Tensor stores its data as one flat, contiguous run of numbers in memory, with metadata that sit on top of that run and tell PyTorch how to walk it: a stride of (s0, s1) means "step s0 elements to move one row, step s1 to move one column". Change the metadata and you get a different view of the same raw data, with no copy.
With this understanding, developers began to explore how PyTorch profiling works. They discovered that nn.Linear calls torch.nn.functional.linear, which in turn calls aten::linear. The eager CPU dispatch chain has more in it than the compiled one. The eager dispatch chain inside aten::linear is aten::t followed by aten::addmm. To understand what aten::t actually does, we need a quick detour into strides and views.
As developers delved deeper into the world of PyTorch profiling, they came across the concept of GEMM kernels. The MLP flattens [batch, seq, dim] to [batch * seq, dim] for matmul. In their command-line invocation, they used 64 for batch and 128 for seq, so that's where the 8192 (batch * seq = 64 * 128) below comes from.
Developers also discovered how torch.compile works. It removes the chain of dispatcher ops aten::linear → aten::t → aten::transpose → aten::matmul → aten::reshape → aten::mm. The proof that it is the same GEMM is that the kernel names are byte-for-byte identical to eager: ...128x128...stages_32x5_tn for gate and up, and ...128x256...stages_64x3_tn for down.
The fused Triton kernel was also introduced. This is the headline of the whole compile lesson. The two eager pointwise kernels (GeLU and mul) plus a reshape collapsed into one kernel, triton_poi_fused__unsafe_view_gelu_mul_0. Let's decode the name: triton – generated by Inductor's Triton backend (not cuBLAS, not ATen). poi – pointwise (Inductor tags pointwise kernels poi, reductions red, and persistent reductions per). fused__unsafe_view_gelu_mul – the ops it merged: the _unsafe_view (reshape), the GeLU, and the mul. 0 – the unique id within the graph.
The LigerGEGLUMLP layer was also introduced using the kernels library. This layer takes one set of launch parameters and runs them for any shape with no recompilation. It gives up the last few microseconds that per-shape specialization would buy, in exchange for being robust to changing shapes.
As developers explored the world of PyTorch profiling further, they came across the concept of kernel tuning. When we say "tuned", we mean two concrete things, and both are visible in the trace. The fusion is baked in. The LigerGEGLUMLP forward is down_proj(LigerGELUMulFunction.apply(gate_proj(x), up_proj(x))). The LigerGELUMulFunction runs a single Triton kernel, _geglu_tanh_forward_kernel, that computes gelu(gate) * up in one pass. This is exactly what we saw from torch.compile, where the intermediate never makes a round-trip through HBM.
The compiled run pays for pre-ops (Dynamo, guards, prologue) before any GEMM runs. The Liger kernel has no pre-ops – the box where they would be is empty. This means that the Triton kernel reads g and u once, computes gelu(gate) * up, and writes the result once. One whole round-trip of the intermediate through global memory is gone.
In conclusion, PyTorch profiling has evolved significantly over the years, from linear layers to fused MLPs. The journey began with the introduction of nn.Linear, a module wrapper around matrix multiplication and addition operations. As developers delved deeper into the world of PyTorch profiling, they discovered how PyTorch profiling works, including the concept of GEMM kernels and tensor views.
The development of fused Triton kernels has revolutionized deep learning research and development. These kernels allow for faster execution and improved performance. The LigerGEGLUMLP layer using the kernels library has also shown promising results.
As we move forward in the world of PyTorch profiling, it is essential to stay updated with the latest developments and tools. From kernel tuning to fused Triton kernels, there are many exciting technologies waiting to be explored.
In the next installment of the Profiling in PyTorch series, we will delve into the attention block and explore its performance characteristics using PyTorch profiling.
Related Information:
https://www.digitaleventhorizon.com/articles/The-Evolution-of-Deep-Learning-Profiling-From-Linear-Layers-to-Fused-MLPs-deh.shtml
https://huggingface.co/blog/torch-mlp-fusion
Published: Thu Jun 11 06:50:34 2026 by llama3.2 3B Q4_K_M