ParaRNN: Large-Scale Nonlinear RNNs, Trainable in Parallel
ParaRNN: Large-Scale Nonlinear RNNs, Trainable in Parallel Recurrent Neural Networks (RNNs) are naturally suited to efficient inference, requiring far less memory and compute than attention-based architectures, but the sequential nature of their computation has historically made it impractical to scale up RNNs to billions of parameters. A new advancement from Apple researchers makes RNN training dramatically more efficient — enabling large-scale training for the first time and widening the set of architecture choices available to practitioners in designing LLMs, particularly for resource-constrained deployment. In ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models, a new paper accepted to ICLR 2026 as an Oral, Apple researchers share a new framework for parallelized RNN training that achieves a 665× speedup over the traditional sequential approach (see Figure 1). This efficiency gain enables the training of the first 7-billion-parameter classical RNNs that can achieve language modeling performance competitive with transformers (see Figure 2). To accelerate research in efficient sequence modeling and enable researchers and practitioners to explore new nonlinear RNN models at scale, the ParaRNN codebase has been released as an open-source framework for automatic training-parallelization of nonlinear RNNs. Why Recurrent Models Still Matter The computational cost of the attention mechanism in a transformer grows quadratically with sequence length, whereas the computation required for a single forward pass through an RNN is the same regardless of how much context came before. This enables constant-time token generation during inference, making them particularly attractive for efficient deployment. But there’s a catch: this efficiency advantage only applies at inference time. Unlike transformers, RNN training can’t be parallelized along the sequence length. The Training Bottleneck The very property that makes RNN efficient at inference — their sequential, recurrent structure — becomes a fundamental bottleneck during training. Unlike the attention mechanism, which can process all tokens in a sequence simultaneously, an RNN application must be unrolled step-by-step, as illustrated in Figure 3. A Compromise for Parallelizability: Linearity Modern recurrent architectures have leveraged a clever workaround to enable sequence parallelization: simplifying the recurrence relationship to be purely linear in the hidden state. Selective state space models (SSMs) like Mamba use a recurrence in the form: while classical RNNs include nonlinearities: Linearity enables parallelization because linear operations are associative, meaning the order in which you combine them doesn’t affect the final result, just like . This mathematical property allows us to use parallel reduction algorithms (also known as parallel scan) to compute the entire sequence of hidden states simultaneously. The intuition is the same behind the parallel computation of a cumulative sum: rather than sequentially adding new terms (starting with , add , then …), one can compute partial results in parallel (add , and , at the same time as and and and , …) and combine them in a tree-like structure, as illustrated in Figure 4. This approach transforms the SSM application from sequential steps to parallel steps, which makes for a dramatic speedup for long sequences: doubling the sequence length only requires one extra step,…

