Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

Liangzu Peng^1, Aditya Chattopadhyay^2, Luca Zancato² Elvis Nunez² Wei Xia² Stefano Soatto²
University of Pennsylvania¹ AWS Agentic AI²
[email protected]
{achatto,zancato,elvisnun,wxia,soattos}@amazon.com Work done during an internship at AWS Agentic AI.Correspondence to [email protected]

Abstract

As efficient alternatives to softmax Attention, linear state-space models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented tasks. We propose Gated KalmaNet (GKA), a layer that reduces this gap by accounting for the full past when predicting the next token, while maintaining SSM-style efficiency. GKA achieves this by solving an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length. Drawing inspiration from the Kalman Filter, we iteratively solve the online ridge regression problem. However, a critical insight is that standard Kalman filter equations are numerically unstable in low-precision environments (like bfloat16) and difficult to parallelize in modern hardware. We address both challenges through two key innovations: (1) an adaptive regularization strategy with input-dependent gating that controls the condition number of the ridge regression, ensuring numerical stability while balancing memory retention. And (2) the use of Chebyshev Iteration instead of other conventional iterative solvers, which we demonstrate to be more stable in low-precision settings. To further improve scalability, we develop a hardware-aware chunk-wise implementation of Chebyshev Iteration along with custom kernels for backpropagating through our adaptive regularization and gating mechanisms. Empirically, GKA shows strong language understanding capabilites on short-context tasks outperforming existing SSM layers (like Mamba2, GLA and Gated DeltaNet). On long-context, GKA excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than $10$ % relative improvement over other fading memory baselines.

1 Introduction

Large Language Models (LLMs) powered by (softmax) Attention mechanisms [Vaswani-NeurIPS2017] have revolutionized sequence modeling through their ability to form rich associations within their context window. However, a fundamental challenge that LLMs face is that their time complexity scales quadratically and storage grows linearly with their input length.

Recent years have seen intense efforts to develop Attention alternatives. Among them, memory layers based on linear State-Space models (SSMs) have grown popular for their linear-time computation and constant storage cost in the sequence length [Dao-ICML2024-mamba2, Yang-ICML2024-gla]. These SSMs find inspirations from classic techniques in adaptive signal processing, and integrating them into modern SSMs leads to principled layer design and enhanced performance [Liu-ICLR2025, Yang-NeurIPS2024, Zancato-NeurIPS2024-BMOJO]. However, pure SSM models still underperform Attention in many settings, especially on long-context tasks. This gap is a consequence of their different memory mechanisms: SSMs have a fading fixed dimensional lossy state of the past, while Attention has an eidetic ever increasing KV-cache state [Zancato-NeurIPS2024-BMOJO].

To bridge this gap, we aim at designing a memory layer that enjoys the efficiency of linear SSMs while performing computation conditioned on the exact past. Towards this goal, we first draw insights from the Kalman filter (KF) [Kalman-1960]. In signal processing terms, KF computes the most recent state conditioned on all data seen thus far, and, under mild assumptions, KF is optimal in the Maximum A-Posteriori (MAP) sense. In the LLM context, we use KF to update the state of an SSM layer and predict its output based on all past inputs. However, integrating KF into such a layer is non-trivial and faces two challenges:

•

Parallelizable Training. KF is an online algorithm and needs to be parallelized to fully utilize modern hardware that is highly optimized for large-scale LLM training.
•

Numerical Stability. KF involves matrix inversion, which can be numerically unstable in low precision arithmetic.

In this work, we propose Gated KalmaNet (GKA), a memory layer that incorporates KF into its design and is both numerically stable and trainable on highly parallelizable hardware. We start by observing that the KF recursion solves a test-time ridge regression problem. Then, to solve such a regularized problem stably, we make the following choices:

•

At the modeling level, we adaptively choose the regularization strength of our test-time objective function based on the Frobenius norm of the regularized data covariance. With this choice we can easily upper bound the condition number of the optimization problem.
•

At the algorithmic level, we note that exact solvers (e.g., torch.linalg.solve) are hard to parallelize (in a chunk-wise manner), so we resort to the classic Chebyshev Iteration (CH), which we show has high numerical accuracy and fast convergence compared with alternatives such as (accelerated) gradient descent and conjugate gradient.

To make GKA scalable and efficient, we implement CH with adaptive regularization in Triton in a hardware-aware, chunk-wise manner. Our technical novelty here includes deriving a chunk-wise implementation that back-propagates through the Frobenius norm, for which the difficulty is the presence of a nested recurrence. Furthermore, we combine CH with a gating mechanism that decides the regression residual weights in an input-aware and time-varying fashion, enhancing the contribution of recent inputs and smoothly fading out distant contexts. Overall, to the best of our knowledge, this is a first adoption of the CH method for training sequence modeling layers in LLMs stably at scale.

Finally, we demonstrate the efficacy of GKA on numerous LLM benchmarks. For example, on synthetic recall tasks (MQAR) [arora2023zoology], our method achieves the highest recall accuracy among other state-of-the-art linear SSMs including Mamba2 [Dao-ICML2024-mamba2] and (Gated) DeltaNet [Yang-NeurIPS2024, Yang-ICLR2025]. Also, GKA outperforms existing SSMs on several (short-context tasks from LM-Harness [eval-harness]) and long-context tasks (from RULER [hsieh2024ruler] and HELMET [yen2025helmet]). Specifically, GKA improves upon SSM baselines by at least 10% on real-world long-context tasks like Retrieval-Augmented Generation and Long Question-Answering tasks up to 128k tokens.

2 Prior Work and Preliminaries

In this section we briefly review prior work and preliminaries that will set the stage for motivating our choice of designing an SSM layer based on the Kalman Filter. For a more detailed exposition of related work refer Appendix˜A.

(Softmax) Attention. At each time $t$ , Attention [Vaswani-NeurIPS2017] linearly projects the $t$ -th input token to obtain three vectors, named query $q_{t}$ , key $k_{t}$ , value $v_{t}$ respectively. Then, it outputs a vector $y_{t}\in\mathbb{R}^{D}$ as a convex combination of all values seen so far, with coefficients $c_{1},\dots,c_{t}$ given by inner products of the current query $q_{t}$ with all seen keys and a softmax mapping:

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:attn}}{e}q:attn}y_{t}=\sum_{i=1}^{t}c_{i}v_{i},\quad\quad c_{i}:=\frac{\exp(\frac{k_{i}^{\top}q_{t}}{\sqrt{D}})}{\sum_{i=1}^{t}\exp(\frac{k_{i}^{\top}q_{t}}{\sqrt{D}})}.

(Attn)

From an optimization perspective, Eq.˜Attn can be viewed as solving the following regression objective¹¹1Concretely, for keys and queries of unit norm, Attention is precisely the Nadaraya-Watson estimator [Nadaraya-1964, Watson-1964] with the Gaussian kernel to approximate the conditional expectation of the value given a query; cf. [Chaudhari-TIST2021, Vidal-2022].,

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq: attn as regression}}{e}q:attnasregression}y_{t}=\mathop{\rm argmin}_{v}\sum_{i=1}^{t}\exp\left(\frac{k_{i}^{\top}q_{t}}{\sqrt{D}}\right)\cdot\|v-v_{i}\|_{2}^{2}.

(1)

The success of Eq.˜Attn is often attributed to its ability to perform verbatim retrieval of relevant context from the entire past. Here, the past refers to the entire key-value pairs observed thus far, also known as the KV-cache, which grows linearly with time $t$ . Moreover, the computation is also linear at each time $t$ , and doing so for all $t$ results in a quadratic time complexity. This high computation and storage cost of Attention makes its use prohibitive in long context scenarios.

Linear State-Space Models (SSMs). The high computation cost of Eq.˜Attn has motivated a flurry of work developing new LLM layers, like SSMs, with linear rather than quadratic cost. Most SSMs maintain a state matrix $S_{t}\in\mathbb{R}^{D\times D}$ and update it at each time step via a linear recursion of the form

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:linear-attn}}{e}q:linear-attn}S_{t}=\gamma_{t}\cdot S_{t-1}+\beta_{t}\cdot v_{t}k_{t}^{\top},\quad y_{t}=S_{t}q_{t},

(Linear-SSM)

where $\gamma_{t},\beta_{t}$ are typically in $[0,1]$ . Unlike the verbatim lookup of Eq.˜Attn, here Eq.˜Linear-SSM essentially compresses the entire KV-cache into a fixed-dimensional representation $S_{t}$ . Subsequent computation of the output $y_{t}$ relies on $S_{t}$ and no longer on the exact past. This results in a constant cost of storage and computation at every timestep.

In many linear SSMs (e.g., RetNet [Sun-arXiv2023RetNet], Mamba2 [Dao-ICML2024-mamba2]), the use of $\gamma_{t}$ and $\beta_{t}$ is often heuristic and finds inspirations from nonlinear recurrent neural networks [Hochreiter-NC1997]; in that light, $\gamma_{t}$ and $\beta_{t}$ are called forgetting and input gates, respectively. This basic form of Eq.˜Linear-SSM has been generalized by replacing $\gamma_{t}$ with a diagonal matrix (GLA [Yang-ICML2024-gla], RWKV-6 [Peng-CoLM2024], Longhorn [Liu-ICLR2025]) or low-rank-plus-identity matrix (Gated DeltaNet [schlag2021linear, Yang-NeurIPS2024, Yang-ICLR2025], DeltaProduct [Siems-arXiv2025], RWKV-7 [Peng-arXiv2025rwkv]).

Similarly to that of Eq.˜Attn the case with low-rank-plus-identity matrices can often be justified from an optimization perspective. For example, Gated DeltaNet [schlag2021linear, Yang-NeurIPS2024] updates the state via ( $I$ is the $D\times D$ identity matrix)

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:GDN}}{e}q:GDN}S_{t}=\gamma_{t}\cdot S_{t-1}\left(I-\beta_{t}k_{t}k_{t}^{\top}\right)+\beta_{t}\cdot v_{t}k_{t}^{\top},

(GDN)

which can be viewed as applying one gradient descent step with stepsize $\beta_{t}$ and initialization $\gamma_{t}S_{t-1}$ to the objective

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:GDN-obj}}{e}q:GDN-obj}\min_{S}\|Sk_{t}-v_{t}\|_{2}^{2}.

(2)

The objectives of Eq.˜GDN Eq.˜2 and Eq.˜Attn Eq.˜1 are prime examples that expose a general distinction between linear SSMs and Eq.˜Attn: The former updates its state based on a regression objective that considers only the previous lossy state and the current time step, whereas the latter uses the entire, exact KV-cache to solve its regression objective Eq.˜1.

We hypothesize this myopic view of SSM objectives results in their lower performance and limited long-context abilities. We then ask: What is an objective or, equivalently, a recursion that considers the entire past as Eq.˜Attn while still being solvable in linear time as in Eq.˜Linear-SSM?

3 A Linear SSM Inspired by the Kalman Filter

In Section˜3.1 we show how the Kalman Filter (KF) gives insights into a new linear SSM layer that takes all past time instants into account. In Section˜3.2 we explain the numerical and efficiency challenges of building such a layer.

3.1 Motivation from Kalman Filter

KF is an established online approach that takes the exact past into account to optimally solve a weighted ridge regression objective (e.g., see [Peng-MoCL2025, Proposition 2 & Lemma 3]). In our context, this means that the optimal state

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:rls}}{e}q:rls}\begin{split}S_{t}&=\mathop{\rm argmin}_{S\in\mathbb{R}^{D\times D}}\lambda\cdot\|S\|_{\text{F}}^{2}+\sum_{i=1}^{t}\eta_{i}\cdot\|Sk_{i}-v_{i}\|_{2}^{2}\end{split}

(3)

can be computed by the KF recursion

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:kf-update}}{e}q:kf-update}S_{t}=S_{t-1}-\frac{(S_{t-1}k_{t}-v_{t})k_{t}^{\top}\Phi_{{t}-1}}{1/\eta_{t}+k_{t}^{\top}\Phi_{{t}-1}k_{t}},

(KF)

where $\eta_{t}$ is the weight for the $t$ -th key-value pair, and $\Phi_{t-1}$ is the Hessian inverse of Eq.˜3 at time $t-1$ ( $\Phi_{t-1}$ itself can be continually updated via the Woodbury matrix identity). It is now clear that objective Eq.˜3 takes the entire KV-cache into account, similarly to Eq.˜Attn. It is also clear that Eq.˜KF is an efficient update scheme similarly to Eq.˜Linear-SSM; indeed, Eq.˜KF is also a low-rank-plus-identity form (cf. Eq.˜GDN).

A key difference from Eq.˜Linear-SSM is that Eq.˜KF leverages second-order information from $\Phi_{t-1}$ to solve Eq.˜3 optimally, whereas Eq.˜Linear-SSM relies on instantaneous objectives akin to Eq.˜2 (cf. [Yang-NeurIPS2024, Table 2]). It is in this sense that we say Eq.˜KF is more expressive than other Eq.˜Linear-SSM or Eq.˜GDN. We now detail the differences in the objectives of Eq.˜KF and Eq.˜Attn:

•

Eq.˜KF computes a parametric linear estimator that enables a constant-sized memory, while Eq.˜1 computes a non-parametric point estimate that entails storing the full cache.
•

In Eq.˜1, the weights of the same residual vary over time as the queries differ at each time, while in Eq.˜3 $i$ -th weight $\eta_{i}$ is constant once observed at time $i$ . The former results in quadratically many weights—thus a quadratic time complexity—and the latter linearly many.
•

In Eq.˜3, the regularizer $\lambda\cdot\|S\|_{\text{F}}^{2}$ prevents overfitting our state to key-value pairs, as only a finite amount of “information" can be stored in a constant-sized memory beyond which will result in “fuzzy" recall. In this light, $\lambda$ can be thought of as controlling the memorization “capacity" of the state.

3.2 Hurdles Towards Scalable Kalman Filter SSMs

Despite its optimality and (sequential) computational efficiency the Eq.˜KF recursion lacks a hardware-aware implementation that leverages parallelism in modern Tensors Cores. Moreover, for long sequences it can lose numerical precision due to division (and more significantly due to how the Hessian inverse $\Phi_{t}$ is updated). The final hurdle is conceptual: Fixing weights $\eta_{i}$ and regularization $\lambda$ over time as in Eq.˜3 might make a layer less expressive.

We are aware of the use of Eq.˜KF or Eq.˜3 in neural network training three decades ago [Shah-1992] or in deep continual learning recently [Zeng-NMI2019, Mcdonnell-NeurIPS2023, Peng-ICLR2025]. We are also aware of the recent mentioning of Eq.˜3 or efforts towards solving it, which go by the name test-time optimization [Wang-arXiv2025v3, Von-arXiv2025-mesa]. However, to the best of our knowledge, none of the prior work has fully addressed the above hurdles that need to be solved to design an SSM layer that is trainable in parallel, numerical well-behaved, and sufficiently expressive. In particular, both [Von-arXiv2025-mesa] and [Wang-arXiv2025v3] have overlooked a basic numerical concern: The worst-case numerical error in solving Eq.˜3 can be $\epsilon\cdot\kappa$ [Golub-2013], where $\kappa$ is the condition number of the Hessian in Eq.˜3 and $\epsilon$ the machine precision; since $\epsilon\approx 0.007$ (bf16), Eq.˜3 has to be regularized strongly for $\kappa$ and the worst-case error to be small, regardless of algorithmic choices to solve Eq.˜3. Indeed, the regularization enforced in [Von-arXiv2025-mesa] sets $\lambda$ to be lower bounded by $0.25$ , but this is not sufficient: Their $\kappa$ is as large as $500$ [Von-arXiv2025-mesa, Fig. 13], implying a worst-case error of $3.5$ (The implementation of [Von-arXiv2025-mesa] available on GitHub is numerically vulnerable; we failed to train it without NaNs in various settings.). Also, the regression objective in [Wang-arXiv2025v3] has no regularization, which makes it numerically ill-posed for low-precision training.

4 Gated KalmaNet (GKA)

We propose Gated KalmaNet (GKA) to address the above hurdles: We enhance numerical stability via adaptive regularization and the classic Chebyshev Iteration (CH), increase expressivity of KF via a standard gating mechanism, and improve parallelism via a hardware-friendly implementation.

4.1 CH with Adaptive Regularization & Weighting

Motivation. As alluded earlier, solving Eq.˜3 via Eq.˜KF is sequential in nature, and here we consider alternatives amenable to parallelizable training. Our first step towards this is to write down a closed form solution to Eq.˜3 and compute the output

\displaystyle y_{t}=S_{t}q_{t}=\left(\sum_{i=1}^{t}\eta_{i}v_{i}k_{i}^{\top}\right)\left(\sum_{i=1}^{t}\eta_{i}k_{i}k_{i}^{\top}+\lambda I\right)^{-1}q_{t}.

With the weighted covariances $U_{t}:=\sum_{i=1}^{t}\eta_{i}v_{i}k_{i}^{\top}$ and $H_{t}:=\sum_{i=1}^{t}\eta_{i}k_{i}k_{i}^{\top}$ , we note that $y_{t}$ can be computed via first solving $(H_{t}+\lambda I)x=q_{t}$ for $x$ and then left-multiplying $U_{t}$ . An exact solver (e.g., torch.linalg.solve) can do so with high accuracy, by parallelizing over the batch dimension. However, it is inefficient here for two reasons. First, it takes $O(D^{3})$ time for every $t$ . Second, it requires explicitly forming and materializing all $H_{t}$ ’s, which would entail a large I/O cost. In light of this, we resort to first-order iterative methods that admit chunk-wise implementation without materializing all $H_{t}$ ’s, enabling parallelism over chunks and batches. Furthermore, they often take $O(D^{2})$ time complexity per iteration and can converge quickly in a few iterations. The iterative method we choose is the Chebyshev Iteration (CH); we proceed to describe its basic idea, with a justification of using CH deferred to Section˜4.2.3.

Chebyshev Iteration (CH). CH can be seen as an accelerated gradient descent method (AGD) that applies Eq.˜grad descent and Eq.˜momentum to the strongly convex objective $\frac{1}{2}\xi^{\top}H\xi-\xi^{\top}q$ , that is to solve the optimality condition $H\xi=q$ (Algorithm˜1). Different from AGD, CH incorporates a Eq.˜weight schedule and makes specific choices of different parameters; these choices makes CH optimal with the fastest convergence among first-order methods [Pedregosa-Chebyshev2021].

We now replace the above exact solver with CH:

\displaystyle\hat{x}_{t}=\text{CH}(H_{t}+\lambda I,q_{t},r),\quad\quad y_{t}=U_{t}\hat{x}_{t}.

Here, $\text{CH}(H_{t}+\lambda I,q_{t},r)$ means $r$ iterations of CH to approximately solve $(H_{t}+\lambda_{t}I)x=q_{t}$ . To improve stability and expressivity, next we allow regularization $\lambda$ and weight $\eta_{i}$ to be time-varying and chosen adaptively. We write $\lambda_{t}$ and $\eta_{t,i}$ to make their dependency in time $t$ explicit, with $\eta_{t,i}$ being the weight of the $i$ -th token at time $t$ .

1Input:

H\in\mathbb{R}^{D\times D},q\in\mathbb{R}^{D}

, eigenvalue bounds

L,\mu

with

L\geq\mu>0

, number of iterations

r

;

2Initialize:

\rho\leftarrow\frac{L-\mu}{L+\mu}

;

\omega_{0}=0

; the first two iterates

\xi_{-1}\leftarrow 0,\xi_{0}\leftarrow\frac{2q}{L+\mu}

;

3For Loop (

i=1,\dots,r

4-0.2cm

$\displaystyle\omega_{i}$	$\displaystyle\leftarrow\frac{4}{4-\rho^{2}\omega_{i-1}}\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:weight-update}}{e}q:weight-update}$	(weight schedule)
$\displaystyle\xi_{i}$	$\displaystyle\leftarrow\xi_{i-1}-\frac{2\cdot\omega_{i}}{L+\mu}(H\xi_{i-1}-q)\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:ch-gd}}{e}q:ch-gd}$	(grad descent)
$\displaystyle\xi_{i}$	$\displaystyle\leftarrow\xi_{i}+(\omega_{i}-1)(\xi_{i-1}-\xi_{i-2})\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:ch-momentum}}{e}q:ch-momentum}$	(momentum)

5Output:

\xi_{r}

Algorithm 1 Chebyshev Iteration to solve

H\xi=q

Adaptive Regularization. As mentioned, the condition number $\kappa_{t}$ of $H_{t}+\lambda_{t}I$ has to be controlled for any method to be numerically stable. We choose $\lambda_{t}$ to be proportional to the Frobenius norm $\|H_{t}\|_{\text{F}}$ , that is to set $\lambda_{t}=a\cdot\|H_{t}\|_{\text{F}}$ for some constant $a>0$ . An upper bound on $\kappa_{t}$ now ensures:

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:kappa-ub}}{e}q:kappa-ub}\kappa_{t}=\frac{\lambda_{\text{max}}(H_{t})+\lambda_{t}}{\lambda_{\text{min}}(H_{t})+\lambda_{t}}\leq\frac{\|H_{t}\|_{\text{F}}+\lambda_{t}}{\lambda_{t}}=\frac{a+1}{a}.

(4)

Here $\lambda_{\text{max}}(H_{t}),\lambda_{\text{min}}(H_{t})$ are the maximum and minimum eigenvalues of $H_{t}$ , respectively. Given this choice of $\lambda_{t}$ , we set $L=\|H_{t}\|_{\text{F}}+\lambda_{t}$ and $\mu=\lambda_{t}$ for Algorithm˜1.

Adaptive Weighting (Gating). We use weights $\eta_{t,i}$ that are exponentially decaying in time: For all $t\geq i$ , we parameterize $\eta_{t,i}=\prod_{j=i+1}^{t}\gamma_{j}$ , with each $\gamma_{j}\in[0,1]$ learnable. The fading weights encode the “prior" of recency bias that has been shown to exist in LLMs [fang2025large] without even explicitly computing the weights from the query-key dot products as in Eq.˜Attn. Similarly to Eq.˜Attn, the weights on the residuals are now time-varying, but differently, the exponentially decay parameterization allows for linear-time implementation.

Forward Recurrence. We now summarize our recurrence which arms CH with adaptive regularization and weighting:

\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:ch-forward}}{e}q:ch-forward}\begin{split}H_{t}&=\gamma_{t}\cdot H_{t-1}+k_{t}k_{t}^{\top},\ \ U_{t}=\gamma_{t}\cdot U_{t-1}+v_{t}k_{t}^{\top},\\ y_{t}&=U_{t}\hat{x}_{t},\quad\hat{x}_{t}=\text{CH}(H_{t}+\lambda_{t}I,q_{t},r).\end{split}

(CH)

4.2 Chunk-wise Implementation

In this subsection, we describe our hardware-aware implementation for the forward + backward passes for Eq.˜CH. More details can be found in Appendix˜B.

4.2.1 Forward Pass

Similarly to prior work [Dao-ICML2024-mamba2, Yang-ICML2024-gla, Yang-NeurIPS2024], we now describe a chunk-wise implementation for Eq.˜CH. In Eq.˜CH, given $U_{t}$ and $\hat{x}_{t}$ , computing $y_{t}=U_{t}\hat{x}_{t}$ in a chunk-wise fashion is similar to that of Eq.˜Linear-SSM; also similar is the calculation of $H_{t}\xi_{i-1}$ as needed in Eq.˜grad descent. For these we refer the reader to [Yang-ICML2024-gla, Yang-NeurIPS2024] for details. Our algorithmic novelty here is a chunk-wise computational formula for $\|H_{t}\|_{\text{F}}$ , presented next.

Let $T$ be the sequence length and $C$ the chunk size such that $N:=T/C$ is an integer. For $t=0,\dots,N-1$ , write $[t]:=tC$ . The core idea of a chunk-wise implementation is as follows. First, we compute and store the initial state $H_{[t]}$ of every chunk. This gives us implicit access to $H_{[t]+c}$ via unrolling the recurrence of $H_{t}$ for $c$ steps and therefore allows us to carry out computation with $H_{[t]+c}$ ; for example, we can compute the matrix-vector product $H_{[t]+1}\xi$ via $H_{[t]}\xi+\gamma_{1}k_{[t]+1}k_{[t]+1}^{\top}\xi$ . This is without forming $H_{[t]+1}$ explicitly, thereby reducing the number of states to materialize on chip. To implement such a scheme, we need to precompute all $H_{[t]}$ ’s sequentially, and then do the computation with parallelism over chunks and within each chunk.

We now make this idea precise for computing all $\|H_{t}\|_{\text{F}}$ ’s within a chunk. Since the computation of each chunk is the same, we simplify by working with the first one where we have access to initial state $H_{0}$ , gates $\gamma_{1},\dots,\gamma_{C}$ , keys $K_{C}=[k_{1},\dots,k_{C}]\in\mathbb{R}^{D\times C}$ ,and we aim to compute $\|H_{1}\|_{\text{F}},\|H_{2}\|_{\text{F}},\dots,\|H_{C}\|_{\text{F}}$ . With these notations, we first compute the $C$ -dimensional vector $\zeta=[\zeta_{1},\dots,\zeta_{C}]^{\top}$ of cumulative products of $\gamma_{i}$ ’s, with $\zeta_{c}=\prod_{i=1}^{c}\gamma_{i}$ . Then, form the $C\times C$ upper triangular matrix $M$ whose ( $i,j$ )-th entry $M_{j,c}$ is $\zeta_{c}/\zeta_{j}$ ( $\forall c\geq j$ ). Now, unroll the recurrence of $H_{c}$ :

\displaystyle H_{c}

\displaystyle=\zeta_{c}H_{0}+\sum_{j=1}^{c}M_{j,c}k_{j}k_{j}^{\top}=\zeta_{c}H_{0}+\sum_{j=1}^{C}M_{j,c}k_{j}k_{j}^{\top}.

Expanding $\|H_{c}\|_{\text{F}}^{2}$ gives the following sum of three terms:

\displaystyle\zeta_{c}^{2}\|H_{0}\|_{\text{F}}^{2}+2\zeta_{c}\sum_{j=1}^{C}M_{j,c}k_{j}^{\top}H_{0}k_{j}+\Big\|\sum_{j=1}^{C}M_{j,c}k_{j}k_{j}^{\top}\Big\|_{\text{F}}^{2}.

With $\zeta$ , the first term $\zeta_{c}^{2}\cdot\|H_{0}\|_{\text{F}}^{2}$ is easily computed in parallel for all $c$ . For the second term, we first compute the vector of quadratic forms $k_{j}^{\top}H_{0}k_{j}$ for all $j$ in parallel, broadcast it and multiply it with $M$ element-wise, sum over each column, and multiply the result with $2\zeta$ element-wise. Finally, with Gram matrix $G_{C}:=K_{C}^{\top}K_{C}$ , one verifies the third term can be computed in parallel for all $c$ via the following pseudocode:

\displaystyle\text{column-sum}(((G_{C}\odot G_{C})M)\odot M).

(5)

Here $\odot$ denotes element-wise multiplication and the sum is over each column. Summing the three terms and taking the square root, we obtain $\|H_{1}\|_{\text{F}},\dots,\|H_{C}\|_{\text{F}}$ , as desired.

4.2.2 Backward Pass

Motivation. Typically, the backward pass is done automatically via torch.autograd. However, for iterative methods such as CH (Algorithm˜1), torch.autograd would store some activations or intermediate iterates, entailing large storage cost. While in principle we can back-propagate through CH without storing any intermediate activations or iterates (by our trick of reverting the CH iterations, cf. Section˜C.1), under this trick it is difficult to compute all the gradients in a chunk-wise fashion. Therefore, we resort to the implicit differentiation trick, which is practically efficient and chunk-wise implementable, for backpropagation through the linear equations that CH approximately solves.

Implicit Differentiation. We derive the backward pass for our method with the standard implicit differentiation trick. It assumes we find an exact solution $x^{*}_{t}$ to the equations $(H_{t}+\lambda_{t}I)x=q_{t}$ . In the backward pass, we are given the gradient $dx^{*}_{t}:=\frac{d\mathcal{L}}{dx^{*}_{t}}$ of some loss function $\mathcal{L}$ , and need to compute the corresponding gradients at $q_{t},k_{t},\gamma_{t}$ . For example, via the chain rule we obtain $dq_{t}:=\frac{d\mathcal{L}}{dq_{t}}$ via

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:exact-dq-ridge}}{e}q:exact-dq-ridge}dq_{t}=(H_{t}+\lambda_{t}I)^{-1}dx^{*}_{t},

(6)

that is to solve linear equations similarly to the forward pass. Since the forward pass computes an approximate solution $\hat{x}_{t}$ via CH, we receive an approximate up stream gradient $d\hat{x}_{t}$ (not exactly $dx_{t}^{*}$ ). Thus we employ CH to obtain an approximate gradient $d\hat{q}_{t}=\text{CH}(H_{t}+\lambda_{t}I,d\hat{x}_{t},r)$ ; cf. Table˜1.

Backward Recurrence. Besides $dq_{t}$ , we need to compute $dH_{t}$ from which we obtain $dk_{t}$ and $d\gamma_{t}$ via the chain rule. We describe $d\gamma_{t}$ in the Appendix. Here we analyze $dk_{t}$ :

Lemma 1.

With $\lambda_{t}=a\|H_{t}\|_{\text{F}}$ , $w_{t}=\frac{2a\cdot(x_{t}^{*})^{\top}dq_{t}}{\|H_{t}\|_{\text{F}}}$ , we have

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:dk_i-main}}{e}q:dk_{i}-main}dk_{i}

\displaystyle=\sum_{t\geq i}M_{i,t}\left(-dq_{t}(x_{t}^{*})^{\top}-x_{t}^{*}dq_{t}^{\top}-w_{t}H_{t}\right)k_{i}.

(7)

With $A_{i}:=\sum_{t\geq i}M_{i,t}\cdot dq_{t}(x_{t}^{*})^{\top}$ , we can compute the first two terms $-A_{i}k_{i}$ and $-A_{i}^{\top}k_{i}$ in Eq.˜7, similarly to Eq.˜Linear-SSM. Specifically, $A_{i}$ satisfies the recursion

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:Ai}}{e}q:Ai}A_{i}=\gamma_{i+1}A_{i+1}+dq_{i}(x_{i}^{*})^{\top},

(8)

thus calculating $A_{i}k_{i}$ amounts to calculating $U_{t}q_{t}$ in Eq.˜Linear-SSM; a difference is that the recursion here runs backwards.

Similarly, with $B_{i}=\sum_{t\geq i}M_{i,t}w_{t}H_{t}$ , the third term in Eq.˜7 can be written recursively as

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:Bi}}{e}q:Bi}B_{i}=\gamma_{i+1}B_{i+1}+w_{i}H_{i},\quad o_{i}=B_{i}k_{i}.

(9)

Chunk-wise Recurrence. As indicated, a chunk-wise implementation for computing $A_{i}k_{i}$ is known. On the other hand, computing $B_{i}k_{i}$ is more challenging than $A_{i}k_{i}$ , as the additive term $w_{i}H_{i}$ in the backward recursion Eq.˜9 is not necessarily rank- $1$ ; rather, $H_{i}$ itself is defined via the forward recursion in Eq.˜CH. Our contribution here is a derivation for computing $B_{i}k_{i}$ efficiently in a chunk-wise manner.

We begin by unrolling $B_{i}$ to $B_{C+1}$ :

\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:Bi=intra+inter}}{e}q:Bi=intra+inter}\begin{split}B_{i}=&M_{i,C}\cdot\gamma_{C+1}B_{C+1}+B_{i}^{\text{intra}}\\ B_{i}^{\text{intra}}:=&\sum_{c=i}^{C}M_{i,c}\cdot w_{c}H_{c}\end{split}

(10)

We next discuss the intra-chunk term $B_{i}^{\text{intra}}k_{i}$ and cross-chunk term $M_{i,C}\cdot\gamma_{C+1}B_{C+1}$ in succession.

Intra-chunk Computation. We now unroll $H_{c}$ and obtain an expression $B_{i}^{\text{intra}}$ more amenable to parallelism:

	$\displaystyle B_{i}^{\text{intra}}$	$\displaystyle=\sum_{c=i}^{C}M_{i,c}\cdot w_{c}H_{c}$
		$\displaystyle=\sum_{c=1}^{C}M_{i,c}w_{c}\Big(\zeta_{c}H_{0}+\sum_{j=1}^{C}M_{j,c}k_{j}k_{j}^{\top}\Big)$
		$\displaystyle=H_{0}\sum_{c=1}^{C}M_{i,c}w_{c}\zeta_{c}+\sum_{j=1}^{C}k_{j}k_{j}^{\top}\sum_{c=1}^{C}M_{i,c}w_{c}M_{j,c}.$

The coefficients of $H_{0}$ , written as $b_{i}$ , are easily computed in parallel for all $i$ via element-wise operations, broadcasting, and summing. The coefficient of $k_{j}k_{j}^{\top}$ is precisely the $(i,j)$ -th entry of the matrix $M_{w}:=M\operatorname{diag}(w_{1},\dots,w_{C})M^{\top}$ . Thus $[B_{1}^{\text{intra}}k_{1},\dots,B_{C}^{\text{intra}}k_{C}]$ is equal to

\displaystyle H_{0}(\operatorname{diag}(b_{1},\dots,b_{C})K_{C})+K_{C}\left((K_{C}^{\top}K_{C})\odot M_{w}\right).

Here the mask $M_{w}$ is in general a full matrix with no zero entries, as opposed to the triangular matrix in the case of Eq.˜Linear-SSM. While the triangular mask in the backward pass allows the error feedback from future tokens to be leveraged for learning past tokens, here our full mask $M_{w}$ allows all tokens to interact with all other tokens in the backward pass, which facilitates the information flow and learning.

Refer to caption — Figure 1: CH converges with smaller errors than CG and is more numerically stable. Convergence of different methods in residual norms during the forward pass with batch size $8$ , sequence length 2048, 8 heads, head dimension 128 (a), and relative gradient differences from the exact solver (torch.linalg.solve) to CG (b, c) or CH (d, e). The backward pass is via implicit differentiation (impl) or torch.autograd (auto); cf. Table˜1. In (b, d) the gradients are those of $[q_{t},k_{t}]$ ; in (c, e) the gradients are those of network weights.

Cross-chunk Computation. In Eq.˜10, both $\gamma_{C+1}$ and $B_{C+1}$ are from the future chunks, thus we revise Eq.˜10 into the cross chunk recursion of $\widetilde{B}_{C+1}:=\gamma_{C+1}B_{C+1}$ which allows us to maintain a single term $\widetilde{B}_{C+1}$ from the future:

\displaystyle\widetilde{B}_{1}=\zeta_{C}\cdot\widetilde{B}_{C+1}+\widetilde{B}^{\text{intra}}_{1},\ \ \ \widetilde{B}^{\text{intra}}_{1}:=\sum_{c=i}^{C}\zeta_{c}\cdot w_{c}H_{c}.

In our intra-chunk computation, we store the intra-chunk term $\widetilde{B}^{\text{intra}}_{1}$ of all chunks, implement the above with a simple for loop, and collect the terms $\zeta_{C}\cdot\widetilde{B}_{C+1}k_{i}$ .

4.2.3 Comparison to Other Iterative Solvers

Here we validate our choice of Chebyshev Iteration (CH) by benchmarking it against other iterative methods.

Convergence in the Forward Pass. We generate random regression problems, which we solve via CH and 3 other baselines: gradient descent (GD), accelerated GD with Nesterov’s momentum (AGD), conjugate gradient (CG). GD and AGD are run with stepsizes that are optimal for regression problems. Fig.˜1a shows CG converges the fastest within a few iterations, while CH reaches the same accuracy as CG at iteration 10 and eventually attains the smallest errors.

Table 1: Implicit differentiation for computing

dq_{t}

	forward pass	backward pass
exact	$x^{*}_{t}=(H_{t}+\lambda_{t}I)^{-1}q_{t}$	$dq_{t}^{}=(H_{t}+\lambda_{t}I)^{-1}dx^{}_{t}$
CG	$\hat{x}_{t}=\text{CG}(H_{t}+\lambda_{t}I,q_{t},r)$	$d\hat{q}_{t}=\text{CG}(H_{t}+\lambda_{t}I,d\hat{x}_{t},r)$
CH	$\hat{x}_{t}=\text{CH}(H_{t}+\lambda_{t}I,q_{t},r)$	$d\hat{q}_{t}=\text{CH}(H_{t}+\lambda_{t}I,d\hat{x}_{t},r)$

Stability of the Backward Pass. We then proceed and measure the gradient stability of CG and CH, whose backward passes are implemented either via implicit differentiation as per Table˜1 (impl), or via torch.autograd (auto).

In Fig.˜1b, CG (impl) as a standalone layer has its gradient close to that of the exact solver up to a $10^{-3}$ relative difference. In Fig.˜1c, this difference is amplified to almost $1$ in a 5-layer LLAMA where Eq.˜Attn is replaced with Eq.˜3. This indicates CG (impl) completely deviates from the reference gradient (exact), defeating its purpose of training the network from the regression feedback. In contrast, the gradients of CH (impl) and CH (auto) are eventually close to that of the exact solver either as a single layer (Fig.˜1c) or within multiple layers (Fig.˜1d), up to a $10^{-6}$ difference. Moreover, the curves for CH (impl) and CH (auto) nearly overlap, suggesting that their gradients may be close. The following lemma confirms and formalizes this intuition (see Appendix˜C for a proof), thereby justifying our choice of CH over the alternatives:

Lemma 2.

Let $dq_{t}$ be the exact gradient of $q_{t}$ for CH, e.g., computed by CH (auto). Let $d\hat{q}_{t}$ be the gradient of CH (impl), computed as per Table˜1. We have $dq_{t}=d\hat{q}_{t}$ .

4.3 Architectural Consideration

Our GKA layer in Fig.˜2 includes two components (in green) on top of established practices (in blue). The CH component is described in Section˜4.1, thus here we introduce the $\alpha$ -connection. First, the sigmoid activation ensures $\alpha_{t}\in[0,1]$ , so the output of the $\alpha$ -connection is a convex combination of the original query $q_{t}$ and the output $\hat{x}_{t}$ of CH. Second, it plays a similar role to residual connection, which establishes a direct path that facilitates the gradient flow and improves training; we show this is indeed the case in Section˜F.3. Finally, the full architecture for GKA is the standard Transformer, with its attention layer replaced by the GKA layer.

5 Experiments

In this section, we empirically validate the efficacy of our approach. We first evaluate memorization ability on synthetic associative recall tasks (Section˜5.1). We then report training throughput of GKA (Section˜5.2). Finally, we examine performance on short-context language understanding benchmarks such as commonsense reasoning and long-context modeling abilities in Section˜5.3. The Appendix details our experimental settings (Appendix˜D) and ablations of various modeling choices (Appendix˜F, Appendix˜G).

Baselines. All experiments consider the following state-of-the-art linear SSM-based fading memory layers as baselines: Mamba2 [Dao-ICML2024-mamba2], DeltaNet [Yang-NeurIPS2024], Gated DeltaNet (GDN) [Yang-ICLR2025], and Gated Linear Attention (GLA) [Yang-ICML2024-gla]. Each of these layers rely on instantaneous objectives that depend on the previous lossy state and current tokens (e.g., Eq.˜2), as opposed to the entire history of tokens observed so far as in GKA. Finally, we contrast our results with (Softmax) Attention, which serves as our paragon. For our Attention-based model, we adopt the architecture proposed in Qwen3 models [yang2025qwen3].

5.1 GKA on Synthetic Associative Recall Tasks

We first assess the capability of our models to recall information on the multi-Query Associative Recall (MQAR) task, a synthetic task introduced by Arora et al. [arora2023zoology]. This task presents the model with a sequence of key-value pairs to memorize, followed by a sequence of queries. For each query, the model must retrieve the corresponding key from memory and accurately recall its associated value. Attention based layers perform the best in this task, while SSM-based memory layers are known to struggle as their memory fades away as the context length grows.

We compare GKA with Attention and other linear SSM baselines on this task. For each memory layer type, we train 2-layer models on MQAR training data and evaluate on a held-out test set. We repeat this experiment for $4$ different learning rates spanning from $10^{-4}$ to $10^{-2}$ . As shown in Fig.˜3a, GKA improves upon every other linear SSM baseline at all sequence lengths and model dimensions considered. Note, the complexity of the task increases with increasing sequence length and number of key-value pairs, while larger model dimensions improve memorization capacity through increased state size. The success of our layer can be attributed to our modeling choice: unlike other fading memory designs (like GDN or Mamba2), we construct states based on the optimal MAP estimate conditioned on the entire history, enabling better retention of remote information.

5.2 Training Throughput of GKA

In Fig.˜3b we measure the running time (forward + backward) of a single GKA layer and compare it with FlashAttention [golden2024flash], DeltaNet, and GDN. Our layer achieves comparable running time to GDN, a state-of-the-art SSM layer, despite having a more computationally expensive state update equation Eq.˜CH than Eq.˜GDN. This demonstrates that our chunk-wise parallelization strategy effectively compensates for the additional computational cost.

Table 2: On average GKA improves upon all fading memory baselines across all tasks. We report results for zero-shot evaluation of 2.8B language models for short-context tasks. For each task, bold indicates highest value followed by underlined.

Model

ARC-C

ARC-E

BoolQ

COPA

HellaSWAG

PIQA

SciQ

Winogrande

FDA

SWDE

Avg

acc_n

\uparrow

acc_n

\uparrow

acc

\uparrow

acc

\uparrow

acc_n

\uparrow

acc_n

\uparrow

acc_n

\uparrow

acc

\uparrow

contains

\uparrow

contains

\uparrow

Transformer

32.25

56.10

64.28

80.00

60.96

73.56

79.50

61.72

58.53

72.28

63.92

Gated Linear Attention

27.82

50.80

52.57

78.00

48.83

70.13

69.60

54.54

2.81

20.43

47.55

DeltaNet

32.85

58.16

42.51

81.00

61.13

73.78

43.90

61.72

11.80

46.08

51.29

Mamba2

32.24

59.64

58.72

82.00

62.23

73.78

79.80

62.19

7.71

41.13

55.94

Gated DeltaNet

32.59

60.02

62.75

82.00

62.80

74.32

80.60

62.35

8.26

44.28

57.00

Gated KalmaNet (Ours)

32.51

59.89

61.68

85.00

63.84

74.81

83.20

64.17

12.89

50.95

58.89

5.3 GKA on Language Modeling

5.3.1 Short-context Tasks

Setup. For this set of experiments, we construct 2.8B LLM models for each memory layers (GKA and baselines described in Section˜5) by cascading blocks of mem + Multi-Layer Perceptron (MLP) blocks.²²2For Mamba2 baseline, we consider cascading blocks of Mamba2 layer alone since a single Mamba2 layer has the Mamba2 SSM and MLP. Hereby, we refer to the 2.8B models with the same name as the layer used to construct them. We then train each model on DCLM [li2024datacomp], a generic pre-training dataset for $100$ B tokens at $4$ K context length using the AdamW optimizer with a peak Learning Rate (LR) of $10^{-3}$ and gradient clipping of $1.0$ . We used the cosine LR scheduler with a warmup period of $5$ B tokens with a global batch size of $2$ M tokens. All models employ the GPT2 tokenizer with a vocabulary size of $50$ K tokens.

Tasks. Following prior works [Zancato-NeurIPS2024-BMOJO, Yang-ICML2024-gla, Yang-ICLR2025], to consider language modeling capabilities of our model we perform zero-shot evaluation on the following eight common-sense reasoning tasks from LM-Harness [eval-harness]: Arc-E, Arc-C, BoolQ, COPA, HellaSWAG, PIQA, SciQ, Winogrande. We also evaluate models on FDA and SWDE, real-world recall-intensive tasks which focus on extracting structured information like tagged content from raw text (for example, HTML files). All these tasks are relatively short ( $<2$ K tokens).

Results. We report our results in Table˜2. GKA outperforms all fading memory baselines on average across all tasks owing to its ability to better manage its state via solving Eq.˜3. In particular, GKA outperforms both GDN and Mamba2 on recall-intensive tasks (FDA and SWDE) by about $10\%$ (rel. improvement). We note that although GKA improves upon existing SSM layers there is still a gap with Attention-based Transformer especially on recall-tasks owing to the eidetic capabilities of Attention. Nevertheless, as discussed in Section˜1 this improvement comes at a quadratic cost at training time, whereas our layer’s computational complexity is still comparable to existing SSM layers (cf. Section˜5.2). In Appendix˜I we extend our results to Hybrid models (stack of SSM and Attention layers) and show that the gap with full Transformer models becomes negligible (while still benefiting the SSM’s computational advantages). Finally, in Appendix˜E we show that GKA exhibits stronger scaling with compute than other SSM baseline models.

5.3.2 Long-context Tasks

Setup. To enable long-context capabilities of our models, as is common practice, we perform continued pre-training of our 2.8B models obtained in Section˜5.3.1 on $25$ B tokens of long documents at $128$ K context length (cf. Appendix). To the best of our knowledge we are the first to train and evaluate SSM models up to $128$ K context (e.g., previous work [Yang-ICLR2025] only considered up to $4$ K/ $8$ K context).

Tasks. For long-context, we refrain from using perplexity as it is known to have limitations at assessing long-context performance of LLMs [nunez2024expansion, fang2024wrong, gao2025train]. Instead, we turn to recently proposed benchmarks that mix synthetic and real datasets comprising several long-context tasks: synthetic Recall, Retrieval-Augmented Generation (RAG), Many shot In-Context Learning (ICL) and Long Question-Answering (LongQA). For synthetic Recall and LongQA we consider tasks from the RULER benchmark [hsieh2024ruler]. For RAG and ICL we consider tasks from HELMET [yen2025helmet].

Results. Fig.˜4 reports our results. GKA shows strong RAG and LongQA capabilities, outperforming all fading memory baselines by at least $10$ % (rel. improvement). Interestingly, on synthetic Recall tasks from RULER, GKA is competitive only at $4$ K context length and starts to fall behind afterwards. We attribute this divergence to the fundamental differences between these task types. While both RAG and LongQA can be thought of as finding relevant information in long streams of text, they involve more realistic linguistic patterns and semantic relationships that align with natural text distributions seen during pretraining. In contrast, synthetic Recall tasks require models to find specific words, numbers, or UUIDs verbatim from long contexts filled with random distractors. This artificial setting does not reflect natural text distributions. Since GKA computes MAP estimates of the latent state based on learned representations of observed tokens, it relies on its pretrained weights to determine which information is important to retain. The synthetic and unnatural structure of Recall tasks makes it difficult for the model to identify what should be retained, as pretrained knowledge provides little signal about importance in these artificial contexts. This suggests that our approach excels in realistic scenarios where pretrained knowledge about natural language structure can guide information selection, but struggles when the signal-to-noise distinction is purely artificial.

6 Kalman Filter for Optimally Modelling Fading Memory

In this section, we show how the Kalman Filter (KF) provides a principled solution for constructing an optimal fading memory that accounts for the entire history. We begin by describing the standard Kalman Filter recurrence in the context of memory modeling. However, the KF has a fundamental limitation: its inherently sequential nature makes it impractical for large-scale training on modern hardware accelerators Section˜3.2. To address this, we make simplifying assumptions that makes KF amenable to parallelization on modern hardware accelerators. We then demonstrate that several recent state-space models (DeltaNet, Gated DeltaNet, and Kimi Delta Attention) can be viewed as approximations to the KF recurrence. Specifically, these methods approximate the “optimal" Kalman gain matrix while ignoring dependencies on the past. In contrast, GKA computes the exact Kalman gain by considering the full history. This theoretical advantage translates to improved empirical performance, as we demonstrate in Section˜5.

6.1 A Dynamical System for Fading Memory

The Kalman filter is a classical algorithm for online optimal inference in linear Gaussian State-Space Models. It gives a principled way to maintain and update a compact state as new noisy observations arrive. The latent state serves as a compressed "memory" of the past. More formally, it is a minimal sufficient statistic that makes past observations conditionally independent of future ones given the state.

We begin by describing a linear Gaussian model for fading memory.

	$\displaystyle s_{t}$	$\displaystyle=A_{t}s_{t-1}+B_{t}u_{t}+w_{t},\quad$	$\displaystyle w_{t}\sim\mathcal{N}(0,Q_{t})$		(LGM)
	$\displaystyle v_{t}$	$\displaystyle=k_{t}^{\top}s_{t}+\mu_{t},\quad$	$\displaystyle\mu_{t}\sim\mathcal{N}(0,r_{t}),$		(LGM)

where $s_{t}\in\mathbb{R}^{n}$ is a latent state that summarizes the past, $u_{t}\in\mathbb{R}^{n}$ is the control input that updates the state and $v_{t}$ is the scalar measurement observed at time $t$ . $A_{t},B_{t}\in\mathbb{R}^{n\times n}$ are the state transition and input selection matrices, and $k_{t}\in\mathbb{R}^{n}$ is the emission (readout) vector. Finally, $w_{t}$ and $\mu_{t}$ are Gaussian process and measurement noise, respectively.

Parameter interpretation. $A_{t}$ and $B_{t}$ control the forgetting (fading of the remote past) and input selectivity rates respectively, determining how the state evolves over time. The measurement noise $\mu_{t}$ naturally gives rise to gating mechanisms commonly used in modern SSM layers, as we will show in Section˜6.4.

Extension to multi-channel measurements. In attention mechanisms, the memory consists of verbatim key-value pairs that can be queried to retrieve past information [Zancato-NeurIPS2024-BMOJO]. Similarly, we want our state to reconstruct past values from their corresponding keys. To achieve this, we extend to a matrix-valued state $S_{t}\in\mathbb{R}^{n\times d}$ , where each column independently follows the dynamics in Eq.˜LGM.

Specifically, for the $i^{\text{th}}$ channel:

	$\displaystyle s_{t,i}$	$\displaystyle=A_{t,i}s_{t-1,i}+B_{t,i}u_{t,i}+w_{t,i},\quad$	$\displaystyle w_{t}\sim\mathcal{N}(0,Q_{t,i})$
	$\displaystyle v_{t,i}$	$\displaystyle=k_{t}^{\top}s_{t,i}+\mu_{t,i}\quad$	$\displaystyle\mu_{t}\sim\mathcal{N}(0,r_{t,i}),$

where $(k_{t},v_{t})$ is the key-value pair at time $t$ and $v_{t,i}$ is the $i^{\text{th}}$ element of $v_{t}$ . In what follows, we focus on a single channel and drop the subscript $i$ from the state for notational clarity.

6.2 Kalman Filter for Optimal Inference

Given the model in Eq.˜LGM and a sequence of measurements $\{v_{1},v_{2},\ldots,v_{t}\}$ , the Kalman Filter computes the Maximum A-Posteriori (MAP) estimate of the latent state at time $t$ :

\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq: KF-MAP}}{e}q:KF-MAP}\hat{s}_{t}=\arg\max_{s}p(s\mid v_{1},v_{2},\ldots,v_{t}),

(11)

where $p$ is the probability density function. The MAP estimate is optimal in the sense that it minimizes the expected squared error between the true state and its estimation given all measurements up to time $t$ .

The KF recursion. The Kalman Filter updates the state estimate recursively as new measurements arrive. At time $t$ , the update is:

\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:KF-recursive-for-LGM}}{e}q:KF-recursive-for-LGM}\hat{s_{t}}=\underset{\textrm{Predicted state}}{\underbrace{A_{t}\hat{s}_{t-1}+B_{t}u_{t}}}+G_{t}(\overbrace{v_{t,i}-k_{t}^{\top}\underset{\textrm{Predicted state}}{\underbrace{\Big[A_{t}\hat{s}_{t-1}+B_{t}u_{t}\Big]}}}^{\textrm{Innovation}}),\\

(12)

where the innovation measures the discrepancy between the actual measurement $v_{t}$ and the predicted measurement based on the predicted state estimate.

The Kalman gain $G_{t}$ determines how much to trust the new measurement versus the predicted state. It is computed as follows:

\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq: kalman gain update}}{e}q:kalmangainupdate}G_{t}=\frac{\Big[A_{t}\Sigma_{t-1}A_{t}^{T}+Q_{t}\Big]k_{t}}{k_{t}^{\top}\Big[A_{t}\Sigma_{t-1}A_{t}^{T}+Q_{t}\Big]k_{t}+r_{t}}.

(13)

The error covariance $\Sigma_{t}$ quantifies the uncertainty in the state estimate. It represents the covariance of the estimation error $(s_{t}-\hat{s}_{t})$ conditioned on all measurements up to time $t$ . The covariance is updated as:

\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq: error covariance update}}{e}q:errorcovarianceupdate}\Sigma_{t}=\Big(I-G_{t}k_{t}^{\top}\Big)\Big(A_{t}\Sigma_{t-1}A_{t}^{T}+Q_{t}\Big)

(14)

Equations (12), (13) and (14) constitute the KF recursion. We initialize with $\hat{s}_{0}=0$ and $\Sigma_{0}=\sigma I_{n}$ , where $I_{n}$ is the $n\times n$ identity matrix and $\sigma$ represents our prior uncertainty about the state before observing any measurements.

6.3 Gated KalmaNet: A Steady-State Dynamical System for Large-Scale Training

Despite its optimality, the KF recursion in its most general form is inherently sequential; each update depends on the previous state estimate. This sequential dependency prevents the parallelization necessary for efficient large-scale training on modern hardware.

To enable parallelization, we make a key simplifying assumption: the underlying state remains static over time. This reduces the problem from tracking a dynamic state to estimating a fixed but unknown parameter from sequential noisy measurements. Formally, we assume a steady-state model:

	$\displaystyle s_{t}$	$\displaystyle=s_{t-1}$			(15)
	$\displaystyle v_{t,i}$	$\displaystyle=k_{t}^{\top}s_{t}+\mu_{t},\quad$	$\displaystyle\mu_{t}\sim\mathcal{N}(0,r_{t}),$		(15)

where $A_{t}=I_{n}$ , $B_{t}=0$ , and $w_{t}=0$ (i.e., no state evolution, no control input, and no process noise).

Adapting to evolving context. While the steady-state assumption may initially seem restrictive, contexts naturally evolve as topics change, GKA addresses this through adaptive weighting (Section˜4.1). By assigning higher weights to recent measurements, older observations are naturally faded out over time, allowing the model to track shifting context despite the static formulation.

Under this simplification, the KF recursion reduces to:

	$\displaystyle\hat{s}_{t}=\hat{s}_{t-1}+G_{t}(v_{t,i}-k_{t}^{\top}\hat{s}_{t-1}).$		(16)
	$\displaystyle G_{t}=\frac{\Sigma_{t-1}k_{t}}{k_{t}^{\top}\Sigma_{t-1}k_{t}+r_{t}}.$
	$\displaystyle\Sigma_{t}=\Big(I-G_{t}k_{t}^{\top}\Big)\Sigma_{t-1}.$

Collecting all channels, these equations can be written compactly in matrix form as shown in (KF).³³3with columns of $S_{t}$ transposed to being rows of $S_{t}$ to be consistent with the notation in (KF) and taking the noise variance $r_{t}=\frac{1}{\eta_{t}}$ .. A key insight of this work is that the KF recursion for the steady-state model admits an efficient parallel implementation via chunked processing (detailed in Section˜4) that results in Gated KalmaNet.

Critically, the KF recursion accounts for the entire history when computing state estimates. The Kalman gain $G_{t}$ at each step depends on all previous measurements through $\Sigma_{t-1}$ . This contrasts with most existing SSMs, which we show next can be viewed as approximations that ignore historical dependencies when computing their gain matrices. This principled treatment of the full history is a key advantage of our approach.

6.4 Connection with Existing SSM Layers

DeltaNet [Yang-DeltaNet] approximates the KF recursion in (16) by assuming fixed error covariance: $\Sigma_{t}=I_{n}$ for all $t$ . This simplifies the Kalman gain to:

\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{G_t for deltanet}}{G}{}_{t}fordeltanet}G_{t}=\frac{k_{t}}{k_{t}^{\top}k_{t}+r_{t}}=\frac{k_{t}}{1+r_{t}},

(17)

where the second equality assumes unit-normalized keys, a common assumption in practical instantiations of DeltaNet. Substituting (17) into the state update (16) and defining $\beta_{t}~=~(1+r_{t})^{-1}$ yields:

\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq: DeltaNet}}{e}q:DeltaNet}\hat{s}_{t}=(I-\beta_{t}k_{t}k_{t}^{\top})\hat{s}_{t-1}+\beta_{t}k_{t}v_{t,i},

(DeltaNet)

which is the DeltaNet recurrence. By fixing $\Sigma_{t}$ , DeltaNet avoids tracking the evolving uncertainty in the state estimate, a key simplification that sacrifices optimality for computational efficiency. In contrast, GKA maintains the full error covariance $\Sigma_{t}$ , allowing it to optimally weight measurements based on the entire history.

Gated DeltaNet (GDN) [Yang-ICLR2025] extends DeltaNet by incorporating explicit forgetting through a time-dependent decay factor $\alpha_{t}$ . Like DeltaNet, GDN can be viewed as fixing $\Sigma_{t}=I_{n}$ , but applying this approximation to the KF recursion for a fading dynamical system where the state decays over time.

Specifically, GDN assumes

	$\displaystyle s_{t}$	$\displaystyle=\alpha_{t}s_{t-1}+w_{t}\quad$	$\displaystyle w_{t}\sim\mathcal{N}(0,I_{n})$		(18)
	$\displaystyle v_{t,i}$	$\displaystyle=k_{t}^{\top}s_{t}+\mu_{t},\quad$	$\displaystyle\mu_{t}\sim\mathcal{N}(0,r_{t}),$		(18)

where $\alpha_{t}\in[0,1]$ is a learned decay factor controlling how much past information to retain. This corresponds to setting $A_{t}=\alpha_{t}I_{n}$ in (LGM). When $\alpha_{t}\to 0$ , the state "forgets" the past completely; when $\alpha_{t}\to 1$ , the state is fully retained.

Under the identity covariance assumption $\Sigma_{t}=I_{n}$ , the Kalman gain becomes:

\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq: kalman gain update GDN}}{e}q:kalmangainupdateGDN}G_{t}=\frac{(\alpha_{t}^{2}+1)k_{t}}{(\alpha_{t}^{2}+1)k_{t}^{\top}k_{t}+r_{t}}=\frac{k_{t}}{1+r_{t}/(\alpha_{t}^{2}+1)},

(19)

where the second equality again assumed unit-normalized keys (as in DeltaNet). Defining $\beta_{t}~=~(1+\frac{r_{t}}{\alpha_{t}^{2}+1})^{-1}$ and substituting into the state update (12) yields:

	$\displaystyle\hat{s}_{t}$	$\displaystyle=\alpha_{t}\hat{s}_{t-1}+\beta_{t}k_{t}(v_{t,i}-k_{t}^{\top}\Big[\alpha_{t}\hat{s}_{t-1}\Big]),$		(GDN)
		$\displaystyle=\Big[I_{n}-\beta_{t}k_{t}k_{t}^{\top}\Big]\alpha_{t}\hat{s}_{t-1}+\beta_{t}k_{t}v_{t,i},$		(GDN)

which recovers the GDN recurrence. In practice, $\beta_{t}$ is an input-dependent learnable parameter.

Like DeltaNet, GDN avoids tracking the evolving uncertainty $\Sigma_{t}$ , trading optimality for computational simplicity. The key difference is that GDN’s explicit forgetting factor $\alpha_{t}$ provides additional control over the memory horizon. However, by fixing $\Sigma_{t}=I_{n}$ , GDN still ignores how measurement history should optimally influence the Kalman gain, leading to suboptimal performance compared to GKA (see Section˜5).

Kimi Delta Attention (KDA) [team2025kimi] further extends GDN by using channel-specific decay factors $\alpha_{t,i}$ in place of the global $\alpha_{t}$ . This allows different channels to have independent memory horizons. In the KF framework, this corresponds to:

s_{t,i}=\alpha_{t,i}s_{t-1,i}+w_{t,i}\quad w_{t,i}\sim\mathcal{N}(0,I_{n}),

(20)

for each channel $i$ . While this added flexibility can improve expressiveness, KDA still assumes $\Sigma_{t}=I_{n}$ and therefore does not optimally consider the entire past when computing its state update. Like DeltaNet and GDN, KDA sacrifices optimality for computational simplicity.

7 Discussions and Limitations

Thanks to its expressive test-time ridge regression objective, Gated KalmaNet extends previous fading memory layers like Mamba2, LongHorn and Gated DeltaNet, all of which only depend on an instantaneous test-time objective. However, GKA is only optimal among linear memory layers, solving our test-time objective leveraging non-linear updates while still maintaining hardware efficiency and numerical stability is an interesting area for future research. Despite the efficient kernels we implemented, we believe even faster implementations of our idea are possible, e.g., via sketching (see Appendix˜H for preliminary results). Finally, while we have showed promising results in combining GKA with Attention layers into Hybrid models (Appendix˜I), further scaling beyond 3B parameters models is required to validate GKA on more challenging real world problems.

Appendix A Related Work

Since the introduction of Self-Attention [Vaswani-NeurIPS2017], significant research has been conducted to reduce its quadratic cost in processing long input sequences. As models and systems scale to million-token contexts, Attention’s bottlenecks have become critical blockers to frontier agentic applications in coding, information gathering, and scientific discovery [chen2024scienceagentbench, cui2025curie, jimenez2023swe]. Prior works have proposed various approximation schemes to overcome these limitations. For example, Reformer [kitaev2020reformer] uses locality-sensitive hashing to group tokens with similar embeddings. This enables the model to attend only to a subset of tokens rather than the entire sequence. Other works equip Transformer models with "compressed" memory tokens that are updated dynamically and causally over sliding windows on entire sequence chunks [dai2019transformerxl, munkhdalai2024leave, mohtashami2023landmark]. While a lot of prior work have focused on reducing the quadratic complexity of Attention with sparse approximations [nunez2024expansion, yuan2025NSA], this work focuses on linear approximations of Attention.

A.1 Linear Attention

Linear Attention methods approximate the Attention mechanism with constant-size recurrent dynamical systems [Dao-ICML2024-mamba2, Yang-ICML2024-gla, beck2024xlstm, Yang-ICLR2025]. Numerous State-Space Model (SSM) variations have been proposed, ranging from those closely resembling Linear Attention [sun2023retentive] or Linear Time-Invariant dynamical systems [gu2021combining, zancato2022stacked], to those introducing novel adaptive or gated state updates [Yang-ICML2024-gla, Dao-ICML2024-mamba2, orvieto2023resurrecting].

Despite their differences, all SSMs follow the same basic working principle inspired by classical state-space models [Kalman-1960]: they process the input sequence by maintaining a fixed-size state that acts as a compressed (lossy) representation of all processed tokens. Moreover, when implemented in hardware, the state must have finite precision and “fades away the past" as more samples are processed. Successful SSM layers typically employ hardware-aware implementations that efficiently utilize modern matrix multiplication accelerators through highly parallelizable and scalable primitives, including associative scans [gu2023mamba, de2024griffin], chunking mechanisms [Dao-ICML2024-mamba2, Yang-ICML2024-gla], and techniques that avoid materializing the entire state in slow high-bandwidth memory [gu2023mamba].

From a modeling perspective, most Linear Attention implementations introduce data-dependent gating factors to control the speed of their “fading” memory, balancing expressivity with scalability. For example, the transition from Mamba to Mamba2 replaced channel-wise data-dependent gating with head-wise gating for better scalability and Tensor Cores utilization. Input-dependent Gating has been shown to empirically improve training stability [arora2023zoology, Yang-ICLR2025] and has driven the development of Linear Attention models (e.g., from S4 [alber_gu_s4] to Mamba [gu2023mamba] and from DeltaNet [Yang-DeltaNet] to Gated DeltaNet [Yang-ICLR2025]). In our work, we demonstrate that gating emerges naturally as a consequence of solving a weighted least squares objective function, establishing a connection to the favorable numerical properties classically described in the adaptive filtering literature [LJUNG_RLS_stability, sayed2003fundamentals, sayed2011adaptive].

A.2 Hybrid State Space Attention Models

While extending the recurrent state in SSM layers has yielded performant models, they typically underperform on tasks requiring recall of information from the distant past [waleffe2024empirical, jelassi2024repeat]. Hybrid State-Space Models address this limitation by complementing SSMs’ “fading" state with Attention layers [dao2024transformers, de2024griffin, lieber2024jamba, glorioso2024zamba]. Early architectures simply stacked SSMs and Attention layers with different blending ratios [waleffe2024empirical, gu2023mamba, Dao-ICML2024-mamba2] or replaced full Attention layers with Sliding Window Attention [de2024griffin]. More sophisticated designs have recently emerged [glorioso2024zamba, Zancato-NeurIPS2024-BMOJO].

Notably, B’MOJO [Zancato-NeurIPS2024-BMOJO] complements SSMs’ fading state with "eidetic" memory by combining SSMs with Sliding Window Attention (SWA) in a single layer. Within the window, tokens can attend to a selected set of past tokens that were deemed difficult to predict using an asynchronous causal selection mechanism. B’MOJO was the first hybrid model to propose a parallel fusion of SSM and SWA at the layer level. Subsequent works [dong2024hymba, bae2025hybrid] have shown this parallel fusion approach to be more performant (at equivalent compute) than the stacked approach of earlier works.

Thanks to their lower memory footprint and test-time scalability over long sequences, Hybrid architectures are expanding into long-range agentic tasks and have recently been trained with Reinforcement Learning at scale [chen2025minimax]. When coupled with system-level optimizations like prefix caching [pan2024marconi] and specialized inference engines [kwon2023efficient], Hybrid models can increase the number of rollouts (exploration), thereby improving end-to-end performance in Reinforcement Learning loops.

Appendix B Forward and Backward Passes of Chebyshev Iteration (Details)

In Section˜4.2 we described our chunk-wise implementation of the CH method with adaptive regularization and gating. We now give full details omitted there.

B.1 Forward Pass

CH in Detail. We begin with describing the CH method (Algorithm˜1) in more detail. Assume we have a linear system of equations $H\xi=q$ where $H$ is a $D\times D$ positive definite matrix. We assume $H$ has its all eigenvalues lie in the interval $[\mu,L]$ and the values of $\mu$ and $L$ is known. Note that solving this system is equivalent to solving the following quadratic problem:

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:quadratic}}{e}q:quadratic}\min_{\xi\in\mathbb{R}^{D}}\ \frac{1}{2}\xi^{\top}H\xi-\xi^{\top}q.

(21)

The classic Chebyshev Iteration in its standard form is presented in Algorithm˜1. In the initialization phase, we set $\rho=\frac{L-\mu}{L+\mu}$ , which is the typical convergence rate of gradient descent applied to the above quadratic problem with stepsize $\frac{2}{L+\mu}$ ; vaguely speaking, in this setting, this stepsize choice is optimal (e.g., that allows gradient descent to converge the fastest possible). Algorithm˜1 initializes two points, $\xi_{-1}$ and $\xi_{0}$ . Here $\xi_{-1}$ is zero, and $\xi_{0}$ is a gradient step for Eq.˜21 starting at $\xi_{-1}$ and with stepsize $\frac{2}{L+\mu}$ . The final component in initialization is the weight $\omega_{0}=2$ . This is the starting point for the weight schedule recursion of $\omega_{i}$ in Eq.˜weight schedule. Similarly, the initialization of $\xi_{-1},\xi_{0}$ is where we start to compute $\xi_{i}$ , whose update consists of Eq.˜grad descent and Eq.˜momentum. Note that Eq.˜grad descent is with stepsize $2\cdot\omega_{i}/(L+\mu)$ . Since $\omega_{i}>1$ , this stepsize is strictly larger than $2/(L+\mu)$ , the latter being the optimal stepsize for vanilla gradient descent. Such a large stepsize alone might not guarantee convergence, but it is balanced by the Eq.˜momentum term $\xi_{i-1}-\xi_{i-2}$ with positive weight $\omega_{i}-1$ so that the convergence of the Chebyshev iterative method is ensured.

Numerical Stability Considerations. Now we analyze the numerical properties of the Chebyshev Iteration. The major computation consists of matrix-vector multiplication; in a batched parallel implementation, this turns out to be matrix-matrix multiplication. For this, the numerical accuracy is well controlled (e.g., in Triton we could specify the accuracy in tl.dot). The update of $\omega_{i}$ in Eq.˜weight schedule might raise numerical concerns as it involves division. That said, we show this division operates in a numerically well-behaved range as $\omega_{i}$ is decreasing with $i$ yet lower bounded by $1$ :

Lemma 3.

For any $r$ , we have $2=\omega_{0}\geq\cdots\geq\omega_{r}\geq\omega^{*}_{1}>1$ , where $\omega^{*}_{1}$ is defined as

\displaystyle\omega^{*}_{1}:=\frac{2(1-\sqrt{1-\rho^{2}})}{\rho^{2}}.

As a consequence, we have $4-\rho^{2}\omega_{i}\in[2,4]$ for all $i=0,\dots,r$ .

Proof.

If $L=\mu$ , then $H$ is a scaled identity matrix, and the algorithm is simplified a lot. So we assume $L>\mu$ in what follows. With $L>\mu>0$ we have $\rho\in(0,1)$ . Since $\omega_{0}=2$ , we have $4-\rho^{2}\omega_{0}\geq 2$ and therefore $0<\omega_{1}\leq 2$ . Repeating this argument and we see $\omega_{i}\in(0,2]$ for all $i$ . By the definition of $\omega_{i}$ , to show $\omega_{i}\leq\omega_{i-1}$ is to show

\displaystyle\frac{4}{4-\rho^{2}\omega_{i-1}}\leq\omega_{i-1}\Leftrightarrow g(\omega_{i})\leq 0

where $g$ is defined as $g(\omega)=\rho^{2}\omega^{2}-4\omega+4$ . Note that $g(\omega)$ has two roots, $\omega_{1}^{*}$ , as defined earlier, and $\omega_{2}^{*}=\frac{2(1+\sqrt{1-\rho^{2}})}{\rho^{2}}$ ; $\omega_{1}^{*},\omega_{2}^{*}$ are the two fixed points of the update Eq.˜weight schedule. Observing that $\omega_{0}=2$ lies in the interval $(\omega_{1}^{*},\omega_{2}^{*})$ , and moreover, for any $i\geq 1$ , if $\omega_{i-1}>\omega_{1}^{*}$ we must have

\displaystyle\omega_{i}=\frac{4}{4-\rho^{2}\omega_{i-1}}>\frac{4}{4-\rho^{2}\omega_{1}^{*}}=\omega_{1}^{*}.

This proves $\omega_{i}>\omega_{1}^{*}$ for all $i=1,\dots,r$ . Next, since $\omega_{0}=2$ lies in the interval $(\omega_{1}^{*},\omega_{2}^{*})$ where $g(\omega)$ decreases, therefore we have $\omega_{1}\leq\omega_{0}$ . Thus $\omega_{1}$ lies in $(\omega_{1}^{*},\omega_{2}^{*})$ again. We could then conclude inductively that $\omega_{1}^{*}<\omega_{i}\leq\omega_{i-1}$ for all $i=1,\dots,r$ . ∎

From Lemma˜3 we know that the update of $\omega_{i}$ in Eq.˜weight schedule would not create much numerical concern in a forward pass, as we have $\omega_{i}\in[1,2]$ for all $i$ . Furthermore, we can bound the rate at which $\omega_{i}$ converges to $\omega_{1}^{*}$ :

Lemma 4.

Define $\kappa:=\frac{L}{\mu}$ . For any $i=1,\dots,r$ , we have

\displaystyle(\omega_{i}-\omega_{1}^{*})\leq R^{i}\cdot(\omega_{0}-\omega_{1}^{*}),

where $R$ is defined as

\displaystyle R:=\frac{\kappa-1}{\kappa+1}\cdot\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}.

Proof.

From the update rule of $\omega_{i}$ in Eq.˜weight schedule and the fixed point property of $\omega_{1}^{*}$ , we have

	$\displaystyle\omega_{i}-\omega_{1}^{*}$	$\displaystyle=\frac{4}{4-\rho^{2}\omega_{i-1}}-\frac{4}{4-\rho^{2}\omega_{1}^{*}}$
		$\displaystyle=\frac{4\rho^{2}}{(4-\rho^{2}\omega_{i-1})(4-\rho^{2}\omega_{1}^{})}\cdot(\omega_{i-1}-\omega_{1}^{})$
		$\displaystyle\overset{\text{(i)}}{=}\frac{\rho^{2}\omega_{1}^{}}{4-\rho^{2}\omega_{i-1}}\cdot(\omega_{i-1}-\omega_{1}^{})$
		$\displaystyle\overset{\text{(ii)}}{\leq}\frac{\rho^{2}w_{i-1}\omega_{1}^{}}{4}\cdot(\omega_{i-1}-\omega_{1}^{})$
		$\displaystyle\overset{\text{(iii)}}{\leq}\left(1-\sqrt{1-\rho^{2}}\right)\cdot(\omega_{i-1}-\omega_{1}^{*})$
		$\displaystyle\overset{\text{(iv)}}{=}\left(\frac{\kappa-1}{\kappa+1}\cdot\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)\cdot(\omega_{i-1}-\omega_{1}^{*})$

Here, (i) follows from the fact that $\omega_{1}^{*}$ is a fixed point, (ii) follows from Lemma˜3 that $\omega_{i}\leq\omega_{i-1}$ , (iii) follows from the definition of $\omega_{1}^{*}$ and the fact $w_{i-1}\leq 2$ , and (iv) follows from the definitions of $\kappa$ and $\rho$ . The proof is concluded by unrolling the above recurrence. ∎

Remark 1.

Here, we call $R$ the linear convergence rate (or contraction factor) of $\omega_{i}$ to $\omega_{1}^{*}$ . First-order methods for solving $H\xi=q$ converge at most at a rate $R_{a}:=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}$ , and we see $\omega_{i}$ converges at an even faster rate. Numerically, assuming $\kappa=\frac{L}{\mu}=\frac{1.02}{0.02}=51$ , we then have:

	$\displaystyle R$	$\displaystyle\approx 0.7253,\quad R^{5}\approx 0.2,\quad R^{10}\approx 0.04,\quad R^{20}\approx 0.0016,\quad R^{30}\approx 6\times 10^{-5}$
	$\displaystyle R_{a}$	$\displaystyle\approx 0.7543,\quad R_{a}^{5}\approx 0.244,\quad R_{a}^{10}\approx 0.0597,\quad R_{a}^{20}\approx 0.0036,\quad R_{a}^{30}\approx 0.0002.$

Thus, with $\kappa=51$ , the update of $\omega_{i}$ in Eq.˜weight schedule converges in at most 20 iterations up to the bfloat16 precision.

B.2 Backward Pass

We now give details for backpropagation through the Chebyshev Iteration (Algorithm˜1) via implicit differentiation.

Computing $\frac{dL}{dq_{t}}$ and $\frac{dL}{dk_{t}}$ . First, we follow Table˜1 and Lemma˜2, and compute $dq_{t}$ . Then, given the equation $(H_{t}+\lambda_{t}I)dq_{t}=dx_{t}^{*}$ , we have that

\displaystyle d(H_{t}+\lambda_{t}I)=-dq_{t}(x_{t}^{*})^{\top}.

(22)

Therefore $d\lambda_{t}=\text{tr}(d(H_{t}+\lambda_{t}I))=-(x_{t}^{*})^{\top}dq_{t}$ . Since we set $\lambda_{t}=a\cdot\|H_{t}\|_{\text{F}}$ , this indicates

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:dH_t-implicit}}{e}q:dH_{t}-implicit}dH_{t}=-dq_{t}(x_{t}^{*})^{\top}-a\cdot\frac{H_{t}}{\|H_{t}\|_{\text{F}}}\cdot\left((x_{t}^{*})^{\top}dq_{t}\right).

(23)

Note that this expression of $dH_{t}$ is partial: It accounts for the upstream gradient from $dq_{t}$ only and one might think of the subsequent states all depend on $H_{t}$ . We will accumulate the gradients later when needed.

Now, the recursion of $H_{t}$ in Eq.˜CH implies

	$\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:dk_i}}{e}q:dk_{i}}dk_{i}$	$\displaystyle=\sum_{t\geq i}\left(dH_{t}+(dH_{t})^{\top}\right)k_{i}\cdot\frac{\zeta_{t}}{\zeta_{i}}$		(24)
		$\displaystyle=\sum_{t\geq i}\frac{\zeta_{t}}{\zeta_{i}}\left(-dq_{t}\otimes(x_{t}^{})^{\top}-x_{t}^{}\otimes(dq_{t})^{\top}+w_{t}H_{t}\right)k_{i},$		(25)

which proves Lemma˜1. We refer the reader to Section˜B.2.1 for more detailed derivations of $dq_{t}$ and $dk_{t}$ .

Derivatives for Gating. In practice we often parameterize $\gamma_{t}$ in the log space to ensure numerical stability. Thus, let us first revise our notations for this case. Let $g_{t}=\log\gamma_{t}$ and $G_{t}:=\sum_{i=1}^{t}g_{i}=\log\left(\prod_{i=1}^{t}\gamma_{i}\right)$ . Then the mask matrix $M$ is

\displaystyle M_{i,j}=\begin{cases}\exp(G_{j}-G_{i})&j\geq i;\\ 0&\text{otherwise}.\end{cases}

(26)

Now, since for any $c=1,\dots,C$ we have

\displaystyle H_{c}=\exp(G_{c})\cdot H_{0}+\sum_{j=1}^{c}\exp(G_{c}-G_{j})\cdot k_{j}k_{j}^{\top},

(27)

for any $G_{i}$ we have the following basic derivatives:

	$\displaystyle\frac{dH_{c}}{dG_{i}}$	$\displaystyle=\begin{cases}0&c<i;\\ \exp(G_{i})\cdot H_{0}+\sum_{j=1}^{i-1}\exp(G_{i}-G_{j})\cdot k_{j}k_{j}^{\top}&c=i;\\ -\exp(G_{c}-G_{i})k_{i}k_{i}^{\top}&c>i,\end{cases}\quad\quad i=1,\dots,C,\quad c=1,\dots,C;$		(28)
	$\displaystyle\frac{dH_{C+1}}{dG_{i}}$	$\displaystyle=-\exp(G_{C+1}-G_{i})k_{i}k_{i}^{\top}\quad\quad\quad i=1,\dots,C$		(29)

With $dH_{C+1}$ being the aggregated gradient from the future, we have for $i=1,\dots,C$ that

$\displaystyle dG_{i}$	$\displaystyle=\sum_{c=i}^{C+1}\langle dH_{c},\frac{dH_{c}}{dG_{i}}\rangle$	(30)
	$\displaystyle=e^{G_{i}}\langle dH_{i},H_{0}\rangle+\sum_{j=1}^{i-1}e^{G_{i}-G_{j}}\langle dH_{i},k_{j}k_{j}^{\top}\rangle-\sum_{c=i+1}^{C}e^{G_{c}-G_{i}}\langle dH_{c},k_{i}k_{i}^{\top}\rangle-e^{G_{C+1}-G_{i}}\langle dH_{C+1},k_{i}k_{i}^{\top}\rangle$	(31)
	$\displaystyle=e^{G_{i}}\langle dH_{i},H_{0}\rangle+\sum_{j=1}^{i}e^{G_{i}-G_{j}}\langle dH_{i},k_{j}k_{j}^{\top}\rangle-\sum_{c=i}^{C}e^{G_{c}-G_{i}}\langle dH_{c},k_{i}k_{i}^{\top}\rangle-e^{G_{C+1}-G_{i}}\langle dH_{C+1},k_{i}k_{i}^{\top}\rangle$	(32)
$\displaystyle dG_{C+1}$	$\displaystyle=e^{G_{C+1}}\langle dH_{C+1},H_{0}\rangle+\sum_{j=1}^{C}e^{G_{C+1}-G_{j}}\langle dH_{C+1},k_{j}k_{j}^{\top}\rangle$	(33)

Note that in one of the above equations we add and subtract the term $\langle dH_{i},k_{i}k_{i}^{\top}\rangle$ , which will simplify the implementation.

Recall that $dH_{t}=-dq_{t}(x_{t}^{*})^{\top}-\frac{1}{2}\cdot w_{t}H_{t}$ with $w_{t}=\frac{2a\cdot(x_{t}^{*})^{\top}dq_{t}}{\|H_{t}\|_{\text{F}}}$ . In computing the derivatives of $G_{i}$ the first term $dq_{t}(x_{t}^{*})^{\top}$ is the standard term that arises in that of Eq.˜Linear-SSM, which we omit here. We now focus on the second term $\frac{1}{2}\cdot w_{t}H_{t}$ . This implies the gradients $dG_{i}$ and $dG_{C+1}$ are partly given respectively by (using the notations in Section˜4.2.2 and omitting some algebraic operations)

\displaystyle\frac{1}{2}\cdot\langle w_{i}H_{i},H_{i}\rangle-\frac{1}{2}\cdot k_{i}^{\top}B_{i}k_{i}\quad\quad\text{and}\quad\quad\frac{1}{2}\cdot e^{G_{C}}\langle\widetilde{B}_{C+1},H_{0}\rangle+\frac{1}{2}\cdot\sum_{j=1}^{C}e^{G_{C}-G_{j}}\cdot k_{j}^{\top}\widetilde{B}_{C+1}k_{j}.

(34)

Computing the first term $\langle w_{i}H_{i},H_{i}\rangle$ in parallel is easy by invoking the definition of $w_{i}$ and the Frobenius norm of $H_{i}$ we stored during the forward pass. Computing the quadratic terms $k_{i}^{\top}B_{i}k_{i}$ and $k_{j}^{\top}\widetilde{B}_{C+1}k_{j}$ in parallel is easy and follows from our computation of $B_{i}k_{i}$ and $\widetilde{B}_{C+1}k_{i}$ for $dk_{i}$ in Section˜4.2.2. Computing $\langle\widetilde{B}_{C+1},H_{0}\rangle$ is easy since we recompute the initial states $H_{0}$ of each chunk and have them available during the backward pass, while $\widetilde{B}_{C+1}$ is updated backwards in a for loop.

B.2.1 Computing $\frac{dL}{dq_{t}}$ and $\frac{dL}{dk_{t}}$ .

In forward pass we solve

(H_{t}+\lambda_{t}I)x_{t}=q_{t}

	$\displaystyle x_{t}$	$\displaystyle=(H_{t}+\lambda_{t}I)^{-1}q_{t}$		(35)
	$\displaystyle\implies dx_{t}$	$\displaystyle=\underset{J_{q_{t}\to x_{t}}}{\underbrace{(H_{t}+\lambda_{t}I)^{-1}}}dq_{t}$		(35)

Recall that the gradient is transpose of the Jacobian, thus we obtain

\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq. implicity dl/dq}}{e}q.implicitydl/dq}\frac{dL}{dq_{t}}=(H_{t}+\lambda_{t}I)^{-1}\frac{dL}{dx_{t}}.

(36)

Thus, we can obtain $\frac{dL}{dq_{t}}$ by running a Chebyshev iteration to solve (for $z$ ) the linear system of equations

(H_{t}+\lambda_{t}I)z=\frac{dL}{dx_{t}}.

Now we have

$\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq: dH}}{e}q:dH}dx_{t}$	$\displaystyle=d(H_{t}+\lambda_{t}I)^{-1}q_{t}$	(37)
$\displaystyle dx_{t}$	$\displaystyle=-(H_{t}+\lambda_{t}I)^{-1}d(H_{t}+\lambda_{t}I)(H_{t}+\lambda_{t}I)^{-1}q_{t}$
$\displaystyle dx_{t}$	$\displaystyle=-(H_{t}+\lambda_{t}I)^{-1}d(H_{t}+\lambda_{t}I)x_{t}$
	$\displaystyle=(x_{t}^{\top}\otimes-(H_{t}+\lambda_{t}I)^{-1})\textrm{vec}(d(H_{t}+\lambda_{t}I))$

In the last equality we have used the identity $\textrm{vec}(ABC)=(C^{\top}\otimes A)\textrm{vec}(B)$ .

Now we will compute the Jacobian of $\lambda$ with respect to $H_{t}$ :

$\displaystyle\lambda_{t}$	$\displaystyle=a\|\|H_{t}\|\|_{F}$	(38)
	$\displaystyle=a\sqrt{\textrm{Tr}(H_{t}^{\top}H_{t})}$
$\displaystyle\implies d\lambda$	$\displaystyle=ad\Big(\sqrt{\textrm{Tr}(H_{t}^{\top}H_{t})}\Big)$
	$\displaystyle=a\frac{1}{2\|\|H_{t}\|\|_{F}}\textrm{Tr}((dH_{t})^{\top}H_{t}+H_{t}^{\top}(dH_{t}))$
	$\displaystyle=a\frac{1}{\|\|H_{t}\|\|_{F}}\textrm{vec}(H_{t})^{\top}d\textrm{vec}(H_{t})$

Substituting (38) in (37).

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq: dH_full}}{e}q:dH_{f}ull}dx_{t}

\displaystyle=(x_{t}^{\top}\otimes-(H_{t}+\lambda_{t}I)^{-1})\Big(\textrm{vec}(dH_{t})+\frac{a}{||H_{t}||_{F}}\textrm{vec}(I)\textrm{vec}(H_{t})^{\top}\textrm{vec}(dH_{t}))\Big)

(39)

Thus, we can obtain $\textrm{vec}(\frac{dL}{dH_{t}})$ as,

\textrm{vec}(\frac{dL}{dH_{t}})=(x_{t}\otimes-(H_{t}+\lambda_{t}I)^{-1})\frac{dL}{dx_{t}}+\frac{a}{||H_{t}||_{F}}\textrm{vec}(H_{t})\textrm{vec}(I)^{\top}(x_{t}\otimes-(H_{t}+\lambda_{t}I)^{-1})\frac{dL}{dx_{t}}

(40)

Substituting from (36),

	$\displaystyle\textrm{vec}(\frac{dL}{dH_{t}})$	$\displaystyle=-(x_{t}\otimes\frac{dL}{dq_{t}})-\frac{a}{\|\|H_{t}\|\|_{F}}\textrm{vec}(H_{t})\textrm{vec}(I)^{\top}(x_{t}\otimes\frac{dL}{dq_{t}})$		(41)
		$\displaystyle=-(x_{t}\otimes\frac{dL}{dq_{t}})-\frac{a}{\|\|H_{t}\|\|_{F}}\textrm{vec}(H_{t})\langle\frac{dL}{dq_{t}},x_{t}\rangle$		(41)

Now, with gating, we have $H_{t}=\gamma_{t}H_{t-1}+k_{t}k_{t}^{\top}$ . Which can be unrolled as

\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq: cumulative dynamics}}{e}q:cumulativedynamics}H_{l}=\sum_{i=0}^{l}\Big(\prod_{k=i}^{l-1}\gamma_{k}\Big)k_{i}k_{i}^{\top}

(42)

We will compute $\frac{dL}{dk_{l}}$ for some $l\leq t$ ,

\displaystyle\frac{dL}{dk_{l}}=\sum_{t\geq l}\frac{d\textrm{vec}(H_{t})}{dk_{l}}\textrm{vec}(\frac{dL}{dH_{t}})

(43)

Computing $\frac{d\textrm{vec}(H_{t})}{dk_{l}}$ for some $t\geq l$

H_{t}=\prod_{i=l}^{t-1}\gamma_{i}k_{l}k_{l}^{\top}+\textrm{terms indep. of $k_{l}$.}

(44)

Taking differentials on both sides,

$\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq: Jacobian H wrt k}}{e}q:JacobianHwrtk}dH_{t}$	$\displaystyle=\prod_{i=l}^{t-1}\gamma_{i}\Big[dk_{l}k_{l}^{\top}+k_{l}(dk_{l})^{\top}\Big]$	(45)
$\displaystyle d(\textrm{vec}(H_{t}))$	$\displaystyle=\prod_{i=l}^{t-1}\gamma_{i}\Big[\textrm{vec}(dk_{l}k_{l}^{\top})+\textrm{vec}(k_{l}(dk_{l})^{\top})\Big]$
	$\displaystyle=\underset{J_{k_{l}\to\textrm{vec}(H)}}{\underbrace{\Big(\prod_{i=l}^{t-1}\gamma_{i}\Big)\Big[(k_{l}\otimes I)+(I\otimes k_{l})\Big]}}dk_{l}$

where in the last equality we used the identity $\textrm{vec}(k_{l}dk_{l}^{\top})=dk_{l}\otimes k_{l}=(I\otimes k_{l})dk_{l}$ .

Subsituting the Jacobian (transposed for gradients) from (45) to (43) we obtain.

\displaystyle\frac{dL}{dk_{l}}=\sum_{t\geq l}\Big(\prod_{i=l}^{t-1}\gamma_{i}\Big)\Big[(k_{l}^{\top}\otimes I)+(I\otimes k_{l}^{\top})\Big]\textrm{vec}(\frac{dL}{dH_{t}})

(46)

Substituting the expression for $\text{vec}(\frac{dL}{dH_{t}})$ into equation (46) we get:

$\displaystyle\frac{dL}{dk_{l}}$	$\displaystyle=-\sum_{t\geq l}\left(\prod_{i=l}^{t-1}\gamma_{i}\right)$	$\displaystyle\Big[(k_{l}^{\top}\otimes I)+(I\otimes k_{l}^{\top})\Big]\Big[(x_{t}\otimes\frac{dL}{dq_{t}})+\frac{a}{\|\|H_{t}\|\|_{F}}\textrm{vec}(H_{t})\langle\frac{dL}{dq_{t}},x_{t}\rangle\Big]=$	(47)
	$\displaystyle=-\sum_{t\geq l}\left(\prod_{i=l}^{t-1}\gamma_{i}\right)$	$\displaystyle\Big[\Big((k_{l}^{\top}\otimes I)(x_{t}\otimes\frac{dL}{dq_{t}})+\frac{a}{\|\|H_{t}\|\|_{F}}(k_{l}^{\top}\otimes I)\textrm{vec}(H_{t})\langle\frac{dL}{dq_{t}},x_{t}\rangle\Big)+$
$\displaystyle(I\otimes k_{l}^{\top})(x_{t}\otimes\frac{dL}{dq_{t}})+\frac{a}{\|\|H_{t}\|\|_{F}}(I\otimes k_{l}^{\top})\textrm{vec}(H_{t})\langle\frac{dL}{dq_{t}},x_{t}\rangle\Big]$

Note that the following equations hold:

	$\displaystyle(k_{l}^{\top}\otimes I)(x_{t}\otimes\frac{dL}{dq_{t}})$	$\displaystyle=(k_{l}^{\top}x_{t}\otimes\frac{dL}{dq_{t}})=\langle k_{l},x_{t}\rangle\frac{dL}{dq_{t}}$		(48)
	$\displaystyle(I\otimes k_{l}^{\top})(x_{t}\otimes\frac{dL}{dq_{t}})$	$\displaystyle=x_{t}\otimes k_{l}^{\top}\frac{dL}{dq_{t}}=\langle k_{l},\frac{dL}{dq_{t}}\rangle x_{t}$		(48)

since $(A\otimes B)(C\otimes D)=AC\otimes BD$ and the fact that the Kronecker products after the simplification is a scalar times a vector.

For the other terms is holds:

	$\displaystyle(k_{l}^{\top}\otimes I)\textrm{vec}(H_{t})$	$\displaystyle=\textrm{vec}(H_{t}k_{l})$		(49)
	$\displaystyle(I\otimes k_{l}^{\top})\textrm{vec}(H_{t})$	$\displaystyle=\textrm{vec}(k_{l}^{\top}H_{t})=\textrm{vec}(H_{t}^{\top}k_{l})$		(49)

where we used the fact $\textrm{vec}(AXB)=(B^{\top}\otimes A)\textrm{vec}(X)$ and the fact that the vec operator applied to a row vector returns the same result as applying it on its transpose (so we go from $k_{l}^{\top}H_{t}$ to $H_{t}^{\top}k_{l}$ ). Since $H_{t}$ is symmetric we can sum both contributions and get twice that amount.

Eventually we get:

\boxed{\begin{aligned} \frac{dL}{dk_{l}}=-\sum_{t\geq l}\left(\prod_{i=l}^{t-1}\gamma_{i}\right)\Bigg[&\langle k_{l},x_{t}\rangle\frac{dL}{dq_{t}}+\langle k_{l},\frac{dL}{dq_{t}}\rangle x_{t}+\frac{2a}{||H_{t}||_{F}}\langle x_{t},\frac{dL}{dq_{t}}\rangle H_{t}k_{l}\Bigg]\end{aligned}}

(50)

Or equivalently, collecting the terms that are linear in the gradient:

\boxed{\frac{dL}{dk_{l}}=-\sum_{t\geq l}\left(\prod_{i=l}^{t-1}\gamma_{i}\right)\left[\langle k_{l},x_{t}\rangle\frac{dL}{dq_{t}}+\langle k_{l},\frac{dL}{dq_{t}}\rangle x_{t}+\frac{2a\langle x_{t},\frac{dL}{dq_{t}}\rangle}{||H_{t}||_{F}}H_{t}k_{l}\right]}

(51)

Note: The last term creates a dependence on $k_{l}$ through $H_{t}k_{l}$ , which is expected since the regularization $\lambda_{t}$ couples the gradient computation.

Appendix C Proof of Lemma˜2

Lemma˜2 describes an interesting phenomenon where, for the CH method (Algorithm˜1), the gradient $d\hat{q}$ obtained from implicit differentiation coincides with the exact gradient $dq$ obtained via backpropagation (chain rule). To prove this result, one way is to derive an analytic expression for $dq$ (Section˜C.1) and then inspect the recursions. However, this can be algebraically involved. Here, we present a clear proof based on some simple observations.

First, note that the output $\xi_{r}$ is linear in $q$ and moreover there is a matrix function $p_{r}(H)\in\mathbb{R}^{D\times D}$ such that

\displaystyle\xi_{r}=p_{r}(H)\cdot q.

(52)

Here $p_{r}(H)$ is a polynomial function of $H$ that encodes the Chebyshev iteration (Algorithm˜1). Conversely, we understand that $p_{r}(H)\cdot q$ can be computed by applying the Chebyshev iteration with $H,q$ for $r$ iterations (together with other parameters such as $\mu,L$ ). Then, given the output gradient $d\xi_{r}$ , we have

\displaystyle dq=p_{r}(H)^{\top}\cdot d\xi_{r}=p_{r}(H)\cdot d\xi_{r},

(53)

where the last equality follows, since $H$ is symmetric, which implies $p_{r}(H)$ is symmetric. The proof is finished by observing that $p_{r}(H)\cdot d\xi_{r}$ can be computed via Algorithm˜1 with $H=H,q=d\xi_{r}$ and other parameters, which gives us $dq$ .

C.1 The Exact Backward Pass for $dq$ and $dH$

Here we show how to obtain the exact gradients of $dH$ and $dq$ in Algorithm˜1 given the output gradient $d\xi_{r}$ , which might be of independent interests. The key insight here is that the Chebyshev iteration can be reversed.

Backward Pass for dq. Let $I$ be the identity matrix of suitable size. To derive a backward pass of Algorithm˜1, we first write down the update of $\xi_{i}$ concisely in the following recursion

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:3recursion}}{e}q:3recursion}\xi_{i}=A_{i}\xi_{i-1}+b_{i}\xi_{i-2}+c_{i}q,

(54)

where $A_{i},b_{i},c_{i}$ are defined as

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:Abc}}{e}q:Abc}A_{i}=\omega_{i}I-\frac{2\cdot\omega_{i}}{L+\mu}H,\quad b_{i}=-(\omega_{i}-1),\quad c_{i}=\frac{2\cdot\omega_{i}}{L+\mu}.

(55)

Note that $A_{i}$ is symmetric. Define $d\xi_{i}:=\frac{d\mathcal{L}}{d\xi_{i}}$ for every $i$ . With some loss function $\mathcal{L}$ , assume we are now given $d\xi_{r}$ , and our goal is to compute $dq:=\frac{d\mathcal{L}}{dq}$ . Since $q$ appears in Eq.˜54 for every $i$ , we know $\xi_{0},\xi_{1},\dots,\xi_{r}$ all depend on $q$ . Therefore, with $c_{0}:=\frac{2}{L+\mu}$ , we have

\displaystyle dq=\sum_{i=0}^{r}c_{i}\cdot d\xi_{i}.

It remains to compute $d\xi_{i}$ for every $i$ . Applying the chain rule to Eq.˜54, we obtain

\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:dxi}}{e}q:dxi}\begin{split}d\xi_{r-1}&=A_{r}\cdot d\xi_{r}\\ d\xi_{i-2}&=A_{i-1}\cdot d\xi_{i-1}+b_{i}\cdot d\xi_{i},\quad\forall i=r,\dots,2.\end{split}

(56)

Note that $A_{i},b_{i},c_{i}$ depend on some constant terms and $\omega_{i}$ . Thus, to compute them backward we assume access to $\omega_{r}$ and these constants. By reversing Eq.˜weight schedule we derive the following recursion:

\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:weight-update-backward}}{e}q:weight-update-backward}\begin{split}\nu_{r}&\leftarrow\omega_{r}\\ \nu_{i-1}&\leftarrow\frac{4}{\rho^{2}}\left(1-\frac{1}{\nu_{i}}\right),\quad\forall i=r,\dots,1.\end{split}

(57)

Similarly to how $\omega_{i}$ decreases with $i$ and converges to $\omega_{1}^{*}$ , we may prove $\nu_{i}$ is convergent to the other fixed point, $\omega_{2}^{*}$ , as $i$ decreases (and the iterate does not stop at $i=1$ ).

Backward Pass for dA. From Eq.˜54 and Eq.˜55 we see that

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:dA}}{e}q:dA}dA_{i}=d\xi_{i}\otimes\xi_{i-1}^{\top},\quad dH=-\frac{2\cdot\omega_{i}}{L+\mu}\cdot dA_{i}

(58)

where $\otimes$ denotes the Kronecker product; this is the out product of $d\xi_{i}$ and $\xi_{i-1}^{\top}$ , as $d\xi_{i}$ and $\xi_{i-1}$ are vectors.

Reverse Chebyshev Iteration. At first glance, computing $dA_{i}=d\xi_{i}\otimes\xi_{i-1}^{\top}$ requires storing $\xi_{i-1}$ in the forward pass, and the actual calculation of $dA_{i}$ is done after we run the backward pass for $d\xi_{i}$ in Eq.˜56. However, storing all $\xi_{i}$ ’s would be memory-inefficient. To address this issue, a main insight here is that we can reverse Eq.˜54 and write

\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:xi_backward}}{e}q:xi_{b}ackward}\xi_{i-2}=\frac{1}{b_{i}}(\xi_{i}-A_{i}\xi_{i-1}+c_{i}q).

(59)

This implies that we can recover all the iterates $\xi_{r},\dots,\xi_{0}$ as soon as we have access to the last two, $\xi_{r},\xi_{r-1}$ . Therefore, to obtain $dA_{i}$ , we can run two iteration schemes in Eq.˜56 and Eq.˜59 simultaneously.

Remark 2.

We find that being able to run the iterative update backward in a numerically stable fashion is a main feature of the Chebyshev iterative method (or more generally, gradient descent variants with momentum). Vanilla gradient descent can not efficiently reverse its iterate $\xi_{i}=\xi_{i-1}-\gamma_{i}(H\xi_{i-1}-q)$ with stepsize $\gamma_{i}$ , as it requires inverting $(I-\gamma_{i}H)$ . Moreover, reversing Eq.˜59 can be done stably, as $b_{i}$ is often in a good numerical range, which means division by $b_{i}$ in Eq.˜59 is not an issue. To see this, first note that by Lemma˜3 we have

\displaystyle 1\geq-b_{i}=\omega_{i}-1\geq\omega_{1}^{*}-1.

Note that $\omega_{1}^{*}$ defined in Lemma˜3 is an increasing function of $\rho$ and therefore of $\kappa$ . We then have that $-b_{i}\in[0.25,1]$ for any $\kappa\geq 10$ (we will not consider the case $\kappa<10$ as this means we need to add a very large regularization strength which might harm the minimization of the regression loss). In comparison, if we were to reverse the CG iteration, we would need to divide a quantity that is often numerically as small as $10^{-3}$ or as large as $10^{10}$ (see Fig.˜5). This is why it is numerically unstable to reverse CG.

2Input:

H,d\xi_{r}

L,\mu

, , number of iterations

r

, the final weight

\omega_{r}

;

3Initialize

\rho\leftarrow\frac{L-\mu}{L+\mu},d\xi_{r+1}\leftarrow 0

\nu_{r}\leftarrow\omega_{r},\nu_{r+1}\leftarrow 0,\nu_{0}\leftarrow 1

dq\leftarrow\frac{2\nu_{r}}{L+\mu}\cdot d\xi_{r}

dH\leftarrow-\frac{2\nu_{r}}{L+\mu}d\xi_{r}\otimes\xi_{r-1}^{\top}

;

4For

i=r,\dots,1

$\displaystyle\xi_{i-2}$	$\displaystyle\leftarrow-\frac{1}{\nu_{i}-1}\left(\xi_{i}-\left(\nu_{i}I-\frac{2\cdot\nu_{i}}{L+\mu}H\right)\xi_{i-1}+\frac{2\cdot\nu_{i}}{L+\mu}q\right)$	(60)
$\displaystyle d\xi_{i-1}$	$\displaystyle\leftarrow\left(\nu_{i}\cdot d\xi_{i}-\frac{2\cdot\nu_{i}}{L+\mu}\cdot H\cdot d\xi_{i}\right)-(\nu_{i+1}-1)\cdot d\xi_{i+1}$	(61)
$\displaystyle\nu_{i-1}$	$\displaystyle\leftarrow\frac{4}{\rho^{2}}\left(1-\frac{1}{\nu_{i}}\right)$	(62)
$\displaystyle dq$	$\displaystyle\leftarrow dq+\frac{2\nu_{i-1}}{L+\mu}\cdot d\xi_{i-1}$	(63)
$\displaystyle dH$	$\displaystyle\leftarrow dH-\frac{2\nu_{i-1}}{L+\mu}\cdot d\xi_{i-1}\otimes\xi_{i-2}^{\top}\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{eq:dH-1}}{e}q:dH-1}$	(64)

Output:

dq

dH

;

Algorithm 2 Backward Pass of Chebyshev Iteration

Appendix D Experimental Setup

Model Configurations. We consider models of 3 different sizes: 440M, 1B, and 2.8B. This is summarized in Table˜3. All models are with the GPT2 tokenizer similarly to [Von-arXiv2025-mesa].

Table 3: Model sizes and the corresponding architectural configurations.

Model Size	Number of Layers	Number of Heads	Hidden Dimension
440M	28	8	1024
1B	28	12	1536
2.8B	32	20	2560

Training Configurations. All models are trained with the AdamW optimizer with initial learning rate $10^{-3}$ , $5\%$ warm-up steps, cosine schedule, gradient clipping with maximum norm $1$ .

Table 4: Model sizes and the corresponding architectural configurations.

Model Size	Global Batch Size	Total Number of Training Tokens	Sequence Length
440M	1M	8B	2048
1B	2M	20B	2048
2.8B	2M	100B	4096

Models of the same scale use the same training configurations. Specifically (see also Table˜4):

•

For 440M models, we use sequence length 2048 and 8B DCLM tokens.
•

For 1B models, we use sequence length 2048 and 20B DCLM tokens.
•

For 2.8B models, we use sequence length 4096 and 100B DCLM tokens.

Model Hyperparameters. We use default parameters for all other models as given in the Flash-Linear-Attention v0.4.0 library (except the ones mentioned in Table˜3). For our approach, we use $\lambda_{t}=0.02\cdot\|H_{t}\|_{\text{F}}$ , with gating and $\alpha$ -connection enabled by default, unless otherwise specified. We also run CH for $30$ iterations for all experiments.

Individual Experiments. We now describe the setups for each individual experiment.

In Fig.˜1a, we randomly generate tensors $k\in\mathbb{R}^{B\times T\times H\times D}$ and $q$ and normalize them along the last dimension ( $D$ ). Here $B,T,H,D$ simulate the batch size, sequence length, number of heads, and head dimension, respectively. Then we compute the covariance matrices $H\in\mathbb{R}^{B\times T\times H\times D\times D}$ of $k$ , normalize its every $D\times D$ slice by its Frobenius norm. The code to generate data is shown below.

    k = torch.randn(B, T, H, D).to(dtype).to(’cuda’)
    q = torch.randn(B, T, H, D).to(dtype).to(’cuda’)

    q = q / torch.linalg.vector_norm(q, dim=-1, keepdim=True).to(q)
    k = k / torch.linalg.vector_norm(k, dim=-1, keepdim=True).to(k)

    kk = torch.einsum(’...i,...j->...ij’, k, k).cumsum(1)
    kk = kk / torch.linalg.matrix_norm(kk, ord=’fro’)[..., None, None]

    kk.diagonal(dim1=-2, dim2=-1).add_(ridge_strength)

For Fig.˜1c and Fig.˜1e, we generate random input ids with vocabulary size 5000, sequence length 2048 within a 5-layer LLAMA; we set 2 heads and head dimension 128 for this architecture.

In the MQAR experiments of Fig.˜3a, we follow the standard experimental setting but consider a strictly harder setting with smaller model dimension (or hidden dimension). Indeed, in the setting of [arora2023zoology], the model dimension is always larger than or equal to the number of KV pairs, while in the setting here, in some cases the model dimension is smaller than the number of KV pairs, in which case linear SSMs could not perfectly memorize all KV pairs.

In the main paper, Fig.˜3a is without any gating or $\alpha$ -connection.

In Fig.˜4 we considered the following tasks for long context evaluations. Reported results for each task is average over the score obtained for individual datasets in that task.

•

Retrieval-Augmented Generation (RAG): These tasks consist of open-domain question answer where the model is given a gold passage (passage containing the answer) interspersed between many other retrieved passages from a corpus [petroni2021kilt, Wikipedia dump split into 100-word passages]. The model is tasked with answering the question based on the obtained passages. We consider the following datasets from HELMET [yen2025helmet] for this task: Natural Questions, TriviaQA, PopQA, HotpotQA.
•

Many-shot In-Context Learning (ICL): ICL tests LLMs ability to learn new skills from a few examples. Here the task is to learn to classify between different concepts based on several in-context examples of the said concept. We consider the following datasets from HELMET [yen2025helmet] for this task: TREC Coarse, TREC Fine, NLU, BANKING77, CLINIC150.
•

Synthetic Recall: These tasks are variations of the “Needle-in-a-Haystack" task [needle2024haystack] where the goal is to retrieve an important piece of information, the “needle" from a long context of distractor tokens, the “haystack". These variations also test multi-hop tracing and aggregation capabilities of the model. We consider the following datasets from RULER [hsieh2024ruler] for this task: S-NIAH-1/2/3, MK-NIAH-1,2,3, MV-NIAH, MQ-NIAH, VT, CWE, FWE.
•

LongQA: These are long document based question-answering tasks. The documents are typically made long by randomly sampling different paragraphs from the same dataset along with the paragraph that contains the answer. We consider the following datasets from RULER [hsieh2024ruler] for this task: SQuAD, HotpotQA.

Appendix E How Does The Performance GKA Scale with Compute?

We consider models at three different scales: 440M, 1B and 2.8B. For training configurations and architecture refer to Appendix˜D. We use prototypical tasks from LM-Harness (see Section˜5.3.1 for list of tasks) to evaluate language modeling capabilities of GKA and compare with baseline SSM/fading memory layers. Table˜5 shows that at 440M scale, GKA is competitive with GDN and Deltanet. However, differences emerge at larger scales, with GKA showing increasing benefits. In particular, the retrieval capabilities of our model, as measured by FDA and SWDE consistently outperform all SSM baselines at 1B and 2.8B scale. We also report the results of equal-sized Transformer for completeness, which serves as a performance ceiling at each scale.

Table 5: GKA shows stronger scaling with compute that other SSM baseline models. LM-Harness results for models at different scales: 440M, 1B and 2.8B. All models were trained from scratch. 440M and 1B models were trained on 8B and 20B tokens respectively in accordance to the Chinchila scaling laws [hoffmann2022empirical]. For the 2.8B model we trained on 100B tokens.

Model

ARC-C

ARC-E

BoolQ

COPA

HellaSWAG

PIQA

SciQ

Winogrande

FDA

SWDE

Avg

acc_n

\uparrow

acc_n

\uparrow

acc

\uparrow

acc

\uparrow

acc_n

\uparrow

acc_n

\uparrow

acc_n

\uparrow

acc

\uparrow

contains

\uparrow

contains

\uparrow

440M Models

Transformer

24.40

42.26

59.88

70.00

36.19

64.15

61.50

51.70

5.17

35.64

45.09

Gated Linear Attention

24.06

40.28

56.57

71.00

32.70

62.24

57.80

50.67

1.00

9.18

40.55

Gated DeltaNet

25.17

41.96

58.23

72.00

36.96

64.69

63.6

51.7

1.91

11.88

42.81

DeltaNet

25.09

41.92

61.13

65.00

37.20

64.47

64.00

49.49

2.81

14.31

42.54

Gated KalmaNet (Ours)

24.57

43.22

56.94

71.00

37.22

64.47

62.8

50.83

1.45

14.04

42.65

1B Models

Transformer

26.62

46.42

59.94

77.00

44.01

67.14

68.30

54.06

8.35

45.18

49.70

Mamba2

28.07

46.63

60.21

70.00

44.57

67.57

65.50

54.30

1.45

15.75

45.40

Gated Linear Attention

25.94

42.00

58.84

70.00

36.34

63.60

58.20

51.85

1.45

10.53

41.88

Gated DeltaNet

27.05

47.98

59.54

74.00

44.27

67.36

66.2

53.83

2.18

17.82

46.02

DeltaNet

27.56

46.25

59.97

71.00

43.18

67.74

65.90

55.41

3.09

20.61

46.07

Gated KalmaNet (Ours)

25.43

46.55

60.73

74.00

44.59

68.88

67.60

52.41

6.17

21.87

46.82

2.8B Models

Transformer

32.25

56.10

64.28

80.00

60.96

73.56

79.50

61.72

58.53

72.28

63.92

Mamba2

32.24

59.64

58.72

82.00

62.23

73.78

79.80

62.19

7.71

41.13

55.94

Gated Linear Attention

27.82

50.80

52.57

78.00

48.83

70.13

69.60

54.54

2.81

20.43

47.55

Gated DeltaNet

32.59

60.02

62.75

82.00

62.8

74.32

80.6

62.35

8.26

44.28

57.00

DeltaNet

32.85

58.16

42.51

81.00

61.13

73.78

43.90

61.72

11.80

46.08

51.29

Gated KalmaNet (Ours)

32.51

59.89

61.68

85.00

63.84

74.81

83.2

64.17

12.89

50.95

58.89

Appendix F Ablations

In this section we consider ablations for various modeling choices made in arriving at our final GKA model. For all ablations, we consider 2.8B models trained on 100B tokens on DCLM at 4K context length (unless mentioned otherwise). We use the same architecture and training configurations for these ablations as mentioned in Appendix˜D.

F.1 Does Adaptive Regularization Help?

As discussed in Section˜4.1, we introduced adaptive regularization to control the condition number of $H_{T}+\lambda_{t}I$ for numerical stability. Here we ablate this choice, specifically we compare the following runs.

1.

Adaptive regularization. We train a model with $\lambda_{t}=a||H_{t}||_{F}$ . We report results for $a=0.02$ for this run.
2.

Constant regularization We train same model architecture (as above) with $\lambda_{t}=0.25$ (a constant). This choice of $0.25$ is motivated from concurrent work [Von-arXiv2025-mesa] which explored a similar ridge regression objective for LLM training.

As shown in Fig.˜6, without strict condition number control, gradient norms spike during training, leading to increased cross entropy loss (compared to the run with adaptive regularization).

F.2 Does Adaptive Weighting Help?

In Section˜4.1, we discussed increasing the expressivity of our layer by introducing adaptive weights $\eta_{t,i}$ which re-weigh the past to be exponentially decaying in time. Given constant-sized memory, we hypothesize this adaptive weighting (gating) allows GKA to learn an effective representation by incorporating recency bias into its computation. In this subsection we test this hypothesis. We carry out the following runs.

1.

Adaptive weighting (gating). We train a model with adaptive weights. Specifically, for all $t\geq i$ , we parameterize the weight for the $i^{\textrm{th}}$ sample at time-step $t$ as $\eta_{t,i}=\prod_{j=i+1}^{t}\gamma_{j}$ , with each $\gamma_{j}\in[0,1]$ learnable.
2.

No weighting. We train the same model architecture as above, but with no weights. This essentially results in an unweighted ridge regression objective obtained by setting $\eta_{i}=1$ in Eq.˜3.

Table˜6 shows clear benefits of adapting weighting with improvements across the board on all LM-Harness tasks considered, thereby validating our hypothesis.

Table 6: Adaptive weighting outperforms across the board on LM-Harness tasks. Results for 2.8B models trained on 100B tokens from DCLM with and without adaptive weights as introduced in Section˜4.1.

Adaptive Weights

ARC-C

ARC-E

BoolQ

COPA

HellaSWAG

PIQA

SciQ

Winogrande

FDA

SWDE

Avg

acc_n

\uparrow

acc_n

\uparrow

acc

\uparrow

acc

\uparrow

acc_n

\uparrow

acc_n

\uparrow

acc_n

\uparrow

acc

\uparrow

contains

\uparrow

contains

\uparrow

✗

28.24

51.73

57.68

53.87

71.87

71.6

54.38

6.08

33.03

50.45

✓

32.51

59.89

61.68

63.84

74.81

83.2

64.17

12.89

50.95

58.89

F.3 Does $\alpha$ -connection Improve Training of GKA?

In Section˜4.3, we introduce the $\alpha$ -connection as a residual connection that establishes a direct path for gradient flow through the GLA solution, improving training stability. This allows the model to fall back on the GLA solution when CH produces poor-quality results due to non-convergence of the iterative solver within the fixed iteration budget. To validate this design choice, we perform two runs.

R1.

with $\alpha$ -connection. We train a model with the $\alpha$ -connection as shown in our GKA block in Fig.˜2.
R2.

without $\alpha$ -connection. We train the same model architecture as above, but with no $\alpha$ connection. This can be simply understood as setting $\alpha_{t}=1$ for all time-steps $t$ in Fig.˜2.

On LM-Harness, both models perform similarly, with R1 and R2 achieving aggregate scores of 58.89 and 58.39, respectively. However, clear differences emerge under long-context evaluation, where we trained both models on an additional 25B tokens from long documents at 128K context length. Fig.˜7 shows that GKA without the $\alpha$ -connection exhibits inferior long-context performance on average, with Synthetic Recall and LongQA showing major degradation.

Appendix G Effects of Different Regularization Strengths

Recall that we proposed setting adaptive regularization $\lambda_{t}=a\cdot\|H_{t}\|_{\text{F}}$ . We now present experiments validating this choice.

Synthetic Experiments. First, we generate data as per Fig.˜1a, where the covariance matrix is normalized by its Frobenius norm. In this case we set $\lambda_{t}=a$ for $a$ varying in $\{0.01,0.02,0.05,0.1\}$ . Fig.˜8 shows that the maximum regularized residual norm (computed as the maximum of $\|(H_{t}+\lambda_{t}I)\xi_{i}-q\|_{2}$ over all dimensions where $\xi_{i}$ is the estimate of CH at iteration $i$ ) decreases as we enlarge $\lambda_{t}$ . This is because having a large $\lambda_{t}$ reduces the condition number. The downside, though, with a large $\lambda_{t}$ is that it reduces the memorization capacity, namely, it might enlarge $\|H_{t}\xi_{i}-q\|_{2}$ , the true residual of interest.

GKA with different regularization strengths. We train several 2.8B models with varying regularization strength by choosing $a\in[0.01,0.02,0.05,0.1]$ . While performance on LM-Harness (Table˜7) shows little discrepancy, we observe noticeable differences in long-context performance—where memorization capacity matters most—(Fig.˜9). Specifically, the long-context performance of GKA improves initially as we decrease $a$ from $0.1\to 0.05$ . This is expected since this increases the memorization capacity of the model. However, decreasing further from $0.05\to 0.02\to 0.01$ causes performance to decrease. This can be attributed to the increasing condition number of the problem, which reduces the quality of the solution computed by CH (Fig.˜8).

Table 7: Ablation over different choices of regularization strength

\lambda_{t}=a\cdot\|H_{t}\|_{\text{F}}

. Short-context performance on LM-Harness shows little discrepancy with different regularization strengths.

a

ARC-C

ARC-E

BoolQ

COPA

HellaSWAG

PIQA

SciQ

Winogrande

FDA

SWDE

Avg

acc_n

\uparrow

acc_n

\uparrow

acc

\uparrow

acc

\uparrow

acc_n

\uparrow

acc_n

\uparrow

acc_n

\uparrow

acc

\uparrow

contains

\uparrow

contains

\uparrow

0.01

33.45

58.63

62.63

85.00

63.36

73.99

81.40

63.14

11.16

51.49

58.43

0.02

32.51

59.89

61.68

85.00

63.84

74.81

83.20

64.17

12.89

50.95

58.89

0.05

32.68

61.66

53.57

79.00

63.46

74.84

82.60

63.77

11.98

49.68

57.32

0.1

32.76

59.85

63.52

84.00

63.95

75.08

83.20

63.54

11.43

51.22

58.86

Table 8: Latent sketching increases training throughput (by up to 10%) while marginally reducing accuracy (< 1%). Training throughput is reported in # Billion tokens/day/node. It is measured on a single H200 GPU with a batch size of 1M tokens. Our results indicate minimal regression on LM-harness tasks but up to 10% improvement in training throughput (going from no-sketch to sketch dim 32). However, long context performance is adversely affected with sketching with up to 60% relative drop in performance. Future work will address this by exploring the use of sketching adaptively depending on the "complexity" of the task.

Sketch dimension	LM-Harness avg.	Training throughput
32	57.57	8.37
no-sketch	58.89	7.65

Appendix H Latent Sketching for Approximate Solutions

We introduce the idea of sketching from random matrix theory to further control the amount of FLOPs vs accuracy in GKA. Sketching involves down projecting the normal equations into a low-dimensional subspace, solving the equations in this subspace and finally up-projecting the solution back to the original space. This reduces the worst-case computational complexity of our approach from $\mathcal{O}(D^{2}r)$ to $\mathcal{O}(d^{2}r)$ , where $d\ll D$ and $r$ is the number of iterations in Algorithm˜1. To the best of our knowledge our work is the first one introducing sketching as a viable solution to increase efficiency of neural network layers that are defined implicitly by the solution to an optimization problem. Sketching can be thought of as an analogous to the Multi Latent Attention idea introduced by DeepSeek but applied to fading memory layers. Table˜8 shows preliminary results of this idea applied to GKA. Both models (no-sketch and sketch dim 32) are trained from scratch at 2.8B scale on 100B tokens.

Appendix I Hybrid Gated KalmaNet

As discussed in Section˜A.2, augmenting SSM models with Attention layers has proven to be an effective way of improving performance on tasks that require recalling information from the distant past. In this section, we show that our Gated KalmaNet layer can be interleaved with Attention layers to yield even stronger models. Our Hybrid GKA model is based on the Qwen3 architecture [yang2025qwen3]. Namely, our Hybrid model consists of a stack of “decoder” blocks, each of which contains a sequence mixer—either Attention or GKA—followed by an MLP. Similar to Qwen3, our Attention layers use QK normalization layers. Our Hybrid model consists of 30 decoder blocks, 26 of which use GKA as the sequence mixer, and 4 that use Attention. The Attention decoder blocks are at indices 6, 14, 22, and 29. Our Hybrid models follow the same training procedure as our non-Hybrid models. Specifically, we pretrain our Hybrid model on 100B tokens with a 4K context size, followed by fine-tuning on 25B tokens at a 128K context size.

When evaluating our pretrained Hybrid model standard NLP benchmarks, we observe that it improves substantially on recall-oriented tasks (FDA & SWDE) compared to the non-Hybrid model⁴⁴4Note, our non-hybrid model shares the same architecture as the hybrid with the distinction that all 4 Attention layers are replaced with GKA layers., as shown in Table˜9. Further, when evaluating our fine-tuned long-context model on tasks that require effective modeling of long-range dependencies, we observe a significant improvement across all context lengths, as shown in Fig.˜10.

Table 9: Our Hybrid GKA + Attention model improves language modeling performance. When interleaving Attention layers into our GKA models, we observe a significant improvement on recall-oriented tasks, such as FDA and SWDE, while preserving a similar performance on short-context tasks.

Model

ARC-C

ARC-E

BoolQ

COPA

HellaSWAG

PIQA

SciQ

Winogrande

FDA

SWDE

Avg

acc_n

\uparrow

acc_n

\uparrow

acc

\uparrow

acc

\uparrow

acc_n

\uparrow

acc_n

\uparrow

acc_n

\uparrow

acc

\uparrow

contains

\uparrow

contains

\uparrow

Gated KalmaNet (Hybrid)

33.02

59.47

64.07

80.00

62.74

74.59

81.40

64.64

53.18

72.46

64.56

Gated KalmaNet

32.51

59.89

61.68

85.00

63.84

74.81

83.20

64.17

12.89

50.95

58.89

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

Abstract

1 Introduction

2 Prior Work and Preliminaries

3 A Linear SSM Inspired by the Kalman Filter

3.1 Motivation from Kalman Filter

3.2 Hurdles Towards Scalable Kalman Filter SSMs

4 Gated KalmaNet (GKA)

4.1 CH with Adaptive Regularization & Weighting

4.2 Chunk-wise Implementation

4.2.1 Forward Pass

4.2.2 Backward Pass

Lemma 1.

4.2.3 Comparison to Other Iterative Solvers

Lemma 2.

4.3 Architectural Consideration

5 Experiments

5.1 GKA on Synthetic Associative Recall Tasks

5.2 Training Throughput of GKA

5.3 GKA on Language Modeling

5.3.1 Short-context Tasks

5.3.2 Long-context Tasks

6 Kalman Filter for Optimally Modelling Fading Memory

6.1 A Dynamical System for Fading Memory

6.2 Kalman Filter for Optimal Inference

6.3 Gated KalmaNet: A Steady-State Dynamical System for Large-Scale Training

6.4 Connection with Existing SSM Layers

7 Discussions and Limitations

Appendix A Related Work

A.1 Linear Attention

A.2 Hybrid State Space Attention Models

Appendix B Forward and Backward Passes of Chebyshev Iteration (Details)

B.1 Forward Pass

Lemma 3.

Proof.

Lemma 4.

Proof.

Remark 1.

B.2 Backward Pass

B.2.1 Computing d​Ld​qt\frac{dL}{dq_{t}} and d​Ld​kt\frac{dL}{dk_{t}}.

Appendix C Proof of Lemma˜2

C.1 The Exact Backward Pass for d​qdq and d​HdH

Remark 2.

Appendix D Experimental Setup

Appendix E How Does The Performance GKA Scale with Compute?

Appendix F Ablations

F.1 Does Adaptive Regularization Help?

F.2 Does Adaptive Weighting Help?

F.3 Does α\alpha-connection Improve Training of GKA?

Appendix G Effects of Different Regularization Strengths

Appendix H Latent Sketching for Approximate Solutions

Appendix I Hybrid Gated KalmaNet

B.2.1 Computing $\frac{dL}{dq_{t}}$ and $\frac{dL}{dk_{t}}$ .

C.1 The Exact Backward Pass for $dq$ and $dH$

F.3 Does $\alpha$ -connection Improve Training of GKA?