【Transformer】20、SOFT: Softmax-free Transformer with Linear Complexity

本文提出了一种名为SOFT的新方法,旨在解决基于self-attention的Transformer计算复杂度问题。作者通过移除softmax操作并引入高斯核函数,实现了softmax-free self-attention,降低了计算量。同时,利用矩阵分解进行低秩规范化,进一步减少了计算需求。实验结果显示,这种方法在保持性能的同时,显著优化了Transformer的效率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >


在这里插入图片描述

本文收录于 NeurIPS 2021

论文链接:https://arxiv.org/pdf/2110.11945.pdf
代码链接:https://github.com/fudan-zvg/SOFT

一、背景

基于 self-attention 的 transformer 虽然取得了较好的效果,但其计算量和内存都和是输入分辨率大小的平方。

作者认为这种计算限制来源于计算概率时使用的 softmax self-attention。

一般的 self-attention 都是计算规范化的 token 特征的内积计算得到,保持这种 softmax 操作对后面的线性化操作有些挑战。

所以,本文作者提出了 softmax-free transformer, SOFT,在 self-attention 中移除了 softmax,使用高斯核函数来代替内积,能够通过低秩矩阵分解来近似得到 self-attention 矩阵。

在这里插入图片描述

二、方法

2.1 Softmax-free self-attention formulation

在这里插入图片描述

输入为 X,要进行 attention ,首先要通过映射得到 Q, K, V:

在这里插入图片描述

self-attention 的计算如下:

在这里插入图片描述

α \alpha α 为计算 self-attention map 的过程,由非线性函数 β \beta β 和 relation function γ \gamma γ 组成:

一般的形式如下:
在这里插入图片描述
为了简化计算,作者使用如下方式代替:
在这里插入图片描述

为了保持 attention matrix 的对称性,作者设定 Q 和 K 的映射函数一样,所以,本文的 self-attention matrix 如下:
在这里插入图片描述

2.2 通过矩阵分解来实现低秩规范化

为了降低计算量,作者参考 Nystrom[38] 来实现低秩矩阵近似,可以不用计算全部的 self-attention。

规范化的 self-attention matrix S ^ \hat S S^ 如下:
在这里插入图片描述

三、效果

SOFT 及其变体如下:
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

### GCN (Graph Convolutional Network) vs Transformer: Differences and Use Cases in Deep Learning #### Architectural Design The architectural design of Graph Convolutional Networks (GCNs) fundamentally differs from Transformers. GCNs operate on graph data structures, where nodes represent entities and edges denote relationships between these entities. This architecture allows GCNs to effectively capture local neighborhood information through message passing mechanisms among connected nodes[^2]. In contrast, Transformers leverage self-attention mechanisms which enable modeling dependencies regardless of their position within sequences without relying on predefined adjacency matrices as seen in GCNs. #### Data Structure Handling GCNs excel at handling non-Euclidean structured data such as social networks or molecular graphs due to the inherent ability to model complex relational patterns present in graph topologies. On the other hand, Transformers have been primarily designed for sequential data like text but can also be adapted to handle tabular or time-series datasets by treating them as tokenized inputs into multi-head attention layers followed by feed-forward networks. #### Computational Complexity In terms of computational complexity, both architectures exhibit different characteristics depending upon specific implementations and optimization techniques applied during training phases. For instance, while standard GCNs may suffer scalability issues when dealing with large-scale sparse graphs because they require repeated matrix multiplications involving entire node sets per layer propagation step; recent advancements including sampling-based methods alleviate some concerns regarding efficiency losses associated with dense connectivity assumptions made earlier versions of this framework. Transformers generally demand higher memory footprints compared to traditional recurrent neural network variants owing largely towards quadratic scaling behavior concerning sequence length inside softmax operations performed across all positions simultaneously via scaled dot-product attentions mechanism – although linearized alternatives exist nowadays aiming specifically toward reducing resource consumption levels significantly over long contexts processing tasks. #### Application Scenarios When considering application scenarios suitable for each type of deep learning methodology mentioned above: - **GCNs**: Ideal candidates include applications requiring analysis based around irregularly shaped domains characterized by explicit interdependencies amongst elements forming part thereof e.g., recommendation systems leveraging collaborative filtering principles alongside citation prediction models built atop academic publication databases. - **Transformers**: Well-suited for natural language understanding challenges ranging from machine translation up until question answering frameworks along with multimodal fusion endeavors incorporating heterogeneous sensory modalities beyond just textual representations alone - exemplified recently through vision-language pretraining paradigms achieving state-of-the-art performance metrics reported throughout various benchmark evaluations conducted under controlled experimental settings[^1]. ```python import torch.nn.functional as F from torch_geometric.nn import GCNConv class SimpleGCN(torch.nn.Module): def __init__(self): super(SimpleGCN, self).__init__() self.conv1 = GCNConv(dataset.num_features, 16) self.conv2 = GCNConv(16, dataset.num_classes) def forward(self, data): x, edge_index = data.x, data.edge_index x = self.conv1(x, edge_index) x = F.relu(x) x = F.dropout(x, training=self.training) x = self.conv2(x, edge_index) return F.log_softmax(x, dim=1) import math import torch import torch.nn as nn import torch.nn.functional as F class PositionwiseFeedForward(nn.Module): "Implements FFN equation." def __init__(self, d_model, d_ff, dropout=0.1): super(PositionwiseFeedForward, self).__init__() self.w_1 = nn.Linear(d_model, d_ff) self.w_2 = nn.Linear(d_ff, d_model) self.dropout = nn.Dropout(dropout) def forward(self, x): return self.w_2(self.dropout(F.relu(self.w_1(x)))) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

呆呆的猫

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值