The attention mechanism
has seen wide applications in
computer vision and natural language processing. Recent works developed the
dot-product attention mechanism
and applied it to various vision and language tasks. However, the memory and computational costs of dot-product attention
grows quadratically with the spatiotemporal size
of the input. Such growth
prohibits
the application of the mechanism on large inputs, e.g., long sequences, highresolution images, or large videos.
, this paper proposes a novel efficient attention mechanism, which is
equivalent to dot-product attention but has
substantially less memory
and
computational costs
.
The resource efficiency allows more widespread and flexible incorporation of efficient attention modules into a neural network, which leads to improved accuracies.
Further, the resource efficiency of the mechanism democratizes attention to complicated models, which were unable to incorporate original dot-product attention due to prohibitively high costs.
As an exemplar
, an efficient attention-augmented model achieved state-of-the-art accuracies for stereo depth estimation on the Scene Flow dataset.
Attention is a mechanism in neural networks that
focuses on long-range dependency modeling
,
a key challenge to deep learning that convolution and recurrence struggle to solve
.
A recent series of works
developed the
highly successful
dot-product attention mechanism
, which facilitates easy
integration into
a deep neural network.
The mechanism computes the response at every position as a weighted sum of features at all positions in the previous layer.
In contrast to the limited spatial and temporal receptive fields of convolution and recurrence,
dot-product attention expands the receptive field to the entire input in one pass.
Using dot-product attention to efficiently model long-range dependencies allows convolution and recurrence to focus on local dependency modeling, in which they specialize. Dot-product attention-based models now hold state-of-the-art records on nearly all tasks in natural language processing [25, 20, 6, 21]. The non-local module [26], an adaptation of dot-product attention for computer vision, achieved state-of-the-art performance on video classification [26] and generative adversarial image modeling [29, 3] and demonstrated significant improvements on object detection [26], instance segmentation [26], person re-identification [14], and image deraining [12], etc.
, global dependency modeling on large inputs, e.g. long sequences, high-definition images, and large videos, remains an unsolved problem. The quadratic memory and computational complexities with respect to the input size of existing dot-product attention modules inhibit their application on such large inputs.
The high memory and computational costs constrain the application of dot-product attention to the low-resolution or short-temporal-span parts of models [26, 29, 3] and prohibits its use for resolution-sensitive or resource-hungry tasks.
然而,对大输入(如长序列、高清图像和大视频)的全局依赖建模仍然是一个未解决的问题。
The need for global dependency modeling on large inputs greatly motivates the exploration for a resource-efficient attention algorithm. An investigation into the nonlocal module revealed an intriguing phenomenon.
The attention maps at each position, despite generated independently, are correlated.
] analyzed, the attention map of a position mainly focuses on semantically related regions.
Figure 1 shows the learned attention maps in a non-local module. When generating an image of a bird before a bush, pixels on the legs tend to attend to other leg pixels for structural consistency. Similarly, body pixels mainly attend to the body, and background pixels focus on the bush.
Figure 1. An illustration of the learned attention maps in a nonlocal module. The first image identifies five query positions with colored dots. Each of the subsequent images illustrates the attention map for one of the positions. Adapted from [
Self-attention generative adversarial networks
].
This observation inspired the design of the
efficient attention mechanism
that this paper proposes.
The mechanism first generates a key feature map, a query feature map, and a value feature map
from an input.
It interprets each channel of the key feature map as a global attention map
.
Using each global attention map as the weights,
efficient attention aggregates the value feature map to produce a global context vector that summarizes an aspect of the global features.
Then, at each position,
the module regards the query feature as a set of coefficients over the global context vectors.
Finally,
the module computes a sum of the global context vectors with the query feature as the weights to produce the output feature at the position.
This algorithm
avoids the generation of the pairwise attention matrix
, whose size is quadratic in the spatiotemporal size of the input. Therefore,
it achieves linear complexities with respect to input size and obtains significant efficiency improvements.
Efficient attention mechanism
具体步骤:
1. 生成
key feature map, a query feature map, and a value feature map
The principal contribution of this paper is the efficient attention mechanism, which:
1. has linear memory and computational complexities with respect to the spatiotemporal size of the input;
就输入的时空大小而言,具有线性记忆和计算复杂性;
2. possesses the same representational power as the widely adopted dot-product attention mechanism;
具有与广泛采用的网络产品注意力机制相同的表征力;
3. allows the incorporation of significantly more attention modules into a neural network, which brings substantial performance boosts to tasks such as object detection and instance segmentation (on MS-COCO 2017) and image classification (on ImageNet); and
] proposed the initial formulation of the dot-product attention mechanism to improve word alignment in machine translation. Successively, [
Attention is all you need
] proposed to completely replace recurrence with attention and named the resultant architecture the Transformer. The Transformer architecture is highly successful on sequence tasks. They hold the state-ofthe-art records on
virtually
all tasks in natural language processing [6, 21, 28] and is highly competitive on end-to-end speech recognition [7, 19]. [
Non-local neural networks
] first adapted dot-product attention for computer vision and proposed the non-local module. They achieved state-of-the-art performance on video classification and demonstrated significant improvements on object detection, instance segmentation, and pose estimation. Subsequent works applied it to various fields in computer vision, including image restoration [
Non-local recurrent network for image restoration
] substantially advanced the state-of-the-art using the nonlocal module.
Efficient attention mainly builds upon the version of dot-product attention in the non-local module. Following [
Non-local neural networks
], the team conducted most experiments on object detection and instance segmentation. The paper compares the resource efficiency of the efficient attention module against the non-local module under the same performance and their performance under the same resource constraints.
Besides dot-product attention, there are a separate set of techniques the literature refers to as attention. This section refers to them as scaling attention. While dot-product attention is effective for global dependency modeling, scaling attention focuses on emphasizing important features and suppressing uninformative ones. For example, the
squeezeand-excitation (SE) module
uses global average pooling and a linear layer to compute a scaling factor for each channel and then scales the channels accordingly. SE-enhanced models achieved state-of-the-art performance on image classification and substantial improvements on scene segmentation and object detection.
On top of SE,
added global max pooling beside global average pooling and an extra spatial attention submodule. It further improved SE’s performance.
缩放比例注意力模型。
Despite both names containing attention,
dot-product attention and scaling attention are two completely separate sets of techniques with very different goals.
When appropriate, one might take both techniques and let them work in conjunction.
Therefore, it is unnecessary to make any comparison of efficient attention with scaling attention techniques.
点积注意力模型和缩放比例注意力模型,是两种完全不相干的 attention 模型,可以结合使用。
Method
This section introduces the efficient attention mechanism. It is mathematically equivalent with the widely adopted dot-product attention mechanism in computer vision (i.e., the attention mechanism in the Transformer and the non-local module). However, efficient attention has linear memory and computational complexities with respect to the number of
pixels or words (
hereafter
referred to as positions)
.
Section 3.1
reviews the dot-product attention mechanism and identifies its critical drawback on large inputs to motivate efficient attention. The introduction of the efficient attention mechanism is in
Section 3.2
.
Section 3.3
shows the equivalence between dot-product and efficient attention.
Section 3.4
discusses the interpretation of the mechanism.
Section 3.5
analyzes its efficiency advantage over dot-product attention.
] initially proposed dot-product attention for machine translation. Subsequently, the Transformer [
Attention is all you need
] adopted the mechanism to model long-range temporal dependencies between words. [
Non-local neural networks
] introduce dot-product attention for the modeling of long-range dependencies between pixels in image and video understanding.
For each input feature vector
that corresponds to the
-th position, dot-product attention first uses three linear layers to convert
into three feature vectors, i.e., the query feature
, the key feature
, and the value feature
.
The query and key features must have the same feature dimension
.
One can measure the similarity between the
-th query and the
-th key as
, where
is a normalization function
. In general, the similarities are asymmetric, since the query and key features are the outputs of two separate layers. The dot-product attention module calculates the similarities between all pairs of positions. Using the similarities as weights, position
aggregates the
value features
from all positions via weighted summation to obtain its output feature.
features in matrix forms as
,
,
, respectively, the output of dot-product attention is
(1)
The normalization function
has two common choices:
(2)
where
denotes applying the softmax function along each row of matrix
. An illustration of the dot-product attention module is in Figure 2 (left).
Non-local attention 的数学计算公式。
The critical drawback of this mechanism is its resource demands. Since it computes a similarity between each pair of positions, there are
such similarities, which results in
memory complexity and
computational complexity. Therefore, dot-product attention’s resource demands get prohibitively high on large inputs. In practice, application of the mechanism is only possible on low-resolution features.
Figure 2. Illustration of the architecture of dot-product and efficient attention. Each box represents an input, output, or intermediate matrix. Above each box is the name of the corresponding matrix. The variable name and the size of the matrix are inside each box.
denotes matrix multiplication.
3.2. Efficient Attention
Observing the critical drawback of dot-product attention, this paper proposes the efficient attention mechanism, which is mathematically equivalent to dot-product attention but much faster and more memory efficient. In efficient attention, the individual feature vectors
still pass through three linear layers to form the query features
, key features
, and value features
. However, instead of interpreting the key features as
feature vectors in
, the module regards them as
single-channel feature maps. Efficient attention uses each of these feature maps as a weighting over all positions and aggregates the value features from all positions through weighted summation to form a global context vector. The name reflects the fact that the vector does not correspond to a specific position, but is a global description of the input features.
The following equation characterizes the efficient attention mechanism:
(3)
where
and
are normalization functions for the query and key features, respectively. The implementation of the same two normalization methods as for dot-production attention are
(4)
where
,
denote applying the softmax function along each row or column of matrix
, respectively. The efficient attention module is a concrete implementation of the mechanism for computer vision data. For an input feature map
, the module
flattens
it to a matrix
, applies the efficient attention mechanism on it, and reshapes the result to
. If
, it further applies a
convolution to restore the dimensionality to
. Finally, it adds the resultant features to the input features to form a residual structure.
3.3. Equivalence between Dot-Product and Efficient Attention
Following is a formal proof of the equivalence between dot-product and efficient attention when
using scaling normalization
. Substituting the scaling normalization formula in Equation (2) into Equation (1) gives
(5)
Similarly,
plugging
the scaling normalization formulae in Equation (4) into Equation (3) results in
(6)
Since scalar multiplication is commutative with matrix multiplication and matrix multiplication is associative, we have
(7)
Comparing Equations (5) and (7), we get
(8)
Thus, the proof is complete.
证明很简单,不翻译了。
The above proof works for the
softmax
normalization variant with one caveat.
The two softmax operations on
,
are not exactly equivalent to the single softmax on
. However, they closely approximate the effect of the original softmax function. The critical property of
is that each row of it sums up to 1 and represents a normalized attention distribution over all positions. The matrix
shares this property. Therefore, the softmax variant of efficient attention is a close approximate of that variant of dot-product attention. Section 4.1 demonstrates this claim empirically.
of the attention mechanism. In dot-product attention, selecting position
as the reference position, one can collect the similarities of all positions to position
and form an attention map
for that position. The attention map
represents the degree to which position
attends to each position
in the input. A higher value for position
on
means position
attends more to position
. In dot-product attention, every position
has such an attention map
, which the mechanism uses to aggregate the value features
to produce the output at position
.
In contrast, efficient attention does not generate an attention map for each position. Instead,
it interprets the key features
as
attention maps
. Each
is a global attention map that does not correspond to any specific position. Instead,
each of them corresponds to a semantic aspect of the entire input
. For example, one such attention map might cover the persons in the input. Another might correspond to the background. Section 4.3 gives several concrete examples. Efficient attention uses each
to aggregate the value features V and produce a global context vector gj . Since
describes a global, semantic aspect of the input, gj also summarizes a global, semantic aspect of the input. Then, position i uses qi as a set of coeffi- cients over g0, g1, . . . , gdk−1. Using the previous example, a person pixel might place a large weight on the global context vector for persons to refine its representation. A pixel at the boundary of an object might have large weights on the global context vectors for both the object and the background to enhance the contrast.
This section analyzes the efficiency advantage of efficient attention over dot-product attention in memory and computation. The reason behind the efficiency advantage is that efficient attention
does not compute a similarity between each pair of positions
, which would occupy
memory and require
computation to generate. Instead, it only generates
global context vectors in
. This change eliminates the
terms from both the memory and computational complexities of the module. Consequently, efficient attention has
memory complexity and
computational complexities, assuming the common setting of
. Table 1 shows complexity formulae of the efficient attention module and the non-local module (using dot-product attention) in detail. In computer vision, this complexity difference is very significant. Firstly,
itself is quadratic in image side length and often very large in practice. Secondly,
is a parameter of the module, which the designer of a network can tune to meet different resource requirements. Section 4.2.3 shows that, within a reasonable range, this parameter has minimal impact on performance. This result means that an efficient attention module can typically have a small
, which further increases its efficiency advantage over dot-product attention. Table 2 compares the complexities of the efficient attention module with the ResBlock. The table shows that the resource demands of the efficient attention module are on par with (less than in most cases) the ResBlock, which gives an intuitive idea on the level of efficiency of he module.
Table 1. Comparison of resource usage of the efficient attention and non-local modules. This table assumes that
, which is the setting for all experiments in Section 4 and also a common setting in the literature for dot-product attention.
The rest of this section will give several concrete examples comparing the resource demands of the efficient attention and non-local modules. Figure 3 compares their resource consumption for image features with different sizes. Directly substituting the non-local module on the 64 × 64 feature map in SAGAN [29] yields a 17-time saving of memory and 32-time saving of computation. The gap widens rapidly with the increase of the input size. For a 256 × 256 feature map, a non-local module would require impractical amounts of memory (17.2 GB) and computation (412 GMACC). With the same input size, an efficient attention module uses 1/260 the amount of memory and 1/515 the amount of computation. The difference is more prominent for videos. Replacing the non-local module on the tiny 28 × 28 × 4 feature volume in res3 of the non-local I3DResNet-50 network [26] results in 2-time memory and computational saving. On a larger 64 × 64 × 32 feature volume, an efficient attention module requires 1/32 the amount of memory and 1/1025 the amount of computation.
Table 2. Comparison of resource usage of the efficient attention module and the ResBlock. Since the ResBlock does not have parameters dk, dv, this table sets
,
, the typical values for these parameters.
Figure 3.
Resource requirements under different input sizes.
The blue and orange bars depict the resource requirements of the efficient attention and non-local modules, respectively. The calculation assumes
. The figure is in log scale.
版权声明:本文为u014546828原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。