Allows the model to jointly attend to information from different representation subspaces. See reference: Attention Is All You Need
Usage
nnf_multi_head_attention_forward(
query,
key,
value,
embed_dim_to_check,
num_heads,
in_proj_weight,
in_proj_bias,
bias_k,
bias_v,
add_zero_attn,
dropout_p,
out_proj_weight,
out_proj_bias,
training = TRUE,
key_padding_mask = NULL,
need_weights = TRUE,
attn_mask = NULL,
avg_weights = TRUE,
use_separate_proj_weight = FALSE,
q_proj_weight = NULL,
k_proj_weight = NULL,
v_proj_weight = NULL,
static_k = NULL,
static_v = NULL,
batch_first = FALSE
)
Arguments
- query
\((L, N, E)\) where L is the target sequence length, N is the batch size, E is the embedding dimension. If batch_first is TRUE, the first two dimensions are transposed.
- key
\((S, N, E)\), where S is the source sequence length, N is the batch size, E is the embedding dimension. If batch_first is TRUE, the first two dimensions are transposed.
- value
\((S, N, E)\) where S is the source sequence length, N is the batch size, E is the embedding dimension. If batch_first is TRUE, the first two dimensions are transposed.
- embed_dim_to_check
total dimension of the model.
- num_heads
parallel attention heads.
- in_proj_weight
input projection weight.
- in_proj_bias
input projection bias.
- bias_k
bias of the key and value sequences to be added at dim=0.
- bias_v
currently undocumented.
- add_zero_attn
add a new batch of zeros to the key and value sequences at dim=1.
- dropout_p
probability of an element to be zeroed.
- out_proj_weight
the output projection weight.
- out_proj_bias
output projection bias.
- training
apply dropout if is
TRUE
.- key_padding_mask
\((N, S)\) where N is the batch size, S is the source sequence length. If a ByteTensor is provided, the non-zero positions will be ignored while the position with the zero positions will be unchanged. If a BoolTensor is provided, the positions with the value of
True
will be ignored while the position with the value ofFalse
will be unchanged.- need_weights
output attn_output_weights.
- attn_mask
2D mask \((L, S)\) where L is the target sequence length, S is the source sequence length. 3D mask \((N*num_heads, L, S)\) where N is the batch size, L is the target sequence length, S is the source sequence length. attn_mask ensure that position i is allowed to attend the unmasked positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged. If a BoolTensor is provided, positions with
True
is not allowed to attend whileFalse
values will be unchanged. If a FloatTensor is provided, it will be added to the attention weight.- avg_weights
Logical; whether to average attn_output_weights over the attention heads before outputting them. This doesn't change the returned value of attn_output; it only affects the returned attention weight matrix.
- use_separate_proj_weight
the function accept the proj. weights for query, key, and value in different forms. If false, in_proj_weight will be used, which is a combination of q_proj_weight, k_proj_weight, v_proj_weight.
- q_proj_weight
input projection weight and bias.
- k_proj_weight
currently undocumented.
- v_proj_weight
currently undocumented.
- static_k
static key and value used for attention operators.
- static_v
currently undocumented.
- batch_first
Logical; whether to expect query, key, and value to have batch as their first parameter, and to return output with batch first.