Question about ReLU in Multi-Head Attention #21

ArtanisCV · 2022-08-10T01:38:31Z

In multi-head attention, there is a relu after queries, keys, and values. Is this a correct implementation? The paper did not mention the relu in Eq. 5. Besides, it seems that the relu will make the attention matrix always positive.

# Linear projections
Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu)
K = tf.layers.dense(keys, num_units, activation=tf.nn.relu)
V = tf.layers.dense(values, num_units, activation=tf.nn.relu)```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about ReLU in Multi-Head Attention #21

Question about ReLU in Multi-Head Attention #21

ArtanisCV commented Aug 10, 2022 •

edited

Loading

Question about ReLU in Multi-Head Attention #21

Question about ReLU in Multi-Head Attention #21

Comments

ArtanisCV commented Aug 10, 2022 • edited Loading

ArtanisCV commented Aug 10, 2022 •

edited

Loading