Hierarchical Attention Networks

Self-Attention谁先提出的，各文章里写的不一样，Attention Is All You Need中说是Jakob.2016年提出的，An Attentive Survey of Attention Models中说是Yang et al. 2016，本篇介绍后者。

Introduction

核心思路：

分层（hierarchical structure）：先构建“词 → 句子”级的表达，再聚合到文档级，即“句子 → 文档”
Attention：不同的词和句子包含的信息和重要程度都依赖于上下文，为了将其考虑进来，所以作者用两层的Attention

作者没有提self-attention，应该是还没意识到这一点的牛逼之处。

Hierarchical Attention Networks

encoder

encoder采用GRU产生，原理及结构省略

Hierarchical Attention

数据表达

sentences $\vec{s_i}$ ,$i=1,2,…L$
words represents: $w_{it}$, $t ∈ [1, T]$, $\vec{s_i}$ contains $T$ words

Word Encoder

先embedding，过双向GRU，将隐层concatenate起来

word embedding: $W_e$, $x_{ij}=W_ew_{ij}$
forward GRU: $\overset{\rightarrow}{h_{it}}=\overset{\rightarrow}{GRU}(x_{it}),\ t ∈ [1, T]$
backward GRU: $\overset{\leftarrow}{h_{it}}=\overset{\leftarrow}{GRU}(x_{it}),\ t ∈ [T, 1]$
concatenate: $h_{it}=[\overset{\rightarrow}{h_{it}},\overset{\leftarrow}{h_{it}}]$

Word Attention

将对句子含义起重要作用的词提取出来，聚合成一个句子向量。先将所有的（$t ∈ [1, T]$）$h_{it}$过全连接得到Key: $u_{it}$；然后和随机变量的query: $u_w$求相似度分布: $\alpha$；最后将最开始的 $h_{it}$作为Value，加权得到sentence vector: $s_i$。所有信息都是从$h_{it}$中得到。

$\begin{eqnarray*} u_{it} &=& tanh(W_wh_{it}+b_w) \tag{FC layer} \\ \\ \alpha_{it} &=& \frac{exp(u^{T}_{it}u_w)}{\sum_{t}{exp(u^{T}_{it}u_w)}} \tag{measure similarity & normalize} \\ \\ s_i &=& \sum_{t}{\alpha_{it}h_{it}} \tag{weighted sum} \end{eqnarray*}$

其中$ u_w$(word context vector)是随机初始化，然后在训练过程中学习的，可以当做是一个固定的query，用来表示这个句子中重要的信息。

维度信息：每个句子只产生一个向量$s_i$，其长度和单个词的BiGRU隐层concat之后的向量$h_{it}$长度相同（不一定等于词向量$w_{it}$的长度）。

Sentence Encoder

句子的encoder也和词的类似，先过bidirectional GRU然后concatenate。

forward GRU：$\overset{\rightarrow}{h_{i}}=\overset{\rightarrow}{GRU}(s_{i}),\ i ∈ [1, L]$
backward GRU: $\overset{\leftarrow}{h_{i}}=\overset{\leftarrow}{GRU}(s_{i}),\ i ∈ [L, 1]$
concatenate: $h_{i}=[\overset{\rightarrow}{h_{i}},\overset{\leftarrow}{h_{i}}]$

Sentence Attention

这部分也和Word Attention部分一样，只是换了个层次

$\begin{eqnarray*}u_{i} &=& tanh(W_sh_{i}+b_s) \tag{FC layer} \\ \\ \alpha_{i} &=& \frac{exp(u^{T}_{i}u_s)}{\sum_{i}{exp(u^{T}_{i}u_s)}} \tag{measure similarity & normalize} \\ \\ v &=& \sum_{i}{\alpha_{i}h_{i}} \tag{weighted sum} \end{eqnarray*}$

这里就将一个文档表示成一个向量$v$，其长度和单个句子的BiGRU隐层concat之后的向量$h_{i}$长度相同。

Document Classification

这部分很简单，文档向量$v$过softmax，然后用log loss训练。

$\begin{eqnarray*} p &=& softmax(W_cv+b_c) \tag{softmax} \\ \\ L &=& -\sum_{d}{log\ p_{dj}} \tag{log loss} \end{eqnarray*}$

其中，$j$是文档$d$的标签，只对正确标签计算loss。

Results and analysis

Yelp 2013上的两个文档，左边是给出了5星好评的，右边是0星差评的。模型可以捕捉到那些词重要。