Sequence Model - Sequence Models & Attention Mechanism
2021/8/23 23:36:39
本文主要是介绍Sequence Model - Sequence Models & Attention Mechanism,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
Various Sequence To Sequence Architectures
Basic Models
Sequence to sequence model
Image captioning
use CNN(AlexNet) first to get a 4096-dimensional vector, feed it to a RNN
Picking the Most Likely Sentence
translate a French sentence \(x\) to the most likely English sentence \(y\) .
it's to find
\[\argmax_{y^{<1>}, \dots, y^{<T_y>}} P(y^{<1>}, \dots, y^{<T_y>} | x) \]-
Why not a greedy search?
(Find the most likely words one by one) Because it may be verbose and long.
Beam Search
-
set the \(B = 3 \text{(beam width)}\), find \(3\) most likely English outputs
-
consider each for the most likely second word, and then find \(B\) most likely words
-
do it again until \(<EOS>\)
if \(B = 1\), it's just greedy search.
Refinements to beam search
Length normalization
\[\argmax_{y} \prod_{t = 1}^{T_y} P(y^{<t>}|x, y^{<1>}, y^{<t - 1>}) \]\(P\) is much less than \(1\) (close to \(0\)) take \(\log\)
\[\argmax_{y} \sum_{t = 1}^{T_y} \log P(y^{<t>}|x, y^{<1>}, y^{<t - 1>}) \]it tends to give the short sentences.
So you can normalize it (\(\alpha\) is a hyperparameter)
\[\argmax_{y} \frac 1 {T_y^{\alpha}} \sum_{t = 1}^{T_y} \log P(y^{<t>}|x, y^{<1>}, y^{<t - 1>}) \]Beam search discussion
- large \(B\) : better result, slower
- small \(B\) : worse result, faster
Error Analysis in Beam Search
let \(y^*\) be human high quality translation, and \(\hat y\) be algorithm output.
- \(P(y^* | x) > P(\hat y | x)\) : Beam search is at fault
- \(P(y^* | x) \le P(\hat y | x)\) : RNN model is at fault
Bleu(bilingual evaluation understudy) Score
if you have some good referrences to evaluate the score.
\[p_n = \frac{\sum_{\text{n-grams} \in \hat y} \text{Count}_{\text{clip}}(\text{n-grams})} {\sum_{\text{n-grams} \in \hat y} \text{Count}(\text{n-grams})} \]Bleu details
calculate it with \(\exp(\frac{1}{4} \sum_{n = 1}^4 p_n)\)
BP = brevity penalty
\[BP = \begin{cases} 1 & \text{if~~MT\_output\_length > reference\_output\_length}\\ \exp(1 - \text{reference\_output\_length / MT\_output\_length}) & \text{otherwise} \end{cases} \]don't want short translation.
Attention Model Intuition
it's hard for network to memorize the whole sentence.
compute the attention weight to predict the word from the context
Attention Model
Use a BiRNN or BiLSTM.
\[\begin{aligned} a^{<t'>} &= (\vec a^{<t'>}, \overleftarrow a^{<t'>})\\ \sum_{t'} \alpha^{<i, t'>} &= 1\\ c^{<i>} &= \sum_{t'} \alpha^{<i, t'>} \alpha^{<t'>} \end{aligned} \]Computing attention
\[\begin{aligned} \alpha^{<t, t'>} &= \text{amount of "attention" } y^{<t>} \text{ should pay to } a^{<t'>}\\ &= \frac{\exp(e^{<t, t'>})}{\sum_{t' = 1}^{T_x} \exp(e^{<t, t'>})} \end{aligned} \]train a very small network to learn what the function is
the complexity is \(\mathcal O(T_x T_y)\) , which is so big (quadratic cost)
Speech Recognition - Audio Data
Speech recognition
\(x(\text{audio clip}) \to y(\text{transcript})\)
Attention model for sppech recognition
generate character by character
CTC cost for speech recognition
CTC(Connectionist temporal classification)
"ttt_h_eee___ ____qqq\(\dots\)" \(\rightarrow\) "the quick brown fox"
Basic rule: collapse repeated characters not separated by "blank"
Trigger Word Detection
label the trigger word, let the output be \(1\)s
这篇关于Sequence Model - Sequence Models & Attention Mechanism的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2024-07-05feign默认connecttimeout和readtimeout是多少-icode9专业技术文章分享
- 2024-07-05idea控制台,日志太多,导致部分想看得日志被刷走 搜不到-icode9专业技术文章分享
- 2024-07-05The server selected protocol version Tls10 is not accepted by client preferences [TLs12]-icode9专业技术文章分享
- 2024-07-05怎么清理项目缓存-icode9专业技术文章分享
- 2024-07-04安装 Eyoucms详细图文教程-icode9专业技术文章分享
- 2024-07-04ueditor 复制文章时,图片的链接是一个下载图片地址,该如何处理?-icode9专业技术文章分享
- 2024-07-04怎样判断host有没有对wordpress有缓存呢-icode9专业技术文章分享
- 2024-07-04具有编译功能的系统make后,无法ssh连接-icode9专业技术文章分享
- 2024-07-04make后如何升级ssh-icode9专业技术文章分享
- 2024-07-03微信支付提示下单账户与支付账户不一致-icode9专业技术文章分享