RNN(recurrent neural network)(二)——动手实现一个RNN

  • Post author:
  • Post category:其他





RNN(recurrent neural network)(二)——动手实现一个RNN

RNN系列博客:

经过上篇博客《

RNN(recurrent neural network)(一)——基础知识

》的介绍,大家应该对RNN的结构以及如何运行的有个比较清晰地认识了,这一篇博客再来挑战更难一点的任务——自己动手撸一个RNN,相信自己动手写完后,会对RNN有一个更深的了解。这篇博客主要从两个部分介绍,首先对RNN的bp做一个推导,其次就是用python编程实现RNN。


注:代码部分几乎完全使用ng deep-learning课的作业代码。

一、RNN的前向传播

首先还是直接放上ng在《sequence model》课程里的图:



这是单个cell的,那么整个forward propagation的示意图为(图来自ng《sequence model》课):





图二、RNN结构

因此,前向传播的公式为:





a

<

t

>

=

t

a

n

h

(

W

a

x

x

<

t

>

+

W

a

a

a

<

t

1

>

+

b

a

)

a^{<t>} = tanh(W_{ax}x^{<t>} + W_{aa}a^{<t-1>} + b_a)







a











<


t


>












=








t


a


n


h


(



W











a


x




















x











<


t


>












+









W











a


a




















a











<


t





1


>












+









b










a


















)











y

^

&lt;

t

&gt;

=

s

o

f

t

m

a

x

(

W

y

a

a

&lt;

t

&gt;

+

b

y

)

\widehat{y}^{&lt;t&gt;} = softmax(W_{ya}a^{&lt;t&gt;} + b_y)















y

































<


t


>












=








s


o


f


t


m


a


x


(



W











y


a




















a











<


t


>












+









b










y


















)





二、损失函数(Loss function)

机器学习任务,首先得有个目标,自然就是loss函数了,RNN的loss函数使用交叉熵(cross entropy),接下来我对RNN的loss函数可能会讲的比较细,主要是为了让大家能够了解清楚。我们把应用场景选定为ng的课后作业——生成恐龙名字,说白了就是生个一个单词,具体细节就是在给定前k个字符的情况下,生成最合理的第k+1个字符,首先假设训练集有



n

n






n





个样本(sequence),一个样本就是一个单词(恐龙名字),也可以理解为一个sequence,因为字符之间存在顺序关系。对于某一个sequence,RNN的结构就如图二所示,对于单个cell而言,loss函数为:





L

(

i

)

&lt;

t

&gt;

(

y

^

(

i

)

&lt;

t

&gt;

,

y

(

i

)

&lt;

t

&gt;

)

=

k

=

1

C

y

k

(

i

)

&lt;

t

&gt;

l

o

g

y

^

k

(

i

)

&lt;

t

&gt;

L^{(i)&lt;t&gt;} (\widehat{y}^{(i)&lt;t&gt;}, y^{(i)&lt;t&gt;}) = \sum_{k=1}^{|C|}-y_k^{(i)&lt;t&gt;}log\widehat{y}_k^{(i)&lt;t&gt;}







L











(


i


)


<


t


>










(











y

































(


i


)


<


t


>










,





y











(


i


)


<


t


>










)




=

















k


=


1






















C



























y










k









(


i


)


<


t


>



















l


o


g











y
































k









(


i


)


<


t


>
























这里,



C

C






C





是类别,一般就是词表的大小。



y

&lt;

t

&gt;

y^{&lt;t&gt;}







y











<


t


>













是真实值,一般是个one-hot向量,大小等于词表的大小,



y

^

&lt;

t

&gt;

\widehat{y}^{&lt;t&gt;}















y

































<


t


>













也是个向量,是softmax层输出的概率分布,大小也是词表的大小,两个向量是点乘,即对应元素相乘,但因为



y

&lt;

t

&gt;

y^{&lt;t&gt;}







y











<


t


>













是个one-hot向量,所以只有是1个那个元素起了作用,其他都是0。

因此,对于一个sequence来说,Loss函数为:





L

(

y

^

(

i

)

,

y

(

i

)

)

=

t

=

1

T

x

L

(

i

)

&lt;

t

&gt;

(

y

^

(

i

)

&lt;

t

&gt;

,

y

(

i

)

&lt;

t

&gt;

)

=

t

=

1

T

x

k

=

1

C

y

k

(

i

)

&lt;

t

&gt;

l

o

g

y

^

k

(

i

)

&lt;

t

&gt;

\begin{array}{lcl} L (\widehat{y}^{(i)}, y^{(i)}) &amp; = &amp; \sum_{t=1}^{T_x}L^{(i)&lt;t&gt;} (\widehat{y}^{(i)&lt;t&gt;}, y^{(i)&lt;t&gt;})\\\\ &amp; = &amp; -\sum_{t=1}^{T_x}\sum_{k=1}^{|C|}y_k^{(i)&lt;t&gt;}log\widehat{y}_k^{(i)&lt;t&gt;} \end{array}


















L


(











y

































(


i


)










,





y











(


i


)










)









































=








=










































t


=


1











T










x






































L











(


i


)


<


t


>










(











y

































(


i


)


<


t


>










,





y











(


i


)


<


t


>










)


























t


=


1











T










x


















































k


=


1













C

























y










k









(


i


)


<


t


>



















l


o


g











y
































k









(


i


)


<


t


>











































这里



T

x

T_x







T










x





















是一个sequence的长度,那么在整个训练集上的Loss函数为:





L

(

y

^

,

y

)

=

1

n

i

=

1

n

t

=

1

T

x

L

(

i

)

&lt;

t

&gt;

(

y

^

(

i

)

&lt;

t

&gt;

,

y

(

i

)

&lt;

t

&gt;

)

=

1

n

i

=

1

n

t

=

1

T

x

k

=

1

C

y

k

(

i

)

&lt;

t

&gt;

l

o

g

y

^

k

(

i

)

&lt;

t

&gt;

\begin{array}{lcl} L (\widehat{y}, y) &amp; = &amp; \frac{1}{n}\sum_{i=1}^{n}\sum_{t=1}^{T_x}L^{(i)&lt;t&gt;} (\widehat{y}^{(i)&lt;t&gt;}, y^{(i)&lt;t&gt;})\\\\ &amp; = &amp; -\frac{1}{n}\sum_{i=1}^{n}\sum_{t=1}^{T_x}\sum_{k=1}^{|C|}y_k^{(i)&lt;t&gt;}log\widehat{y}_k^{(i)&lt;t&gt;} \end{array}


















L


(










y
























,




y


)









































=








=









































n
















1




































i


=


1










n


































t


=


1











T










x






































L











(


i


)


<


t


>










(











y

































(


i


)


<


t


>










,





y











(


i


)


<


t


>










)























n
















1




































i


=


1










n


































t


=


1











T










x


















































k


=


1













C

























y










k









(


i


)


<


t


>



















l


o


g











y
































k









(


i


)


<


t


>











































之所以要求个均值,是为了更加准确的刻画网络在数据上的拟合程度,方便loss之间的比较,比如一个数据集有10w条样本,而另一个有1000条,如果不求均值,那么即使在10w条样本上的那个网络性能更好,也有可能loss比在1000条上的大。

三、后向传播(back-propagation through time,BPTT)

实际上,RNN中bp应该叫back-propagation through time,BPTT(通过时间反向传播),其实还是和bp一样,只是名字更加生动形象。我们的目标是求



W

a

x

W_{ax}







W











a


x


























W

a

a

W_{aa}







W











a


a


























W

y

a

W_{ya}







W











y


a


























b

y

b_y







b










y

























b

a

b_a







b










a





















的梯度(导数),还是要再放一下这个图,因为没办法,太需要了。



在求上面几个参数的导数之前,先把



t

a

n

h

(

x

)

tanh(x)






t


a


n


h


(


x


)





的导数求了,因为后面很多地方都要用到。



t

a

n

h

(

x

)

tanh(x)






t


a


n


h


(


x


)





的公式为:





t

a

n

h

(

x

)

=

1

e

2

x

1

+

e

2

x

tanh(x) = \frac{1 – e^{-2x}}{1 + e^{-2x}}






t


a


n


h


(


x


)




=



















1




+





e














2


x






















1










e














2


x

































因此,导数为:





t

a

n

h

(

x

)

=

e

2

x

(

2

)

(

1

+

e

2

x

)

2

(

1

e

2

x

)

e

2

x

(

2

)

(

1

+

e

2

x

)

2

=

4

e

2

x

(

1

+

e

2

x

)

2

=

1

(

1

e

2

x

1

+

e

2

x

)

2

=

1

t

a

n

h

2

(

x

)

\begin{aligned} {tanh(x)}&#x27; &amp;= \frac{- e^{-2x}\cdot(-2)(1 + e^{-2x})^{2} – (1 – e^{-2x})\cdot e^{-2x}\cdot(-2)}{(1 + e^{-2x})^{2}}\\ &amp;= \frac{4e^{-2x}}{(1 + e^{-2x})^{2}}\\\\ &amp;= 1- (\frac{1 – e^{-2x}}{1 + e^{-2x}})^2\\\\ &amp;=1 – tanh^2(x) \end{aligned}


















t


a


n


h


(


x


)
















































































=















(


1




+





e














2


x











)











2


























e














2


x

















(





2


)


(


1




+





e














2


x











)











2

















(


1










e














2


x










)










e














2


x

















(





2


)






























=















(


1




+





e














2


x











)











2






















4



e














2


x






































=




1









(













1




+





e














2


x






















1










e














2


x





























)










2



















=




1









t


a


n



h










2









(


x


)
























根据求导时的链式法则:





L

W

a

x

=

L

a

&lt;

t

&gt;

a

&lt;

t

&gt;

W

a

x

=

L

a

&lt;

t

&gt;

t

a

n

h

(

W

a

x

x

&lt;

t

&gt;

+

W

a

a

a

&lt;

t

1

&gt;

+

b

a

)

x

&lt;

t

&gt;

=

L

a

&lt;

t

&gt;

(

1

t

a

n

h

2

(

W

a

x

x

&lt;

t

&gt;

+

W

a

a

a

&lt;

t

1

&gt;

+

b

a

)

x

&lt;

t

&gt;

)

\begin{aligned} \frac{\partial L}{\partial W_{ax}} &amp;= \frac{\partial L}{\partial a^{&lt;t&gt;}}\cdot \frac{\partial a^{&lt;t&gt;}}{\partial W_{ax}}\\\\ &amp;=\frac{\partial L}{\partial a^{&lt;t&gt;}}\cdot {tanh}&#x27;(W_{ax}x^{&lt;t&gt;} + W_{aa}a^{&lt;t-1&gt;} + b_a)\cdot x^{&lt;t&gt;}\\\\ &amp;=\frac{\partial L}{\partial a^{&lt;t&gt;}}\cdot (1 – tanh^2(W_{ax}x^{&lt;t&gt;} + W_{aa}a^{&lt;t-1&gt;} + b_a)\cdot x^{&lt;t&gt;}) \end{aligned}































W











a


x


































L







































































=



















a











<


t


>

























L










































W











a


x



































a











<


t


>






































=



















a











<


t


>

























L





























t


a


n


h























(



W











a


x




















x











<


t


>












+





W











a


a




















a











<


t





1


>












+





b










a


















)










x











<


t


>




















=



















a











<


t


>

























L



























(


1









t


a


n



h










2









(



W











a


x




















x











<


t


>












+





W











a


a




















a











<


t





1


>












+





b










a


















)










x











<


t


>










)




























L

W

a

a

=

L

a

&lt;

t

&gt;

a

&lt;

t

&gt;

W

a

a

=

L

a

&lt;

t

&gt;

t

a

n

h

(

W

a

x

x

&lt;

t

&gt;

+

W

a

a

a

&lt;

t

1

&gt;

+

b

a

)

a

&lt;

t

1

&gt;

=

L

a

&lt;

t

&gt;

(

1

t

a

n

h

2

(

W

a

x

x

&lt;

t

&gt;

+

W

a

a

a

&lt;

t

1

&gt;

+

b

a

)

a

&lt;

t

1

&gt;

)

\begin{aligned} \frac{\partial L}{\partial W_{aa}} &amp;= \frac{\partial L}{\partial a^{&lt;t&gt;}}\cdot \frac{\partial a^{&lt;t&gt;}}{\partial W_{aa}}\\\\ &amp;=\frac{\partial L}{\partial a^{&lt;t&gt;}}\cdot {tanh}&#x27;(W_{ax}x^{&lt;t&gt;} + W_{aa}a^{&lt;t-1&gt;} + b_a)\cdot a^{&lt;t-1&gt;}\\\\ &amp;=\frac{\partial L}{\partial a^{&lt;t&gt;}}\cdot (1 – tanh^2(W_{ax}x^{&lt;t&gt;} + W_{aa}a^{&lt;t-1&gt;} + b_a)\cdot a^{&lt;t-1&gt;}) \end{aligned}































W











a


a


































L







































































=



















a











<


t


>

























L










































W











a


a



































a











<


t


>






































=



















a











<


t


>

























L





























t


a


n


h























(



W











a


x




















x











<


t


>












+





W











a


a




















a











<


t





1


>












+





b










a


















)










a











<


t





1


>




















=



















a











<


t


>

























L



























(


1









t


a


n



h










2









(



W











a


x




















x











<


t


>












+





W











a


a




















a











<


t





1


>












+





b










a


















)










a











<


t





1


>










)




























L

b

a

=

L

a

&lt;

t

&gt;

a

&lt;

t

&gt;

b

a

=

L

a

&lt;

t

&gt;

t

a

n

h

(

W

a

x

x

&lt;

t

&gt;

+

W

a

a

a

&lt;

t

1

&gt;

+

b

a

)

=

L

a

&lt;

t

&gt;

(

1

t

a

n

h

2

(

W

a

x

x

&lt;

t

&gt;

+

W

a

a

a

&lt;

t

1

&gt;

+

b

a

)

)

\begin{aligned} \frac{\partial L}{\partial b_{a}} &amp;= \frac{\partial L}{\partial a^{&lt;t&gt;}}\cdot \frac{\partial a^{&lt;t&gt;}}{\partial b_{a}}\\\\ &amp;=\frac{\partial L}{\partial a^{&lt;t&gt;}}\cdot {tanh}&#x27;(W_{ax}x^{&lt;t&gt;} + W_{aa}a^{&lt;t-1&gt;} + b_a)\\\\ &amp;=\frac{\partial L}{\partial a^{&lt;t&gt;}}\cdot (1 – tanh^2(W_{ax}x^{&lt;t&gt;} + W_{aa}a^{&lt;t-1&gt;} + b_a)) \end{aligned}































b











a


































L







































































=



















a











<


t


>

























L










































b











a



































a











<


t


>






































=



















a











<


t


>

























L





























t


a


n


h























(



W











a


x




















x











<


t


>












+





W











a


a




















a











<


t





1


>












+





b










a


















)












=



















a











<


t


>

























L



























(


1









t


a


n



h










2









(



W











a


x




















x











<


t


>












+





W











a


a




















a











<


t





1


>












+





b










a


















)


)
























接下来是求



W

y

a

W_{ya}







W











y


a


























b

y

b_y







b










y





















的梯度,还有要求



L

a

&lt;

t

&gt;

\frac{\partial L}{\partial a^{&lt;t&gt;}}






















a











<


t


>



























L
























,这都涉及到softmax求导,softmax的求导会稍微复杂一点。在RNN cell中预测值的输出为:





y

^

&lt;

t

&gt;

=

s

o

f

t

m

a

x

(

W

y

a

a

&lt;

t

&gt;

+

b

y

)

\widehat{y}^{&lt;t&gt;} = softmax(W_{ya}a^{&lt;t&gt;} + b_y)















y

































<


t


>












=








s


o


f


t


m


a


x


(



W











y


a




















a











<


t


>












+









b










y


















)







为了求导方便,我们先令:



Z

&lt;

t

&gt;

=

W

y

a

a

&lt;

t

&gt;

+

b

y

Z^{&lt;t&gt;}=W_{ya}a^{&lt;t&gt;} + b_y







Z











<


t


>












=









W











y


a




















a











<


t


>












+









b










y





















,那么:





y

^

k

&lt;

t

&gt;

=

s

o

f

t

m

a

x

(

Z

k

&lt;

t

&gt;

)

=

e

Z

k

&lt;

t

&gt;

l

=

1

C

e

Z

l

&lt;

t

&gt;

\begin{aligned} \widehat{y}^{&lt;t&gt;}_k &amp;= softmax(Z^{&lt;t&gt;}_k)\\ &amp;= \frac{e^{Z^{&lt;t&gt;}_k}}{\sum_{l=1}^Ce^{Z^{&lt;t&gt;}_l}} \end{aligned}

























y
































k









<


t


>




















































=




s


o


f


t


m


a


x


(



Z










k









<


t


>



















)












=




























l


=


1









C





















e












Z










l









<


t


>








































e












Z










k









<


t


>



































































其中,



k

k






k





表示预测样本属于第



k

k






k





类,



C

C






C





表示类别数目,比如常见的二分类,那么



C

=

2

C=2






C




=








2





,在自然语言处理序列任务中,C一般等于词表大小。


对softmax求导,有一点要考虑的就是,因为经过softmax的输出有:[



y

^

1

&lt;

t

&gt;

\widehat{y}^{&lt;t&gt;}_1















y
































1









<


t


>


























y

^

2

&lt;

t

&gt;

\widehat{y}^{&lt;t&gt;}_2















y
































2









<


t


>






















,…,



y

^

C

&lt;

t

&gt;

\widehat{y}^{&lt;t&gt;}_C















y
































C









<


t


>






















] ,假如求



L

Z

j

&lt;

t

&gt;

\frac{\partial L}{\partial {Z}^{&lt;t&gt;}_j}























Z











j









<


t


>




































L
























,其实从



y

^

1

&lt;

t

&gt;

\widehat{y}^{&lt;t&gt;}_1















y
































1









<


t


>


























y

^

C

&lt;

t

&gt;

\widehat{y}^{&lt;t&gt;}_C















y
































C









<


t


>






















都包含



Z

j

&lt;

t

&gt;

{Z}^{&lt;t&gt;}_j








Z











j









<


t


>


























y

^

j

&lt;

t

&gt;

\widehat{y}^{&lt;t&gt;}_j















y
































j









<


t


>






















分子分母都包含,其它的



C

C






C









1

1






1





项(



y

^

k

&lt;

t

&gt;

k

j

\widehat{y}^{&lt;t&gt;}_k,k\neq j















y
































k









<


t


>






















k

















̸





















=









j





)只有分母包含),所以这时候就要分上面两种情况来分开求导。

因此,对于某一个cell(就当做是最后一个cell吧,这样理解可能更容易懂),我们把loss函数拆写成两部分:





L

(

y

^

&lt;

t

&gt;

,

y

&lt;

t

&gt;

)

=

1

n

i

=

1

n

L

(

i

)

&lt;

t

&gt;

(

y

^

(

i

)

&lt;

t

&gt;

,

y

(

i

)

&lt;

t

&gt;

)

=

1

n

i

=

1

n

k

=

1

C

y

k

(

i

)

&lt;

t

&gt;

l

o

g

y

^

k

(

i

)

&lt;

t

&gt;

=

1

n

i

=

1

n

[

k

=

1

,

k

j

C

y

k

(

i

)

&lt;

t

&gt;

l

o

g

y

^

k

(

i

)

&lt;

t

&gt;

+

y

j

(

i

)

&lt;

t

&gt;

l

o

g

y

^

j

(

i

)

&lt;

t

&gt;

]

\begin{array}{lcl} L (\widehat{y}^{&lt;t&gt;}, y^{&lt;t&gt;}) &amp; = &amp; \frac{1}{n}\sum_{i=1}^{n}L^{(i)&lt;t&gt;} (\widehat{y}^{(i)&lt;t&gt;}, y^{(i)&lt;t&gt;})\\\\ &amp; = &amp; -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{|C|}y_k^{(i)&lt;t&gt;}log\widehat{y}_k^{(i)&lt;t&gt;}\\\\ &amp;=&amp; -\frac{1}{n}\sum_{i=1}^{n}[\sum_{k=1,k\neq j}^{|C|}y_k^{(i)&lt;t&gt;}log\widehat{y}_k^{(i)&lt;t&gt;} + y_j^{(i)&lt;t&gt;}log\widehat{y}_j^{(i)&lt;t&gt;}] \end{array}


















L


(











y

































<


t


>










,





y











<


t


>










)





















































=








=








=









































n
















1




































i


=


1










n






















L











(


i


)


<


t


>










(











y

































(


i


)


<


t


>










,





y











(


i


)


<


t


>










)























n
















1




































i


=


1










n


































k


=


1













C

























y










k









(


i


)


<


t


>



















l


o


g











y
































k









(


i


)


<


t


>








































n
















1




































i


=


1










n



















[















k


=


1


,


k















̸





















=



j













C

























y










k









(


i


)


<


t


>



















l


o


g











y
































k









(


i


)


<


t


>





















+





y










j









(


i


)


<


t


>



















l


o


g











y
































j









(


i


)


<


t


>



















]
























下面来求导,先对第一部分



k

j

k \neq j






k

















̸





















=









j





求导:





L

Z

j

&lt;

t

&gt;

=

k

=

1

,

k

j

C

L

(

k

)

y

^

k

&lt;

t

&gt;

y

^

k

&lt;

t

&gt;

Z

j

&lt;

t

&gt;

=

k

=

1

,

k

j

C

y

k

&lt;

t

&gt;

1

y

^

k

&lt;

t

&gt;

e

Z

k

&lt;

t

&gt;

1

(

l

=

1

C

e

Z

l

&lt;

t

&gt;

)

2

e

Z

j

&lt;

t

&gt;

=

k

=

1

,

k

j

C

y

k

&lt;

t

&gt;

1

y

^

k

&lt;

t

&gt;

(

1

)

e

Z

k

&lt;

t

&gt;

l

=

1

C

e

Z

l

&lt;

t

&gt;

e

Z

j

&lt;

t

&gt;

l

=

1

C

e

Z

l

&lt;

t

&gt;

=

k

=

1

,

k

j

C

y

k

&lt;

t

&gt;

1

y

^

k

&lt;

t

&gt;

(

1

)

y

^

k

&lt;

t

&gt;

y

^

j

&lt;

t

&gt;

=

k

=

1

,

k

j

C

y

k

&lt;

t

&gt;

y

^

j

&lt;

t

&gt;

\begin{array}{lcl} \frac{\partial L}{\partial Z^{&lt;t&gt;}_j} &amp;=&amp; -\sum_{k=1,k\neq j}^{|C|}\frac{\partial L^{(k)}}{\partial \widehat{y}^{&lt;t&gt;}_k}\cdot \frac{\partial \widehat{y}^{&lt;t&gt;}_k}{\partial Z^{&lt;t&gt;}_j}\\\\ &amp;=&amp;-\sum_{k=1,k\neq j}^{|C|}y_{k}^{&lt;t&gt;}\cdot \frac{1}{\widehat{y}_{k}^{&lt;t&gt;}}\cdot e^{Z_{k}^{&lt;t&gt;}}\cdot \frac{-1}{(\sum_{l=1}^Ce^{Z_{l}^{&lt;t&gt;}})^{2}}\cdot e^{Z_{j}^{&lt;t&gt;}}\\\\ &amp;=&amp;-\sum_{k=1,k\neq j}^{|C|}y_{k}^{&lt;t&gt;}\cdot \frac{1}{\widehat{y}_{k}^{&lt;t&gt;}}\cdot(-1)\cdot\frac{e^{Z_{k}^{&lt;t&gt;}}}{\sum_{l=1}^Ce^{Z_{l}^{&lt;t&gt;}}}\cdot \frac{e^{Z_{j}^{&lt;t&gt;}}}{\sum_{l=1}^Ce^{Z_{l}^{&lt;t&gt;}}}\\\\ &amp;=&amp;-\sum_{k=1,k\neq j}^{|C|}y_{k}^{&lt;t&gt;}\cdot \frac{1}{\widehat{y}_{k}^{&lt;t&gt;}}\cdot(-1)\cdot \widehat{y}_{k}^{&lt;t&gt;}\cdot \widehat{y}_{j}^{&lt;t&gt;}\\\\ &amp;=&amp;\sum_{k=1,k\neq j}^{|C|}y_{k}^{&lt;t&gt;}\cdot \widehat{y}_{j}^{&lt;t&gt;} \end{array}


































Z










j









<


t


>




































L
































































































=








=








=








=








=















































k


=


1


,


k















̸





















=



j













C
















































y
































k









<


t


>





































L











(


k


)




















































Z










j









<


t


>













































y
































k









<


t


>






























































k


=


1


,


k















̸





















=



j













C

























y











k










<


t


>















































y

































k










<


t


>

































1





























e












Z











k










<


t


>














































(















l


=


1









C





















e












Z










l








<


t


>



























)











2



























1





























e












Z











j










<


t


>



















































k


=


1


,


k















̸





















=



j













C

























y











k










<


t


>















































y

































k










<


t


>

































1




























(





1


)


































l


=


1









C





















e












Z










l








<


t


>









































e












Z










k








<


t


>













































































l


=


1









C





















e












Z










l








<


t


>









































e












Z










j








<


t


>





































































k


=


1


,


k















̸





















=



j













C

























y











k










<


t


>















































y

































k










<


t


>

































1




























(





1


)


















y

































k










<


t


>



































y

































j










<


t


>






































k


=


1


,


k















̸





















=



j













C

























y











k










<


t


>



































y

































j










<


t


>











































接下来对第二部分也就是



k

=

j

k = j






k




=








j





时,求导:





L

(

j

)

Z

j

&lt;

t

&gt;

=

L

(

j

)

y

^

j

&lt;

t

&gt;

y

^

j

&lt;

t

&gt;

Z

j

&lt;

t

&gt;

=

y

j

&lt;

t

&gt;

1

y

^

j

&lt;

t

&gt;

e

Z

j

&lt;

t

&gt;

l

=

1

C

e

Z

l

&lt;

t

&gt;

e

Z

j

&lt;

t

&gt;

e

Z

j

&lt;

t

&gt;

(

l

=

1

C

e

Z

l

&lt;

t

&gt;

)

2

=

y

j

&lt;

t

&gt;

1

y

^

j

&lt;

t

&gt;

e

Z

j

&lt;

t

&gt;

(

l

=

1

C

e

Z

l

&lt;

t

&gt;

e

Z

j

&lt;

t

&gt;

)

(

l

=

1

C

e

Z

l

&lt;

t

&gt;

)

2

=

y

j

&lt;

t

&gt;

1

y

^

j

&lt;

t

&gt;

e

Z

j

&lt;

t

&gt;

l

=

1

C

e

Z

l

&lt;

t

&gt;

(

1

e

Z

j

&lt;

t

&gt;

l

=

1

C

e

Z

l

&lt;

t

&gt;

)

=

y

j

&lt;

t

&gt;

1

y

^

j

&lt;

t

&gt;

y

^

j

&lt;

t

&gt;

(

1

y

^

j

&lt;

t

&gt;

)

=

y

j

&lt;

t

&gt;

(

1

y

^

j

&lt;

t

&gt;

)

\begin{array}{lcl} \frac{\partial L^{(j)}}{\partial Z^{&lt;t&gt;}_j} &amp;=&amp; -\frac{\partial L^{(j)}}{\partial \widehat{y}^{&lt;t&gt;}_j}\cdot \frac{\partial \widehat{y}^{&lt;t&gt;}_j}{\partial Z^{&lt;t&gt;}_j}\\\\ &amp;=&amp;-y_{j}^{&lt;t&gt;}\cdot \frac{1}{\widehat{y}_{j}^{&lt;t&gt;}}\cdot \frac{e^{Z_{j}^{&lt;t&gt;}}\cdot \sum_{l=1}^Ce^{Z_{l}^{&lt;t&gt;}} – e^{Z_{j}^{&lt;t&gt;}}\cdot e^{Z_{j}^{&lt;t&gt;}}}{(\sum_{l=1}^Ce^{Z_{l}^{&lt;t&gt;}})^{2}}\\\\ &amp;=&amp;-y_{j}^{&lt;t&gt;}\cdot \frac{1}{\widehat{y}_{j}^{&lt;t&gt;}}\cdot \frac{e^{Z_{j}^{&lt;t&gt;}}\cdot (\sum_{l=1}^Ce^{Z_{l}^{&lt;t&gt;}} – e^{Z_{j}^{&lt;t&gt;}})}{(\sum_{l=1}^Ce^{Z_{l}^{&lt;t&gt;}})^{2}}\\\\ &amp;=&amp;-y_{j}^{&lt;t&gt;}\cdot \frac{1}{\widehat{y}_{j}^{&lt;t&gt;}}\cdot\frac{e^{Z_{j}^{&lt;t&gt;}}}{\sum_{l=1}^Ce^{Z_{l}^{&lt;t&gt;}}}\cdot (1 – \frac{e^{Z_{j}^{&lt;t&gt;}}}{\sum_{l=1}^Ce^{Z_{l}^{&lt;t&gt;}}})\\\\ &amp;=&amp;-y_{j}^{&lt;t&gt;}\cdot \frac{1}{\widehat{y}_{j}^{&lt;t&gt;}}\cdot \widehat{y}_{j}^{&lt;t&gt;}\cdot (1 – \widehat{y}_{j}^{&lt;t&gt;})\\\\ &amp;=&amp;-y_{j}^{&lt;t&gt;}\cdot (1 -\widehat{y}_{j}^{&lt;t&gt;}) \end{array}


































Z










j









<


t


>





































L











(


j


)




















































































































=








=








=








=








=








=
























































y
































j









<


t


>





































L











(


j


)




















































Z










j









<


t


>













































y
































j









<


t


>
















































y











j










<


t


>















































y

































j










<


t


>

































1








































(















l


=


1









C





















e












Z










l








<


t


>



























)











2

























e












Z










j








<


t


>










































l


=


1









C





















e












Z










l








<


t


>






























e












Z










j








<


t


>






























e












Z










j








<


t


>























































y











j










<


t


>















































y

































j










<


t


>

































1








































(















l


=


1









C





















e












Z










l








<


t


>



























)











2

























e












Z










j








<


t


>





























(















l


=


1









C





















e












Z










l








<


t


>






























e












Z










j








<


t


>


























)































y











j










<


t


>















































y

































j










<


t


>

































1





















































l


=


1









C





















e












Z










l








<


t


>









































e












Z










j








<


t


>




















































(


1


































l


=


1









C





















e












Z










l








<


t


>









































e












Z










j








<


t


>













































)












y











j










<


t


>















































y

































j










<


t


>

































1





































y

































j










<


t


>


























(


1


















y

































j










<


t


>



















)












y











j










<


t


>


























(


1


















y

































j










<


t


>



















)


























现在把两部分相加起来,得到:





L

Z

j

&lt;

t

&gt;

=

k

=

1

,

k

j

C

y

k

&lt;

t

&gt;

y

^

j

&lt;

t

&gt;

+

[

y

j

&lt;

t

&gt;

(

1

y

^

j

&lt;

t

&gt;

)

]

=

k

=

1

,

k

j

C

y

k

&lt;

t

&gt;

y

^

j

&lt;

t

&gt;

+

y

^

j

&lt;

t

&gt;

y

^

j

&lt;

t

&gt;

y

j

&lt;

t

&gt;

=

k

=

1

C

y

k

&lt;

t

&gt;

y

^

j

&lt;

t

&gt;

y

j

&lt;

t

&gt;

=

y

^

j

&lt;

t

&gt;

k

=

1

C

y

k

&lt;

t

&gt;

y

j

&lt;

t

&gt;

=

y

^

j

&lt;

t

&gt;

y

j

&lt;

t

&gt;

\begin{array}{lcl} \frac{\partial L}{\partial Z^{&lt;t&gt;}_j} &amp;=&amp;\sum_{k=1,k\neq j}^{|C|}y_{k}^{&lt;t&gt;}\cdot \widehat{y}_{j}^{&lt;t&gt;} + [-y_{j}^{&lt;t&gt;}\cdot (1 -\widehat{y}_{j}^{&lt;t&gt;})]\\\\ &amp;=&amp; \sum_{k=1,k\neq j}^{|C|}y_{k}^{&lt;t&gt;}\cdot \widehat{y}_{j}^{&lt;t&gt;} + \widehat{y}_{j}^{&lt;t&gt;}\cdot \widehat{y}_{j}^{&lt;t&gt;} – y_{j}^{&lt;t&gt;}\\\\ &amp;=&amp;\sum_{k=1}^{|C|}y_{k}^{&lt;t&gt;}\cdot \widehat{y}_{j}^{&lt;t&gt;} – y_{j}^{&lt;t&gt;}\\\\ &amp;=&amp;\widehat{y}_{j}^{&lt;t&gt;}\cdot \sum_{k=1}^{|C|}y_{k}^{&lt;t&gt;}- y_{j}^{&lt;t&gt;}\\\\ &amp;=&amp;\widehat{y}_{j}^{&lt;t&gt;} – y_{j}^{&lt;t&gt;} \end{array}


































Z










j









<


t


>




































L
































































































=








=








=








=








=










































k


=


1


,


k















̸





















=



j













C

























y











k










<


t


>



































y

































j










<


t


>





















+




[






y











j










<


t


>


























(


1


















y

































j










<


t


>



















)


]





















k


=


1


,


k















̸





















=



j













C

























y











k










<


t


>



































y

































j










<


t


>





















+













y

































j










<


t


>



































y

































j










<


t


>



























y











j










<


t


>






































k


=


1













C

























y











k










<


t


>



































y

































j










<


t


>



























y











j










<


t


>


































y

































j










<


t


>







































k


=


1













C

























y











k










<


t


>



























y











j










<


t


>


































y

































j










<


t


>



























y











j










<


t


>











































这里解释下上面这个公式导数第二部怎么化为倒数第一步的,其实也就是



k

=

1

C

y

k

&lt;

t

&gt;

\sum_{k=1}^{|C|}y_{k}^{&lt;t&gt;}



















k


=


1













C

























y











k










<


t


>






















为什么是



1

1






1





,因为



y

k

&lt;

t

&gt;

y_{k}^{&lt;t&gt;}







y











k










<


t


>






















是真实值,在



C

C






C





个里面,只有一个类别为1,其它都是0,(想想 one-hot),所以和为1。

接下来,我们就可以很容易的得到



L

a

&lt;

t

&gt;

\frac{\partial L}{\partial a^{&lt;t&gt;}}






















a











<


t


>



























L
























,因为我们设的



Z

&lt;

t

&gt;

=

W

y

a

a

&lt;

t

&gt;

+

b

y

Z^{&lt;t&gt;}=W_{ya}a^{&lt;t&gt;} + b_y







Z











<


t


>












=









W











y


a




















a











<


t


>












+









b










y





















,所以:





L

a

&lt;

t

&gt;

=

L

Z

&lt;

t

&gt;

Z

&lt;

t

&gt;

a

&lt;

t

&gt;

=

(

y

^

&lt;

t

&gt;

y

&lt;

t

&gt;

)

W

y

a

\begin{aligned} \frac{\partial L}{\partial a^{&lt;t&gt;}} &amp;= \frac{\partial L}{\partial Z^{&lt;t&gt;}}\cdot \frac{\partial Z^{&lt;t&gt;}}{\partial a^{&lt;t&gt;}}\\\\ &amp;=(\widehat{y}^{&lt;t&gt;} – y^{&lt;t&gt;})\cdot W_{ya} \end{aligned}































a











<


t


>

























L



























































=



















Z











<


t


>

























L










































a











<


t


>


























Z











<


t


>






































=




(











y

































<


t


>


















y











<


t


>










)










W











y


a













































L

W

y

a

=

L

Z

&lt;

t

&gt;

Z

&lt;

t

&gt;

W

y

a

=

(

y

^

&lt;

t

&gt;

y

&lt;

t

&gt;

)

a

&lt;

t

&gt;

\begin{aligned} \frac{\partial L}{\partial W_{ya}} &amp;= \frac{\partial L}{\partial Z^{&lt;t&gt;}}\cdot \frac{\partial Z^{&lt;t&gt;}}{\partial W_{ya}}\\\\ &amp;=(\widehat{y}^{&lt;t&gt;} – y^{&lt;t&gt;})\cdot a^{&lt;t&gt;} \end{aligned}































W











y


a


































L



























































=



















Z











<


t


>

























L










































W











y


a



































Z











<


t


>






































=




(











y

































<


t


>


















y











<


t


>










)










a











<


t


>
































因此,整个bp推导结束,关于



1

n

\frac{1}{n}


















n
















1
























一般是要加上的,原因前面也讲过了,不过这就是个常数,对于求导没有任何影响,我们在实现的时候加上。

四、梯度裁剪(gradient clipping)

前面已经把RNN的知识点全部介绍完了,但是训练RNN的过程并不总是那么愉快,我们总是希望loss随着迭代次数的进行是像蓝线那样不停的下降,然而事实却是绿线那样,直到你看到loss抛出了NaN。(图片来自李宏毅RNN课件)。



产生梯度爆炸和消失的原因是因为RNN中经常出现像下图所示那样

陡峭的悬崖

(loss对参数的变化)(图片来自李宏毅RNN),如下图所示,幸运的情况下我们更新梯度后正好是在平面上,这时loss在缓缓的下降,但不幸的时候参数(w1,w2)直接跳上了悬崖,这就会导致loss暴增,下面如果再跳下来,loss又会爆减,这就对应了上图中的绿色曲线。

但是最不幸的是有时候正在踩在悬崖峭壁上,然后下面就是直接飞出去了,导致loss直接Nan了。

这样的话RNN就没法继续训练下去了。



Mikolov大神(word2vec的发明者)最早想出了解决办法,这就是

梯度裁剪(Gradient Clipping)

。其实梯度裁剪的思想相当的直白和简单,但是能想出来并且work就已经是件很厉害的事情了。

梯度裁剪的思想是:如果梯度大于某个阈值或者小于阈值的负数,就用直接让它等于阈值。

比如,设置阈值为10,那么如果求得的梯度如果大于10,或者小于-10,则直接设置梯度为10。这样做的好处显而易见,回到上面的图,飞出去的那条线,就是因为梯度太大或者太小导致参数更新后,loss爆了,采用梯度裁剪后,会使得飞出去的线变成虚线所示,继续落在面上,可以继续训练。

五、RNN的实现


注:实现部分参考ng的deep learning课堂作业,关于RNN的实现,有一个点需要注意的,因为我们每个句子(单词)的长度是不同的,所以有两种方法去处理:

1.对句子(单词)进行对齐,自己设置对齐长度(可以设置为长度最长的单词,不足的用一个标记符来填充,比如keras中可使用pad_sequences()函数来截长补短),但这样做在计算loss的时候需要注意,因为计算每一个timestamp的loss的时候填充的标记符不能计算,可以自己设置个mask矩阵来标记计算的时候忽略那些填充符号。现在为了矩阵并行计算,大多使用这种方法。

2.不对句子(单词)进行对齐,但在做前向后向的时候要一个一个样本的去做,也就是for循环遍历样例,这样做计算量上大一点。ng的作业即采用这种方法。


这篇博客应用场景依然采用ng作业,即生成恐龙的名字,所以采用第二种方法。有时间的话,也会实现第一种方法,毕竟第一种方法主流些。

接下来就是RNN的实现,难点主要在bp上,实现bp有个技巧,因为链式法则,(不怎么好描述,直接举个例子),假如



y

=

t

2

y= t^2






y




=









t










2
















t

=

4

x

+

1

t = 4x + 1






t




=








4


x




+








1





,如果我们想求



y

x

\frac{\partial y}{\partial x}





















x



















y
























,根据链式法则,



y

x

=

y

t

t

x

\frac{\partial y}{\partial x} = \frac{\partial y}{\partial t} \cdot \frac{\partial t}{\partial x}





















x



















y























=























t



















y















































x



















t
























,如果在给定



y

t

\frac{\partial y}{\partial t}





















t



















y
























的情况,我们就不用管他前面无论有多少级联,都不用管,我们只要求



t

x

\frac{\partial t}{\partial x}





















x



















t
























就可以了。所以,我们实现的时}候,传入的参数就是上一级的梯度就可以了(如下图所示,如果给定



d

a

&lt;

2

&gt;

da^{&lt;2&gt;}






d



a











<


2


>













(两条红线),我们就不用管它后面的级联了,写代码的时候只要把



d

a

&lt;

2

&gt;

da^{&lt;2&gt;}






d



a











<


2


>













当参数传进来就可以了),只要实现本cell(层)的参数的求导就ok了。



关于基本RNN网络的实现,主要有 参数初始化、前向传播、后向传播、参数更新(+梯度裁剪)这四部分。

5.1 参数初始化

def initialize_parameters(n_a, n_x, n_y):
	"""
	Initialize parameters with small random values
	Returns:
	parameters -- python dictionary containing:
				  Wax -- Weight matrix multiplying the input, of shape (n_a, n_x)
				  Waa -- Weight matrix multiplying the hidden state, of shape (n_a, n_a)
				  Wya -- Weight matrix relating the hidden-state to the output, of shape (n_y, n_a)
				  b --  Bias, of shape (n_a, 1)
				  by -- Bias relating the hidden-state to the output, of shape (n_y, 1)
	"""
	np.random.seed(1)
	Wax = np.random.randn(n_a, n_x) * 0.01  # input to hidden
	Waa = np.random.randn(n_a, n_a) * 0.01  # hidden to hidden
	Wya = np.random.randn(n_y, n_a) * 0.01  # hidden to output
	ba = np.zeros((n_a, 1))  # hidden bias
	by = np.zeros((n_y, 1))  # output bias
	parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "ba": ba, "by": by}

	return parameters

5.2 前向传播

单个cell的实现:

def rnn_step_forward(xt, a_prev, parameters):
	"""
	Implements a single forward step of the RNN-cell that uses a tanh activation function
	Arguments:
	xt -- the input data at timestep "t", of shape (n_x, m).
	a_prev -- Hidden state at timestep "t-1", of shape (n_a, m)
	**here, n_x denotes the dimension of word vector, n_a denotes the number of hidden units in a RNN cell
	parameters -- python dictionary containing:
				  Wax -- Weight matrix multiplying the input, of shape (n_a, n_x)
				  Waa -- Weight matrix multiplying the hidden state, of shape (n_a, n_a)
				  Wya -- Weight matrix relating the hidden-state to the output, of shape (n_y, n_a)
				  ba --  Bias,of shape (n_a, 1)
				  by -- Bias relating the hidden-state to the output,of shape (n_y, 1)
	Returns:
	a_next -- next hidden state, of shape (n_a, 1)
	yt_pred -- prediction at timestep "t", of shape (n_y, 1)
	"""
	# get parameters from "parameters"
	Wax = parameters["Wax"] #(n_a, n_x)
	Waa = parameters["Waa"] #(n_a, n_a)
	Wya = parameters["Wya"] #(n_y, n_a)
	ba = parameters["ba"]   #(n_a, 1)
	by = parameters["by"]   #(n_y, 1)

	a_next = np.tanh(np.dot(Waa, a_prev) + np.dot(Wax, xt) + ba)
	yt_pred = softmax(np.dot(Wya, a_next) + by) #(n_y,1)
	
	return a_next, yt_pred

整个前向传播过程:

def rnn_forward(X, Y, a0, parameters, vocab_size=27):
	x, a, y_hat = {}, {}, {}
	a[-1] = np.copy(a0)
	# initialize your loss to 0
	loss = 0
	for t in range(len(X)):
		# Set x[t] to be the one-hot vector representation of the t'th character in X.
		# if X[t] == None, we just have x[t]=0. This is used to set the input for the first timestep to the zero vector.
		x[t] = np.zeros((vocab_size, 1))
		if (X[t] != None):
			x[t][X[t]] = 1
		# Run one step forward of the RNN
		a[t], y_hat[t] = rnn_step_forward(parameters, a[t - 1], x[t]) #a[t]: (n_a,1), y_hat[t]:(n_y,1)
	    # Update the loss by substracting the cross-entropy term of this time-step from it.
		#这里因为真实的label也是采用onehot表示的,因此只要把label向量中1对应的概率拿出来就行了
		loss -= np.log(y_hat[t][Y[t], 0])

	cache = (y_hat, a, x)

	return loss, cache

5.3 后向传播

def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):

	gradients['dWya'] += np.dot(dy, a.T)
	gradients['dby'] += dy
	Wya = parameters['Wya']
	Waa = parameters['Waa']
	da = np.dot(Wya.T, dy) + gradients['da_next']  #每个cell的Upstream有两条,一条da_next过来的,一条y_hat过来的
	dtanh = (1 - a * a) * da  # backprop through tanh nonlinearity
	gradients['dba'] += dtanh
	gradients['dWax'] += np.dot(dtanh, x.T)
	gradients['dWaa'] += np.dot(dtanh, a_prev.T)
	gradients['da_next'] = np.dot(Waa.T, dtanh)

	return gradients


def rnn_backward(X, Y, parameters, cache):
	# Initialize gradients as an empty dictionary
	gradients = {}

	# Retrieve from cache and parameters
	(y_hat, a, x) = cache
	Waa, Wax, Wya, by, ba = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['ba']

	# each one should be initialized to zeros of the same dimension as its corresponding parameter
	gradients['dWax'], gradients['dWaa'], gradients['dWya'] = np.zeros_like(Wax), np.zeros_like(Waa), np.zeros_like(Wya)
	gradients['dba'], gradients['dby'] = np.zeros_like(ba), np.zeros_like(by)
	gradients['da_next'] = np.zeros_like(a[0])

	# Backpropagate through time
	for t in reversed(range(len(X))):
		dy = np.copy(y_hat[t])
		dy[Y[t]] -= 1 #计算y_hat - y,即预测值-真实值,因为真实值是one-hot向量,只有1个1,其它都是0,
		# 所以只要在预测值(向量)对应位置减去1即可,其它位置减去0相当于没变
		gradients = rnn_step_backward(dy, gradients, parameters, x[t], a[t], a[t - 1])

	return gradients, a

5.4 参数更新

#梯度裁剪
def clip(gradients, maxValue):
	'''
	Clips the gradients' values between minimum and maximum.
	Arguments:
	gradients -- a dictionary containing the gradients "dWaa", "dWax", "dWya", "db", "dby"
	maxValue -- everything above this number is set to this number, and everything less than -maxValue is set to -maxValue
	Returns:
	gradients -- a dictionary with the clipped gradients.
	'''
	dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['dba'], gradients['dby']
	# clip to mitigate exploding gradients, loop over [dWax, dWaa, dWya, dba, dby].
	for gradient in [dWax, dWaa, dWya, dba, dby]:
		np.clip(gradient, -maxValue, maxValue, gradient)
	gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": dba, "dby": dby}
	return gradients

#更新梯度
def update_parameters(parameters, gradients, lr):
	parameters['Wax'] += -lr * gradients['dWax']
	parameters['Waa'] += -lr * gradients['dWaa']
	parameters['Wya'] += -lr * gradients['dWya']
	parameters['ba']  += -lr * gradients['dba']
	parameters['by']  += -lr * gradients['dby']
	return parameters

到这里,RNN(基于生成单词的这个场景)基本写完了,在生成单词这个任务场景里,每一个时间步都要基于前面的字符生成一个字符,每一步的输出是softmax向量(概率分布),表示字符表里每个字符的概率,我们需要采样得到下一个字符。下面是采样的代码:

def sample(parameters, char_to_ix, seed):
	"""
	Sample a sequence of characters according to a sequence of probability distributions output of the RNN
	Arguments:
	parameters -- python dictionary containing the parameters Waa, Wax, Wya, by, and b.
	char_to_ix -- python dictionary mapping each character to an index.
	seed -- used for grading purposes. Do not worry about it.
	Returns:
	indices -- a list of length n containing the indices of the sampled characters.
	"""

	# Retrieve parameters and relevant shapes from "parameters" dictionary
	Waa, Wax, Wya, by, ba = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['ba']
	vocab_size = by.shape[0]
	n_a = Waa.shape[1]
	# Step 1: Create the one-hot vector x for the first character (initializing the sequence generation).
	x = np.zeros((vocab_size, 1))
	# Step 1': Initialize a_prev as zeros
	a_prev = np.zeros((n_a, 1))

	# Create an empty list of indices, this is the list which will contain the list of indices of the characters to generate
	indices = []

	# Idx is a flag to detect a newline character, we initialize it to -1
	idx = -1

	# Loop over time-steps t. At each time-step, sample a character from a probability distribution and append
	# its index to "indices". We'll stop if we reach 50 characters (which should be very unlikely with a well
	# trained model), which helps debugging and prevents entering an infinite loop.
	counter = 0
	newline_character = char_to_ix['\n']

	while (idx != newline_character and counter != 50):
		# Step 2: Forward propagate x using the equations (1), (2) and (3)
		a = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + ba)
		z = np.dot(Wya, a) + by
		y = softmax(z)

		# for grading purposes
		np.random.seed(counter + seed)

		# Step 3: Sample the index of a character within the vocabulary from the probability distribution y
		idx = np.random.choice(vocab_size, p = y.ravel())  # 等价于np.random.choice([0,1,...,vocab_size-1], p = y.ravel()),
		# 一维数组或者int型变量,如果是数组,就按照里面的范围来进行采样,如果是单个变量,则采用np.arange(a)的形式
		# Append the index to "indices"
		indices.append(idx)
		# Step 4: Overwrite the input character as the one corresponding to the sampled index.
		#每次生成的字符是下一个时间步的输入
		x = np.zeros((vocab_size, 1))
		x[idx] = 1

		# Update "a_prev" to be "a"
		a_prev = a
		# for grading purposes
		seed += 1
		counter += 1
	if (counter == 50):
		indices.append(char_to_ix['\n'])

	return indices

完整的代码以及数据集(dinos.txt)地址:

https://github.com/tz28/deep-learning/blob/master/rnn.py


数据集:

https://github.com/tz28/deep-learning/blob/master/dinos.txt

后记:因为实习的缘故,这篇博客拖了整整两个月才写完…,希望以后不要这样了。



版权声明:本文为u012328159原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。