admin 管理员组文章数量: 1087139
2024年4月12日发(作者:strcpy导致内存越界)
,NavdeepJaitly,
5
1
0
2
r
p
A
7
]
E
N
.
s
c
[
2
v
1
4
9
0
0
.
4
0
5
:
1
v
i
X
r
a
Abstract
Learninglongtermdependenciesinrecurrentnetworksisdifficultduetovan-
comethisdifficulty,researchershavede-
ve
paper,weproposeasimplersolutionthatuserecurrentneuralnetworkscomposed
ofrectifiursolutionistheuseoftheidentitymatrixorits
findthatoursolutionis
comparabletoastandardimplementationofLSTMsonourfourbenchmarks:two
toyproblemsinvolvinglong-rangetemporalstructures,alargelanguagemodeling
problemandabenchmarkspeechrecognitionproblem.
1Introduction
Recurrentneuralnetworks(RNNs)areverypowerfuldynamicalsystemsandtheyarethenatural
wayofusingneuralnetworkstomapaninputsequencetoanoutputsequence,asinspeechrecog-
nitionandmachinetranslation,ortopredictthenextterminasequence,asinlanguagemodeling.
However,trainingRNNsbyusingback-propagationthroughtime[30]tocomputeerror-derivatives
canbediffittemptssufferedfromvanishingandexplodinggradients[15]andthismeant
thattheyhadgreatdiffifferentmethodshavebeen
proposedforovercomingthisdifficulty.
Amethodthathasproducedsomeimpressiveresults[23,24]istoabandonstochasticgradient
descentinfavorofamuchmoresophisticatedHessian-Free(HF)ates
onlargemini-batchesandisabletodetectpromisingdirectionsintheweight-spacethathavevery
uentwork,however,suggestedthatsimilarresults
couldbeachievedbyusingstochasticgradientdescentwithmomentumprovidedtheweightswere
initializedcarefully[34]andlargegradientswereclipped[28].FurtherdevelopmentsoftheHF
approachlookpromising[35,25]butaremuchhardertoimplementthanpopularsimplemethods
suchasstochasticgradientdescentwithmomentum[34]oradaptivelearningratesforeachweight
thatdependonthehistoryofitsgradients[5,14].
ThemostsuccessfultechniquetodateistheLongShortTermMemory(LSTM)RecurrentNeural
Networkwhichusesstochasticgradientdescent,butchangesthehiddenunitsinsuchawaythat
thebackpropagatedgradientsaremuchbetterbehaved[16].LSTMreplaceslogisticortanhhidden
unitswith“memorycells”morycellhasitsowninputand
outputgatesthatcontrolwheninputsareallowedtoaddtothestoredanalogvalueandwhenthis
valueisallowedtoinflatesarelogisticunitswiththeirownlearnedweights
onconnectios
alsoaforgetgatewithlearnedweightsthatcontrolstherateatwhichtheanalogvaluestoredinthe
iodswhentheinputandoutputgatesareoffandtheforgetgateisnot
causingdecay,
storedvaluestaysconstantwhenbackpropagatedoverthoseperiods.
1
ThefirstmajorsuccessofLSTMswasforthetaskofunconstrainedhandwritingrecognition[12].
Sincethen,theyhaveachievedimpressiveresultsonmanyothertasksincludingspeechrecogni-
tion[13,10],handwritinggeneration[8],sequencetosequencemapping[36],machinetransla-
tion[22,1],imagecaptioning[38,18],parsing[37]andpredictingtheoutputsofsimplecomputer
programs[39].
TheimpressiveresultsachievedusingLSTMsmakeitimportanttodiscoverwhichaspectsofthe
rathercomplicas
unlikelythatHochreiterandSchmidhuber’s[16]initialdesigncombinedwiththesubsequentintro-
ductionofforgetgates[6,7]istheoptimaldesign:atthetime,theimportantissuewastofindany
schemethatcouldlearnlong-rangedependenciesratherthantofindtheminimaloroptimalscheme.
Oneaimofthispaperistocastlightonwhataspectsofthedesignareresponsibleforthesuccessof
LSTMs.
Recentresearchondeepfeedforwardnetworkshasalsoproducedsomeimpressiveresults[19,3]
andthereisnowaconsensusthatfordeepnetworks,rectifiedlinearunits(ReLUs)areeasiertotrain
thanthelogisticortanhunitsthatwereusedformanyyears[27,40].Atfirstsight,ReLUsseem
inappropriateforRNNsbecausetheycanhaveverylargeoutputssotheymightbeexpectedtobefar
daimofthispaperistoexplore
whetherReLUscanbemadetoworkwellinRNNsandwhethertheeaseofoptimizingthemin
feedforwardnetstransferstoRNNs.
2Theinitializationtrick
Inthispaper,wedemonstratethat,withtherightinitializationoftheweights,RNNscomposed
ofrectifiedlinearunitsarerelativelyeasytotrainandaregoodatmodelinglong-rangedependen-
saretrainedbyusingbackpropagationthroughtimetogeterror-derivativesforthe
weerformance
ontestdataiscomparablewithLSTMs,bothfortoyproblemsinvolvingverylong-rangetemporal
structuresandforrealtaskslikepredictingthenextwordinaverylargecorpusoftext.
Weinitializans
thateachnewhiddenstatevectorisobtainedbysimplycopyingtheprevioushiddenvectorthen
addingobsence
ofinput,anRNNthatiscomposedofReLUsandinitializedwiththeidentitymatrix(whichwecall
anIRNN)juststaysinthesamestateindefintityinitializationhastheverydesirable
propertythatwhentheerrorderivativesforthehiddenunitsarebackpropagatedthroughtimethey
thesamebehaviorasLSTMs
whentheirforgetgatesaresetsothatthereisnodecayanditmakesiteasytolearnverylong-range
temporaldependencies.
Wealsofindthatfortasksthatexhibitlesslongrangedependencies,scalingtheidentitymatrixby
thesamebehavioras
LTSMswhentheirforgetgatesaresetsothatthememorydecaysfast.
OurinitializationschemebearssomeresemblancetotheideaofMikolovetal.[26],whereapartof
theweightmatrixisfindifferenceoftheirworkto
oursisthefactthatournetworkusestherectifiedlinearunitsandtheidentitymatrixisonlyusedfor
ledidentityinitializationwasalsoproposedinSocheretal.[32]inthecontext
kisalsorelatedtotheworkof
Saxeetal.[31],whostudytheuseoforthogonalmatricesasinitializationindeepnetworks.
3Overviewoftheexperiments
timestep,thefirstinputunithasarealvalue
andthesecondinputunithasavalueof0or1asshowninfikistoreportthesumof
thetworealvaluesthataremarkedbyhavinga1asthesecondinput[16,15,24].IRNNscanlearn
tohandlesequenceswithalengthof300,whichisachallengingregimeforotheralgorithms.
2
版权声明:本文标题:A Simple Way to Initialize Recurrent Networks of Re 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://roclinux.cn/b/1712896258a611330.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论