admin 管理员组

文章数量: 1087139


2024年4月12日发(作者:strcpy导致内存越界)

,NavdeepJaitly,

5

1

0

2

r

p

A

7

]

E

N

.

s

c

[

2

v

1

4

9

0

0

.

4

0

5

:

1

v

i

X

r

a

Google

Abstract

Learninglongtermdependenciesinrecurrentnetworksisdifficultduetovan-

comethisdifficulty,researchershavede-

ve

paper,weproposeasimplersolutionthatuserecurrentneuralnetworkscomposed

ofrectifiursolutionistheuseoftheidentitymatrixorits

findthatoursolutionis

comparabletoastandardimplementationofLSTMsonourfourbenchmarks:two

toyproblemsinvolvinglong-rangetemporalstructures,alargelanguagemodeling

problemandabenchmarkspeechrecognitionproblem.

1Introduction

Recurrentneuralnetworks(RNNs)areverypowerfuldynamicalsystemsandtheyarethenatural

wayofusingneuralnetworkstomapaninputsequencetoanoutputsequence,asinspeechrecog-

nitionandmachinetranslation,ortopredictthenextterminasequence,asinlanguagemodeling.

However,trainingRNNsbyusingback-propagationthroughtime[30]tocomputeerror-derivatives

canbediffittemptssufferedfromvanishingandexplodinggradients[15]andthismeant

thattheyhadgreatdiffifferentmethodshavebeen

proposedforovercomingthisdifficulty.

Amethodthathasproducedsomeimpressiveresults[23,24]istoabandonstochasticgradient

descentinfavorofamuchmoresophisticatedHessian-Free(HF)ates

onlargemini-batchesandisabletodetectpromisingdirectionsintheweight-spacethathavevery

uentwork,however,suggestedthatsimilarresults

couldbeachievedbyusingstochasticgradientdescentwithmomentumprovidedtheweightswere

initializedcarefully[34]andlargegradientswereclipped[28].FurtherdevelopmentsoftheHF

approachlookpromising[35,25]butaremuchhardertoimplementthanpopularsimplemethods

suchasstochasticgradientdescentwithmomentum[34]oradaptivelearningratesforeachweight

thatdependonthehistoryofitsgradients[5,14].

ThemostsuccessfultechniquetodateistheLongShortTermMemory(LSTM)RecurrentNeural

Networkwhichusesstochasticgradientdescent,butchangesthehiddenunitsinsuchawaythat

thebackpropagatedgradientsaremuchbetterbehaved[16].LSTMreplaceslogisticortanhhidden

unitswith“memorycells”morycellhasitsowninputand

outputgatesthatcontrolwheninputsareallowedtoaddtothestoredanalogvalueandwhenthis

valueisallowedtoinflatesarelogisticunitswiththeirownlearnedweights

onconnectios

alsoaforgetgatewithlearnedweightsthatcontrolstherateatwhichtheanalogvaluestoredinthe

iodswhentheinputandoutputgatesareoffandtheforgetgateisnot

causingdecay,

storedvaluestaysconstantwhenbackpropagatedoverthoseperiods.

1

ThefirstmajorsuccessofLSTMswasforthetaskofunconstrainedhandwritingrecognition[12].

Sincethen,theyhaveachievedimpressiveresultsonmanyothertasksincludingspeechrecogni-

tion[13,10],handwritinggeneration[8],sequencetosequencemapping[36],machinetransla-

tion[22,1],imagecaptioning[38,18],parsing[37]andpredictingtheoutputsofsimplecomputer

programs[39].

TheimpressiveresultsachievedusingLSTMsmakeitimportanttodiscoverwhichaspectsofthe

rathercomplicas

unlikelythatHochreiterandSchmidhuber’s[16]initialdesigncombinedwiththesubsequentintro-

ductionofforgetgates[6,7]istheoptimaldesign:atthetime,theimportantissuewastofindany

schemethatcouldlearnlong-rangedependenciesratherthantofindtheminimaloroptimalscheme.

Oneaimofthispaperistocastlightonwhataspectsofthedesignareresponsibleforthesuccessof

LSTMs.

Recentresearchondeepfeedforwardnetworkshasalsoproducedsomeimpressiveresults[19,3]

andthereisnowaconsensusthatfordeepnetworks,rectifiedlinearunits(ReLUs)areeasiertotrain

thanthelogisticortanhunitsthatwereusedformanyyears[27,40].Atfirstsight,ReLUsseem

inappropriateforRNNsbecausetheycanhaveverylargeoutputssotheymightbeexpectedtobefar

daimofthispaperistoexplore

whetherReLUscanbemadetoworkwellinRNNsandwhethertheeaseofoptimizingthemin

feedforwardnetstransferstoRNNs.

2Theinitializationtrick

Inthispaper,wedemonstratethat,withtherightinitializationoftheweights,RNNscomposed

ofrectifiedlinearunitsarerelativelyeasytotrainandaregoodatmodelinglong-rangedependen-

saretrainedbyusingbackpropagationthroughtimetogeterror-derivativesforthe

weerformance

ontestdataiscomparablewithLSTMs,bothfortoyproblemsinvolvingverylong-rangetemporal

structuresandforrealtaskslikepredictingthenextwordinaverylargecorpusoftext.

Weinitializans

thateachnewhiddenstatevectorisobtainedbysimplycopyingtheprevioushiddenvectorthen

addingobsence

ofinput,anRNNthatiscomposedofReLUsandinitializedwiththeidentitymatrix(whichwecall

anIRNN)juststaysinthesamestateindefintityinitializationhastheverydesirable

propertythatwhentheerrorderivativesforthehiddenunitsarebackpropagatedthroughtimethey

thesamebehaviorasLSTMs

whentheirforgetgatesaresetsothatthereisnodecayanditmakesiteasytolearnverylong-range

temporaldependencies.

Wealsofindthatfortasksthatexhibitlesslongrangedependencies,scalingtheidentitymatrixby

thesamebehavioras

LTSMswhentheirforgetgatesaresetsothatthememorydecaysfast.

OurinitializationschemebearssomeresemblancetotheideaofMikolovetal.[26],whereapartof

theweightmatrixisfindifferenceoftheirworkto

oursisthefactthatournetworkusestherectifiedlinearunitsandtheidentitymatrixisonlyusedfor

ledidentityinitializationwasalsoproposedinSocheretal.[32]inthecontext

kisalsorelatedtotheworkof

Saxeetal.[31],whostudytheuseoforthogonalmatricesasinitializationindeepnetworks.

3Overviewoftheexperiments

timestep,thefirstinputunithasarealvalue

andthesecondinputunithasavalueof0or1asshowninfikistoreportthesumof

thetworealvaluesthataremarkedbyhavinga1asthesecondinput[16,15,24].IRNNscanlearn

tohandlesequenceswithalengthof300,whichisachallengingregimeforotheralgorithms.

2


本文标签: 导致 内存 越界 作者