admin 管理员组

文章数量: 1086019


2024年2月18日发(作者:css字体透明度)

EstimatingtheSupportofaHigh-DimensionalDistributionBernhardSch¨olkopf,,JohnShawe-Taylor,,msonMicrosoftResearchLtd,1GuildhallStreet,CambridgeCB23NH,UKMicrosoftResearch,1MicrosoftWay,Redmond,WA,USARoyalHolloway,UniversityofLondon,Egham,UKDepartmentofEngineering,AustralianNationalUniversity,Canberra0200,Australiabsc@,jplatt@,john@@,mson@27November1999TechnicalReportMSR-TR-99-87MicrosoftResearchMicrosoftCorporationOneMicrosoftWayRedmond,WA98052

AbstractSupposeyouaregivensomedatasetdrawnfromanunderlyingproba-bilitydistributionandyouwanttoestimatea“simple”subsetofinputspacesuchthattheprobabilitythatatestpointdrawnfromliesoutsideofisboundedbysomeapriorispecifioseamethodtoapproachthisproblembytryinctionalformofisgivenbyakernelexpansionintermsofapotentiallysmallsubsetofthetrainingdata;itisregularizedbyansioncoefficientsarefoundbysolvingaquadraticprogrammingproblem,whichprovideapreliminorithmisanaturtVectorMachines,KernelMethods,DensityEstima-tion,UnsupervisedLearning,NoveltyDetection1IntroductionDuringrecentyears,anewsetofkerneltechniquesforsupervisedlearninghasbeendeveloped(Vapnik,1995;Sch¨olkopfetal.,1999a).Specifically,supportvec-tor(SV)algorithmsforpatternrecognition,regressionestimatiavebeenafewattemptstotransfertheideaofusingkernelstocom-publemsinthatdomainare,however,lesspreciselyspecifilly,theycanbecharacterizedasestimatingfunctionsofthedatawtance,kernelPCAcanbechar-acterizedascomputingfunctionswhichonthetrainingdataproduceunitvarianceoutputswhilehavingminimumnorminfeaturespace(Sch¨olkopfetal.,1999b).Anotherkernel-basedunsupervisedlearningtechnique,regularizedprincipalman-ifolds(Smolaetal.,1999),computesfunctionswhichgiveamappingontoalringal-gorithmsarefurtherexamplesofunsupervisedlearningtechniqueswhichcanbekernelized(Sch¨olkopfetal.,1999b).Anextremey,knowledgeofthedensityofwouldthenallowustosolvewhateverproblemcanbesolvedonthebasisofthedata.1

Thepresentworkaddressesaneasierproblem:itproposesanalgorithmwhichcomputesabinaryfunctionwhichissupposedtocaptureregionsininputspacewheretheprobabilitydensitylives(itssupport),ionsuchthatmostofthedatawillliveintheregionwherethefunctionisnonzero(Sch¨olkopfetal.,1999).Indoingso,itisinlinewithVapnik’sprincipleneverter,itisapplicablealsoincaseswherethedensityofthedata’sdistributionisnotevenwell-defined,reviewofsomepreviousworkinSec.2,.4givesde-tailsontheimplementationoftheoptimizationprocedure,.6,weapplythealgorithmtoartifiludewithadiscussion.2PreviousWorkPartofthemotivationforthepresentworkwasapaperofBen-DavidandLin-denbaum(1997).Itturnsoutthatthereisaconsiderableamountofpriorworkinthestatisticalliterature,andinthissectionwebriefltattemptadetailedcomparisonoftheprooftechniquesofthespecificresultsachieved,butconfirtosummarizethemethods,itisconvenienttointroducethefollowingdefinitionofa(multi-dimensional)quantilefunction(introducedbyEinmalandMason(1992)).classofmeasurablesubsetsofandletbeareal-valuedfunctiondefintilefunctionwithrespecttoisIfistheempiricaldistribution(tilefunctionis),theempiricalquan-Wedenotebyandthe(notnecessarilyunique)thatattainstheinfimum(whenitisachievable).ThemostcommonchoiceofisLebesguemeasure,inwhichcaseisttorsoftheformarecalledminimumvolumeestimators.2

ethatforbeingallBorelmea-surablesets,isthesupportofthedensitycorrespondingto,assumingitexists.(Notethatiswelldefinedevenwhendoesnotexist.)Forsmallerclasses,blemofestimatingappearstohavefirstbeenstudiedbyGeffroy(1964)avebeenanumberofworksstudyinganaturalnonparametricestimatorof(-lier(1976);DevroyeandWise(1980);Cuevas(1990);see(Gayraud,1997)forfurtherreferences).Thenonparametricestimatorissimply(1)whereistheballeandWise(1980)showedtheasymp-toticconsistencyof(1)(1990)didthesame,andFraiman(1997)studiedtheasymptoticconsistencyofaplug-inestimatorof:rtoavoidproblyforagivendistribution,isrelatedto,trecentworkrelatingtotheestimationofisbyGayraud(1997)mplesareorthecenterof.(Seealso(KorostelevandTsy-bakov,1993,Chapter8).)Estimatinghighprobabilityregions().Turningtothecasewhere,itseemsthefirstworkwasreportedbySager(1977)andthenHartigan(1987)whoconsideredwithbeingtheclassofclosedconvexsetsin.(Theyactuallyconsidereddensitycontourclusters;seebelowforadefinition.)Nolan(1991)ov(1997)hasstudiedanestimatorbasedonpiecewisepolynomialap-proximationofandhasshownitk(1997)vedaidersbothVCclassesandclasseswithalog-coveringnumberwithbracket-3

summarizesanumberofotherprk(1995b)hasalsostudiedtheuseofthe“excessmassapproach”(M¨uller,1992)toconstructanestimatorof“generalized-clusters”finetheexcetheirempiricalcounterpartsandbyinthesedefirwords,lstisclearlydifferentto,itisrelatedtoitinthatitattemptstofindsmallregionswithasmuchexcessmass(whichissimilartofindingsmallregionswithagivenamountofprobabilitymass).Actuallyismorecloselyrelatedtothedeterminationofdensitycontourclustersatlevel:Simultaneously,andindependently,Ben-DavidandLindenbaum(1997)omadeuseofVCclassesbutstatedtheirresultsinastrongerformwhichismeaningfulforfiywepointoutacuriousconnectionbetweenminimumvolumesetsofadishesupportofanddefinethedifferentialentropyofbyForand,definethetypicalsetwithrespecttobywhere.4

Ifandaresequences,thenotation(CoverandThomas,1991,p.227)showthatforallmeans,thenTheypointoutthatthisresult“indicatesthatthevolumeoa-dimensionalvolume,ovidesaninterpretationofdifferentialentropy.”cludeproblemsinmedicaldiagnosis(Tarassenkoetal.,1995),marketing(Ben-DavidandLindenbaum,1997),conditionmonitoringofmachines(DevroyeandWise,1980),estimatingmanufacturingyields(Stoneking,1999),econometricsandgeneralizednonlinearprincipalcurves(Tsybakov,1997;Ko-rostelevandTsybakov,1993),regressionandspectralanalysis(Polonik,1997),testsformultimodalityandclustering(Polonik,1995b)andothers(M¨uller,1992).Polonik(1995a)ntofdoingthisisthatitallowsonetoencodearangeofpriorassumptionsaboutthetruedensitythatwoulhownasymptoticconsistencyandratesofconvergencefordensitiesbelongingtosentpaperdescribesanalgorithmwhichfissisdefinedimplicitlyviaakettryandfitherhand,ouralgorithmhastractablecomputationalcomplexity,ory,whichusesverysimilartoolstothoseusedbyPolonik,givesresultsthatweexpectwillbeofmoreuseinafinitesamplesizesetting.3Algorithms(2)plicity,Wefiidertrainingdatawhereisthenumberofobservations,and5

featuremap,toadotproductspacesuchthatthedotproductintheimageofcanbecomputedbyevaluatingsomesimplekernel(Boseretal.,1992;Vapnik,1995;Sch¨olkopfetal.,1999a)(3)suchastheGaussiankernel(4)Indicesandareunderstoodtorangeover(incompactnotation:).Boldfacegreeklettersdenote-dimemainderofthissection,weshalldevelopanalgorithmwhichreturnsafunctionthattakesthevalueina“small”regioncapturingmostofthedatapoints,ategyistomapthedataintothefeaturespacecorrespondingtothekernel,wpoint,thevalueisdeterminedbyevaluatingwhichsideofthehyperplaneitfallson,freedomtoutilizedifferenttypesofkernelfunctions,thissimplegeometrratethedatasetfromtheorigin,wesolvethefollowingquadraticpro-gram:(5)subjectto(6)Here,onzeroslackvariablesarepenalizedintheobjectivefunction,wecanexpectthatifandsolvethisproblem,thenthedecisionfunctionsgn(7)willbepositiveformostexamplescontainedinthetrainingset,ultipliers,weintroduceaLagrangian(8)6

andsetthederivativeswithrespecttotheprimalvariablesyieldingequaltozero,(9)(10)In(9),erwith(3),theSVexpansiontransformsthedecisionfunction(7)intoakernelexpansionsgn(11)Substituting(9)–(10)into(8),andusing(3),weobtainthedualproblem:subjectto(12)Onecanshowthatattheoptimum,thetwoinequalityconstraints(6)becomeequal-itiesifandarenonzero,ore,wecanrecoverbyexploitingthatforanysuch,thecorrespondingpatternsatisfies(13)Notethatifapproaches,theupperboundariesontheLagrangemultiplierstendtoinfinity,ondinequalityconstraintin(12)blemthenresemblesthecorrespondinghardmarginalgorithm,sincethepenal-izationoferrorsbecomesinfinite,ascanbeseenfromtheprimalobjectivefunction(5).Itisstillafeasibleproblem,sincewehaveplacednorestrictionon,socanbecomealargenegativenumberinordertosatisfy(6).Ifwehadrequiredfromthestart,wewouldhaveendedupwiththeconstraintinsteadofthecorrespondingequalityconstraintin(12),ludethissection,wenotethatonecanalsouseballstodescribethedatainfeaturespace,closeinspirittothealgorithmsofSch¨olkopfetal.(1995),withhardboundaries,andTaxandDuin(1999),with“softmargins.”Again,wetrytoputmostofthedataintoasmallballbysolving,for,subjecttofor(14)7

Thisleadstothedual(15)(16)(17)subjecttoandthesolutioncorrespondingtoadecisionfunctionoftheformsgn(18)Similartotheabove,nelswhichonlydependon,case,theequalityconstraintimpliesthatthelinearterminthedualtargetfunctionisconstant,andtheproblem(15–16)turnsouttobeequivalentto(12).Itcanbeshownthatthesameholdstrueforthedecisionfunction,hencethetwoalgorithmscoincideinthatcase.4OptimizationThelastsectionhasformulatedquadraticprograms(QPs)onstrainedoptimizationpro,however,possessfeaturesthatsetthemapartfromgenericQPs,resentsection,wedescribeanalgorithmwhichtakesadvantageofthesefeaturesandempiricallyscalesbettertolargedatasetsizesthanastandardQPsolverwithtimecomplexityoforder(,1999).ThealgorithmisamodifiedversionofSMO(SequentialMinimalOptimization),anSVtrainingalgorithmoriginallyproposedforclassification(Platt,1999),andsubsequentlyadaptedtoregressionestimation(SmolaandSch¨olkopf,1998).ThestrategyofSMOistobreakuptheconstrainedminimizationof(12)heconstraintonthesumofthedualvariables,itisimpossibletomodifyindeforeresorttooptimizingoverpairsofvariables.8

tance,consideroptimizingoverandwithallothervariablesfiheshorthand,(12)thenreducesto(19)withand,subjectto(20)whereWediscard.,whichisindependentofand,andeliminatetoobtain(21)(22)withthederivativeSettingthistozeroandsolvingfor,weget(23)isfound,ewpointisoutsideof,theconstrainedoptimumisfoundbyprojectingfrom(23)intoheregionallowedbytheconstraints,onalinsightcanbeobtainedbyrewritingthelastequationintermsoftheoutpuecorrespondingoutputs(cf.(11))readOnce(24)Usingthelattertoeliminatethe,weendupwithanupdateequationforwhichdoesnotexplicitlydependon,(25)whichshowsthattheupdateisessentiallythefractionoffirstandsecondderivativeoy,thesameelementaryoptimizationstepcanbeappliedtoanypairoftwovariables,brieflydescribehowtodotheoveralloptimization.9

taninteger,er,selectafirstvari,Attheend,thesecorrespondtopointsthatwillsitexactlyonthehyperplane,andthatwillthereforehaveastronginfluenceonitspreciseposition.(i)Wescanovertheentiredataset1untilwefindavariableviolatingaKKTcondition(Bertsekas,),havefoundone,say,wepickaccordingto(26)(ii)Sameas(i),tice,onescanoftype(i)isfollowedbymultiplescansoftype(ii),untiltherearenoKKTviolatorsin,whereupontheoptimizationgoesbacktoasinglescanoftype(i).Ifthetype(i)scanfindsnoKKTviolators,ualcircumstances,thechoiceheuristic(26)ore,ahitherheuristicsarethesameasinthecaseofpatternrecognition,cf.(Platt,1999),xperimentswithSMOappliedtodistributionsupportestimation,r,toensureconvergenceeveninrarepathologicalconditions,thealgorithmcanbemodifiedslightly,cf.(Keerthietal.,1999).Weendthissessitice,onehastouseanonzeroaccuracytolerancesuicular,compariewantthefinaldecisionfunctiontoevaluatetoforpointswhichlieonthemargin,ancanbeacceleratedbynotcheckingpatternswhichareonthecorrectsideofthehy-perplanebyalargemargin,usingthemethodofJoachims(1999).110

5TheoryInthissection,weanalysethealgorithmtheoretically,startingwiththeuniquenessofthehyperplane(Proposition2).Wethendescribetheconnectiontobinaryclas-sification(Proposition3),andshowthattheparametercharacterizesthefractionsofSVsandoutliers(Proposition4).Followingthat,wegivearobustnessresultforthesoftmargin(Proposition5)andfinallywebrieflystateerrorbounds(Theorem9).Inthissection,wewilluseitalicletterstode(27)(28)Defiition2Ifthedataset(28)isseparable,thenthereexistsauniquesupport-inghyperplanewiththepropertiesthat(1)itseparatesalldatafromtheorigin,and(2),itisgivenbysubjectto(29)ProofDuetotheseparability,stenceanduniquenessofthehyperplanethenfollowsfromthesupportinghyperplanetheorem(kas,1995).Moreover,separabilityimpliesthatthereactuallyexistssomeandsuchthatfor(byrescaling,thiscanbeseentoworkforarbitrarilylarge).BytheCauchy-Schwartzinequality,oretheoptimalhyperplaneisobtainedbyminimizingsubjecttotheseconstraints,olutionof(29).Thefollowingresultelucidatestherelationshipbetweensingle-classclassificationandbinaryclassification.11

Proposition3(i)Supposeparametrizesthesupportinghyperplaneforthedata(28).Thenparametrizestheoptimalseparatinghyperplane(passingthroughtheorigin(Vapnik,1995))forthelabelleddataset(30)(ii)Supposeparametrizestheoptimalseparatinghyperplanepassingthroughtheoriginforalabelleddatasetfor(31)Suppose,moreover,thatisalignedsuchthatispositivewhenever,rametrizesthesupportinghyperplanefortheunlabelleddataset(32)ProofAd(i).Observethatparametrizesthesupportinghyperplaneforthedatasetreflectedthroughtheorigin,ovidesanoptimalseparationofthetwosets,withdistance,(ii).Byassumption,wehave(,1995),case,marginerrorsinbinaryclassification(whichareeitheronthewrongsideoftheseparatinghyperplaneorwhichfallinsidethemargin)translateintooutliersinsingle-classclassification,ition3thenholds,cumgranosalis,forthetrainingsetswithmarginerrorsandoutliers,respectively,lityofProposition3liesinthefactthatitallowsustorecyclecer-tainresultsprovenforbinaryclassification(Sch¨olkopfetal.,1999c)lowing,explainingthesignificanceoftheparameter,ition4Assumethesolutionof(6)satisfieshold:(i)isanupperboundonthefractionofoutliers.(ii)isalowerboundonthefractionofSVs..Thefollowingstatements12

(iii)Supposethedata(28)weregeneratedindee,moreover,obability1,asymptotically,(i)and(ii)followdirectlyfromProposition3andthefactthatoutliersaredealtwithinexactlythesamewayasmarginerrorsintheoptimizationproblemforthebinaryclassificationcase(Sch¨olkopfetal.,1999c).Thebasicideaisthat(10)imposesconstraintsonthefractionofpatternsthatcanhave,upperboundingthefractionofoutliers,andonthefractionofpatternsthatmusthave,atively,theresultcanbeprovendirectlybasedontheprimalobjectivefunction(5),assketchedpresently:tothisend,notethatwhenchanging,thetermwillchangeproportionallytothenumberofpointsthathaveanonzero(theoutliers),plus,whenchanginginthepositivedirection,thenumberofpointswhicharejustabouttogetanonzero,itonthehyperplane(theSVs).Attheoptimumof(5),wethereforehave(i)and(ii).Part(iii)canbeprovenbyauniformconvergenceargumentshowingthatsincethecoveringnumbersofkernelexpansionsregularizedbyanorminsomefeaturespacearewell-behaved,thefractionofpointswhichlieexactlyonthehyperplaneisasymptoticallynegligible(¨olkopfetal.,1999c).Proposition5(Resistance),hencebytheKKTconditions(kas,1995).Transformingitinto,where,,ore,isstillfeasible,,weusefor,,andascomputedfrom(13).Finally,theKKTconditionsarestillsatisfied,(Bertsekas,),atalthoughthehyperplanedoesnotchange,listoboundtheprobabilitythatanovelpointdrawnfromthesameunderltbyintroducingacommontoolformeasuringthecapacityofaclassoffunctionsthatmapto.13

Definition6Letbeapseudo-metricspace,an-coverforif,forevery,-coveringnumberof,,istheminimalcardinalityofan-coverfor(ifthereisnosuchfinitecoverthenitisdefinedtobe).Theideaisthatshouldbefiusethedistanceoverafinitesampleforthepseudo-metricinthespaceoffunctions,(33),thprobabilityoversuchan-sample,ifwefindsuchthatforall,isoftheproofis(Shawe-Tayloretal.,1998,Lemma3.9).Wenorrespondstohavinganon-zeroslackvariableinthealgorithm,wherewetakeandusethecfifinition8Letdefi,Similarlyforatrainingsequence,wedefierafixedbutunknowistancefunctionthatdiffersfromametricinthatitisonlysemidefinite14

withprobabilityandanyoverrandomlydrawntrainingsequencesandofsize,forall,whereTheproofisbasedonsimilarproofsfortheclassificationcasein(Shawe-TaylorandCristianini,1999,Theorem3).Thetheoremboundstheprobabilityofanewpointfallingintheregionforwhichhasvaluelessthan,thlgorithmdescribedinthispaper,onewouldusethehyperplaneshiftedbytowardstheorigintodefiatthereisnorestrictionplacedontheclassoficeofgivesatrade-offbetweenthesizeoftheregionoverwhichtheboundholds(increasingincreasesthesizeoftheregion)andthesizeoftheprobabilitywithwhichitholds(increasingdecreasesthesizeofthelogcoveringnumbers).Theresultshowsthatwecanboundtheprobabilityofpointsfallingoutsidetheregionofestimatedsupportbyaquantityinvolvingtheratioofthelogcoveringnumbers(whichcanbeboundedbythefatshatteringdimensionatscalepropor-tionalto)andthenumberoftrainingexamples,ultisstrongerthanrelatedresultsgivenbyBen-DavidandLindenbaum(1997),sincetheirboundinvolvesthesquarerootoftheratioofthePollarddimen-sion(thefatshatteringdimensionwhentendsto0)veboundsare,nevertheless,notentirelysatisfactory,andtheirinclu-sionhereismuchmoreasasanitycheckthanasa“closed-case”mostoftheapparenttechnicalgapscanbereadilyfilled(forexampledeterminingthecoveringnumbersfortheclassoffunctionsin-ducedbyuseofaparticularkernelusingmethodsasinWilliamsonetal.(1999)),apsdonotinvalidatethealgorithm;theysimplyindicateanincompletetheory,diffiinthesupportvectormachinecase,rmore,whilstnotimmediatelyapparent,theresultsstateddonotactuallygiveguidanceastohowtochosethekernelparameter,al-thterconnectionisnotnecessary,butitcouldbemotivatedbynotingthatit15

,/OLsmargin0.5,0.50.54,0.430.840.5,0.50.59,0.470.700.1,0.50.24,0.030.620.5,0.10.65,0.380.48Figure1:Firsttwopictures:Asingle-classSVMappliedtotwotoyproblems;,domain:.Notehowinbothcases,atleastafractionofofallexamplesisintheestimatedregion().Thelargevalueofcausestheadditionaldatapointsintheupperleftcornertohavealmostnoinflllervaluesof,suchas(thirdpicture),atively,onecanforcethealgorithmtotakethese‘outliers’(OLs)intoaccountbychangingthekernelwidth(4):inthefourthpicture,using,thedataiseffectivelyanalyzedonadifferentlengthscalausiblethatifweobtainaverylargemarginofseparationtotheorigin,wewouldbemorelikelytoacceptalarge(withtheassociatedriskofendingupwithmorefalsepositivesfromthe“unknown”class).Measuringrelativetothemarginwouldthenleadtoboundswhichdependonthemargin,lently,wecouldarguethatwetrytomaximizethemargininordertohavethefreedomtousealarge,leadingtosmallervaluesoftheerrorbounds,whilestillnotincludingthe“unknown”tly,thisargumentimplicitlymakespriorassumptionsabouttheunknownclass,inparticularthatitisinorithmcouldbemodifiedtoaccomodatethiscase,butpresently,weshallnotgointofurtherdetailonthatmatter.6ExperimentsWeapplythemethodtoartifi1displays2-Dtoyexamples,andshowshowtheparametersettingsinfl2showsaplotoftheoutputsontraabasecontainsdigitimagesofsize;ur16

test

train

other

thresholdtest

train

other

thresholdFigure2:izerfordigit;outputhistogramfortheexemplarsofinthetraining/testset,-axisgivestheoutputvalues,umentofthesgnfunctionin(11).For(top),wegetSVsandoutliers(consistentwithProposition4),truepositivetestexamples,andzerofalsepositivesfromthe“other”(bottom),wegetandforSVsandoutliers,case,thetruepositiverateisimprovedto,,finally,thaity,manyexamplessitexactlyatthethresholdvalue(thenon-boundSVs).Sincethispeakissmoothedoutbytheestimator,thefractionsofoutliersinthetrainingsetappearslightlylargerthanitshouldbe.17

algorithm,usingaGaussiankernel(4)ofwidth(acommonvalueforSVMclassifiersonthatdataset,¨olkopfetal.(1995)),ninfigure2,leadstozerofalsepositives(oughthelearningmachinehasnotseenanynon--sduringtraining,itcorrectlyidentifiesallnon--sassuch),recognitionratescanbeachievedusingsmallervaluesof:for,wegetcorrectrecognitionofdigitsinthetestset,leadingtoencouragingresults,thisexperore,,weutilizedtheUSPSset;however,thistimewetrainedthealgorithmonthetestsetandusedittoidentifyoutliers—itisfolkloreinthecommunitythattheUSPStestset(Fig.3)containsanumberofpatternswhicharehardorimpossibletoclassify,duetosegmentationerrorsormislabelling(,1995).Intheexperiment,weaugmentedtheinputpatteionaleforthisisthatifwedisregardedthelabels,rsa,withthelabels,thealgorithmhasthechanc.4showsthe20worstoutliersfortheUSPStestset,atthealgorithmindeedexperiment,weusedthesamekernelwidthasabove,astexperiment,wetestedthescalingbehaviouroftheproposedSMOsolverwhichisusedfortrainingthelearningmachine(Fig.5).smallvaluesofwhicharetypicallyusedinoutlierdetectiontasks,thealgorithmscalesverywelltolargerdatasets,withadependencyoftrainingtimesonthesamplesizewhichisatmostquadratic.692830827Figure3:esrandomlydrawnfromtheUSPStestset,with18

1%2%3%4%5%6%7%8%9%10%20%30%40%50%60%70%80%90%fractionofOLs0.0%0.0%0.1%0.6%1.4%1.8%2.6%4.1%5.4%6.2%16.9%27.5%37.1%47.4%58.2%68.3%78.5%89.4%fractionofSVs10.0%10.0%10.0%10.1%10.6%11.2%11.5%12.0%12.9%13.7%22.6%31.8%41.7%51.2%61.0%70.7%80.5%90.1%trainingtime36393512841159Table1:atboundsthefractionsofoutliersandsupportvectorsfromaboveandbelow,respectively(ition4).Aswearenotintheasymptoticregime,thereissomeslackinthebounds;nevertheless,,moreover,thattrainingtimes(CPUtimeinsecondsonaPentiumIIrunningat450MHz)relatedtothefactthatalmostallLagrangemultiplierswillbeattheupperboundinthatcase(.4).Thesystemusedintheoutlierdetectionexperimentsisshowninboldface.19

−5139−5071−4580−3771−2827−2162−2003−1869−1795−1620−1533−1436−1286−1230−1177−935−780−587−526−483Figure4:Outliersidentifiedbytheproposedalgorithm,rankedbythenegativeoutputoftheSVM(theargumentof(11)).Theoutputs(forconvenienceinunitsof)arewrittenunderneatheachimageinitalics,the(alleged)atmostoftheexamplesare“difficult”inthattheyareeitheratypicalorevenmislabelled.7DiscussionOnecouldviewthepresentworkasanattempttoprovideanewalgorithmwhichisinlinewithVapnik’sprinciplenevertosolveap,insituationswhereoneisonlyinterestedindetectingnovelty,,densityestimationismoredifficultthanwhatwearedoing,aticallyspeaking,adensitywillonlyexistiftheunderlyingprobabr,thegeneralproblemofestimatingthemeasureforalargeclassofsets,saythesetsmeasureableinBorel’ssense,isnotsolvable(foradiscussion,,1998).Thereforeweneesmallclassofsets,thesimplestestimatorwhichaccomplishesthistaskistheempiricalmeasure,whitswiththenumberoftrainingpointsthataresupposedtofallintotheregion,,therewillbemanysuchregions—thesolutionbecomesuniqueonlybyapplyingaregularizer,whichinourcaseeore,wemustkeepinmindthatthemeasureofsmallnessinthissensedependsonthekernelused,inawayarproblem,however,20

8ν= 1%ν= 2%ν= 5%ν=10%11109876ν=20%ν=30%ν=40%ν=50%7654543322167891Figure5:tsizes(bothaxesdepictlogsatbase2;CPUtimeinsecondsonaPentiumIIrunningat450MHz).AsinTable1,itcanbeseenthatlargervaluesofgenerallyleadtolongertrainingtimes(notethattheplotsusedifferenty-axisranges).However,gevaluesof,trainingtimesareroughlyproportionaltothesamplesizeraisedtothepowerof(rightplot).Forvalues(leftplot),ypicallyusedinoutlierdetectionexperiments(inFig.4,weused),thescalingexponentisbelow(theexponentscanbedirectlyreadofffromtheslopeofthegraphs,astheyareplottedinlogscalewithequalaxisspacing).Notethatthescalingsarebetterthanthecubiconethatonewouldexpectwhensolvingtheoptimizationproblemusingallpatternsatonce,eotherexperiments,weused,howeverweonlytrainedonsubsetsoftheUSPStestset.21

performa(nonlinear)coordinatetransformationintheinputdomain,thenthedensityvalueswillchange;looselyspeaking,whatremainsconstantis,whileistransformed,rectlyestimatingtheprobabilitymeasureofregions,wearenotfacedwiththisproblem,activepropertyofthemeasureofsmallnessthatwechosetouseisthatitcanalsobeplacedinthecontextofregularizationtheory,leadingtoaninterpretationofthesolutionasmaximallysmoothinasensewhichdependsonthespecifiecifically,letusassumethatisGreen’sfunctionofforanoperatormappingintosomedotproductspace(Smolaetal.,1998;Girosi,1998),andtakealookatthedualobjectivefunctionthatweminimize,ularizationoperatorsofcommonkernelscanbeshowntocorrespondtoderivativeoperators(PoggioandGirosi,1990)—therefore,minimizingthedualobjectivefunctioncorrespondstomaximizingthesmoothnessofthefunction(whichis,uptoathresholdingoperation,thefunctionweestimate).This,inturn,stingly,astheminimizationofthedualobjectivefunctionalsocorre-spondstoamaximizationofthemargininfeaturespace,anequivalentinterpreta-tionisintermsofaprioronthedistributionoftheunknownotherclass(the“novel”classinanoveltydetectionproblem)—tryingtoseparatethedatafromtninspiratio1962,theyproposedanalgorithmforcharacterizingasetofunlabelleddatapointsbyseparatingitfromtheoriginusingahyperplane(VapnikandLerner,1963;VapnikandChervonenkis,1974).However,theyquicklymovedontotwo-classclassificationproblems,bothintermsofalgorithmsandintermsofthetheoreticaldevelopmentofstatisticallearningtheorywhichoriginatedinthosedays.22

Fromanalgorithmicpointofview,wecanidentifytwoshortcomingsoftheoriginalapproachwhichmay,theoriginalalgorithmin(VapnikandChervonenkis,1974)waslimitedtolineardecisionrulesininputspace,secondly,unction,theserestrictionsareindeedsevere—agenericmodifi,thekerneltrickallowsforamuchlargerclassoffunctionsbynon-linearlymappingintoahigh-dimensionalfeaturespace,andthicular,usingaGaus-siankernel(4),suchaseparationexistsforanydataset:toseethis,notethatforall,thusalldotproductsbetweenmappedpatternsarepositive,er,sinceforall,ondmodifiincorporatedthis‘softness’ofthedecisionruleusingthe-trick(Sch¨olkopfetal.,1999c)evethatourapproach,proposingaconcretealgorithmwithwell-behavedcomputationalcomplexity(convexquadraticprogramming)foraproblemthatsofarhasmainlybeenstudthealgorithmintoaneasy-to-useblack-boxmethodforpracticioners,questionsliketheselectionofkernelparameters(suchasthewidthofaGaussiankernel)rexpectationthatthetheoreticalresultswhichwehavebrieflyoutlrkwassupportedbytheARCandtheDFG(#Ja379/9-1).PartsofitweredonewhileBSandASwerewithGMDFIRST,-David,,,,¨orr,ngdistributionsbytheirdensitylevels:lofComputerandSystemSciences,55:171–182,Scientific,Belmont,MA,1995.23

,,ingalgorithmforoptimalmarginclassifier,editor,Proceedingsofthe5thAnnualACMWorkshoponComputationalLearningTheory,pages144–152,Pittsburgh,PA,tiondusupportetducontourdusupportd’uneloideproba-bilit´sdel’InstitutHenriPoincar´desProbabilit´leS´erie,12(4):339–364,,NewYork,etes,19(6):26–33,alsofStatistics,25(6):2300–2312,urnalonAppliedMathematics,38(3):480–488,alsofStatistics,20(2):1062–1078,aticalMethodsofStatistics,6(1):26–46,robl`emed’estimationg´eom´ationsdel’InstitutdeStatistiquedel’Universit´edeParis,13:191–210,Computation,10(6):1455–1480,loftheAmericanStatisticalAssociation,82:267–270,¨olkopf,,,editors,AdvancesinKernelMethods—SupportVectorLearning,pages185–ss,Cambridge,MA,i,e,charyya,ementstoPlatt’sSMOalgorithmforSVMclassificalReportCD-99-14,anicalandProductionEngineering,ore,Singapore,1999.24

er,NewYork,1993.D.W.M¨¨agezurStatistik,Uni-versit¨atHeidelberg,lofMultivariateAnalysis,39:348–371,¨olkopf,,,editors,AdvancesinKernelMethods—SupportVectorLearning,pages185–ss,Cambridge,MA,dingsoftheIEEE,78(9),lofMultivariateAnalysis,55(1):61–81,ingmassconcentrationsandestimatingdensitycontourclus-ters—alsofStatistics,23(3):855–881,sticProcessesandtheirApplications,69:1–24,loftheAmericanStatisticalAssociation,74(366):329–339,¨olkopf,,samy,editors,Proceedings,FirstInternationalConferenceonKnowledgeDiscovery&ess,MenloPark,CA,¨olkopf,,esinKernelMethods—ss,Cambridge,MA,¨olkopf,,andK.-R.M¨¨olkopf,,,editors,AdvancesinKernelMethods—ss,Cambridge,MA,1999b.327–¨olkopf,,mson,arin:NeuralComputation,:NeuroColt2-TR1998-031.25

¨olkopf,mson,,n,,,,editors,UnsupervisedLearning,Dagstuhl-Seminar-Report235,pages19–20,-Taylor,tt,mson,,44(5):1926–1940,FischerandHansUlrichSimon,editors,ComputationalLearningThe-ory,4thEuropeanConference,EuroCOLT’er,¨olt2-TR1998-030,svm.fi,,¨olkopf,andK.-R.M¨Networks,11:637–649,,mson,,¨utationalLearningTheory:4thEuropeanConference,vol-ume1572ofLectureNotesinArtificialIntelligence,pages214–er,ec-trum,36(6):70–76,enko,,z,ydetectionfortheidentifieedingsFourthIEEInterna-tionalConferenceonArtificialNeuralNetworks,pages442–447,Cambridge,sen,editor,ProceedingsESANN,pages251–256,Brussels,alsofStatistics,25(3):948–969,erVerlag,NewYork,,NewYork,1998.26

ofPatternRecognition[inRussian].Nauka,Moscow,1974.(GermanTranslation:&wonenkis,TheoriederZeichenerkennung,Akademie-Verlag,Berlin,1979).-tomatikaiTelemekhanika,24:774–780,mson,,¨ynumbers,¨olkopf,,,editors,Ad-vancesinKernelMethods—SupportVectorLearning,pages127–ss,Cambridge,MA,1999.27


本文标签: 透明度 字体 作者