Streaming Gibbs Sampling for LDA Model YangGao JianfeiChen,JunZhu ElectricalEngineeringandComputerScience Dept. ofComp. Sci&Tech;TNListLab, UniversityofCaliforniaBerkeley StateKeyLabofIntell. Tech&Sys. 6 Berkeley,CA94710 TsinghuaUniversity,Beijing,100084,China 1 [email protected] [email protected] 0 [email protected] 2 n a J Abstract 6 StreamingvariationalBayes(SVB)issuccessfulinlearningLDAmodelsinanon- ] G line manner. HoweverpreviousattemptstowarddevelopingonlineMonte-Carlo methodsforLDAhavelittlesuccess,oftenbyhavingmuchworseperplexitythan L theirbatchcounterparts. WepresentastreamingGibbssampling(SGS)method, . s anonlineextensionofthecollapsedGibbssampling(CGS).Ourempiricalstudy c showsthatSGScanreachsimilarperplexityasCGS,muchbetterthanSVB.Our [ distributedversionof SGS, DSGS, is muchmore scalable than SVB mainlybe- 1 causetheupdates’communicationcomplexityissmall. v 2 4 1 Introduction 1 1 0 TopicmodelssuchasLatentDirichletAllocation(LDA) [1]havegainedincreasingattention.LDA . providesinterpretablelowdimensionalrepresentationofdocuments,uncoveringthelatenttopicsof 1 0 thecorpus. Themodelhasbeenprovento beusefulin manyﬁelds, suchasnaturallanguagepro- 6 cessing,informationretrievalandrecommendationsystems[2, 3].CompaniessuchasGoogle[4], 1 Yahoo! [5], and Tencent [6] have taken advantage of the model extensively. LDA has gradually : becomeastandardtooltoanalyzedocumentsinasemanticperspective. v i With the Internet generating a huge amount of data each day, accommodating new data requires X periodicretrainingusingtraditionalbatchinferencealgorithmssuchasvariationalBayes(VB) [1], r collapsedGibbssampling(CGS) [7] or theirvariants [8, 9, 10], which mightbe a waste ofboth a computationalandstorageresources,duetotheneedofrecomputingandstoringallhistoricaldata.It isbettertolearnthemodelincrementally,withonline(streaming)algorithms. Anonlinealgorithm is deﬁned as the method that learns the model in a single pass of the data and can analyze a test documentatanytimeduringlearning. Stochastic variational inference (SVI) is an ofﬂine stochastic learning algorithm that has enjoyed great experimentalsuccess on LDA [11]. Although online methods often perform worse than of- ﬂine ones, streaming variational Bayes (SVB) [11] achieves the same performance as the ofﬂine SVI, which is impressive. However, these variational methods need to make unwarranted mean- ﬁeld assumptionsand require model-speciﬁcderivations. In contrast, Monte Carlo methods(e.g., CGS)aregenerallyapplicableandasymptoticallyconvergetothetargetposterior.Byexploringthe sparsity structure, CGS hasbeen adoptedin manyscalable algorithmsfor LDA [12, 5]. However, theattemptstowardsdevelopingstreamingMonteCarlomethodsforLDAhavelittlesuccess,often achievingworseperformancethanCGS.Forinstance,theperplexityofOLDA[13]onNIPSdataset ismuchworsethanCGS;andtheparticleﬁlterapproach[14]couldonlyachieve50%performance ofCGS,intermsofthenormalizedmutualinformationonlabeledcorpus. Inthispaper,weﬁllupthisgapbypresentingastreamingGibbssampling(SGS)algorithm. SGS naturallyextendsthecollapsedGibbssamplingtotheonlinelearningsetting. Weempiricallyverify 1 that, with weight decay, SGS can reach similar inference quality as CGS. We further present a distributedversionofSGS (DSGS) in orderto dealwith large-scaledatasetson multiplecompute nodes.EmpiricalresultsdemonstratethatthedistributedSGSachievescomparableinferencequality asthenon-distributedone,whilewithdramaticscaling-up. SinceDSGScanbeimplementedwith sparsedatastructures, demandingmuchlowercommunicationbandwidth,itismorescalablethan SVB.NotethattheSGSwithoutweightdecayisthesameasOLDAwithout“topicmixing”,where “topicmixing”isthekeycomponentofOLDA[13].Moreover,OLDAattemptstosolveadifferent generativemodel,insteadoftheusualLDAmodel. Therestofthepaperisstructuredasfollows.Section2presentsthestreamingGibbssampling(SGS) algorithm for LDA and shows its relationship with ConditionalDensity Filtering (CDF) [15]. We alsoproposetheDistributedSGS(DSGS)whichcantakeadvantageofsparsityinsamplingtohandle verylargedatasets. Section3presentsexperimentalsettingsofSGSandDSGStodemonstratetheir inferencequalityandspeed.Section4concludeswithdiscussiononfuturework. 2 Streaming Gibbs Sampling forLDA 2.1 BasicsofLDA LDA[1] is a hierarchical Bayesian model that describes a generative process of topic proportions foradocumentandtopicassignmentsforeverypositionindocuments,wordsinthedocumentsare sampledfromdistributionsspeciﬁedbytopicassignments. LetD,K,V bethenumberofdocuments,topicanduniquewordsrespectively. φ~ isa V dimen- k sionalcategoricaldistributionoverwordswithsymmetricDirichletpriorβ,~θ isthetopicproportion d fordocumentdwithDirichletpriorα. ThegenerativeprocessofLDAisasfollows φ~ ∼Dir(β),∀d∈{1,...,D}:θ~ ∼Dir(α),z ∼Mult(~θ ),w ∼Mult(φ~ ), k d di d di zdi where L is the length of documentd, i ∈ {1,...,L }, z , w is the assignment and word on d d di di positioniofdocumentd. Dir(·),Mult(·)istheDirichletdistributionandMultinomialdistribution. Denoteθ =(~θ1,...,~θD),thematrixformedbyalltopicproportions,likewiseforΦ,Z,W. GivenasetofdocumentsW,inferringtheexactposteriordistributionp(θ,Φ,Z|W)isintractable. WemustresorttoeithervariationalapproximationorMonteCarlosimulationmethods. Amongthe variousalgorithms,collapsedGibbssampling(CGS)[7]isofparticularinterestduetoitssimplicity and sparsity. CGS exploitsthe conjugacypropertiesof DirichletandMultinomialto integrateout (θ,Φ)fromtheposterior: N−di +β p(z =k|Z−di,W)∝(N−di+α) kvdi , (1) di kd N−di+Vβ k whereN ,N aresufﬁcientstatistics(counts)fortheDirichlet-Multinomialdistribution: N = kd kv kd Ld I(z = k),N = Ld I(w = v,z = k),N = N = N . The i=1 di kv d i=1 di di k d kd v kv sPuperscript−di standsforexPcludPingthe tokenatpositioniofdocumenPtd. Nkv,NPkd,Nk are the matricesorvectorformedbyallcorrespondingcounts.ApseudocodeofCGSisdepictedasAlg.1. ThesparsityofsufﬁcientstatisticmatricesN ,N leadstomanyfastsparsityawarealgorithms kw kd [16, 17, 12]. The sparsity is also an importantfactor thatmakesour algorithmmore scalable than SVB. 2 Algorithm1CollapsedGibbsSampling Algorithm2StreamingGibbsSampling Input: dataW,iterationsN Input: iterationsN,decayfactorλ InitializeZ,N ,N ,N fort=1to∞do kv k dk foriter=1toN do Input: dataWt foreachtokenz inthedocumentsdo InitializeZtandupdateNt ,Nt,Nt di kv k dk Samplez ∼p(z |Z−di,W) foriter=1toN do di di UpdateN ,N ,N foreachtokenz inthemini-batchdo kv k dk di Outputposteriormean: Samplez ∼p(z |Z1:t ,W1:t) di di −di φ = Nkv+β,θ = Ndk+α UpdateNt ,Nt,Nt kv Nk+Vβ dk Nd+Kα Decay:Nt =kvλNtk dk kv kv Outputposteriormean: φt = Nktv+β,θt = Ndtk+α kv Nt+Vβ dk Nt+Kα k d 2.2 StreamingGibbsSampling Given a Bayesian model P(x|Θ) with prior P(Θ), and incoming data mini-batches X1,X2,··· ,Xt,···, let X1:t = {X1,...,Xt}. Bayesian streaming learningis the processof gettingaseriesofposteriordistributionsP(Θ|X1:t)bytherecurrencerelation: P(Θ|X1:t)∝P(Θ|X1:t−1)P(Xt|Θ). (2) Therefore,theposteriorlearntfromX1:t−1 isusedasthepriorwhenlearningfromXt. Notethat theamountofdatainastreammightbeinﬁnite,soastreaminglearningalgorithmcanneitherstore allpreviousdata,norupdatethemodelattimetwithatimecomplexityevenlinearoft. Ideally,the algorithmshouldonlyhaveconstantstorageandconstantupdatecomplexityateachtimestep. AsdepictedinAlg.2,weproposeastreamingGibbssampling(SGS)algorithmforLDA.SGSisan onlineextensionofCGS,whichﬁxesthetopicsZ1:t−1ofthepreviousarriveddocumentmini-batch, andthensamplesZtofthecurrentmini-batchusingthenormalCGSupdate.Thisisincontrastwith CGS,whichcancomebackandreﬁnesomeZtafteritisﬁrstsampled.ActuallyZ1:t−1isnoteven storedinCGS,westoreonlythesufﬁcientstatisticN . kv One can understand SGS using the recurrence relation (2): without any data, the initial φ~ have k parametersβ~0 = β, after incorporatingmini-batchW1, the parametersis updatedto β~1 = β~0 + k k k N1 [k], which is used as the prior of consequentmini-batches, where Nt [k] is the k-th row of kv kv matrixNt . Ingeneral,SGSupdatesthepriorusingtherecurrencerelationβ~t =β~t−1+Nt [k], kv k k kv whereβ~t isthepriorforφ~ attimet. k k The decay factor λ serves to forget the history. When plugging in the decay factor, the update equationbecomesβ~t = λ(β~t−1 +Nt [k]). λcanthenbeunderstoodasweakeningtheposterior k k kv causedbythepreviousdata. ThisdecayfactorwouldimprovetheperformanceofSGS,especially whenthetopic-worddistributionisevolvingalongthetime. SGS only requires constant memory: it only stores the current mini-batch Wt,Zt, but not W1:t−1,Z1:t−1,Wt+1:∞,Zt+1:∞. The total time complexity is the same as CGS, which is O(KN|W1:t|),where|W1:t|isnumberoftokensinmini-batch1tot. Inpractice,weuseasmaller numberof iterationsthan thatof CGS, becauseSGS iterates overa smaller numberof documents (mini-batch)andthusitconvergesfaster. 2.3 RelationtoConditionalDensityFiltering Inthissection,weconsideraspecialcaseofSGS,wherethedecayfactorλissetto1.0. Werelate thisspecialcasetoConditionalDensityFiltering(CDF)[15]andshowthatSGScanbeseenasan improvedversionofCDFframeworkwhenappliedtotheLDAmodel. 3 2.3.1 ConditionalDensityFiltering CDF is an algorithm that can sample from a sequence of graduallyevolving distributions. Given a probabilistic model P(D1:t|Θ), where Θ = (θ1,··· ,θk) is a k-dimensionalparameter vector andD1:t isthedatauntilnow,wedeﬁnetheSurrogateConditionalSufﬁcientStatistics(SCSS)as follows: Deﬁnition1 [SCSS] Assume p(θj|θ−j,Dt) can be written as p(θj|θ−j,1,h(Dt,θ−j,2)), where θ−j = Θ\θj, θ−j,1 and θ−j,2 are a partition of θ−j and h is some known function. If θˆ−tj,2 is aconsistentestimatorofθ−j,2attimet,thenCt =g(Ct−1,h(Dt,θˆ−tj,2))isdeﬁnedastheSCSSof θj attimet,forsomeknownfunctiong. Weusep(θj|θ−j,1,Ct)toapproximatep(θj|θ−j,D1:t). SCSS is an extension of Sufﬁcient Statistics (SS), in the sense that both of them summarize the historical observations sampled from a class of distributions. SS is accurate and summarizes a wholedistribution,butSCSSisapproximateandonlysummarizesconditionaldistributions. IftheparametersetofaprobabilisticmodelcanbepartitionedintotwosetsI andI ,whereeach 1 2 parameter’sSCSSonlydependsontheparametersintheotherset,thenwecanusetheCDFalgo- rithm(3)toinfertheposterioroftheparameters. Algorithm3ConditionalDensityFiltering Algorithm5CDF-LDA fort=1to∞do Initialize:Φ=Uniform(0,1),Nˆ0 =0 kv fors∈{1,2}do fort=1to∞do forj ∈Is do Input: asingledocumentw~t Cjts =g(Cjts−1,h(Dt,Θˆ−s)) Initialize~zt Sampleθ ∼p(θ |θ ,Ct ) SCSSofz:Φˆt =Φ j j −js js foreachtokeniindoctdo Samplez ∼p(z |~z−ti,Φˆt) Algorithm4DistributedSGS(DSGS) SCSS of Φt:i Nˆt =ti Ntˆt−1 + I(z = IInniptiuatl:izieteNraktivon=sN0 ,decayfactorλ kfo∧rkw=ti =1tvo)K dkvo kv Pi ti foreachmini-batchWtatsomeworkerdo Sampleφ~ ∼Dir(Nˆt +β) CopyglobalN tolocalNlocal k kv kv kv ∆Nlocal =CGS(α,β+Nlocal,Wt) kv kv UpdateglobalN =λ(N +∆Nlocal) kv kv kv Underasemi-collapsedrepresentationofLDA,whereθ~ iscollapsed,wecanpartitiontheparame- d tersintotwosets: I1 ={φkv};I2 ={zdi}. Theconditionaldistributionsare: K p(Φ|Z,W)= Dir(φ~ |N [k]+β), p(z =k|~z−di,Φ,W)∝(N−di+α)φ . k kv di d kd kvdi kY=1 By deﬁnition(1), we canverifythattheSCSS of Φ and~z attime t are N and Φ respectively. d kv ThuswehavetheCDFsolutionofLDAasshowninAlgorithm(5). 2.3.2 RelationshipbetweenSGSandCDF-LDA OurSGSmethodcanbeviewedasanimprovedversionofCDF-LDAinthefollowingaspects: • InCDF-LDA,SCSSΦt isdirectlysampledfromaDirichletdistribution,whichunneces- sarilyintroducesextrasourceofrandomness.Replacingthesamplingwiththeexpectation φˆt = Nˆkt−v1+β givesbetterperformanceinpractice.Thiscorrespondstoafullycollapsed kv Nˆt−1+Vβ k representationofLDA.It’smorestatisticalefﬁcientduetotheRao-BlackwellTheorem. • TheCDF-LDA’ssamplingupdateofz doesnotincludeothertokensin thecurrentdoc- ti umentt, which could be improvedby taking the currentdocumentinto account. This is especiallyusefulforthe beginningiterations,becauseit enablesthetopics’preferenceof docttobepropagatedimmediately. 4 • Itishardtosayhowasingledocumentcanbedecomposedintotopicswithoutlookingat theotherdocuments,butthisisthecasethatoccurrsatthebeginningofCDF-LDA.This wouldresultininaccuratez assignmentsandthereforepollutingN ,andﬁnallyresult- ti kv inginlowconvergencerateonthewhole. Ourmethodavoidsthisproblembyprocessing amini-batchofdocumentsatatimeandallowsformultipleiterationsoverthemini-batch. Thiswouldenabletopic-assignmentstobepropagatedlocally. To sum up, SGS without weight decay can be seen as an improvementover CDF-LDA not only by adapting a fully collapsed representation, but also by enabling a timely and repeatedly cross- documentinformationﬂow. 2.4 DistributedSGS Many distributed CGS samplers exist for LDA, including divide-and-conquer [18], parameter server[5,19]andmodelparallel[15]approaches. Althoughsomeofthemdonothavetheoretical guarantees,theyallperformprettywellinpractice. Here,weadopttheparameterserverapproach andpresentadistributedSGSsampler. SameasCGS,theglobalparameterinSGSisthetopicwordcountmatrixN . SGScanbeviewed kv asasequenceofcallstotheCGSprocedure ∆Nt =CGS(α,β+Nt−1,Wt), kv kv wherethekthrowofβ+Nt−1isthepriorofφ~ . Thenthetopicwordcountmatrixcanbeupdated kv k byNt =λ(Nt−1+∆Nt ). kv kv kv In the parameter server architecture, we store N in a central server. Each worker fetches the kv mostrecentparameterN , runsCGS(α,β +Nt−1,Wt)togettheupdates∆Nt , andpushes kv kv kv backtheupdateto theserver. Uponreceivinganupdate,theserverupdatesitsparameters. Inour implementation,theworkersruninafully(hogwild)asynchronousfashion.Althoughworkersmay haveslightlystaleparameterscomparedtotheserialversion,affectingtheconvergence,thishogwild approachis shownto workwell in [18, 5] as wellas our experiments,based onthe fact thatN kv changesslowly. Betterschemessuchasstalesynchronousparallel[20]mightbeused,butitisjust amatterofchoosingparameterserverimplementations.Apseudo-codeofisgivenasAlg.4. Inexperiments,wecanseethatthisempiricalparallelframeworkcanalmostlinearlyscaleupSGS withneglectableprecisionloss. DuetothesparsenessofN ,DistributedSGS(DSGS)hasmuch kv lesscommunicationoverheadbetweenthemasterandworkers,hencemorescalablethanSVB. 3 Experiments We evaluate the inference quality and computational efﬁciency of SGS. We also assess how the parameterssuchasmini-batchsize,decayfactorandthenumberofiterationsaffecttheperformance. We comparewiththeonlinevariationalBayesapproachSVB[11],whichhasproventohavehigh inferencequalitythatissimilartoofﬂinestochasticmethodslikeSVI[8]. 3.1 ImplementationDetails AsCaninietal. [14]mentioned,initializationsmighthavebigimpactsonthesolutionqualitiesof theinferencealgorithms,andhence,usingrandomizedinitializationforZ isoftennotgood. Thus, we use a kind of “progressive online” initialization for SGS, CGS and SVB. Speciﬁcally, taking SGSasanexample,attheﬁrstiterationforeachmini-batch,foreachdocumentwesamplez~ from d theposteriordistributionuptocurrentt. Thenweupdatetheposteriordistributionusingthecurrent documentand proceed to the nextone. Such a “samplingfrom posterior” initialization technique ensuresthat ourinitialization is reasonablygood. We use a similar initialization methodfor SVB andCGS. All the core implementations (sampling and calculating a variational approximation)are in C++. We also use Python and MATLAB wrappersfor computationallyinexpensive operations, such as measuringtime.OurimplementationofSVBhasbeenmadeassimilaraspossiblewithSGSforfair 5 speedcomparison. ThespeedofourSVBimplementationissimilartotheimplementationin[11]. Theexperimentsaredoneona3nodecluster,witheachnodeequippedwith12coresofIntelXeon [email protected],24GBmemoryand1Gbpsethernet. Inthedistributedexperiments,dataWtispre-partitionedandstoredseparatelyoneachnode,butin practiceitcanbestoredinadistributedﬁlesystemsuchasHDFS,orapublish-subscribepatterncan beusedforhandlingstreamingdata. Forthesakeofsimplicity,we implementourownparameter server using pyRpc, a simple remote procedure call (RPC) module that uses ZeroMQ. We use a pipeline on worker nodes to hide the communication overhead. Parameters on master server is stored using atomic, and there are model replicas on master server to ensure high availability whileperformingupdates.Forproductionusage,existinghighperformanceparameterservers,such as [19] might be used to achieve better performanceand scalability. Again, system is orthogonal withthealgorithminourcaseandisnotthemainfocusinthispaper. 3.2 Setups In the following experiments, hyper-parameters α and β are all set to 0.1 and 0.03, number of topics K is set to 50. Differentsettings yield same conclusions. Thus we stick to this setting for simplicity. MultiplerandomstartsofSGSandSVBdon’tresultinsigniﬁcantdifference. Without special remarks, for each mini-batch, both SGS and SVB run the sampler until convergence. To be speciﬁc, SGS stops the iteration when the training perplexity on the current mini-batch stops improvingfor10consecutiveiterations,orwhenitreachesamaximumof400iterations.SVBstops theinneriterationwhen ||θ~dold−θ~dnew||1 <10−5orwhenitreachesamaximumof100iterations,and K stopstheouteriterationif ||φnew−φold||1 <10−3. KV The predictiveperformanceof LDA can be measuredby the probabilityit assigns to the held-out documents. This metric is formalized by perplexity. First we partition the data-set and use 80% for training and 20% for testing. Let M denote the model trained on the training data. Given a held-outdocumentw~ ,wecaninferθ~ fromtheﬁrsthalfofthetokensind,andthencalculatethe d d exponentiatednegativeaverageofthelogprobability.Formally,wehave: logp(w |M) per(w~ |M)=exp − i di , d (cid:26) P |w~ | (cid:27) d wherelogp(w |M)=log φ θ . di k kvdi dk P Table1:DataStatistics,whereKandM standfor Tocomparetheperformanceofthealgorithms thousandandmillionrespectively. in various settings, we test them on three #Docs #Token Vocab-Size datasets1. ThesmallNIPSdatasethasthearti- NIPS 1740 2M 13K clesfrom1988to2000publishedbyNeuralIn- NYT 300K 100M 100K formation Processing Systems (NIPS) Confer- PubMed 8.2M 730M 141K ences. A larger dataset is the news from New YorkTimes(NYT)andthelargestoneistheabstractsofthepublicationsonPubMed[21].Table1 showsthedetailedinformationofeachdataset. 3.3 Resultsforvariousmini-batchsizes WerunSGSandSVBonNIPSandNYTdatasets,varyingmini-batchsizes. Tosimplifyourcom- parison,wesetthedecayfactorforSGStoλ=1.Fig.1(a)andFig.1(b)showthatSGSconsistently performsbetterthanSVB,especiallyinthecasesofbiggerdatasetandsmallermini-batchsizes. ThedifferenceintheperformancegapbetweenSGSandSVBonNYTandNIPSdatasetscouldbe understoodasdifferentlevelsofredundancy. Inorderlearneffectivelyina streamingsetting, itis required to have a redundantdataset. There are only 1740 documentsin the NIPS corpus, which isfarfrombeingredundantandthusdifferentstreamingalgorithmsperformsalikeonthisdataset. Fromthetrendofthedatasetsize,wecanexpectthatSGSwouldperformevenbetterthanSVBon largerdatasets. AnotherinterestingphenomenonisthatSVBismoresensitivetomini-batchsizes 1All the three datasets can be downloaded from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Bag+of+Words 6 2500 9000 SSGVBS 8000 SSGVBS 2000 CGS 7000 CGS 6000 Perplexity11050000 Perplexity45000000 3000 500 2000 1000 0 0 50 100 200 400 800 1600 501002004008001.6k3.2k6.4k13k27k51k102k Batch Sizes Batch Sizes (a) PerplexitiesofSVBandSGSonNIPS. (b) PerplexitiesofSVBandSGSonNYT. Figure1:TheperplexitiesofSVB,SGSandCGSonthesmallNIPS(a)andlargeNYT(b)datasets. 2500 batch=50 6500 7000 SGS07 2400 bbaattcchh==120000 6000 6500 SSGVBS10 Perplexity22230000 batch=400 Perplexity55050000 bbbbaaaattttcccchhhh====83150221008200000 perplexity556050000000 2100 4500 4500 20000.5 0.6 0.7 0.8 0.9 1 40000.5 0.6 0.7 0.8 0.9 1 40000 20 40 60 80 100 Decay Factor Decay Factor batch sequence number (a) Effectofdecayfactors(NIPS)(.b) Effectofdecayfactors(NYT). (c) Learningprocess. Figure2: (a,b)ThechangeofSGS’sperplexityw.r.tthedecayfactorλonNIPSandNYTdatasets; and(c)Thetrendofchangingperplexitiesasnewmini-batchesarrive,mini-batchsize=3200. thanSGS. The inherentability of SGSto performmuchbetterthan SVB onsmaller mini-batches haveimportantadvantagesinpractice. NotethatSGSisequivalenttoCGSwhenthemini-batchsizeequalstothewholetrainingsetsize. The green horizontaldashed lines in Fig. 1(a) 1(b) mark the perplexity of CGS. We can see that on the large NYT dataset, SGS has a huge improvement over SVB when compared to the best- achievableperplexity. 3.4 Resultsfordifferentdecayfactor WealsoinvestigatehowthedecayfactorλaffectstheperformanceofSGSondifferentdatasets.The resultsaresimilarasabove,abiggerdatasethasmoreobvioustrends. Asshownin Fig.s2(a)and 2(b),whenthemini-batchistoosmall,thedecayfactorhasanegativeeffect.Itisprobablybecause asmallmini-batchcanonlylearnalimitedamountofknowledgeinasingleroundanditisthusnot preferabletoforgettheknowledge.Whenthemini-batchsizegetsbigger,thedecayfactorimproves theperformance,andtheoptimalvalueforλgetssmaller.Thiscanbeexplainedinasimilarmanner as above, where bigger mini-batcheswill learn more and the next mini-batch would have greater discrepancywiththecurrentone. LetusexamineaspeciﬁcsettingwherethebatchsizeofNYTdatasetissetto3200. SVByieldsa perplexity6511,whileSGSwithoutdecaycanreach5240.Afterapplyingdecayfactorof0.7,SGS yieldsaperplexityof4640,whichisprettyclosetothebatchperplexityof4300. Wecanconclude that, if the decay factor is set properly,SGS is much better than SVB and it can almost reach the sameprecisionasitsbatchcounterpart. 3.5 LearningProcess However, the mean perplexityof each mini-batchis nota full descriptionof the learningprocess. We shouldalso take thetrendoftheinferencequalityasnew mini-batchesarriveintoaccount. In Fig.2(c),wepartitioneachmini-batchintotrainingandtestingtests,andplotthetestingperplexity ofeachmini-batchofSVBandSGS.WecanclearlyseethattheperformanceofthedecayedSGS isstrictlybetterthanthenon-decayedversion,andthelatteroneoutperformsSVB.Inotherwords, SGS can consistentlylearn a better modeleveryday. Allthree modelshaveperplexitybursts ata fewinitialmini-batchesbecausethemodelsover-ﬁtstheﬁrstfewmini-batches. 7 3.6 ComputationalEfﬁciency 2450 ×104 8 SGS 2400 SVB SGS Perplexity22330500 e (seconds)46 SCVGBS m 2250 Ti2 2200 0 0 2 4 6 8 200 400 1600 6400 25.6k 102k 299k Wallclock time (second) Batch Sizes (a) ConvergenceofSGSandSVBonamini-batch. (b) TimeconsumptionofSGS,SVBandCGS. Figure3: ComputationalEfﬁciencyofSGS,SVB,andCGS Sinceonlinealgorithmsusuallyrunforalongertimespan,monthsoryears,onlinealgorithmsface less computational challenges than the ofﬂine versions. However, we would still like the online inference method to not use excessive computational resources. In this section, we compare the computationalefﬁcienciesofSGS,SVBandCGS.SinceSGSandSVBhaveouterloopsthatprocess arrivingmini-batchesandinnerloops(iterations),weinvestigatethetimepermini-batchandonthe wholedatasetseparately. InFig.3(a),werunSGSandSVBthroughsomeinitialmini-batchesandtheninvestigatetheircon- vergencesonthesameintermediatemini-batch. Theperplexityonheld-outdocumentsareplotted againsttime.SGSstartsfromabetterpointthanSVBbecauseitsbetterresultonpreviousiterations. Furthermore,SGSalsoconvergesfasterthanSVB. In Fig. 3(b), bothSGS andSVB are rununtilconvergence,wherethe criterionfor convergenceis statedinSection3.2. WecanseethatSVBandSGSconvergewithinsimilartime. Also,sincethe onlineversionissearchingoversmallernumberofconﬁgurations,wecanobservethatthesmaller themini-batchsizeis,thefasteritconverges. SGScanbefasterthanCGSwithsmallermini-batch sizes. 3.7 DistributedExperiments In this section we compare distributed SGS and SVB2 on the NYT dataset, and the scalability of DSGSisexaminedonthelargerPubMeddataset. WhenwecompareDSGSwithSVBinFig.4(a), wecanconcludethatalthoughtheperplexityofDSGSgetsworseasthenumberofcoresincreases, itstillconsistentlyoutperformsSVB.Fig.4(b)showsthethroughputofDSGSandSVB,intokens per second. Since the topic-word assignment update of DSGS is sparse and the corresponding variational parameter update of SVB is dense, the speedup of DSGS is much better than SVB. Fig.4(c),4(d)showthescalabilityresultonthelargerPubMeddataset. Ingeneralwecanconclude DSGSenjoysnicespeedupwhileretainingasimilarlevelofperplexity. 4 Discussion WehavedevelopedastreamingGibbssamplingalgorithm(SGS)fortheLDAmodel. Ourmethod can be seen as an online extension of the collapsed Gibbs sampling approach. Our experimental resultsdemonstratethatSGSimprovesperplexityoverpreviousonlinemethods,whilemaintaining similarcomputationalburden. WehavealsoshownthatSGScanbewellparallelizedusingsimilar techniquesasthoseadoptedinSVB. Inthefuture,SGScanbefurtherimprovedbymakingthedecayfactorλ,themini-batchsizeandthe numberofiterationsforeachdocumentself-evolving,asmoredataisfedintothesystem.Intuitively, the algorithmlearnsfastat thebeginningandslows downlater on. Thusforexample,it mightbe temptingtodecreasetheiterationcountsforeachdocumenttosomeconstantovertime.Thescheme fortheevolutiondeservesfutureresearchandneedsstrongtheoreticalguidance. 2SVBreferstobothdistributedandsingle-threadedvariants. 8 7000 5×105 SGS 6000 SGS-ideal d4 SVB 5000 on SVB-ideal Perplexity34000000 ns per sec23 12000000 SSGVBS toke1 0 0 1 2 4 9 18 36 1 2 4 9 18 36 num of threads num of threads (a) PerplexityofDSGSandSVBonNYT. (b) TokenspersecondofDSGSandSVB. 6000 40 batch=200k 5000 35 bbaattcchh==5100kk 30 batch=3k Perplexities234000000000 bbaattcchh==25000kk Running time(in hour)11220505 batch=1k 1000 batch=10k batch=3k 5 batch=1k 0 0 36 18 9 4 2 1 36 18 9 4 2 1 Number of threads Number of threads (c) PerplexityofDSGSonPubMeddataset. (d) TimeconsumptionofDSGSonPubMed. Figure4:Scalabilityresults References [1] D.M.Blei,A.Y.Ng,andM.I.Jordan.“LatentDirichletAllocation”.In:JournalofMachine LearningResearch3(2003),pp.993–1022. [2] J. Mitchell and M. Lapata. “Vector-based Models of Semantic Composition.” In: Annual MeetingoftheAssociationforComputationalLinguistics(ACL).2008. [3] N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi. “Bad News Travel Fast: A Content- BasedAnalysisofInterestingnessonTwitter”.In:ProceedingsofInternationalWebScience Conference.2011. [4] Z.Liu,Y. Zhang,E.Y. Chang,andM.Sun.“Plda+:ParallellatentDirichletallocationwith data placementand pipeline processing”.In:Transactionson IntelligentSystems and Tech- nology(TIST)(2011). [5] A. Ahmed, M. Aly, J. Gonzalez,S. Narayanamurthy,and A. J. Smola. “Scalable Inference in Latent Variable Models”. In: InternationalConference on Web Search and Data Mining (WSDM).ACM.2012. [6] Y. Wang, X. Zhao, Z. Sun, H. Yan, L. Wang, Z. Jin, L. Wang, Y. Gao, C. Law, and J. Zeng. “Peacock: Learning Long-Tail Topic Features for Industrial Applications”. In: arXiv:1405.4402(2014). [7] T.L.GrifﬁthsandM.Steyvers.“FindingScientiﬁcTopics”.In:ProceedingsoftheNational academyofSciencesoftheUnitedStatesofAmerica101.Suppl1(2004),pp.5228–5235. [8] M.D.Hoffman,D.M.Blei,C.Wang,andJ.Paisley.“StochasticVariationalInference”.In: JournalofMachineLearningResearch14.1(2013),pp.1303–1347. [9] J.Foulds,L.Boyles,C.DuBois,P.Smyth,andM.Welling.“Stochasticcollapsedvariational Bayesian inference for latent Dirichlet allocation”. In: InternationalConference on Knowl- edgeDiscoveryandDatamining(SIGKDD).2013. [10] S.PattersonandY.W.Teh.“StochasticgradientRiemannianLangevindynamicsontheprob- abilitysimplex”.In:AdvancesinNeuralInformationProcessingSystems(NIPS).2013. [11] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. Jordan. “Streaming Variational Bayes”.In:AdvancesinNeuralInformationProcessingSystems(NIPS).2013. [12] J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing, T.-Y. Liu, and W.-Y. Ma. “LightLDA:BigTopicModelsonModestComputeClusters”.In:arXiv:1412.1576(2014). 9 [13] L. AlSumait, D. Barbara´, and C. Domeniconi. “On-line LDA: Adaptive topic models for miningtextstreamswithapplicationstotopicdetectionandtracking”.In:InternationalCon- ferenceonDataMining(ICDM).2008. [14] K. R. Canini, L. Shi, and T. L. Grifﬁths. “Online inference of topics with latent Dirichlet allocation”.In:InternationalConferenceonArtiﬁcialIntelligenceandStatistics(AISTATS). 2009. [15] R. Guhaniyogi, S. Qamar, and D. B. Dunson. “Bayesian Conditional Density Filtering for BigData”.In:arXiv:1401.3632(2014). [16] L. Yao, D. Mimno, and A. McCallum. “Efﬁcient methods for topic model inference on streamingdocumentcollections”.In:InternationalConferenceonKnowledgeDiscoveryand Datamining(SIGKDD).2009. [17] A.Q.Li,A.Ahmed,S.Ravi,andA.J.Smola.“Reducingthesamplingcomplexityoftopic models”.In:InternationalConferenceonKnowledgeDiscoveryandDatamining(SIGKDD). 2014. [18] D. Newman, A. U. Asuncion, P. Smyth, and M. Welling. “Distributed inference for latent Dirichletallocation.”In:AdvancesinNeuralInformationProcessingSystems(NIPS).2007. [19] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita,andB.-Y.Su.“Scalingdistributedmachinelearningwiththeparameterserver”.In: OperatingSystemsDesignandImplementation(OSDI).2014. [20] Q.Ho,J.Cipar,H.Cui,S.Lee,J.K.Kim,P.B.Gibbons,G.A.Gibson,G.Ganger,andE.P. Xing. “Moreeffectivedistributed mlvia a stale synchronousparallelparameterserver”.In: Advancesinneuralinformationprocessingsystems(NIPS).2013. [21] K.BacheandM.Lichman.UCIMachineLearningRepository-BagofWordsDataSet.2013. URL:https://archive.ics.uci.edu/ml/datasets/Bag+of+Words. 10

Streaming Gibbs Sampling for LDA Model PDF

0.15 MB

by Yang Gao

#journals #arxiv

Checking for file health...

Preview Streaming Gibbs Sampling for LDA Model

Streaming Gibbs Sampling for LDA Model YangGao JianfeiChen,JunZhu ElectricalEngineeringandComputerScience Dept. ofComp. Sci&Tech;TNListLab, UniversityofCaliforniaBerkeley StateKeyLabofIntell. Tech&Sys. 6 Berkeley,CA94710 TsinghuaUniversity,Beijing,100084,China 1 [email protected] [email protected] 0 [email protected] 2 n a J Abstract 6 StreamingvariationalBayes(SVB)issuccessfulinlearningLDAmodelsinanon- ] G line manner. HoweverpreviousattemptstowarddevelopingonlineMonte-Carlo methodsforLDAhavelittlesuccess,oftenbyhavingmuchworseperplexitythan L theirbatchcounterparts. WepresentastreamingGibbssampling(SGS)method, . s anonlineextensionofthecollapsedGibbssampling(CGS).Ourempiricalstudy c showsthatSGScanreachsimilarperplexityasCGS,muchbetterthanSVB.Our [ distributedversionof SGS, DSGS, is muchmore scalable than SVB mainlybe- 1 causetheupdates’communicationcomplexityissmall. v 2 4 1 Introduction 1 1 0 TopicmodelssuchasLatentDirichletAllocation(LDA) [1]havegainedincreasingattention.LDA . providesinterpretablelowdimensionalrepresentationofdocuments,uncoveringthelatenttopicsof 1 0 thecorpus. Themodelhasbeenprovento beusefulin manyﬁelds, suchasnaturallanguagepro- 6 cessing,informationretrievalandrecommendationsystems[2, 3].CompaniessuchasGoogle[4], 1 Yahoo! [5], and Tencent [6] have taken advantage of the model extensively. LDA has gradually : becomeastandardtooltoanalyzedocumentsinasemanticperspective. v i With the Internet generating a huge amount of data each day, accommodating new data requires X periodicretrainingusingtraditionalbatchinferencealgorithmssuchasvariationalBayes(VB) [1], r collapsedGibbssampling(CGS) [7] or theirvariants [8, 9, 10], which mightbe a waste ofboth a computationalandstorageresources,duetotheneedofrecomputingandstoringallhistoricaldata.It isbettertolearnthemodelincrementally,withonline(streaming)algorithms. Anonlinealgorithm is deﬁned as the method that learns the model in a single pass of the data and can analyze a test documentatanytimeduringlearning. Stochastic variational inference (SVI) is an ofﬂine stochastic learning algorithm that has enjoyed great experimentalsuccess on LDA [11]. Although online methods often perform worse than of- ﬂine ones, streaming variational Bayes (SVB) [11] achieves the same performance as the ofﬂine SVI, which is impressive. However, these variational methods need to make unwarranted mean- ﬁeld assumptionsand require model-speciﬁcderivations. In contrast, Monte Carlo methods(e.g., CGS)aregenerallyapplicableandasymptoticallyconvergetothetargetposterior.Byexploringthe sparsity structure, CGS hasbeen adoptedin manyscalable algorithmsfor LDA [12, 5]. However, theattemptstowardsdevelopingstreamingMonteCarlomethodsforLDAhavelittlesuccess,often achievingworseperformancethanCGS.Forinstance,theperplexityofOLDA[13]onNIPSdataset ismuchworsethanCGS;andtheparticleﬁlterapproach[14]couldonlyachieve50%performance ofCGS,intermsofthenormalizedmutualinformationonlabeledcorpus. Inthispaper,weﬁllupthisgapbypresentingastreamingGibbssampling(SGS)algorithm. SGS naturallyextendsthecollapsedGibbssamplingtotheonlinelearningsetting. Weempiricallyverify 1 that, with weight decay, SGS can reach similar inference quality as CGS. We further present a distributedversionofSGS (DSGS) in orderto dealwith large-scaledatasetson multiplecompute nodes.EmpiricalresultsdemonstratethatthedistributedSGSachievescomparableinferencequality asthenon-distributedone,whilewithdramaticscaling-up. SinceDSGScanbeimplementedwith sparsedatastructures, demandingmuchlowercommunicationbandwidth,itismorescalablethan SVB.NotethattheSGSwithoutweightdecayisthesameasOLDAwithout“topicmixing”,where “topicmixing”isthekeycomponentofOLDA[13].Moreover,OLDAattemptstosolveadifferent generativemodel,insteadoftheusualLDAmodel. Therestofthepaperisstructuredasfollows.Section2presentsthestreamingGibbssampling(SGS) algorithm for LDA and shows its relationship with ConditionalDensity Filtering (CDF) [15]. We alsoproposetheDistributedSGS(DSGS)whichcantakeadvantageofsparsityinsamplingtohandle verylargedatasets. Section3presentsexperimentalsettingsofSGSandDSGStodemonstratetheir inferencequalityandspeed.Section4concludeswithdiscussiononfuturework. 2 Streaming Gibbs Sampling forLDA 2.1 BasicsofLDA LDA[1] is a hierarchical Bayesian model that describes a generative process of topic proportions foradocumentandtopicassignmentsforeverypositionindocuments,wordsinthedocumentsare sampledfromdistributionsspeciﬁedbytopicassignments. LetD,K,V bethenumberofdocuments,topicanduniquewordsrespectively. φ~ isa V dimen- k sionalcategoricaldistributionoverwordswithsymmetricDirichletpriorβ,~θ isthetopicproportion d fordocumentdwithDirichletpriorα. ThegenerativeprocessofLDAisasfollows φ~ ∼Dir(β),∀d∈{1,...,D}:θ~ ∼Dir(α),z ∼Mult(~θ ),w ∼Mult(φ~ ), k d di d di zdi where L is the length of documentd, i ∈ {1,...,L }, z , w is the assignment and word on d d di di positioniofdocumentd. Dir(·),Mult(·)istheDirichletdistributionandMultinomialdistribution. Denoteθ =(~θ1,...,~θD),thematrixformedbyalltopicproportions,likewiseforΦ,Z,W. GivenasetofdocumentsW,inferringtheexactposteriordistributionp(θ,Φ,Z|W)isintractable. WemustresorttoeithervariationalapproximationorMonteCarlosimulationmethods. Amongthe variousalgorithms,collapsedGibbssampling(CGS)[7]isofparticularinterestduetoitssimplicity and sparsity. CGS exploitsthe conjugacypropertiesof DirichletandMultinomialto integrateout (θ,Φ)fromtheposterior: N−di +β p(z =k|Z−di,W)∝(N−di+α) kvdi , (1) di kd N−di+Vβ k whereN ,N aresufﬁcientstatistics(counts)fortheDirichlet-Multinomialdistribution: N = kd kv kd Ld I(z = k),N = Ld I(w = v,z = k),N = N = N . The i=1 di kv d i=1 di di k d kd v kv sPuperscript−di standsforexPcludPingthe tokenatpositioniofdocumenPtd. Nkv,NPkd,Nk are the matricesorvectorformedbyallcorrespondingcounts.ApseudocodeofCGSisdepictedasAlg.1. ThesparsityofsufﬁcientstatisticmatricesN ,N leadstomanyfastsparsityawarealgorithms kw kd [16, 17, 12]. The sparsity is also an importantfactor thatmakesour algorithmmore scalable than SVB. 2 Algorithm1CollapsedGibbsSampling Algorithm2StreamingGibbsSampling Input: dataW,iterationsN Input: iterationsN,decayfactorλ InitializeZ,N ,N ,N fort=1to∞do kv k dk foriter=1toN do Input: dataWt foreachtokenz inthedocumentsdo InitializeZtandupdateNt ,Nt,Nt di kv k dk Samplez ∼p(z |Z−di,W) foriter=1toN do di di UpdateN ,N ,N foreachtokenz inthemini-batchdo kv k dk di Outputposteriormean: Samplez ∼p(z |Z1:t ,W1:t) di di −di φ = Nkv+β,θ = Ndk+α UpdateNt ,Nt,Nt kv Nk+Vβ dk Nd+Kα Decay:Nt =kvλNtk dk kv kv Outputposteriormean: φt = Nktv+β,θt = Ndtk+α kv Nt+Vβ dk Nt+Kα k d 2.2 StreamingGibbsSampling Given a Bayesian model P(x|Θ) with prior P(Θ), and incoming data mini-batches X1,X2,··· ,Xt,···, let X1:t = {X1,...,Xt}. Bayesian streaming learningis the processof gettingaseriesofposteriordistributionsP(Θ|X1:t)bytherecurrencerelation: P(Θ|X1:t)∝P(Θ|X1:t−1)P(Xt|Θ). (2) Therefore,theposteriorlearntfromX1:t−1 isusedasthepriorwhenlearningfromXt. Notethat theamountofdatainastreammightbeinﬁnite,soastreaminglearningalgorithmcanneitherstore allpreviousdata,norupdatethemodelattimetwithatimecomplexityevenlinearoft. Ideally,the algorithmshouldonlyhaveconstantstorageandconstantupdatecomplexityateachtimestep. AsdepictedinAlg.2,weproposeastreamingGibbssampling(SGS)algorithmforLDA.SGSisan onlineextensionofCGS,whichﬁxesthetopicsZ1:t−1ofthepreviousarriveddocumentmini-batch, andthensamplesZtofthecurrentmini-batchusingthenormalCGSupdate.Thisisincontrastwith CGS,whichcancomebackandreﬁnesomeZtafteritisﬁrstsampled.ActuallyZ1:t−1isnoteven storedinCGS,westoreonlythesufﬁcientstatisticN . kv One can understand SGS using the recurrence relation (2): without any data, the initial φ~ have k parametersβ~0 = β, after incorporatingmini-batchW1, the parametersis updatedto β~1 = β~0 + k k k N1 [k], which is used as the prior of consequentmini-batches, where Nt [k] is the k-th row of kv kv matrixNt . Ingeneral,SGSupdatesthepriorusingtherecurrencerelationβ~t =β~t−1+Nt [k], kv k k kv whereβ~t isthepriorforφ~ attimet. k k The decay factor λ serves to forget the history. When plugging in the decay factor, the update equationbecomesβ~t = λ(β~t−1 +Nt [k]). λcanthenbeunderstoodasweakeningtheposterior k k kv causedbythepreviousdata. ThisdecayfactorwouldimprovetheperformanceofSGS,especially whenthetopic-worddistributionisevolvingalongthetime. SGS only requires constant memory: it only stores the current mini-batch Wt,Zt, but not W1:t−1,Z1:t−1,Wt+1:∞,Zt+1:∞. The total time complexity is the same as CGS, which is O(KN|W1:t|),where|W1:t|isnumberoftokensinmini-batch1tot. Inpractice,weuseasmaller numberof iterationsthan thatof CGS, becauseSGS iterates overa smaller numberof documents (mini-batch)andthusitconvergesfaster. 2.3 RelationtoConditionalDensityFiltering Inthissection,weconsideraspecialcaseofSGS,wherethedecayfactorλissetto1.0. Werelate thisspecialcasetoConditionalDensityFiltering(CDF)[15]andshowthatSGScanbeseenasan improvedversionofCDFframeworkwhenappliedtotheLDAmodel. 3 2.3.1 ConditionalDensityFiltering CDF is an algorithm that can sample from a sequence of graduallyevolving distributions. Given a probabilistic model P(D1:t|Θ), where Θ = (θ1,··· ,θk) is a k-dimensionalparameter vector andD1:t isthedatauntilnow,wedeﬁnetheSurrogateConditionalSufﬁcientStatistics(SCSS)as follows: Deﬁnition1 [SCSS] Assume p(θj|θ−j,Dt) can be written as p(θj|θ−j,1,h(Dt,θ−j,2)), where θ−j = Θ\θj, θ−j,1 and θ−j,2 are a partition of θ−j and h is some known function. If θˆ−tj,2 is aconsistentestimatorofθ−j,2attimet,thenCt =g(Ct−1,h(Dt,θˆ−tj,2))isdeﬁnedastheSCSSof θj attimet,forsomeknownfunctiong. Weusep(θj|θ−j,1,Ct)toapproximatep(θj|θ−j,D1:t). SCSS is an extension of Sufﬁcient Statistics (SS), in the sense that both of them summarize the historical observations sampled from a class of distributions. SS is accurate and summarizes a wholedistribution,butSCSSisapproximateandonlysummarizesconditionaldistributions. IftheparametersetofaprobabilisticmodelcanbepartitionedintotwosetsI andI ,whereeach 1 2 parameter’sSCSSonlydependsontheparametersintheotherset,thenwecanusetheCDFalgo- rithm(3)toinfertheposterioroftheparameters. Algorithm3ConditionalDensityFiltering Algorithm5CDF-LDA fort=1to∞do Initialize:Φ=Uniform(0,1),Nˆ0 =0 kv fors∈{1,2}do fort=1to∞do forj ∈Is do Input: asingledocumentw~t Cjts =g(Cjts−1,h(Dt,Θˆ−s)) Initialize~zt Sampleθ ∼p(θ |θ ,Ct ) SCSSofz:Φˆt =Φ j j −js js foreachtokeniindoctdo Samplez ∼p(z |~z−ti,Φˆt) Algorithm4DistributedSGS(DSGS) SCSS of Φt:i Nˆt =ti Ntˆt−1 + I(z = IInniptiuatl:izieteNraktivon=sN0 ,decayfactorλ kfo∧rkw=ti =1tvo)K dkvo kv Pi ti foreachmini-batchWtatsomeworkerdo Sampleφ~ ∼Dir(Nˆt +β) CopyglobalN tolocalNlocal k kv kv kv ∆Nlocal =CGS(α,β+Nlocal,Wt) kv kv UpdateglobalN =λ(N +∆Nlocal) kv kv kv Underasemi-collapsedrepresentationofLDA,whereθ~ iscollapsed,wecanpartitiontheparame- d tersintotwosets: I1 ={φkv};I2 ={zdi}. Theconditionaldistributionsare: K p(Φ|Z,W)= Dir(φ~ |N [k]+β), p(z =k|~z−di,Φ,W)∝(N−di+α)φ . k kv di d kd kvdi kY=1 By deﬁnition(1), we canverifythattheSCSS of Φ and~z attime t are N and Φ respectively. d kv ThuswehavetheCDFsolutionofLDAasshowninAlgorithm(5). 2.3.2 RelationshipbetweenSGSandCDF-LDA OurSGSmethodcanbeviewedasanimprovedversionofCDF-LDAinthefollowingaspects: • InCDF-LDA,SCSSΦt isdirectlysampledfromaDirichletdistribution,whichunneces- sarilyintroducesextrasourceofrandomness.Replacingthesamplingwiththeexpectation φˆt = Nˆkt−v1+β givesbetterperformanceinpractice.Thiscorrespondstoafullycollapsed kv Nˆt−1+Vβ k representationofLDA.It’smorestatisticalefﬁcientduetotheRao-BlackwellTheorem. • TheCDF-LDA’ssamplingupdateofz doesnotincludeothertokensin thecurrentdoc- ti umentt, which could be improvedby taking the currentdocumentinto account. This is especiallyusefulforthe beginningiterations,becauseit enablesthetopics’preferenceof docttobepropagatedimmediately. 4 • Itishardtosayhowasingledocumentcanbedecomposedintotopicswithoutlookingat theotherdocuments,butthisisthecasethatoccurrsatthebeginningofCDF-LDA.This wouldresultininaccuratez assignmentsandthereforepollutingN ,andﬁnallyresult- ti kv inginlowconvergencerateonthewhole. Ourmethodavoidsthisproblembyprocessing amini-batchofdocumentsatatimeandallowsformultipleiterationsoverthemini-batch. Thiswouldenabletopic-assignmentstobepropagatedlocally. To sum up, SGS without weight decay can be seen as an improvementover CDF-LDA not only by adapting a fully collapsed representation, but also by enabling a timely and repeatedly cross- documentinformationﬂow. 2.4 DistributedSGS Many distributed CGS samplers exist for LDA, including divide-and-conquer [18], parameter server[5,19]andmodelparallel[15]approaches. Althoughsomeofthemdonothavetheoretical guarantees,theyallperformprettywellinpractice. Here,weadopttheparameterserverapproach andpresentadistributedSGSsampler. SameasCGS,theglobalparameterinSGSisthetopicwordcountmatrixN . SGScanbeviewed kv asasequenceofcallstotheCGSprocedure ∆Nt =CGS(α,β+Nt−1,Wt), kv kv wherethekthrowofβ+Nt−1isthepriorofφ~ . Thenthetopicwordcountmatrixcanbeupdated kv k byNt =λ(Nt−1+∆Nt ). kv kv kv In the parameter server architecture, we store N in a central server. Each worker fetches the kv mostrecentparameterN , runsCGS(α,β +Nt−1,Wt)togettheupdates∆Nt , andpushes kv kv kv backtheupdateto theserver. Uponreceivinganupdate,theserverupdatesitsparameters. Inour implementation,theworkersruninafully(hogwild)asynchronousfashion.Althoughworkersmay haveslightlystaleparameterscomparedtotheserialversion,affectingtheconvergence,thishogwild approachis shownto workwell in [18, 5] as wellas our experiments,based onthe fact thatN kv changesslowly. Betterschemessuchasstalesynchronousparallel[20]mightbeused,butitisjust amatterofchoosingparameterserverimplementations.Apseudo-codeofisgivenasAlg.4. Inexperiments,wecanseethatthisempiricalparallelframeworkcanalmostlinearlyscaleupSGS withneglectableprecisionloss. DuetothesparsenessofN ,DistributedSGS(DSGS)hasmuch kv lesscommunicationoverheadbetweenthemasterandworkers,hencemorescalablethanSVB. 3 Experiments We evaluate the inference quality and computational efﬁciency of SGS. We also assess how the parameterssuchasmini-batchsize,decayfactorandthenumberofiterationsaffecttheperformance. We comparewiththeonlinevariationalBayesapproachSVB[11],whichhasproventohavehigh inferencequalitythatissimilartoofﬂinestochasticmethodslikeSVI[8]. 3.1 ImplementationDetails AsCaninietal. [14]mentioned,initializationsmighthavebigimpactsonthesolutionqualitiesof theinferencealgorithms,andhence,usingrandomizedinitializationforZ isoftennotgood. Thus, we use a kind of “progressive online” initialization for SGS, CGS and SVB. Speciﬁcally, taking SGSasanexample,attheﬁrstiterationforeachmini-batch,foreachdocumentwesamplez~ from d theposteriordistributionuptocurrentt. Thenweupdatetheposteriordistributionusingthecurrent documentand proceed to the nextone. Such a “samplingfrom posterior” initialization technique ensuresthat ourinitialization is reasonablygood. We use a similar initialization methodfor SVB andCGS. All the core implementations (sampling and calculating a variational approximation)are in C++. We also use Python and MATLAB wrappersfor computationallyinexpensive operations, such as measuringtime.OurimplementationofSVBhasbeenmadeassimilaraspossiblewithSGSforfair 5 speedcomparison. ThespeedofourSVBimplementationissimilartotheimplementationin[11]. Theexperimentsaredoneona3nodecluster,witheachnodeequippedwith12coresofIntelXeon [email protected],24GBmemoryand1Gbpsethernet. Inthedistributedexperiments,dataWtispre-partitionedandstoredseparatelyoneachnode,butin practiceitcanbestoredinadistributedﬁlesystemsuchasHDFS,orapublish-subscribepatterncan beusedforhandlingstreamingdata. Forthesakeofsimplicity,we implementourownparameter server using pyRpc, a simple remote procedure call (RPC) module that uses ZeroMQ. We use a pipeline on worker nodes to hide the communication overhead. Parameters on master server is stored using atomic, and there are model replicas on master server to ensure high availability whileperformingupdates.Forproductionusage,existinghighperformanceparameterservers,such as [19] might be used to achieve better performanceand scalability. Again, system is orthogonal withthealgorithminourcaseandisnotthemainfocusinthispaper. 3.2 Setups In the following experiments, hyper-parameters α and β are all set to 0.1 and 0.03, number of topics K is set to 50. Differentsettings yield same conclusions. Thus we stick to this setting for simplicity. MultiplerandomstartsofSGSandSVBdon’tresultinsigniﬁcantdifference. Without special remarks, for each mini-batch, both SGS and SVB run the sampler until convergence. To be speciﬁc, SGS stops the iteration when the training perplexity on the current mini-batch stops improvingfor10consecutiveiterations,orwhenitreachesamaximumof400iterations.SVBstops theinneriterationwhen ||θ~dold−θ~dnew||1 <10−5orwhenitreachesamaximumof100iterations,and K stopstheouteriterationif ||φnew−φold||1 <10−3. KV The predictiveperformanceof LDA can be measuredby the probabilityit assigns to the held-out documents. This metric is formalized by perplexity. First we partition the data-set and use 80% for training and 20% for testing. Let M denote the model trained on the training data. Given a held-outdocumentw~ ,wecaninferθ~ fromtheﬁrsthalfofthetokensind,andthencalculatethe d d exponentiatednegativeaverageofthelogprobability.Formally,wehave: logp(w |M) per(w~ |M)=exp − i di , d (cid:26) P |w~ | (cid:27) d wherelogp(w |M)=log φ θ . di k kvdi dk P Table1:DataStatistics,whereKandM standfor Tocomparetheperformanceofthealgorithms thousandandmillionrespectively. in various settings, we test them on three #Docs #Token Vocab-Size datasets1. ThesmallNIPSdatasethasthearti- NIPS 1740 2M 13K clesfrom1988to2000publishedbyNeuralIn- NYT 300K 100M 100K formation Processing Systems (NIPS) Confer- PubMed 8.2M 730M 141K ences. A larger dataset is the news from New YorkTimes(NYT)andthelargestoneistheabstractsofthepublicationsonPubMed[21].Table1 showsthedetailedinformationofeachdataset. 3.3 Resultsforvariousmini-batchsizes WerunSGSandSVBonNIPSandNYTdatasets,varyingmini-batchsizes. Tosimplifyourcom- parison,wesetthedecayfactorforSGStoλ=1.Fig.1(a)andFig.1(b)showthatSGSconsistently performsbetterthanSVB,especiallyinthecasesofbiggerdatasetandsmallermini-batchsizes. ThedifferenceintheperformancegapbetweenSGSandSVBonNYTandNIPSdatasetscouldbe understoodasdifferentlevelsofredundancy. Inorderlearneffectivelyina streamingsetting, itis required to have a redundantdataset. There are only 1740 documentsin the NIPS corpus, which isfarfrombeingredundantandthusdifferentstreamingalgorithmsperformsalikeonthisdataset. Fromthetrendofthedatasetsize,wecanexpectthatSGSwouldperformevenbetterthanSVBon largerdatasets. AnotherinterestingphenomenonisthatSVBismoresensitivetomini-batchsizes 1All the three datasets can be downloaded from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Bag+of+Words 6 2500 9000 SSGVBS 8000 SSGVBS 2000 CGS 7000 CGS 6000 Perplexity11050000 Perplexity45000000 3000 500 2000 1000 0 0 50 100 200 400 800 1600 501002004008001.6k3.2k6.4k13k27k51k102k Batch Sizes Batch Sizes (a) PerplexitiesofSVBandSGSonNIPS. (b) PerplexitiesofSVBandSGSonNYT. Figure1:TheperplexitiesofSVB,SGSandCGSonthesmallNIPS(a)andlargeNYT(b)datasets. 2500 batch=50 6500 7000 SGS07 2400 bbaattcchh==120000 6000 6500 SSGVBS10 Perplexity22230000 batch=400 Perplexity55050000 bbbbaaaattttcccchhhh====83150221008200000 perplexity556050000000 2100 4500 4500 20000.5 0.6 0.7 0.8 0.9 1 40000.5 0.6 0.7 0.8 0.9 1 40000 20 40 60 80 100 Decay Factor Decay Factor batch sequence number (a) Effectofdecayfactors(NIPS)(.b) Effectofdecayfactors(NYT). (c) Learningprocess. Figure2: (a,b)ThechangeofSGS’sperplexityw.r.tthedecayfactorλonNIPSandNYTdatasets; and(c)Thetrendofchangingperplexitiesasnewmini-batchesarrive,mini-batchsize=3200. thanSGS. The inherentability of SGSto performmuchbetterthan SVB onsmaller mini-batches haveimportantadvantagesinpractice. NotethatSGSisequivalenttoCGSwhenthemini-batchsizeequalstothewholetrainingsetsize. The green horizontaldashed lines in Fig. 1(a) 1(b) mark the perplexity of CGS. We can see that on the large NYT dataset, SGS has a huge improvement over SVB when compared to the best- achievableperplexity. 3.4 Resultsfordifferentdecayfactor WealsoinvestigatehowthedecayfactorλaffectstheperformanceofSGSondifferentdatasets.The resultsaresimilarasabove,abiggerdatasethasmoreobvioustrends. Asshownin Fig.s2(a)and 2(b),whenthemini-batchistoosmall,thedecayfactorhasanegativeeffect.Itisprobablybecause asmallmini-batchcanonlylearnalimitedamountofknowledgeinasingleroundanditisthusnot preferabletoforgettheknowledge.Whenthemini-batchsizegetsbigger,thedecayfactorimproves theperformance,andtheoptimalvalueforλgetssmaller.Thiscanbeexplainedinasimilarmanner as above, where bigger mini-batcheswill learn more and the next mini-batch would have greater discrepancywiththecurrentone. LetusexamineaspeciﬁcsettingwherethebatchsizeofNYTdatasetissetto3200. SVByieldsa perplexity6511,whileSGSwithoutdecaycanreach5240.Afterapplyingdecayfactorof0.7,SGS yieldsaperplexityof4640,whichisprettyclosetothebatchperplexityof4300. Wecanconclude that, if the decay factor is set properly,SGS is much better than SVB and it can almost reach the sameprecisionasitsbatchcounterpart. 3.5 LearningProcess However, the mean perplexityof each mini-batchis nota full descriptionof the learningprocess. We shouldalso take thetrendoftheinferencequalityasnew mini-batchesarriveintoaccount. In Fig.2(c),wepartitioneachmini-batchintotrainingandtestingtests,andplotthetestingperplexity ofeachmini-batchofSVBandSGS.WecanclearlyseethattheperformanceofthedecayedSGS isstrictlybetterthanthenon-decayedversion,andthelatteroneoutperformsSVB.Inotherwords, SGS can consistentlylearn a better modeleveryday. Allthree modelshaveperplexitybursts ata fewinitialmini-batchesbecausethemodelsover-ﬁtstheﬁrstfewmini-batches. 7 3.6 ComputationalEfﬁciency 2450 ×104 8 SGS 2400 SVB SGS Perplexity22330500 e (seconds)46 SCVGBS m 2250 Ti2 2200 0 0 2 4 6 8 200 400 1600 6400 25.6k 102k 299k Wallclock time (second) Batch Sizes (a) ConvergenceofSGSandSVBonamini-batch. (b) TimeconsumptionofSGS,SVBandCGS. Figure3: ComputationalEfﬁciencyofSGS,SVB,andCGS Sinceonlinealgorithmsusuallyrunforalongertimespan,monthsoryears,onlinealgorithmsface less computational challenges than the ofﬂine versions. However, we would still like the online inference method to not use excessive computational resources. In this section, we compare the computationalefﬁcienciesofSGS,SVBandCGS.SinceSGSandSVBhaveouterloopsthatprocess arrivingmini-batchesandinnerloops(iterations),weinvestigatethetimepermini-batchandonthe wholedatasetseparately. InFig.3(a),werunSGSandSVBthroughsomeinitialmini-batchesandtheninvestigatetheircon- vergencesonthesameintermediatemini-batch. Theperplexityonheld-outdocumentsareplotted againsttime.SGSstartsfromabetterpointthanSVBbecauseitsbetterresultonpreviousiterations. Furthermore,SGSalsoconvergesfasterthanSVB. In Fig. 3(b), bothSGS andSVB are rununtilconvergence,wherethe criterionfor convergenceis statedinSection3.2. WecanseethatSVBandSGSconvergewithinsimilartime. Also,sincethe onlineversionissearchingoversmallernumberofconﬁgurations,wecanobservethatthesmaller themini-batchsizeis,thefasteritconverges. SGScanbefasterthanCGSwithsmallermini-batch sizes. 3.7 DistributedExperiments In this section we compare distributed SGS and SVB2 on the NYT dataset, and the scalability of DSGSisexaminedonthelargerPubMeddataset. WhenwecompareDSGSwithSVBinFig.4(a), wecanconcludethatalthoughtheperplexityofDSGSgetsworseasthenumberofcoresincreases, itstillconsistentlyoutperformsSVB.Fig.4(b)showsthethroughputofDSGSandSVB,intokens per second. Since the topic-word assignment update of DSGS is sparse and the corresponding variational parameter update of SVB is dense, the speedup of DSGS is much better than SVB. Fig.4(c),4(d)showthescalabilityresultonthelargerPubMeddataset. Ingeneralwecanconclude DSGSenjoysnicespeedupwhileretainingasimilarlevelofperplexity. 4 Discussion WehavedevelopedastreamingGibbssamplingalgorithm(SGS)fortheLDAmodel. Ourmethod can be seen as an online extension of the collapsed Gibbs sampling approach. Our experimental resultsdemonstratethatSGSimprovesperplexityoverpreviousonlinemethods,whilemaintaining similarcomputationalburden. WehavealsoshownthatSGScanbewellparallelizedusingsimilar techniquesasthoseadoptedinSVB. Inthefuture,SGScanbefurtherimprovedbymakingthedecayfactorλ,themini-batchsizeandthe numberofiterationsforeachdocumentself-evolving,asmoredataisfedintothesystem.Intuitively, the algorithmlearnsfastat thebeginningandslows downlater on. Thusforexample,it mightbe temptingtodecreasetheiterationcountsforeachdocumenttosomeconstantovertime.Thescheme fortheevolutiondeservesfutureresearchandneedsstrongtheoreticalguidance. 2SVBreferstobothdistributedandsingle-threadedvariants. 8 7000 5×105 SGS 6000 SGS-ideal d4 SVB 5000 on SVB-ideal Perplexity34000000 ns per sec23 12000000 SSGVBS toke1 0 0 1 2 4 9 18 36 1 2 4 9 18 36 num of threads num of threads (a) PerplexityofDSGSandSVBonNYT. (b) TokenspersecondofDSGSandSVB. 6000 40 batch=200k 5000 35 bbaattcchh==5100kk 30 batch=3k Perplexities234000000000 bbaattcchh==25000kk Running time(in hour)11220505 batch=1k 1000 batch=10k batch=3k 5 batch=1k 0 0 36 18 9 4 2 1 36 18 9 4 2 1 Number of threads Number of threads (c) PerplexityofDSGSonPubMeddataset. (d) TimeconsumptionofDSGSonPubMed. Figure4:Scalabilityresults References [1] D.M.Blei,A.Y.Ng,andM.I.Jordan.“LatentDirichletAllocation”.In:JournalofMachine LearningResearch3(2003),pp.993–1022. [2] J. Mitchell and M. Lapata. “Vector-based Models of Semantic Composition.” In: Annual MeetingoftheAssociationforComputationalLinguistics(ACL).2008. [3] N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi. “Bad News Travel Fast: A Content- BasedAnalysisofInterestingnessonTwitter”.In:ProceedingsofInternationalWebScience Conference.2011. [4] Z.Liu,Y. Zhang,E.Y. Chang,andM.Sun.“Plda+:ParallellatentDirichletallocationwith data placementand pipeline processing”.In:Transactionson IntelligentSystems and Tech- nology(TIST)(2011). [5] A. Ahmed, M. Aly, J. Gonzalez,S. Narayanamurthy,and A. J. Smola. “Scalable Inference in Latent Variable Models”. In: InternationalConference on Web Search and Data Mining (WSDM).ACM.2012. [6] Y. Wang, X. Zhao, Z. Sun, H. Yan, L. Wang, Z. Jin, L. Wang, Y. Gao, C. Law, and J. Zeng. “Peacock: Learning Long-Tail Topic Features for Industrial Applications”. In: arXiv:1405.4402(2014). [7] T.L.GrifﬁthsandM.Steyvers.“FindingScientiﬁcTopics”.In:ProceedingsoftheNational academyofSciencesoftheUnitedStatesofAmerica101.Suppl1(2004),pp.5228–5235. [8] M.D.Hoffman,D.M.Blei,C.Wang,andJ.Paisley.“StochasticVariationalInference”.In: JournalofMachineLearningResearch14.1(2013),pp.1303–1347. [9] J.Foulds,L.Boyles,C.DuBois,P.Smyth,andM.Welling.“Stochasticcollapsedvariational Bayesian inference for latent Dirichlet allocation”. In: InternationalConference on Knowl- edgeDiscoveryandDatamining(SIGKDD).2013. [10] S.PattersonandY.W.Teh.“StochasticgradientRiemannianLangevindynamicsontheprob- abilitysimplex”.In:AdvancesinNeuralInformationProcessingSystems(NIPS).2013. [11] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. Jordan. “Streaming Variational Bayes”.In:AdvancesinNeuralInformationProcessingSystems(NIPS).2013. [12] J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing, T.-Y. Liu, and W.-Y. Ma. “LightLDA:BigTopicModelsonModestComputeClusters”.In:arXiv:1412.1576(2014). 9 [13] L. AlSumait, D. Barbara´, and C. Domeniconi. “On-line LDA: Adaptive topic models for miningtextstreamswithapplicationstotopicdetectionandtracking”.In:InternationalCon- ferenceonDataMining(ICDM).2008. [14] K. R. Canini, L. Shi, and T. L. Grifﬁths. “Online inference of topics with latent Dirichlet allocation”.In:InternationalConferenceonArtiﬁcialIntelligenceandStatistics(AISTATS). 2009. [15] R. Guhaniyogi, S. Qamar, and D. B. Dunson. “Bayesian Conditional Density Filtering for BigData”.In:arXiv:1401.3632(2014). [16] L. Yao, D. Mimno, and A. McCallum. “Efﬁcient methods for topic model inference on streamingdocumentcollections”.In:InternationalConferenceonKnowledgeDiscoveryand Datamining(SIGKDD).2009. [17] A.Q.Li,A.Ahmed,S.Ravi,andA.J.Smola.“Reducingthesamplingcomplexityoftopic models”.In:InternationalConferenceonKnowledgeDiscoveryandDatamining(SIGKDD). 2014. [18] D. Newman, A. U. Asuncion, P. Smyth, and M. Welling. “Distributed inference for latent Dirichletallocation.”In:AdvancesinNeuralInformationProcessingSystems(NIPS).2007. [19] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita,andB.-Y.Su.“Scalingdistributedmachinelearningwiththeparameterserver”.In: OperatingSystemsDesignandImplementation(OSDI).2014. [20] Q.Ho,J.Cipar,H.Cui,S.Lee,J.K.Kim,P.B.Gibbons,G.A.Gibson,G.Ganger,andE.P. Xing. “Moreeffectivedistributed mlvia a stale synchronousparallelparameterserver”.In: Advancesinneuralinformationprocessingsystems(NIPS).2013. [21] K.BacheandM.Lichman.UCIMachineLearningRepository-BagofWordsDataSet.2013. URL:https://archive.ics.uci.edu/ml/datasets/Bag+of+Words. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.