Logout succeed
Logout succeed. See you again!

Streaming Gibbs Sampling for LDA Model PDF
Preview Streaming Gibbs Sampling for LDA Model
Streaming Gibbs Sampling for LDA Model YangGao JianfeiChen,JunZhu ElectricalEngineeringandComputerScience Dept. ofComp. Sci&Tech;TNListLab, UniversityofCaliforniaBerkeley StateKeyLabofIntell. Tech&Sys. 6 Berkeley,CA94710 TsinghuaUniversity,Beijing,100084,China 1 [email protected] [email protected] 0 [email protected] 2 n a J Abstract 6 StreamingvariationalBayes(SVB)issuccessfulinlearningLDAmodelsinanon- ] G line manner. HoweverpreviousattemptstowarddevelopingonlineMonte-Carlo methodsforLDAhavelittlesuccess,oftenbyhavingmuchworseperplexitythan L theirbatchcounterparts. WepresentastreamingGibbssampling(SGS)method, . s anonlineextensionofthecollapsedGibbssampling(CGS).Ourempiricalstudy c showsthatSGScanreachsimilarperplexityasCGS,muchbetterthanSVB.Our [ distributedversionof SGS, DSGS, is muchmore scalable than SVB mainlybe- 1 causetheupdates’communicationcomplexityissmall. v 2 4 1 Introduction 1 1 0 TopicmodelssuchasLatentDirichletAllocation(LDA) [1]havegainedincreasingattention.LDA . providesinterpretablelowdimensionalrepresentationofdocuments,uncoveringthelatenttopicsof 1 0 thecorpus. Themodelhasbeenprovento beusefulin manyfields, suchasnaturallanguagepro- 6 cessing,informationretrievalandrecommendationsystems[2, 3].CompaniessuchasGoogle[4], 1 Yahoo! [5], and Tencent [6] have taken advantage of the model extensively. LDA has gradually : becomeastandardtooltoanalyzedocumentsinasemanticperspective. v i With the Internet generating a huge amount of data each day, accommodating new data requires X periodicretrainingusingtraditionalbatchinferencealgorithmssuchasvariationalBayes(VB) [1], r collapsedGibbssampling(CGS) [7] or theirvariants [8, 9, 10], which mightbe a waste ofboth a computationalandstorageresources,duetotheneedofrecomputingandstoringallhistoricaldata.It isbettertolearnthemodelincrementally,withonline(streaming)algorithms. Anonlinealgorithm is defined as the method that learns the model in a single pass of the data and can analyze a test documentatanytimeduringlearning. Stochastic variational inference (SVI) is an offline stochastic learning algorithm that has enjoyed great experimentalsuccess on LDA [11]. Although online methods often perform worse than of- fline ones, streaming variational Bayes (SVB) [11] achieves the same performance as the offline SVI, which is impressive. However, these variational methods need to make unwarranted mean- field assumptionsand require model-specificderivations. In contrast, Monte Carlo methods(e.g., CGS)aregenerallyapplicableandasymptoticallyconvergetothetargetposterior.Byexploringthe sparsity structure, CGS hasbeen adoptedin manyscalable algorithmsfor LDA [12, 5]. However, theattemptstowardsdevelopingstreamingMonteCarlomethodsforLDAhavelittlesuccess,often achievingworseperformancethanCGS.Forinstance,theperplexityofOLDA[13]onNIPSdataset ismuchworsethanCGS;andtheparticlefilterapproach[14]couldonlyachieve50%performance ofCGS,intermsofthenormalizedmutualinformationonlabeledcorpus. Inthispaper,wefillupthisgapbypresentingastreamingGibbssampling(SGS)algorithm. SGS naturallyextendsthecollapsedGibbssamplingtotheonlinelearningsetting. Weempiricallyverify 1 that, with weight decay, SGS can reach similar inference quality as CGS. We further present a distributedversionofSGS (DSGS) in orderto dealwith large-scaledatasetson multiplecompute nodes.EmpiricalresultsdemonstratethatthedistributedSGSachievescomparableinferencequality asthenon-distributedone,whilewithdramaticscaling-up. SinceDSGScanbeimplementedwith sparsedatastructures, demandingmuchlowercommunicationbandwidth,itismorescalablethan SVB.NotethattheSGSwithoutweightdecayisthesameasOLDAwithout“topicmixing”,where “topicmixing”isthekeycomponentofOLDA[13].Moreover,OLDAattemptstosolveadifferent generativemodel,insteadoftheusualLDAmodel. Therestofthepaperisstructuredasfollows.Section2presentsthestreamingGibbssampling(SGS) algorithm for LDA and shows its relationship with ConditionalDensity Filtering (CDF) [15]. We alsoproposetheDistributedSGS(DSGS)whichcantakeadvantageofsparsityinsamplingtohandle verylargedatasets. Section3presentsexperimentalsettingsofSGSandDSGStodemonstratetheir inferencequalityandspeed.Section4concludeswithdiscussiononfuturework. 2 Streaming Gibbs Sampling forLDA 2.1 BasicsofLDA LDA[1] is a hierarchical Bayesian model that describes a generative process of topic proportions foradocumentandtopicassignmentsforeverypositionindocuments,wordsinthedocumentsare sampledfromdistributionsspecifiedbytopicassignments. LetD,K,V bethenumberofdocuments,topicanduniquewordsrespectively. φ~ isa V dimen- k sionalcategoricaldistributionoverwordswithsymmetricDirichletpriorβ,~θ isthetopicproportion d fordocumentdwithDirichletpriorα. ThegenerativeprocessofLDAisasfollows φ~ ∼Dir(β),∀d∈{1,...,D}:θ~ ∼Dir(α),z ∼Mult(~θ ),w ∼Mult(φ~ ), k d di d di zdi where L is the length of documentd, i ∈ {1,...,L }, z , w is the assignment and word on d d di di positioniofdocumentd. Dir(·),Mult(·)istheDirichletdistributionandMultinomialdistribution. Denoteθ =(~θ1,...,~θD),thematrixformedbyalltopicproportions,likewiseforΦ,Z,W. GivenasetofdocumentsW,inferringtheexactposteriordistributionp(θ,Φ,Z|W)isintractable. WemustresorttoeithervariationalapproximationorMonteCarlosimulationmethods. Amongthe variousalgorithms,collapsedGibbssampling(CGS)[7]isofparticularinterestduetoitssimplicity and sparsity. CGS exploitsthe conjugacypropertiesof DirichletandMultinomialto integrateout (θ,Φ)fromtheposterior: N−di +β p(z =k|Z−di,W)∝(N−di+α) kvdi , (1) di kd N−di+Vβ k whereN ,N aresufficientstatistics(counts)fortheDirichlet-Multinomialdistribution: N = kd kv kd Ld I(z = k),N = Ld I(w = v,z = k),N = N = N . The i=1 di kv d i=1 di di k d kd v kv sPuperscript−di standsforexPcludPingthe tokenatpositioniofdocumenPtd. Nkv,NPkd,Nk are the matricesorvectorformedbyallcorrespondingcounts.ApseudocodeofCGSisdepictedasAlg.1. ThesparsityofsufficientstatisticmatricesN ,N leadstomanyfastsparsityawarealgorithms kw kd [16, 17, 12]. The sparsity is also an importantfactor thatmakesour algorithmmore scalable than SVB. 2 Algorithm1CollapsedGibbsSampling Algorithm2StreamingGibbsSampling Input: dataW,iterationsN Input: iterationsN,decayfactorλ InitializeZ,N ,N ,N fort=1to∞do kv k dk foriter=1toN do Input: dataWt foreachtokenz inthedocumentsdo InitializeZtandupdateNt ,Nt,Nt di kv k dk Samplez ∼p(z |Z−di,W) foriter=1toN do di di UpdateN ,N ,N foreachtokenz inthemini-batchdo kv k dk di Outputposteriormean: Samplez ∼p(z |Z1:t ,W1:t) di di −di φ = Nkv+β,θ = Ndk+α UpdateNt ,Nt,Nt kv Nk+Vβ dk Nd+Kα Decay:Nt =kvλNtk dk kv kv Outputposteriormean: φt = Nktv+β,θt = Ndtk+α kv Nt+Vβ dk Nt+Kα k d 2.2 StreamingGibbsSampling Given a Bayesian model P(x|Θ) with prior P(Θ), and incoming data mini-batches X1,X2,··· ,Xt,···, let X1:t = {X1,...,Xt}. Bayesian streaming learningis the processof gettingaseriesofposteriordistributionsP(Θ|X1:t)bytherecurrencerelation: P(Θ|X1:t)∝P(Θ|X1:t−1)P(Xt|Θ). (2) Therefore,theposteriorlearntfromX1:t−1 isusedasthepriorwhenlearningfromXt. Notethat theamountofdatainastreammightbeinfinite,soastreaminglearningalgorithmcanneitherstore allpreviousdata,norupdatethemodelattimetwithatimecomplexityevenlinearoft. Ideally,the algorithmshouldonlyhaveconstantstorageandconstantupdatecomplexityateachtimestep. AsdepictedinAlg.2,weproposeastreamingGibbssampling(SGS)algorithmforLDA.SGSisan onlineextensionofCGS,whichfixesthetopicsZ1:t−1ofthepreviousarriveddocumentmini-batch, andthensamplesZtofthecurrentmini-batchusingthenormalCGSupdate.Thisisincontrastwith CGS,whichcancomebackandrefinesomeZtafteritisfirstsampled.ActuallyZ1:t−1isnoteven storedinCGS,westoreonlythesufficientstatisticN . kv One can understand SGS using the recurrence relation (2): without any data, the initial φ~ have k parametersβ~0 = β, after incorporatingmini-batchW1, the parametersis updatedto β~1 = β~0 + k k k N1 [k], which is used as the prior of consequentmini-batches, where Nt [k] is the k-th row of kv kv matrixNt . Ingeneral,SGSupdatesthepriorusingtherecurrencerelationβ~t =β~t−1+Nt [k], kv k k kv whereβ~t isthepriorforφ~ attimet. k k The decay factor λ serves to forget the history. When plugging in the decay factor, the update equationbecomesβ~t = λ(β~t−1 +Nt [k]). λcanthenbeunderstoodasweakeningtheposterior k k kv causedbythepreviousdata. ThisdecayfactorwouldimprovetheperformanceofSGS,especially whenthetopic-worddistributionisevolvingalongthetime. SGS only requires constant memory: it only stores the current mini-batch Wt,Zt, but not W1:t−1,Z1:t−1,Wt+1:∞,Zt+1:∞. The total time complexity is the same as CGS, which is O(KN|W1:t|),where|W1:t|isnumberoftokensinmini-batch1tot. Inpractice,weuseasmaller numberof iterationsthan thatof CGS, becauseSGS iterates overa smaller numberof documents (mini-batch)andthusitconvergesfaster. 2.3 RelationtoConditionalDensityFiltering Inthissection,weconsideraspecialcaseofSGS,wherethedecayfactorλissetto1.0. Werelate thisspecialcasetoConditionalDensityFiltering(CDF)[15]andshowthatSGScanbeseenasan improvedversionofCDFframeworkwhenappliedtotheLDAmodel. 3 2.3.1 ConditionalDensityFiltering CDF is an algorithm that can sample from a sequence of graduallyevolving distributions. Given a probabilistic model P(D1:t|Θ), where Θ = (θ1,··· ,θk) is a k-dimensionalparameter vector andD1:t isthedatauntilnow,wedefinetheSurrogateConditionalSufficientStatistics(SCSS)as follows: Definition1 [SCSS] Assume p(θj|θ−j,Dt) can be written as p(θj|θ−j,1,h(Dt,θ−j,2)), where θ−j = Θ\θj, θ−j,1 and θ−j,2 are a partition of θ−j and h is some known function. If θˆ−tj,2 is aconsistentestimatorofθ−j,2attimet,thenCt =g(Ct−1,h(Dt,θˆ−tj,2))isdefinedastheSCSSof θj attimet,forsomeknownfunctiong. Weusep(θj|θ−j,1,Ct)toapproximatep(θj|θ−j,D1:t). SCSS is an extension of Sufficient Statistics (SS), in the sense that both of them summarize the historical observations sampled from a class of distributions. SS is accurate and summarizes a wholedistribution,butSCSSisapproximateandonlysummarizesconditionaldistributions. IftheparametersetofaprobabilisticmodelcanbepartitionedintotwosetsI andI ,whereeach 1 2 parameter’sSCSSonlydependsontheparametersintheotherset,thenwecanusetheCDFalgo- rithm(3)toinfertheposterioroftheparameters. Algorithm3ConditionalDensityFiltering Algorithm5CDF-LDA fort=1to∞do Initialize:Φ=Uniform(0,1),Nˆ0 =0 kv fors∈{1,2}do fort=1to∞do forj ∈Is do Input: asingledocumentw~t Cjts =g(Cjts−1,h(Dt,Θˆ−s)) Initialize~zt Sampleθ ∼p(θ |θ ,Ct ) SCSSofz:Φˆt =Φ j j −js js foreachtokeniindoctdo Samplez ∼p(z |~z−ti,Φˆt) Algorithm4DistributedSGS(DSGS) SCSS of Φt:i Nˆt =ti Ntˆt−1 + I(z = IInniptiuatl:izieteNraktivon=sN0 ,decayfactorλ kfo∧rkw=ti =1tvo)K dkvo kv Pi ti foreachmini-batchWtatsomeworkerdo Sampleφ~ ∼Dir(Nˆt +β) CopyglobalN tolocalNlocal k kv kv kv ∆Nlocal =CGS(α,β+Nlocal,Wt) kv kv UpdateglobalN =λ(N +∆Nlocal) kv kv kv Underasemi-collapsedrepresentationofLDA,whereθ~ iscollapsed,wecanpartitiontheparame- d tersintotwosets: I1 ={φkv};I2 ={zdi}. Theconditionaldistributionsare: K p(Φ|Z,W)= Dir(φ~ |N [k]+β), p(z =k|~z−di,Φ,W)∝(N−di+α)φ . k kv di d kd kvdi kY=1 By definition(1), we canverifythattheSCSS of Φ and~z attime t are N and Φ respectively. d kv ThuswehavetheCDFsolutionofLDAasshowninAlgorithm(5). 2.3.2 RelationshipbetweenSGSandCDF-LDA OurSGSmethodcanbeviewedasanimprovedversionofCDF-LDAinthefollowingaspects: • InCDF-LDA,SCSSΦt isdirectlysampledfromaDirichletdistribution,whichunneces- sarilyintroducesextrasourceofrandomness.Replacingthesamplingwiththeexpectation φˆt = Nˆkt−v1+β givesbetterperformanceinpractice.Thiscorrespondstoafullycollapsed kv Nˆt−1+Vβ k representationofLDA.It’smorestatisticalefficientduetotheRao-BlackwellTheorem. • TheCDF-LDA’ssamplingupdateofz doesnotincludeothertokensin thecurrentdoc- ti umentt, which could be improvedby taking the currentdocumentinto account. This is especiallyusefulforthe beginningiterations,becauseit enablesthetopics’preferenceof docttobepropagatedimmediately. 4 • Itishardtosayhowasingledocumentcanbedecomposedintotopicswithoutlookingat theotherdocuments,butthisisthecasethatoccurrsatthebeginningofCDF-LDA.This wouldresultininaccuratez assignmentsandthereforepollutingN ,andfinallyresult- ti kv inginlowconvergencerateonthewhole. Ourmethodavoidsthisproblembyprocessing amini-batchofdocumentsatatimeandallowsformultipleiterationsoverthemini-batch. Thiswouldenabletopic-assignmentstobepropagatedlocally. To sum up, SGS without weight decay can be seen as an improvementover CDF-LDA not only by adapting a fully collapsed representation, but also by enabling a timely and repeatedly cross- documentinformationflow. 2.4 DistributedSGS Many distributed CGS samplers exist for LDA, including divide-and-conquer [18], parameter server[5,19]andmodelparallel[15]approaches. Althoughsomeofthemdonothavetheoretical guarantees,theyallperformprettywellinpractice. Here,weadopttheparameterserverapproach andpresentadistributedSGSsampler. SameasCGS,theglobalparameterinSGSisthetopicwordcountmatrixN . SGScanbeviewed kv asasequenceofcallstotheCGSprocedure ∆Nt =CGS(α,β+Nt−1,Wt), kv kv wherethekthrowofβ+Nt−1isthepriorofφ~ . Thenthetopicwordcountmatrixcanbeupdated kv k byNt =λ(Nt−1+∆Nt ). kv kv kv In the parameter server architecture, we store N in a central server. Each worker fetches the kv mostrecentparameterN , runsCGS(α,β +Nt−1,Wt)togettheupdates∆Nt , andpushes kv kv kv backtheupdateto theserver. Uponreceivinganupdate,theserverupdatesitsparameters. Inour implementation,theworkersruninafully(hogwild)asynchronousfashion.Althoughworkersmay haveslightlystaleparameterscomparedtotheserialversion,affectingtheconvergence,thishogwild approachis shownto workwell in [18, 5] as wellas our experiments,based onthe fact thatN kv changesslowly. Betterschemessuchasstalesynchronousparallel[20]mightbeused,butitisjust amatterofchoosingparameterserverimplementations.Apseudo-codeofisgivenasAlg.4. Inexperiments,wecanseethatthisempiricalparallelframeworkcanalmostlinearlyscaleupSGS withneglectableprecisionloss. DuetothesparsenessofN ,DistributedSGS(DSGS)hasmuch kv lesscommunicationoverheadbetweenthemasterandworkers,hencemorescalablethanSVB. 3 Experiments We evaluate the inference quality and computational efficiency of SGS. We also assess how the parameterssuchasmini-batchsize,decayfactorandthenumberofiterationsaffecttheperformance. We comparewiththeonlinevariationalBayesapproachSVB[11],whichhasproventohavehigh inferencequalitythatissimilartoofflinestochasticmethodslikeSVI[8]. 3.1 ImplementationDetails AsCaninietal. [14]mentioned,initializationsmighthavebigimpactsonthesolutionqualitiesof theinferencealgorithms,andhence,usingrandomizedinitializationforZ isoftennotgood. Thus, we use a kind of “progressive online” initialization for SGS, CGS and SVB. Specifically, taking SGSasanexample,atthefirstiterationforeachmini-batch,foreachdocumentwesamplez~ from d theposteriordistributionuptocurrentt. Thenweupdatetheposteriordistributionusingthecurrent documentand proceed to the nextone. Such a “samplingfrom posterior” initialization technique ensuresthat ourinitialization is reasonablygood. We use a similar initialization methodfor SVB andCGS. All the core implementations (sampling and calculating a variational approximation)are in C++. We also use Python and MATLAB wrappersfor computationallyinexpensive operations, such as measuringtime.OurimplementationofSVBhasbeenmadeassimilaraspossiblewithSGSforfair 5 speedcomparison. ThespeedofourSVBimplementationissimilartotheimplementationin[11]. Theexperimentsaredoneona3nodecluster,witheachnodeequippedwith12coresofIntelXeon [email protected],24GBmemoryand1Gbpsethernet. Inthedistributedexperiments,dataWtispre-partitionedandstoredseparatelyoneachnode,butin practiceitcanbestoredinadistributedfilesystemsuchasHDFS,orapublish-subscribepatterncan beusedforhandlingstreamingdata. Forthesakeofsimplicity,we implementourownparameter server using pyRpc, a simple remote procedure call (RPC) module that uses ZeroMQ. We use a pipeline on worker nodes to hide the communication overhead. Parameters on master server is stored using atomic, and there are model replicas on master server to ensure high availability whileperformingupdates.Forproductionusage,existinghighperformanceparameterservers,such as [19] might be used to achieve better performanceand scalability. Again, system is orthogonal withthealgorithminourcaseandisnotthemainfocusinthispaper. 3.2 Setups In the following experiments, hyper-parameters α and β are all set to 0.1 and 0.03, number of topics K is set to 50. Differentsettings yield same conclusions. Thus we stick to this setting for simplicity. MultiplerandomstartsofSGSandSVBdon’tresultinsignificantdifference. Without special remarks, for each mini-batch, both SGS and SVB run the sampler until convergence. To be specific, SGS stops the iteration when the training perplexity on the current mini-batch stops improvingfor10consecutiveiterations,orwhenitreachesamaximumof400iterations.SVBstops theinneriterationwhen ||θ~dold−θ~dnew||1 <10−5orwhenitreachesamaximumof100iterations,and K stopstheouteriterationif ||φnew−φold||1 <10−3. KV The predictiveperformanceof LDA can be measuredby the probabilityit assigns to the held-out documents. This metric is formalized by perplexity. First we partition the data-set and use 80% for training and 20% for testing. Let M denote the model trained on the training data. Given a held-outdocumentw~ ,wecaninferθ~ fromthefirsthalfofthetokensind,andthencalculatethe d d exponentiatednegativeaverageofthelogprobability.Formally,wehave: logp(w |M) per(w~ |M)=exp − i di , d (cid:26) P |w~ | (cid:27) d wherelogp(w |M)=log φ θ . di k kvdi dk P Table1:DataStatistics,whereKandM standfor Tocomparetheperformanceofthealgorithms thousandandmillionrespectively. in various settings, we test them on three #Docs #Token Vocab-Size datasets1. ThesmallNIPSdatasethasthearti- NIPS 1740 2M 13K clesfrom1988to2000publishedbyNeuralIn- NYT 300K 100M 100K formation Processing Systems (NIPS) Confer- PubMed 8.2M 730M 141K ences. A larger dataset is the news from New YorkTimes(NYT)andthelargestoneistheabstractsofthepublicationsonPubMed[21].Table1 showsthedetailedinformationofeachdataset. 3.3 Resultsforvariousmini-batchsizes WerunSGSandSVBonNIPSandNYTdatasets,varyingmini-batchsizes. Tosimplifyourcom- parison,wesetthedecayfactorforSGStoλ=1.Fig.1(a)andFig.1(b)showthatSGSconsistently performsbetterthanSVB,especiallyinthecasesofbiggerdatasetandsmallermini-batchsizes. ThedifferenceintheperformancegapbetweenSGSandSVBonNYTandNIPSdatasetscouldbe understoodasdifferentlevelsofredundancy. Inorderlearneffectivelyina streamingsetting, itis required to have a redundantdataset. There are only 1740 documentsin the NIPS corpus, which isfarfrombeingredundantandthusdifferentstreamingalgorithmsperformsalikeonthisdataset. Fromthetrendofthedatasetsize,wecanexpectthatSGSwouldperformevenbetterthanSVBon largerdatasets. AnotherinterestingphenomenonisthatSVBismoresensitivetomini-batchsizes 1All the three datasets can be downloaded from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Bag+of+Words 6 2500 9000 SSGVBS 8000 SSGVBS 2000 CGS 7000 CGS 6000 Perplexity11050000 Perplexity45000000 3000 500 2000 1000 0 0 50 100 200 400 800 1600 501002004008001.6k3.2k6.4k13k27k51k102k Batch Sizes Batch Sizes (a) PerplexitiesofSVBandSGSonNIPS. (b) PerplexitiesofSVBandSGSonNYT. Figure1:TheperplexitiesofSVB,SGSandCGSonthesmallNIPS(a)andlargeNYT(b)datasets. 2500 batch=50 6500 7000 SGS07 2400 bbaattcchh==120000 6000 6500 SSGVBS10 Perplexity22230000 batch=400 Perplexity55050000 bbbbaaaattttcccchhhh====83150221008200000 perplexity556050000000 2100 4500 4500 20000.5 0.6 0.7 0.8 0.9 1 40000.5 0.6 0.7 0.8 0.9 1 40000 20 40 60 80 100 Decay Factor Decay Factor batch sequence number (a) Effectofdecayfactors(NIPS)(.b) Effectofdecayfactors(NYT). (c) Learningprocess. Figure2: (a,b)ThechangeofSGS’sperplexityw.r.tthedecayfactorλonNIPSandNYTdatasets; and(c)Thetrendofchangingperplexitiesasnewmini-batchesarrive,mini-batchsize=3200. thanSGS. The inherentability of SGSto performmuchbetterthan SVB onsmaller mini-batches haveimportantadvantagesinpractice. NotethatSGSisequivalenttoCGSwhenthemini-batchsizeequalstothewholetrainingsetsize. The green horizontaldashed lines in Fig. 1(a) 1(b) mark the perplexity of CGS. We can see that on the large NYT dataset, SGS has a huge improvement over SVB when compared to the best- achievableperplexity. 3.4 Resultsfordifferentdecayfactor WealsoinvestigatehowthedecayfactorλaffectstheperformanceofSGSondifferentdatasets.The resultsaresimilarasabove,abiggerdatasethasmoreobvioustrends. Asshownin Fig.s2(a)and 2(b),whenthemini-batchistoosmall,thedecayfactorhasanegativeeffect.Itisprobablybecause asmallmini-batchcanonlylearnalimitedamountofknowledgeinasingleroundanditisthusnot preferabletoforgettheknowledge.Whenthemini-batchsizegetsbigger,thedecayfactorimproves theperformance,andtheoptimalvalueforλgetssmaller.Thiscanbeexplainedinasimilarmanner as above, where bigger mini-batcheswill learn more and the next mini-batch would have greater discrepancywiththecurrentone. LetusexamineaspecificsettingwherethebatchsizeofNYTdatasetissetto3200. SVByieldsa perplexity6511,whileSGSwithoutdecaycanreach5240.Afterapplyingdecayfactorof0.7,SGS yieldsaperplexityof4640,whichisprettyclosetothebatchperplexityof4300. Wecanconclude that, if the decay factor is set properly,SGS is much better than SVB and it can almost reach the sameprecisionasitsbatchcounterpart. 3.5 LearningProcess However, the mean perplexityof each mini-batchis nota full descriptionof the learningprocess. We shouldalso take thetrendoftheinferencequalityasnew mini-batchesarriveintoaccount. In Fig.2(c),wepartitioneachmini-batchintotrainingandtestingtests,andplotthetestingperplexity ofeachmini-batchofSVBandSGS.WecanclearlyseethattheperformanceofthedecayedSGS isstrictlybetterthanthenon-decayedversion,andthelatteroneoutperformsSVB.Inotherwords, SGS can consistentlylearn a better modeleveryday. Allthree modelshaveperplexitybursts ata fewinitialmini-batchesbecausethemodelsover-fitsthefirstfewmini-batches. 7 3.6 ComputationalEfficiency 2450 ×104 8 SGS 2400 SVB SGS Perplexity22330500 e (seconds)46 SCVGBS m 2250 Ti2 2200 0 0 2 4 6 8 200 400 1600 6400 25.6k 102k 299k Wallclock time (second) Batch Sizes (a) ConvergenceofSGSandSVBonamini-batch. (b) TimeconsumptionofSGS,SVBandCGS. Figure3: ComputationalEfficiencyofSGS,SVB,andCGS Sinceonlinealgorithmsusuallyrunforalongertimespan,monthsoryears,onlinealgorithmsface less computational challenges than the offline versions. However, we would still like the online inference method to not use excessive computational resources. In this section, we compare the computationalefficienciesofSGS,SVBandCGS.SinceSGSandSVBhaveouterloopsthatprocess arrivingmini-batchesandinnerloops(iterations),weinvestigatethetimepermini-batchandonthe wholedatasetseparately. InFig.3(a),werunSGSandSVBthroughsomeinitialmini-batchesandtheninvestigatetheircon- vergencesonthesameintermediatemini-batch. Theperplexityonheld-outdocumentsareplotted againsttime.SGSstartsfromabetterpointthanSVBbecauseitsbetterresultonpreviousiterations. Furthermore,SGSalsoconvergesfasterthanSVB. In Fig. 3(b), bothSGS andSVB are rununtilconvergence,wherethe criterionfor convergenceis statedinSection3.2. WecanseethatSVBandSGSconvergewithinsimilartime. Also,sincethe onlineversionissearchingoversmallernumberofconfigurations,wecanobservethatthesmaller themini-batchsizeis,thefasteritconverges. SGScanbefasterthanCGSwithsmallermini-batch sizes. 3.7 DistributedExperiments In this section we compare distributed SGS and SVB2 on the NYT dataset, and the scalability of DSGSisexaminedonthelargerPubMeddataset. WhenwecompareDSGSwithSVBinFig.4(a), wecanconcludethatalthoughtheperplexityofDSGSgetsworseasthenumberofcoresincreases, itstillconsistentlyoutperformsSVB.Fig.4(b)showsthethroughputofDSGSandSVB,intokens per second. Since the topic-word assignment update of DSGS is sparse and the corresponding variational parameter update of SVB is dense, the speedup of DSGS is much better than SVB. Fig.4(c),4(d)showthescalabilityresultonthelargerPubMeddataset. Ingeneralwecanconclude DSGSenjoysnicespeedupwhileretainingasimilarlevelofperplexity. 4 Discussion WehavedevelopedastreamingGibbssamplingalgorithm(SGS)fortheLDAmodel. Ourmethod can be seen as an online extension of the collapsed Gibbs sampling approach. Our experimental resultsdemonstratethatSGSimprovesperplexityoverpreviousonlinemethods,whilemaintaining similarcomputationalburden. WehavealsoshownthatSGScanbewellparallelizedusingsimilar techniquesasthoseadoptedinSVB. Inthefuture,SGScanbefurtherimprovedbymakingthedecayfactorλ,themini-batchsizeandthe numberofiterationsforeachdocumentself-evolving,asmoredataisfedintothesystem.Intuitively, the algorithmlearnsfastat thebeginningandslows downlater on. Thusforexample,it mightbe temptingtodecreasetheiterationcountsforeachdocumenttosomeconstantovertime.Thescheme fortheevolutiondeservesfutureresearchandneedsstrongtheoreticalguidance. 2SVBreferstobothdistributedandsingle-threadedvariants. 8 7000 5×105 SGS 6000 SGS-ideal d4 SVB 5000 on SVB-ideal Perplexity34000000 ns per sec23 12000000 SSGVBS toke1 0 0 1 2 4 9 18 36 1 2 4 9 18 36 num of threads num of threads (a) PerplexityofDSGSandSVBonNYT. (b) TokenspersecondofDSGSandSVB. 6000 40 batch=200k 5000 35 bbaattcchh==5100kk 30 batch=3k Perplexities234000000000 bbaattcchh==25000kk Running time(in hour)11220505 batch=1k 1000 batch=10k batch=3k 5 batch=1k 0 0 36 18 9 4 2 1 36 18 9 4 2 1 Number of threads Number of threads (c) PerplexityofDSGSonPubMeddataset. (d) TimeconsumptionofDSGSonPubMed. Figure4:Scalabilityresults References [1] D.M.Blei,A.Y.Ng,andM.I.Jordan.“LatentDirichletAllocation”.In:JournalofMachine LearningResearch3(2003),pp.993–1022. [2] J. Mitchell and M. Lapata. “Vector-based Models of Semantic Composition.” In: Annual MeetingoftheAssociationforComputationalLinguistics(ACL).2008. [3] N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi. “Bad News Travel Fast: A Content- BasedAnalysisofInterestingnessonTwitter”.In:ProceedingsofInternationalWebScience Conference.2011. [4] Z.Liu,Y. Zhang,E.Y. Chang,andM.Sun.“Plda+:ParallellatentDirichletallocationwith data placementand pipeline processing”.In:Transactionson IntelligentSystems and Tech- nology(TIST)(2011). [5] A. Ahmed, M. Aly, J. Gonzalez,S. Narayanamurthy,and A. J. Smola. “Scalable Inference in Latent Variable Models”. In: InternationalConference on Web Search and Data Mining (WSDM).ACM.2012. [6] Y. Wang, X. Zhao, Z. Sun, H. Yan, L. Wang, Z. Jin, L. Wang, Y. Gao, C. Law, and J. Zeng. “Peacock: Learning Long-Tail Topic Features for Industrial Applications”. In: arXiv:1405.4402(2014). [7] T.L.GriffithsandM.Steyvers.“FindingScientificTopics”.In:ProceedingsoftheNational academyofSciencesoftheUnitedStatesofAmerica101.Suppl1(2004),pp.5228–5235. [8] M.D.Hoffman,D.M.Blei,C.Wang,andJ.Paisley.“StochasticVariationalInference”.In: JournalofMachineLearningResearch14.1(2013),pp.1303–1347. [9] J.Foulds,L.Boyles,C.DuBois,P.Smyth,andM.Welling.“Stochasticcollapsedvariational Bayesian inference for latent Dirichlet allocation”. In: InternationalConference on Knowl- edgeDiscoveryandDatamining(SIGKDD).2013. [10] S.PattersonandY.W.Teh.“StochasticgradientRiemannianLangevindynamicsontheprob- abilitysimplex”.In:AdvancesinNeuralInformationProcessingSystems(NIPS).2013. [11] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. Jordan. “Streaming Variational Bayes”.In:AdvancesinNeuralInformationProcessingSystems(NIPS).2013. [12] J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing, T.-Y. Liu, and W.-Y. Ma. “LightLDA:BigTopicModelsonModestComputeClusters”.In:arXiv:1412.1576(2014). 9 [13] L. AlSumait, D. Barbara´, and C. Domeniconi. “On-line LDA: Adaptive topic models for miningtextstreamswithapplicationstotopicdetectionandtracking”.In:InternationalCon- ferenceonDataMining(ICDM).2008. [14] K. R. Canini, L. Shi, and T. L. Griffiths. “Online inference of topics with latent Dirichlet allocation”.In:InternationalConferenceonArtificialIntelligenceandStatistics(AISTATS). 2009. [15] R. Guhaniyogi, S. Qamar, and D. B. Dunson. “Bayesian Conditional Density Filtering for BigData”.In:arXiv:1401.3632(2014). [16] L. Yao, D. Mimno, and A. McCallum. “Efficient methods for topic model inference on streamingdocumentcollections”.In:InternationalConferenceonKnowledgeDiscoveryand Datamining(SIGKDD).2009. [17] A.Q.Li,A.Ahmed,S.Ravi,andA.J.Smola.“Reducingthesamplingcomplexityoftopic models”.In:InternationalConferenceonKnowledgeDiscoveryandDatamining(SIGKDD). 2014. [18] D. Newman, A. U. Asuncion, P. Smyth, and M. Welling. “Distributed inference for latent Dirichletallocation.”In:AdvancesinNeuralInformationProcessingSystems(NIPS).2007. [19] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita,andB.-Y.Su.“Scalingdistributedmachinelearningwiththeparameterserver”.In: OperatingSystemsDesignandImplementation(OSDI).2014. [20] Q.Ho,J.Cipar,H.Cui,S.Lee,J.K.Kim,P.B.Gibbons,G.A.Gibson,G.Ganger,andE.P. Xing. “Moreeffectivedistributed mlvia a stale synchronousparallelparameterserver”.In: Advancesinneuralinformationprocessingsystems(NIPS).2013. [21] K.BacheandM.Lichman.UCIMachineLearningRepository-BagofWordsDataSet.2013. URL:https://archive.ics.uci.edu/ml/datasets/Bag+of+Words. 10