loading

Logout succeed

Logout succeed. See you again!

ebook img

From Community Detection to Community Profiling PDF

file size2.8 MB

Preview From Community Detection to Community Profiling

From Community Detection to Community Profiling Hongyun Cai†‡, Vincent W. Zheng†, Fanwei Zhu#, Kevin Chen-Chuan Chang(cid:5), Zi Huang‡ † AdvancedDigitalSciencesCenter,Singapore ‡ SchoolofITEE,TheUniversityofQueensland,Australia # ZhejiangUniversityCityCollege,China (cid:5) UniversityofIllinoisatUrbana-Champaign,USA {hongyun.c,vincent.zheng}@adsc.com.sg,[email protected],[email protected],[email protected] ABSTRACT Inthispaper,weforthefirsttimeformalizetheconceptof“com- munityprofile”.Weasktwofundamentalquestions: 7 Mostexistingcommunity-relatedstudiesfocusondetection,which 1 aim to find the community membership for each user from user •Whatisacommunityprofile?Byname,theprofileshouldcharac- 0 friendshiplinks. However,membershipalone,withoutacomplete terizeacommunity,bothinternally(i.e.,whatitis)andexternally 2 profileofwhatacommunityisandhowitinteractswithothercom- (i.e., how it interacts with others). Since a community is an ag- munities, has limited applications. This motivates us to consider gregationofusers,itsprofileisessentiallyanaggregationofuser n a systematically profiling the communities and thereby developing information. DenoteX assometypeofuserinformation. Toac- J usefulcommunity-levelapplications. Inthispaper,weforthefirst commodateuncertaintyinX,wedefineaninternalprofileasprob- timeformalizetheconceptofcommunityprofiling. Withrichuser abilitiesof“community-X”,andanexternalprofileasprobabilities 7 information on the network, such as user published content and of“community-community-X”. HerewefocusonX ascontent, 1 userdiffusionlinks,wecharacterizeacommunityintermsofboth which is the primary user information in many social networks. ] itsinternalcontentprofileandexternaldiffusionprofile. Thediffi- E.g., in Twitter, users write tweets and retweet from others; in I cultyofcommunityprofilingisoftenunderestimated. Wenovelly DBLP,authorspublishpapersandcitepapersfromothers.Wecall S identify three unique challenges and propose a joint Community the probabilities of “community-content” as content profile (i.e., . s ProfilingandDetection(CPD)modeltoaddressthemaccordingly. whatacommunityisabout),andthoseof“community-community- c Wealsocontributeascalableinferencealgorithm,whichscaleslin- content”asdiffusionprofile(i.e.,howacommunitydiffusescertain [ earlywiththedatasizeanditiseasilyparallelizable. Weevaluate content with another). Other types of X’s may exist in different 1 CPDonlarge-scalereal-worlddatasets,andshowthatitissignifi- networks,e.g.,attributesinFacebook. Thus,“communityprofile” v cantlybetterthanthestate-of-the-artbaselinesinvarioustasks. isaflexibleconcept.WeleaveothertypesofX’sasfuturework. 8 •Isacommunityprofilegood?Duetonetworkhomophily,usersin 2 1. INTRODUCTION thesamecommunitytendtohavesimilarbehaviors. Thus,acom- 5 Thankstothepioneerstudiesoncommunitydetection[19,38], munity’sprofileshouldexplainthecommonbehaviorsofitsusers. 4 0 we have been able to model a community in terms of its mem- Inotherwords,wedonotseeany“community-content”distribu- . berusers. Suchcommunitymembershipassistsustobetterunder- tion as a good content profile; instead, only that well explaining 1 standthenetworkstructure. However,membershipalone,without theobservationsofusercontentasgeneratedbythecommunities 0 knowingwhatacommunityisandhowitinteractswithothers,has isa“good”one. Analogously, onlythe“community-community- 7 onlylimitedapplications–e.g.,wecannotrankcommunitiesbyde- content”distributionthatwellexplainstheobservationsofuser-to- 1 siredcharacteristics,exploitinter-communitydiffusions,andvisu- usercontentdiffusionasgeneratedbythecommunitiesisa“good” v: alize communities and their interactions. With this critical lack- diffusionprofile. Thisqualitycriterionwilllaterguideustoesti- i ingofcommunity“understanding”,thispaperproposessystematic mate the profiles accurately. It is also a key to differentiating us X communityprofiling–tocharacterizetheintrinsicnatureandextrin- fromotherwork–somepriorattemptsimplyaggregatesuserinfor- r sicbehaviorofacommunity–therebyenablingusefulcommunity- mationtooutputcommunityproperties(mostlyinternalones)[14], a level applications. As social networks increasingly capture more butitdoesnotrequiresuchpropertiestobestexplaintheobserva- andricheruserinformation,itisnowfeasibletoprofilecommuni- tions of user behaviors as generated by the communities through ties. E.g.,beyondtraditionalfriendshiplinkswhichconnectusers them. onasocialnetwork,therearealsousers’attributes,publishedcon- Weconsider“communityprofiling”asanewproblemtosolve tent, diffused content and so on. We can leverage such rich user duetothreereasons.Firstly,communityprofilingisdifferentfrom datatoestimatethecommunityprofiles. community detection, because detection focuses on getting com- munitymembershipforeachuser,whereasprofilingfocusesonget- tingthe“community-content”and“community-community-content” probabilities.Secondly,communityprofilehasneverbeendefined. Some recent work exploits rich user information [15, 30, 32, 33, This work is licensed under the Creative Commons Attribution- 40,41,42]toimprovecommunitydetectionandoutputsomeuser NonCommercial-NoDerivatives4.0InternationalLicense. Toviewacopy informationaggregationasaby-product. Buttheyneitherdefinea ofthislicense,visithttp://creativecommons.org/licenses/by-nc-nd/4.0/.For communityprofilebothinternallyandexternally, nortrytoiden- anyusebeyondthosecoveredbythislicense,obtainpermissionbyemailing tifynewapplicationsthatcommunityprofilesenable. Finally,the [email protected]. difficultyofcommunityprofilingisoftenunderestimated. Aswe ProceedingsoftheVLDBEndowment,Vol.10,No.6 Copyright2017VLDBEndowment2150-8097/17/02. shall discuss soon, due to the inter-dependency with community Friendship link Diffusion n Friendship link Diffusion link Community c Community-aware diffusion Document link Profiling & Detectio (Sect. 3.2) caossmigmDnumoneictnuyt mentc 1 c2 Taosspiigcn meCnDto ifncftu1e sCCniotoo npmmc rp2mmo rTTTfouuooiolfnnpppeil iiiiiecottccc3y y f o zzz ccfc213 1 2 13c plication Prediction (Sect. 5) Q(“ cePiPuo.gremh.or,oy mfn iqeule” n )-i dtiersi:v ce1n, c c2Qt ooupmeiWcrmysi l ul rneittwyTtec ocero1o t dap? m>i n Kf mf kuc cui2son ne m> igqt i m ecsu3: nci2t,i ecs3 nt c z1 z2 z3 1 Ap Profile-driven community visualization Joi 3 Diffusion Diffusion on a specific with topic c1->c1 c1->c2 c1->c3 topic aggregation (a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) (c) Applications (Sect. 6) Figure1:Theframeworkofjointcommunityprofilinganddetection. detection, the heterogeneity of social observations and the non- 2)webuildaninteractivesystem1forprofile-drivencommunityvi- conformity of user behaviors, finding a good community profile sualization and ranking, which for the first time allows people to ischallenging. Noneoftheexistingworkhaseveridentifiedand freelybrowsethecommunitiesbybothcontentanddiffusion[4]. addressedsuchchallenges(morediscussionsinSect.2). Thedifficultyofcommunityprofilingisoftenlargelyunderesti- Ourgoalistoinfercontentprofileanddiffusionprofileforeach mated;asweshalldiscussnext,thereexistmanychallenges: community, andultimatelyenablenewapplications. InFig.1(a), • Inter-dependency with community detection. A straightforward we show the input for community profiling: a set of users, each approachofcommunityprofilingistofirstdetectcommunitiesand ofwhompublishesdocuments; usersareconnectedbyfriendship thenaggregateeachcommunity’suserobservationsastheprofiles. links,andinteractwitheachotherbydiffusionlinks.E.g.,inTwit- However,becausethisapproachdoesnottryto“bestexplain”the ter, each user posts tweets, users are connected by followership user observations as generated by the communities through their links,andtheyretweeteachothertodiffuseinformation.InFig.1(b), profiles,itisoftensuboptimal.Takecontentprofileasanexample. foreachcommunity,weoutput:acontentprofile(e.g.,community Denoteauserasuandacommunityasc.Forsimplicity,wedenote c1 tendstopublishtopicsz1 andz2)andadiffusionprofile(e.g., c’scontentprofileasp(content|c)andthelikelihoodofu’scontent c1tendstodiffuseitselfandc2onz1).InFig.1(c),weenablethree asp(content|u). Tobestexplaintheusercontentasgeneratedby new applications as follows (novelty to be discussed in Sect. 2, thecommunitiesthroughtheircontentprofiles,weeffectivelysolve applicationstobeconcretizedinSect.5andevaluatedinSect.6): (cid:81) (cid:81) (cid:80) • Community-aware diffusion. As community profiles aggregate max up(content|u)= u cp(content|c)p(c|u), (1) user behaviors, we can use them to more robustly model the dif- where p(c|u) is the probability of user u assigned to community fusioninacommunitylevel,ratherthananindividuallevel[9,22, c. Ideally, to optimize Eq. 1, we shall optimize both the profile 25]. E.g.,wecanexplainaretweethappensasoneuser’scommu- p(content|c)’sandthecommunityassignmentp(c|u)’s. Butinthe nitiesoftenretweettheother’sonacertaintopic.Weacknowledge straightforwardapproach,thedetectionfirstfixesthep(c|u)’s,then diffusionasacomplexdecision–beyondcommunityprofiles,there thebestresultthisaggregationcanreturnisthep(content|c)’sthat are also nonconformity factors such as individual preference and maximize Eq. 1. It is clear that, the maximal likelihood we get topicpopularity. Thispartiallyexplainswhycommunityprofiling withfixedp(c|u)’sissuboptimal,unlessthep(c|u)’sare“perfect”. ischallenging–wecannotaccountcommunityprofilesforallthe A perfect detection of p(c|u)’s also needs to maximize the like- diffusions; instead, we have to model different factors, to accu- lihood in Eq. 1, which depends on the profile p(content|c)’s. In ratelyestimatetheprofilesandthecommunity-awarediffusion. all, content profiles and community detection are coupled. Sim- •Profile-drivencommunityranking. Weoftenneedtotargetaudi- ilarly, we can show that diffusion profile and community detec- encesfordisseminatinginformationinthenetworks. E.g.,acom- tionarealsocoupled. Denoteonemoreuserasv, andonemore panywantstotargetcommunities,whicharemostlikelytoretweet communityasc(cid:48). Forsimplicity,wedenotec’sdiffusionprofileas aboutitsproduct, soastolaunchacampaign. Afundingagency p(diffusion|content,c,c(cid:48))’s, eachasaprobabilityofhavingadif- wants to target communities, which actively cite papers about its fusionbetweencandc(cid:48)aboutsomecontent. Then,tobestexplain grantthemeon“deeplearning”,soastodisseminatethegrantcall. theuser-to-userdiffusionasgeneratedbythecommunitiesthrough Sincewehaveknownwhatcontenteachcommunityisinterestedin theirdiffusionprofiles,weeffectivelysolve andhowitdiffusesthatcontentwithothers,wecanrankthecom- (cid:81) munities. Profile-driven community ranking is different from the max p(diffusion|content,u,v) (2) (u,v) traditionalcommunityrecommendation,whichoftenreliesononly =(cid:81) (cid:80) (cid:80) p(diffusion|content,c,c(cid:48))p(c|u)p(c(cid:48)|v), “community-X”propertiesandisunawareofdiffusion[7,14]. (u,v) c c(cid:48) •Profile-drivencommunityvisualization. Holisticmodelingleads wheretheproductover(u,v)istakenoveralltheuserpairshaving torichvisualization–wecannowvisualizenotonlyhowcommuni- adiffusionlink. TooptimizethelikelihoodinEq.2,weshallop- tiesfeaturedistinctcontents(e.g.,whatanITcommunitytweets), timizethediffusionprofilep(diffusion|content,c,c(cid:48))’s,aswellas but also how they interact (e.g., how an IT community retweets thecommunityassignmentsp(c|u)’sandp(c(cid:48)|v)’s. others)whichisoftenoverlookedbefore[8,23]. •Heterogeneityofsocialobservations. Socialobservations,espe- Wemaketworemarksabouttheaboveapplications:1)wecom- ciallytheuserlinks(i.e.,friendshiplinksanddiffusionlinks),often pleteonetaskofcommunityprofilingtosupportmultipleapplica- tionsatatime,thuscommunityprofilingisonlydoneonceoffline; 1http://sociallens.adsc.com.sg/ carrydifferentsemantics; e.g., friendshiplinksindicateusercon- •WedevelopascalableinferencealgorithmforCPD,andwefur- nections and diffusion links indicate user interactions. Tradition- therparallelizeitbytakingthedataskewnessintoaccount. ally, weoftentrytoenforceuserconnectionstobedenserwithin eachcommunitythanacrosscommunities[16, 19]. Butindiffu- • We perform extensive experiments to evaluate CPD over large- sion, the “weak ties” theory recognizes that the inter-community scaledatasets,andshowbothitseffectivenessandscalability. interactionsmaynotbeweak[12].E.g.,softwareengineeringcom- munitycitesmorepapersfrommachinelearningcommunitythan 2. RELATEDWORK itselfon“deeplearning.”Thismeanswehavetoseparatethemod- elingofuserconnectionanduserdiffusion. Suchuserlinkhetero- In this section, we review the related work on community de- geneity is largely overlooked in the previous work [32, 35], thus tection and relevant applications, and distinguish the differences howtomodelheterogeneoususerlinkstogetherremainsunclear. between existing work and our community profiling model. We • Nonconformity of user behaviors. User behaviors, especially furtherorganizesuchdifferencesinTable4. theirdiffusiondecisions,canhappenformanyreasons.Community- CommunityDetection. Detectingcommunitiesfromvariousnet- levelconformityisjustonereason,thuswehavetoconsiderother works has been extensively studied in the last decade. There ex- factorsaswell. E.g.,somediffusionhappensduetoitstopic(e.g., ist comprehensive surveys [39, 19, 36] on community detection, presidential election) being popular at the moment or its author which review different community detection methods in terms of (e.g.,LadyGaga)beingpreferredasacelebrity. Suchtopicpopu- detectionalgorithms,qualitymeasures,benchmarksandsoon. larityanduserpreferencearetheothertwotypicalnonconformity Conventionally,acommunityisdefinedasagroupofnodes,in factors for diffusion, and we must accommodate them. No prior which intra-group connections are much denser than inter-group workhasexploredbothcommunityfactorandnonconformityfac- ones [11, 38]. The pioneer community detection studies aim to tors[17,25],anditisnotclearhowtobalancethemindiffusion. generate the community membership for each node purely based Our technical novelty is identifying the above challenges and onthelinksamongstthem[19,38]. Theprevalenceofsocialnet- developing a unified Community Profiling and Detection (CPD) works offers a rich collection of user links to use for community model(Sect.3)toaddressthemaccordingly. detection,suchasthefollowershipinTwitter[32],Flickr[30]and • To model the inter-dependency with community detection, we Facebook/Google+[26, 41], theco-authorshipinDBLP[40, 42], proposetotakeanovelprofile-awaregenerativeapproach–were- the email exchange [32]. However, most of these existing work alizethedetectionbylatentmembershipvariablesandtheprofiling only consider one single type of links. There are other different bylatentcommunityprofilevariables,whichtogethergeneratethe typesofuserlinks; e.g., userscomment/replyotherusersindigg user friendship links, user content and user diffusion links in the [23],contact/co-contact/co-subscribeotherusersinYouTube[35], network. Thenweinfertheselatentvariablesbymaximizingthe follow/reply/retweetotherusersinTwitter[31].Butthesedifferent likelihood. None of the existing work has taken a profile-aware linkswereoftenmodeledinthesameway.Sofarasweknow,none generative approach– they may use a generative model for com- oftheexistingcommunityworkconsiderstheheterogeneityamong munitydetection[26,31,33],buttheyneverconsiderinternaland userlinks(i.e.,friendshiplinksanddiffusionlinks)aswedo. externalprofilestogetherwithdetection. Recentstudiesstarttoexploittherichuserinformation,suchas •Toaddresstheheterogeneityofsocialobservations,wepropose content[32],attribute[33,40],action[23,31],toimprovethede- toseparatethegenerationoffriendshiplinksfromlatentcommu- tection.Consequently,inadditiontocommunitymembership,they nityassignmentsandthegenerationofdiffusionlinksfromlatent alsooccationallyoutputsome“community-X”associations,such profiles. In particular, we require that two users are more likely as“community-content”[32],“community-attribute”[33,40]and to share a friendship link if they have similar community assign- “community-action”[23,31].Inourwork,wesimultaneouslydis- ments. Thus maximizing the likelihood of observing the friend- cover communities and characterize them with both internal and shiplinksenforcesintra-communityfriendshiplinkstobedenser externalprofiles.Althoughsomeformsofinternalcommunitypro- than inter-community ones. In contrast, we use the community filesmaybeobtainedinsomepriorwork([23,31,32,33,40])as diffusion profiles to generate the diffusion links, but we do not theby-products,theexternalprofilesaregreatlyoverlooked. require inter-community diffusion strengths to be always smaller Therearesomerecentstudiesonaggregatingeachcommunity’s thanintra-communityones;instead,thediffusionprofilesarefreely userpreferencesassomeformofcommunityprofiles,soastoen- learnedinmaximizingthelikelihoodofthediffusionlinkobserva- ableitemrecommendationtoeachcommunity. Theirworkisdif- tions. ferentfromoursintwoaspects. Ononehand,mostofthesecom- •Toaccommodatethenonconformityofuserbehaviors, wepro- munityrecommendationstudiesaregiventhecommunitiesasin- pose to define the generative probability of observing a diffusion put [14, 29]. Even though some of them did try to detect com- linkasalogisticfunctionovermultiplefactors,includingthetopic- munities[1,27],theirdefinitionofacommunityisagroupofusers awarecommunitydiffusionprofiles, thetime-sensitivetopicpop- whosharesimilarpreferencestoarecommendeditem,whichisnot ularities and the individual user preferences. By maximizing the basedonnetworklinksatall.Incontrast,ourcommunityisagroup likelihoodofdiffusionlinkobservations,welearnthediffusionpro- ofdenselyconnectedusers,whosharesimilarinterestsanddiffu- files,aswellastheweightstocombinethesedifferentfactors. sionbehaviours. Ontheotherhand,theircommunityprofileisob- Finally,wedesignascalableinferencealgorithmforCPD(Sect. tainedbyaggregatingtheusers’preferences,whichisusuallybased 4). Asshownlater, ourinferencealgorithmscaleslinearlytothe onaleastmiseryoraggregatevotingapproach.Incontrast,wefor- data set size. We further parallelize our inference algorithm, by malizethecommunityprofilesastheprobabilitiesof“community- takingthedataskewnessintoaccount. X”andprobabilitiesof“community-community-X”. Besides,we Wesummarizeourcontributionsasfollows: estimatethesecommunityprofilesbyagenerativemodeltogether withcommunitydetection. • We identify a new problem of community profiling, which to- getherwithdetectionenablesaholisticmodelingofcommunities. Community-awareApplications.Thecommunityprofilesdeepen ourunderstandingofthedetectedcommunitiesandthusbenefita • We identify three unique challenges and design a novel CPD lot of community-level applications. Here we review the related modelforjointcommunityprofilinganddetection. Notations Description |U|,|W| Thenumberofusersandwords |C|,|Z| Thenumberofcommunitiesandtopics |F|,|E| Thenumberoffriendshiplinksanddiffusionlinks dui Thei-thdocumentpublishedbyuseru CPD(ours)COLD[17]CRM[15]WTM[37]PMTLM[43]HAM[16]BlackHole[21]topcgo[10]INFEST[25]TopicInfluence[24]LADP[22]Influlearner[9]CFF[27]GF[1]GenClus[33]TURCM[31]PMM[35]MetaFac[23]SA-Cluster[42]BAGC[40]CESNA[41]SocialCircle[26]CODICIL[30]SN-LDA[32]MaxFlow[11]Methods EπDWwcFuuuuituuijvi,ikzui TTAAMTThhhhudfeeeerliitfkssceifeenon-utttomdshoomismoffwhinwidauoplonlroicdldinrituidnyksimsnktafrireisdfnobnrsoomuitdcgsmtouindpocmmuounusecbemoenulnvrietmtsenudharetenuntcdddoitoubimtuioystmpoeuirucdsvneoaricstuusieimgsnesmnpteejcnitfiatfcottiromdueusiteru •• •• ••• • • •• tex θφc MMuullttiinnoommiiaallddiissttrriibbuuttiioonnoovveerrtwoopridcssssppeecciifificcttooctoopmicmzunityc t z • • •••• attribu ηναc,,βc(cid:48),zρ TPDrhioreibcpahablriealtimtpyeriotoefrrscsofmormmuonditeylicngdiifnfduisvinidgucaolmdimffuusniiotyncp(cid:48)reofnerteonpciecz te userfeatures user-itempairsuser-itempairs interactionsinteractionuseractions nodefeatureData winogr,kcotomomurutnhirtyeedeifxfaumsipoTlneaabanpldepl2cio:cmaNtimootnuasnt,iitoiynncvsliusduianlgizcaotimonm.uFniirtsytlrya,nfok-r frien cnoitmiemsbuansiteydroannkuinsegr,sm’ionstteroefsetsxiosntinthgesmtu,dii.ees.,[t1o4,fi7n]drtahnekfcaovmoumriute- • • ••••••• •••••••••••d . communities for users. Moreover, the communities to be ranked T link are often already predefined over the networks. In our work, the able1: ••••• ••••• diff.lin ccdooifmmfummsiuuonnniittpiieerossfiablryeesnbtooottghperttohhveeirid.reTidnhtaeissrntwhaelililcnohpneutltep,nattnhpderoocuofirmlefpsoacanunysd/iasuetxtohteorrranntakol C k om in choosethepromisingcommunitytopromotetheirproducts/papers d asmuchaspossible. Secondly, forcommunitydiffusion, incon- parison • •• ••••• ividualDiffu tmraosdtetlosaorueractotmhemiunndiitvyi-dleuvaellledviefflu[s9io,n22m,o2d5e].lliRnge,cemntolsyt,dthifefruesiaorne withthe ••• • communitysionfactor stoshoivoemenrreltfohasoectuktcodeordisme[st1mot5hgu,aen1ttih7ctie]oe.rns,Bsiainedrdseeiirdvpedirdsei,fudfuaeulnfislfinioakecndetoao[tr1ut0irhs]memocoriodstmeshilemningtugoniponiitfcy[v-1ala7ewr]viaoearulne,sdnbdeutiostfspfeuiiics-- r s elate • • • topic pmoupnuiltayrivtiysufaaclitzoartiisonm,iaslstihnoguignh[a15lo].tLofasetf,fbourttsnhoatvtehebeleeanstd,efvoortceodmto- d w to communitydetection,onlyafewofthemfurthervisualizethere- or pic sultstofacilitatethedeepanalysisandsemanticinterpretation. In k. •• • • • • ex [8], theauthors proposea community detectionand visualization tra model, whichdifferentiatestheinnernodesandthebordernodes c t. forvisualizingtheinteractionsbetweencommunities. Whiletheir c o objectiveistodesignalayoutalgorithmforclearlydisplayingthe m m communitiesandtheirinteractions,wefocusondemonstratingthe u ••• ••• •••••••••••••nity topic-awareuserinteractionstrengthsamongthecommunities. d e te 3. JOINTPROFILINGANDDETECTION c t.diffuTasks muIlnattehethfeojloloinwtincogm,wmeufinirtsytpdreofifinleinsgomanedkdeeytencotitoionnpsr;otbhleenmw.eTafbolre- s ••••• ••••• io 2summarizesthenotationsusedinthispaper. n p red DEFINITION 1. A social graph is G = (U, D,F,E), where . u ∈ U isauserandd ∈ Disauserpublisheddocument. There c o m are two types of links in G. Fuv ∈ F is a friendship link from m u userutouserv; Eij ∈ E isadiffusionlinkfromdocumentito n • •• ity documentj.Bothtypesoflinksaredirected. p rofile Fuo;rFaTwreitpterersneenttwsotrhka,tDusuer⊂uDfoilslotwhessuesteorfvt;wEeetsrpeopsrteesdenbtysuthseart uv ij tweet i is a retweet of tweet j. For a DBLP network, D is the u setofpaperspublishedbyauthoru; F representsthatauthoru uv co-authorswithauthorv;E representsthatpapericitespaperj. ij Toenablecontentmodeling,wefirstdefinetopic. Heterogeneous anauthorpublishesapaperondeeplearning,becausesheisfrom user interactions F ρ themachinelearning(ML)community,whichstudiesdeeplearn- uv F ing. Aswedealwithshortdocuments(e.g.,tweetsinTwitterand papertitlesinDBLP)andashortdocumentislikelytobeaboutone singletopic[31,17],weassignonesingletopictoeachdocument η Et c π c ij ui u inourmodel.Secondly,weconsiderauserutopublishadocument C E Joint profiling α and detection β dui oftopicz,whichdiffusesanotheruserv’sdocumentdvj,due ν tobothusers’communityassignmentscui andcvj,aswellasthe θ Cdioffmupsiroenh menosdiveeli ng z w φ caopmapmeurnointysodfiftwfuasrieonrepproosfiitloerηiecsu,ia,cnvdjzc.itEes.ga.n,oatnhearutahuotrhourpvu’sblpisahpeesr c ui uik z C Wun D Z ondeeplearning,becauseuisfromthesoftwareengineering(SE) u U community,visfromtheMLcommunity,andSEcommunitytends tocitepapersondeeplearningfromtheMLcommunity. Finally, Figure2:GraphicalmodelofCPD. weconsiderauserutoformafriendshiplinkwithauserv,dueto theirsimilarcommunitymembershipsπ andπ . E.g.,anauthor u v uisaco-authorwithanotherauthorv,becausetheyarebothfrom DEFINITION 2. Atopicz∈{1,...,|Z|}isa|W|-dimensional theMLcommunity. multinomialdistributionφ overwords,whereeachdimensionφ z z,w Addressingdataheterogeneity.WemodelthefriendshiplinksF istheprobabilityofawordw∈{1,...,|W|}belongingtoz. andthediffusionlinksEdifferently. Conventionally,agoodcom- Then,wedefinethecommunitymembership,aswellasourcom- munityneedstohavelowconductance,whichmeansthefriendship munitycontentprofileanddiffusionprofile. linksshouldbedenserinsideacommunitythanoutsideacommu- nity. Specifically,wedefinetheprobabilityofhavingafriendship DEFINITION 3. A user u’s community membership is a |C|- linkbetweentwousersuandvasasigmoidfunction,parameter- dimensional multinomial distribution π , where each dimension u izedbytheircommunitymembershipsimilarity: π istheprobabilityofubelongingtocommunityc,∀c∈{1,...,|C|}. u,c P(F =1)=σ(πˆTπˆ ), (3) uv u v DEFINITION 4. The content profile of community c is a |Z|- dimensional multinomial distribution θc over topics, where each where πˆu = [πˆu,1,...,πˆu,|C|]T is an estimation of πu based on dimensionθ istheprobabilityofcdiscussingtopicz. theaggregationofu’scommunityassignments. Inothewords,we c,z use πˆ and πˆ , instead of π and π , to generate the F ’s in u v u v uv DEFINITION 5. Thediffusionprofileofcommunitycisa|C|× Fig.2. Suchadesignismotivatedby[5,6]tosimplifytheinfer- |Z|-dimensionalmatrixηc,whereeachentryηc,c(cid:48)zistheprobabil- ence.σ(x)=1/(1+e−x)isasigmoidfunction.Themoresimilar ityofcdiffusinganothercommunityc(cid:48)ontopicz. πˆ andπˆ are,themorelikelyF exists. Inotherwords,F is u v uv uv large if u and v are from the same communities. This naturally Takecommunityc1inFig.1asanexample. Asc1’suserspublish enforces denser friendship links within a community than across morecontentonz1andz2,theresultingθc1,z1 andθc1,z2 arebig- communities,thusleadingtolowconductance.Incontrastwiththe ger. Besides,asc1’susersoftenretweet/citethemselvesonz1,the friendship links, the inter-community diffusion is not necessarily resultingηc1,c1z1 isbig. AsmotivatedinSect. 1,weformalizea “weak”[12]. Infact,thecommunity-leveldiffusionstrengthsvary jointprofilinganddetectionproblemtosolveinthispaper. overtopics,whichbreakstheassumptionofhavingtomaintainthe lowconductancewithinacommunity. Weneedtoresorttoadif- PROBLEM 1. GivenasocialgraphG =(U,D,F,E),thetask ferentmodelingofdiffusionlinks,aswediscussnext. ofjointcommunityprofilinganddetectionistoinfer:1)eachuser u’scommunitymembershipπ ,∀u ∈ U; 2)eachcommunityc’s Accommondatingnonconformity. Differentfactorscanaccount u contentprofileθ anddiffusionprofileη ,∀c∈{1,...,|C|}. for a diffusion decision. Take Twitter as an example; user u is c c likelytoretweetv’stweetd asheri-thtweetd attimetif: 1) vj ui 3.1 ModelDesign thecommunity-leveldiffusionstrengthbetweenc (thecommu- ui Next,weconcretizeourmodeldesignw.r.t. thethreetechnical nity u belongs to when she generates document dui) and cvj on challengesforcommunityprofilingasdiscussedinSect.1.Wewill topiczvjisstrong;2)thetopiczvjofdvjistrendingattimet;3)u laterevaluatehowwellweaddresseachchallengeinSect.6.2. hasanindividualpreferencetoretweetfromv.Thesefactorsshow threetypicalperspectivestomakeadiffusiondecision:community Profile-aware generative model. Community detection aims to perspective(ifacommunityismorelikelytoretweetanothercom- infer a community membership assignment π for each user u u munity),contentperspective(ifatopicismorepopularatthetime) basedonthefriendshiplinksF ’s. Communityprofilingaimsto uv and user perspective (if a user is more likely to retweet another inferacontentprofileθ andadiffusionprofileη foreachcom- c c user).Next,wecharacterizethethreetypicalfactors. munitycbasedonitsmemberusers’publishedcontentD ’sand u diffusionlinksEt ’s. Wecanreinforceprofilinganddetection,by •Communitydiffusionpreference: weconsiderauserutodiffuse ij lettingthemleverageeachother’sdata. Asaresult,wewishtoin- anotheruserv ontopicz,ifthecommunitiesofuandv areboth ferasetofcommunity-levellatentvariables,includingπu’s,θc’s interestedinzandtheyoftendiffuseeachotheronz. Denotes ∈ andη ’s,togetherfromalltheobservations(D,F,E). {0,1}asanindicatorforadiffusionlinkinEtohappen.Then,the c Since joint profiling and detection is an unsupervised task, we probabilityofhavingadiffusions=1fromutovonzis adoptagenerativeframeworkforourCPDmodel.WedesignCPD asagraphicalmodelinFig.2, whereweusecommunitiestoex- p(s=1,z|u,v)=1 (cid:88)(cid:88)p(s=1|c,c(cid:48),z)p(z|c)p(z|c(cid:48))p(c(cid:48)|v) plainalltheuserobservationsonthenetwork. Firstly,weconsider c c(cid:48) aniutysearssuigtnompeunbtlicsuhiaadnodctuhmeecnotmdmuiunoiftytocpoinctzen,tdpureotfioleheθrcucio.mEm.gu.-, ∝2 (cid:88)c(cid:88)c(cid:48)ηc,c(cid:48)zθˆc,zπˆu,cθˆc(cid:48),zπˆv,c(cid:48), (4) where at step 1 we expand p(s = 1,z|u,v) by introducing the 4. SCALABLEMODELINFERENCE communitymembershipp(c|u)andp(c(cid:48)|v),thecommunities’in- WedevelopascalableinferencealgorithmforCPD.Weaimto terestsonthetopicp(z|c)andp(z|c(cid:48)),aswellasthetopic-sensitive inferthetopicassignmentandcommunityassignmentlatentvari- communitydiffusionprobabilityp(s = 1|c,c(cid:48),z). Atstep2, we ables {Z,C} from the observations {W,F,E}, where W is the estimatep(s=1|c,c(cid:48),z)withηc,c(cid:48)z,theprobabilityofcretweet- wordsinD. WeusecollapsedGibbssampling[6,15,32]forthe ing/citingc(cid:48) onz. Besides, weestimatep(c|u)withπˆu,c, which inference. Wealsoestimatethevariationalparameters{π,θ,φ} istheempiricalprobabilityofcommunitycbeingassignedtouser andthemodelparameters{ν,η}byvariationalExpectationMaxi- u; similarly we estimate p(c(cid:48)|v) with πˆv,c(cid:48). Finally we estimate mization(EM)[5,6].Welaterparallelizeourinferencealgorithm. p(z|c)withθˆ ,whichistheempiricalprobabilityoftopicz as- c,z signedtothedocumentsfromc;similarlyweestimatep(z|c(cid:48))with 4.1 CollapsedGibbsSampling θˆc(cid:48),z.Denoteθˆ·,z =[θˆ1,z,...,θˆ|C|,z]T andη¯ =vec([η1,...,η|C|]), To derive the Gibbs sampler, we start with computing the col- wherevec(A)concatenatestherowvectorsinamatrixAtoavec- lapsedposteriordistributionofourmodel: tor. Foradiffusionbetweend andd , whichshares thesame ui vj p(W,F,E,C,Z,f,ν,η¯|ρ,α,β) topicz,wedenotec¯ =vec((πˆ πˆT)◦(θˆ θˆT )),where◦isan (6) ij u v ·,z ·,z =p(C|ρ)p(Z|C,α)p(W|Z,β)p(F|C)p(E|C,η,Z,ν,f), element-wiseproduct.ThenEq.(4)becomesc¯Tη¯. ij wherep(F|C)(abbreviatedasp(F)inthefollowing)istheprob- •Topicpopularity:wemodelthepopularityofatopicataspecific ability for the friendship links F generated by the communities timestamptasthecountoftopiczatt,whichisdenotedasnt. z C; p(E|C,η¯,Z,ν,f) (abbreviated as p(E) in the following) is theprobabilityforthediffusionlinksE generatedbythecommu- • Individual preference: we model user u’s preference to diffuse informationfromuservwithalinearfunctionνTf ,whereν is nities C. We follow [5] to model observed links only in Eq. 6. uv (cid:81) Thus, we define p(F) = P(F = 1) and p(E) = aexpaamrapmlee;twere,fcuovnsisidaerfetwatourfeeavteucrteosrffoorruu:a1n)dusve.rTpaokpeuTlawriittyte,rwahsiacnh (cid:81)(i,j)∈Ep(Eitj = 1), where(ut,vis)∈tFhe timeusvtamp of the diffusion is defined as the number of u’s followers divided by that of her link(i,j). followees|Followers(u)|;2)useractiveness,whichisdefinedasthe InthegenerativeprocessofCPDmodel,wemodelbothP(Fuv = |Followees(u)| 1) (step 3.c) and p(Et = 1) (steps 3.d) with sigmoid functions numberofu’sretweetsdividedbythatofhertweets |Retweets(u)|. ij |Tweets(u)| σ(·). Bayesianinferencewithsigmoidfunctionisknownashard, Weextractv’sfeaturesandconcatenatethemwithu’sasf . uv because it is analytically inconvenient to construct a Gibbs sam- Inordertosystematicallycombinethethreediffusionfactors,we plerforthesigmoidfunction[28]. Wearemotivatedbythedata introduceasigmoidfunctiontodefinetheprobabilityofdocument augmentationapproach[2,6],whichintroducesPo´lya-Gammaran- d diffusingdocumentd oftopiczattimestamptas: vj ui domvariablestoderiveanexactmixturerepresentationofthesig- moidfunctionforeasierinference.HenceweintroducetwoPo´lya- p(Et =1|u,v,z,t)=σ(c¯Tη¯+nt +νTf ). (5) ij ij z uv Gammavariablesλandδastheaugmentedvariablesforp(F)and p(E)respectively. Formally,arandomvariablexfollowsaPo´lya- Welearntheparametersη¯andν,sothatweknowhowmucheach Gammadistributionx∼PG(a,b)(a>0,b>0),if factorcontributesinthediffusion. Generativeprocess. WesummarizetheCPDmodel’sgenerative x= 2π12 (cid:80)∞k=1 (k−1/2)2g+kb2/(4π2), processbelow.Denote1 asanall-onevectoroflength(cid:96). (cid:96)×1 whereg ∼ Gamma(a,1)isaGammarandomvariable. Ithas k 1. Foreachtopicz = 1,...,|Z|,drawitsworddistributionfrom beenshownin[28],alogisticfunctioncanberepresentedasamix- aDirichletpriorparameterizedbyβ:φ |β ∼Dir(β1 ); tureofGaussiansw.r.t.aPo´lya-Gammadistribution: z |W|×1 2. Foreachcommunityc=1,...,|C|,drawitstopicdistribution 1 1(cid:90) ∞ fromaDirichletpriorparameterizedbyα:θc|α∼Dir(α1|Z|×1); 1+e−w = 2 ψ(w,x)p(x|1,0)dx, (7) 0 3. Foreachuseru=1,...,|U| w−xw2 whereψ(w,x)=e 2 andx∼PG(1,0).Then,forp(Fuv = (a) Drawhercommunitydistributionπu|ρ∼Dir(ρ1|C|×1); 1)asdefinedinEq.3,wecanintroduceaPo´lya-Gammavariable (b) Forthei-thdocumentduiofuseru λuv ∼PG(1,0),suchthatwegetajointprobability i. Drawacommunityassignmentcui|π∼Multi(πu), p(F =1,λ )= 1ψ(πˆTπˆ ,λ )p(λ |1,0). (8) byu’smultinomialcommunitydistributionπ ; uv uv 2 u v uv uv u ii. Drawatopiczui|c,θ∼Multi(θcui),bycui’smulti- Similarly,forp(Eitj = 1)asdefinedinEq.5,wecanintroducea nomialtopicdistributionθc; Po´lya-Gammavariablesδij ∼PG(1,0),suchthatweget iii. Draw each word w |z,φ ∼ Multi(φ ), ∀k = 1,...,|Wui|,byzui’usikmultinomialworddizsutriibution; p(Eitj =1,δij)= 12ψ(c¯Tijη¯+ntz+νTfuv,δij)p(δij|1,0). (9) (c) Foreachfriendshiplinkfromuserutouserv,drawF |π∼ uv Consideringallthefriendshiplinksanddiffusionlinks,wehave Ber(σ(πˆTπˆ ))byaBernollidistribution(Eq.3); (d) For each duiffvusion link Eitj from document dui to docu- p(F,λ)=(cid:89) ψ(πˆTuπˆv,λuv)p(λuv|1,0), (10) mentd attimet,drawEt |C,η,Z,ν,f ∼Ber(σ(c¯Tη¯+ (u,v)∈F ntz+νvTjfuv))byaBernoiljlidistribution(Eq.5); ij p(E,δ)= (cid:89) ψ(c¯Tijη¯+ntz+νTfuv,δij)p(δij|1,0). (11) (i,j)∈E Instep3.b.iii, sinceshorttextoftenhassingletopic[31, 17], we NextweinferZandC,togetherwithλandδ.Specifically,aug- sampleallwordsind fromthesametopic-worddistributionφ . mented with two Po´lya-Gamma variables λ and δ, the collapsed ui z posteriordistributionofourmodelbecomes: Weefficientlysampleλ byanalternateexponentiallytiltedJa- uv cobidistribution[28]. p(W,F,E,C,Z,f,ν,η¯,λ,δ|ρ,α,β) •Forδ:theconditionaldistributionofδisalsoPo´lya-Gamma, =p(C|ρ)p(Z|C,α)p(W|Z,β)p(F,λ|C)p(E,δ|C,η¯,Z,ν,f) p(δ |W,F,E,C,Z,f,ν,η,δ) (cid:82) (cid:82) ij = P(C|π)P(π|ρ)dπ· p(Z|C,θ)P(θ|α)dθ· (cid:82) π θ −δij(¯cTijη¯+ntz+νTfuv)2 φP(W|Z,φ)P(φ|β)dφ·p(F,λ)·p(E,δ) ∝e 2 p(δij|1,0) (16) = |(cid:81)U| ∆(ncu+ρ) · |(cid:81)C| ∆(nzc+α) · |(cid:81)Z| ∆(nwz+β) ·p(F,λ)·p(E,δ), =PG(1,c¯Tijη¯+ntz+νTfuv). ∆(ρ) ∆(α) ∆(β) u=1 c=1 z=1 4.2 ModelParameterEstimation (12) where∆(x)= (cid:81)dii=m(1x)Γ(xi).BasedonEq.12,wecaninferZ,C, WeusevariationalEMtoiterativelyestimatethevariationalpa- Γ((cid:80)dii=m(1x)xi) rameters{π,θ,φ}andthemodelparameters{ν,η}: λandδonebyoneasfollows. 1. (E-step)UsethesamplesofcollapsedGibbssamplingtoes- •ForZ:theprobabilityofassigningtopicztoduiis timatetheparametersπ,θandφ,givenν,η. p(z =z|C,Z ,W,F,E,f,ν,η¯,λ,δ) 2. (M-step)OptimizeνandηbymaximizingEq.(6),giventhe ui ¬{ui} parametersπ,θ,φestimatedintheE-step. = p(W,F,E,C,Z,f,ν,η¯,λ,δ|ρ,α,β) p(W,F,E,cui=c,C¬{ui},Z¬{ui},f,ν,η¯,λ,δ,ρ,α,β) In the E-step, the Gibbs sampler iteratively draws samples of Z, ∝(cid:81)|C| ∆(nzc+α) ·(cid:81)|Z| ∆(nwz+β) · C, λandδ byEqs. 13–16. Basedonthesamples, weestimate: c=1 ∆(nzc,¬{ui}+α) z=1 ∆(nwz,¬{ui}+β) π = ncu+ρ ,θ = nzc+α andφ = nwz+β .Inthe p(F|λ,C¬{ui})·p(E|δ,C¬{ui},Z¬{ui}) Mu-,sctep,nw(u·e)+fi|Crs|tρestci,mzatenη(cc·,)c+(cid:48)z|Z’s|αby aggrze,wgatingn(zt·h)+e|Wco|mβmunity nz +α (cid:81)|W| (cid:81)nwui(nw +β+i−1) andtopicassignmentsw.r.tallthedocuments,basedonthelastit- = c,¬{ui} · w=1 i=1 z,¬{ui} · (13) n(c·,)¬{ui}+|Z|α (cid:81)nj=(u·1i)(n(z·,)¬{ui}+|W|β+j−1) earllatoitohnerofvsaarimabplleinsgfi.xTehde–nthwiseiessteismseantetiνallbyyfimttainxgimaizloinggisEticq.r6egwreitsh- (cid:81) ψ(πˆTπˆ ,λ |C )· sionfunction;tosolveit,werandomlysamplethesameamountof v∈Λu u v uv ¬{ui} (cid:81) ψ(c¯Tη¯+nt +νTf ,δ |C ,Z ), non-observeddiffusionlinksasnegativeinstancesforoptimization. j∈Λi ij z uv ij ¬{ui} ¬{ui} As α and ρ are used to sample the πu’s andθc’s, we followthe where Λu = {v|(u,v) ∈ F or (v,u) ∈ F} is user u’s neigh- convention[13]tosettheirvaluesas50dividedbyπu’sdimension bors in F. Λi = {j|(i,j) ∈ E or (j,i) ∈ E} is document andθc’sdimensionrespectively, i.e., α = 50/|Z|, ρ = 50/|C|. i’s neighbors in E. nz and n(·) denote the number of Asβisusedtosampletheworddistributionφz’sandthenumber c,¬{ui} c,¬{ui} ofwordsislarge,wefollow[13]againtosetβ =0.1. times that topic z is assigned to community c and the number of times that any topic is assigned to c, excluding the current doc- 4.3 Scalability ument dui. Similarly, nwz,¬{ui} and n(z·,)¬{ui} are the number of WesummarizeourinferencealgorithminAlg.1. Insteps3–10, timesthatwordw isassignedtotopicz andthenumberoftimes wetakeanE-stepforcollapsedGibbssampling.Insteps11-14,we thatanywordisassignedtoz,excludingd . nw andn(·)arethe takeanM-stepfortrainingthemodelparameters. ui ui ui numberoftimesthatwordw occursinthedocumentd andthe ui Time complexity. In steps 4–6, as we compute the community numberofwordsind . Finally,ψ(πˆTπˆ ,λ |C )denotes ui u v uv ¬{ui} assignmentsandtopicassignmentsforeachdocumentofeachuser, estimating ψ(πˆTuπˆv,λuv) based on C¬{ui} instead of the whole ittakesO(|D|×|C|+|W|×|Z|). Insteps7–8,aswecompute C;ψ(c¯Tijη¯+ntz+νTfuv,δij|C¬{ui},Z¬{ui})denotesestimating πˆTuπˆv foreachfriendshiplink,ittakesO(|C|×|F|). Insteps9– ψ(c¯Tijη¯+ntz +νTfuv,δij)basedonC¬{ui} andZ¬{ui} instead 10,aswecompute(c¯Tijη¯+ntz+νTfuv)foreachdiffusionlink,it ofthewholeCandZ. takes|C|2×|E|. Insteps11–12,asweaggregatethecommunity assignmentsandtopicassignmentsforeachdiffusionlink,ittakes •ForC:theprobabilityofassigningcommunityctouatd is ui O(|E|).Insteps13–14,aswecomputegradientsforνoverallthe p(cui =c|C¬{ui},Z,W,F,E,f,ν,η¯,λ,δ) diffusionlinks,ittakesO(|E|×T2).Intotal,forT1iterations,the = p(W,F,E,C,Z,f,ν,η¯,λ,δ|ρ,α,β) overallcomplexityisO((|D|×|C|+|W|×|Z|+|C|×|F|+ p(C¬{ui},zui=z,Z¬{ui},W,F,E,f,ν,η¯,λ,δ|ρ,α,β) |C|2×|E|+|E|+|E|×T2)×T1).Aswecansee,Alg.1’stime ∝(cid:81)|U| ∆(ncu+ρ) (cid:81)|C| ∆(nzc+α) · complexityislineartothedatasize(i.e.,|D|,|F|,and|E|). u=1 ∆(nc +ρ) c=1 ∆(nz +α) u,¬{ui} c,¬{ui} Parallelization.WeconsidermultithreadparallelizationofAlg.1. p(F|λ,C¬{ui})·p(E|δ,C¬{ui},Z¬{ui}) (14) Weleavemulti-machineparallelizationasfuturework.Inourvari- = ncu,¬{ui}+ρ · nzc,¬{ui}+α ·(cid:81) ψ(πˆTπˆ ,λ |C ) ationalEMalgorithm, wefindtheE-steptakesmuchlongertime n(·) +|C|ρ n(·) +|Z|α v∈Λu u v uv ¬{ui} thantheM-step,because:1)theE-step’scollapsedGibbssampling u,¬{ui} c,¬{ui} ·(cid:81) ψ(c¯Tη¯+nt +νTf ,δ |C ,Z ), hastobedoneiterativelyoveralltheobservations,includingdoc- j∈Λi ij z uv ij ¬{ui} ¬{ui} uments (thus words), friendship links and diffusion links; 2) the wherenc andn(·) arethenumberofdocumentsfrom M-step’smodelparameterestimationiscomparativelymucheas- u,¬{ui} u,¬{ui} ier, since optimizing ν is basically solving logistic regression on useruthatareassignedtocommunitycandthenumberofdocu- thediffusionlinks(andthesameamountofnegativelinks)andηis mentsfromuseruexcludingd ,respectively. ui donebysimplyaggregatingtheexistingcommunityandtopicas- •Forλ:theconditionaldistributionofλisPo´lya-Gamma,i.e., signments.Thus,inthispaperwefocusonparallelizingtheE-step. •Segmentingdatatoreduceinter-dependency.RecallinSect.4.1, p(λ |W,F,E,C,Z,f,ν,η,δ) uv thesamplingrequirescomputing:1)anumberofcounters,includ- ∝e−λuv(πˆ2Tuπˆv)2p(λuv|1,0)=PG(1,πˆTuπˆv). (15) ing the community-topic counter nzc, the topic-word counter nwz, Algorithm1ScalableinferenceforCPD document and each link, based on a serial implementation of the Input: UsersU,docsD,friendshiplinksF,diffusionlinksE; samplingalgorithmoverallthedata.Then,basedonthenumberof Output: TopicassignmentsZ,communityassignmentsC,model documentsandlinksauserhas,weestimatetheaverageworkload parametersνandη; ofprocessingthatuser. Finally,wesumuptheaverageworkload ofalltheusersinthei-thdatasegmentaso . 1: Initializeν,η,α,β,ρ; i Weevaluateourinferencealgorithm’sefficiencyinSect.6.4. 2: foriter=1:T do 1 3: foreachuseru∈U do 4: foreachdocumentd ∈D do 5. APPLICATIONS ui u 5: SampleatopiclabelzuiaccordingtoEq.13; We concretize how to enable the following three community- 6: SampleacommunitylabelcuiaccordingtoEq.14; aware applications based on five CPD outputs, including: 1) the 7: foreachfriendshiplink(u,v)∈F do communityassignmentforusersπ ’s;2)thecommunitycontent u,c 8: SampleaugmentedvariableλuvaccordingtoEq.15; profile θc,z’s; 3) the community diffusion profile ηc,c(cid:48)z’s; 4) the 9: foreachdiffusionlink(i,j)∈Edo topicassignmentforwordsφw,z’s;5)theindividualdiffusionpref- 10: Sampleaugmentedvariableδ accordingtoEq.16; erenceparametersν. ij 11: foreachdiffusionlink(i,j)∈Edo Community-awarediffusion.Giveninputofadocumentd pub- vj 12: Updateηcui,cvjzui byaggregatingc(cid:48)uis,cvj’sandzui’s; lishedbyuserv,weoutputtheprobabilitythatanotheruseruwill 13: forsubiter=1:T2do publishadocumentduitoretweetorcitedvj attimestamptas 14: GradientdescentforνoverthediffusionlinksE. p(Et =1|u,v,d ,t)=1 (cid:80) p(Et =1|u,v,z,t)p(z|d ) ij vj z ij vj theuser-communitycounterncu;2)anumberoflinkprobabilities, =2 (cid:80)zσ((cid:80)c(cid:80)c(cid:48)πu,cθc,zηc,c(cid:48)zπv,c(cid:48)θc(cid:48),z+ntz+νTfuv)p((z1|8d)vj). includingthefriendshiponeψ(πˆTuπˆv,λuv)andthediffusionone whereatstep1weexpandp(Eitj =1|u,v,dvj,t)bythetopicsof ψ(c¯Tijη¯+ntz+νTfuv,δij).Amongthesecomputations,bothtopic dvj.Atstep2,wepluginthedefinitionofp(Eitj =1|u,v,z,t)by andcommunityassignmentsareappliedtodocuments(thustheir Eq.5.Aswecansee,Eq.18comprehensivelymodelsthediffusion users), the friendship link probability is applied to users, and the bytakingthecommunityassignmentsπ,thecommunityprofilesθ diffusion link probability is applied to two documents (thus their andη,andtheindividualdiffusionpreferenceνintoaccount. users). Therefore,exceptthewordtopicassignment,thevastma- Profile-drivencommunityranking. Giveninputofaqueryq ∈ jorityofcomputationsaredoneonusersanddocuments.Thismo- Wk(∀k≥1),weoutputtherankingofcommunitiesbasedontheir tivatesustosegmentthedatabyusersanddocuments,sothatdif- probabilitiestodiffuseinformationaboutq.Denotetheprobability ferentthreadscanworkondifferentdatasegmentswithlittleinter- ofacommunityctogenerateadiffusionlinks=1ofqueryqas dependency. It may be possible to take words into consideration for the data segment as well, but it is not obvious and we leave p(s=1|c,q)=1 (cid:80) (cid:80) p(s=1|c,c(cid:48),z)p(z|q,c(cid:48))p(c(cid:48)|q) itforfuturework. Consideringthatauseroftenhasmanydocu- z c(cid:48) ments(especiallyinTwitter),wedesigntwoguidelinestosegment ∝2 (cid:80)z(cid:80)c(cid:48)ηc,c(cid:48)zp(z|q,c(cid:48))∝3 (cid:80)z(cid:80)c(cid:48)ηc,c(cid:48)zθc(cid:48)z(cid:81)w∈qφz,w, thedatabyusersanddocuments: 1)wekeepauser’sdocuments (19) inthesamedatasegment,becauseotherwisetherearelikelymany whereatstep1weexpandp(s = 1|c,q)bythecommunitydiffu- conflictingupdatesaboutthesameuserfrommultiplethreads; 2) sionprofilep(s = 1|c,c(cid:48),z),thetopicassignmentforqinacom- wepreferkeepingthesame-topicdocumentsinthesamedataseg- munityp(z|q,c(cid:48))andtheprobabilitythatqisfromthatcommunity ment,becauseithelpstoreducetheconflictingupdatesaboutthe p(c(cid:48)|q). Atstep2wepluginthedefinitionofp(s = 1|c,c(cid:48),z) ∝ same topic from multiple threads. Overall, we first run LDA [3] ηc,c(cid:48)z andconsiderqcancomefromanycommunitywithp(c(cid:48)|q) onalltheusers’documentswith|Z|topics; thenwepartitionthe uniformly. At step 3, we estimate the probability p(z|q,c(cid:48)) in a usersinto|Z|segments,basedoneachuser’smostfrequentlyas- similar way as Eq. 13. We skip the details but explain the ratio- signedtopicinherdocuments. Ineachsegment,eachuserhasher nalofthisestimation: p(z|q,c(cid:48))isproportionaltotheprobability documents,relatedfriendshiplinksanddiffusionlinks. ofcommunityc(cid:48)generatingtopicz(i.e.,capturedbyθc(cid:48),z)andthe (cid:81) • Distributing workload to avoid data skewness. We aim to dis- probabilityofqbelongingtotopicz(i.e.,capturedby w∈qφz,w). tributethe|Z|datasegmentstoM threads,suchthattheworkload Profile-drivencommunityvisualization. Wecanvisualizeeach on each thread is balanced. Note that M is set as the number of community’scontentprofileanditsdiffusionprofile, asFig.1(b) physicalCPUcoresinthiswork. Ourapproachistofirstestimate shows.Inparticular,weareinterestedinthediffusionvisualization, theworkloadofeachdatasegment,andthencastthissegmentallo- as it is new. In our experiments, we visualize how a community cationtaskassolvingMstandard0-1knapsackproblems2.Denote interactswiththeothersintwotypicalsettings: 1)diffusionona tahlletih-ethddaatatasseeggmmeenntts’siswOork=loa(cid:80)daMi=s1oioi∈. DRe+n,othteusatbhienwaroyrkinlodaicdaftoorr tsopecc(cid:48)ifiucndtoepritco,pwichezr;e2w)ediuffsuesiηocn,c(cid:48)wzitahsttohpeicdiaffgugsrieognastitorenn,gwthhefrroemwce asxi ∈{0,1}.Thenforeachthread,wesolve use(cid:80)zηc,c(cid:48)zasthediffusionstrengthfromctoc(cid:48). max (cid:80)M o x , s.t. (cid:80)M o x ≤ 1 O, (17) i=1 i i i=1 i i M 6. EXPERIMENTS whichtriestofindasubsetofthedatasegmentstohaveascloseto WetestCPDwithtwolarge-scalereal-worlddatasets. Wede- O workloadaspossible. Onecanfinetunetheobjectivefunction M sign experiments to: 1) evaluate how well we address each chal- of Eq. 17 in practice to best allocate the data segments for even lengelistedinSect.1;2)evaluateCPD’sperformance,bycompar- workloadamongthethreads.Weestimateeachworkloado asfol- i ingwiththestate-of-the-artbaselinesindifferentapplications. lows.Firstofall,weestimatetheaverageprocessingtimeforeach 6.1 SetUp 2https://en.wikipedia.org/wiki/Knapsack_problem #(user) #(friend.link) #(diff.link) #(doc.) #(word) [17]todetectthecommunities,andfurtheraggregatetheuserob- Twitter 137,325 3,589,811 992,522 39,952,379 2,316,020 servations in each detected community as the profiles. After ap- DBLP 916,907 3,063,186 10,210,652 4,121,213 330,334 plyingCRMandCOLD,wegetthecommunityassignmentprob- Table3:Datasetstatistics. abilitiesforeachuserutoeachcommunityc,whichwedenoteas π∗ ’s. Togetaggregatedcontentprofile,wefirstrunLDA[3]on u,c Data Diffusionfactors Tasks alltheusers’documentswith|Z|topics,andforeachuseru’si-th Methods textfriend diff indiv-commtopic topic comm diff comm documentd ,wegetits|Z|-dimensionalmultinomialtopicdistri- links links idual extract detect predprofile butionasθ∗ui . Denotecommunityc’saggregatedcontentprofile PMTLM[43] • • • • • du,i WTM[37] • • • • • asθ∗c.Wehave CCORLMD[[1157]] • • •• • •• • •• •• θ∗ =(cid:80)|U| π∗ (cid:80)|Du| θ∗du,i. (20) Ours • • • • • • • • • • c u=1 u,c i=1 |Du| To get aggregated diffusion profile, we aggregate each diffusion Table4:Differenceswithbaselines. linkbetweend andd inEw.r.t. theirusers’communitieson u,i v,j atopicz. Denoteaggregateddiffusionprofileη∗ astheproba- c,c(cid:48)z bilityofcommunitycdiffusingcommunityc(cid:48)ontopicz.Thenwe WedoexperimentsonLinuxcomputersequippedwithIntel(R) have 3.50GHzCPUsand16GBRAMs. Wedo10-foldcrossvalidation andreportaveragescoresforallthequantitativeresults. Wealso ηc∗,c(cid:48)z ∝(cid:80)(i,j)∈Eπu∗,cπv∗,c(cid:48)θd∗u,i,zθd∗v,j,z. (21) reportsignificanttestresultswhenevernecessary. Inall,weobtaintwomorebaselines,whichimplementthestraight- Data sets. We use two public data sets: Twitter [20] and DBLP forward“firstdetection,thenaggregation”profilingapproach. [34]. TheTwitterdatasetwascollectedinMay2011. TheDBLP • CRM+Agg. It uses CRM [15] to detect communities; then it datasetcontainsthepublicationsindexedbyDBLP3from1936to usesEq.20andEq.21foruseraggregationtogetthecommunity 2010. Wepre-processedthetweetsandthepapertitles,byremov- contentprofilesanddiffusionprofiles,respectively. ingstopwords,stemmingandPOStagging4. Weonlykeptnouns, •COLD+Agg.ItusesCOLD[17]todetectcommunities,thensim- verbsandhashtags.Afterthat,weremovethedocumentswithless ilarlyusesEq.20andEq.21togetcontentanddiffusionprofiles. thantwowords,andthenremovetheuserswithnodocument. Ta- WecomparewithbothCRM+AggandCOLD+Aggindiffusion ble3summarizesthestatisticsofourdatasetsafterpre-processing. linkpredictionandcommunityranking. Baselines.Wechoosebaselinesbasedonthefollowingguidelines: Evaluation. Sincewejointlyprofileanddetectcommunities,we 1)theyarethestateofthearttomodelheterogeneoususerobser- willevaluatethequalityofcommunitydetectionandprofiles. vationsatthedatalevel;2)theymodeldiffusionpredictionatthe •Detectionquality.Weconsidertwowaystoevaluatetheresulting tasklevel;3)preferablytheymodelcommunity.Finally,wechoose communities:1)howdensetheyare;2)howwelltheycanbeused fourbaselinesbelow,andlistourdifferenceswiththeminTable4. to explain the friendship link observations. For 1), we use con- •PoissonMixed-TopicLinkModel(PMTLM)[43]. Itmodelsthe ductance[18,19]asthemetric. Asourcommunityassignmentis documentnetworkandusesthedocumenttopicassignmenttogen- probabilistic,wefollow[17]toleteachuserbelongtohertopfive erate the links. We adapt PMTLM for community detection and communitiesinconductanceevaluation(andalsolatercommunity friendshiplinkpredictioncomparison,byaggregatingthetopicas- rankingevaluation).Thesmallerconductanceis,thebetter.For2), signmentsofeachuser’sdocumentsasthecommunitymembership wefollow[17]todesignalinkpredictiontask,whereweuseEq.3 forthatuser. WealsocomparewithPMTLMondiffusionpredic- topredictwhethertoobserveafriendshiplinkbasedontwousers’ tion,asitalsomodelsthedocumentlinks. communities. Asthereisnopredefinedthresholdforlinkpredic- tion,weuseAUC(AreaUnderthereceiveroperatingcharacteristic •WhomtoMention(WTM)[37].Itmodelstheuserdiffusionlinks Curve) [17, 15] as the metric. Given a ranking of non-observed withusercontentandfriendship.Itdoesnotmodelcommunity.We links,wecalculatetheAUCscoreastheprobabilityofarandomly comparewithWTMondiffusionprediction. chosentruepositivelinkbeingrankedhigherthanarandomlycho- •CommunityRoleModel(CRM)[15]. Itmodelsfriendshiplinks sen true negative link. In the 10-fold cross validation, each time anddiffusionlinksbasedontheuser’scommunityassignmentand weuse10%ofthepositivelinksandsamplethesameamountof roleassignmenttogether. WecomparewithCRMoncommunity negativelinkstocalculateAUC.ThehigherAUCis,thebetter. detection,friendshiplinkpredictionanddiffusionprediction. •Profilequality.Duetolackofgroundtruth,wegenerallyevaluate •COmmunityLevelDiffusion(COLD)[17]. Itmodelsthecontent thecontentanddiffusionprofiles’qualitythroughtheapplications in Sect. 5. For community-aware diffusion, as there is no prede- and diffusion links based on communities. Thus it is the closest finedthresholdfordiffusionlinkprediction,weagainuseAUCas worktoours. Butitmodelsneitherfriendshiplinksincommunity the evaluation metric. For profile-driven community ranking, as detection,norindividualfactorandtopicfactorindiffusionpredic- thecommunitiesdetectedbydifferentalgorithmsaredifferent,for tion.WecomparewithCOLDoncommunitydetection,friendship faircomparison,weevaluatethequalityofeachrankedcommunity linkpredictionanddiffusionprediction.AsCOLDhascommunity intermsofitsusers–givenaqueryq, wecheckhowmanyusers diffusionstrength,wealsocompareitoncommunityranking. Inadditiontotheaboveexistingbaselines,wealsodesignsome inthetopK rankedcommunitiesreallyretweet(orcite)aboutq. Thennaturallywecomputeprecisionandrecallforeachcommu- morebaselinestovalidatethatwearebetterthanastraightforward nity in the ranking list. Denote the users who mention q in their communityprofilingapproachof“firstdetectingcommunities,then retweets(orcitationpapertitles)asU∗. Denotetheusersbelong- aggregatingeachcommunity’suserobservations”.Specifically,we q ingtoanyofthetopK communitiesasU . Theprecisionofthe adopt the two state-of-the-art algorithms, CRM [15] and COLD K topK communitiesforqueryq isP(K,q) = |U∗∩U |/|U |, q K K 3http://dblp.uni-trier.de/ and the recall is R(K,q) = |U∗ ∩U |/|U∗|. We define mean q K q 4http://nlp.stanford.edu/software/tagger.shtml average precision (MAP) over all the queries as MAP@K = ecnatcudnoC 0000....00007689....51555697820 50 NNOoou r HJsoeitne1tr 0Mo0goedneeliitnyg 150 )CUA( noitciderP kniL 000000......00666767..2684426720 NNOoou rHJso5ei0tnetr Mogoedneeliitnyg100 150 )CUA( noitciderP noisuffiD 000000......00788887..6246888920 NNOouo rJHs5oe0intetr Mogoedneeliitnyg100 150 ecnatcudnoC 000000......34567844444420 50 NNOouo r JHsoeinte1tr 0Mo0goedneeliitnyg 150 Number of Communities Number of Communities Number of Communities Number of Communities (a) Communitydetect.(Twitter) (b)Friendshiplinkpred.(Twitter) (c) Diffusionlinkpred.(Twitter) (d) Communitydetect.(DBLP) )CUA( noitciderP kniL 000000......79678872722720 50 NNOoou r HJsoeint1et0r Mo0goedneeliitnyg 150 )CUA( noitciderP noisuffiD 00000.....756898888820 NNOouo rJHso5ein0tetr Mogoedneeliitnyg100 150 )CUA( noitciderP noisuffiD 0000....7788272720 NNOouo rTIsn5od0pivicidual & T1o0p0ic 150 )CUA( noitciderP noisuffiD 0000....8899161620 50NNOoou rITsnodpivicid1u0a0l & Topic 150 Number of Communities Number of Communities Number of Communities Number of Communities (e) Friendshiplinkpred.(DBLP) (f) Diffusionlinkpred.(DBLP) (g) Diffusionlinkpred.(Twitter) (h) Diffusionlinkpred.(DBLP) Figure3:Studyofourmodeldesign. (cid:80)MmhbiniaegstqAahoe(ndepR(cid:80)rioac@MnKivm=KetA1rooapFdP=giece(ilssii(cid:80)n,,F,gwq1tq)he[/(e3a(cid:80)aKs]dbtoM)eKio=/tpte|t1eAQvorRa.Fn|l(ue@Iaiane,ntKxqdeat)rid/mtads=Kewiqtai)iuond/2ane×|MlQa,liMytvAya|e.u.sAPrsEaFP@tehgfid@KfneeeKam+cclr×tlMeeoiytvMcn,rAeaitwcAelRlylneR(@,tp(@pdKMepeeKrrrfiApop.lnfilReeelx)xTeitithhatyiyeess) )CUA( noitciderP noisuffiD 00000.....456785555520 WCCOO5T0LLMDD+Agg 10CCO0RRurMMs+Agg150 )CUA( noitciderP noisuffiD 000000......56894755555520 PCCMOOLLTDDL5M+0Agg CCO1RRu0rMMs0+Agg 150 of a content profile measures how well it generates the user con- Number of Communities Number of Communities (a) Twitter (b) DBLP tentobservations,andweusethesamedefinitionofperplexityas in[17].Thelowerperplexityis,thebetter. Figure4:Resultsofcommunity-awarediffusion. 6.2 ModelDesign DBLP respectively; the topic factor is able to contribute another Wewanttoevaluatehowwellweaddresseachcommunitypro- 3.6%and10.5%absoluteAUCimprovementoneachdataset. filingchallengeasintroducedinSect.1. Toachievethisgoal,we In all, we conclude that our model design well addresses the designsomebaselinesbasedonthedegeneratedversionsofCPD, threechallengesincommunityprofiling. for validating the advantages of our model design. We compare 6.3 ComparisonwithBaselines CPDwiththesebaselines,andevaluatethequalityofdetectedcom- munities and profiles through three tasks: community detection, WeevaluateCPDandthebaselinesonvariousapplications. friendshiplinkpredictionanddiffusionlinkprediction. 6.3.1 Community-awareDiffusion • Modeling the inter-dependency with community detection. We Quantitativeanalysis. Infigure4,wesummarizetheresultcom- designabaseline“nojointmodeling”,wherewefirstdetectcom- parisonwiththebaselinesintroducedinSect.6.1. PMLTMisnot munitiesonlyfromthefriendshiplinksthroughagenerativemodel applicabletoTwitter,sinceitisdesignedsolelyforcitationnetwork– byEq.3,thenweextracttheprofilesthroughagenerativemodelas it predicts a citation based on the similarity between two docu- inCPDexcepthavingthecommunitiesfixed.AsshowninFigures ments,butinTwitteratweetanditsretweetarealmostidentical.As 3(a)–3(f),oursisalwaysbetterthan“nojointmodeling”. showninFig.4,ourmodelconsistentlyoutperformsallthebase- •Addressingtheheterogeneityofsocialobservations.Wedesigna lines,thanksto: 1)ourmodelingvariousdiffusionfactorsandhet- baseline“noheterogeneity”,whereweadaptCPDtomodelfriend- erogeneoususerlinks,incontrastwiththebaselinesinTable4;2) shiplinksanddiffusionlinksinthesamewaybyEq. (3),butkeep ourjointdetectionandprofiling,incontrastwiththetwo“firstde- theotherpartsofCPDmodelingunchanged. AsshowninFigures tectionthenaggregation”baselines. When|C|=100,weachieve 3(a)-3(f),oursisbetterthan“noheterogeneity”ondiffusionpre- 24.2%–91.6%and5.1%–108.0%relativeAUCimprovementsthan diction,andcomparablewithitoncommunitydetectionandfriend- thebaselinesinTwitterandDBLP,respectively.Theimprovements shiplinkprediction.Thisimplies:1)diffusionlinksandfriendship arestatisticallysignificantoverthe10-foldcrossvalidationresults, links are different, and diffusion links require more sophisticated withstudent’st-testone-tailedp-valuep<0.01. modeling than friendship links; 2) friendship links and diffusion Casestudy. WeexaminethethreediffusionfactorsinEq.5with linksarecorrelated;diffusionlinksdonotsignificantlychangethe theDBLPdata. Firstly,inFig.5(a)weplotthenumberofpapers communitystructureoncethefriendshiplinksaregiven. ausercitesw.r.t.heractiveness,andthenumberofcitationsauser •Accommodatethenonconformityofuserbehaviors. Wedesign hasw.r.t.herpopularity.Useractivenessandpopularityaredefined two baselines: 1) “no individual & topic”, where we exclude the inSect.3.1. Generally, themoreactiveauseris(i.e., publishing individualfactorandtopicfactorfromEq.5inCPD;2)“notopic”, morepapers),themorepapersshecites;besides,themorepopular where we exclude only the topic factor from Eq. 5 in CPD. As auseris(i.e.,amoreestablishedresearcher),themorecitationsher showninfigures3(g)and3(h),theindividualfactorisabletocon- papersget. Thisobservationsupportsourdesignofmodelingboth tribute4.8%and6.8%absoluteAUCimprovementonTwitterand useractivenessandpopularityastheindividualfactorsindiffusion.

See more

The list of books you might like