From Community Detection to Community Proﬁling Hongyun Cai†‡, Vincent W. Zheng†, Fanwei Zhu#, Kevin Chen-Chuan Chang(cid:5), Zi Huang‡ † AdvancedDigitalSciencesCenter,Singapore ‡ SchoolofITEE,TheUniversityofQueensland,Australia # ZhejiangUniversityCityCollege,China (cid:5) UniversityofIllinoisatUrbana-Champaign,USA {hongyun.c,vincent.zheng}@adsc.com.sg,[email protected],[email protected],[email protected] ABSTRACT Inthispaper,wefortheﬁrsttimeformalizetheconceptof“com- munityproﬁle”.Weasktwofundamentalquestions: 7 Mostexistingcommunity-relatedstudiesfocusondetection,which 1 aim to ﬁnd the community membership for each user from user •Whatisacommunityproﬁle?Byname,theproﬁleshouldcharac- 0 friendshiplinks. However,membershipalone,withoutacomplete terizeacommunity,bothinternally(i.e.,whatitis)andexternally 2 proﬁleofwhatacommunityisandhowitinteractswithothercom- (i.e., how it interacts with others). Since a community is an ag- munities, has limited applications. This motivates us to consider gregationofusers,itsproﬁleisessentiallyanaggregationofuser n a systematically proﬁling the communities and thereby developing information. DenoteX assometypeofuserinformation. Toac- J usefulcommunity-levelapplications. Inthispaper,wefortheﬁrst commodateuncertaintyinX,wedeﬁneaninternalproﬁleasprob- timeformalizetheconceptofcommunityproﬁling. Withrichuser abilitiesof“community-X”,andanexternalproﬁleasprobabilities 7 information on the network, such as user published content and of“community-community-X”. HerewefocusonX ascontent, 1 userdiffusionlinks,wecharacterizeacommunityintermsofboth which is the primary user information in many social networks. ] itsinternalcontentproﬁleandexternaldiffusionproﬁle. Thedifﬁ- E.g., in Twitter, users write tweets and retweet from others; in I cultyofcommunityproﬁlingisoftenunderestimated. Wenovelly DBLP,authorspublishpapersandcitepapersfromothers.Wecall S identify three unique challenges and propose a joint Community the probabilities of “community-content” as content proﬁle (i.e., . s ProﬁlingandDetection(CPD)modeltoaddressthemaccordingly. whatacommunityisabout),andthoseof“community-community- c Wealsocontributeascalableinferencealgorithm,whichscaleslin- content”asdiffusionproﬁle(i.e.,howacommunitydiffusescertain [ earlywiththedatasizeanditiseasilyparallelizable. Weevaluate content with another). Other types of X’s may exist in different 1 CPDonlarge-scalereal-worlddatasets,andshowthatitissigniﬁ- networks,e.g.,attributesinFacebook. Thus,“communityproﬁle” v cantlybetterthanthestate-of-the-artbaselinesinvarioustasks. isaﬂexibleconcept.WeleaveothertypesofX’sasfuturework. 8 •Isacommunityproﬁlegood?Duetonetworkhomophily,usersin 2 1. INTRODUCTION thesamecommunitytendtohavesimilarbehaviors. Thus,acom- 5 Thankstothepioneerstudiesoncommunitydetection[19,38], munity’sproﬁleshouldexplainthecommonbehaviorsofitsusers. 4 0 we have been able to model a community in terms of its mem- Inotherwords,wedonotseeany“community-content”distribu- . berusers. Suchcommunitymembershipassistsustobetterunder- tion as a good content proﬁle; instead, only that well explaining 1 standthenetworkstructure. However,membershipalone,without theobservationsofusercontentasgeneratedbythecommunities 0 knowingwhatacommunityisandhowitinteractswithothers,has isa“good”one. Analogously, onlythe“community-community- 7 onlylimitedapplications–e.g.,wecannotrankcommunitiesbyde- content”distributionthatwellexplainstheobservationsofuser-to- 1 siredcharacteristics,exploitinter-communitydiffusions,andvisu- usercontentdiffusionasgeneratedbythecommunitiesisa“good” v: alize communities and their interactions. With this critical lack- diffusionproﬁle. Thisqualitycriterionwilllaterguideustoesti- i ingofcommunity“understanding”,thispaperproposessystematic mate the proﬁles accurately. It is also a key to differentiating us X communityproﬁling–tocharacterizetheintrinsicnatureandextrin- fromotherwork–somepriorattemptsimplyaggregatesuserinfor- r sicbehaviorofacommunity–therebyenablingusefulcommunity- mationtooutputcommunityproperties(mostlyinternalones)[14], a level applications. As social networks increasingly capture more butitdoesnotrequiresuchpropertiestobestexplaintheobserva- andricheruserinformation,itisnowfeasibletoproﬁlecommuni- tions of user behaviors as generated by the communities through ties. E.g.,beyondtraditionalfriendshiplinkswhichconnectusers them. onasocialnetwork,therearealsousers’attributes,publishedcon- Weconsider“communityproﬁling”asanewproblemtosolve tent, diffused content and so on. We can leverage such rich user duetothreereasons.Firstly,communityproﬁlingisdifferentfrom datatoestimatethecommunityproﬁles. community detection, because detection focuses on getting com- munitymembershipforeachuser,whereasproﬁlingfocusesonget- tingthe“community-content”and“community-community-content” probabilities.Secondly,communityproﬁlehasneverbeendeﬁned. Some recent work exploits rich user information [15, 30, 32, 33, This work is licensed under the Creative Commons Attribution- 40,41,42]toimprovecommunitydetectionandoutputsomeuser NonCommercial-NoDerivatives4.0InternationalLicense. Toviewacopy informationaggregationasaby-product. Buttheyneitherdeﬁnea ofthislicense,visithttp://creativecommons.org/licenses/by-nc-nd/4.0/.For communityproﬁlebothinternallyandexternally, nortrytoiden- anyusebeyondthosecoveredbythislicense,obtainpermissionbyemailing tifynewapplicationsthatcommunityproﬁlesenable. Finally,the [email protected]. difﬁcultyofcommunityproﬁlingisoftenunderestimated. Aswe ProceedingsoftheVLDBEndowment,Vol.10,No.6 Copyright2017VLDBEndowment2150-8097/17/02. shall discuss soon, due to the inter-dependency with community Friendship link Diffusion n Friendship link Diffusion link Community c Community-aware diffusion Document link Profiling & Detectio (Sect. 3.2) caossmigmDnumoneictnuyt mentc 1 c2 Taosspiigcn meCnDto ifncftu1e sCCniotoo npmmc rp2mmo rTTTfouuooiolfnnpppeil iiiiiecottccc3y y f o zzz ccfc213 1 2 13c plication Prediction (Sect. 5) Q(“ cePiPuo.gremh.or,oy mfn iqeule” n )-i dtiersi:v ce1n, c c2Qt ooupmeiWcrmysi l ul rneittwyTtec ocero1o t dap? m>i n Kf mf kuc cui2son ne m> igqt i m ecsu3: nci2t,i ecs3 nt c z1 z2 z3 1 Ap Profile-driven community visualization Joi 3 Diffusion Diffusion on a specific with topic c1->c1 c1->c2 c1->c3 topic aggregation (a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) (c) Applications (Sect. 6) Figure1:Theframeworkofjointcommunityproﬁlinganddetection. detection, the heterogeneity of social observations and the non- 2)webuildaninteractivesystem1forproﬁle-drivencommunityvi- conformity of user behaviors, ﬁnding a good community proﬁle sualization and ranking, which for the ﬁrst time allows people to ischallenging. Noneoftheexistingworkhaseveridentiﬁedand freelybrowsethecommunitiesbybothcontentanddiffusion[4]. addressedsuchchallenges(morediscussionsinSect.2). Thedifﬁcultyofcommunityproﬁlingisoftenlargelyunderesti- Ourgoalistoinfercontentproﬁleanddiffusionproﬁleforeach mated;asweshalldiscussnext,thereexistmanychallenges: community, andultimatelyenablenewapplications. InFig.1(a), • Inter-dependency with community detection. A straightforward we show the input for community proﬁling: a set of users, each approachofcommunityproﬁlingistoﬁrstdetectcommunitiesand ofwhompublishesdocuments; usersareconnectedbyfriendship thenaggregateeachcommunity’suserobservationsastheproﬁles. links,andinteractwitheachotherbydiffusionlinks.E.g.,inTwit- However,becausethisapproachdoesnottryto“bestexplain”the ter, each user posts tweets, users are connected by followership user observations as generated by the communities through their links,andtheyretweeteachothertodiffuseinformation.InFig.1(b), proﬁles,itisoftensuboptimal.Takecontentproﬁleasanexample. foreachcommunity,weoutput:acontentproﬁle(e.g.,community Denoteauserasuandacommunityasc.Forsimplicity,wedenote c1 tendstopublishtopicsz1 andz2)andadiffusionproﬁle(e.g., c’scontentproﬁleasp(content|c)andthelikelihoodofu’scontent c1tendstodiffuseitselfandc2onz1).InFig.1(c),weenablethree asp(content|u). Tobestexplaintheusercontentasgeneratedby new applications as follows (novelty to be discussed in Sect. 2, thecommunitiesthroughtheircontentproﬁles,weeffectivelysolve applicationstobeconcretizedinSect.5andevaluatedinSect.6): (cid:81) (cid:81) (cid:80) • Community-aware diffusion. As community proﬁles aggregate max up(content|u)= u cp(content|c)p(c|u), (1) user behaviors, we can use them to more robustly model the dif- where p(c|u) is the probability of user u assigned to community fusioninacommunitylevel,ratherthananindividuallevel[9,22, c. Ideally, to optimize Eq. 1, we shall optimize both the proﬁle 25]. E.g.,wecanexplainaretweethappensasoneuser’scommu- p(content|c)’sandthecommunityassignmentp(c|u)’s. Butinthe nitiesoftenretweettheother’sonacertaintopic.Weacknowledge straightforwardapproach,thedetectionﬁrstﬁxesthep(c|u)’s,then diffusionasacomplexdecision–beyondcommunityproﬁles,there thebestresultthisaggregationcanreturnisthep(content|c)’sthat are also nonconformity factors such as individual preference and maximize Eq. 1. It is clear that, the maximal likelihood we get topicpopularity. Thispartiallyexplainswhycommunityproﬁling withﬁxedp(c|u)’sissuboptimal,unlessthep(c|u)’sare“perfect”. ischallenging–wecannotaccountcommunityproﬁlesforallthe A perfect detection of p(c|u)’s also needs to maximize the like- diffusions; instead, we have to model different factors, to accu- lihood in Eq. 1, which depends on the proﬁle p(content|c)’s. In ratelyestimatetheproﬁlesandthecommunity-awarediffusion. all, content proﬁles and community detection are coupled. Sim- •Proﬁle-drivencommunityranking. Weoftenneedtotargetaudi- ilarly, we can show that diffusion proﬁle and community detec- encesfordisseminatinginformationinthenetworks. E.g.,acom- tionarealsocoupled. Denoteonemoreuserasv, andonemore panywantstotargetcommunities,whicharemostlikelytoretweet communityasc(cid:48). Forsimplicity,wedenotec’sdiffusionproﬁleas aboutitsproduct, soastolaunchacampaign. Afundingagency p(diffusion|content,c,c(cid:48))’s, eachasaprobabilityofhavingadif- wants to target communities, which actively cite papers about its fusionbetweencandc(cid:48)aboutsomecontent. Then,tobestexplain grantthemeon“deeplearning”,soastodisseminatethegrantcall. theuser-to-userdiffusionasgeneratedbythecommunitiesthrough Sincewehaveknownwhatcontenteachcommunityisinterestedin theirdiffusionproﬁles,weeffectivelysolve andhowitdiffusesthatcontentwithothers,wecanrankthecom- (cid:81) munities. Proﬁle-driven community ranking is different from the max p(diffusion|content,u,v) (2) (u,v) traditionalcommunityrecommendation,whichoftenreliesononly =(cid:81) (cid:80) (cid:80) p(diffusion|content,c,c(cid:48))p(c|u)p(c(cid:48)|v), “community-X”propertiesandisunawareofdiffusion[7,14]. (u,v) c c(cid:48) •Proﬁle-drivencommunityvisualization. Holisticmodelingleads wheretheproductover(u,v)istakenoveralltheuserpairshaving torichvisualization–wecannowvisualizenotonlyhowcommuni- adiffusionlink. TooptimizethelikelihoodinEq.2,weshallop- tiesfeaturedistinctcontents(e.g.,whatanITcommunitytweets), timizethediffusionproﬁlep(diffusion|content,c,c(cid:48))’s,aswellas but also how they interact (e.g., how an IT community retweets thecommunityassignmentsp(c|u)’sandp(c(cid:48)|v)’s. others)whichisoftenoverlookedbefore[8,23]. •Heterogeneityofsocialobservations. Socialobservations,espe- Wemaketworemarksabouttheaboveapplications:1)wecom- ciallytheuserlinks(i.e.,friendshiplinksanddiffusionlinks),often pleteonetaskofcommunityproﬁlingtosupportmultipleapplica- tionsatatime,thuscommunityproﬁlingisonlydoneonceofﬂine; 1http://sociallens.adsc.com.sg/ carrydifferentsemantics; e.g., friendshiplinksindicateusercon- •WedevelopascalableinferencealgorithmforCPD,andwefur- nections and diffusion links indicate user interactions. Tradition- therparallelizeitbytakingthedataskewnessintoaccount. ally, weoftentrytoenforceuserconnectionstobedenserwithin eachcommunitythanacrosscommunities[16, 19]. Butindiffu- • We perform extensive experiments to evaluate CPD over large- sion, the “weak ties” theory recognizes that the inter-community scaledatasets,andshowbothitseffectivenessandscalability. interactionsmaynotbeweak[12].E.g.,softwareengineeringcom- munitycitesmorepapersfrommachinelearningcommunitythan 2. RELATEDWORK itselfon“deeplearning.”Thismeanswehavetoseparatethemod- elingofuserconnectionanduserdiffusion. Suchuserlinkhetero- In this section, we review the related work on community de- geneity is largely overlooked in the previous work [32, 35], thus tection and relevant applications, and distinguish the differences howtomodelheterogeneoususerlinkstogetherremainsunclear. between existing work and our community proﬁling model. We • Nonconformity of user behaviors. User behaviors, especially furtherorganizesuchdifferencesinTable4. theirdiffusiondecisions,canhappenformanyreasons.Community- CommunityDetection. Detectingcommunitiesfromvariousnet- levelconformityisjustonereason,thuswehavetoconsiderother works has been extensively studied in the last decade. There ex- factorsaswell. E.g.,somediffusionhappensduetoitstopic(e.g., ist comprehensive surveys [39, 19, 36] on community detection, presidential election) being popular at the moment or its author which review different community detection methods in terms of (e.g.,LadyGaga)beingpreferredasacelebrity. Suchtopicpopu- detectionalgorithms,qualitymeasures,benchmarksandsoon. larityanduserpreferencearetheothertwotypicalnonconformity Conventionally,acommunityisdeﬁnedasagroupofnodes,in factors for diffusion, and we must accommodate them. No prior which intra-group connections are much denser than inter-group workhasexploredbothcommunityfactorandnonconformityfac- ones [11, 38]. The pioneer community detection studies aim to tors[17,25],anditisnotclearhowtobalancethemindiffusion. generate the community membership for each node purely based Our technical novelty is identifying the above challenges and onthelinksamongstthem[19,38]. Theprevalenceofsocialnet- developing a uniﬁed Community Proﬁling and Detection (CPD) works offers a rich collection of user links to use for community model(Sect.3)toaddressthemaccordingly. detection,suchasthefollowershipinTwitter[32],Flickr[30]and • To model the inter-dependency with community detection, we Facebook/Google+[26, 41], theco-authorshipinDBLP[40, 42], proposetotakeanovelproﬁle-awaregenerativeapproach–were- the email exchange [32]. However, most of these existing work alizethedetectionbylatentmembershipvariablesandtheproﬁling only consider one single type of links. There are other different bylatentcommunityproﬁlevariables,whichtogethergeneratethe typesofuserlinks; e.g., userscomment/replyotherusersindigg user friendship links, user content and user diffusion links in the [23],contact/co-contact/co-subscribeotherusersinYouTube[35], network. Thenweinfertheselatentvariablesbymaximizingthe follow/reply/retweetotherusersinTwitter[31].Butthesedifferent likelihood. None of the existing work has taken a proﬁle-aware linkswereoftenmodeledinthesameway.Sofarasweknow,none generative approach– they may use a generative model for com- oftheexistingcommunityworkconsiderstheheterogeneityamong munitydetection[26,31,33],buttheyneverconsiderinternaland userlinks(i.e.,friendshiplinksanddiffusionlinks)aswedo. externalproﬁlestogetherwithdetection. Recentstudiesstarttoexploittherichuserinformation,suchas •Toaddresstheheterogeneityofsocialobservations,wepropose content[32],attribute[33,40],action[23,31],toimprovethede- toseparatethegenerationoffriendshiplinksfromlatentcommu- tection.Consequently,inadditiontocommunitymembership,they nityassignmentsandthegenerationofdiffusionlinksfromlatent alsooccationallyoutputsome“community-X”associations,such proﬁles. In particular, we require that two users are more likely as“community-content”[32],“community-attribute”[33,40]and to share a friendship link if they have similar community assign- “community-action”[23,31].Inourwork,wesimultaneouslydis- ments. Thus maximizing the likelihood of observing the friend- cover communities and characterize them with both internal and shiplinksenforcesintra-communityfriendshiplinkstobedenser externalproﬁles.Althoughsomeformsofinternalcommunitypro- than inter-community ones. In contrast, we use the community ﬁlesmaybeobtainedinsomepriorwork([23,31,32,33,40])as diffusion proﬁles to generate the diffusion links, but we do not theby-products,theexternalproﬁlesaregreatlyoverlooked. require inter-community diffusion strengths to be always smaller Therearesomerecentstudiesonaggregatingeachcommunity’s thanintra-communityones;instead,thediffusionproﬁlesarefreely userpreferencesassomeformofcommunityproﬁles,soastoen- learnedinmaximizingthelikelihoodofthediffusionlinkobserva- ableitemrecommendationtoeachcommunity. Theirworkisdif- tions. ferentfromoursintwoaspects. Ononehand,mostofthesecom- •Toaccommodatethenonconformityofuserbehaviors, wepro- munityrecommendationstudiesaregiventhecommunitiesasin- pose to deﬁne the generative probability of observing a diffusion put [14, 29]. Even though some of them did try to detect com- linkasalogisticfunctionovermultiplefactors,includingthetopic- munities[1,27],theirdeﬁnitionofacommunityisagroupofusers awarecommunitydiffusionproﬁles, thetime-sensitivetopicpop- whosharesimilarpreferencestoarecommendeditem,whichisnot ularities and the individual user preferences. By maximizing the basedonnetworklinksatall.Incontrast,ourcommunityisagroup likelihoodofdiffusionlinkobservations,welearnthediffusionpro- ofdenselyconnectedusers,whosharesimilarinterestsanddiffu- ﬁles,aswellastheweightstocombinethesedifferentfactors. sionbehaviours. Ontheotherhand,theircommunityproﬁleisob- Finally,wedesignascalableinferencealgorithmforCPD(Sect. tainedbyaggregatingtheusers’preferences,whichisusuallybased 4). Asshownlater, ourinferencealgorithmscaleslinearlytothe onaleastmiseryoraggregatevotingapproach.Incontrast,wefor- data set size. We further parallelize our inference algorithm, by malizethecommunityproﬁlesastheprobabilitiesof“community- takingthedataskewnessintoaccount. X”andprobabilitiesof“community-community-X”. Besides,we Wesummarizeourcontributionsasfollows: estimatethesecommunityproﬁlesbyagenerativemodeltogether withcommunitydetection. • We identify a new problem of community proﬁling, which to- getherwithdetectionenablesaholisticmodelingofcommunities. Community-awareApplications.Thecommunityproﬁlesdeepen ourunderstandingofthedetectedcommunitiesandthusbeneﬁta • We identify three unique challenges and design a novel CPD lot of community-level applications. Here we review the related modelforjointcommunityproﬁlinganddetection. Notations Description |U|,|W| Thenumberofusersandwords |C|,|Z| Thenumberofcommunitiesandtopics |F|,|E| Thenumberoffriendshiplinksanddiffusionlinks dui Thei-thdocumentpublishedbyuseru CPD(ours)COLD[17]CRM[15]WTM[37]PMTLM[43]HAM[16]BlackHole[21]topcgo[10]INFEST[25]TopicInﬂuence[24]LADP[22]Inﬂulearner[9]CFF[27]GF[1]GenClus[33]TURCM[31]PMM[35]MetaFac[23]SA-Cluster[42]BAGC[40]CESNA[41]SocialCircle[26]CODICIL[30]SN-LDA[32]MaxFlow[11]Methods EπDWwcFuuuuituuijvi,ikzui TTAAMTThhhhudfeeeerliitfkssceifeenon-utttomdshoomismoffwhinwidauoplonlroicdldinrituidnyksimsnktafrireisdfnobnrsoomuitdcgsmtouindpocmmuounusecbemoenulnvrietmtsenudharetenuntcdddoitoubimtuioystmpoeuirucdsvneoaricstuusieimgsnesmnpteejcnitﬁatfcottiromdueusiteru •• •• ••• • • •• tex θφc MMuullttiinnoommiiaallddiissttrriibbuuttiioonnoovveerrtwoopridcssssppeecciiﬁﬁccttooctoopmicmzunityc t z • • •••• attribu ηναc,,βc(cid:48),zρ TPDrhioreibcpahablriealtimtpyeriotoefrrscsofmormmuonditeylicngdiifnfduisvinidgucaolmdimffuusniiotyncp(cid:48)reofnerteonpciecz te userfeatures user-itempairsuser-itempairs interactionsinteractionuseractions nodefeatureData winogr,kcotomomurutnhirtyeedeifxfaumsipoTlneaabanpldepl2cio:cmaNtimootnuasnt,iitoiynncvsliusduianlgizcaotimonm.uFniirtsytlrya,nfok-r frien cnoitmiemsbuansiteydroannkuinsegr,sm’ionstteroefsetsxiosntinthgesmtu,dii.ees.,[t1o4,ﬁ7n]drtahnekfcaovmoumriute- • • ••••••• •••••••••••d . communities for users. Moreover, the communities to be ranked T link are often already predeﬁned over the networks. In our work, the able1: ••••• ••••• diff.lin ccdooifmmfummsiuuonnniittpiieerossﬁablryeesnbtooottghperttohhveeirid.reTidnhtaeissrntwhaelililcnohpneutltep,nattnhpderoocuoﬁrmlefpsoacanunysd/iasuetxtohteorrranntakol C k om in choosethepromisingcommunitytopromotetheirproducts/papers d asmuchaspossible. Secondly, forcommunitydiffusion, incon- parison • •• ••••• ividualDiffu tmraosdtetlosaorueractotmhemiunndiitvyi-dleuvaellledviefflu[s9io,n22m,o2d5e].lliRnge,cemntolsyt,dthifefruesiaorne withthe ••• • communitysionfactor stoshoivoemenrreltfohasoectuktcodeordisme[st1mot5hgu,aen1ttih7ctie]oe.rns,Bsiainedrdseeiirdvpedirdsei,fudfuaeulnﬁslfinioakecndetoao[tr1ut0irhs]memocoriodstmeshilemningtugoniponiitfcy[v-1ala7ewr]viaoearulne,sdnbdeutiostfspfeuiiics-- r s elate • • • topic pmoupnuiltayrivtiysufaaclitzoartiisonm,iaslstihnoguignh[a15lo].tLofasetf,fbourttsnhoatvtehebeleeanstd,efvoortceodmto- d w to communitydetection,onlyafewofthemfurthervisualizethere- or pic sultstofacilitatethedeepanalysisandsemanticinterpretation. In k. •• • • • • ex [8], theauthors proposea community detectionand visualization tra model, whichdifferentiatestheinnernodesandthebordernodes c t. forvisualizingtheinteractionsbetweencommunities. Whiletheir c o objectiveistodesignalayoutalgorithmforclearlydisplayingthe m m communitiesandtheirinteractions,wefocusondemonstratingthe u ••• ••• •••••••••••••nity topic-awareuserinteractionstrengthsamongthecommunities. d e te 3. JOINTPROFILINGANDDETECTION c t.diffuTasks muIlnattehethfeojloloinwtincogm,wmeuﬁnirtsytpdreoﬁﬁnleinsgomanedkdeeytencotitoionnpsr;otbhleenmw.eTafbolre- s ••••• ••••• io 2summarizesthenotationsusedinthispaper. n p red DEFINITION 1. A social graph is G = (U, D,F,E), where . u ∈ U isauserandd ∈ Disauserpublisheddocument. There c o m are two types of links in G. Fuv ∈ F is a friendship link from m u userutouserv; Eij ∈ E isadiffusionlinkfromdocumentito n • •• ity documentj.Bothtypesoflinksaredirected. p roﬁle Fuo;rFaTwreitpterersneenttwsotrhka,tDusuer⊂uDfoilslotwhessuesteorfvt;wEeetsrpeopsrteesdenbtysuthseart uv ij tweet i is a retweet of tweet j. For a DBLP network, D is the u setofpaperspublishedbyauthoru; F representsthatauthoru uv co-authorswithauthorv;E representsthatpapericitespaperj. ij Toenablecontentmodeling,weﬁrstdeﬁnetopic. Heterogeneous anauthorpublishesapaperondeeplearning,becausesheisfrom user interactions F ρ themachinelearning(ML)community,whichstudiesdeeplearn- uv F ing. Aswedealwithshortdocuments(e.g.,tweetsinTwitterand papertitlesinDBLP)andashortdocumentislikelytobeaboutone singletopic[31,17],weassignonesingletopictoeachdocument η Et c π c ij ui u inourmodel.Secondly,weconsiderauserutopublishadocument C E Joint profiling α and detection β dui oftopicz,whichdiffusesanotheruserv’sdocumentdvj,due ν tobothusers’communityassignmentscui andcvj,aswellasthe θ Cdioffmupsiroenh menosdiveeli ng z w φ caopmapmeurnointysodfiftwfuasrieonrepproosﬁitloerηiecsu,ia,cnvdjzc.itEes.ga.n,oatnhearutahuotrhourpvu’sblpisahpeesr c ui uik z C Wun D Z ondeeplearning,becauseuisfromthesoftwareengineering(SE) u U community,visfromtheMLcommunity,andSEcommunitytends tocitepapersondeeplearningfromtheMLcommunity. Finally, Figure2:GraphicalmodelofCPD. weconsiderauserutoformafriendshiplinkwithauserv,dueto theirsimilarcommunitymembershipsπ andπ . E.g.,anauthor u v uisaco-authorwithanotherauthorv,becausetheyarebothfrom DEFINITION 2. Atopicz∈{1,...,|Z|}isa|W|-dimensional theMLcommunity. multinomialdistributionφ overwords,whereeachdimensionφ z z,w Addressingdataheterogeneity.WemodelthefriendshiplinksF istheprobabilityofawordw∈{1,...,|W|}belongingtoz. andthediffusionlinksEdifferently. Conventionally,agoodcom- Then,wedeﬁnethecommunitymembership,aswellasourcom- munityneedstohavelowconductance,whichmeansthefriendship munitycontentproﬁleanddiffusionproﬁle. linksshouldbedenserinsideacommunitythanoutsideacommu- nity. Speciﬁcally,wedeﬁnetheprobabilityofhavingafriendship DEFINITION 3. A user u’s community membership is a |C|- linkbetweentwousersuandvasasigmoidfunction,parameter- dimensional multinomial distribution π , where each dimension u izedbytheircommunitymembershipsimilarity: π istheprobabilityofubelongingtocommunityc,∀c∈{1,...,|C|}. u,c P(F =1)=σ(πˆTπˆ ), (3) uv u v DEFINITION 4. The content proﬁle of community c is a |Z|- dimensional multinomial distribution θc over topics, where each where πˆu = [πˆu,1,...,πˆu,|C|]T is an estimation of πu based on dimensionθ istheprobabilityofcdiscussingtopicz. theaggregationofu’scommunityassignments. Inothewords,we c,z use πˆ and πˆ , instead of π and π , to generate the F ’s in u v u v uv DEFINITION 5. Thediffusionproﬁleofcommunitycisa|C|× Fig.2. Suchadesignismotivatedby[5,6]tosimplifytheinfer- |Z|-dimensionalmatrixηc,whereeachentryηc,c(cid:48)zistheprobabil- ence.σ(x)=1/(1+e−x)isasigmoidfunction.Themoresimilar ityofcdiffusinganothercommunityc(cid:48)ontopicz. πˆ andπˆ are,themorelikelyF exists. Inotherwords,F is u v uv uv large if u and v are from the same communities. This naturally Takecommunityc1inFig.1asanexample. Asc1’suserspublish enforces denser friendship links within a community than across morecontentonz1andz2,theresultingθc1,z1 andθc1,z2 arebig- communities,thusleadingtolowconductance.Incontrastwiththe ger. Besides,asc1’susersoftenretweet/citethemselvesonz1,the friendship links, the inter-community diffusion is not necessarily resultingηc1,c1z1 isbig. AsmotivatedinSect. 1,weformalizea “weak”[12]. Infact,thecommunity-leveldiffusionstrengthsvary jointproﬁlinganddetectionproblemtosolveinthispaper. overtopics,whichbreakstheassumptionofhavingtomaintainthe lowconductancewithinacommunity. Weneedtoresorttoadif- PROBLEM 1. GivenasocialgraphG =(U,D,F,E),thetask ferentmodelingofdiffusionlinks,aswediscussnext. ofjointcommunityproﬁlinganddetectionistoinfer:1)eachuser u’scommunitymembershipπ ,∀u ∈ U; 2)eachcommunityc’s Accommondatingnonconformity. Differentfactorscanaccount u contentproﬁleθ anddiffusionproﬁleη ,∀c∈{1,...,|C|}. for a diffusion decision. Take Twitter as an example; user u is c c likelytoretweetv’stweetd asheri-thtweetd attimetif: 1) vj ui 3.1 ModelDesign thecommunity-leveldiffusionstrengthbetweenc (thecommu- ui Next,weconcretizeourmodeldesignw.r.t. thethreetechnical nity u belongs to when she generates document dui) and cvj on challengesforcommunityproﬁlingasdiscussedinSect.1.Wewill topiczvjisstrong;2)thetopiczvjofdvjistrendingattimet;3)u laterevaluatehowwellweaddresseachchallengeinSect.6.2. hasanindividualpreferencetoretweetfromv.Thesefactorsshow threetypicalperspectivestomakeadiffusiondecision:community Proﬁle-aware generative model. Community detection aims to perspective(ifacommunityismorelikelytoretweetanothercom- infer a community membership assignment π for each user u u munity),contentperspective(ifatopicismorepopularatthetime) basedonthefriendshiplinksF ’s. Communityproﬁlingaimsto uv and user perspective (if a user is more likely to retweet another inferacontentproﬁleθ andadiffusionproﬁleη foreachcom- c c user).Next,wecharacterizethethreetypicalfactors. munitycbasedonitsmemberusers’publishedcontentD ’sand u diffusionlinksEt ’s. Wecanreinforceproﬁlinganddetection,by •Communitydiffusionpreference: weconsiderauserutodiffuse ij lettingthemleverageeachother’sdata. Asaresult,wewishtoin- anotheruserv ontopicz,ifthecommunitiesofuandv areboth ferasetofcommunity-levellatentvariables,includingπu’s,θc’s interestedinzandtheyoftendiffuseeachotheronz. Denotes ∈ andη ’s,togetherfromalltheobservations(D,F,E). {0,1}asanindicatorforadiffusionlinkinEtohappen.Then,the c Since joint proﬁling and detection is an unsupervised task, we probabilityofhavingadiffusions=1fromutovonzis adoptagenerativeframeworkforourCPDmodel.WedesignCPD asagraphicalmodelinFig.2, whereweusecommunitiestoex- p(s=1,z|u,v)=1 (cid:88)(cid:88)p(s=1|c,c(cid:48),z)p(z|c)p(z|c(cid:48))p(c(cid:48)|v) plainalltheuserobservationsonthenetwork. Firstly,weconsider c c(cid:48) aniutysearssuigtnompeunbtlicsuhiaadnodctuhmeecnotmdmuiunoiftytocpoinctzen,tdpureotﬁoleheθrcucio.mEm.gu.-, ∝2 (cid:88)c(cid:88)c(cid:48)ηc,c(cid:48)zθˆc,zπˆu,cθˆc(cid:48),zπˆv,c(cid:48), (4) where at step 1 we expand p(s = 1,z|u,v) by introducing the 4. SCALABLEMODELINFERENCE communitymembershipp(c|u)andp(c(cid:48)|v),thecommunities’in- WedevelopascalableinferencealgorithmforCPD.Weaimto terestsonthetopicp(z|c)andp(z|c(cid:48)),aswellasthetopic-sensitive inferthetopicassignmentandcommunityassignmentlatentvari- communitydiffusionprobabilityp(s = 1|c,c(cid:48),z). Atstep2, we ables {Z,C} from the observations {W,F,E}, where W is the estimatep(s=1|c,c(cid:48),z)withηc,c(cid:48)z,theprobabilityofcretweet- wordsinD. WeusecollapsedGibbssampling[6,15,32]forthe ing/citingc(cid:48) onz. Besides, weestimatep(c|u)withπˆu,c, which inference. Wealsoestimatethevariationalparameters{π,θ,φ} istheempiricalprobabilityofcommunitycbeingassignedtouser andthemodelparameters{ν,η}byvariationalExpectationMaxi- u; similarly we estimate p(c(cid:48)|v) with πˆv,c(cid:48). Finally we estimate mization(EM)[5,6].Welaterparallelizeourinferencealgorithm. p(z|c)withθˆ ,whichistheempiricalprobabilityoftopicz as- c,z signedtothedocumentsfromc;similarlyweestimatep(z|c(cid:48))with 4.1 CollapsedGibbsSampling θˆc(cid:48),z.Denoteθˆ·,z =[θˆ1,z,...,θˆ|C|,z]T andη¯ =vec([η1,...,η|C|]), To derive the Gibbs sampler, we start with computing the col- wherevec(A)concatenatestherowvectorsinamatrixAtoavec- lapsedposteriordistributionofourmodel: tor. Foradiffusionbetweend andd , whichshares thesame ui vj p(W,F,E,C,Z,f,ν,η¯|ρ,α,β) topicz,wedenotec¯ =vec((πˆ πˆT)◦(θˆ θˆT )),where◦isan (6) ij u v ·,z ·,z =p(C|ρ)p(Z|C,α)p(W|Z,β)p(F|C)p(E|C,η,Z,ν,f), element-wiseproduct.ThenEq.(4)becomesc¯Tη¯. ij wherep(F|C)(abbreviatedasp(F)inthefollowing)istheprob- •Topicpopularity:wemodelthepopularityofatopicataspeciﬁc ability for the friendship links F generated by the communities timestamptasthecountoftopiczatt,whichisdenotedasnt. z C; p(E|C,η¯,Z,ν,f) (abbreviated as p(E) in the following) is theprobabilityforthediffusionlinksE generatedbythecommu- • Individual preference: we model user u’s preference to diffuse informationfromuservwithalinearfunctionνTf ,whereν is nities C. We follow [5] to model observed links only in Eq. 6. uv (cid:81) Thus, we deﬁne p(F) = P(F = 1) and p(E) = aexpaamrapmlee;twere,fcuovnsisidaerfetwatourfeeavteucrteosrffoorruu:a1n)dusve.rTpaokpeuTlawriittyte,rwahsiacnh (cid:81)(i,j)∈Ep(Eitj = 1), where(ut,vis)∈tFhe timeusvtamp of the diffusion is deﬁned as the number of u’s followers divided by that of her link(i,j). followees|Followers(u)|;2)useractiveness,whichisdeﬁnedasthe InthegenerativeprocessofCPDmodel,wemodelbothP(Fuv = |Followees(u)| 1) (step 3.c) and p(Et = 1) (steps 3.d) with sigmoid functions numberofu’sretweetsdividedbythatofhertweets |Retweets(u)|. ij |Tweets(u)| σ(·). Bayesianinferencewithsigmoidfunctionisknownashard, Weextractv’sfeaturesandconcatenatethemwithu’sasf . uv because it is analytically inconvenient to construct a Gibbs sam- Inordertosystematicallycombinethethreediffusionfactors,we plerforthesigmoidfunction[28]. Wearemotivatedbythedata introduceasigmoidfunctiontodeﬁnetheprobabilityofdocument augmentationapproach[2,6],whichintroducesPo´lya-Gammaran- d diffusingdocumentd oftopiczattimestamptas: vj ui domvariablestoderiveanexactmixturerepresentationofthesig- moidfunctionforeasierinference.HenceweintroducetwoPo´lya- p(Et =1|u,v,z,t)=σ(c¯Tη¯+nt +νTf ). (5) ij ij z uv Gammavariablesλandδastheaugmentedvariablesforp(F)and p(E)respectively. Formally,arandomvariablexfollowsaPo´lya- Welearntheparametersη¯andν,sothatweknowhowmucheach Gammadistributionx∼PG(a,b)(a>0,b>0),if factorcontributesinthediffusion. Generativeprocess. WesummarizetheCPDmodel’sgenerative x= 2π12 (cid:80)∞k=1 (k−1/2)2g+kb2/(4π2), processbelow.Denote1 asanall-onevectoroflength(cid:96). (cid:96)×1 whereg ∼ Gamma(a,1)isaGammarandomvariable. Ithas k 1. Foreachtopicz = 1,...,|Z|,drawitsworddistributionfrom beenshownin[28],alogisticfunctioncanberepresentedasamix- aDirichletpriorparameterizedbyβ:φ |β ∼Dir(β1 ); tureofGaussiansw.r.t.aPo´lya-Gammadistribution: z |W|×1 2. Foreachcommunityc=1,...,|C|,drawitstopicdistribution 1 1(cid:90) ∞ fromaDirichletpriorparameterizedbyα:θc|α∼Dir(α1|Z|×1); 1+e−w = 2 ψ(w,x)p(x|1,0)dx, (7) 0 3. Foreachuseru=1,...,|U| w−xw2 whereψ(w,x)=e 2 andx∼PG(1,0).Then,forp(Fuv = (a) Drawhercommunitydistributionπu|ρ∼Dir(ρ1|C|×1); 1)asdeﬁnedinEq.3,wecanintroduceaPo´lya-Gammavariable (b) Forthei-thdocumentduiofuseru λuv ∼PG(1,0),suchthatwegetajointprobability i. Drawacommunityassignmentcui|π∼Multi(πu), p(F =1,λ )= 1ψ(πˆTπˆ ,λ )p(λ |1,0). (8) byu’smultinomialcommunitydistributionπ ; uv uv 2 u v uv uv u ii. Drawatopiczui|c,θ∼Multi(θcui),bycui’smulti- Similarly,forp(Eitj = 1)asdeﬁnedinEq.5,wecanintroducea nomialtopicdistributionθc; Po´lya-Gammavariablesδij ∼PG(1,0),suchthatweget iii. Draw each word w |z,φ ∼ Multi(φ ), ∀k = 1,...,|Wui|,byzui’usikmultinomialworddizsutriibution; p(Eitj =1,δij)= 12ψ(c¯Tijη¯+ntz+νTfuv,δij)p(δij|1,0). (9) (c) Foreachfriendshiplinkfromuserutouserv,drawF |π∼ uv Consideringallthefriendshiplinksanddiffusionlinks,wehave Ber(σ(πˆTπˆ ))byaBernollidistribution(Eq.3); (d) For each duiffvusion link Eitj from document dui to docu- p(F,λ)=(cid:89) ψ(πˆTuπˆv,λuv)p(λuv|1,0), (10) mentd attimet,drawEt |C,η,Z,ν,f ∼Ber(σ(c¯Tη¯+ (u,v)∈F ntz+νvTjfuv))byaBernoiljlidistribution(Eq.5); ij p(E,δ)= (cid:89) ψ(c¯Tijη¯+ntz+νTfuv,δij)p(δij|1,0). (11) (i,j)∈E Instep3.b.iii, sinceshorttextoftenhassingletopic[31, 17], we NextweinferZandC,togetherwithλandδ.Speciﬁcally,aug- sampleallwordsind fromthesametopic-worddistributionφ . mented with two Po´lya-Gamma variables λ and δ, the collapsed ui z posteriordistributionofourmodelbecomes: Weefﬁcientlysampleλ byanalternateexponentiallytiltedJa- uv cobidistribution[28]. p(W,F,E,C,Z,f,ν,η¯,λ,δ|ρ,α,β) •Forδ:theconditionaldistributionofδisalsoPo´lya-Gamma, =p(C|ρ)p(Z|C,α)p(W|Z,β)p(F,λ|C)p(E,δ|C,η¯,Z,ν,f) p(δ |W,F,E,C,Z,f,ν,η,δ) (cid:82) (cid:82) ij = P(C|π)P(π|ρ)dπ· p(Z|C,θ)P(θ|α)dθ· (cid:82) π θ −δij(¯cTijη¯+ntz+νTfuv)2 φP(W|Z,φ)P(φ|β)dφ·p(F,λ)·p(E,δ) ∝e 2 p(δij|1,0) (16) = |(cid:81)U| ∆(ncu+ρ) · |(cid:81)C| ∆(nzc+α) · |(cid:81)Z| ∆(nwz+β) ·p(F,λ)·p(E,δ), =PG(1,c¯Tijη¯+ntz+νTfuv). ∆(ρ) ∆(α) ∆(β) u=1 c=1 z=1 4.2 ModelParameterEstimation (12) where∆(x)= (cid:81)dii=m(1x)Γ(xi).BasedonEq.12,wecaninferZ,C, WeusevariationalEMtoiterativelyestimatethevariationalpa- Γ((cid:80)dii=m(1x)xi) rameters{π,θ,φ}andthemodelparameters{ν,η}: λandδonebyoneasfollows. 1. (E-step)UsethesamplesofcollapsedGibbssamplingtoes- •ForZ:theprobabilityofassigningtopicztoduiis timatetheparametersπ,θandφ,givenν,η. p(z =z|C,Z ,W,F,E,f,ν,η¯,λ,δ) 2. (M-step)OptimizeνandηbymaximizingEq.(6),giventhe ui ¬{ui} parametersπ,θ,φestimatedintheE-step. = p(W,F,E,C,Z,f,ν,η¯,λ,δ|ρ,α,β) p(W,F,E,cui=c,C¬{ui},Z¬{ui},f,ν,η¯,λ,δ,ρ,α,β) In the E-step, the Gibbs sampler iteratively draws samples of Z, ∝(cid:81)|C| ∆(nzc+α) ·(cid:81)|Z| ∆(nwz+β) · C, λandδ byEqs. 13–16. Basedonthesamples, weestimate: c=1 ∆(nzc,¬{ui}+α) z=1 ∆(nwz,¬{ui}+β) π = ncu+ρ ,θ = nzc+α andφ = nwz+β .Inthe p(F|λ,C¬{ui})·p(E|δ,C¬{ui},Z¬{ui}) Mu-,sctep,nw(u·e)+ﬁ|Crs|tρestci,mzatenη(cc·,)c+(cid:48)z|Z’s|αby aggrze,wgatingn(zt·h)+e|Wco|mβmunity nz +α (cid:81)|W| (cid:81)nwui(nw +β+i−1) andtopicassignmentsw.r.tallthedocuments,basedonthelastit- = c,¬{ui} · w=1 i=1 z,¬{ui} · (13) n(c·,)¬{ui}+|Z|α (cid:81)nj=(u·1i)(n(z·,)¬{ui}+|W|β+j−1) earllatoitohnerofvsaarimabplleinsgﬁ.xTehde–nthwiseiessteismseantetiνallbyyﬁmttainxgimaizloinggisEticq.r6egwreitsh- (cid:81) ψ(πˆTπˆ ,λ |C )· sionfunction;tosolveit,werandomlysamplethesameamountof v∈Λu u v uv ¬{ui} (cid:81) ψ(c¯Tη¯+nt +νTf ,δ |C ,Z ), non-observeddiffusionlinksasnegativeinstancesforoptimization. j∈Λi ij z uv ij ¬{ui} ¬{ui} As α and ρ are used to sample the πu’s andθc’s, we followthe where Λu = {v|(u,v) ∈ F or (v,u) ∈ F} is user u’s neigh- convention[13]tosettheirvaluesas50dividedbyπu’sdimension bors in F. Λi = {j|(i,j) ∈ E or (j,i) ∈ E} is document andθc’sdimensionrespectively, i.e., α = 50/|Z|, ρ = 50/|C|. i’s neighbors in E. nz and n(·) denote the number of Asβisusedtosampletheworddistributionφz’sandthenumber c,¬{ui} c,¬{ui} ofwordsislarge,wefollow[13]againtosetβ =0.1. times that topic z is assigned to community c and the number of times that any topic is assigned to c, excluding the current doc- 4.3 Scalability ument dui. Similarly, nwz,¬{ui} and n(z·,)¬{ui} are the number of WesummarizeourinferencealgorithminAlg.1. Insteps3–10, timesthatwordw isassignedtotopicz andthenumberoftimes wetakeanE-stepforcollapsedGibbssampling.Insteps11-14,we thatanywordisassignedtoz,excludingd . nw andn(·)arethe takeanM-stepfortrainingthemodelparameters. ui ui ui numberoftimesthatwordw occursinthedocumentd andthe ui Time complexity. In steps 4–6, as we compute the community numberofwordsind . Finally,ψ(πˆTπˆ ,λ |C )denotes ui u v uv ¬{ui} assignmentsandtopicassignmentsforeachdocumentofeachuser, estimating ψ(πˆTuπˆv,λuv) based on C¬{ui} instead of the whole ittakesO(|D|×|C|+|W|×|Z|). Insteps7–8,aswecompute C;ψ(c¯Tijη¯+ntz+νTfuv,δij|C¬{ui},Z¬{ui})denotesestimating πˆTuπˆv foreachfriendshiplink,ittakesO(|C|×|F|). Insteps9– ψ(c¯Tijη¯+ntz +νTfuv,δij)basedonC¬{ui} andZ¬{ui} instead 10,aswecompute(c¯Tijη¯+ntz+νTfuv)foreachdiffusionlink,it ofthewholeCandZ. takes|C|2×|E|. Insteps11–12,asweaggregatethecommunity assignmentsandtopicassignmentsforeachdiffusionlink,ittakes •ForC:theprobabilityofassigningcommunityctouatd is ui O(|E|).Insteps13–14,aswecomputegradientsforνoverallthe p(cui =c|C¬{ui},Z,W,F,E,f,ν,η¯,λ,δ) diffusionlinks,ittakesO(|E|×T2).Intotal,forT1iterations,the = p(W,F,E,C,Z,f,ν,η¯,λ,δ|ρ,α,β) overallcomplexityisO((|D|×|C|+|W|×|Z|+|C|×|F|+ p(C¬{ui},zui=z,Z¬{ui},W,F,E,f,ν,η¯,λ,δ|ρ,α,β) |C|2×|E|+|E|+|E|×T2)×T1).Aswecansee,Alg.1’stime ∝(cid:81)|U| ∆(ncu+ρ) (cid:81)|C| ∆(nzc+α) · complexityislineartothedatasize(i.e.,|D|,|F|,and|E|). u=1 ∆(nc +ρ) c=1 ∆(nz +α) u,¬{ui} c,¬{ui} Parallelization.WeconsidermultithreadparallelizationofAlg.1. p(F|λ,C¬{ui})·p(E|δ,C¬{ui},Z¬{ui}) (14) Weleavemulti-machineparallelizationasfuturework.Inourvari- = ncu,¬{ui}+ρ · nzc,¬{ui}+α ·(cid:81) ψ(πˆTπˆ ,λ |C ) ationalEMalgorithm, weﬁndtheE-steptakesmuchlongertime n(·) +|C|ρ n(·) +|Z|α v∈Λu u v uv ¬{ui} thantheM-step,because:1)theE-step’scollapsedGibbssampling u,¬{ui} c,¬{ui} ·(cid:81) ψ(c¯Tη¯+nt +νTf ,δ |C ,Z ), hastobedoneiterativelyoveralltheobservations,includingdoc- j∈Λi ij z uv ij ¬{ui} ¬{ui} uments (thus words), friendship links and diffusion links; 2) the wherenc andn(·) arethenumberofdocumentsfrom M-step’smodelparameterestimationiscomparativelymucheas- u,¬{ui} u,¬{ui} ier, since optimizing ν is basically solving logistic regression on useruthatareassignedtocommunitycandthenumberofdocu- thediffusionlinks(andthesameamountofnegativelinks)andηis mentsfromuseruexcludingd ,respectively. ui donebysimplyaggregatingtheexistingcommunityandtopicas- •Forλ:theconditionaldistributionofλisPo´lya-Gamma,i.e., signments.Thus,inthispaperwefocusonparallelizingtheE-step. •Segmentingdatatoreduceinter-dependency.RecallinSect.4.1, p(λ |W,F,E,C,Z,f,ν,η,δ) uv thesamplingrequirescomputing:1)anumberofcounters,includ- ∝e−λuv(πˆ2Tuπˆv)2p(λuv|1,0)=PG(1,πˆTuπˆv). (15) ing the community-topic counter nzc, the topic-word counter nwz, Algorithm1ScalableinferenceforCPD document and each link, based on a serial implementation of the Input: UsersU,docsD,friendshiplinksF,diffusionlinksE; samplingalgorithmoverallthedata.Then,basedonthenumberof Output: TopicassignmentsZ,communityassignmentsC,model documentsandlinksauserhas,weestimatetheaverageworkload parametersνandη; ofprocessingthatuser. Finally,wesumuptheaverageworkload ofalltheusersinthei-thdatasegmentaso . 1: Initializeν,η,α,β,ρ; i Weevaluateourinferencealgorithm’sefﬁciencyinSect.6.4. 2: foriter=1:T do 1 3: foreachuseru∈U do 4: foreachdocumentd ∈D do 5. APPLICATIONS ui u 5: SampleatopiclabelzuiaccordingtoEq.13; We concretize how to enable the following three community- 6: SampleacommunitylabelcuiaccordingtoEq.14; aware applications based on ﬁve CPD outputs, including: 1) the 7: foreachfriendshiplink(u,v)∈F do communityassignmentforusersπ ’s;2)thecommunitycontent u,c 8: SampleaugmentedvariableλuvaccordingtoEq.15; proﬁle θc,z’s; 3) the community diffusion proﬁle ηc,c(cid:48)z’s; 4) the 9: foreachdiffusionlink(i,j)∈Edo topicassignmentforwordsφw,z’s;5)theindividualdiffusionpref- 10: Sampleaugmentedvariableδ accordingtoEq.16; erenceparametersν. ij 11: foreachdiffusionlink(i,j)∈Edo Community-awarediffusion.Giveninputofadocumentd pub- vj 12: Updateηcui,cvjzui byaggregatingc(cid:48)uis,cvj’sandzui’s; lishedbyuserv,weoutputtheprobabilitythatanotheruseruwill 13: forsubiter=1:T2do publishadocumentduitoretweetorcitedvj attimestamptas 14: GradientdescentforνoverthediffusionlinksE. p(Et =1|u,v,d ,t)=1 (cid:80) p(Et =1|u,v,z,t)p(z|d ) ij vj z ij vj theuser-communitycounterncu;2)anumberoflinkprobabilities, =2 (cid:80)zσ((cid:80)c(cid:80)c(cid:48)πu,cθc,zηc,c(cid:48)zπv,c(cid:48)θc(cid:48),z+ntz+νTfuv)p((z1|8d)vj). includingthefriendshiponeψ(πˆTuπˆv,λuv)andthediffusionone whereatstep1weexpandp(Eitj =1|u,v,dvj,t)bythetopicsof ψ(c¯Tijη¯+ntz+νTfuv,δij).Amongthesecomputations,bothtopic dvj.Atstep2,wepluginthedeﬁnitionofp(Eitj =1|u,v,z,t)by andcommunityassignmentsareappliedtodocuments(thustheir Eq.5.Aswecansee,Eq.18comprehensivelymodelsthediffusion users), the friendship link probability is applied to users, and the bytakingthecommunityassignmentsπ,thecommunityproﬁlesθ diffusion link probability is applied to two documents (thus their andη,andtheindividualdiffusionpreferenceνintoaccount. users). Therefore,exceptthewordtopicassignment,thevastma- Proﬁle-drivencommunityranking. Giveninputofaqueryq ∈ jorityofcomputationsaredoneonusersanddocuments.Thismo- Wk(∀k≥1),weoutputtherankingofcommunitiesbasedontheir tivatesustosegmentthedatabyusersanddocuments,sothatdif- probabilitiestodiffuseinformationaboutq.Denotetheprobability ferentthreadscanworkondifferentdatasegmentswithlittleinter- ofacommunityctogenerateadiffusionlinks=1ofqueryqas dependency. It may be possible to take words into consideration for the data segment as well, but it is not obvious and we leave p(s=1|c,q)=1 (cid:80) (cid:80) p(s=1|c,c(cid:48),z)p(z|q,c(cid:48))p(c(cid:48)|q) itforfuturework. Consideringthatauseroftenhasmanydocu- z c(cid:48) ments(especiallyinTwitter),wedesigntwoguidelinestosegment ∝2 (cid:80)z(cid:80)c(cid:48)ηc,c(cid:48)zp(z|q,c(cid:48))∝3 (cid:80)z(cid:80)c(cid:48)ηc,c(cid:48)zθc(cid:48)z(cid:81)w∈qφz,w, thedatabyusersanddocuments: 1)wekeepauser’sdocuments (19) inthesamedatasegment,becauseotherwisetherearelikelymany whereatstep1weexpandp(s = 1|c,q)bythecommunitydiffu- conﬂictingupdatesaboutthesameuserfrommultiplethreads; 2) sionproﬁlep(s = 1|c,c(cid:48),z),thetopicassignmentforqinacom- wepreferkeepingthesame-topicdocumentsinthesamedataseg- munityp(z|q,c(cid:48))andtheprobabilitythatqisfromthatcommunity ment,becauseithelpstoreducetheconﬂictingupdatesaboutthe p(c(cid:48)|q). Atstep2wepluginthedeﬁnitionofp(s = 1|c,c(cid:48),z) ∝ same topic from multiple threads. Overall, we ﬁrst run LDA [3] ηc,c(cid:48)z andconsiderqcancomefromanycommunitywithp(c(cid:48)|q) onalltheusers’documentswith|Z|topics; thenwepartitionthe uniformly. At step 3, we estimate the probability p(z|q,c(cid:48)) in a usersinto|Z|segments,basedoneachuser’smostfrequentlyas- similar way as Eq. 13. We skip the details but explain the ratio- signedtopicinherdocuments. Ineachsegment,eachuserhasher nalofthisestimation: p(z|q,c(cid:48))isproportionaltotheprobability documents,relatedfriendshiplinksanddiffusionlinks. ofcommunityc(cid:48)generatingtopicz(i.e.,capturedbyθc(cid:48),z)andthe (cid:81) • Distributing workload to avoid data skewness. We aim to dis- probabilityofqbelongingtotopicz(i.e.,capturedby w∈qφz,w). tributethe|Z|datasegmentstoM threads,suchthattheworkload Proﬁle-drivencommunityvisualization. Wecanvisualizeeach on each thread is balanced. Note that M is set as the number of community’scontentproﬁleanditsdiffusionproﬁle, asFig.1(b) physicalCPUcoresinthiswork. Ourapproachistoﬁrstestimate shows.Inparticular,weareinterestedinthediffusionvisualization, theworkloadofeachdatasegment,andthencastthissegmentallo- as it is new. In our experiments, we visualize how a community cationtaskassolvingMstandard0-1knapsackproblems2.Denote interactswiththeothersintwotypicalsettings: 1)diffusionona tahlletih-ethddaatatasseeggmmeenntts’siswOork=loa(cid:80)daMi=s1oioi∈. DRe+n,othteusatbhienwaroyrkinlodaicdaftoorr tsopecc(cid:48)iﬁucndtoepritco,pwichezr;e2w)ediuffsuesiηocn,c(cid:48)wzitahsttohpeicdiaffgugsrieognastitorenn,gwthhefrroemwce asxi ∈{0,1}.Thenforeachthread,wesolve use(cid:80)zηc,c(cid:48)zasthediffusionstrengthfromctoc(cid:48). max (cid:80)M o x , s.t. (cid:80)M o x ≤ 1 O, (17) i=1 i i i=1 i i M 6. EXPERIMENTS whichtriestoﬁndasubsetofthedatasegmentstohaveascloseto WetestCPDwithtwolarge-scalereal-worlddatasets. Wede- O workloadaspossible. Onecanﬁnetunetheobjectivefunction M sign experiments to: 1) evaluate how well we address each chal- of Eq. 17 in practice to best allocate the data segments for even lengelistedinSect.1;2)evaluateCPD’sperformance,bycompar- workloadamongthethreads.Weestimateeachworkloado asfol- i ingwiththestate-of-the-artbaselinesindifferentapplications. lows.Firstofall,weestimatetheaverageprocessingtimeforeach 6.1 SetUp 2https://en.wikipedia.org/wiki/Knapsack_problem #(user) #(friend.link) #(diff.link) #(doc.) #(word) [17]todetectthecommunities,andfurtheraggregatetheuserob- Twitter 137,325 3,589,811 992,522 39,952,379 2,316,020 servations in each detected community as the proﬁles. After ap- DBLP 916,907 3,063,186 10,210,652 4,121,213 330,334 plyingCRMandCOLD,wegetthecommunityassignmentprob- Table3:Datasetstatistics. abilitiesforeachuserutoeachcommunityc,whichwedenoteas π∗ ’s. Togetaggregatedcontentproﬁle,weﬁrstrunLDA[3]on u,c Data Diffusionfactors Tasks alltheusers’documentswith|Z|topics,andforeachuseru’si-th Methods textfriend diff indiv-commtopic topic comm diff comm documentd ,wegetits|Z|-dimensionalmultinomialtopicdistri- links links idual extract detect predproﬁle butionasθ∗ui . Denotecommunityc’saggregatedcontentproﬁle PMTLM[43] • • • • • du,i WTM[37] • • • • • asθ∗c.Wehave CCORLMD[[1157]] • • •• • •• • •• •• θ∗ =(cid:80)|U| π∗ (cid:80)|Du| θ∗du,i. (20) Ours • • • • • • • • • • c u=1 u,c i=1 |Du| To get aggregated diffusion proﬁle, we aggregate each diffusion Table4:Differenceswithbaselines. linkbetweend andd inEw.r.t. theirusers’communitieson u,i v,j atopicz. Denoteaggregateddiffusionproﬁleη∗ astheproba- c,c(cid:48)z bilityofcommunitycdiffusingcommunityc(cid:48)ontopicz.Thenwe WedoexperimentsonLinuxcomputersequippedwithIntel(R) have 3.50GHzCPUsand16GBRAMs. Wedo10-foldcrossvalidation andreportaveragescoresforallthequantitativeresults. Wealso ηc∗,c(cid:48)z ∝(cid:80)(i,j)∈Eπu∗,cπv∗,c(cid:48)θd∗u,i,zθd∗v,j,z. (21) reportsigniﬁcanttestresultswhenevernecessary. Inall,weobtaintwomorebaselines,whichimplementthestraight- Data sets. We use two public data sets: Twitter [20] and DBLP forward“ﬁrstdetection,thenaggregation”proﬁlingapproach. [34]. TheTwitterdatasetwascollectedinMay2011. TheDBLP • CRM+Agg. It uses CRM [15] to detect communities; then it datasetcontainsthepublicationsindexedbyDBLP3from1936to usesEq.20andEq.21foruseraggregationtogetthecommunity 2010. Wepre-processedthetweetsandthepapertitles,byremov- contentproﬁlesanddiffusionproﬁles,respectively. ingstopwords,stemmingandPOStagging4. Weonlykeptnouns, •COLD+Agg.ItusesCOLD[17]todetectcommunities,thensim- verbsandhashtags.Afterthat,weremovethedocumentswithless ilarlyusesEq.20andEq.21togetcontentanddiffusionproﬁles. thantwowords,andthenremovetheuserswithnodocument. Ta- WecomparewithbothCRM+AggandCOLD+Aggindiffusion ble3summarizesthestatisticsofourdatasetsafterpre-processing. linkpredictionandcommunityranking. Baselines.Wechoosebaselinesbasedonthefollowingguidelines: Evaluation. Sincewejointlyproﬁleanddetectcommunities,we 1)theyarethestateofthearttomodelheterogeneoususerobser- willevaluatethequalityofcommunitydetectionandproﬁles. vationsatthedatalevel;2)theymodeldiffusionpredictionatthe •Detectionquality.Weconsidertwowaystoevaluatetheresulting tasklevel;3)preferablytheymodelcommunity.Finally,wechoose communities:1)howdensetheyare;2)howwelltheycanbeused fourbaselinesbelow,andlistourdifferenceswiththeminTable4. to explain the friendship link observations. For 1), we use con- •PoissonMixed-TopicLinkModel(PMTLM)[43]. Itmodelsthe ductance[18,19]asthemetric. Asourcommunityassignmentis documentnetworkandusesthedocumenttopicassignmenttogen- probabilistic,wefollow[17]toleteachuserbelongtohertopﬁve erate the links. We adapt PMTLM for community detection and communitiesinconductanceevaluation(andalsolatercommunity friendshiplinkpredictioncomparison,byaggregatingthetopicas- rankingevaluation).Thesmallerconductanceis,thebetter.For2), signmentsofeachuser’sdocumentsasthecommunitymembership wefollow[17]todesignalinkpredictiontask,whereweuseEq.3 forthatuser. WealsocomparewithPMTLMondiffusionpredic- topredictwhethertoobserveafriendshiplinkbasedontwousers’ tion,asitalsomodelsthedocumentlinks. communities. Asthereisnopredeﬁnedthresholdforlinkpredic- tion,weuseAUC(AreaUnderthereceiveroperatingcharacteristic •WhomtoMention(WTM)[37].Itmodelstheuserdiffusionlinks Curve) [17, 15] as the metric. Given a ranking of non-observed withusercontentandfriendship.Itdoesnotmodelcommunity.We links,wecalculatetheAUCscoreastheprobabilityofarandomly comparewithWTMondiffusionprediction. chosentruepositivelinkbeingrankedhigherthanarandomlycho- •CommunityRoleModel(CRM)[15]. Itmodelsfriendshiplinks sen true negative link. In the 10-fold cross validation, each time anddiffusionlinksbasedontheuser’scommunityassignmentand weuse10%ofthepositivelinksandsamplethesameamountof roleassignmenttogether. WecomparewithCRMoncommunity negativelinkstocalculateAUC.ThehigherAUCis,thebetter. detection,friendshiplinkpredictionanddiffusionprediction. •Proﬁlequality.Duetolackofgroundtruth,wegenerallyevaluate •COmmunityLevelDiffusion(COLD)[17]. Itmodelsthecontent thecontentanddiffusionproﬁles’qualitythroughtheapplications in Sect. 5. For community-aware diffusion, as there is no prede- and diffusion links based on communities. Thus it is the closest ﬁnedthresholdfordiffusionlinkprediction,weagainuseAUCas worktoours. Butitmodelsneitherfriendshiplinksincommunity the evaluation metric. For proﬁle-driven community ranking, as detection,norindividualfactorandtopicfactorindiffusionpredic- thecommunitiesdetectedbydifferentalgorithmsaredifferent,for tion.WecomparewithCOLDoncommunitydetection,friendship faircomparison,weevaluatethequalityofeachrankedcommunity linkpredictionanddiffusionprediction.AsCOLDhascommunity intermsofitsusers–givenaqueryq, wecheckhowmanyusers diffusionstrength,wealsocompareitoncommunityranking. Inadditiontotheaboveexistingbaselines,wealsodesignsome inthetopK rankedcommunitiesreallyretweet(orcite)aboutq. Thennaturallywecomputeprecisionandrecallforeachcommu- morebaselinestovalidatethatwearebetterthanastraightforward nity in the ranking list. Denote the users who mention q in their communityproﬁlingapproachof“ﬁrstdetectingcommunities,then retweets(orcitationpapertitles)asU∗. Denotetheusersbelong- aggregatingeachcommunity’suserobservations”.Speciﬁcally,we q ingtoanyofthetopK communitiesasU . Theprecisionofthe adopt the two state-of-the-art algorithms, CRM [15] and COLD K topK communitiesforqueryq isP(K,q) = |U∗∩U |/|U |, q K K 3http://dblp.uni-trier.de/ and the recall is R(K,q) = |U∗ ∩U |/|U∗|. We deﬁne mean q K q 4http://nlp.stanford.edu/software/tagger.shtml average precision (MAP) over all the queries as MAP@K = ecnatcudnoC 0000....00007689....51555697820 50 NNOoou r HJsoeitne1tr 0Mo0goedneeliitnyg 150 )CUA( noitciderP kniL 000000......00666767..2684426720 NNOoou rHJso5ei0tnetr Mogoedneeliitnyg100 150 )CUA( noitciderP noisuffiD 000000......00788887..6246888920 NNOouo rJHs5oe0intetr Mogoedneeliitnyg100 150 ecnatcudnoC 000000......34567844444420 50 NNOouo r JHsoeinte1tr 0Mo0goedneeliitnyg 150 Number of Communities Number of Communities Number of Communities Number of Communities (a) Communitydetect.(Twitter) (b)Friendshiplinkpred.(Twitter) (c) Diffusionlinkpred.(Twitter) (d) Communitydetect.(DBLP) )CUA( noitciderP kniL 000000......79678872722720 50 NNOoou r HJsoeint1et0r Mo0goedneeliitnyg 150 )CUA( noitciderP noisuffiD 00000.....756898888820 NNOouo rJHso5ein0tetr Mogoedneeliitnyg100 150 )CUA( noitciderP noisuffiD 0000....7788272720 NNOouo rTIsn5od0pivicidual & T1o0p0ic 150 )CUA( noitciderP noisuffiD 0000....8899161620 50NNOoou rITsnodpivicid1u0a0l & Topic 150 Number of Communities Number of Communities Number of Communities Number of Communities (e) Friendshiplinkpred.(DBLP) (f) Diffusionlinkpred.(DBLP) (g) Diffusionlinkpred.(Twitter) (h) Diffusionlinkpred.(DBLP) Figure3:Studyofourmodeldesign. (cid:80)MmhbiniaegstqAahoe(ndepR(cid:80)rioac@MnKivm=KetA1rooapFdP=giece(ilssii(cid:80)n,,F,gwq1tq)he[/(e3a(cid:80)aKs]dbtoM)eKio=/tpte|t1eAQvorRa.Fn|l(ue@Iaiane,ntKxqdeat)rid/mtads=Kewiqtai)iuond/2ane×|MlQa,liMytvAya|e.u.sAPrsEaFP@tehgfid@KfneeeKam+cclr×tlMeeoiytvMcn,rAeaitwcAelRlylneR(@,tp(@pdKMepeeKrrrﬁApop.lnﬁlReeelx)xTeitithhatyiyeess) )CUA( noitciderP noisuffiD 00000.....456785555520 WCCOO5T0LLMDD+Agg 10CCO0RRurMMs+Agg150 )CUA( noitciderP noisuffiD 000000......56894755555520 PCCMOOLLTDDL5M+0Agg CCO1RRu0rMMs0+Agg 150 of a content proﬁle measures how well it generates the user con- Number of Communities Number of Communities (a) Twitter (b) DBLP tentobservations,andweusethesamedeﬁnitionofperplexityas in[17].Thelowerperplexityis,thebetter. Figure4:Resultsofcommunity-awarediffusion. 6.2 ModelDesign DBLP respectively; the topic factor is able to contribute another Wewanttoevaluatehowwellweaddresseachcommunitypro- 3.6%and10.5%absoluteAUCimprovementoneachdataset. ﬁlingchallengeasintroducedinSect.1. Toachievethisgoal,we In all, we conclude that our model design well addresses the designsomebaselinesbasedonthedegeneratedversionsofCPD, threechallengesincommunityproﬁling. for validating the advantages of our model design. We compare 6.3 ComparisonwithBaselines CPDwiththesebaselines,andevaluatethequalityofdetectedcom- munities and proﬁles through three tasks: community detection, WeevaluateCPDandthebaselinesonvariousapplications. friendshiplinkpredictionanddiffusionlinkprediction. 6.3.1 Community-awareDiffusion • Modeling the inter-dependency with community detection. We Quantitativeanalysis. Inﬁgure4,wesummarizetheresultcom- designabaseline“nojointmodeling”,whereweﬁrstdetectcom- parisonwiththebaselinesintroducedinSect.6.1. PMLTMisnot munitiesonlyfromthefriendshiplinksthroughagenerativemodel applicabletoTwitter,sinceitisdesignedsolelyforcitationnetwork– byEq.3,thenweextracttheproﬁlesthroughagenerativemodelas it predicts a citation based on the similarity between two docu- inCPDexcepthavingthecommunitiesﬁxed.AsshowninFigures ments,butinTwitteratweetanditsretweetarealmostidentical.As 3(a)–3(f),oursisalwaysbetterthan“nojointmodeling”. showninFig.4,ourmodelconsistentlyoutperformsallthebase- •Addressingtheheterogeneityofsocialobservations.Wedesigna lines,thanksto: 1)ourmodelingvariousdiffusionfactorsandhet- baseline“noheterogeneity”,whereweadaptCPDtomodelfriend- erogeneoususerlinks,incontrastwiththebaselinesinTable4;2) shiplinksanddiffusionlinksinthesamewaybyEq. (3),butkeep ourjointdetectionandproﬁling,incontrastwiththetwo“ﬁrstde- theotherpartsofCPDmodelingunchanged. AsshowninFigures tectionthenaggregation”baselines. When|C|=100,weachieve 3(a)-3(f),oursisbetterthan“noheterogeneity”ondiffusionpre- 24.2%–91.6%and5.1%–108.0%relativeAUCimprovementsthan diction,andcomparablewithitoncommunitydetectionandfriend- thebaselinesinTwitterandDBLP,respectively.Theimprovements shiplinkprediction.Thisimplies:1)diffusionlinksandfriendship arestatisticallysigniﬁcantoverthe10-foldcrossvalidationresults, links are different, and diffusion links require more sophisticated withstudent’st-testone-tailedp-valuep<0.01. modeling than friendship links; 2) friendship links and diffusion Casestudy. WeexaminethethreediffusionfactorsinEq.5with linksarecorrelated;diffusionlinksdonotsigniﬁcantlychangethe theDBLPdata. Firstly,inFig.5(a)weplotthenumberofpapers communitystructureoncethefriendshiplinksaregiven. ausercitesw.r.t.heractiveness,andthenumberofcitationsauser •Accommodatethenonconformityofuserbehaviors. Wedesign hasw.r.t.herpopularity.Useractivenessandpopularityaredeﬁned two baselines: 1) “no individual & topic”, where we exclude the inSect.3.1. Generally, themoreactiveauseris(i.e., publishing individualfactorandtopicfactorfromEq.5inCPD;2)“notopic”, morepapers),themorepapersshecites;besides,themorepopular where we exclude only the topic factor from Eq. 5 in CPD. As auseris(i.e.,amoreestablishedresearcher),themorecitationsher showninﬁgures3(g)and3(h),theindividualfactorisabletocon- papersget. Thisobservationsupportsourdesignofmodelingboth tribute4.8%and6.8%absoluteAUCimprovementonTwitterand useractivenessandpopularityastheindividualfactorsindiffusion.

From Community Detection to Community Profiling PDF

2.8 MB

by Hongyun Cai

#journals #arxiv

Checking for file health...

Preview From Community Detection to Community Profiling

From Community Detection to Community Proﬁling Hongyun Cai†‡, Vincent W. Zheng†, Fanwei Zhu#, Kevin Chen-Chuan Chang(cid:5), Zi Huang‡ † AdvancedDigitalSciencesCenter,Singapore ‡ SchoolofITEE,TheUniversityofQueensland,Australia # ZhejiangUniversityCityCollege,China (cid:5) UniversityofIllinoisatUrbana-Champaign,USA {hongyun.c,vincent.zheng}@adsc.com.sg,[email protected],[email protected],[email protected] ABSTRACT Inthispaper,wefortheﬁrsttimeformalizetheconceptof“com- munityproﬁle”.Weasktwofundamentalquestions: 7 Mostexistingcommunity-relatedstudiesfocusondetection,which 1 aim to ﬁnd the community membership for each user from user •Whatisacommunityproﬁle?Byname,theproﬁleshouldcharac- 0 friendshiplinks. However,membershipalone,withoutacomplete terizeacommunity,bothinternally(i.e.,whatitis)andexternally 2 proﬁleofwhatacommunityisandhowitinteractswithothercom- (i.e., how it interacts with others). Since a community is an ag- munities, has limited applications. This motivates us to consider gregationofusers,itsproﬁleisessentiallyanaggregationofuser n a systematically proﬁling the communities and thereby developing information. DenoteX assometypeofuserinformation. Toac- J usefulcommunity-levelapplications. Inthispaper,wefortheﬁrst commodateuncertaintyinX,wedeﬁneaninternalproﬁleasprob- timeformalizetheconceptofcommunityproﬁling. Withrichuser abilitiesof“community-X”,andanexternalproﬁleasprobabilities 7 information on the network, such as user published content and of“community-community-X”. HerewefocusonX ascontent, 1 userdiffusionlinks,wecharacterizeacommunityintermsofboth which is the primary user information in many social networks. ] itsinternalcontentproﬁleandexternaldiffusionproﬁle. Thedifﬁ- E.g., in Twitter, users write tweets and retweet from others; in I cultyofcommunityproﬁlingisoftenunderestimated. Wenovelly DBLP,authorspublishpapersandcitepapersfromothers.Wecall S identify three unique challenges and propose a joint Community the probabilities of “community-content” as content proﬁle (i.e., . s ProﬁlingandDetection(CPD)modeltoaddressthemaccordingly. whatacommunityisabout),andthoseof“community-community- c Wealsocontributeascalableinferencealgorithm,whichscaleslin- content”asdiffusionproﬁle(i.e.,howacommunitydiffusescertain [ earlywiththedatasizeanditiseasilyparallelizable. Weevaluate content with another). Other types of X’s may exist in different 1 CPDonlarge-scalereal-worlddatasets,andshowthatitissigniﬁ- networks,e.g.,attributesinFacebook. Thus,“communityproﬁle” v cantlybetterthanthestate-of-the-artbaselinesinvarioustasks. isaﬂexibleconcept.WeleaveothertypesofX’sasfuturework. 8 •Isacommunityproﬁlegood?Duetonetworkhomophily,usersin 2 1. INTRODUCTION thesamecommunitytendtohavesimilarbehaviors. Thus,acom- 5 Thankstothepioneerstudiesoncommunitydetection[19,38], munity’sproﬁleshouldexplainthecommonbehaviorsofitsusers. 4 0 we have been able to model a community in terms of its mem- Inotherwords,wedonotseeany“community-content”distribu- . berusers. Suchcommunitymembershipassistsustobetterunder- tion as a good content proﬁle; instead, only that well explaining 1 standthenetworkstructure. However,membershipalone,without theobservationsofusercontentasgeneratedbythecommunities 0 knowingwhatacommunityisandhowitinteractswithothers,has isa“good”one. Analogously, onlythe“community-community- 7 onlylimitedapplications–e.g.,wecannotrankcommunitiesbyde- content”distributionthatwellexplainstheobservationsofuser-to- 1 siredcharacteristics,exploitinter-communitydiffusions,andvisu- usercontentdiffusionasgeneratedbythecommunitiesisa“good” v: alize communities and their interactions. With this critical lack- diffusionproﬁle. Thisqualitycriterionwilllaterguideustoesti- i ingofcommunity“understanding”,thispaperproposessystematic mate the proﬁles accurately. It is also a key to differentiating us X communityproﬁling–tocharacterizetheintrinsicnatureandextrin- fromotherwork–somepriorattemptsimplyaggregatesuserinfor- r sicbehaviorofacommunity–therebyenablingusefulcommunity- mationtooutputcommunityproperties(mostlyinternalones)[14], a level applications. As social networks increasingly capture more butitdoesnotrequiresuchpropertiestobestexplaintheobserva- andricheruserinformation,itisnowfeasibletoproﬁlecommuni- tions of user behaviors as generated by the communities through ties. E.g.,beyondtraditionalfriendshiplinkswhichconnectusers them. onasocialnetwork,therearealsousers’attributes,publishedcon- Weconsider“communityproﬁling”asanewproblemtosolve tent, diffused content and so on. We can leverage such rich user duetothreereasons.Firstly,communityproﬁlingisdifferentfrom datatoestimatethecommunityproﬁles. community detection, because detection focuses on getting com- munitymembershipforeachuser,whereasproﬁlingfocusesonget- tingthe“community-content”and“community-community-content” probabilities.Secondly,communityproﬁlehasneverbeendeﬁned. Some recent work exploits rich user information [15, 30, 32, 33, This work is licensed under the Creative Commons Attribution- 40,41,42]toimprovecommunitydetectionandoutputsomeuser NonCommercial-NoDerivatives4.0InternationalLicense. Toviewacopy informationaggregationasaby-product. Buttheyneitherdeﬁnea ofthislicense,visithttp://creativecommons.org/licenses/by-nc-nd/4.0/.For communityproﬁlebothinternallyandexternally, nortrytoiden- anyusebeyondthosecoveredbythislicense,obtainpermissionbyemailing tifynewapplicationsthatcommunityproﬁlesenable. Finally,the [email protected]. difﬁcultyofcommunityproﬁlingisoftenunderestimated. Aswe ProceedingsoftheVLDBEndowment,Vol.10,No.6 Copyright2017VLDBEndowment2150-8097/17/02. shall discuss soon, due to the inter-dependency with community Friendship link Diffusion n Friendship link Diffusion link Community c Community-aware diffusion Document link Profiling & Detectio (Sect. 3.2) caossmigmDnumoneictnuyt mentc 1 c2 Taosspiigcn meCnDto ifncftu1e sCCniotoo npmmc rp2mmo rTTTfouuooiolfnnpppeil iiiiiecottccc3y y f o zzz ccfc213 1 2 13c plication Prediction (Sect. 5) Q(“ cePiPuo.gremh.or,oy mfn iqeule” n )-i dtiersi:v ce1n, c c2Qt ooupmeiWcrmysi l ul rneittwyTtec ocero1o t dap? m>i n Kf mf kuc cui2son ne m> igqt i m ecsu3: nci2t,i ecs3 nt c z1 z2 z3 1 Ap Profile-driven community visualization Joi 3 Diffusion Diffusion on a specific with topic c1->c1 c1->c2 c1->c3 topic aggregation (a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) (c) Applications (Sect. 6) Figure1:Theframeworkofjointcommunityproﬁlinganddetection. detection, the heterogeneity of social observations and the non- 2)webuildaninteractivesystem1forproﬁle-drivencommunityvi- conformity of user behaviors, ﬁnding a good community proﬁle sualization and ranking, which for the ﬁrst time allows people to ischallenging. Noneoftheexistingworkhaseveridentiﬁedand freelybrowsethecommunitiesbybothcontentanddiffusion[4]. addressedsuchchallenges(morediscussionsinSect.2). Thedifﬁcultyofcommunityproﬁlingisoftenlargelyunderesti- Ourgoalistoinfercontentproﬁleanddiffusionproﬁleforeach mated;asweshalldiscussnext,thereexistmanychallenges: community, andultimatelyenablenewapplications. InFig.1(a), • Inter-dependency with community detection. A straightforward we show the input for community proﬁling: a set of users, each approachofcommunityproﬁlingistoﬁrstdetectcommunitiesand ofwhompublishesdocuments; usersareconnectedbyfriendship thenaggregateeachcommunity’suserobservationsastheproﬁles. links,andinteractwitheachotherbydiffusionlinks.E.g.,inTwit- However,becausethisapproachdoesnottryto“bestexplain”the ter, each user posts tweets, users are connected by followership user observations as generated by the communities through their links,andtheyretweeteachothertodiffuseinformation.InFig.1(b), proﬁles,itisoftensuboptimal.Takecontentproﬁleasanexample. foreachcommunity,weoutput:acontentproﬁle(e.g.,community Denoteauserasuandacommunityasc.Forsimplicity,wedenote c1 tendstopublishtopicsz1 andz2)andadiffusionproﬁle(e.g., c’scontentproﬁleasp(content|c)andthelikelihoodofu’scontent c1tendstodiffuseitselfandc2onz1).InFig.1(c),weenablethree asp(content|u). Tobestexplaintheusercontentasgeneratedby new applications as follows (novelty to be discussed in Sect. 2, thecommunitiesthroughtheircontentproﬁles,weeffectivelysolve applicationstobeconcretizedinSect.5andevaluatedinSect.6): (cid:81) (cid:81) (cid:80) • Community-aware diffusion. As community proﬁles aggregate max up(content|u)= u cp(content|c)p(c|u), (1) user behaviors, we can use them to more robustly model the dif- where p(c|u) is the probability of user u assigned to community fusioninacommunitylevel,ratherthananindividuallevel[9,22, c. Ideally, to optimize Eq. 1, we shall optimize both the proﬁle 25]. E.g.,wecanexplainaretweethappensasoneuser’scommu- p(content|c)’sandthecommunityassignmentp(c|u)’s. Butinthe nitiesoftenretweettheother’sonacertaintopic.Weacknowledge straightforwardapproach,thedetectionﬁrstﬁxesthep(c|u)’s,then diffusionasacomplexdecision–beyondcommunityproﬁles,there thebestresultthisaggregationcanreturnisthep(content|c)’sthat are also nonconformity factors such as individual preference and maximize Eq. 1. It is clear that, the maximal likelihood we get topicpopularity. Thispartiallyexplainswhycommunityproﬁling withﬁxedp(c|u)’sissuboptimal,unlessthep(c|u)’sare“perfect”. ischallenging–wecannotaccountcommunityproﬁlesforallthe A perfect detection of p(c|u)’s also needs to maximize the like- diffusions; instead, we have to model different factors, to accu- lihood in Eq. 1, which depends on the proﬁle p(content|c)’s. In ratelyestimatetheproﬁlesandthecommunity-awarediffusion. all, content proﬁles and community detection are coupled. Sim- •Proﬁle-drivencommunityranking. Weoftenneedtotargetaudi- ilarly, we can show that diffusion proﬁle and community detec- encesfordisseminatinginformationinthenetworks. E.g.,acom- tionarealsocoupled. Denoteonemoreuserasv, andonemore panywantstotargetcommunities,whicharemostlikelytoretweet communityasc(cid:48). Forsimplicity,wedenotec’sdiffusionproﬁleas aboutitsproduct, soastolaunchacampaign. Afundingagency p(diffusion|content,c,c(cid:48))’s, eachasaprobabilityofhavingadif- wants to target communities, which actively cite papers about its fusionbetweencandc(cid:48)aboutsomecontent. Then,tobestexplain grantthemeon“deeplearning”,soastodisseminatethegrantcall. theuser-to-userdiffusionasgeneratedbythecommunitiesthrough Sincewehaveknownwhatcontenteachcommunityisinterestedin theirdiffusionproﬁles,weeffectivelysolve andhowitdiffusesthatcontentwithothers,wecanrankthecom- (cid:81) munities. Proﬁle-driven community ranking is different from the max p(diffusion|content,u,v) (2) (u,v) traditionalcommunityrecommendation,whichoftenreliesononly =(cid:81) (cid:80) (cid:80) p(diffusion|content,c,c(cid:48))p(c|u)p(c(cid:48)|v), “community-X”propertiesandisunawareofdiffusion[7,14]. (u,v) c c(cid:48) •Proﬁle-drivencommunityvisualization. Holisticmodelingleads wheretheproductover(u,v)istakenoveralltheuserpairshaving torichvisualization–wecannowvisualizenotonlyhowcommuni- adiffusionlink. TooptimizethelikelihoodinEq.2,weshallop- tiesfeaturedistinctcontents(e.g.,whatanITcommunitytweets), timizethediffusionproﬁlep(diffusion|content,c,c(cid:48))’s,aswellas but also how they interact (e.g., how an IT community retweets thecommunityassignmentsp(c|u)’sandp(c(cid:48)|v)’s. others)whichisoftenoverlookedbefore[8,23]. •Heterogeneityofsocialobservations. Socialobservations,espe- Wemaketworemarksabouttheaboveapplications:1)wecom- ciallytheuserlinks(i.e.,friendshiplinksanddiffusionlinks),often pleteonetaskofcommunityproﬁlingtosupportmultipleapplica- tionsatatime,thuscommunityproﬁlingisonlydoneonceofﬂine; 1http://sociallens.adsc.com.sg/ carrydifferentsemantics; e.g., friendshiplinksindicateusercon- •WedevelopascalableinferencealgorithmforCPD,andwefur- nections and diffusion links indicate user interactions. Tradition- therparallelizeitbytakingthedataskewnessintoaccount. ally, weoftentrytoenforceuserconnectionstobedenserwithin eachcommunitythanacrosscommunities[16, 19]. Butindiffu- • We perform extensive experiments to evaluate CPD over large- sion, the “weak ties” theory recognizes that the inter-community scaledatasets,andshowbothitseffectivenessandscalability. interactionsmaynotbeweak[12].E.g.,softwareengineeringcom- munitycitesmorepapersfrommachinelearningcommunitythan 2. RELATEDWORK itselfon“deeplearning.”Thismeanswehavetoseparatethemod- elingofuserconnectionanduserdiffusion. Suchuserlinkhetero- In this section, we review the related work on community de- geneity is largely overlooked in the previous work [32, 35], thus tection and relevant applications, and distinguish the differences howtomodelheterogeneoususerlinkstogetherremainsunclear. between existing work and our community proﬁling model. We • Nonconformity of user behaviors. User behaviors, especially furtherorganizesuchdifferencesinTable4. theirdiffusiondecisions,canhappenformanyreasons.Community- CommunityDetection. Detectingcommunitiesfromvariousnet- levelconformityisjustonereason,thuswehavetoconsiderother works has been extensively studied in the last decade. There ex- factorsaswell. E.g.,somediffusionhappensduetoitstopic(e.g., ist comprehensive surveys [39, 19, 36] on community detection, presidential election) being popular at the moment or its author which review different community detection methods in terms of (e.g.,LadyGaga)beingpreferredasacelebrity. Suchtopicpopu- detectionalgorithms,qualitymeasures,benchmarksandsoon. larityanduserpreferencearetheothertwotypicalnonconformity Conventionally,acommunityisdeﬁnedasagroupofnodes,in factors for diffusion, and we must accommodate them. No prior which intra-group connections are much denser than inter-group workhasexploredbothcommunityfactorandnonconformityfac- ones [11, 38]. The pioneer community detection studies aim to tors[17,25],anditisnotclearhowtobalancethemindiffusion. generate the community membership for each node purely based Our technical novelty is identifying the above challenges and onthelinksamongstthem[19,38]. Theprevalenceofsocialnet- developing a uniﬁed Community Proﬁling and Detection (CPD) works offers a rich collection of user links to use for community model(Sect.3)toaddressthemaccordingly. detection,suchasthefollowershipinTwitter[32],Flickr[30]and • To model the inter-dependency with community detection, we Facebook/Google+[26, 41], theco-authorshipinDBLP[40, 42], proposetotakeanovelproﬁle-awaregenerativeapproach–were- the email exchange [32]. However, most of these existing work alizethedetectionbylatentmembershipvariablesandtheproﬁling only consider one single type of links. There are other different bylatentcommunityproﬁlevariables,whichtogethergeneratethe typesofuserlinks; e.g., userscomment/replyotherusersindigg user friendship links, user content and user diffusion links in the [23],contact/co-contact/co-subscribeotherusersinYouTube[35], network. Thenweinfertheselatentvariablesbymaximizingthe follow/reply/retweetotherusersinTwitter[31].Butthesedifferent likelihood. None of the existing work has taken a proﬁle-aware linkswereoftenmodeledinthesameway.Sofarasweknow,none generative approach– they may use a generative model for com- oftheexistingcommunityworkconsiderstheheterogeneityamong munitydetection[26,31,33],buttheyneverconsiderinternaland userlinks(i.e.,friendshiplinksanddiffusionlinks)aswedo. externalproﬁlestogetherwithdetection. Recentstudiesstarttoexploittherichuserinformation,suchas •Toaddresstheheterogeneityofsocialobservations,wepropose content[32],attribute[33,40],action[23,31],toimprovethede- toseparatethegenerationoffriendshiplinksfromlatentcommu- tection.Consequently,inadditiontocommunitymembership,they nityassignmentsandthegenerationofdiffusionlinksfromlatent alsooccationallyoutputsome“community-X”associations,such proﬁles. In particular, we require that two users are more likely as“community-content”[32],“community-attribute”[33,40]and to share a friendship link if they have similar community assign- “community-action”[23,31].Inourwork,wesimultaneouslydis- ments. Thus maximizing the likelihood of observing the friend- cover communities and characterize them with both internal and shiplinksenforcesintra-communityfriendshiplinkstobedenser externalproﬁles.Althoughsomeformsofinternalcommunitypro- than inter-community ones. In contrast, we use the community ﬁlesmaybeobtainedinsomepriorwork([23,31,32,33,40])as diffusion proﬁles to generate the diffusion links, but we do not theby-products,theexternalproﬁlesaregreatlyoverlooked. require inter-community diffusion strengths to be always smaller Therearesomerecentstudiesonaggregatingeachcommunity’s thanintra-communityones;instead,thediffusionproﬁlesarefreely userpreferencesassomeformofcommunityproﬁles,soastoen- learnedinmaximizingthelikelihoodofthediffusionlinkobserva- ableitemrecommendationtoeachcommunity. Theirworkisdif- tions. ferentfromoursintwoaspects. Ononehand,mostofthesecom- •Toaccommodatethenonconformityofuserbehaviors, wepro- munityrecommendationstudiesaregiventhecommunitiesasin- pose to deﬁne the generative probability of observing a diffusion put [14, 29]. Even though some of them did try to detect com- linkasalogisticfunctionovermultiplefactors,includingthetopic- munities[1,27],theirdeﬁnitionofacommunityisagroupofusers awarecommunitydiffusionproﬁles, thetime-sensitivetopicpop- whosharesimilarpreferencestoarecommendeditem,whichisnot ularities and the individual user preferences. By maximizing the basedonnetworklinksatall.Incontrast,ourcommunityisagroup likelihoodofdiffusionlinkobservations,welearnthediffusionpro- ofdenselyconnectedusers,whosharesimilarinterestsanddiffu- ﬁles,aswellastheweightstocombinethesedifferentfactors. sionbehaviours. Ontheotherhand,theircommunityproﬁleisob- Finally,wedesignascalableinferencealgorithmforCPD(Sect. tainedbyaggregatingtheusers’preferences,whichisusuallybased 4). Asshownlater, ourinferencealgorithmscaleslinearlytothe onaleastmiseryoraggregatevotingapproach.Incontrast,wefor- data set size. We further parallelize our inference algorithm, by malizethecommunityproﬁlesastheprobabilitiesof“community- takingthedataskewnessintoaccount. X”andprobabilitiesof“community-community-X”. Besides,we Wesummarizeourcontributionsasfollows: estimatethesecommunityproﬁlesbyagenerativemodeltogether withcommunitydetection. • We identify a new problem of community proﬁling, which to- getherwithdetectionenablesaholisticmodelingofcommunities. Community-awareApplications.Thecommunityproﬁlesdeepen ourunderstandingofthedetectedcommunitiesandthusbeneﬁta • We identify three unique challenges and design a novel CPD lot of community-level applications. Here we review the related modelforjointcommunityproﬁlinganddetection. Notations Description |U|,|W| Thenumberofusersandwords |C|,|Z| Thenumberofcommunitiesandtopics |F|,|E| Thenumberoffriendshiplinksanddiffusionlinks dui Thei-thdocumentpublishedbyuseru CPD(ours)COLD[17]CRM[15]WTM[37]PMTLM[43]HAM[16]BlackHole[21]topcgo[10]INFEST[25]TopicInﬂuence[24]LADP[22]Inﬂulearner[9]CFF[27]GF[1]GenClus[33]TURCM[31]PMM[35]MetaFac[23]SA-Cluster[42]BAGC[40]CESNA[41]SocialCircle[26]CODICIL[30]SN-LDA[32]MaxFlow[11]Methods EπDWwcFuuuuituuijvi,ikzui TTAAMTThhhhudfeeeerliitfkssceifeenon-utttomdshoomismoffwhinwidauoplonlroicdldinrituidnyksimsnktafrireisdfnobnrsoomuitdcgsmtouindpocmmuounusecbemoenulnvrietmtsenudharetenuntcdddoitoubimtuioystmpoeuirucdsvneoaricstuusieimgsnesmnpteejcnitﬁatfcottiromdueusiteru •• •• ••• • • •• tex θφc MMuullttiinnoommiiaallddiissttrriibbuuttiioonnoovveerrtwoopridcssssppeecciiﬁﬁccttooctoopmicmzunityc t z • • •••• attribu ηναc,,βc(cid:48),zρ TPDrhioreibcpahablriealtimtpyeriotoefrrscsofmormmuonditeylicngdiifnfduisvinidgucaolmdimffuusniiotyncp(cid:48)reofnerteonpciecz te userfeatures user-itempairsuser-itempairs interactionsinteractionuseractions nodefeatureData winogr,kcotomomurutnhirtyeedeifxfaumsipoTlneaabanpldepl2cio:cmaNtimootnuasnt,iitoiynncvsliusduianlgizcaotimonm.uFniirtsytlrya,nfok-r frien cnoitmiemsbuansiteydroannkuinsegr,sm’ionstteroefsetsxiosntinthgesmtu,dii.ees.,[t1o4,ﬁ7n]drtahnekfcaovmoumriute- • • ••••••• •••••••••••d . communities for users. Moreover, the communities to be ranked T link are often already predeﬁned over the networks. In our work, the able1: ••••• ••••• diff.lin ccdooifmmfummsiuuonnniittpiieerossﬁablryeesnbtooottghperttohhveeirid.reTidnhtaeissrntwhaelililcnohpneutltep,nattnhpderoocuoﬁrmlefpsoacanunysd/iasuetxtohteorrranntakol C k om in choosethepromisingcommunitytopromotetheirproducts/papers d asmuchaspossible. Secondly, forcommunitydiffusion, incon- parison • •• ••••• ividualDiffu tmraosdtetlosaorueractotmhemiunndiitvyi-dleuvaellledviefflu[s9io,n22m,o2d5e].lliRnge,cemntolsyt,dthifefruesiaorne withthe ••• • communitysionfactor stoshoivoemenrreltfohasoectuktcodeordisme[st1mot5hgu,aen1ttih7ctie]oe.rns,Bsiainedrdseeiirdvpedirdsei,fudfuaeulnﬁslfinioakecndetoao[tr1ut0irhs]memocoriodstmeshilemningtugoniponiitfcy[v-1ala7ewr]viaoearulne,sdnbdeutiostfspfeuiiics-- r s elate • • • topic pmoupnuiltayrivtiysufaaclitzoartiisonm,iaslstihnoguignh[a15lo].tLofasetf,fbourttsnhoatvtehebeleeanstd,efvoortceodmto- d w to communitydetection,onlyafewofthemfurthervisualizethere- or pic sultstofacilitatethedeepanalysisandsemanticinterpretation. In k. •• • • • • ex [8], theauthors proposea community detectionand visualization tra model, whichdifferentiatestheinnernodesandthebordernodes c t. forvisualizingtheinteractionsbetweencommunities. Whiletheir c o objectiveistodesignalayoutalgorithmforclearlydisplayingthe m m communitiesandtheirinteractions,wefocusondemonstratingthe u ••• ••• •••••••••••••nity topic-awareuserinteractionstrengthsamongthecommunities. d e te 3. JOINTPROFILINGANDDETECTION c t.diffuTasks muIlnattehethfeojloloinwtincogm,wmeuﬁnirtsytpdreoﬁﬁnleinsgomanedkdeeytencotitoionnpsr;otbhleenmw.eTafbolre- s ••••• ••••• io 2summarizesthenotationsusedinthispaper. n p red DEFINITION 1. A social graph is G = (U, D,F,E), where . u ∈ U isauserandd ∈ Disauserpublisheddocument. There c o m are two types of links in G. Fuv ∈ F is a friendship link from m u userutouserv; Eij ∈ E isadiffusionlinkfromdocumentito n • •• ity documentj.Bothtypesoflinksaredirected. p roﬁle Fuo;rFaTwreitpterersneenttwsotrhka,tDusuer⊂uDfoilslotwhessuesteorfvt;wEeetsrpeopsrteesdenbtysuthseart uv ij tweet i is a retweet of tweet j. For a DBLP network, D is the u setofpaperspublishedbyauthoru; F representsthatauthoru uv co-authorswithauthorv;E representsthatpapericitespaperj. ij Toenablecontentmodeling,weﬁrstdeﬁnetopic. Heterogeneous anauthorpublishesapaperondeeplearning,becausesheisfrom user interactions F ρ themachinelearning(ML)community,whichstudiesdeeplearn- uv F ing. Aswedealwithshortdocuments(e.g.,tweetsinTwitterand papertitlesinDBLP)andashortdocumentislikelytobeaboutone singletopic[31,17],weassignonesingletopictoeachdocument η Et c π c ij ui u inourmodel.Secondly,weconsiderauserutopublishadocument C E Joint profiling α and detection β dui oftopicz,whichdiffusesanotheruserv’sdocumentdvj,due ν tobothusers’communityassignmentscui andcvj,aswellasthe θ Cdioffmupsiroenh menosdiveeli ng z w φ caopmapmeurnointysodfiftwfuasrieonrepproosﬁitloerηiecsu,ia,cnvdjzc.itEes.ga.n,oatnhearutahuotrhourpvu’sblpisahpeesr c ui uik z C Wun D Z ondeeplearning,becauseuisfromthesoftwareengineering(SE) u U community,visfromtheMLcommunity,andSEcommunitytends tocitepapersondeeplearningfromtheMLcommunity. Finally, Figure2:GraphicalmodelofCPD. weconsiderauserutoformafriendshiplinkwithauserv,dueto theirsimilarcommunitymembershipsπ andπ . E.g.,anauthor u v uisaco-authorwithanotherauthorv,becausetheyarebothfrom DEFINITION 2. Atopicz∈{1,...,|Z|}isa|W|-dimensional theMLcommunity. multinomialdistributionφ overwords,whereeachdimensionφ z z,w Addressingdataheterogeneity.WemodelthefriendshiplinksF istheprobabilityofawordw∈{1,...,|W|}belongingtoz. andthediffusionlinksEdifferently. Conventionally,agoodcom- Then,wedeﬁnethecommunitymembership,aswellasourcom- munityneedstohavelowconductance,whichmeansthefriendship munitycontentproﬁleanddiffusionproﬁle. linksshouldbedenserinsideacommunitythanoutsideacommu- nity. Speciﬁcally,wedeﬁnetheprobabilityofhavingafriendship DEFINITION 3. A user u’s community membership is a |C|- linkbetweentwousersuandvasasigmoidfunction,parameter- dimensional multinomial distribution π , where each dimension u izedbytheircommunitymembershipsimilarity: π istheprobabilityofubelongingtocommunityc,∀c∈{1,...,|C|}. u,c P(F =1)=σ(πˆTπˆ ), (3) uv u v DEFINITION 4. The content proﬁle of community c is a |Z|- dimensional multinomial distribution θc over topics, where each where πˆu = [πˆu,1,...,πˆu,|C|]T is an estimation of πu based on dimensionθ istheprobabilityofcdiscussingtopicz. theaggregationofu’scommunityassignments. Inothewords,we c,z use πˆ and πˆ , instead of π and π , to generate the F ’s in u v u v uv DEFINITION 5. Thediffusionproﬁleofcommunitycisa|C|× Fig.2. Suchadesignismotivatedby[5,6]tosimplifytheinfer- |Z|-dimensionalmatrixηc,whereeachentryηc,c(cid:48)zistheprobabil- ence.σ(x)=1/(1+e−x)isasigmoidfunction.Themoresimilar ityofcdiffusinganothercommunityc(cid:48)ontopicz. πˆ andπˆ are,themorelikelyF exists. Inotherwords,F is u v uv uv large if u and v are from the same communities. This naturally Takecommunityc1inFig.1asanexample. Asc1’suserspublish enforces denser friendship links within a community than across morecontentonz1andz2,theresultingθc1,z1 andθc1,z2 arebig- communities,thusleadingtolowconductance.Incontrastwiththe ger. Besides,asc1’susersoftenretweet/citethemselvesonz1,the friendship links, the inter-community diffusion is not necessarily resultingηc1,c1z1 isbig. AsmotivatedinSect. 1,weformalizea “weak”[12]. Infact,thecommunity-leveldiffusionstrengthsvary jointproﬁlinganddetectionproblemtosolveinthispaper. overtopics,whichbreakstheassumptionofhavingtomaintainthe lowconductancewithinacommunity. Weneedtoresorttoadif- PROBLEM 1. GivenasocialgraphG =(U,D,F,E),thetask ferentmodelingofdiffusionlinks,aswediscussnext. ofjointcommunityproﬁlinganddetectionistoinfer:1)eachuser u’scommunitymembershipπ ,∀u ∈ U; 2)eachcommunityc’s Accommondatingnonconformity. Differentfactorscanaccount u contentproﬁleθ anddiffusionproﬁleη ,∀c∈{1,...,|C|}. for a diffusion decision. Take Twitter as an example; user u is c c likelytoretweetv’stweetd asheri-thtweetd attimetif: 1) vj ui 3.1 ModelDesign thecommunity-leveldiffusionstrengthbetweenc (thecommu- ui Next,weconcretizeourmodeldesignw.r.t. thethreetechnical nity u belongs to when she generates document dui) and cvj on challengesforcommunityproﬁlingasdiscussedinSect.1.Wewill topiczvjisstrong;2)thetopiczvjofdvjistrendingattimet;3)u laterevaluatehowwellweaddresseachchallengeinSect.6.2. hasanindividualpreferencetoretweetfromv.Thesefactorsshow threetypicalperspectivestomakeadiffusiondecision:community Proﬁle-aware generative model. Community detection aims to perspective(ifacommunityismorelikelytoretweetanothercom- infer a community membership assignment π for each user u u munity),contentperspective(ifatopicismorepopularatthetime) basedonthefriendshiplinksF ’s. Communityproﬁlingaimsto uv and user perspective (if a user is more likely to retweet another inferacontentproﬁleθ andadiffusionproﬁleη foreachcom- c c user).Next,wecharacterizethethreetypicalfactors. munitycbasedonitsmemberusers’publishedcontentD ’sand u diffusionlinksEt ’s. Wecanreinforceproﬁlinganddetection,by •Communitydiffusionpreference: weconsiderauserutodiffuse ij lettingthemleverageeachother’sdata. Asaresult,wewishtoin- anotheruserv ontopicz,ifthecommunitiesofuandv areboth ferasetofcommunity-levellatentvariables,includingπu’s,θc’s interestedinzandtheyoftendiffuseeachotheronz. Denotes ∈ andη ’s,togetherfromalltheobservations(D,F,E). {0,1}asanindicatorforadiffusionlinkinEtohappen.Then,the c Since joint proﬁling and detection is an unsupervised task, we probabilityofhavingadiffusions=1fromutovonzis adoptagenerativeframeworkforourCPDmodel.WedesignCPD asagraphicalmodelinFig.2, whereweusecommunitiestoex- p(s=1,z|u,v)=1 (cid:88)(cid:88)p(s=1|c,c(cid:48),z)p(z|c)p(z|c(cid:48))p(c(cid:48)|v) plainalltheuserobservationsonthenetwork. Firstly,weconsider c c(cid:48) aniutysearssuigtnompeunbtlicsuhiaadnodctuhmeecnotmdmuiunoiftytocpoinctzen,tdpureotﬁoleheθrcucio.mEm.gu.-, ∝2 (cid:88)c(cid:88)c(cid:48)ηc,c(cid:48)zθˆc,zπˆu,cθˆc(cid:48),zπˆv,c(cid:48), (4) where at step 1 we expand p(s = 1,z|u,v) by introducing the 4. SCALABLEMODELINFERENCE communitymembershipp(c|u)andp(c(cid:48)|v),thecommunities’in- WedevelopascalableinferencealgorithmforCPD.Weaimto terestsonthetopicp(z|c)andp(z|c(cid:48)),aswellasthetopic-sensitive inferthetopicassignmentandcommunityassignmentlatentvari- communitydiffusionprobabilityp(s = 1|c,c(cid:48),z). Atstep2, we ables {Z,C} from the observations {W,F,E}, where W is the estimatep(s=1|c,c(cid:48),z)withηc,c(cid:48)z,theprobabilityofcretweet- wordsinD. WeusecollapsedGibbssampling[6,15,32]forthe ing/citingc(cid:48) onz. Besides, weestimatep(c|u)withπˆu,c, which inference. Wealsoestimatethevariationalparameters{π,θ,φ} istheempiricalprobabilityofcommunitycbeingassignedtouser andthemodelparameters{ν,η}byvariationalExpectationMaxi- u; similarly we estimate p(c(cid:48)|v) with πˆv,c(cid:48). Finally we estimate mization(EM)[5,6].Welaterparallelizeourinferencealgorithm. p(z|c)withθˆ ,whichistheempiricalprobabilityoftopicz as- c,z signedtothedocumentsfromc;similarlyweestimatep(z|c(cid:48))with 4.1 CollapsedGibbsSampling θˆc(cid:48),z.Denoteθˆ·,z =[θˆ1,z,...,θˆ|C|,z]T andη¯ =vec([η1,...,η|C|]), To derive the Gibbs sampler, we start with computing the col- wherevec(A)concatenatestherowvectorsinamatrixAtoavec- lapsedposteriordistributionofourmodel: tor. Foradiffusionbetweend andd , whichshares thesame ui vj p(W,F,E,C,Z,f,ν,η¯|ρ,α,β) topicz,wedenotec¯ =vec((πˆ πˆT)◦(θˆ θˆT )),where◦isan (6) ij u v ·,z ·,z =p(C|ρ)p(Z|C,α)p(W|Z,β)p(F|C)p(E|C,η,Z,ν,f), element-wiseproduct.ThenEq.(4)becomesc¯Tη¯. ij wherep(F|C)(abbreviatedasp(F)inthefollowing)istheprob- •Topicpopularity:wemodelthepopularityofatopicataspeciﬁc ability for the friendship links F generated by the communities timestamptasthecountoftopiczatt,whichisdenotedasnt. z C; p(E|C,η¯,Z,ν,f) (abbreviated as p(E) in the following) is theprobabilityforthediffusionlinksE generatedbythecommu- • Individual preference: we model user u’s preference to diffuse informationfromuservwithalinearfunctionνTf ,whereν is nities C. We follow [5] to model observed links only in Eq. 6. uv (cid:81) Thus, we deﬁne p(F) = P(F = 1) and p(E) = aexpaamrapmlee;twere,fcuovnsisidaerfetwatourfeeavteucrteosrffoorruu:a1n)dusve.rTpaokpeuTlawriittyte,rwahsiacnh (cid:81)(i,j)∈Ep(Eitj = 1), where(ut,vis)∈tFhe timeusvtamp of the diffusion is deﬁned as the number of u’s followers divided by that of her link(i,j). followees|Followers(u)|;2)useractiveness,whichisdeﬁnedasthe InthegenerativeprocessofCPDmodel,wemodelbothP(Fuv = |Followees(u)| 1) (step 3.c) and p(Et = 1) (steps 3.d) with sigmoid functions numberofu’sretweetsdividedbythatofhertweets |Retweets(u)|. ij |Tweets(u)| σ(·). Bayesianinferencewithsigmoidfunctionisknownashard, Weextractv’sfeaturesandconcatenatethemwithu’sasf . uv because it is analytically inconvenient to construct a Gibbs sam- Inordertosystematicallycombinethethreediffusionfactors,we plerforthesigmoidfunction[28]. Wearemotivatedbythedata introduceasigmoidfunctiontodeﬁnetheprobabilityofdocument augmentationapproach[2,6],whichintroducesPo´lya-Gammaran- d diffusingdocumentd oftopiczattimestamptas: vj ui domvariablestoderiveanexactmixturerepresentationofthesig- moidfunctionforeasierinference.HenceweintroducetwoPo´lya- p(Et =1|u,v,z,t)=σ(c¯Tη¯+nt +νTf ). (5) ij ij z uv Gammavariablesλandδastheaugmentedvariablesforp(F)and p(E)respectively. Formally,arandomvariablexfollowsaPo´lya- Welearntheparametersη¯andν,sothatweknowhowmucheach Gammadistributionx∼PG(a,b)(a>0,b>0),if factorcontributesinthediffusion. Generativeprocess. WesummarizetheCPDmodel’sgenerative x= 2π12 (cid:80)∞k=1 (k−1/2)2g+kb2/(4π2), processbelow.Denote1 asanall-onevectoroflength(cid:96). (cid:96)×1 whereg ∼ Gamma(a,1)isaGammarandomvariable. Ithas k 1. Foreachtopicz = 1,...,|Z|,drawitsworddistributionfrom beenshownin[28],alogisticfunctioncanberepresentedasamix- aDirichletpriorparameterizedbyβ:φ |β ∼Dir(β1 ); tureofGaussiansw.r.t.aPo´lya-Gammadistribution: z |W|×1 2. Foreachcommunityc=1,...,|C|,drawitstopicdistribution 1 1(cid:90) ∞ fromaDirichletpriorparameterizedbyα:θc|α∼Dir(α1|Z|×1); 1+e−w = 2 ψ(w,x)p(x|1,0)dx, (7) 0 3. Foreachuseru=1,...,|U| w−xw2 whereψ(w,x)=e 2 andx∼PG(1,0).Then,forp(Fuv = (a) Drawhercommunitydistributionπu|ρ∼Dir(ρ1|C|×1); 1)asdeﬁnedinEq.3,wecanintroduceaPo´lya-Gammavariable (b) Forthei-thdocumentduiofuseru λuv ∼PG(1,0),suchthatwegetajointprobability i. Drawacommunityassignmentcui|π∼Multi(πu), p(F =1,λ )= 1ψ(πˆTπˆ ,λ )p(λ |1,0). (8) byu’smultinomialcommunitydistributionπ ; uv uv 2 u v uv uv u ii. Drawatopiczui|c,θ∼Multi(θcui),bycui’smulti- Similarly,forp(Eitj = 1)asdeﬁnedinEq.5,wecanintroducea nomialtopicdistributionθc; Po´lya-Gammavariablesδij ∼PG(1,0),suchthatweget iii. Draw each word w |z,φ ∼ Multi(φ ), ∀k = 1,...,|Wui|,byzui’usikmultinomialworddizsutriibution; p(Eitj =1,δij)= 12ψ(c¯Tijη¯+ntz+νTfuv,δij)p(δij|1,0). (9) (c) Foreachfriendshiplinkfromuserutouserv,drawF |π∼ uv Consideringallthefriendshiplinksanddiffusionlinks,wehave Ber(σ(πˆTπˆ ))byaBernollidistribution(Eq.3); (d) For each duiffvusion link Eitj from document dui to docu- p(F,λ)=(cid:89) ψ(πˆTuπˆv,λuv)p(λuv|1,0), (10) mentd attimet,drawEt |C,η,Z,ν,f ∼Ber(σ(c¯Tη¯+ (u,v)∈F ntz+νvTjfuv))byaBernoiljlidistribution(Eq.5); ij p(E,δ)= (cid:89) ψ(c¯Tijη¯+ntz+νTfuv,δij)p(δij|1,0). (11) (i,j)∈E Instep3.b.iii, sinceshorttextoftenhassingletopic[31, 17], we NextweinferZandC,togetherwithλandδ.Speciﬁcally,aug- sampleallwordsind fromthesametopic-worddistributionφ . mented with two Po´lya-Gamma variables λ and δ, the collapsed ui z posteriordistributionofourmodelbecomes: Weefﬁcientlysampleλ byanalternateexponentiallytiltedJa- uv cobidistribution[28]. p(W,F,E,C,Z,f,ν,η¯,λ,δ|ρ,α,β) •Forδ:theconditionaldistributionofδisalsoPo´lya-Gamma, =p(C|ρ)p(Z|C,α)p(W|Z,β)p(F,λ|C)p(E,δ|C,η¯,Z,ν,f) p(δ |W,F,E,C,Z,f,ν,η,δ) (cid:82) (cid:82) ij = P(C|π)P(π|ρ)dπ· p(Z|C,θ)P(θ|α)dθ· (cid:82) π θ −δij(¯cTijη¯+ntz+νTfuv)2 φP(W|Z,φ)P(φ|β)dφ·p(F,λ)·p(E,δ) ∝e 2 p(δij|1,0) (16) = |(cid:81)U| ∆(ncu+ρ) · |(cid:81)C| ∆(nzc+α) · |(cid:81)Z| ∆(nwz+β) ·p(F,λ)·p(E,δ), =PG(1,c¯Tijη¯+ntz+νTfuv). ∆(ρ) ∆(α) ∆(β) u=1 c=1 z=1 4.2 ModelParameterEstimation (12) where∆(x)= (cid:81)dii=m(1x)Γ(xi).BasedonEq.12,wecaninferZ,C, WeusevariationalEMtoiterativelyestimatethevariationalpa- Γ((cid:80)dii=m(1x)xi) rameters{π,θ,φ}andthemodelparameters{ν,η}: λandδonebyoneasfollows. 1. (E-step)UsethesamplesofcollapsedGibbssamplingtoes- •ForZ:theprobabilityofassigningtopicztoduiis timatetheparametersπ,θandφ,givenν,η. p(z =z|C,Z ,W,F,E,f,ν,η¯,λ,δ) 2. (M-step)OptimizeνandηbymaximizingEq.(6),giventhe ui ¬{ui} parametersπ,θ,φestimatedintheE-step. = p(W,F,E,C,Z,f,ν,η¯,λ,δ|ρ,α,β) p(W,F,E,cui=c,C¬{ui},Z¬{ui},f,ν,η¯,λ,δ,ρ,α,β) In the E-step, the Gibbs sampler iteratively draws samples of Z, ∝(cid:81)|C| ∆(nzc+α) ·(cid:81)|Z| ∆(nwz+β) · C, λandδ byEqs. 13–16. Basedonthesamples, weestimate: c=1 ∆(nzc,¬{ui}+α) z=1 ∆(nwz,¬{ui}+β) π = ncu+ρ ,θ = nzc+α andφ = nwz+β .Inthe p(F|λ,C¬{ui})·p(E|δ,C¬{ui},Z¬{ui}) Mu-,sctep,nw(u·e)+ﬁ|Crs|tρestci,mzatenη(cc·,)c+(cid:48)z|Z’s|αby aggrze,wgatingn(zt·h)+e|Wco|mβmunity nz +α (cid:81)|W| (cid:81)nwui(nw +β+i−1) andtopicassignmentsw.r.tallthedocuments,basedonthelastit- = c,¬{ui} · w=1 i=1 z,¬{ui} · (13) n(c·,)¬{ui}+|Z|α (cid:81)nj=(u·1i)(n(z·,)¬{ui}+|W|β+j−1) earllatoitohnerofvsaarimabplleinsgﬁ.xTehde–nthwiseiessteismseantetiνallbyyﬁmttainxgimaizloinggisEticq.r6egwreitsh- (cid:81) ψ(πˆTπˆ ,λ |C )· sionfunction;tosolveit,werandomlysamplethesameamountof v∈Λu u v uv ¬{ui} (cid:81) ψ(c¯Tη¯+nt +νTf ,δ |C ,Z ), non-observeddiffusionlinksasnegativeinstancesforoptimization. j∈Λi ij z uv ij ¬{ui} ¬{ui} As α and ρ are used to sample the πu’s andθc’s, we followthe where Λu = {v|(u,v) ∈ F or (v,u) ∈ F} is user u’s neigh- convention[13]tosettheirvaluesas50dividedbyπu’sdimension bors in F. Λi = {j|(i,j) ∈ E or (j,i) ∈ E} is document andθc’sdimensionrespectively, i.e., α = 50/|Z|, ρ = 50/|C|. i’s neighbors in E. nz and n(·) denote the number of Asβisusedtosampletheworddistributionφz’sandthenumber c,¬{ui} c,¬{ui} ofwordsislarge,wefollow[13]againtosetβ =0.1. times that topic z is assigned to community c and the number of times that any topic is assigned to c, excluding the current doc- 4.3 Scalability ument dui. Similarly, nwz,¬{ui} and n(z·,)¬{ui} are the number of WesummarizeourinferencealgorithminAlg.1. Insteps3–10, timesthatwordw isassignedtotopicz andthenumberoftimes wetakeanE-stepforcollapsedGibbssampling.Insteps11-14,we thatanywordisassignedtoz,excludingd . nw andn(·)arethe takeanM-stepfortrainingthemodelparameters. ui ui ui numberoftimesthatwordw occursinthedocumentd andthe ui Time complexity. In steps 4–6, as we compute the community numberofwordsind . Finally,ψ(πˆTπˆ ,λ |C )denotes ui u v uv ¬{ui} assignmentsandtopicassignmentsforeachdocumentofeachuser, estimating ψ(πˆTuπˆv,λuv) based on C¬{ui} instead of the whole ittakesO(|D|×|C|+|W|×|Z|). Insteps7–8,aswecompute C;ψ(c¯Tijη¯+ntz+νTfuv,δij|C¬{ui},Z¬{ui})denotesestimating πˆTuπˆv foreachfriendshiplink,ittakesO(|C|×|F|). Insteps9– ψ(c¯Tijη¯+ntz +νTfuv,δij)basedonC¬{ui} andZ¬{ui} instead 10,aswecompute(c¯Tijη¯+ntz+νTfuv)foreachdiffusionlink,it ofthewholeCandZ. takes|C|2×|E|. Insteps11–12,asweaggregatethecommunity assignmentsandtopicassignmentsforeachdiffusionlink,ittakes •ForC:theprobabilityofassigningcommunityctouatd is ui O(|E|).Insteps13–14,aswecomputegradientsforνoverallthe p(cui =c|C¬{ui},Z,W,F,E,f,ν,η¯,λ,δ) diffusionlinks,ittakesO(|E|×T2).Intotal,forT1iterations,the = p(W,F,E,C,Z,f,ν,η¯,λ,δ|ρ,α,β) overallcomplexityisO((|D|×|C|+|W|×|Z|+|C|×|F|+ p(C¬{ui},zui=z,Z¬{ui},W,F,E,f,ν,η¯,λ,δ|ρ,α,β) |C|2×|E|+|E|+|E|×T2)×T1).Aswecansee,Alg.1’stime ∝(cid:81)|U| ∆(ncu+ρ) (cid:81)|C| ∆(nzc+α) · complexityislineartothedatasize(i.e.,|D|,|F|,and|E|). u=1 ∆(nc +ρ) c=1 ∆(nz +α) u,¬{ui} c,¬{ui} Parallelization.WeconsidermultithreadparallelizationofAlg.1. p(F|λ,C¬{ui})·p(E|δ,C¬{ui},Z¬{ui}) (14) Weleavemulti-machineparallelizationasfuturework.Inourvari- = ncu,¬{ui}+ρ · nzc,¬{ui}+α ·(cid:81) ψ(πˆTπˆ ,λ |C ) ationalEMalgorithm, weﬁndtheE-steptakesmuchlongertime n(·) +|C|ρ n(·) +|Z|α v∈Λu u v uv ¬{ui} thantheM-step,because:1)theE-step’scollapsedGibbssampling u,¬{ui} c,¬{ui} ·(cid:81) ψ(c¯Tη¯+nt +νTf ,δ |C ,Z ), hastobedoneiterativelyoveralltheobservations,includingdoc- j∈Λi ij z uv ij ¬{ui} ¬{ui} uments (thus words), friendship links and diffusion links; 2) the wherenc andn(·) arethenumberofdocumentsfrom M-step’smodelparameterestimationiscomparativelymucheas- u,¬{ui} u,¬{ui} ier, since optimizing ν is basically solving logistic regression on useruthatareassignedtocommunitycandthenumberofdocu- thediffusionlinks(andthesameamountofnegativelinks)andηis mentsfromuseruexcludingd ,respectively. ui donebysimplyaggregatingtheexistingcommunityandtopicas- •Forλ:theconditionaldistributionofλisPo´lya-Gamma,i.e., signments.Thus,inthispaperwefocusonparallelizingtheE-step. •Segmentingdatatoreduceinter-dependency.RecallinSect.4.1, p(λ |W,F,E,C,Z,f,ν,η,δ) uv thesamplingrequirescomputing:1)anumberofcounters,includ- ∝e−λuv(πˆ2Tuπˆv)2p(λuv|1,0)=PG(1,πˆTuπˆv). (15) ing the community-topic counter nzc, the topic-word counter nwz, Algorithm1ScalableinferenceforCPD document and each link, based on a serial implementation of the Input: UsersU,docsD,friendshiplinksF,diffusionlinksE; samplingalgorithmoverallthedata.Then,basedonthenumberof Output: TopicassignmentsZ,communityassignmentsC,model documentsandlinksauserhas,weestimatetheaverageworkload parametersνandη; ofprocessingthatuser. Finally,wesumuptheaverageworkload ofalltheusersinthei-thdatasegmentaso . 1: Initializeν,η,α,β,ρ; i Weevaluateourinferencealgorithm’sefﬁciencyinSect.6.4. 2: foriter=1:T do 1 3: foreachuseru∈U do 4: foreachdocumentd ∈D do 5. APPLICATIONS ui u 5: SampleatopiclabelzuiaccordingtoEq.13; We concretize how to enable the following three community- 6: SampleacommunitylabelcuiaccordingtoEq.14; aware applications based on ﬁve CPD outputs, including: 1) the 7: foreachfriendshiplink(u,v)∈F do communityassignmentforusersπ ’s;2)thecommunitycontent u,c 8: SampleaugmentedvariableλuvaccordingtoEq.15; proﬁle θc,z’s; 3) the community diffusion proﬁle ηc,c(cid:48)z’s; 4) the 9: foreachdiffusionlink(i,j)∈Edo topicassignmentforwordsφw,z’s;5)theindividualdiffusionpref- 10: Sampleaugmentedvariableδ accordingtoEq.16; erenceparametersν. ij 11: foreachdiffusionlink(i,j)∈Edo Community-awarediffusion.Giveninputofadocumentd pub- vj 12: Updateηcui,cvjzui byaggregatingc(cid:48)uis,cvj’sandzui’s; lishedbyuserv,weoutputtheprobabilitythatanotheruseruwill 13: forsubiter=1:T2do publishadocumentduitoretweetorcitedvj attimestamptas 14: GradientdescentforνoverthediffusionlinksE. p(Et =1|u,v,d ,t)=1 (cid:80) p(Et =1|u,v,z,t)p(z|d ) ij vj z ij vj theuser-communitycounterncu;2)anumberoflinkprobabilities, =2 (cid:80)zσ((cid:80)c(cid:80)c(cid:48)πu,cθc,zηc,c(cid:48)zπv,c(cid:48)θc(cid:48),z+ntz+νTfuv)p((z1|8d)vj). includingthefriendshiponeψ(πˆTuπˆv,λuv)andthediffusionone whereatstep1weexpandp(Eitj =1|u,v,dvj,t)bythetopicsof ψ(c¯Tijη¯+ntz+νTfuv,δij).Amongthesecomputations,bothtopic dvj.Atstep2,wepluginthedeﬁnitionofp(Eitj =1|u,v,z,t)by andcommunityassignmentsareappliedtodocuments(thustheir Eq.5.Aswecansee,Eq.18comprehensivelymodelsthediffusion users), the friendship link probability is applied to users, and the bytakingthecommunityassignmentsπ,thecommunityproﬁlesθ diffusion link probability is applied to two documents (thus their andη,andtheindividualdiffusionpreferenceνintoaccount. users). Therefore,exceptthewordtopicassignment,thevastma- Proﬁle-drivencommunityranking. Giveninputofaqueryq ∈ jorityofcomputationsaredoneonusersanddocuments.Thismo- Wk(∀k≥1),weoutputtherankingofcommunitiesbasedontheir tivatesustosegmentthedatabyusersanddocuments,sothatdif- probabilitiestodiffuseinformationaboutq.Denotetheprobability ferentthreadscanworkondifferentdatasegmentswithlittleinter- ofacommunityctogenerateadiffusionlinks=1ofqueryqas dependency. It may be possible to take words into consideration for the data segment as well, but it is not obvious and we leave p(s=1|c,q)=1 (cid:80) (cid:80) p(s=1|c,c(cid:48),z)p(z|q,c(cid:48))p(c(cid:48)|q) itforfuturework. Consideringthatauseroftenhasmanydocu- z c(cid:48) ments(especiallyinTwitter),wedesigntwoguidelinestosegment ∝2 (cid:80)z(cid:80)c(cid:48)ηc,c(cid:48)zp(z|q,c(cid:48))∝3 (cid:80)z(cid:80)c(cid:48)ηc,c(cid:48)zθc(cid:48)z(cid:81)w∈qφz,w, thedatabyusersanddocuments: 1)wekeepauser’sdocuments (19) inthesamedatasegment,becauseotherwisetherearelikelymany whereatstep1weexpandp(s = 1|c,q)bythecommunitydiffu- conﬂictingupdatesaboutthesameuserfrommultiplethreads; 2) sionproﬁlep(s = 1|c,c(cid:48),z),thetopicassignmentforqinacom- wepreferkeepingthesame-topicdocumentsinthesamedataseg- munityp(z|q,c(cid:48))andtheprobabilitythatqisfromthatcommunity ment,becauseithelpstoreducetheconﬂictingupdatesaboutthe p(c(cid:48)|q). Atstep2wepluginthedeﬁnitionofp(s = 1|c,c(cid:48),z) ∝ same topic from multiple threads. Overall, we ﬁrst run LDA [3] ηc,c(cid:48)z andconsiderqcancomefromanycommunitywithp(c(cid:48)|q) onalltheusers’documentswith|Z|topics; thenwepartitionthe uniformly. At step 3, we estimate the probability p(z|q,c(cid:48)) in a usersinto|Z|segments,basedoneachuser’smostfrequentlyas- similar way as Eq. 13. We skip the details but explain the ratio- signedtopicinherdocuments. Ineachsegment,eachuserhasher nalofthisestimation: p(z|q,c(cid:48))isproportionaltotheprobability documents,relatedfriendshiplinksanddiffusionlinks. ofcommunityc(cid:48)generatingtopicz(i.e.,capturedbyθc(cid:48),z)andthe (cid:81) • Distributing workload to avoid data skewness. We aim to dis- probabilityofqbelongingtotopicz(i.e.,capturedby w∈qφz,w). tributethe|Z|datasegmentstoM threads,suchthattheworkload Proﬁle-drivencommunityvisualization. Wecanvisualizeeach on each thread is balanced. Note that M is set as the number of community’scontentproﬁleanditsdiffusionproﬁle, asFig.1(b) physicalCPUcoresinthiswork. Ourapproachistoﬁrstestimate shows.Inparticular,weareinterestedinthediffusionvisualization, theworkloadofeachdatasegment,andthencastthissegmentallo- as it is new. In our experiments, we visualize how a community cationtaskassolvingMstandard0-1knapsackproblems2.Denote interactswiththeothersintwotypicalsettings: 1)diffusionona tahlletih-ethddaatatasseeggmmeenntts’siswOork=loa(cid:80)daMi=s1oioi∈. DRe+n,othteusatbhienwaroyrkinlodaicdaftoorr tsopecc(cid:48)iﬁucndtoepritco,pwichezr;e2w)ediuffsuesiηocn,c(cid:48)wzitahsttohpeicdiaffgugsrieognastitorenn,gwthhefrroemwce asxi ∈{0,1}.Thenforeachthread,wesolve use(cid:80)zηc,c(cid:48)zasthediffusionstrengthfromctoc(cid:48). max (cid:80)M o x , s.t. (cid:80)M o x ≤ 1 O, (17) i=1 i i i=1 i i M 6. EXPERIMENTS whichtriestoﬁndasubsetofthedatasegmentstohaveascloseto WetestCPDwithtwolarge-scalereal-worlddatasets. Wede- O workloadaspossible. Onecanﬁnetunetheobjectivefunction M sign experiments to: 1) evaluate how well we address each chal- of Eq. 17 in practice to best allocate the data segments for even lengelistedinSect.1;2)evaluateCPD’sperformance,bycompar- workloadamongthethreads.Weestimateeachworkloado asfol- i ingwiththestate-of-the-artbaselinesindifferentapplications. lows.Firstofall,weestimatetheaverageprocessingtimeforeach 6.1 SetUp 2https://en.wikipedia.org/wiki/Knapsack_problem #(user) #(friend.link) #(diff.link) #(doc.) #(word) [17]todetectthecommunities,andfurtheraggregatetheuserob- Twitter 137,325 3,589,811 992,522 39,952,379 2,316,020 servations in each detected community as the proﬁles. After ap- DBLP 916,907 3,063,186 10,210,652 4,121,213 330,334 plyingCRMandCOLD,wegetthecommunityassignmentprob- Table3:Datasetstatistics. abilitiesforeachuserutoeachcommunityc,whichwedenoteas π∗ ’s. Togetaggregatedcontentproﬁle,weﬁrstrunLDA[3]on u,c Data Diffusionfactors Tasks alltheusers’documentswith|Z|topics,andforeachuseru’si-th Methods textfriend diff indiv-commtopic topic comm diff comm documentd ,wegetits|Z|-dimensionalmultinomialtopicdistri- links links idual extract detect predproﬁle butionasθ∗ui . Denotecommunityc’saggregatedcontentproﬁle PMTLM[43] • • • • • du,i WTM[37] • • • • • asθ∗c.Wehave CCORLMD[[1157]] • • •• • •• • •• •• θ∗ =(cid:80)|U| π∗ (cid:80)|Du| θ∗du,i. (20) Ours • • • • • • • • • • c u=1 u,c i=1 |Du| To get aggregated diffusion proﬁle, we aggregate each diffusion Table4:Differenceswithbaselines. linkbetweend andd inEw.r.t. theirusers’communitieson u,i v,j atopicz. Denoteaggregateddiffusionproﬁleη∗ astheproba- c,c(cid:48)z bilityofcommunitycdiffusingcommunityc(cid:48)ontopicz.Thenwe WedoexperimentsonLinuxcomputersequippedwithIntel(R) have 3.50GHzCPUsand16GBRAMs. Wedo10-foldcrossvalidation andreportaveragescoresforallthequantitativeresults. Wealso ηc∗,c(cid:48)z ∝(cid:80)(i,j)∈Eπu∗,cπv∗,c(cid:48)θd∗u,i,zθd∗v,j,z. (21) reportsigniﬁcanttestresultswhenevernecessary. Inall,weobtaintwomorebaselines,whichimplementthestraight- Data sets. We use two public data sets: Twitter [20] and DBLP forward“ﬁrstdetection,thenaggregation”proﬁlingapproach. [34]. TheTwitterdatasetwascollectedinMay2011. TheDBLP • CRM+Agg. It uses CRM [15] to detect communities; then it datasetcontainsthepublicationsindexedbyDBLP3from1936to usesEq.20andEq.21foruseraggregationtogetthecommunity 2010. Wepre-processedthetweetsandthepapertitles,byremov- contentproﬁlesanddiffusionproﬁles,respectively. ingstopwords,stemmingandPOStagging4. Weonlykeptnouns, •COLD+Agg.ItusesCOLD[17]todetectcommunities,thensim- verbsandhashtags.Afterthat,weremovethedocumentswithless ilarlyusesEq.20andEq.21togetcontentanddiffusionproﬁles. thantwowords,andthenremovetheuserswithnodocument. Ta- WecomparewithbothCRM+AggandCOLD+Aggindiffusion ble3summarizesthestatisticsofourdatasetsafterpre-processing. linkpredictionandcommunityranking. Baselines.Wechoosebaselinesbasedonthefollowingguidelines: Evaluation. Sincewejointlyproﬁleanddetectcommunities,we 1)theyarethestateofthearttomodelheterogeneoususerobser- willevaluatethequalityofcommunitydetectionandproﬁles. vationsatthedatalevel;2)theymodeldiffusionpredictionatthe •Detectionquality.Weconsidertwowaystoevaluatetheresulting tasklevel;3)preferablytheymodelcommunity.Finally,wechoose communities:1)howdensetheyare;2)howwelltheycanbeused fourbaselinesbelow,andlistourdifferenceswiththeminTable4. to explain the friendship link observations. For 1), we use con- •PoissonMixed-TopicLinkModel(PMTLM)[43]. Itmodelsthe ductance[18,19]asthemetric. Asourcommunityassignmentis documentnetworkandusesthedocumenttopicassignmenttogen- probabilistic,wefollow[17]toleteachuserbelongtohertopﬁve erate the links. We adapt PMTLM for community detection and communitiesinconductanceevaluation(andalsolatercommunity friendshiplinkpredictioncomparison,byaggregatingthetopicas- rankingevaluation).Thesmallerconductanceis,thebetter.For2), signmentsofeachuser’sdocumentsasthecommunitymembership wefollow[17]todesignalinkpredictiontask,whereweuseEq.3 forthatuser. WealsocomparewithPMTLMondiffusionpredic- topredictwhethertoobserveafriendshiplinkbasedontwousers’ tion,asitalsomodelsthedocumentlinks. communities. Asthereisnopredeﬁnedthresholdforlinkpredic- tion,weuseAUC(AreaUnderthereceiveroperatingcharacteristic •WhomtoMention(WTM)[37].Itmodelstheuserdiffusionlinks Curve) [17, 15] as the metric. Given a ranking of non-observed withusercontentandfriendship.Itdoesnotmodelcommunity.We links,wecalculatetheAUCscoreastheprobabilityofarandomly comparewithWTMondiffusionprediction. chosentruepositivelinkbeingrankedhigherthanarandomlycho- •CommunityRoleModel(CRM)[15]. Itmodelsfriendshiplinks sen true negative link. In the 10-fold cross validation, each time anddiffusionlinksbasedontheuser’scommunityassignmentand weuse10%ofthepositivelinksandsamplethesameamountof roleassignmenttogether. WecomparewithCRMoncommunity negativelinkstocalculateAUC.ThehigherAUCis,thebetter. detection,friendshiplinkpredictionanddiffusionprediction. •Proﬁlequality.Duetolackofgroundtruth,wegenerallyevaluate •COmmunityLevelDiffusion(COLD)[17]. Itmodelsthecontent thecontentanddiffusionproﬁles’qualitythroughtheapplications in Sect. 5. For community-aware diffusion, as there is no prede- and diffusion links based on communities. Thus it is the closest ﬁnedthresholdfordiffusionlinkprediction,weagainuseAUCas worktoours. Butitmodelsneitherfriendshiplinksincommunity the evaluation metric. For proﬁle-driven community ranking, as detection,norindividualfactorandtopicfactorindiffusionpredic- thecommunitiesdetectedbydifferentalgorithmsaredifferent,for tion.WecomparewithCOLDoncommunitydetection,friendship faircomparison,weevaluatethequalityofeachrankedcommunity linkpredictionanddiffusionprediction.AsCOLDhascommunity intermsofitsusers–givenaqueryq, wecheckhowmanyusers diffusionstrength,wealsocompareitoncommunityranking. Inadditiontotheaboveexistingbaselines,wealsodesignsome inthetopK rankedcommunitiesreallyretweet(orcite)aboutq. Thennaturallywecomputeprecisionandrecallforeachcommu- morebaselinestovalidatethatwearebetterthanastraightforward nity in the ranking list. Denote the users who mention q in their communityproﬁlingapproachof“ﬁrstdetectingcommunities,then retweets(orcitationpapertitles)asU∗. Denotetheusersbelong- aggregatingeachcommunity’suserobservations”.Speciﬁcally,we q ingtoanyofthetopK communitiesasU . Theprecisionofthe adopt the two state-of-the-art algorithms, CRM [15] and COLD K topK communitiesforqueryq isP(K,q) = |U∗∩U |/|U |, q K K 3http://dblp.uni-trier.de/ and the recall is R(K,q) = |U∗ ∩U |/|U∗|. We deﬁne mean q K q 4http://nlp.stanford.edu/software/tagger.shtml average precision (MAP) over all the queries as MAP@K = ecnatcudnoC 0000....00007689....51555697820 50 NNOoou r HJsoeitne1tr 0Mo0goedneeliitnyg 150 )CUA( noitciderP kniL 000000......00666767..2684426720 NNOoou rHJso5ei0tnetr Mogoedneeliitnyg100 150 )CUA( noitciderP noisuffiD 000000......00788887..6246888920 NNOouo rJHs5oe0intetr Mogoedneeliitnyg100 150 ecnatcudnoC 000000......34567844444420 50 NNOouo r JHsoeinte1tr 0Mo0goedneeliitnyg 150 Number of Communities Number of Communities Number of Communities Number of Communities (a) Communitydetect.(Twitter) (b)Friendshiplinkpred.(Twitter) (c) Diffusionlinkpred.(Twitter) (d) Communitydetect.(DBLP) )CUA( noitciderP kniL 000000......79678872722720 50 NNOoou r HJsoeint1et0r Mo0goedneeliitnyg 150 )CUA( noitciderP noisuffiD 00000.....756898888820 NNOouo rJHso5ein0tetr Mogoedneeliitnyg100 150 )CUA( noitciderP noisuffiD 0000....7788272720 NNOouo rTIsn5od0pivicidual & T1o0p0ic 150 )CUA( noitciderP noisuffiD 0000....8899161620 50NNOoou rITsnodpivicid1u0a0l & Topic 150 Number of Communities Number of Communities Number of Communities Number of Communities (e) Friendshiplinkpred.(DBLP) (f) Diffusionlinkpred.(DBLP) (g) Diffusionlinkpred.(Twitter) (h) Diffusionlinkpred.(DBLP) Figure3:Studyofourmodeldesign. (cid:80)MmhbiniaegstqAahoe(ndepR(cid:80)rioac@MnKivm=KetA1rooapFdP=giece(ilssii(cid:80)n,,F,gwq1tq)he[/(e3a(cid:80)aKs]dbtoM)eKio=/tpte|t1eAQvorRa.Fn|l(ue@Iaiane,ntKxqdeat)rid/mtads=Kewiqtai)iuond/2ane×|MlQa,liMytvAya|e.u.sAPrsEaFP@tehgfid@KfneeeKam+cclr×tlMeeoiytvMcn,rAeaitwcAelRlylneR(@,tp(@pdKMepeeKrrrﬁApop.lnﬁlReeelx)xTeitithhatyiyeess) )CUA( noitciderP noisuffiD 00000.....456785555520 WCCOO5T0LLMDD+Agg 10CCO0RRurMMs+Agg150 )CUA( noitciderP noisuffiD 000000......56894755555520 PCCMOOLLTDDL5M+0Agg CCO1RRu0rMMs0+Agg 150 of a content proﬁle measures how well it generates the user con- Number of Communities Number of Communities (a) Twitter (b) DBLP tentobservations,andweusethesamedeﬁnitionofperplexityas in[17].Thelowerperplexityis,thebetter. Figure4:Resultsofcommunity-awarediffusion. 6.2 ModelDesign DBLP respectively; the topic factor is able to contribute another Wewanttoevaluatehowwellweaddresseachcommunitypro- 3.6%and10.5%absoluteAUCimprovementoneachdataset. ﬁlingchallengeasintroducedinSect.1. Toachievethisgoal,we In all, we conclude that our model design well addresses the designsomebaselinesbasedonthedegeneratedversionsofCPD, threechallengesincommunityproﬁling. for validating the advantages of our model design. We compare 6.3 ComparisonwithBaselines CPDwiththesebaselines,andevaluatethequalityofdetectedcom- munities and proﬁles through three tasks: community detection, WeevaluateCPDandthebaselinesonvariousapplications. friendshiplinkpredictionanddiffusionlinkprediction. 6.3.1 Community-awareDiffusion • Modeling the inter-dependency with community detection. We Quantitativeanalysis. Inﬁgure4,wesummarizetheresultcom- designabaseline“nojointmodeling”,whereweﬁrstdetectcom- parisonwiththebaselinesintroducedinSect.6.1. PMLTMisnot munitiesonlyfromthefriendshiplinksthroughagenerativemodel applicabletoTwitter,sinceitisdesignedsolelyforcitationnetwork– byEq.3,thenweextracttheproﬁlesthroughagenerativemodelas it predicts a citation based on the similarity between two docu- inCPDexcepthavingthecommunitiesﬁxed.AsshowninFigures ments,butinTwitteratweetanditsretweetarealmostidentical.As 3(a)–3(f),oursisalwaysbetterthan“nojointmodeling”. showninFig.4,ourmodelconsistentlyoutperformsallthebase- •Addressingtheheterogeneityofsocialobservations.Wedesigna lines,thanksto: 1)ourmodelingvariousdiffusionfactorsandhet- baseline“noheterogeneity”,whereweadaptCPDtomodelfriend- erogeneoususerlinks,incontrastwiththebaselinesinTable4;2) shiplinksanddiffusionlinksinthesamewaybyEq. (3),butkeep ourjointdetectionandproﬁling,incontrastwiththetwo“ﬁrstde- theotherpartsofCPDmodelingunchanged. AsshowninFigures tectionthenaggregation”baselines. When|C|=100,weachieve 3(a)-3(f),oursisbetterthan“noheterogeneity”ondiffusionpre- 24.2%–91.6%and5.1%–108.0%relativeAUCimprovementsthan diction,andcomparablewithitoncommunitydetectionandfriend- thebaselinesinTwitterandDBLP,respectively.Theimprovements shiplinkprediction.Thisimplies:1)diffusionlinksandfriendship arestatisticallysigniﬁcantoverthe10-foldcrossvalidationresults, links are different, and diffusion links require more sophisticated withstudent’st-testone-tailedp-valuep<0.01. modeling than friendship links; 2) friendship links and diffusion Casestudy. WeexaminethethreediffusionfactorsinEq.5with linksarecorrelated;diffusionlinksdonotsigniﬁcantlychangethe theDBLPdata. Firstly,inFig.5(a)weplotthenumberofpapers communitystructureoncethefriendshiplinksaregiven. ausercitesw.r.t.heractiveness,andthenumberofcitationsauser •Accommodatethenonconformityofuserbehaviors. Wedesign hasw.r.t.herpopularity.Useractivenessandpopularityaredeﬁned two baselines: 1) “no individual & topic”, where we exclude the inSect.3.1. Generally, themoreactiveauseris(i.e., publishing individualfactorandtopicfactorfromEq.5inCPD;2)“notopic”, morepapers),themorepapersshecites;besides,themorepopular where we exclude only the topic factor from Eq. 5 in CPD. As auseris(i.e.,amoreestablishedresearcher),themorecitationsher showninﬁgures3(g)and3(h),theindividualfactorisabletocon- papersget. Thisobservationsupportsourdesignofmodelingboth tribute4.8%and6.8%absoluteAUCimprovementonTwitterand useractivenessandpopularityastheindividualfactorsindiffusion.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.