ChangeIsHard:AdaptingDependencyGraphModelsFor
UnifiedDiagnosisinWired/WirelessNetworks
LeninRavindranath†,ParamvirBahl‡,RanveerChandra‡,
DavidA.Maltz‡,JitendraPadhye‡,ParveenPatel‡
†
MIT,‡MicrosoftResearch
ABSTRACT
Organizationsworld-wideareadoptingwirelessnetworksatanim-pressiverate,andanewindustryhassprunguptoprovidetoolstomanagethesenetworks.Unfortunately,thesetoolsdonotintegratecleanlywithtraditionalwirednetworkmanagementtools,leadingtounsolvedproblemsandfrustrationamongtheITstaff.Weex-ploretheproblemofunifyingwirelessandwirednetworkmanage-mentandshowthatsimplemergingoftoolsandstrategies,and/ortheirtrivialextensionfromonedomaintoanotherdoesnotwork.Buildingonpreviousresearchonnetworkservicedependencyex-traction,faultdiagnosis,andwirelessnetworkmanagement,weintroduceMnM,anend-to-endnetworkmanagementsystemthatunifieswiredandwirelessnetworkmanagement.MnMtreatsthephysicallocationofenddevicesasacorecomponentofitsmanage-mentstrategy.Italsodynamicallyadaptstothefrequenttopologychangesbroughtaboutbyend-nodemobility.WehaveaprototypedeploymentinalargeorganizationthatshowsthatMnM’sroot-causeanalysisengineout-performssystemsthatdonottakeusermobilityintoaccountwhenlocalizingfaultsorattributingblame.CategoriesandSubjectDescriptorsC.4[Performanceofsys-tems]
GeneralTerms:Management,performance,reliability,wirelessKeywords:Wireless,corporatenetworks,performance
1.INTRODUCTION
DatafromITdepartmentsoflargecorporationsanddominantPCmanufacturersshowthatemployeesprefertousejustonedevice(e.g.,alaptopcomputer)foralltheircomputingneeds[17].Con-sequently,manylargeITdepartmentsaremovingtowardsafuturethatincludesasignificantlyreducedroleforthetraditionalwireddesktopcomputer[11].Theyenvisionafuturewhereenterprisesdeploywirelessnetworksinallcorporatecampusbuildings,andswarmsofnomadicusersaccesscorporateresourcesthroughwire-lessAccessPoint(APs).Theyexpectuserstofrequentlychangetheirpointofattachmenttothecorporatenetwork.Inthisnewworld,thecorporateITdepartmentsneedtoolstomanageanddi-agnosebothwiredandwirelesspartsoftheirnetwork.
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.
WREN’09,August21,2009,Barcelona,Spain.
Copyright2009ACM978-1-60558-443-0/09/08...$5.00.
Currententerprisenetworkmanagementanddiagnosissystemsuseseparatetoolstodiagnosewiredandwirelessnetworks.Inanenvironmentwherealargenumberofusersarenomadic,debug-gingapplicationperformanceproblemsusingseparatetoolsisbothdifficultandfrustrating[13].
Forexample,considerFigure1thatshowsthetimerequiredtofetchaURL,measuredsimultaneouslyfromawireddesktophostandawirelesslaptopasthelaptopwasmovedbetweenroomsev-ery5minutes.Unsurprisingly,boththewiredandwirelesshostseesignificantvariationintheresponsetime.Interestingly,however,thevariationissometimesseenbythewirelesshostonly,poten-tiallyindicatingproblemsinthewirelessconnectivity,andsome-timesthevariationisseenonlyinthewiredhost,potentiallyindi-catingcongestioninthewirednetwork.Sometimesthevariationisseeninboth,potentiallyindicatingcongestioninaserverinvolvedinprovidingtherequestedURL.
Anaturalquestiontoaskis:whynotdiagnoseperformanceproblemsbyusingtheexistingwirelessandwirednetworkdiag-nosissystemsseparately?
Theansweristhatadiagnosissystemthatlooksatonlythewirednetworkorthewirelessnetworkislikelytomisinterpretsomeofthespikesintheresponsetimeandblamethewrongnetworkcom-ponent.Inthispaper,weshowthatqualityofdiagnosisisbetterwhenbothwiredandwirelessaspectsoftheenterprisenetworksareanalyzedjointly.
Threemainfeaturesdistinguishourapproachfromtherecentresearchonenterprisenetworkdiagnosissystems:
Changingnetworktopology:ManyrecentlyproposednetworkfaultdiagnosissystemssuchasSherlock[4]andSMARTS[22]implicitlyassumethatthefundamentalstructureofthenetworkiseitherstaticorchangesslowly.Thisassumptionallowsthesesys-temstobuildInferenceGraphs[4]andcodebooks[22]topinpointthecauseofperformanceproblemsseenbytheusers.However,theseapproachescannotbeusedwithoutsubstantialmodificationsinanenvironmentwhereclientsfrequentlychangetheirpointofattachmenttothecorporatenetwork.
JointConsiderationofWiredandWirelessNetworks:Todi-agnoseend-to-endperformanceofnetworkedapplicationsacrosswiredandwirelessnetworksrequiresre-thinkingcoreaspectsoffaultdiagnosis.Forexample,geographiclocationmustbecomeafirstclassobjectintheanalysisfordeterminingifaproblemisinthebackhaulnetwork,thewirelesslink,orthedatacenterservers.AbsenceofFixedObservers:Sincemanyproblemsinwirelessnetworksarelocationspecific,existingwirelessnetworkmonitor-ingsystemsrelyonfixeddesktops[8]orspecializedmonitoringhardware[3,10].However,inanetworkconsistingprimarilyofnomadicusers,systemslikeDAIR[8]areimpractical,whilesys-temslikeJigsaw[10]andWit[18]areexpensivetodeploy.
83五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
Largeservervariations
Largewirelessvarations
BackgroundwirelessvariabilitySpikesofvariabilityinserver
Figure1:TimetofetchaURLasmeasuredsimultaneouslyfromawireddesktophostandawirelesslaptop.Thelaptopwasmovedbetweenroomsevery5minutes.
Wehavedevelopedanend-to-endnetworkdiagnosissystem,calledMnM,thatsuccessfullydiagnosesperformanceofnetworkedser-vicesandapplicationsrunningonnomadichosts.MnMbuildsonrecentresearchonnetworkservicedependencyextraction[4],faultdiagnosis,andwirelessnetworkmonitoring.Ittreatsthephysicallocationofenddevicesasacorecomponentofitsdiagnosisstrat-egy.Italsodynamicallyadaptstothefrequenttopologychangesbroughtaboutbyend-nodemovement.Oursystemisimplementedentirelyinuser-levelsoftware,anditdoesnotrequireanyspecial-izedmonitoringhardware.WehavedeployedtheMnMsystemonasegmentofourorganization’snetwork.Overaperiodoftwoweeks,wemonitored27usersand10servers.Wedetectedandcorrectlydiagnosedavarietyofperformanceissues,includingpoorWi-Ficoverage,congestioninwirednetworks,andmisconfig-uredDNSentries.Asweshallshowlaterinthepaper,atleast140performanceproblemswouldhavebeenmis-diagnosedhadwenottakenanintegrated,holisticviewofwiredandwirelessnetworks.MnMextendsthestate-of-artinenterprisenetworkmanagementbymakingtwoimportantcontributions:
1.Weidentifyissuesthataenterprisenetworkmanagementsystemmustconsiderwhentheend-hostsarenomadic.Weshowthatrecentlydevelopedsystemsarenotabletocopewiththeseissues.Wequantifymistakendiagnosesthatoccurinsystemsthatdonotcompensateforusernomadicity,andwearguethatlocationmustbetreatedasacorecomponentinfutureenterprisenetworkmanagementsystems.2.Wepresentanenterprisenetworkmanagementsystemthatuni-fieswiredandwirelessnetworkmanagement,andhandlesno-madicusers.Itiseasytodeploy,asitrequiresnospecialfixedinfrastructureforwirelessmonitoringandautomaticallyinitial-izesitslocationsystem.Weevaluateitsaccuracythroughbothcontrolledexperimentsanda2-weekfieldstudy.
clientsandwirelessAPs.Unlikeoursystem,theirtechniquesmissoutonproblemsthatamobileclientmayhavebecauseofaperfor-manceissueinthewiredpartofthenetwork.
TheDAIRsystem[8]alsodetectsperformanceproblemsfacedbyusersofWi-Finetworks.DAIRusescorporatedesktopcomput-erstomonitortheairwavesand,likeMnM,location-awarenessisacorecomponentofitsmanagementstrategy.Fundamentally,DAIRreliesontheexistenceoffixeddesktopdevicestomonitorperfor-manceofwirelesslink.Incontrast,MnMassumesaworldwhereeveryclientismobile.Insuchanenvironment,monitoringmustbedonebymobileclientsthemselves.Thispresentsseveraluniquechallenges,suchasbootstrapping,whichsystemslikeDAIRcan-nothandle.Furthermore,DAIRrequiresthemonitoringdevicestosniffpacketsinpromiscuousmode,whichmaynotalwaysbepossibleonbatteryconstrainedmobileclients.
Jigsaw[10]andWIT[18]areWi-Fimonitoringsystemsthatcombinethedatafrommultiplemonitorstogenerateacomprehen-siveviewofnetworkevents.Jigsawusesdedicated,custom-built,multi-radiomonitoringnodesandprovidesadetailedviewoflow-levelnetworkeffectssuchasinterference.WITisabletoanalyzeanddetectMAC-levelmis-behavior.Whileusefulininvestigatingwhyindividuallocationshavepoorperformance,thesetoolsarenotdesignedfordiagnosingend-to-endnetworkedservicesinacorpo-rateenvironment.
Commercialsystems[2,3]areavailableformanagingwirelessnetworks,buttheydonotdetectperformanceissuesduetoprob-lemsinthewiredpartofthenetwork.Furthermore,systemslikeDAIR,Jigsaw,WIT,Airtight,etc.donothavevisibilityintoapp-lication-levelperformanceproblems,whereas,aswewillshow,MnMdoes.
WiredNetworkManagement:TheSherlocksystem[4]managesnetworkedservicesinenterprisenetworksbyextractinginferencegraphsandthenusingthesetodiagnoseperformanceproblems.SoftwareagentsrunningondesktopmachinesdeterminethesetofservicesthehostdependsonandacentralizedinferenceenginecapturesthedependenciesbetweenthecomponentsoftheITin-frastructurebymergingtheviewsofeachclient.Sherlockthendi-agnosesfaultsbyrunninganinferencealgorithmontheinferencegraphs.Sherlockmakesafundamentalassumptionthatdependen-ciesarestaticor,atmost,changeslowly.Thisisnottrueforap-plicationsrunningondevicesusedbynomadicusers.AsweshowinSection3,systemslikeSherlockperformpoorlywhendepen-denciesaredynamicandfastchanging.Furthermore,suchsystemscannotbetriviallyextendedtohandlenomadicclients.
Othernetworkmanagementsystems,suchasShrink[14]andSCORE[15],havemadeseminalcontributionsindiagnosingfaultsinwide-areanetworks,buttheycannotbeeasilyusedformanagingnomadicusers.Similarly,sophisticatedcommercialproductssuchasSMARTS[22],OpenView[19],andTivoli[23]providepow-erfultoolsformanagingenterprisewirednetworks,butfallshortwhenextendedtomanagemobileclientsandWi-Fiusers.
Finally,wenotethatalongerversionofthispaperisavailableasatechnicalreport[5].
2.RELATEDWORK
Thereissubstaintialpriorworkinenterprisenetworkmanage-ment.However,ithasfocusedonmanagingeitherwirednetworksorwirelessnetworks,notbothsimultaneously.Theclosestthingtounifiedmanagementtoolsaresystemsthatletnetworkmanagersviewthewiredandwirelessnetworkssimultaneously[13].WirelessNetworkManagement:Adyaet,al.[1]builtoneofthefirstenterprisewirelessnetworkmanagementsystems.Theirsystemissimilartooursinthattheyfocusonperformanceprob-lemsfacedbyWi-Fienabledmobileclients.Theydetectproblemsbyanalyzinglinkdatacollectedbymonitoringagentsresidingon
3.FORMULATINGTHEPROBLEM
Figure2illustratesanenterprisenetworkofthefuture.Userslocatedonthecorporatecampusaccesstheenterprisedatacen-terserversviaAPsdeployedincampusbuildings,andtheseusersmovearoundfrequently.Someusersmayworkremotely,andcon-necttothecorporatenetworkviaVPN.Inthispaperwefocuspri-marilyonnomadicuserswhochangelocationbutconductmostoftheirworkwhenstationary.Someotherpapersrefertotheseasmo-
84五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
Figure2:Exampleofthetypicalenterprisenetworkofthefu-ture.Mostusersaccesscorporateresourcesfromlaptopcom-putersconnectedtowirelessnetworksorfromremotelocationsviaVPNsovertheInternet.
Cusingawebserver.Inthisfigure,theresponsetimetheclientCobserveswhenfetchingawebpagewillbeaffectedbythehealthoftheDNSservice,theKerberosservice,andthewebserveritself,sincetosuccessfullyfetchthewebpage,CmustfirstuseDNStoconvertthenameofthewebsitetoanIPaddress,thenfetchcer-tificatestoaccessthewebsite,andfinallyretrievethecontentfromthewebsite.Thehealthoftheseservices,inturn,isaffectedbythehealthoftheserversthatimplementtheserviceandtheabilityoftheclientCtosuccessfullyreachtheserversoverthenetwork.Thehealthofeachnetworkpathisaffectedbytheroutersonthepath.NodesintheInferenceGraphareconceptuallyinoneoftwostates:upordown.Rootcausesthatareoperatingnormallyandobservationsindicatingnormalperformanceareup.Nodescausingorindicatingpoorperformancearedown,eveniftheyhavenotfailedcompletelybutaremerelyslowinreturninganswers.
WhileourexampleInferenceGraphhasonlyasingleclientandasingleobservationofasingleapplication,asystem-wideInferenceGraphisbuiltbycombiningthegraphsforeachclientapplicationandservice.Thesegraphssharethesamerootcausenodes,buthavedifferentobservationandservicenodesforthecombinationofeachclientandapplication.
TheInferenceAlgorithm:Giventheinferencegraphandthestateoftheobservationnodes,aninferencealgorithmcaninferwhichrootcausesaremostlikelytohavefailed.Thisisespeciallyusefulinthecaseswhererootcausescannotbedirectlyobserved[4,15].Manyinferencealgorithmshavebeendeveloped,butthegoalofeachisthesame:givenasetofobservationsofsystemperfor-mance,goodandbad,determineasetofrootcauseswhosefailurewouldbestexplainthatpatternofobservations.Tocopewiththeuncertaintyintherealworld,MnMusesprobabilisticinference.Specifically,everyrootcausehasapriorprobability—thatis,thefractionoftimetherootcauseistypicallydown.Theinferenceal-gorithmtakesthesepriorsintoaccountwhencomputingwhichrootcausesaremostlikelytobedown.ThealgorithmusedinthispaperisthesameasthatusedbySherlock[4].
Figure3:ExampleInferenceGraph.Theresponsetimemea-suredforfetchinghttp://foo(dashedoutline)isaffectedbytherootcauses(shownwithdottedoutlines).
bileusers,andweusethetermsinterchangeably.WebelieveMnMisapplicabletousersinconstantmotion,butitisoutofthescopeofthispaper.
3.2ImpactofNomadicUsers
3.1FaultDiagnosisusingInferenceGraphs
Priorworkinfieldsasdiverseasnetworkmanagement[15,4,24]andmedicaldiagnosishasshowntheadvantagesofusinganInferenceGraphtodiagnosefaultsinthepresenceofnoisyobser-vations.However,wehavefoundthatnomadicusersviolatesomeoftheimportantassumptionsonwhichthesesystemsarebased,and,consequently,thesesystemsperformpoorlywhenusedtodi-agnosetheproblemsexperiencedbynomadicdevices.
MnMattemptstoleveragetheexpressivenessofinferencegraphswhilefixingtheproblemsthatpreventthemfromusewithnomadicsystems.Webeginwithabriefoverviewofinferencegraphs.Formoredetails,see[4,15].Then,inSection3.2,wedescribetheproblemscausedbynomadicusers.Section4describesourtech-niquesforapplyingInferenceGraphstonomadichosts.
TheInferenceGraph:
WeusethemodelproposedinSherlock[4].AnInferenceGraphconsistsofdirectededgesandthreetypesofnodes:rootcauses,meta-nodes,andobservations.Thegraphencodeshowrootcauses,whichrepresentcomponentsorservicesthatcanbefaulty,affecttheobservationnodes,whichrepresentaspectsofthesystemthatcanbemeasured.Meta-nodesarethegluethattiestogethertherootcausesinvolvedinparticularservicesornetworkpaths.
Figure3illustratesanexampleInferenceGraphforasingleclient
Onecouldaskthequestion,wouldatrivialcombinationofwire-lessmonitoringmethods[8,10,18]andwiredmonitoringmeth-ods[4]beabletodiagnosetheproblemsexperiencebynomadicusers?Weanswerthisquestionbymakingthefollowingfourob-servations:
3.2.1DynamicInferenceGraph
Adefiningcharacteristicofnomadicusersisthattheymove,changingtheirlocationandtheirpoint-andmethod-of-attachmenttothenetworkuptoseveraltimesduringaday.Asaresult,Infer-enceGraphsfornomadicuserschangefrequentlyandsignificantly.Forexample,whenanomadicuserconnectstotheenterprisenet-workviaawirelessnetwork,theAPchangesasshemovesfromonelocationtoanother.Worseyet,theserversinotherpartsoftheInferenceGraphchangeaswell,astheDNSandKerberosserversthatahostusesmaychangewheneverthesubnetchangesandanewIPaddressisissuedfromtheDHCPserver.Figure4illustrateshowtheInferenceGraphforaparticularapplicationchangedcomparedtotheinferencegraphofFigure3asclientC’spointofattachmentchangedfromawirednetworktoawirelessnetworkatadifferentlocation.
Suchdynamisminsidethenetworkisaproblemforcurrentin-ferencesystems.PriorworkhasproposedtechniquesforlearningtheInferenceGraphviamonitoringthepacketsthathostssendandreceive[20,4].However,theselearningalgorithmsassumethat
85五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
3.2.4DifficultiesIdentifyingRootCauses
Figure4:ExampleInferenceGraphwhenanomadicusercon-nectstothethecorporatenetworkusinga802.11wirelessnet-work.ToeasecomparisonwithFigure3,nodesaffectedbymo-bilityareshownwithdarkbackgrounds.
Onemightarguethatrunningexistingwirelessandwireddiag-nostictoolsseparatelycandiagnoseapplication-levelperformanceproblemsfornomadicusers.However,lowlevelwirelessperfor-mancemetricssuchassignalstrengthandpacketlossrateshaveacomplexrelationshiptotheperformanceofhigherlayers[9].Onecannotsimplyassignthresholdstotranslatelink-layermea-surementsintoapplication-levelthroughputs.Forexample,usingthedatacollectedfromour2-weekstudypresentedinSection6.2,weseethatthereisnosignificantcorrelationbetweentheAPsig-nalstrengthseenbyaclientandtheend-to-endperformanceitachieves.Further,therearesomedependenciesinthewirednet-workthatarespecifictowirelessmachines,e.g.APs,thewire-lessgatewayandthewirelessauthenticationservers.Itishardtomeasuretheirimpactonapplicationperformancewithoutunifyingwiredandwirelessperformancediagnosis.
4.
theInferenceGraphremainsunchangedlongenoughtobelearned.Forexample,SherlockreportsthatittakesseveralhoursforthelearnedInferenceGraphtostabilize.Otherresearchershaveshownthatuserschangelocationfrequently[7,16],soformostcasestheSherlockalgorithmwouldnotbeabletolearntheInferenceGraphbeforeitchanged.
MnM’sapproachistoseparatetheInferenceGraphintothepor-tionswhicharerelativelystaticandcanbelearned(e.g.,depen-denciesamongserversinthewireddatacenter)andtheportionsthatchangefrequently.WeusetheDomainExpertsdescribedinSection4.1.4tocomputetheseportionsasneeded.
ARCHITECTURE
3.2.2ImportanceofLocation
Researchershavepreviouslyshownthatthephysicallocationofamobiledevicehasadirectimpactontheperformanceoftheap-plicationsitisrunning[8,12].Forexample,twousersrunningthesameapplication,connectedtothenetworkviathesameAP,mayexperiencedifferentperformance—onemightseeshortresponsetimesfromawebserverwhiletheotherseeslongresponsetimes,allduetovariationsintheRFenvironmentaroundtheirphysicallocation.IflocationisnotincorporatedintotheInferenceGraph,thentheinferencealgorithmwillblamethewrongrootcauseasittriestoexplaintheperformanceproblemsseenbythehostexperi-encinglongerdelays.Consequently,MnMtreatsphysicallocationasacorecomponentofitsend-to-endnetworkdiagnosissystem.
Asystemthatjointlymanageswiredandwirelessnetworksneedsthreeuniquecapabilities:anabilitytodeterminethelocationsofmobileclientswithoutrelyingonfixedmonitoringresources,anabilitytofrequentlyupdatetheinferencegraphandanabilitytode-terminetheperformanceofdifferentcomponentsofthenetwork.Inadditiontoend-to-endobservations,MnMalsomeasurestheper-formanceofsomeindividualnetworkcomponents,suchastheca-pacityofthewirelesslink,andincludestheseintoitsinferenceal-gorithmwhendiagnosingapplication-levelperformanceproblems.InthissectionwedescribethearchitectureofMnMandshowhowthesecapabilitiesareincorporatedwithinit.
Figure5illustratesMnM’sarchitecture.MnMconsistsoftwomaincomponents:theMnMAgentthatrunsoneachmobiledeviceinthenetwork,andtheMnMInferenceEnginethatacceptsdatafromtheseagents.TheInferenceEngineanalyzesdatafromagentstodeterminetherootcauseofperformanceproblems,andraisesalertstothenetworkoperator.Inaddition,wehavedomainexperts,whosefunctionalityissplitbetweentheagentandtheinferenceengine.Theroleofdomainexpertsistomodifytheinferencegraphinsomespecialcases.Belowweprovidemoredetailsoneachofthesecomponents.
CommentaboutPrivacy:Thispaperfocusesonenterprisenet-works.InsuchnetworkstheITdepartmenthastheauthoritytorequireeveryusertorunmonitoringsoftware.Therefore,theis-suesofuserconsentandprivacyareoutofscope.
3.2.3DynamicsofMonitoringanditsLimitations
4.1TheMnMAgent
StateoftheartWi-Finetworkmanagementanddiagnosissys-temssuchasJigsaw[10],WIT[18],andDAIR[8]relyontheexistenceoffixedinfrastructure,eitherintheformofspecializedhardwareoralways-availabledesktopcomputers,tomonitortheRFenvironment.Specializedhardwareisexpensivetodeployandmaintain.Furthermore,thegeneraltrendinlargeITdepartmentsistoreplacedesktopcomputerswithlaptops.Withoutthesupportof‘static’infrastructure,determiningthephysicallocationofaclientbecomesdifficult.Further,thelaptopsofordinaryuserscannotbeusedtotakedetailedmeasurementsoftheirwirelessenvironmentbecausethatwouldrequirerunningtheirWi-Fiinterfacecardsinpromiscuousmode.Promiscuousmodepreventsthecardsfromen-teringtheirpowersavestatesandthusplacesanunacceptablestrainonthelaptops’batteriesandincreasesthebarriertodeployment.Consequently,end-to-endnetworkdiagnosissystemsmustuselight-weightself-configuringlocationdeterminationtechniquesthatdonotdependonsupportfromexistinginfrastructure.
TheMnMagentisalight-weightapplicationthatrunsonusers’laptops.ItincludesMonitorsthatgatherinformationaboutthesys-tem,useractivityandnetworkconnectivity.ThisdataisprocessedbyDomainExpertsthatencapsulatethespeciallogicrequiredtodealwithdifferentproblemdomains.TheDomainExpertsgen-eratedatafortheinferencegraphandperformanceobservations.TheagentsendsallthisdatatotheMnMInferenceEngineoveratransportcalledtheTrickleIntegratorthatisdesignedtocopewithintermittentandvariableconnectivity.TheMnMAgentdoesnotrequireanydrivermodificationsintheclientsandhenceiseasytodeploy.
4.1.1(Agent)Controller
TheControlleristheagent’slightweightworkflowengine.Itprovidesapublisher-subscriberservicetomoderatetheinteractionsbetweenMonitors,DomainExperts,andtheTrickleIntegrator.AllmessagesbetweenthecomponentsinMnMtaketheformoftuples:alistoffieldsandtheirvalues.Theexpertsandmonitorsregister
86五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
AgentDomainExpertsWifiRASHTTP......TrickleIntegratorInferenceEngineDomainExpertsWifiRASHTTP...Measure-mentsInferenceGraphControllerMonitorsSystemCalendarNetworkTrace RouteTrickleIntegratorFaultSuspectsFaultInferenceLocalStoreHistoricalDataLocationInferenceControllerFigure5:Architecture
triggerswiththeController.WhenevertheControllerprocessesamessagematchingatrigger,itinvokestheassociatedcallbackwiththemessageasanargument.TheControlleritselfgeneratesmes-sagestomarkimportantevents,suchasagentstartupandexpirationofaperiodictimer.MonitorsandDomainExpertsarethe“plug-gable”components.Theycanbedevelopedindependentlyofoneanother—onlytheformatoffieldsandvaluesmustbeagreedontoensureproperintra-agentcommunication.
TheagentgeneratesaSTARTmessageonstartup.Thenitgen-eratesaPERIODICTIMERmessageeverypollinginterval,whichtriggersthemonitorstogeneratemessagesencapsulatingtheirmea-surements.Inadditiontothemessagesgeneratedbytheagent,themonitorsalsoregisterforsystem-wideeventssuchasnetworkad-dresschangeandwirelesshand-offevent.
wayservers,andpingtimestothefirsthoprouter.IftheNetworkMonitordetectsthattheuserisconnectedtothenetworkviaawire-lessinterface,itperiodicallycollectsadditionalinformationsuchastheAPtheinterfaceisassociatedwith,otherAPsitcandetectandthesignalstrengthsoftheirbeacons.Themonitoralsogeneratesmessagesthatarespecifictothewirelessinterface.Forexample,ifthewirelessclientishandedoffformoneAPtoanother,itgener-atesaHANDOFFmessage.
TraceRouteMonitor:Thismonitorusestraceroutetodiscoverthenetworkpathbetweentheclientandtheothermachinestowhichitissendingpackets.
ThetotalamountofdatapushedtotheInferenceEngineforeachobservationislessthan1Kbytesandhencepushingdatatoservertakesverynegligibleamountoftheusersnetworkbandwidth.ThisissueisexaminedindetailinSection5.
4.1.2TrickleIntegrator
WedesignedMnMtohandlesituationswhenmobilehostsareunabletoreachtheInferenceEngine.Specifically,MnMincludesamoduleinspiredbyCoda[21]fordealingwithmeasureddataduringweaklyconnectedanddisconnectedoperation.EverytupleofdatacreatedbyaDomainExpertoraMonitorispassedtotheController,andfromthereitisplacedinalocalstore.DatafromthelocalstoreisthenpushedtotheInferenceEnginewhenevertheclienthasconnectivity.TheTrickleIntegratoralsorate-limitsthemessagessentbytheclienttotheserver,and,ifabacklogdevelops,newmessagesaregivenpriorityoveroldones.
4.1.4DomainExperts
4.1.3Monitors
AsmentionedinSection4.1.1monitorscanbedevelopedindepen-dentlyanddynamicallyaddedtotheMnMAgentonanas-neededbasis.Inourcurrentimplementation,theMnMAgentcontainsfourmonitors.
SystemMonitor:Thismonitorreportsvarioussystempropertiesfromthecurrentpollinginterval.Itreportsinformationsuchas,whetherthesystem’sbatteryisbeingchargedandwhetherthesys-temisconnectedtoawirednetwork(e.g.Ethernet).Italsore-portswhetherauseriscurrentlyactiveonthesystem(thesystemisconsideredidleifthereisnouserinputfornminutes,whereniscurrentlysetto2).
CalendarMonitor:TheCalendarMonitortracksthetimeandlocationofacceptedmeetingsfromtheusersenterprisecalendar(e.g.,ExchangeorLotusserver).Thisinformationisusedtoboot-strapthelocationengine,asweshalldescribeinSection4.2.1.
NetworkMonitor:TheNetworkMonitorreportsinformationaboutnetworkconnectivity.Themonitoristriggeredbythenetworkchangerelatedeventsfromthesystem,suchasnetworkaddresschange.Itreportsinformationaboutactivenetworkinterfacesin-cluding:IPandMACaddresses,gateways,DNSanddefaultgate-
InSherlock,theauthorsassumethattheInferenceGraphissta-ble,andhenceitislearnableviablack-boxtechniques.However,mobilitycauseschangestotheInferenceGraph,andeventhoughthechangesmayberegularandsometimespredictable,theyaregenerallytoorapidforblack-boxtechniquestolearnthegraph.Tohandlethis,wedefinetheconceptofaDomainExpert-amodulethatisresponsibleformakingtheappropriatechangestotheInfer-enceGraphwhentriggeredbyahostchangingitsconnectionpointorotherdependencies.
AtypicalDomainExperthascodebothonthehost,aspartoftheAgent,andontheInferenceEngine.DomainExpertsrespondtotriggerssuchaschangeinIPaddress,orAPhandoffevent.Uponsuchchanges,theDomainExpertontheclientnotifiestheDomainExpertontheInferenceEngineofthetriggeringevent.TheDomainExpertontheInferenceEnginethenupdatestheInferenceGraphappropriately.Forexample,whenanAPHandoffeventoccurs,theWiFiDomainExpertontheagentnotifiesitscounterpartontheInferenceEngine.TheInferenceEnginethenupdatestheInferenceGraphtoaccountforthechangeintopology.
WiFiExpert:TheWiFiExpertisresponsibleformanagingthedetailsofhowwirelessconnectivityaffectstheperformanceofap-plicationsrunningonamobilenode.ItdoesthisbyaddingnewrootcauseandobservationnodestotheInferenceGraphinapar-ticularpattern,whichwecallagraphgadget.Basedonreportsfromthemonitorsontheclient,theexpertfillsinthecorrectAPandlocationinformation.Figure6illustratesthenewInferenceGraphgeneratedwiththehelpofaWiFiExpert.
Mostimportantly,foreveryclientwhoselocationcanbedeter-mined,theWiFiExpertaddsanewrootcausenodethatrepresentsthelocation.Thereisonelocationrootcausenodeforeachloca-tionknowntoMnM—alltheclientspredictedtobeinthatlocation
87五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
enginestoresandanalyzesthedatasenttoitfromeachoftheMnMAgents.Usingthisinformationandtheservice-leveldependencygraph,itgeneratesandupdatesanInferenceGraphthatreflectswherethemobileclientsarelocatedandhowtheyareconnectedtothenetwork.ItusestheInferenceGraphtogeneratealistofprobablecauseswheneveritidentifiesperformanceproblems,andsubsequentlyraisesalerts.
4.2.1
Figure6:“Gadget”addedtotheInferenceGraphofmobilehostsbytheWiFiExpert.Newelementsshowningreyorwithdarkerlines.
sharethatnode.Associatedwithlocationistheaprioriprobabilitythatthelocationcausesperformanceproblems.MnMdetermineslocationsandcomputespriorsasdescribedinthesubsectionsthatfollow.Theexpertconnectsthelocationrootcausetoanobserva-tionnodewhosevalueistiedtomeasurementsoftheRTTofpingsbetweentheclientandthecurrentAP.TheRTTprovidesadegreeofdirectestimationofcurrentwirelesschannelquality,whileloca-tionpriorsprovidehistoricalinformationaboutthewirelesschan-nelqualityatthislocation.
HTTPExpert:TheHTTPExpertmonitorstheresponsetimeofwebserverswhenURLsarefetched,andreportsthesetotheIn-ferenceEngine.TheInferenceGraphusestheseasobservationsabouttheapplication’shealth.Fortestingpurposes,ourHTTPEx-pertalsoincludesaURLpollingrobotthatcanbeorderedtofetchparticularURLsduringexperiments.
NetworkExpert:TheNetworkExpertcomputesthenetworkto-pology-relateddynamicpartoftheinferencegraphwheneveranet-workchangeeventoccursontheclient.Itisresponsibleforfillingintwotypesofinformation.First,itcomputesnetworkpathtonetworkservicesbyusingtopologydiscoverytechniques,suchastraceroute.Second,itdetectschangesinlocation-dependentnet-workservices,suchastheDNSandKerberosservers.TheNet-workExpertcounterpartontheInferenceEngineupdatesthisin-formationintheinferencegraph.
ServiceExpert:TheServiceExpertisaspecialexpertthatrunsonlyontheInferenceEngine,andhasnoclientcounterpart.TheServiceExpertisresponsibleforbuildingastatic,service-levelde-pendencygraphforallnetworkedapplications.Aserviceisiden-tifiedbytheservicenameandtheserverthatisprovidingthatser-vice.Forexample,awebsiteisidentifiedbyitsURLandthewebserverhostingit.TheServiceExpertgetsthedataneededtocon-structthedependencygraphfromavarietyofsources.Forexam-ple,systemslike[10,4]usetemporalcorrelationinpackettracestoinferdependencies.Someinformation,suchastopologyofthedatacenter,canbeextractedfromnetworkconfigurationfiles.Thestaticdependencygraphiscombinedwithdynamicinformationfromotherdomainexperts,suchastheNetworkExpertandtheWiFiExpert,tobuildaninferencegraph.
Comment:WenotethattheDomainExpertarchitectureisageneraltechniquethatwillbeusefulforhandlingothertypesofdomainswheretheInferenceGraphchangesfasterthanitcanbelearned.Anexampleofthisispeer-to-peersystemswheretheserversbeinginvokedchangedependingonthequerybeingmade.
LocationInference
4.2TheMnMInferenceEngine
TheMnMInferenceEngineisresponsibleformonitoringthehealthofthemobiledeviceandtheapplicationsrunningonit.The
Thephysicallocationofawirelessclientmayhaveastrongim-pactonitsnetworkperformance[8].Thus,managementtoolsde-signedforwirelessnetworksmustincludeanintegratedlocationestimationsystem.
Anumberoftechniques[6,8,25]havebeenproposedfores-timatingthelocationofclientsinaWi-Finetwork.Thesetech-niquesofferawiderangeoftradeoffbetweenaccuracy,measure-mentoverhead,requiredinfrastructuresupportandtheneedforde-tailedprofilingofthephysicalenvironment.Forthepurposeofnetworkmanagement,itisgenerallysufficienttodeterminetheclientlocationatthegranularityofoneoffice.However,unlikethescenariodescribedpreviously[8],wecannotrelythepresenceofdenselydeployed,fixeddesktoptoserveasmonitors.Hence,wehavebuiltalocationsystemusingthetechniquedescribedin[25].LocationProfiles:Oursystemstoresaprofileforeachlocationofinterest.Toallowforeasyinterpretation,wedefinelocationintermsofofficenumbers,ratherthan(x,y,z)coordinates.TheprofileforeachofficeconsistsofalistofAPs(i.e.theirBSSIDs)thatarevisiblefromthatlocationalongwiththedistributionofobservedsignalstrengthofeachAP.WeassumeaGaussiandistributionandcharacterizeitwithitsmeanandvariance.Theseprofilesaregen-eratedautomatically,aswewillexplainlaterinthissection.
DeterminingClientLocation:Aspartofitsobservations(e.g.,measuringURLresponsetimes),theWi-FiMonitorrunningoneachclientsendstheinferenceenginethelistofAPsseenbytheclient,alongwiththeirsignalstrengths.Usingthestoredprofiles,andtheBayesianinferencetechniquedescribedin[25],thelocationinferencemoduledeterminesthemostlikelylocationoftheclient.andpersistsitwiththeobservationdatainahistorydatabase.Themedianerrorforcomputedlocationisabout5meters(oneortwooffices).WewillpresentadetailedevaluationoftheaccuracyofourlocationsysteminSection6.1.
AutomaticGenerationofProfiles:ToreducetheeffortrequiredtorolloutMnM,weautomaticallygeneratelocationprofilesbyusingtheinformationprovidedbytheCalenderMonitorrunningoneachclient.
Mostcorporateenvironmentsprovideacalendarservicethatem-ployeesusetoschedulemeetingswitheachother.Foreachmeet-ing,thecalendarrecordstheidentitiesofinvitedattendeesandthelocationofthemeeting(e.g.,aconferenceroomoranotherem-ployee’soffice).MnMgeneratesprofilesforroomsthatappearasmeetinglocationsusingtheWi-Fiobservationsreportedbytheem-ployees’laptopsduringthemeetingtime.Toreducetheamountoferroneousinformationincludedinthelocationprofile,MnMveri-fiesboththatthereisactivityontheuser’slaptopduringthemeet-ing(i.e.,theuserhasthelaptopwiththematthemeeting)andthatWi-Fiobservationsareroughlyconsistentwiththoseofotherat-tendees(i.e.,theuserhasactuallygonetothemeeting,ratherthanremainingintheiroffice).
Togenerateaprofileforauser’soffice,MnMlooksforWi-Fiob-servationsmadeduringtimeswhentheuserhasnomeetingsched-uled.ManypeopleplugtheirlaptopsintowiredEthernetand/orwallpowerwhentheyareintheiroffices,andMnMlooksfortheseclueswhenselectingobservationstoconstructtheofficeprofile.
88五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
WealsonotethatinanenvironmentwhereAPsaredeployeddensely,itmaybesufficienttocharacterizethelocationoftheclientsimplybytheAPthattheclientisassociatedwith.Thismethodrequiresnoprofiling,butissubjecttoinaccuracies,sinceclientssometimesassociatewithAPsthatarefaraway.WeevaluatetheusageofAPsasastand-inforlocationinSection6.2.
4.2.2FaultInference
ThefaultinferencemoduleofMnMisresponsiblefortakingthedataproducedbytheagentsinthesystemanddeterminingwhichrootcausesareresponsibleforanyproblems.Theresultinglistoffaultsuspectsisgiventothenetworkmanagersforreportingandresolution.
Themoduleconsistsoftwocomponents:thecomputationoflo-cationpriors,whichisinvokedonceaday,andtheinferencemod-ule,whichisinvokedevery3minutesorwheneverthereisasig-nificantchangeintheobservationsbeingreportedbyclients.
Onceinvoked,theinferencemoduleupdatestheInferenceGraph,computesthestateoftheobservationnodes,andthenrunsthein-ferencealgorithmtodeterminealistoffaultsuspects.
ComputingPriorsforLocations:Insteadofdetailedcurrentmea-surements,MnMreliesonanalysisofpastexperiencetocomputeapriorprobabilityoffailureforeachlocationknowntothesystem.Thesepriorsarethenusedbytheinferencealgorithmwhendeter-miningtherootcausesresponsibleforbadobservations.Priorscanbecheaplycomputedfrominformationalreadyavailableinthehis-toricaldatabasepresentontheInferenceEngine,and,asshowninourevaluation,theylargelyeliminatetheneedfordetailedcurrentmeasurementswhendiagnosingfaults.
Onceaday,theInferenceEnginecomputespriorsforeachlo-cationlbyretrievingfromitshistorydatabaseallresponsetimeobservationsfromlocationswithin6.7metersofl—6.7metersisthemedianerrorofourlocationinferencesystem,soobserva-tionslabeledasbeingfromthoselocationscouldhavecomefroml.MnMthencomputesthefractionofthoseresponsetimesthataredownandusesthisfractionasthepriorprobabilitythatlisfaulty.Thissimplisticapproachimplicitlyassumesthatalldownobser-vationsareduesolelytothelocationalone—discountingtheeffectoftheserversandothercomponentsthatmightaffecttheobser-vations.However,sinceourapproachaveragesovertheresponsetimesofmanyserverscontactedfromlocationloverlongperiodsoftime,anysystematicbiasismostlikelyduetothelocation.MorecomplicatedBayesianestimationtechniquescouldbeused,butourevaluationshowstheyareunnecessaryinourenvironment.
ComputingtheInferenceGraph:TheInferenceEnginecontrollerorchestratestheconstructionoftheInferenceGraphbythevariousDomainExpertsthroughapublish-subscribesystem.Thebasicin-ferencegraphisgeneratedbytheserviceexpert.EachDomainEx-pertsubscribestobenotifiedwhenevernodesoredgeswithspeci-fiedpropertiesareaddedordeletedfromthegraph.Uponreceivingsuchnotification,theDomainExpertmakesitsownalterationstograph.Thisprocessrepeatsuntilnofurtherchangesaremadetothegraph,atwhichpointthegraphisreadytouseforinference.TheprocessofalteringtheInferenceGraphistriggeredwhen-everamonitororexpertonaclientdetectsachange.Forexample,whentheHTTPExpertonclientCobservestheclientaccessingawebpagehttp://foo.comwithresponsetimert,theHTTPEx-pertontheInferenceEnginewillcreateanewobservationnodeforCaccessingfoo.comifitdoesnotalreadyexistintheInferenceGraph.TheadditionofthisobservationnodecausestheServiceDependencyExperttoaddnodesandedgesreflectingtheserversinvolvedinaccessingfoo.com(e.g.,DNS,Kerberos,andfoo.comitself).TheadditionofthesenodescausestheNetworkExpertto
fillinadditionalrootcausesandedgesforthenetworkpathsfromCtothoseservers,theDNSserverscurrentlybeingusedbyC,etc.ComputingObservations:Beforeinvokingtheinferencealgo-rithm,theinferencemodulescansallobservationnodesintheIn-ferenceGraphandinvokestheDomainExpertthatcreatedthenode.TheDomainExpertisexpectedtodeterminewhethertheobserva-tionnodeisupordown,andtypicallydoessobyretrievingrecentmeasurementsforthatnodeanddeterminingiftheyarenormalorabnormal.Forexample,theobservationnodeforaHTTPresponsetimereturnsdowniftheresponsetimeisgreaterthanathresholdbasedonthenormaldistributionofresponsetimesforthatweb-server,andupotherwise.
DiagnosingFaults:GivenanInferenceGraph,priorprobabilitiesforlocations,andtheupanddownstatusoftheobservations,MnMusestheFerretinferencealgorithmdescribedin[4]tocomputetherootcausesthataremostlikelyresponsibleforthedownobserva-tions.Theserootcausesarereturnedasthefaultsuspectlist.
5.IMPLEMENTATION
WehaveimplementedtheMnMsystemshowninFigure5.TheAgentControllerisimplementedasadaemon(service)process.TheDomainExpertsandMonitorsareimplementedasloadablemodulesthatareloadedandinvokedbytheController.TheInfer-enceEngineisimplementedasacentralizedservice.TheInferenceEngineusesadatabasetostorehistoricaldatabutkeepstheInfer-enceGraphandthecurrentobservationsinmemoryforfastaccess.TheInferenceEnginecanruninferencesonliveincomingdataoronthehistoricaldata.OurInferenceEngineintegrateswiththeen-terprisenetworkmanagementsystemdeployedinourorganizationandgeneratesalertsthroughitsconsolewheneveritdiagnosesaperformanceproblem.
Scalabilityisafrequentconcernwithcentralizedsystems.Weevaluatedtwoaspectsofscalabilityofourdesign–theCPUandnetworkoverheadontheclientmachinesandtheperformanceoftheInferenceEngineasthenumberofnodesincreases.
TheCPUoverheadofrunningtheMnMagentonclientmachinesisnegligible.Eachclientmachine,onaverage,generateslessthan1000bytesperminute(0.13Kbps),whichisalsonegligible.
ThetrafficfromallclientsaggregatesatthecentralInferenceEngine.Evenwith10,000activeclients,theInferenceEnginere-ceiveslessthan1.5Mbpsoftraffic.TheCPUoverheadofourIn-ferenceEngineisalsosmall.TheauthorsofSherlock[4]showthattheoverheadofinferencescaleslinearlyasthenumberofnodesincreases.Weobservedsimilarbehaviorwithoursystem.Onamachinewith3GBofRAMandfour3.2GHzCPUs,ourinfer-encealgorithmprocessesanInferenceGraphcontainingmorethan100,000nodesinlessthan5seconds.
6.EVALUATION
WeevaluatedMnMinalargeenterprisenetwork,performingtwotypesofexperiments.Wefirstconductedcontrolledexperi-mentswithintentionallyinjectedfaultstoevaluatetheaccuracyofoursystemwhendiagnosingthefaultsthatmightoccurinanen-terprisenetworkwithallnomadicusers.Oncewehadconfidencethesystemwasperformingcorrectly,wethenranthesystemfortwoweeksonthemachinesof27volunteers,creatingadatasetthatweusetoanalyzethesensitivityofthesystemandthetypesofproblemsfoundinthenetwork.Alltheexperimentspresentedinthissectionwereconductedonaliveproductionenterprisenetworkwiththousandsofcomputers,sothebackgroundtrafficisentirelyrealistic.
五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
10.90.80.70.6CDF0.50.40.30.20.100204060Distance error (m)
Calendar−based profileSurvey−based profile80100Key= 0.1= 0.2= 0.3= 0.5= 0.7Figure7:CDFoferrorinpredictedlocation,measuredinme-ters,over22,000observationsamong96locationsoveraperiodoftwoweeks.
WeinstalledMnMon42computers:27userlaptops,5testlap-tops,and10servers.Thesecomputerswereusednormallybytheirownersintheirdailyactivities.Theusersrepresentavarietyofcorporateusers,includingprogrammers,managersandresearchers.BecausewearenotpartofthecorporateITdepartmentandhadtorecruitvolunteers,wedidnotmonitortheactualwebsitesthatusersvisitedoutofprivacyconcerns.Instead,weaddedanagenttotheirmachinesthatfetchedcontentfromasetoffiveinternalproductionwebsiteseverythreeminutes.
Figure8:Locationpriorsinourbuilding.
sameasthemedianerrorwithsurvey-basedprofiles.Thissuggeststhatcalendar-basedprofilingworkswellforalargenumberoflo-cationsandrecords,althoughmoreobservationslabeledwithcal-endardatawouldbeneededtomatchtheaccuracyofsurvey-basedprofilesacrossalllocations.
6.1LocationInferenceEvaluation
AsdescribedinSection4.2.1,thelocationestimationmoduleinfersalocationforeveryrecordsubmittedtotheInferenceEn-gine,aslongasthesubmittedrecordcontainsawirelessfingerprint.Mostofficesonourfloorareapproximately9squaremeters(3x3)insize.Theconferenceroomsaremuchlarger.Thesizeoftheflooris101metersby86metersandithasapproximately200offices.Duringthetwoweekstudy,thelocationestimationmodulein-ferredlocationsforover77,000records.Ofthese77,000records,22,000weremanuallylabeledbythevolunteerswiththeirtruelo-cation(i.e.theofficeortheconferenceroomthemachinewasac-tuallyinatthattime).
Figure7showstheCDFofthedistanceerrorbetweenthege-ometriccenterofeachrecord’struelocationanditsinferredlo-cation,usingtwodifferentsetsofprofiles.Whenusingprofilesgeneratedautomaticallybyourcalendarheuristics,asdescribedinSection4.2.1,theinferredlocationmatchesthetruelocationex-actly37%ofthetime.Themediandifferenceis6.7meters,whichtranslatestoanerrorofabouttwooffices.Webelievethatthisac-curacyissufficientforourpurposes.
Thecalendar-basedprofileswillcontainsomeerrorsasmachinesarenotalwayslocatedwherethecalendarheuristicsguesstheywillbe.Toestimatethelossinaccuracycausedbythesemistakes,weconductedasurveyofourbuildingbymanuallyplacingalaptopinroughlyeveryotherofficeforafixedperiodoftimeandgatheringthesignalstrengthsofbeaconsbroadcastbythevariousAPs.Wecomputedprofilesfromtheseobservations,andthencomputedthedistanceerroroftherecordswhenlocationswereinferredusingthesesurvey-basedprofiles.
Theerrorislesswhenusingsurvey-basedprofilesasallobser-vationsusedtogeneratetheprofilearelabeledwiththecorrectlo-cation.Thedifferencebetweenthetwocurvesmeasuresthelossofaccuracyduetomistakesmadeguessingthemachine’sloca-tionfromtheusers’calendar.Interestingly,themedianerrorwithourautomaticallygeneratedcalendar-basedprofilesisroughlythe
6.2FieldStudy
Inthissectionwedescribetheresultsofour2-weekstudyofrealusersusingMnM.
LocationPriors:Figure8showsthepriorprobabilitythatfetch-ingaURLwilltakeunacceptablylongfromanoffice,wherethedarkerthecirclethegreatertheprobabilityofthatlocationbeingaproblem.Thereisclearvariationinthepriorsoverthebuilding,indicatingthatlocationdoeshaveastrongeffectontheabilityofnomadicuserstoaccessthecompany’sservers.Themiddle-leftofthebuildingisparticularlybad,themiddle-topofficesareslightlybetter,andtheconferenceroomsinthemiddleandtheofficestotherightare,forthemostpart,thebest.Priorsvaryfrom0.01inthebestareastoalmost0.7intheworst.
FaultDiagnosis:TheInferenceEnginewasrunevery10minutesduringthe2-weekstudy:atotalof1530times.Itdiagnosedafaultduring434oftheseruns.Unsurprisingly,mostfaultswereconcen-tratedduringtheworkinghourswhenmorelaptopsarepresentandnetworkandserverusageishighest.Wehaveconfidenceintheaccuracyofthefaultsdiagnosedbythesystembasedonitsperfor-manceinthecontrolledexperiments.
Figure9showsthenumberoffaultsofeachtypethatwerediag-nosedduringthestudy.Thebarfor“Withlocationpriors”repre-sentstheresultsofMnMasweintendittobeused,withlocationpriorstakenintoaccountbytheinferencealgorithm.Astherecanbemorethanonefaultdiagnosedduringasinglerunoftheinfer-encealgorithm,thenumberoffaultsdiscoveredtotalstomorethan434.Themostcommonsourceofproblemswasthelaptopsthem-selves(“machines”),followedbyaserverinthedatacenter.Ofthe310faultsattributedtoaserverinthedatacenter,114weretoaserverwell-knowntohaveproblemswithintermittentoverloads.MnMalsocorrectlyidentifiedDNSmisconfigurationononeoftheservers.Theserver’sprimaryDNSwasconfiguredto127.0.0.1whileitwasnotrunningaDNSserver.Thiswascausingdelayin
90五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
NetworkElementInternetPathNetworkPathAccessPointHandOffLocationWirelessAccessPointMachineServerWith location priorsLocation = APNo location priors050100150200250300350400450# of occurrences
Figure9:Numberoffaultsdiagnosedduring2-weekstudy,brokenoutbytypeoffaultandlocationinformationused.
DNSlookup,whichultimatelyimpactedtotalURLfetchtimes.Importanceoflocation:Locationwastoblamefor144problems–10%ofthetotal–indicatingthatitisasignificantsourceofer-rors.During3110-minuteintervals,allproblemsseenbyuserswereduesolelytotheusers’location.Basedonthisdata,weex-pectthatMnMwouldbeatleast10%moreaccurateinitsfaultdiagnosesthanasystemthatdoesnotconsiderlocation.
Topredicttheperformanceofasystemthatdoesnotincludelo-cationbutdoesmodelwirelesscomponentslikeaccesspoints,weconfiguredMnMtousetheAPwithwhicheachlaptopwasasso-ciatedasthe“location”ofthatlaptop.Asexpected,thenumberofproblemsattributedtotheaccesspointsincreases.Interestingly,thenumberofproblemsattributedtotheserversgoesdown—withouttheabilitytoblamespecificlocations,thesystemblamestoomanyproblemsonwirelessissues.
Importanceoflocationpriors:ToevaluatetheeffectoflocationpriorsonfaultdiagnosisweranMnMwithlocations,butassign-ingalllocationsthesameprior(labeled“nolocationpriors”inthefigure).ThesystemcorrectlydiagnoseslocationfaultsasoftenasMnMdoeswhenusingaccuratepriors,butitalsoblamesthema-chinesandserversmorethanitshould.Manylocationshaveonlyasinglemachinereportingobservations,astheyareprivateoffices,andwithoutthehistoricalperspectiveprovidedbythepriorthesys-temdoesnothaveenoughindependentobservationstoconfidentlydistinguishbetweenaproblemwiththelocation,theuser’slaptop,ortheremoteserver.
6.3ControlledExperiments
Toevaluatetheaccuracyofoursystemindiagnosingproblemsthatariseinclientmobilityscenarios,weconductedcontrolledex-perimentswherewedeliberatelyimpairedpartsofthenetworktocreatefaults.Theseexperimentswereconductedonourproductioncorporatenetwork,sotherewasnormalcorporatebackgroundtraf-ficandsomenaturallyoccurringfailuresduringtheexperiments.However,theresultsheregivealowerboundontheaccuracyofMnM.
Methodology:Forthefollowingexperiments,all42machinespolledfourenterprisewebsitesonceevery60seconds.TheMnMAgentsrantheapplicationexpertsandmonitorsdescribedinSec-tion4.1.4.
Eachexperimentranforatleast60minutes,withthespecifiedfaultinjectedatthebeginningoftheexperiment.TheInferenceEngineranonceeveryminute,producingatleast60setoffaultsuspectsforeachexperiment.Fortheseexperiments,werequiredthattheInferenceEnginereturntherootcauserepresentingthein-jectedfaultwithrankoneortwobeforecountingitasasuccessfuldiagnosis.Thisisbecausenetworkmanagersareunwillingtolookbeyondthetopfewrootcauses.Table1presentsasummaryoftheresults.
ProblemsDuetoBadLocation:TomeasuretheaccuracyofourInferenceEngineinidentifyingbadlocations,wecreatedthefollowingexperimentalsetup.WeplacetwolaptopsinalocationwithpoorperformancecharacteristicsduetoitslongdistancefromanAP,andforcethelaptopstoassociatewiththatAP.Threeotherlaptops,placedclosertotheAP,werealsoassociatedwiththeAP.TheexperimenttestswhetherMnMcancorrectlydeterminethatmultipleperformancefaultsobservedforclientsassociatedwiththesameAPdonotnecessarilyimplythattheAPisatfault.Instead,MnMmustdeterminetheimpactofaclient’slocationonitsperfor-mance.ThefirstrowofTable1presentsasummaryoftheresults.Wemadetwoobservationsduringthisexperiment:
First,whenthelocationmoduleaccuratelyinfersthelocationsofthetwolaptopsseeingpoorperformance,theInferenceEnginecorrectlyidentifiedthelocationasthehighestrankedrootcause.Second,whenthelocationmoduledoesnotreportthetwopoorly-performinglaptopsbeingatthesamelocation,theInferenceEnginereportsthelocationasthesecond-highestrankedrootcause.Thewirelessaccesspointwasreportedasthehighestrankedrootcause,asitwasashareddependencybetweenthetwolaptopsintheInfer-enceGraph,whereaseachlaptopwas(incorrectly)connectedtoadifferentlocationrootcause.
ProblemsDueToBadAccessPoint:TodeterminetheaccuracyofMnMinidentifyingapoorlyperformingAP(e.g.onesufferingfrominterferencenearit),wecreatedthefollowingexperimentalsetup.WeconnectfourlaptopsfromdifferentlocationstoaspecificAP.WereducedthecapacityoftheAPbyintroducinga500msdelayonallpacketstraversingthroughit.TheexperimenttestswhetherMnMcancorrectlydeterminethatmultipleperformancefaultsobservedforclientsassociatedwiththesameAPdo,insomecases,implythattheAPisatfault.AsshowninthesecondrowofTable1,MnMcorrectlyidentifiedtheAPastherootcauseforallofourobservations.
ProblemsDuetoHandoff:Wirelesslaptopssometimesexperi-encebadperformancebecausetheirdevicedriveristooaggressiveatchangingAPsinanattempttoachievebetterperformance.
WesetupthefollowingexperimenttoevaluateMnM’sabilitytocorrectlydetectproblemsduetoAPhandoffs.WeforcedonelaptoptoswitchbetweentwoAPsevery30seconds,causingtheperformanceoftheclienttosuffer.OtherclientsassociatedwiththetwoAPsfromdifferentlocations,andtheycontinuedtoperformnormally.AsshowninthethirdrowofTable1,MnMidentifiedthehandoffasthecorrectrootcausefor86%oftheobservations.
Fortheremaining14%oftheobservations,theAPwasidentifiedasthetopmostrootcauseandthehandoffwasrankedsecond.Thisisactuallythecorrectresult,asfurtherinvestigationshowedoneofthetwoAPsbeganexperiencingoutsideinterferenceduringtheexperiment,andhenceallclientsassociatedwiththatAPsawpoorperformance.ThisexperimenthighlightshowtheInferenceEngineisabletoquicklyidentifytherightrootcauseevenunderrapidlychangingconditions.
SimultaneousDiagnosis:TomeasurehowwellMnMdealswithmultiplesimultaneousfailures,weperformedtwoexperimentswhereweinjectedmultiplefaultsatthesametime.
91五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
TargetRootCauseLocationAPAPHandoffServerSimultaneousFaults
%thetargetRootCause
isfirst5510086100100OtherRootCauses
intoptwo
Machine,Server,APFirst-hoprouterLocation,Machine,APLast-hoprouterAPFirst-hoprouterReasonsforotherrootcausesLocationerrorRealcongestionattheserverFewpositiveobservationsthroughthefirst-hoprouterLocationerror,APfailuresFewpositiveobservationsforthelast-hoprouterFewpositiveobservationsforthefirst-hoprouter
Table1:Rootcauseanalysis
Forthefirstexperiment,wedeliberatelydelayedthepacketsen-teringandleavingtheserverby500ms,andwesimultaneouslyplacedtwoclientsatalocationwithknownpoorperformance.Theexpectedoutcomeforthisexperimentisfortheservertobethehighest-rankedrootcauseandthelocationtobethesecondhigh-est.MnMcorrectlyrankedthesetworootcausesforalltheobser-vations.
Inthesecondexperiment,weplacedtwoclientsinabadloca-tion,andweagaindelayedpacketstraversingtheAPsothatperfor-manceofallclientsassociatedwithitsuffered(notjustthetwoatthebadlocation).TheinferencealgorithmperformedasexpectedandcorrectlyrankedtheAPasthehighest-rankedrootcauseandthebadlocationasthesecond-highest-rankedrootcauseforallob-servations.
[8]R.Chandra,J.Padhye,A.Wolman,andB.Zill.A
Location-basedManagementSystemforEnterpriseWirelessLANs.InNSDI,2007.
[9]Y.-C.Cheng,M.Afanasyev,P.Verkaik,P.Benko,J.Chiang,
A.Snoeren,G.Voelker,andS.Savage.Automated
cross-layerdiagnosisofenterprisewirelessnetworks.InSIGCOMM,2007.
[10]Y.-C.Cheng,J.Bellardo,P.Benko,A.Snoeren,G.Voelker,
andS.Savage.Jigsaw:Solvingthepuzzleofenterprise802.11analysis.InSIGCOMM,2006.
[11]PrivateconversationwithDelllabmembers.[12]F.Giroire1,J.Chandrashekar,G.Iannaccone,
K.Papagiannaki,E.M.Schooler,,andN.Taft.Thecubiclevs.thecoffeeshop:Behavioralmodesinenterpriseend-users.InProc.ofPAM,2008.
[13]S.Gittlen.“Wanttomanageyourwired/wirelessLANs
together?Toobad”.ComputerWorld,March2007.
[14]S.Kandula,D.Katabi,andJ.-P.Vasseur.Shrink:AToolfor
FailureDiagnosisinIPNetworks.InProc.MineNetWorkshopatSIGCOMM,2005.
[15]R.R.Kompella,J.Yates,A.Greenberg,andA.Snoeren.IP
FaultLocalizationViaRiskModeling.InProc.ofNSDI,May2005.
[16]D.KotzandK.Essien.Analysisofacampus-widewireless
network.InMOBICOM,2002.
[17]M.Lopez.ForresterResearch:TheStateofNorthAmerican
EnterpriseMobilityin2006.December2006.
[18]R.Mahajan,M.Rodrig,D.Wetherall,andJ.Zahorjan.
AnalyzingMAC-levelbehaviorofwirelessnetworksinthewild.InSIGCOMM,2006.
[19]HPOpenview.http://www.openview.hp.com/.
[20]P.Reynolds,J.L.Wiener,J.C.Mogul,M.K.Aguilera,and
A.Vahdat.WAP5:Black-boxPerformanceDebuggingforWide-areaSystems.InWWW,May2006.
[21]M.Satyanarayanan.Mobileinformationaccess.IEEE
PersonalCommunications,Feb.1996.[22]EMCSmartsFamily.
http://www.emc.com/products/software/smarts/smartsfamily/.[23]IBMTivoli.http://www.ibm.com/software/tivoli/.
[24]S.Yemini,S.Kliger,E.Mozes,Y.Yemini,andD.Ohsie.
HighSpeedandRobustEventCorrelation.InIEEECommunicationsMagazine,1996.
[25]M.A.Youssef,A.Agrawala,andA.U.Shankar.WLAN
locationdeterminationviaclusteringandprobabilitydistributions.InIEEEPercom,2003.
7.CONCLUSION
Thispaperhighlightstheissuesthatanenterprisenetworkman-agementanddiagnosissystemmusthandlewhenallitsusersarenomadic.Theseissuesincluderapidlychangingdependencies,rootcauseanalysisinunifiedwiredandwirelessnetworksandtheim-pactofphysicallocationonapplicationperformance.WepresentMnM,anend-hostbased,integratednetworkmonitoringandfaultdiagnosissystem,andweshowthattakinganintegratedapproachtowiredandwirelessmonitoringimprovestheaccuracyoffaultdiagnosis.
8.REFERENCES
[1]A.Adya,P.Bahl,R.Chandra,andL.Qiu.Architectureand
TechniquesforDiagnosingFaultsinIEEE802.11InfrastructureNetworks.InMOBICOM,2004.
[2]AirDefense:WirelessLANSecurity.http://airdefense.net.[3]AirTightNetwoks.http://airtightnetworks.net.
[4]P.Bahl,R.Chandra,A.Greenberg,S.Kandula,D.A.Maltz,
andM.Zhang.Towardshighlyreliableenterprisenetworkservicesviainferenceofmulti-leveldependencies.InSIGCOMM,2007.
[5]P.Bahl,R.Chandra,D.A.Maltz,P.Patel,J.Padhye,and
L.Ravindranath.TowardsUnifiedmanagementofNetworkedServicesinWiredandWirelessEnterpriseNetworks.Technicalreport,2008.MSR-TR-2008-18.[6]P.BahlandV.N.Padmanabhan.RADAR:Anin-building
RF-baseduserlocationandtrackingsystem.InINFOCOM,2000.
[7]M.BalazinskaandP.Castro.Characterizingmobilityand
networkusageinacorporatewirelesslocal-areanetwork.InMOBISYS,2003.
92五道口生活网 www.wdklife.com 五道论坛(五道口人自己的论坛) www.wdklife.com/bbs 通洲生活网 www.85118.com
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- kqyc.cn 版权所有 赣ICP备2024042808号-2
违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务