书签 分享 收藏 举报 版权申诉 / 87
上传文档赚钱

类型专题论坛大数据课件.ppt

  • 上传人(卖家):晟晟文业
  • 文档编号:5217317
  • 上传时间:2023-02-17
  • 格式:PPT
  • 页数:87
  • 大小:10.93MB
  • 【下载声明】
    1. 本站全部试题类文档,若标题没写含答案,则无答案;标题注明含答案的文档,主观题也可能无答案。请谨慎下单,一旦售出,不予退换。
    2. 本站全部PPT文档均不含视频和音频,PPT中出现的音频或视频标识(或文字)仅表示流程,实际无音频或视频文件。请谨慎下单,一旦售出,不予退换。
    3. 本页资料《专题论坛大数据课件.ppt》由用户(晟晟文业)主动上传,其收益全归该用户。163文库仅提供信息存储空间,仅对该用户上传内容的表现方式做保护处理,对上传内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知163文库(点击联系客服),我们立即给予删除!
    4. 请根据预览情况,自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
    5. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器,压缩文件请下载最新的WinRAR软件解压。
    配套讲稿:

    如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。

    特殊限制:

    部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。

    关 键  词:
    专题 论坛 数据 课件
    资源描述:

    1、Big Data vs Smart Model:Beauty and the BeastProf.Yike GuoDepartment of ComputingImperial College LondonModel:Mathematical Representation of a SimplifiedPhysical WorldModelling is an essential and inseparable part of all scientific activity.A scientific model seeks to representempirical objects,pheno

    2、mena,and physical processes in a logical and objective wayTo understand the world or an object (called a target T),a model M is a simplified mathematicalrepresentation of it.Model is the result of abstraction from observations made,and its used to givepredictionHuman/SensorHuman/MachineHuman/Machine

    3、.No Model Is Perfect:Inherent Uncertainty:These targets consist of a set of continuous phenomena(inboth time and space),and they typically produce rich signals.Because of thecontinuity in both time and space of target,the signals are in principle infinite.Butobservations(e.g.sensor readings)are made

    4、 at discrete points in time and space,sothey are incomprehensive,and approximate,which brings the“uncertainty”.Overfitting or Underfitting:When learning a model from observations,such aslearning a nonlinear regression model,we need to choose the parameters such as K.Considering the fact that the inf

    5、ormation from observations is partial.It is hard tomake a perfect choice of K.Such imperfectness causes the problem of model error,like underfitting(small k)and overfitting(large k).Simplification:From observations,we project from a multi-dimensional world asimplified model with significant reduced

    6、dimensionality to focus on the features orproperties we are interested in.Nonlinearregression:K-order polynomialGeorge Box(statistician)“All models are wrong,but some areuseful.”Only models,from cosmological equations to theories of humanbehavior,seemed to be able to consistently,if imperfectly,expl

    7、ain the worldaround us.-1980Peter Norvig(Google):All models are wrong,and increasinglyyou can succeed without them.-2008Chris Anderson(Wired):There is now a better way.Petabytesallow us to say:Correlation is enough.We can stop looking for models.We can analyze the data without hypotheses about what

    8、it might show.Wecan throw the numbers into the biggest computing clusters the world hasever seen and let statistical algorithms find patterns where science cannot.(The Data Deluge Makes the Scientific Method Obsolete)-20124So,Why Model?The Google ArgumentAt the petabyte scale,information is not a ma

    9、tter of simple three-and four-dimensionaltaxonomy and order but of dimensionally agnostic statistics.It calls for an entirely differentapproach,one that requires us to lose the tether of data as something that can be visualizedin its totality.It forces us to view data mathematically first and establ

    10、ish a context for it later.For instance,Google conquered the advertising world with nothing more than appliedmathematics.It didnt pretend to know anything about the culture and conventions ofadvertising it just assumed that better data,with better analytical tools,would win the day.And Google was ri

    11、ght.Googles founding philosophy is that we dont know why this page is better than thatone:If the statistics of incoming links say it is,thats good enough.No semantic orcausal analysis is required.Thats why Google can translate languages without actuallyknowing them(given equal corpus data,Google can

    12、 translate Klingon into Farsi aseasily as it can translate French into German).And why it can match ads to contentwithout any knowledge or assumptions about the ads or the content.Model Free Sensor Informatics:Query Driventime10am10am.10amid12.7temp202129DatabaseTable raw-dataSensorNetwork3.Write ou

    13、tput to a file/back to the database4.Write data processing tools toprocess/aggregate the output(maybe usingUser1.Extract all readings into a file2.Run MATLAB/R/other data processing toolsDB)5.Decide new data to acquireRepeatModel-free sensing treats the sensory system as a database,and sensing as qu

    14、erying to fetch data from physicalworld.One of the leading vendors Crossbow is bundling a query processor with their devices.Wikisensing:A Model Free Sensor Informatics SystemBased on Big Data ArchitectureModel Free Sensing is Super Inefficient Data misrepresentation without model Latent information

    15、 missing without model High demand of computation/storage without model Require too much of interoperability between sensorsand analyticsBayesian:Data Is Not the Enemy of Models,Rather aGreat Supporter!Bayesian probability is a formalism that allows us to reason about beliefs of models underconditio

    16、ns of uncertainty based on the observations(data).If we have observed that a particular event has happened,such as Britain coming 10th in themedal table at the 2004 Olympics,then there is no uncertainty about it.However,suppose a is the statement“Britain sweeps the boards at 2012 London Olympics,win

    17、ning more than 30 Gold Medals!“made before 28th of JulySince this is a statement about a future event,nobody can state with any certainty whether ornot it is true.Different people may have different beliefs in the statement depending on theirspecific knowledge of factors that might effect its likeli

    18、hoodThe beliefs of the model were changing daily based on the performance data available eachday.By the 10 of August,most of peoples belief to this model should be almost 80%Thus,in general,a persons subjective belief in a statement a will depend on some body ofknowledge K.We write this as P(a|K).He

    19、nrys belief in a is different from Marcels because theyare using different Ks.However,even if they were using the same K they might still havedifferent beliefs in a.The expression P(a|K)thus represents a belief measure.Sometimes,for simplicity,when Kremains constant we just write P(a),but you must b

    20、e aware that this is a simplification.Model and Data Interaction:Bayesian Inference10Bayes Rule:Interaction between data and modelLearning as A Sequence of Interactionsp(Y|)p()p(Y)P(|Y)Big Data Meets Smart Models:A Bayesian Approachtowards Sensor InformaticsWe need model:a model is the representatio

    21、n of our knowledge so farData:the observations which may revise our belief to the models we haveAnalysis:assessing our belief and updating our models to make them more believableSensing:acquiring needed data to update(enrich)modelsModels are learned from data(observations)by scientists (theoretical

    22、abstraction)or by machine (machinelearning)Models are hypothesis (when making new observation)Models are knowledge(when established belief)Sensor Informatics:Sensing management-Managing the“neediness”:when and where to senseSensing analytics-Managing model updating:how to enrich models with observat

    23、ionsReasoning-Decision making based on integration of trusted modelsP(M|D)=P(D|M)P(M)/P(D)Surprising Event:When an Observation Does not Fit aKnown ModelPosterior and prior(P(M|D)P(M)has great variance-surprise!How great is great variance?Surprise threshold Kullback-Leibler divergence:Other methods:s

    24、ignficant level,Chebyshevs Theorem,From model,we get C(A,B)(e.g.a multivariateGaussian distribution)A:100mmB:50mmModel consistentA:100mmB:500mmSurprise!Camera example:Image-Analog Signal -Digital Data-Compressed Data -InformationWhy sensing so much data and then throw themaway?Why not sensing inform

    25、ation directly?Using Compressive Sensing Technology to OptimizeObservationsCompressive sensing:Take the advantage of sparseness,to solve the under-determinedsignals with just a small amount of measurement.Unobserved behavior(behavior not captured by the currentmodel)is typically sparse.Reconstructio

    26、n method:L1-min,Bayesian CS.Sensing data is enough when we can recover the need information through compressive sensing.:CS Matrix built from the model:Placement MatrixHow to Update Model Parameter Estimation1Y131.03188.294245.559302.823360.088417.352474.617531.881589.146646.41DEC 25 201121:15:23NOD

    27、AL SOLUTIONSTEP=360SUB =1TIME=1800TEMP (AVG)RSYS=0SMN =131.03SMX =646.41MXMNZXEstimating parameter to maximize the likelihoodof data given the model:Model:An Example in Digital CityModelling City Life via Causality:C(eA,eB)is used for predict current value of location(A)whenanother location(B)value

    28、is given Location :physical/logical locations with causality(through sensory cortex)(city areas,A.B)Relationship :topology(geo topology between A and B:diffusion Structure)Event:events,which is the dynamics of observable signal S=f(E)(heavyrainfall)Ontologies are adopted to represent locations L,rel

    29、ationships R*events E,and signals S.Diffusion:An event e1 E in n1causes another event e2 E in n2,when two nodes n1,n2 in G arelinked.Digital City Model:looking into the detailsSystem T=(L,R,E)Model M(T)=(G,B)Training for causality:use Bayesian network to represent theconditional independencies betwe

    30、en cause and target variables:1.Gaussian Mixture Models(GMMs),estimated via expectationmaximization(EM)2.Gaussian Process with Bayesian Inference.When the surprise surprise threshold Diversity detected identify the incorrect causality C(el,ep),which is sparse Compressive sensing approachNew observat

    31、ion-measurement thatcould revise model in model space tomaximize the likelihood of observationsFocusing ondiversityPlacementModel UpdatingModel Driven Sensing:No Surprise!The dynamics of model update:Surprise-Sensing-Model UpdatingThe goal for sensing:Capturing surpriseThe goal of analysis:Revising

    32、modelA model cannot overfit/underfit,when there is diversity,it could be updated-consistent with the universe(target)Model UpdateIts a Bayesian:P(M,?|D)=P(D|M,?)P(M,?)/P(D)T:target,M:model,?:top-down parameter*When?is fixed:P(M|D)=P(D|M)P(M)/P(D)-The variance between posterior and prior is“surprise”

    33、-bottom-up attention-model update(data assimilation):combining observations of the current state of a system with the resultsfrom a model(the forecast)to produce an analysis.The model is thenadvanced in time and its result becomes the forecast in the nextanalysis cycle*When?is updated:P(M,?)=P(M|?)P

    34、(?)-top-down attention(alertness)-model updateAdaptive Observation:Sensing and Numerical ModellingCityGML Ontology-GIS-Geometry meshBuilding An Initial Model and Making Prediction bySimulationsSetting up boundary conditions,numerical schemas,model parameters,etc.Simulation24 Building Case(Fine Mesh

    35、600000 Nodes):20 ProcessorsSimulationMoving Vehicles and Scalar Dispersions in Street CanyonsUsing Sensor to Verify the Prediction Results of theModelSensing:Acquiring data to get posterior of model,for validate(consistent)or update model.P(M|D)=P(D|M)P(M)/P(D)Data sensingModelvalidateupdateNew Wiki

    36、Sensing:Elastic Sensing Environment forLarge Scale Sensor Informatics Elastic sensing theory based on Bayesian inference Big Data architecture for large scale sensory data management Ontology for the background knowledge management Model driven adaptive observation support Digital City and digital l

    37、ife applicationsThe architecture of the New WikiSensing SystemOntology Used to Organise the Complex knowledgemanagementUsing ontology to represent the targets,signals,sensing methods,measurements,etc.Ontology to support flexible resolutionUpper ontology for unified operationOntoSensorConclusion Big

    38、data offers great opportunity for building smart models Big data provides new methodology for model research New informatics comes from the close coupled integration of the data and the model worlds Bayesian theory provides a nature foundation for such an integration Sensor Informatics is a good exa

    39、mple for such a paradigm A new uniform framework of sensor informatics can be developed based on the Bayesian theory wherethe dynamics of data and model capturing the essence of building a sensory system We are developing the WikiSensing system to realise this paradigmThank youUnderstanding Big Data

    40、Haixun WangData ExplosionMB=106 bytesa typical book in text formatGB=109 bytesa one hour video is about 1GB;data produced by a biologyexperiment in one dayTB=1012 bytesastronomy data in one night;US Library of Congress has 1000 TB data;search log of Bing is 20 TB per day(2009)The Arecibo TelescopeWo

    41、rlds largest radio telescopeDiameter:305 m(1,000 ft)Area:18 acresLocation:Arecibo,Puerto Ricohttp:/www.naic.eduThe P-ALFA surveys800 Terabytes in 5 yearsSoftware Driven Telescopefrom few,large,expensive,directional dishes to many,small,cheap,omni directional antennaea large number of high-speedinput

    42、 streams(2Gbps per antenna,25,000antennae in an area of 340 km indiameter)Data sizeChallenge 1:Its the data,stupid!Data complexityKey/value storeColumn storeDocument storeGraph SystemsBig data drives tomorrows economy.The value of big data lies in its degree ofconnectedness.Existing systems cannot h

    43、andle richconnectedness of big data.RDBMS and Rich Relationships Performance of multi-way joins is very poor inRDBMS Managing data of rich connectedness requiresmulti-way Joins in RDBMSTrinity A general purpose,distributed,in memory graph system Online graph query processing Offline graph analyticsT

    44、rinity Performance Highlight Online query processing :visiting 2.2 million users(3 hop neighborhood)on Facebook:=100ms foundation for graph-based service,e.g.,entity search Offline graph analytics :one iteration on a 1 billion node graph:=60sec foundation for analytics,e.g.,social analyticsPeople Se

    45、arch DemoMulti-way Join vs.Graph TraversalCompanyIncidentProblemIDCompanyID1ID2IDIncidentID3ID4IDProblemRDBMSTrinityChallenge 2:Interpretation of Big Data IBM Watson:Runs on 2,880 cores,15 terabytes of RAM,and80kW of power A human brain:Runs on a tuna fish sandwich and a glass of wateransweringthe q

    46、uestionunconstrainednatural languageinferencing&reasoningdomain specificlanguagesimplecalculationHuman(Turing Test)SIRIWatsonWolframAlphaGoogle/Bing?the Eternal Questunderstandingthe questionSQLcalculatorTurning the Webintoa DatabaseWhat you see when you look at my homepage Haixun WangMicrosoft Rese

    47、arch AsiaEmail:haixunw Tel:+86-10-58963289Tel:+1-914-902-0749I joined Microsoft Research Asia in 2009.I was with IBM T.J.Watson ResearchCenter from 2000 to 2009.I received theB.S.and M.S.Degree in Computer Sciencefrom Shanghai Jiao Tong University in1994 and 1996,the Ph.D.Degree inComputer Science f

    48、romUniversity of California,Los Angelesin June,2000.AWhat a machine sees when it looks at my homepage A JPEG Imagea jpeg Filetext in bigA bold fontA4 lines of textanother dozen lines oftext with twoembedded URLsSemantic Web?Number 1 trend in 2008 Richard MacManus The infrastructure to power theSeman

    49、tic Web is already here.Tim Berners-Lee Unstructured information will give way to structuredinformation paving the road to intelligent computing.Alex IskoldMore data beats better algorithmsBanko and Brill 2001Mean translation quality(1=incomprehensible,4=perfect)English-Spanish translation quality,M

    50、icrosoft technical texts2.523.52001200220032004200520062007SystranImprovealgorithms,scale system,and add data!Rule-based systemwith expensivecustomizationsfor Microsoft3MSRMTLogosOff-the-shelfrule-based systemFrom Rick Rashids talk:Its a data driven world get over it!ProbaseisA(concept,entities)isPr

    展开阅读全文
    提示  163文库所有资源均是用户自行上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作他用。
    关于本文
    本文标题:专题论坛大数据课件.ppt
    链接地址:https://www.163wenku.com/p-5217317.html

    Copyright@ 2017-2037 Www.163WenKu.Com  网站版权所有  |  资源地图   
    IPC备案号:蜀ICP备2021032737号  | 川公网安备 51099002000191号


    侵权投诉QQ:3464097650  资料上传QQ:3464097650
       


    【声明】本站为“文档C2C交易模式”,即用户上传的文档直接卖给(下载)用户,本站只是网络空间服务平台,本站所有原创文档下载所得归上传人所有,如您发现上传作品侵犯了您的版权,请立刻联系我们并提供证据,我们将在3个工作日内予以改正。

    163文库