XML代价估计方法研究综述课件.ppt

上传人（卖家）：晟晟文业

文档编号：5197472

上传时间：2023-02-16

格式：PPT

页数：45

大小：538KB

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

25 文币

交易提醒：下载本文档，相应价格的文币将全额进入上传人（卖家）的账号。立即下载优惠套餐（点此详情）

【下载声明】
1. 本站全部试题类文档，若标题没写含答案，则无答案；标题注明含答案的文档，主观题也可能无答案。请谨慎下单，一旦售出，不予退换。
2. 本站全部PPT文档均不含视频和音频，PPT中出现的音频或视频标识（或文字）仅表示流程，实际无音频或视频文件。请谨慎下单，一旦售出，不予退换。
3. 本页资料《XML代价估计方法研究综述课件.ppt》由用户（晟晟文业）主动上传，其收益全归该用户。163文库仅提供信息存储空间，仅对该用户上传内容的表现方式做保护处理，对上传内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知163文库（点击联系客服），我们立即给予删除！
4. 请根据预览情况，自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
5. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器，压缩文件请下载最新的WinRAR软件解压。

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: XML 代价估计方法研究综述课件

资源描述：: 1、XML 代价估计方法研究综述代价估计方法研究综述 XML Group Outlinen研究代价估计的必要性nxml中代价估计研究的难点n背景知识介绍n现有方法分类研究代价估计的必要性n查询优化的基础n不同分支对查询目标的选择率不同，如果选择性低的分支先于选择性高的分支执行，就会有效的减少中间结果从而提供查询效率n结构连接在XML中是一个很重要的操作，连接的顺序的选择需要计算不同连接顺序的代价n及早给用户提供反馈信息xml中代价估计研究的难点nXML数据的不规则性是对传统统计信息方法的重要挑战，具体表现在以下几个方面：n路径依赖性，n如同为name结点，在person下和在city下出现意义就完
2、全不同n结构相关性n嵌套在不同祖先下面的同类结点的个数差别可能很大，如book结点下的author个数是不确定的n值相关性n/purchase/Itemtype=bookprice100nXML的有序性制约了转换规则的灵活性n所有这些问题，都使得在xml中采用传统的代价估计方法不切实际，会带来很大的误差。针对xml数据的特点，我们应该寻求一种新的代价估计方法。Background Knowledge nXML Data ModelnA large,node-labeled tree T(V,E)d0a1a2a3p4p5n6t14k15y131999y16t17k18k19n7b9p8y20t2
3、1k22y24t25k26n10b12p9t23t26200220012000Example XML Document TreeBackground Knowledge nXML Data ModelnA large,node-labeled tree T(V,E)d0a1a2a3p4p5n6t14k15y131999y16t17k18k19n7b9p8y20t21k22y24t25k26n10b12p9t23t26200220012000Example XML DocumentBackground KnowledgenXML Query ModelnA node-labeled query
4、tree TQnEach node of TQ is labeled with a variable name qi in Q nEach edge(qi,qj)of TQ is annotated with an XPath expression path(qi,qj)that describes the specific structural constraints specified in QQuery Fragment：for$b in doc(“bib.xml”)/bib/book$a in$b/author$t in$b/title bibbookauthortitleroot$b
5、$a$tTwig Query Model现有估计方法的分类n路径选择性计算nTwig匹配个数统计n值谓词选择率估计n结构连接代价估计 Overviewdata treedifferent summarization methods based on:pathall m-length pathF/B-stabilityLoreMcHugh et al,VLDB99pruningPathTree,Markov TableAboulnaga et al,VLDB01XSketchPolyzotis et al,VLDB02typeXSketchesPolyzotis et al,SIGMOD02+v
6、alue+twigXSketchesPolyzotis et al,SIGMOD04Region Code2D Model1D ModelStatiXFreire et al,SIGMOD02PH HistogramWu Y EDBT 2002Interva&Position ModelH.Lu SIGMOD 2003路径选择性计算n问题描述 /a/bc/d/eg/e/f/*/*/e/f/a/b/*/*/e/f/c/d/e/g/e/f/a/b/*/*/e/f/c/d/e/g/e/fPath Tree&Markov Table nAshraf Aboulnaga,Alaa R.Alameldee
7、n,and Jeffry F.Naughton.Estimating the selectivity of XML path expressions for Internet scale applications.VLDB 2001 Path TreenConstructionnStart from an original path treenCount of nodes in absolute pathsnAggregate the tree to fit in the limited spacenSibling-*nEstimationnMatch the query against th
8、e path tree,matching*as the last resort.nNeed to decide whether to use total count or average count.An XML document and its path tree A1B2C1D1D1E3Estimation:count(A/B/D)=1 count(A/C/D)=1 Aggregate Path TreeA1C9B13G 10F15H6K12E5D7K11J4I2Aggregate Path TreeA1C9B13G 10F15H6K12E5D7K11J4I2红色结点表示那些不频繁出现的结
9、点，需要删除，从而保证path树能够放入内存Sibling-*SummarizationA1C9B13*F15*K*f=23n=2683把查询和path 树匹配需要决定用总数还是平均数估计方法Estimation:count(/C/H/K)=11 count(/C/*/K)=23 Markov Tablen构造一个表，存储长度=2。n当查询路径长度=m时，可以直接从表中读出结果，否则，用公式计算。n例如，m3，查询为/A/B/C/D，则n当表不能装入内存的时候，进行剪枝操作。f(B/C/D)f(B/C)f(A/B/C/D)f(A/B/C)Markov Table:Exampleabbcddee
10、efda1a/b2b2a/c1c1b/d2d3c/d1e3c/e3f1c/f1abdde12213cf11m=2b2c/e3d3*3/3e3c/*2/2a/b2*/*1/1b/d2suffix-*Q1:a/c/e(/)(/)(/)(/)(*/*)3()(*)f c ef c ef a c ef a cff cfTwig 匹配个数统计n问题描述nTwig Query(basic building block of declarative query languages for XML)FOR$f IN document(“person.xml”)/department/faculty FOR$r
11、 in$f/RA,FOR$t in$f/TADepartmentFacultyRATA$f$r$tXSketch方法n N.Polyzotis,M.Garofalakis.Statistical Synopses for Graph-Structured XML Databases,In Proc.of ACM SIGMOD Conf.2002 nN.Polyzotis and M.Garofalakis.Structure and value synopses for xml data graphs.In Proceedings of 28th VLDB Conf.,2002.nN.Poly
12、zotis,M.Garofalakis,and Yannis Iosnnidis.Selectivity Estimation for XML Twigs.In Proc.of the 20th Intl.Conf.on Data Engineering,2004nN.Polyzotis and M.Garofalakis and Y.Ioannidis.Approximate XML Query Answers.In Proc.ACM SIGMOD 2004d0a1a2a3p4p5n6t14k15y131999 y16t17k18k19n7b9p8y20t21k22y24t25k26n10b
13、12p9t23t26200220012000D(1)A(3)N(3)P(4)B(2)Y(4)K(6)T(6)H(Y)B/FB/FB/FB/FBB/FFFXSketch方法B/F stabilityn Definition：n Let U,V be sets of elements in an XML data Tree T.We say that V is backward-stable(B-stale)with respect to U,iff for each vV there exists a u U such that edge(u,v)is in T.nSimilarly,U is
14、said to be forward-stable(F-stable)with respect to V,iff for each u U there exists a v V such that the edge(u,v)is in G.D(1)A(3)N(3)P(4)B(2)Y(4)K(6)T(6)H(Y)B/FB/FB/FB/FBB/FFFXSketch方法How to estimate the cardinality of xpath query such as A/P/K or A/p?Utilizing edge stability,We can give an accurate
15、match number:Count(A/P/K)=6Count(A/p)=3But what about A/P/T which doesnt conform to backward stability?2 assumptions Path independence Backward-edge uniformityCount(A/P/T)=f(A/P/T)*count(T)f(A/P/T)=f(A/P)*f(P/T|A/P)=f(P/T|A/P)=f(P/T)count(P)/(count(B)+count(P)XSketch方法(处理值谓词)v1v3v5v2v4BBFFSTN(v5)Dep
16、(v5)=v1,v4v1v4Histogram at v5H(1,4)=count(v11v2/v3v44/v5)v3 v5 count(v11v2/v33v44/v5)=H(1,4)*count(v33)/count(v3)A1Edge-Value IndependenceAcross Non-Satble Edgesf(u/v|v)f(u/v)A2Value-Independence Outside Correlation ScopeXSkecth方法（处理twig）nProblem with the former modelnLet us see a simple examplefor
17、t0 in A,t1 in t0/B,t2 in t0/Cra1a2b1c1b2c2ra1a2b1c1b2c21001010010100100 1010RA(2)B(110)C(110)B/FB/FB/F(a)(b)(c)Structural XSketch2000 binding tuples on the first document10100 tuples on the second document XSkecth方法（处理twig）nSinge-path XSketch model dose not store detailed enough information to captu
18、re the correlation patterns between multiple path expressions.nIf node A records a two-dimensional distribution fA(b,c)for the fraction of elements in A that have b children in B and c children in C.CBCCElementsfp10010a10.510100a20.52(0.5100100.510100)CBCCElementsfp100100a10.51010a20.52(0.51001000.5
19、1010)XSkecth方法（处理twig）CKCYElementsfp21p40.25P54(0.2521100.25110.51)5CPCN21210.2521110.511p8，p9Another ExampleEdge-distribution:fP(CY,CK,CP，CN)PYPKAPANStatix方法n Freire J,Jayant R,Ramanath M,Roy P and Simeon J.StatiX:Making XML Count.In:Proc.of ACM SIGMOD 2002,USA.Statix方法nConstructionnevery instance
20、node is assigned a unique id(type id+sequential node id)during parsing and validation,meanwhile gathering statisticsnConstructing the histogramsnHistogramnBuild histogram for each typenMight contain histogram describing the distribution of the parents of a typenValue histogram can be built for types
21、 that are defined in terms of base types(eg.integer)nEstimationnConvert the query into SQL(i.e.,relational join on IDs)nUse standard histogram multiplicationStatiX:ExampleFOR$s in document(“imdb.xml”)/show,$r in$s/reviewWHERE$s/year 1992RETURN$s/title,$rSELECT title,reviewFROM Show,ReviewWHERE Show.
22、year 1992 AND Show.ID=Review.ParIDSTAT Show CARD:5 ID:16 STAT Show_Year VALUE:19902001 BUCKET_NUM:2 BUCKETS:1990,1994):3;1994,2001):2STAT Review CARD:8 ID:3038 PARENT HISTOGRAM Show BUCKET_NUM:3 BUCKETS:1,2):4;2,3):1;3,5):3 STAT Aka CARD:6 ID:6-12 PARENT HISTOGRAM Show_Episode BUCKET_NUM:3 BUCKETS:1
23、,2):1;2,6):0;6,12):7 31/28/5=2.4SummaryQuery/*BranchValueCorrelationSchemaStatiXSimple Twig+ValueNYYYNYPath TreeSimple PathNYNNNNMarkov TableSimple PathNYNNNNXSketchSimple TwigNNYNNNXSketchesSimple Twig+ValueNNYYY(mD methods)N结构连接代价估计n问题描述n给定任意的祖先节点集合A和后代节点集合D，计算A/D结果集合的大小，估计匹配的祖先后代的结果个数 n当同一祖先有多个后代
24、，或者同一后代有多个祖先时，节点对个数可能远远大于祖先或后代的个数。n应用n连接顺序的选择结构连接代价估计nExisted workn2D model:pH histogram Wu Y,Patel J and Jagadish H.V.Estimating Answer Sizes for XML Queries.In:Proc.of the EDBT 2002.nInterval and Positional model W.Wang,H.Jiang,H.Lu,and J.F.Xu.Containment join Size Estimation:Models and Methods.In
25、:Proc.Of ACM SIGMOD 2003 Absolute Region Code(start,end)blah 1234 5678 0000 contactnamephoneblahofficehomemobile123456780000(1,16)(2,4)(3,3)(5,15)(6,8)(7,7)(9,11)(10,10)(12,14)(13,13)234a.start d.start and d.end a.endpH HistogramForbidden Regions For a nodeR1和R2是结点A的ForbiddenRegion 两个结点的(start,end)满
26、足：（1）no overlap （2）nested 不可能有部分重叠的(partiallyoverlap)情况 pH HistogramEstPA=HistP1A1/4HistP2A+HistP2B+HistP2C+HistP2E+1/2(HistP2D+HistP2F)Ancestor-based Join EstimationEstPA=HistP2A1/4HistP1A+HistP1F+HistP2G+HistP1HDescendant-based Join EstimationpH Histogram (1 4)ADDAAAAEAHAHGHHHIHA35 51 3 71021203 0
27、0 0 00000est=3*(12+1+2+0.25*5)=48.75AGIHstartendstartstartendendFor off-diagonal cells:pH for ApH for DInterval&Position ModelsnMap into new problems under 2 newly proposed models.nInterval model and Position modelnHistogram based method that is adaptive to the data:PL histogramnAssumption:1D unifor
28、m of the descendant setnEstimation is robust when correlation is lownSampling based method that captures the correlation:IM Sampling and PM SamplingnAssumption:Height of the XML data tree is a small constantnBoth have theoretical bounds on the accuracy of the estimationInterval ModelnInterval Modeln
29、Map elements to 1D line segments(points)nIMA(A):interval setnIMD(D):point setn|A SJ D|=|interval-point pairs|a1a2a3IMA(A)022d1d2d3d4IMD(D)022a(start,end)a1(2,7)a2(18,21)a3(1,22)I IMMA A(A A)d(start,end)d1(3,4)d2(9,10)d3(11,12)d3(19,20)I IMMD D(D D)interval setpoint set(2,7)a3a1a2d1d2d3d4(1,22)(3,4)(
30、9,10)(11,12)(19,20)(18,21)Position ModelnPosition ModelnUse frequency tables to record coverage and starting point informationnPMA(S):coverage tablenPMD(S):start tablen|A SJ D|=nContainment joinEquijoina(start,end)a1(2,7)a2(18,21)a3(1,22)d(start,end)d1(3,4)d2(9,10)d3(11,12)d4(19,20)VF112232425262728
31、191101111121131141151161171182192202212221P PMMA A(A A)VF102031405060708091100111120130140150160170180191200210220P PMMD D(D D)coverage tablestart table(.)(.)ia i F d i FPoint-Line(PL)-Histograma1a2a3IMA(A)d1d2d3d4IMD(D)nwsswsel1()jalldadwswsj nXnnn nPartition the pace into partitionsnEstimate the o
32、verlapping(interval,point)pairs within each bucket.nAssumptionsnData distribution of A and D are independentnD conforms to uniform distributionSampling in Interval ModelnAdapted from bifocal samplingnIM Algorithm:nChoose m samples from IMD(D)nSum up the m subjoin sizes.nX=Scale up the result by|IMD(D)|/mnStatistical GuaranteenE(X)=X(join size)nwith high probability(by Hoeffding bound)a1a2a3IMA(A)d1d2d3d4IMD(D)2136m=2ProblemnHow to handle parent-child relationship?nHow to count|A/B/C|nHow to use estimation for Query OptimizationQuestion?Then We done!Thank you!

展开阅读全文

163文库所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：XML代价估计方法研究综述课件.ppt
链接地址：https://www.163wenku.com/p-5197472.html

晟晟文业

内容提供者

实名认证

联系作者