数据挖掘课件：chap6-basic-association-analysis.ppt

上传人（卖家）：罗嗣辉

文档编号：2040925

上传时间：2022-01-19

格式：PPT

页数：82

大小：2.72MB

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15 文币

交易提醒：下载本文档，相应价格的文币将全额进入上传人（卖家）的账号。立即下载优惠套餐（点此详情）

【下载声明】
1. 本站全部试题类文档，若标题没写含答案，则无答案；标题注明含答案的文档，主观题也可能无答案。请谨慎下单，一旦售出，不予退换。
2. 本站全部PPT文档均不含视频和音频，PPT中出现的音频或视频标识（或文字）仅表示流程，实际无音频或视频文件。请谨慎下单，一旦售出，不予退换。
3. 本页资料《数据挖掘课件：chap6-basic-association-analysis.ppt》由用户（罗嗣辉）主动上传，其收益全归该用户。163文库仅提供信息存储空间，仅对该用户上传内容的表现方式做保护处理，对上传内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知163文库（点击联系客服），我们立即给予删除！
4. 请根据预览情况，自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
5. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器，压缩文件请下载最新的WinRAR软件解压。

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 数据挖掘课件 chap6_basic_association_analysis

资源描述：: 1、Data Mining Association Analysis: Basic Concepts and AlgorithmsLecture Notes for Chapter 6Introduction to Data MiningbyTan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Association Rule MininglGiven a set o
2、f transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transactionMarket-Basket transactionsExample of Association RulesDiaper Beer,Milk, Bread Eggs,Coke,Beer, Bread Milk,Implication means co-occurrence, not causality! Tan,Steinbach, Kum
3、ar Introduction to Data Mining 4/18/2004 3 Definition: Frequent ItemsetlItemset A collection of one or more itemsuExample: Milk, Bread, Diaper k-itemsetuAn itemset that contains k itemslSupport count ( ) Frequency of occurrence of an itemset E.g. (Milk, Bread,Diaper) = 2 lSupport Fraction of transac
4、tions that contain an itemset E.g. s(Milk, Bread, Diaper) = 2/5lFrequent Itemset An itemset whose support is greater than or equal to a minsup threshold Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Definition: Association RuleExample:BeerDiaper,Milk4 . 052|T|)BeerDiaper,Milk(s67. 032
5、)Diaper,Milk()BeerDiaper,Milk,(clAssociation RuleAn implication expression of the form X Y, where X and Y are itemsetsExample: Milk, Diaper Beer lRule Evaluation MetricsSupport (s)uFraction of transactions that contain both X and YConfidence (c)uMeasures how often items in Y appear in transactions t
6、hatcontain X Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Association Rule Mining TasklGiven a set of transactions T, the goal of association rule mining is to find all rules having support minsup threshold confidence minconf thresholdlBrute-force approach: List all possible associat
7、ion rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Mining Association RulesExample of Rules:Milk,Diaper Beer (s=0.4, c=0.67)Milk,Beer Diaper (s=0.
8、4, c=1.0)Diaper,Beer Milk (s=0.4, c=0.67)Beer Milk,Diaper (s=0.4, c=0.67) Diaper Milk,Beer (s=0.4, c=0.5) Milk Diaper,Beer (s=0.4, c=0.5)Observations: All the above rules are binary partitions of the same itemset: Milk, Diaper, Beer Rules originating from the same itemset have identical support but
9、can have different confidence Thus, we may decouple the support and confidence requirements Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Mining Association RuleslTwo-step approach: 1. Frequent Itemset GenerationGenerate all itemsets whose support minsup2. Rule GenerationGenerate high
10、 confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemsetlFrequent itemset generation is still computationally expensive Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Frequent Itemset GenerationnullABACADAEBCBDBECDCEDEABCDEABCABDABEAC
11、DACEADEBCDBCEBDECDEABCDABCEABDEACDEBCDEABCDEGiven d items, there are 2d possible candidate itemsets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Frequent Itemset GenerationlBrute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each can
12、didate by scanning the database Match each transaction against every candidate Complexity O(NMw) = Expensive since M = 2d ! Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Computational ComplexitylGiven d unique items: Total number of itemsets = 2d Total number of possible association
13、rules: 1231111dddkkdjjkdkdRIf d=6, R = 602 rules Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Frequent Itemset Generation StrategieslReduce the number of candidates (M) Complete search: M=2d Use pruning techniques to reduce MlReduce the number of transactions (N) Reduce size of N as
14、 the size of itemset increases Used by DHP and vertical-based mining algorithmslReduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction Tan,Steinbach, Kumar Introduction to Data Mining 4/18
15、/2004 12 Reducing Number of CandidateslApriori principle: If an itemset is frequent, then all of its subsets must also be frequentlApriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets This is known as the anti-mo
16、notone property of support)()()( :,YsXsYXYX Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Found to be InfrequentIllustrating Apriori PrinciplePruned supersets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Illustrating Apriori PrincipleItemCountBread4Coke2Milk4Beer3Dia
17、per4Eggs1ItemsetCountBread,Milk3Bread,Beer2Bread,Diaper3Milk,Beer2Milk,Diaper3Beer,Diaper3Item set C ount B read,M ilk,D iaper 3 Items (1-itemsets)Pairs (2-itemsets)(No need to generatecandidates involving Cokeor Eggs)Triplets (3-itemsets)Minimum Support = 3If every subset is considered, 6C1 + 6C2 +
18、 6C3 = 41With support-based pruning,6 + 6 + 1 = 13 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Apriori AlgorithmlMethod: Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identifieduGenerate length (k+1) candidate itemsets from length k freque
19、nt itemsetsuPrune candidate itemsets containing subsets of length k that are infrequent uCount the support of each candidate by scanning the DBuEliminate candidates that are infrequent, leaving only those that are frequent Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Reducing Number
20、 of ComparisonslCandidate counting: Scan the database of transactions to determine the support of each candidate itemset To reduce the number of comparisons, store the candidates in a hash structureu Instead of matching each transaction against every candidate, match it against candidates contained
21、in the hashed buckets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Generate Hash Tree2 3 45 6 71 4 51 3 61 2 44 5 71 2 54 5 81 5 93 4 53 5 63 5 76 8 93 6 73 6 81,4,72,5,83,6,9Hash functionSuppose you have 15 candidate itemsets of length 3: 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5 9, 1
22、 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Association Rule Disco
23、very: Hash tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6,9Hash FunctionCandidate Hash TreeHash on 1, 4 or 7 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Association Rule Discovery: Hash tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92
24、 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6,9Hash FunctionCandidate Hash TreeHash on 2, 5 or 8 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Association Rule Discovery: Hash tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6,9Hash FunctionCand
25、idate Hash TreeHash on 3, 6 or 9 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Subset OperationGiven a transaction t, what are the possible subsets of size 3? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 Subset Operation Using Hash Tree1 5 91 4 51 3 63 4 53 6 73 6 83
26、 5 63 5 76 8 92 3 45 6 71 2 44 5 71 2 54 5 81 2 3 5 61 + 2 3 5 63 5 62 +5 63 +1,4,72,5,83,6,9Hash Functiontransaction Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Subset Operation Using Hash Tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6
27、,9Hash Function1 2 3 5 63 5 61 2 +5 61 3 +61 5 +3 5 62 +5 63 +1 + 2 3 5 6transaction Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Subset Operation Using Hash Tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6,9Hash Function1 2 3 5 63 5 61 2
28、+5 61 3 +61 5 +3 5 62 +5 63 +1 + 2 3 5 6transactionMatch transaction against 11 out of 15 candidates Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Factors Affecting ComplexitylChoice of minimum support threshold lowering support threshold results in more frequent itemsets this may in
29、crease number of candidates and max length of frequent itemsetslDimensionality (number of items) of the data set more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increaselSize of database since Apriori makes
30、multiple passes, run time of algorithm may increase with number of transactionslAverage transaction width transaction width increases with denser data sets This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width) Tan,
31、Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Compact Representation of Frequent ItemsetslSome itemsets are redundant because they have identical support as their supersetslNumber of frequent itemsetslNeed a compact representation101103kk Tan,Steinbach, Kumar Introduction to Data Mining
32、4/18/2004 27 Maximal Frequent ItemsetBorderInfrequent ItemsetsMaximal ItemsetsAn itemset is maximal frequent if none of its immediate supersets is frequent Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 Closed ItemsetlAn itemset is closed if none of its immediate supersets has the sam
33、e support as the itemsetItemsetSupportA,B,C2A,B,D3A,C,D2B,C,D3A,B,C,D2 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 Maximal vs Closed ItemsetsTIDItems1ABC2ABCD3BCE4ACDE5DEnullABACADAEBCBDBECDCEDEABCDEABCABDABEACDACEADEBCDBCEBDECDEABCDABCEABDEACDEBCDEABCDE1241231234245345121242441232
34、3243445122244423424Transaction IdsNot supported by any transactions Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Maximal vs Closed Frequent ItemsetsnullABACADAEBCBDBECDCEDEABCDEABCABDABEACDACEADEBCDBCEBDECDEABCDABCEABDEACDEBCDEABCDE12412312342453451212424412323243445122244423424Mini
35、mum support = 2# Closed = 9# Maximal = 4Closed and maximalClosed but not maximal Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Maximal vs Closed Itemsets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32 Alternative Methods for Frequent Itemset GenerationlTraversal of Ite
36、mset Lattice General-to-specific vs Specific-to-general Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 Alternative Methods for Frequent Itemset GenerationlTraversal of Itemset Lattice Equivalent Classes Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 Alternative Methods
37、for Frequent Itemset GenerationlTraversal of Itemset Lattice Breadth-first vs Depth-first Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Alternative Methods for Frequent Itemset GenerationlRepresentation of Database horizontal vs vertical data layout Tan,Steinbach, Kumar Introduction
38、to Data Mining 4/18/2004 36 FP-growth AlgorithmlUse a compressed representation of the database using an FP-treelOnce an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 FP-tree c
39、onstructionTIDItems1A,B2B,C,D3A,C,D,E4A,D,E5A,B,C6A,B,C,D7B,C8A,B,C9A,B,D10B,C,EnullA:1B:1nullA:1B:1B:1C:1D:1After reading TID=1:After reading TID=2: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 FP-Tree ConstructionnullA:7B:5B:3C:3D:1C:1D:1C:3D:1D:1E:1E:1TIDItems1A,B2B,C,D3A,C,D,E4A
40、,D,E5A,B,C6A,B,C,D7B,C8A,B,C9A,B,D10B,C,EPointers are used to assist frequent itemset generationD:1E:1Transaction DatabaseItemPointerABCDEHeader table Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 FP-growthnullA:7B:5B:1C:1D:1C:1D:1C:3D:1D:1Conditional Pattern base for D: P = (A:1,B:1
41、,C:1),(A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1)Recursively apply FP-growth on PFrequent Itemsets found (with sup 1): AD, BD, CD, ACD, BCDD:1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40 Tree ProjectionSet enumeration tree:nullABACADAEBCBDBECDCEDEABCDEABCABDABEACDACEADEBCDBCEBDECDEABCDA
42、BCEABDEACDEBCDEABCDEPossible Extension: E(A) = B,C,D,EPossible Extension: E(ABC) = D,E Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Tree ProjectionlItems are listed in lexicographic orderlEach node P stores the following information: Itemset for node P List of possible lexicographic
43、 extensions of P: E(P) Pointer to projected database of its ancestor node Bitvector containing information about which transactions in the projected database contain the itemset Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 Projected DatabaseTIDItems1A,B2B,C,D3A,C,D,E4A,D,E5A,B,C6A,B
44、,C,D7B,C8A,B,C9A,B,D10B,C,ETIDItems1B23C,D,E4D,E5B,C6B,C,D78B,C9B,D10Original Database:Projected Database for node A: For each transaction T, projected transaction at node A is T E(A) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43 ECLATlFor each item, store a list of transaction ids (
45、tids)TIDItems1A,B,E2B,C,D3C,E4A,C,D5A,B,C,D6A,E7A,B8A,B,C9A,C,D10BHorizontalData LayoutABCDE11221423435545667897898109Vertical Data LayoutTID-list Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44 ECLATlDetermine support of any k-itemset by intersecting tid-lists of two of its (k-1) subs
46、ets.l3 traversal approaches: top-down, bottom-up and hybridlAdvantage: very fast support countinglDisadvantage: intermediate tid-lists may become too large for memoryA1456789B1257810 AB1578 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45 Rule GenerationlGiven a frequent itemset L, find
47、 all non-empty subsets f L such that f L f satisfies the minimum confidence requirement If A,B,C,D is a frequent itemset, candidate rules:ABC D, ABD C, ACD B, BCD A, A BCD,B ACD,C ABD, D ABCAB CD,AC BD, AD BC, BC AD, BD AC, CD AB,lIf |L| = k, then there are 2k 2 candidate association rules (ignoring
48、 L and L) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46 Rule GenerationlHow to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone propertyc(ABC D) can be larger or smaller than c(AB D) But confidence of rules generated from the sa
49、me itemset has an anti-monotone property e.g., L = A,B,C,D: c(ABC D) c(AB CD) c(A BCD) u Confidence is anti-monotone w.r.t. number of items on the RHS of the rule Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47 Rule Generation for Apriori AlgorithmLattice of rulesPruned RulesLow Confid
50、ence Rule Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48 Rule Generation for Apriori AlgorithmlCandidate rule is generated by merging two rules that share the same prefixin the rule consequentljoin(CD=AB,BD=AC)would produce the candidaterule D = ABClPrune rule D=ABC if itssubset AD=BC

展开阅读全文

163文库所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：数据挖掘课件：chap6-basic-association-analysis.ppt
链接地址：https://www.163wenku.com/p-2040925.html

罗嗣辉

内容提供者

实名认证

联系作者