数据挖掘课件:chap6-basic-association-analysis.ppt
- 【下载声明】
1. 本站全部试题类文档,若标题没写含答案,则无答案;标题注明含答案的文档,主观题也可能无答案。请谨慎下单,一旦售出,不予退换。
2. 本站全部PPT文档均不含视频和音频,PPT中出现的音频或视频标识(或文字)仅表示流程,实际无音频或视频文件。请谨慎下单,一旦售出,不予退换。
3. 本页资料《数据挖掘课件:chap6-basic-association-analysis.ppt》由用户(罗嗣辉)主动上传,其收益全归该用户。163文库仅提供信息存储空间,仅对该用户上传内容的表现方式做保护处理,对上传内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知163文库(点击联系客服),我们立即给予删除!
4. 请根据预览情况,自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
5. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器,压缩文件请下载最新的WinRAR软件解压。
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 数据 挖掘 课件 chap6_basic_association_analysis
- 资源描述:
-
1、Data Mining Association Analysis: Basic Concepts and AlgorithmsLecture Notes for Chapter 6Introduction to Data MiningbyTan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Association Rule MininglGiven a set o
2、f transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transactionMarket-Basket transactionsExample of Association RulesDiaper Beer,Milk, Bread Eggs,Coke,Beer, Bread Milk,Implication means co-occurrence, not causality! Tan,Steinbach, Kum
3、ar Introduction to Data Mining 4/18/2004 3 Definition: Frequent ItemsetlItemset A collection of one or more itemsuExample: Milk, Bread, Diaper k-itemsetuAn itemset that contains k itemslSupport count ( ) Frequency of occurrence of an itemset E.g. (Milk, Bread,Diaper) = 2 lSupport Fraction of transac
4、tions that contain an itemset E.g. s(Milk, Bread, Diaper) = 2/5lFrequent Itemset An itemset whose support is greater than or equal to a minsup threshold Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Definition: Association RuleExample:BeerDiaper,Milk4 . 052|T|)BeerDiaper,Milk(s67. 032
5、)Diaper,Milk()BeerDiaper,Milk,(clAssociation RuleAn implication expression of the form X Y, where X and Y are itemsetsExample: Milk, Diaper Beer lRule Evaluation MetricsSupport (s)uFraction of transactions that contain both X and YConfidence (c)uMeasures how often items in Y appear in transactions t
6、hatcontain X Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Association Rule Mining TasklGiven a set of transactions T, the goal of association rule mining is to find all rules having support minsup threshold confidence minconf thresholdlBrute-force approach: List all possible associat
7、ion rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Mining Association RulesExample of Rules:Milk,Diaper Beer (s=0.4, c=0.67)Milk,Beer Diaper (s=0.
8、4, c=1.0)Diaper,Beer Milk (s=0.4, c=0.67)Beer Milk,Diaper (s=0.4, c=0.67) Diaper Milk,Beer (s=0.4, c=0.5) Milk Diaper,Beer (s=0.4, c=0.5)Observations: All the above rules are binary partitions of the same itemset: Milk, Diaper, Beer Rules originating from the same itemset have identical support but
9、can have different confidence Thus, we may decouple the support and confidence requirements Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Mining Association RuleslTwo-step approach: 1. Frequent Itemset GenerationGenerate all itemsets whose support minsup2. Rule GenerationGenerate high
10、 confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemsetlFrequent itemset generation is still computationally expensive Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Frequent Itemset GenerationnullABACADAEBCBDBECDCEDEABCDEABCABDABEAC
11、DACEADEBCDBCEBDECDEABCDABCEABDEACDEBCDEABCDEGiven d items, there are 2d possible candidate itemsets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Frequent Itemset GenerationlBrute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each can
12、didate by scanning the database Match each transaction against every candidate Complexity O(NMw) = Expensive since M = 2d ! Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Computational ComplexitylGiven d unique items: Total number of itemsets = 2d Total number of possible association
13、rules: 1231111dddkkdjjkdkdRIf d=6, R = 602 rules Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Frequent Itemset Generation StrategieslReduce the number of candidates (M) Complete search: M=2d Use pruning techniques to reduce MlReduce the number of transactions (N) Reduce size of N as
14、 the size of itemset increases Used by DHP and vertical-based mining algorithmslReduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction Tan,Steinbach, Kumar Introduction to Data Mining 4/18
15、/2004 12 Reducing Number of CandidateslApriori principle: If an itemset is frequent, then all of its subsets must also be frequentlApriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets This is known as the anti-mo
16、notone property of support)()()( :,YsXsYXYX Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Found to be InfrequentIllustrating Apriori PrinciplePruned supersets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Illustrating Apriori PrincipleItemCountBread4Coke2Milk4Beer3Dia
17、per4Eggs1ItemsetCountBread,Milk3Bread,Beer2Bread,Diaper3Milk,Beer2Milk,Diaper3Beer,Diaper3Item set C ount B read,M ilk,D iaper 3 Items (1-itemsets)Pairs (2-itemsets)(No need to generatecandidates involving Cokeor Eggs)Triplets (3-itemsets)Minimum Support = 3If every subset is considered, 6C1 + 6C2 +
18、 6C3 = 41With support-based pruning,6 + 6 + 1 = 13 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Apriori AlgorithmlMethod: Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identifieduGenerate length (k+1) candidate itemsets from length k freque
19、nt itemsetsuPrune candidate itemsets containing subsets of length k that are infrequent uCount the support of each candidate by scanning the DBuEliminate candidates that are infrequent, leaving only those that are frequent Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Reducing Number
20、 of ComparisonslCandidate counting: Scan the database of transactions to determine the support of each candidate itemset To reduce the number of comparisons, store the candidates in a hash structureu Instead of matching each transaction against every candidate, match it against candidates contained
21、in the hashed buckets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Generate Hash Tree2 3 45 6 71 4 51 3 61 2 44 5 71 2 54 5 81 5 93 4 53 5 63 5 76 8 93 6 73 6 81,4,72,5,83,6,9Hash functionSuppose you have 15 candidate itemsets of length 3: 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5 9, 1
22、 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Association Rule Disco
23、very: Hash tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6,9Hash FunctionCandidate Hash TreeHash on 1, 4 or 7 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Association Rule Discovery: Hash tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92
24、 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6,9Hash FunctionCandidate Hash TreeHash on 2, 5 or 8 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Association Rule Discovery: Hash tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6,9Hash FunctionCand
25、idate Hash TreeHash on 3, 6 or 9 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Subset OperationGiven a transaction t, what are the possible subsets of size 3? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 Subset Operation Using Hash Tree1 5 91 4 51 3 63 4 53 6 73 6 83
展开阅读全文