数据挖掘课件:chap7-extended-association-analysis (1).ppt
- 【下载声明】
1. 本站全部试题类文档,若标题没写含答案,则无答案;标题注明含答案的文档,主观题也可能无答案。请谨慎下单,一旦售出,不予退换。
2. 本站全部PPT文档均不含视频和音频,PPT中出现的音频或视频标识(或文字)仅表示流程,实际无音频或视频文件。请谨慎下单,一旦售出,不予退换。
3. 本页资料《数据挖掘课件:chap7-extended-association-analysis (1).ppt》由用户(罗嗣辉)主动上传,其收益全归该用户。163文库仅提供信息存储空间,仅对该用户上传内容的表现方式做保护处理,对上传内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知163文库(点击联系客服),我们立即给予删除!
4. 请根据预览情况,自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
5. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器,压缩文件请下载最新的WinRAR软件解压。
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 数据挖掘课件:chap7_extended_association_analysis 1 数据 挖掘 课件 chap7_extended_association_analysis
- 资源描述:
-
1、Data MiningAssociation Rules: Advanced Concepts and AlgorithmsLecture Notes for Chapter 7Introduction to Data MiningbyTan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Continuous and Categorical AttributesE
2、xample of Association Rule: Number of Pages 5,10) (Browser=Mozilla) Buy = NoHow to apply association analysis formulation to non-asymmetric binary variables? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Handling Categorical AttributeslTransform categorical attribute into asymmetric b
3、inary variableslIntroduce a new “item” for each distinct attribute-value pair Example: replace Browser Type attribute withu Browser Type = Internet Exploreru Browser Type = Mozillau Browser Type = Mozilla Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Handling Categorical AttributeslPo
4、tential Issues What if attribute has many possible valuesu Example: attribute country has more than 200 possible valuesu Many of the attribute values may have very low support Potential solution: Aggregate the low-support attribute values What if distribution of attribute values is highly skewedu Ex
5、ample: 95% of the visitors have Buy = Nou Most of the items will be associated with (Buy=No) item Potential solution: drop the highly frequent items Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Handling Continuous AttributeslDifferent kinds of rules: Age21,35) Salary70k,120k) Buy Sal
6、ary70k,120k) Buy Age: =28, =4lDifferent methods: Discretization-based Statistics-based Non-discretization basedu minApriori Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Handling Continuous AttributeslUse discretizationlUnsupervised: Equal-width binning Equal-depth binning Clusteringl
7、Supervised: Classv1v2v3v4v5v6v7v8v9Anomalous002010200000Normal150100000100100150100bin1bin3bin2Attribute values, v Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Discretization IssueslSize of the discretized intervals affect support & confidence If intervals too smallu may not have eno
8、ugh support If intervals too largeu may not have enough confidencelPotential solution: use all possible intervalsRefund = No, (Income = $51,250) Cheat = NoRefund = No, (60K Income 80K) Cheat = NoRefund = No, (0K Income 1B) Cheat = No Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Discr
9、etization IssueslExecution time If intervals contain n values, there are on average O(n2) possible rangeslToo many rulesRefund = No, (Income = $51,250) Cheat = NoRefund = No, (51K Income 52K) Cheat = NoRefund = No, (50K Income 60K) Cheat = No Tan,Steinbach, Kumar Introduction to Data Mining 4/18/200
10、4 9 Approach by Srikant & AgrawallPreprocess the data Discretize attribute using equi-depth partitioningu Use partial completeness measure to determine number of partitionsu Merge adjacent intervals as long as support is less than max-supportlApply existing association rule mining algorithmslDetermi
11、ne interesting rules in the output Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Approach by Srikant & AgrawallDiscretization will lose information Use partial completeness measure to determine how much information is lost C: frequent itemsets obtained by considering all ranges of at
12、tribute valuesP: frequent itemsets obtained by considering all ranges over the partitionsP is K-complete w.r.t C if P C,and X C, X P such that: 1. X is a generalization of X and support (X) K support(X) (K 1) 2. Y X, Y X such that support (Y) K support(Y)Given K (partial completeness level), can det
13、ermine number of intervals (N)XApproximated X Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Interestingness MeasurelGiven an itemset: Z = z1, z2, , zk and its generalization Z = z1, z2, , zk P(Z): support of ZEZ(Z): expected support of Z based on Z Z is R-interesting w.r.t. Z if P(Z)
14、 R EZ(Z)Refund = No, (Income = $51,250) Cheat = NoRefund = No, (51K Income 52K) Cheat = NoRefund = No, (50K Income 60K) Cheat = No) () ()() ()() ()()(2211ZPzPzPzPzPzPzPZEkkZ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Interestingness MeasurelFor S: X Y, and its generalization S: X
15、YP(Y|X): confidence of X Y P(Y|X): confidence of X Y ES(Y|X): expected support of Z based on ZlRule S is R-interesting w.r.t its ancestor rule S if Support, P(S) R ES(S) or Confidence, P(Y|X) R ES(Y|X) | () ()() ()() ()()|(2211XYPyPyPyPyPyPyPXYEkk Tan,Steinbach, Kumar Introduction to Data Mining 4/1
16、8/2004 13 Statistics-based MethodslExample: Browser=Mozilla Buy=Yes Age: =23lRule consequent consists of a continuous variable, characterized by their statistics mean, median, standard deviation, etc.lApproach: Withhold the target variable from the rest of the data Apply existing frequent itemset ge
17、neration on the rest of the data For each frequent itemset, compute the descriptive statistics for the corresponding target variableu Frequent itemset becomes a rule by introducing the target variable as rule consequent Apply statistical test to determine interestingness of the rule Tan,Steinbach, K
18、umar Introduction to Data Mining 4/18/2004 14 Statistics-based MethodslHow to determine whether an association rule interesting? Compare the statistics for segment of population covered by the rule vs segment of population not covered by the rule:A B: versus A B: Statistical hypothesis testing:u Nul
19、l hypothesis: H0: = + u Alternative hypothesis: H1: + u Z has zero mean and variance 1 under null hypothesis222121nsnsZ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Statistics-based MethodslExample: r: Browser=Mozilla Buy=Yes Age: =23 Rule is interesting if difference between and is
20、 greater than 5 years (i.e., = 5) For r, suppose n1 = 50, s1 = 3.5 For r (complement): n2 = 250, s2 = 6.5 For 1-sided test at 95% confidence level, critical Z-value for rejecting null hypothesis is 1.64. Since Z is greater than 1.64, r is an interesting rule11. 32505 . 6505 . 35233022222121nsnsZ Tan
21、,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Min-Apriori (Han et al)TID W1 W2 W3 W4 W5D122001D200122D323000D400101D511102Example:W1 and W2 tends to appear together in the same documentDocument-term matrix: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Min-ApriorilData c
22、ontains only continuous attributes of the same “type” e.g., frequency of words in a documentlPotential solution: Convert into 0/1 matrix and then apply existing algorithmsu lose word frequency information Discretization does not apply as users want association among words not ranges of wordsTID W1 W
23、2 W3 W4 W5D122001D200122D323000D400101D511102 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Min-ApriorilHow to determine the support of a word? If we simply sum up its frequency, support count will be greater than total number of documents!u Normalize the word vectors e.g., using L1
24、normu Each word has a support equals to 1.0TID W1 W2 W3 W4 W5D122001D200122D323000D400101D511102TID W1 W2 W3 W4 W5D1 0.40 0.33 0.00 0.00 0.17D2 0.00 0.00 0.33 1.00 0.33D3 0.40 0.50 0.00 0.00 0.00D4 0.00 0.00 0.33 0.00 0.17D5 0.20 0.17 0.33 0.00 0.33Normalize Tan,Steinbach, Kumar Introduction to Data
25、 Mining 4/18/2004 19 Min-ApriorilNew definition of support:TiCjjiDC),()sup(minExample:Sup(W1,W2,W3)= 0 + 0 + 0 + 0 + 0.17= 0.17TID W1 W2 W3 W4 W5D1 0.40 0.33 0.00 0.00 0.17D2 0.00 0.00 0.33 1.00 0.33D3 0.40 0.50 0.00 0.00 0.00D4 0.00 0.00 0.33 0.00 0.17D5 0.20 0.17 0.33 0.00 0.33 Tan,Steinbach, Kuma
展开阅读全文
链接地址:https://www.163wenku.com/p-2041030.html