数据挖掘课件：chap7-extended-association-analysis (1).ppt

上传人（卖家）：罗嗣辉

文档编号：2041030

上传时间：2022-01-19

格式：PPT

页数：67

大小：1.30MB

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15 文币

交易提醒：下载本文档，相应价格的文币将全额进入上传人（卖家）的账号。立即下载优惠套餐（点此详情）

【下载声明】
1. 本站全部试题类文档，若标题没写含答案，则无答案；标题注明含答案的文档，主观题也可能无答案。请谨慎下单，一旦售出，不予退换。
2. 本站全部PPT文档均不含视频和音频，PPT中出现的音频或视频标识（或文字）仅表示流程，实际无音频或视频文件。请谨慎下单，一旦售出，不予退换。
3. 本页资料《数据挖掘课件：chap7-extended-association-analysis (1).ppt》由用户（罗嗣辉）主动上传，其收益全归该用户。163文库仅提供信息存储空间，仅对该用户上传内容的表现方式做保护处理，对上传内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知163文库（点击联系客服），我们立即给予删除！
4. 请根据预览情况，自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
5. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器，压缩文件请下载最新的WinRAR软件解压。

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 数据挖掘课件：chap7_extended_association_analysis 1 数据挖掘课件 chap7_extended_association_analysis

资源描述：: 1、Data MiningAssociation Rules: Advanced Concepts and AlgorithmsLecture Notes for Chapter 7Introduction to Data MiningbyTan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Continuous and Categorical AttributesE
2、xample of Association Rule: Number of Pages 5,10) (Browser=Mozilla) Buy = NoHow to apply association analysis formulation to non-asymmetric binary variables? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Handling Categorical AttributeslTransform categorical attribute into asymmetric b
3、inary variableslIntroduce a new “item” for each distinct attribute-value pair Example: replace Browser Type attribute withu Browser Type = Internet Exploreru Browser Type = Mozillau Browser Type = Mozilla Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Handling Categorical AttributeslPo
4、tential Issues What if attribute has many possible valuesu Example: attribute country has more than 200 possible valuesu Many of the attribute values may have very low support Potential solution: Aggregate the low-support attribute values What if distribution of attribute values is highly skewedu Ex
5、ample: 95% of the visitors have Buy = Nou Most of the items will be associated with (Buy=No) item Potential solution: drop the highly frequent items Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Handling Continuous AttributeslDifferent kinds of rules: Age21,35) Salary70k,120k) Buy Sal
6、ary70k,120k) Buy Age: =28, =4lDifferent methods: Discretization-based Statistics-based Non-discretization basedu minApriori Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Handling Continuous AttributeslUse discretizationlUnsupervised: Equal-width binning Equal-depth binning Clusteringl
7、Supervised: Classv1v2v3v4v5v6v7v8v9Anomalous002010200000Normal150100000100100150100bin1bin3bin2Attribute values, v Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Discretization IssueslSize of the discretized intervals affect support & confidence If intervals too smallu may not have eno
8、ugh support If intervals too largeu may not have enough confidencelPotential solution: use all possible intervalsRefund = No, (Income = $51,250) Cheat = NoRefund = No, (60K Income 80K) Cheat = NoRefund = No, (0K Income 1B) Cheat = No Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Discr
9、etization IssueslExecution time If intervals contain n values, there are on average O(n2) possible rangeslToo many rulesRefund = No, (Income = $51,250) Cheat = NoRefund = No, (51K Income 52K) Cheat = NoRefund = No, (50K Income 60K) Cheat = No Tan,Steinbach, Kumar Introduction to Data Mining 4/18/200
10、4 9 Approach by Srikant & AgrawallPreprocess the data Discretize attribute using equi-depth partitioningu Use partial completeness measure to determine number of partitionsu Merge adjacent intervals as long as support is less than max-supportlApply existing association rule mining algorithmslDetermi
11、ne interesting rules in the output Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Approach by Srikant & AgrawallDiscretization will lose information Use partial completeness measure to determine how much information is lost C: frequent itemsets obtained by considering all ranges of at
12、tribute valuesP: frequent itemsets obtained by considering all ranges over the partitionsP is K-complete w.r.t C if P C,and X C, X P such that: 1. X is a generalization of X and support (X) K support(X) (K 1) 2. Y X, Y X such that support (Y) K support(Y)Given K (partial completeness level), can det
13、ermine number of intervals (N)XApproximated X Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Interestingness MeasurelGiven an itemset: Z = z1, z2, , zk and its generalization Z = z1, z2, , zk P(Z): support of ZEZ(Z): expected support of Z based on Z Z is R-interesting w.r.t. Z if P(Z)
14、 R EZ(Z)Refund = No, (Income = $51,250) Cheat = NoRefund = No, (51K Income 52K) Cheat = NoRefund = No, (50K Income 60K) Cheat = No) () ()() ()() ()()(2211ZPzPzPzPzPzPzPZEkkZ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Interestingness MeasurelFor S: X Y, and its generalization S: X
15、YP(Y|X): confidence of X Y P(Y|X): confidence of X Y ES(Y|X): expected support of Z based on ZlRule S is R-interesting w.r.t its ancestor rule S if Support, P(S) R ES(S) or Confidence, P(Y|X) R ES(Y|X) | () ()() ()() ()()|(2211XYPyPyPyPyPyPyPXYEkk Tan,Steinbach, Kumar Introduction to Data Mining 4/1
16、8/2004 13 Statistics-based MethodslExample: Browser=Mozilla Buy=Yes Age: =23lRule consequent consists of a continuous variable, characterized by their statistics mean, median, standard deviation, etc.lApproach: Withhold the target variable from the rest of the data Apply existing frequent itemset ge
17、neration on the rest of the data For each frequent itemset, compute the descriptive statistics for the corresponding target variableu Frequent itemset becomes a rule by introducing the target variable as rule consequent Apply statistical test to determine interestingness of the rule Tan,Steinbach, K
18、umar Introduction to Data Mining 4/18/2004 14 Statistics-based MethodslHow to determine whether an association rule interesting? Compare the statistics for segment of population covered by the rule vs segment of population not covered by the rule:A B: versus A B: Statistical hypothesis testing:u Nul
19、l hypothesis: H0: = + u Alternative hypothesis: H1: + u Z has zero mean and variance 1 under null hypothesis222121nsnsZ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Statistics-based MethodslExample: r: Browser=Mozilla Buy=Yes Age: =23 Rule is interesting if difference between and is
20、 greater than 5 years (i.e., = 5) For r, suppose n1 = 50, s1 = 3.5 For r (complement): n2 = 250, s2 = 6.5 For 1-sided test at 95% confidence level, critical Z-value for rejecting null hypothesis is 1.64. Since Z is greater than 1.64, r is an interesting rule11. 32505 . 6505 . 35233022222121nsnsZ Tan
21、,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Min-Apriori (Han et al)TID W1 W2 W3 W4 W5D122001D200122D323000D400101D511102Example:W1 and W2 tends to appear together in the same documentDocument-term matrix: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Min-ApriorilData c
22、ontains only continuous attributes of the same “type” e.g., frequency of words in a documentlPotential solution: Convert into 0/1 matrix and then apply existing algorithmsu lose word frequency information Discretization does not apply as users want association among words not ranges of wordsTID W1 W
23、2 W3 W4 W5D122001D200122D323000D400101D511102 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Min-ApriorilHow to determine the support of a word? If we simply sum up its frequency, support count will be greater than total number of documents!u Normalize the word vectors e.g., using L1
24、normu Each word has a support equals to 1.0TID W1 W2 W3 W4 W5D122001D200122D323000D400101D511102TID W1 W2 W3 W4 W5D1 0.40 0.33 0.00 0.00 0.17D2 0.00 0.00 0.33 1.00 0.33D3 0.40 0.50 0.00 0.00 0.00D4 0.00 0.00 0.33 0.00 0.17D5 0.20 0.17 0.33 0.00 0.33Normalize Tan,Steinbach, Kumar Introduction to Data
25、 Mining 4/18/2004 19 Min-ApriorilNew definition of support:TiCjjiDC),()sup(minExample:Sup(W1,W2,W3)= 0 + 0 + 0 + 0 + 0.17= 0.17TID W1 W2 W3 W4 W5D1 0.40 0.33 0.00 0.00 0.17D2 0.00 0.00 0.33 1.00 0.33D3 0.40 0.50 0.00 0.00 0.00D4 0.00 0.00 0.33 0.00 0.17D5 0.20 0.17 0.33 0.00 0.33 Tan,Steinbach, Kuma
26、r Introduction to Data Mining 4/18/2004 20 Anti-monotone property of SupportExample:Sup(W1) = 0.4 + 0 + 0.4 + 0 + 0.2 = 1Sup(W1, W2) = 0.33 + 0 + 0.4 + 0 + 0.17 = 0.9Sup(W1, W2, W3) = 0 + 0 + 0 + 0 + 0.17 = 0.17TID W1 W2 W3 W4 W5D1 0.40 0.33 0.00 0.00 0.17D2 0.00 0.00 0.33 1.00 0.33D3 0.40 0.50 0.00
27、 0.00 0.00D4 0.00 0.00 0.33 0.00 0.17D5 0.20 0.17 0.33 0.00 0.33 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Multi-level Association RulesFoodBreadMilkSkim2%ElectronicsComputersHomeDesktopLaptopWheatWhiteForemostKempsDVDTVPrinterScannerAccessory Tan,Steinbach, Kumar Introduction to
28、 Data Mining 4/18/2004 22 Multi-level Association RuleslWhy should we incorporate concept hierarchy? Rules at lower levels may not have enough support to appear in any frequent itemsets Rules at lower levels of the hierarchy are overly specific u e.g., skim milk white bread, 2% milk wheat bread,skim
29、 milk wheat bread, etc.are indicative of association between milk and bread Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Multi-level Association RuleslHow do support and confidence vary as we traverse the concept hierarchy? If X is the parent item for both X1 and X2, then (X) (X1) +
30、 (X2) If (X1 Y1) minsup, and X is parent of X1, Y is parent of Y1 then(X Y1) minsup, (X1 Y) minsup(X Y) minsup If conf(X1 Y1) minconf,thenconf(X1 Y) minconf Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Multi-level Association RuleslApproach 1: Extend current association rule formula
31、tion by augmenting each transaction with higher level itemsOriginal Transaction: skim milk, wheat bread Augmented Transaction: skim milk, wheat bread, milk, bread, foodlIssues: Items that reside at higher levels have much higher support counts u if support threshold is low, too many frequent pattern
32、s involving items from the higher levels Increased dimensionality of the data Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Multi-level Association RuleslApproach 2: Generate frequent patterns at highest level first Then, generate frequent patterns at the next highest level, and so o
33、nlIssues: I/O requirements will increase dramatically because we need to perform more passes over the data May miss some potentially interesting cross-level association patterns Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Sequence DataObjectTimestampEventsA102, 3, 5A206, 1A231B114,
34、 5, 6B172B217, 8, 1, 2B281, 6C141, 8, 7Sequence Database: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Examples of Sequence DataSequence DatabaseSequenceElement (Transaction)Event(Item)CustomerPurchase history of a given customerA set of items bought by a customer at time tBooks, di
35、ary products, CDs, etcWeb DataBrowsing activity of a particular Web visitorA collection of files viewed by a Web visitor after a single mouse clickHome page, index page, contact info, etcEvent dataHistory of events generated by a given sensorEvents triggered by a sensor at time tTypes of alarms gene
36、rated by sensors Genome sequencesDNA sequence of a particular speciesAn element of the DNA sequence Bases A,T,G,CSequenceE1E2E1E3E2E3E4E2Element (Transaction)Event (Item) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 Formal Definition of a SequencelA sequence is an ordered list of el
37、ements (transactions)s = Each element contains a collection of events (items)ei = i1, i2, , ik Each element is attributed to a specific time or locationlLength of a sequence, |s|, is given by the number of elements of the sequencelA k-sequence is a sequence that contains k events (items) Tan,Steinba
38、ch, Kumar Introduction to Data Mining 4/18/2004 29 Examples of SequencelWeb sequence: lSequence of initiating events causing the nuclear accident at 3-mile Island:(http:/stellar- of books checked out at a library: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Formal Definition of a S
39、ubsequencelA sequence is contained in another sequence (m n) if there exist integers i1 i2 in such that a1 bi1 , a2 bi1, , an bin lThe support of a subsequence w is defined as the fraction of data sequences that contain wlA sequential pattern is a frequent subsequence (i.e., a subsequence whose supp
40、ort is minsup)Data sequenceSubsequenceContain?Yes NoYes Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Sequential Pattern Mining: DefinitionlGiven: a database of sequences a user-specified minimum support threshold, minsuplTask: Find all subsequences with support minsup Tan,Steinbach,
41、 Kumar Introduction to Data Mining 4/18/2004 32 Sequential Pattern Mining: ChallengelGiven a sequence: Examples of subsequences:, , , etc.lHow many k-subsequences can be extracted from a given n-sequence? n = 9 k=4: Y _ _ Y Y _ _ _ Y 12649:Answerkn Tan,Steinbach, Kumar Introduction to Data Mining 4/
42、18/2004 33 Sequential Pattern Mining: ExampleMinsup = 50%Examples of Frequent Subsequences: s=60% s=60%s=80%s=80%s=80%s=60%s=60%s=60%s=60%ObjectTimestampEventsA11,2,4A22,3A35B11,2B22,3,4C11, 2C22,3,4C32,4,5D12D23, 4D34, 5E11, 3E22, 4, 5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 E
43、xtracting Sequential PatternslGiven n events: i1, i2, i3, , inlCandidate 1-subsequences: , , , , lCandidate 2-subsequences:, , , , , , lCandidate 3-subsequences:, , , , , , , , , , Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Generalized Sequential Pattern (GSP)lStep 1: Make the fir
44、st pass over the sequence database D to yield all the 1-element frequent sequenceslStep 2: Repeat until no new frequent sequences are found Candidate Generation: uMerge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items Candidate Pruning:uPr
45、une candidate k-sequences that contain infrequent (k-1)-subsequences Support Counting:uMake a new pass over the sequence database D to find the support for these candidate sequences Candidate Elimination:uEliminate candidate k-sequences whose actual support is less than minsup Tan,Steinbach, Kumar I
46、ntroduction to Data Mining 4/18/2004 36 Candidate GenerationlBase case (k=2): Merging two frequent 1-sequences and will produce two candidate 2-sequences: and lGeneral case (k2): A frequent (k-1)-sequence w1 is merged with another frequent (k-1)-sequence w2 to produce a candidate k-sequence if the s
47、ubsequence obtained by removing the first event in w1 is the same as the subsequence obtained by removing the last event in w2u The resulting candidate after merging is given by the sequence w1 extended with the last event of w2. If the last two events in w2 belong to the same element, then the last
48、 event in w2 becomes part of the last element in w1 Otherwise, the last event in w2 becomes a separate element appended to the end of w1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 Candidate Generation ExampleslMerging the sequences w1= and w2 = will produce the candidate sequence
49、because the last two events in w2 (4 and 5) belong to the same elementlMerging the sequences w1= and w2 = will produce the candidate sequence because the last two events in w2 (4 and 5) do not belong to the same elementlWe do not have to merge the sequences w1 = and w2 = to produce the candidate bec
50、ause if the latter is a viable candidate, then it can be obtained by merging w1 with Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 GSP Example Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Timing Constraints (I)A B C D E= msngxg: max-gapng: min-gapms: maximum spanData

展开阅读全文

163文库所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：数据挖掘课件：chap7-extended-association-analysis (1).ppt
链接地址：https://www.163wenku.com/p-2041030.html

罗嗣辉

内容提供者

实名认证

联系作者