数据挖掘课件：chap4-basic-classification.ppt

上传人（卖家）：罗嗣辉

文档编号：2040892

上传时间：2022-01-19

格式：PPT

页数：101

大小：3.81MB

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15 文币

交易提醒：下载本文档，相应价格的文币将全额进入上传人（卖家）的账号。立即下载优惠套餐（点此详情）

【下载声明】
1. 本站全部试题类文档，若标题没写含答案，则无答案；标题注明含答案的文档，主观题也可能无答案。请谨慎下单，一旦售出，不予退换。
2. 本站全部PPT文档均不含视频和音频，PPT中出现的音频或视频标识（或文字）仅表示流程，实际无音频或视频文件。请谨慎下单，一旦售出，不予退换。
3. 本页资料《数据挖掘课件：chap4-basic-classification.ppt》由用户（罗嗣辉）主动上传，其收益全归该用户。163文库仅提供信息存储空间，仅对该用户上传内容的表现方式做保护处理，对上传内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知163文库（点击联系客服），我们立即给予删除！
4. 请根据预览情况，自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
5. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器，压缩文件请下载最新的WinRAR软件解压。

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 数据挖掘课件 chap4_basic_classification

资源描述：: 1、Data Mining Classification: Basic Concepts, Decision Trees, and Model EvaluationLecture Notes for Chapter 4Introduction to Data MiningbyTan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Classification: Defi
2、nitionlGiven a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.lFind a model for class attribute as a function of the values of other attributes.lGoal: previously unseen records should be assigned a class as accurately as possible. A
3、 test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Illustrating Classification Task Tan,
4、Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Examples of Classification TasklPredicting tumor cells as benign or malignantlClassifying credit card transactions as legitimate or fraudulentlClassifying secondary structures of protein as alpha-helix, beta-sheet, or random coillCategorizing
5、news stories as finance, weather, entertainment, sports, etc Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Classification TechniqueslDecision Tree based MethodslRule-based MethodslMemory based reasoninglNeural NetworkslNave Bayes and Bayesian Belief NetworkslSupport Vector Machines Ta
6、n,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Example of a Decision TreeTidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10categoricalcat
7、egoricalcontinuousclassRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KSplitting AttributesTraining DataModel: Decision Tree Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Another Example of Decision TreeTidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KN
8、o3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10categoricalcategoricalcontinuousclassMarStRefundTaxIncYESNONONOYesNoMarried Single, Divorced 80KThere could be more than one tree that fits the same data! Tan,Steinbach,
9、Kumar Introduction to Data Mining 4/18/2004 8 Decision Tree Classification TaskDecision Tree Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10
10、Test DataStart from the root of tree. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Tan,Steinbach, Kumar Introduction to Data Mi
11、ning 4/18/2004 11 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried S
12、ingle, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10
13、Test Data Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Test DataAssign Cheat to “No” Tan,Steinbach, Kumar Introduction to Data Mining 4/1
14、8/2004 15 Decision Tree Classification TaskDecision Tree Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Decision Tree InductionlMany Algorithms: Hunts Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 General Struc
15、ture of Hunts AlgorithmlLet Dt be the set of training records that reach a node tlGeneral Procedure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that bel
16、ong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.Dt? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Hunts AlgorithmDont CheatRefundDont CheatDont CheatYesNoRefundDont CheatYesNoMaritalStatusDont Ch
17、eatCheatSingle,DivorcedMarriedTaxableIncomeDont Cheat= 80KRefundDont CheatYesNoMaritalStatusDont CheatCheatSingle,DivorcedMarriedTidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9No
18、Married75KNo10NoSingle90KYes10 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Tree InductionlGreedy strategy. Split the records based on an attribute test that optimizes certain criterion.lIssues Determine how to split the recordsuHow to specify the attribute test condition?uHow to de
19、termine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Tree InductionlGreedy strategy. Split the records based on an attribute test that optimizes certain criterion.lIssues Determine how to split the recordsuHow to specify the attribute
20、 test condition?uHow to determine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 How to Specify Test Condition?lDepends on attribute types Nominal Ordinal ContinuouslDepends on number of ways to split 2-way split Multi-way split Tan,Ste
21、inbach, Kumar Introduction to Data Mining 4/18/2004 22 Splitting Based on Nominal AttributeslMulti-way split: Use as many partitions as distinct values. lBinary split: Divides values into two subsets. Need to find optimal partitioning.CarTypeFamilySportsLuxuryCarTypeFamily, LuxurySportsCarTypeSports
22、, LuxuryFamilyOR Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 lMulti-way split: Use as many partitions as distinct values. lBinary split: Divides values into two subsets. Need to find optimal partitioning.lWhat about this split?Splitting Based on Ordinal AttributesSizeSmallMediumLar
23、geSizeMedium, LargeSmallSizeSmall, MediumLargeORSizeSmall, LargeMedium Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Splitting Based on Continuous AttributeslDifferent ways of handling Discretization to form an ordinal categorical attributeu Static discretize once at the beginningu D
24、ynamic ranges can be found by equal interval bucketing, equal frequency bucketing(percentiles), or clustering. Binary Decision: (A v) or (A v)u consider all possible splits and finds the best cutu can be more compute intensive Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Splitting B
25、ased on Continuous Attributes Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Tree InductionlGreedy strategy. Split the records based on an attribute test that optimizes certain criterion.lIssues Determine how to split the recordsuHow to specify the attribute test condition?uHow to det
26、ermine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 How to determine the Best SplitBefore Splitting: 10 records of class 0,10 records of class 1Which test condition is the best? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/20
27、04 28 How to determine the Best SplitlGreedy approach: Nodes with homogeneous class distribution are preferredlNeed a measure of node impurity:Non-homogeneous,High degree of impurityHomogeneous,Low degree of impurity Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 Measures of Node Impu
28、ritylGini IndexlEntropylMisclassification error Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 How to Find the Best SplitB?YesNoNode N3Node N4A?YesNoNode N1Node N2Before Splitting:M0M1M2M3M4M12M34Gain = M0 M12 vs M0 M34 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Mea
29、sure of Impurity: GINIlGini Index for a given node t :(NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most
30、interesting informationjtjptGINI2)|(1)(C10C26Gini=0.000C12C24Gini=0.444C13C23Gini=0.500C11C25Gini=0.278 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32 Examples for computing GINIC1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Gini = 1 P(C1)2 P(C2)2 = 1 0 1 = 0 jtjptGINI2)
31、|(1)(P(C1) = 1/6 P(C2) = 5/6Gini = 1 (1/6)2 (5/6)2 = 0.278P(C1) = 2/6 P(C2) = 4/6Gini = 1 (2/6)2 (4/6)2 = 0.444 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 Splitting Based on GINIlUsed in CART, SLIQ, SPRINT.lWhen a node p is split into k partitions (children), the quality of split
32、is computed as,where,ni = number of records at child i, n = number of records at node p.kiisplitiGINInnGINI1)( Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 Binary Attributes: Computing GINI IndexlSplits into two partitionslEffect of Weighing partitions: Larger and Purer Partitions a
33、re sought for.B?YesNoNode N1Node N2 Parent C1 6 C2 6 Gini = 0.500 N1 N2 C1 5 1 C2 2 4 Gini=0.333 Gini(N1) = 1 (5/6)2 (2/6)2 = 0.194 Gini(N2) = 1 (1/6)2 (4/6)2 = 0.528Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528= 0.333 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Categorical Attribut
34、es: Computing Gini IndexlFor each distinct value, gather counts for each class in the datasetlUse the count matrix to make decisionsCarTypeSports,LuxuryFamilyC131C224Gini0.400CarTypeSportsFamily,LuxuryC122C215Gini0.419CarTypeFamily Sports LuxuryC1121C2411Gini0.393Multi-way splitTwo-way split (find b
35、est partition of values) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36 Continuous Attributes: Computing Gini IndexlUse Binary Decisions based on one valuelSeveral Choices for the splitting value Number of possible splitting values = Number of distinct valueslEach splitting value has
36、a count matrix associated with it Class counts in each of the partitions, A v and A vlSimple method to choose best v For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/
37、2004 37 Continuous Attributes: Computing Gini Index.lFor efficient computation: for each attribute, Sort the attribute on values Linearly scan these values, each time updating the count matrix and computing gini index Choose the split position that has the least gini indexCheatNoNoNoYesYesYesNoNoNoN
38、oTaxable Income60707585909510012012522055657280879297110122172230Yes0303030312213030303030No0716253434343443526170Gini0.4200.4000.3750.3430.4170.4000.3000.3430.3750.4000.420Split PositionsSorted Values Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 Alternative Splitting Criteria based
39、 on INFOlEntropy at a given node t:(NOTE: p( j | t) is the relative frequency of class j at node t). Measures homogeneity of a node. uMaximum (log nc) when records are equally distributed among all classes implying least informationuMinimum (0.0) when all records belong to one class, implying most i
40、nformation Entropy based computations are similar to the GINI index computationsjtjptjptEntropy)|(log)|()( Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Examples for computing EntropyC1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Entropy = 0 log 0 1 log 1 = 0 0 = 0 P(C1
41、) = 1/6 P(C2) = 5/6Entropy = (1/6) log2 (1/6) (5/6) log2 (1/6) = 0.65P(C1) = 2/6 P(C2) = 4/6Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92jtjptjptEntropy)|(log)|()(2 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40 Splitting Based on INFO.lInformation Gain: Parent Node, p is split i
42、nto k partitions;ni is number of records in partition i Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Used in ID3 and C4.5 Disadvantage: Tends to prefer splits that result in large number of partitions, each being small bu
43、t pure.kiisplitiEntropynnpEntropyGAIN1)()( Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Splitting Based on INFO.lGain Ratio: Parent Node, p is split into k partitionsni is the number of records in partition i Adjusts Information Gain by the entropy of the partitioning (SplitINFO). H
44、igher entropy partitioning (large number of small partitions) is penalized! Used in C4.5 Designed to overcome the disadvantage of Information GainSplitINFOGAINGainRATIOSplitsplitkiiinnnnSplitINFO1log Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 Splitting Criteria based on Classifica
45、tion ErrorlClassification error at a node t :lMeasures misclassification error made by a node. uMaximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting informationuMinimum (0.0) when all records belong to one class, implying most interesting information)
46、|(max1)(tiPtErrori Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43 Examples for Computing ErrorC1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Error = 1 max (0, 1) = 1 1 = 0 P(C1) = 1/6 P(C2) = 5/6Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6P(C1) = 2/6 P(C2) = 4/6Error = 1 max (
47、2/6, 4/6) = 1 4/6 = 1/3)|(max1)(tiPtErrori Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44 Comparison among Splitting CriteriaFor a 2-class problem: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45 Misclassification Error vs GiniA?YesNoNode N1Node N2 Parent C1 7 C2 3 Gini
48、= 0.42 N1 N2 C1 3 4 C2 0 3 Gini=0.361 Gini(N1) = 1 (3/3)2 (0/3)2 = 0 Gini(N2) = 1 (4/7)2 (3/7)2 = 0.489Gini(Children) = 3/10 * 0 + 7/10 * 0.489= 0.342Gini improves ! Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46 Tree InductionlGreedy strategy. Split the records based on an attribute
49、test that optimizes certain criterion.lIssues Determine how to split the recordsuHow to specify the attribute test condition?uHow to determine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47 Stopping Criteria for Tree InductionlStop expa
50、nding a node when all the records belong to the same classlStop expanding a node when all the records have similar attribute valueslEarly termination (to be discussed later) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48 Decision Tree Based ClassificationlAdvantages: Inexpensive to co

展开阅读全文

163文库所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：数据挖掘课件：chap4-basic-classification.ppt
链接地址：https://www.163wenku.com/p-2040892.html

罗嗣辉

内容提供者

实名认证

联系作者