数据挖掘课件:chap1-intro.ppt
- 【下载声明】
1. 本站全部试题类文档,若标题没写含答案,则无答案;标题注明含答案的文档,主观题也可能无答案。请谨慎下单,一旦售出,不予退换。
2. 本站全部PPT文档均不含视频和音频,PPT中出现的音频或视频标识(或文字)仅表示流程,实际无音频或视频文件。请谨慎下单,一旦售出,不予退换。
3. 本页资料《数据挖掘课件:chap1-intro.ppt》由用户(罗嗣辉)主动上传,其收益全归该用户。163文库仅提供信息存储空间,仅对该用户上传内容的表现方式做保护处理,对上传内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知163文库(点击联系客服),我们立即给予删除!
4. 请根据预览情况,自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
5. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器,压缩文件请下载最新的WinRAR软件解压。
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 数据 挖掘 课件 chap1_intro
- 资源描述:
-
1、 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Data Mining: IntroductionLecture Notes for Chapter 1Introduction to Data MiningbyTan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 lLots of data is being collected and warehoused Web data, e-commerce purch
2、ases at department/grocery stores Bank/Credit Card transactionslComputers have become cheaper and more powerfullCompetitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship Management)Why Mine Data? Commercial ViewpointWhy Mine Data? Scientific Viewpo
3、intlData collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of datalTraditional techniques infeasible for raw datalData mining may help scientists in cla
4、ssifying and segmenting data in Hypothesis Formation Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Mining Large Data Sets - MotivationlThere is often information “hidden” in the data that is not readily evidentlHuman analysts may take weeks to discover useful informationlMuch of the d
5、ata is never analyzed at all0500,0001,000,0001,500,0002,000,0002,500,0003,000,0003,500,0004,000,00019951996199719981999The Data GapTotal new disk (TB) since 1995Number of analysts From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications” Tan,Steinbach, Kumar I
6、ntroduction to Data Mining 4/18/2004 5 What is Data Mining?lMany Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful p
7、atterns Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 What is (not) Data Mining?l What is Data Mining? Certain names are more prevalent in certain US locations (OBrien, ORurke, OReilly in Boston area) Group together similar documents returned by search engine according to their contex
8、t (e.g. Amazon rainforest, A,)l What is not Data Mining? Look up phone number in phone directory Query a Web search engine for information about “Amazon” Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 lDraws ideas from machine learning/AI, pattern recognition, statistics, and database
9、systemslTraditional Techniquesmay be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of dataOrigins of Data MiningMachine Learning/Pattern RecognitionStatistics/AIData MiningDatabase systems Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
10、 Data Mining TaskslPrediction Methods Use some variables to predict unknown or future values of other variables.lDescription Methods Find human-interpretable patterns that describe the data.From Fayyad, et.al. Advances in Knowledge Discovery and Data Mining, 1996 Tan,Steinbach, Kumar Introduction to
11、 Data Mining 4/18/2004 9 Data Mining Tasks.lClassification PredictivelClustering DescriptivelAssociation Rule Discovery DescriptivelSequential Pattern Discovery DescriptivelRegression PredictivelDeviation Detection Predictive Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Classificati
12、on: DefinitionlGiven a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.lFind a model for class attribute as a function of the values of other attributes.lGoal: previously unseen records should be assigned a class as accurately as pos
13、sible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Classification ExampleTidRef
14、undMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10categoricalcategoricalcontinuousclassRefundMaritalStatusTaxableIncomeCheatNoSingle75K?YesMarried50K?NoMar
15、ried150K?YesDivorced90K?NoSingle40K?NoMarried80K?10TestSetTraining SetModelLearn Classifier Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Classification: Application 1lDirect Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product
16、. Approach:uUse the data for a similar product introduced before. uWe know which customers decided to buy and which decided otherwise. This buy, dont buy decision forms the class attribute.uCollect various demographic, lifestyle, and company-interaction related information about all such customers.
17、Type of business, where they stay, how much they earn, etc.uUse this information as input attributes to learn a classifier model.From Berry & Linoff Data Mining Techniques, 1997 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Classification: Application 2lFraud Detection Goal: Predict
18、fraudulent cases in credit card transactions. Approach:uUse credit card transactions and the information on its account-holder as attributes. When does a customer buy, what does he buy, how often he pays on time, etcuLabel past transactions as fraud or fair transactions. This forms the class attribu
19、te.uLearn a model for the class of the transactions.uUse this model to detect fraud by observing credit card transactions on an account. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Classification: Application 3lCustomer Attrition/Churn: Goal: To predict whether a customer is likely
20、 to be lost to a competitor. Approach:uUse detailed record of transactions with each of the past and present customers, to find attributes. How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc. uLabel the customers as loyal or di
21、sloyal.uFind a model for loyalty.From Berry & Linoff Data Mining Techniques, 1997 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Classification: Application 4lSky Survey Cataloging Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the tel
22、escopic survey images (from Palomar Observatory). 3000 images with 23,040 x 23,040 pixels per image. Approach:uSegment the image. uMeasure image attributes (features) - 40 of them per object.uModel the class based on these features.uSuccess Story: Could find 16 new high red-shift quasars, some of th
23、e farthest objects that are difficult to find!From Fayyad, et.al. Advances in Knowledge Discovery and Data Mining, 1996 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Classifying GalaxiesEarlyIntermediateLateData Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image D
展开阅读全文