第十九章-聚类分析-Chapter19-Clustering-Analysis-PPT课件.ppt

上传人（卖家）：三亚风情

文档编号：2772457

上传时间：2022-05-25

格式：PPT

页数：37

大小：274.50KB

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

25 文币

交易提醒：下载本文档，相应价格的文币将全额进入上传人（卖家）的账号。立即下载优惠套餐（点此详情）

【下载声明】
1. 本站全部试题类文档，若标题没写含答案，则无答案；标题注明含答案的文档，主观题也可能无答案。请谨慎下单，一旦售出，不予退换。
2. 本站全部PPT文档均不含视频和音频，PPT中出现的音频或视频标识（或文字）仅表示流程，实际无音频或视频文件。请谨慎下单，一旦售出，不予退换。
3. 本页资料《第十九章-聚类分析-Chapter19-Clustering-Analysis-PPT课件.ppt》由用户（三亚风情）主动上传，其收益全归该用户。163文库仅提供信息存储空间，仅对该用户上传内容的表现方式做保护处理，对上传内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知163文库（点击联系客服），我们立即给予删除！
4. 请根据预览情况，自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
5. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器，压缩文件请下载最新的WinRAR软件解压。

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 第十九聚类分析 Chapter19 Clustering Analysis PPT 课件

资源描述：: 1、Chapter19 Clustering AnalysisContent Similarity coefficient Hierarchical clustering analysis Dynamic clustering analysis Ordered sample clustering analysis Discriminant Analysis : having known with certainty to come from two or more populations, its a method to acquire the discriminate model that wi
2、ll allocate further individuals to the correct population. Clustering Analysis: a statistic method for grouping objects of random kind into respective categories. Its used when theres no priori hypotheses, but trying to find the most appropriate sorting method resorting to mathematical statistics an
3、d some collected information. It has become the first selected means to uncover great capacity of genetic messages.Both are methods of multivariate statistics to study classification. Clustering analysis is a method of exploring statistical analysis. It can be classified into two major species accor
4、ding to its aims. For example, m refers to the number of variables(i.e. indexes) while n refers to that of cases(i.e. samples) ,you can do as follows: (1) R-type clustering: also called index clustering. The method to sort the m kinds of indexes, aiming at lowering the dimension of indexes and choos
5、ing typical ones. （2）Q-type clustering: also called sample clustering. The method to sort the n kinds of samples to find the commonness among them. The most important thing for both R-type clustering and Q-type clustering is the definition of similarity, that is how to quantify similarity. The first
6、 step of clustering is to define the metric similarity between two indexes or two samples- similarity coefficient 1 similarity coefficient 1 similarity coefficient of R-type clustering Suppose there are m kinds of variables: X1，X2，Xm. R-type clustering usually use the absolute value of simple correl
7、ation coefficient to define the similarity coefficient among variables: The two variables tend to be more similar when the absolute value increases. Similarly, Spearman rank correlation coefficient can be used to define the similarity coefficient of non-normal variables. But when the variables are a
8、ll qualitative variables, its best to use contingency coefficient. 22()() (19-1)()()iijjijiijjXXXXrXXXX2 Similarity coefficient commonly used in Q-type clustering : Suppose there are n cases regard as n spots in a m dimensions space, distance between two spots can be used to define similarity coeffi
9、cient, the two samples tend to be more similar when the distance declines. （1）Euclidean distance （2）Manhattan distance （3）Minkowski distance： Absolute distance refers to Minkowski distance when q=1；Euclidean distance is direct-viewing and simple to compute, but having not regarded the correlated rel
10、ations among variables. Thats why Manhattan distance was introduced.2() (19-3)ijijdXX| (19-4)ijijdXX| ( 1 9 - 5 )qqi jijdXX(19-5) （4）Mahalanobis distance ：its used to express the sample covariance matrix among m kinds of variables. It can be worked out as follows: When its a unit matrix, Mahalanobis
11、 distance equals to the square of Euclidean distance.All of the four distances refer to quantitative variables, for the qualitative variables and ordinal variables, quantization is needed before using. (19-6)ijd1XS X 1122(,)ijijimjmXXXXXXX 2 Hierarchical Clustering Analysis Hierarchical clustering a
12、nalysis is a most commonly used method to sort out similar samples or variables. The process is as follows: 1）At the beginning, samples(or variables) are regarded respectively as one single cluster, that is, each cluster contains only one sample(or variable). Then work out similarity coefficient mat
13、rix among clusters. The matrix is made up of similarity coefficients between samples (or variables). Similarity coefficient matrix is a symmetrical matrix. 2）The two clusters with the maximum similarity coefficient( minimum distance or maximum correlation coefficient) are merged into a new cluster.
14、Compute the similarity coefficient between the new cluster with other clusters. Repeat step two until all of the samples (or variables) are merged into one cluster.The calculation of similarity coefficient between clusters Each step of hierarchical clustering has to calculate the similarity coeffici
15、ent among clusters. When there is only one sample or variable in each of the two clusters, the similarity coefficient between them equals to that of the two samples or the two variables, or compute according to section one. When there are more than one sample or variable in each cluster, many kinds
16、of methods can be used to compute similarity coefficient. Just list 5 kinds of methods as follows. and refer to the two clusters, which respectively has or kinds of samples or variables. pGqGpnqn1The maximum similarity coefficient method If therere respectively , samples(or variables) in cluster and
17、 , herere altogether and similarity coefficients between the two clusters, but only the maximum is considered as the similarity coefficient of the two clusters. Attention :the minimum distance also means the maximum similarity coefficient. 2The Minimum similarity coefficient method similarity coeffi
18、cient between clusters can be calculated as follows:,Min () , 19-7Max () , pqpqpqiji Gj Gpqiji Gj GDdrr样品聚类（）指标聚类,Max () , 19-8Min () , pqpqpqiji GjGpqiji GjGDdrr样品聚类（）指标聚类pnqnpGqG2pn2qn3. The center of gravity method (only used in sample clustering) The weights are the index means among clusters. I
19、t can be computed as follows:4Cluster equilibration method (only used in sample clustering) work out the average square distance between two samples of each cluster. Cluster equilibration is one of the good methods in the hierarchical clustering, because it can fully reflect the individual informati
20、on within a cluster. 221 (19-10)pqijpqDdn n (19-9)pqpqDdX X5sum of squares of deviations method also called Ward method，only for sample clustering. It imitates the basic thoughts of variance analysis, that is, a rational classification can make the sum of squares of deviation within a cluster smalle
21、r, while that among clusters larger. Suppose that samples have been classified into g clusters, including and . The sum of squares of deviations of cluster from samples is: ( is the mean of ) . The merged sum of squares of deviations of all the g clusters is . If and are merged, there will be g-1 cl
22、usters. The increment of merged sum of squares of deviations is ,which is defined as the square distance between the two clusters. Obviously, when n samples respectively forms a single cluster, the merged sum of squares of deviation is 0.211()knmkijjijLXXjXjXgkLL21pqggDLLnpGqGknkpGqG Sample 19-1 The
23、rere four variables surveying from 3454 female adults : height（X1）、length of legs （X2）、waistline（X3）and chest circumference（X4）.The correlation matrix has been worked out as follows: Try to use hierarchical clustering to cluster the 4 indexes. This is a case of R-type(index) clustering. We choose si
24、mple similarity coefficient as the similarity coefficient ,and use maximum similarity coefficient method to calculate the similarity coefficient among clusters.732. 0174. 0234. 0055. 0099. 0852. 0432321)0(XXXXXXR The clustering procedure is listed as follows: （1）each index is regarded as a single cl
25、uster G1=X1，G2=X2，G3=X3，G4=X4.Therere altogether 4 clusters.（2）Merge the two clusters with maximum similarity coefficient into a new cluster. In this case, we merge G1 and G2( similarity coefficient is 0.852) as G5=X1 , X2. Calculate the similarity coefficient among G5、G3 and G4. The similar matrix
26、among G3,G4 and G5:451424Max(,)Max(0.234,0.174)0.234rrr234. 0099. 0732. 05443)1(GGGGR351323Max(,)Max(0.099,0.055)0.099rrr （3）Merge G3 and G4 as G6=G3 , G4, for this time the similarity coefficient between G3 and G4 ranks the largest(0.732). Compute the similarity coefficient between G6 and G5. （4）La
27、stly G5 and G6 are merged into one clusterG7=G5 , G6, which in fact includes all the primitive indexes.563545Max(,)Max(0.099,0.234)0.234rrr Draw the hierarchical dendrogram (picture 19-1）according to the process of clustering. As the picture indicates, its better to be classified into two clusters:
28、X1，X2，X3，X4.That is, length index as one cluster while circumference as the other one. 19-1 4 个指标聚类系统聚类height length waistline chest of legs circumference Picture 19-1 hierarchical dendrogram with 4 indexes Sample 19-2 Table 19-1 lists the means of energy expenditure and sugar expenditure of four at
29、hletic items from six athletes. In order to provide correspondent dietary standard to improve performance record, please cluster the athletic items using hierarchical clustering. Table 19-1 measure values of 4 athletic itemsAthletic itemsEnergy expenditure X1（joule/minute、m2）Sugar expenditure X2（%）W
30、eight loading crouchingG127.89261.421.3150.688Pull-up G223.47556.830.1740.088Push-ups G318.92445.13-1.001-1.441Sit-up G420.91361.25-0.4880.665 We choose Minkowski distance in this sample, and use minimum similarity coefficient method to calculate distances among clusters. To reduce the effect of var
31、iable dimensions, the variables should be standardized before analysis. respectively refers to the sample mean and standard deviation of Xi. The data after transformation are listed in table 19-1., iiiiiiXXXXSS 、 The clustering process：（1）compute the similarity coefficient matrix( i.e. distance matr
32、ix) of the 4 samples. The distance of weight loading crouching and pull-ups can be work out using formula（19-3）. Likewise, the distance between weight loading crouching and push-ups can be computed as follows: Lastly，work out the distance matrix:22221211211222()()(1.3150.174)(0.6880.088)1.289dXXXX22
33、221311311232()()(1.3151.001)(0.6881.441)3.145dXXXX168. 2878. 0803. 1928. 1145. 3289. 1432321)0(GGGGGGD （2）The distance between G2 and G4 is the minimum, so G2 and G4 should be emerged into a new cluster G5= G2，G4. Compute the distance between G5 and other clusters using minimum similarity coefficien
34、t method according to formula （19-8）.The distance matrix of G1,G3 and G5: （3）Merge G1 and G5 into a new cluster G6= G1，G5. Compute the distance between G6 and G3: （4）lastly merge G1 and G6 into G7=G1 , G6. All the indexes have all been merged into a large cluster.168. 2803. 1145. 35331)1(GGGGD351323
35、Max(,)Max(0.099,0.055)0.099rrr451424Max(,)Max(0.234,0.174)0.234rrr361335M ax(,)M ax(3.145,2.168)3.145ddd According to the process of clustering, draw out the the hierarchy dendrogram (chart 19-2). As the hierarchy dendrogram shows and expertise we have learned, the indexes should be sorted into two
36、clusters: G1，G2，G4 and G3. Physical energy expenditure in weight loading crouching 、pull-ups and sit-ups would be much higher, dietary standard improvement might be required in those items during training. Analysis of clustering examples Different definition of similarity coefficient and that among
37、clusters will cause different clustering results. Expertise as well as clustering method is important to the explanation of clustering analysis. Sample 19-3 twenty-seven petroleum pitch workers and pyro-furnaceman are surveyed about their ages, length of service and smoking information. In addition,
38、 detections of sero- P21, sero-P53, peripheral blood lymphocyte SCE, the number of chromosomal aberration and the number of cells that had happened chromosomal aberration were carried out among these workers (table 19-3). (P21mutiple=P21detection value /the mean of control group P21) Please sort the
39、 27 workers using hierarchical clustering serviceably method. Table 19-3 result of bio-marker detection and clustering analysis of petroleum pitch workers and pyro-furnacemanSampleNumberageLength ofservicesmokeRamus/dSero-P21P21MultipleP53SCENumber ofchromosomeaberrationNumber of cells ofChromosomea
40、berrationresult ofculstering14625521381.680.358.1144235122035102.761.436.84331352252027842.190.544.1133143272024511.930.4711.4596153822032472.560.8011.68551651313037102.920.3711.6022174091031942.510.4011.40551834172046583.670.4611.3533195029050193.950.4713.4510811042202074825.890.1213.11002115730153
41、8002.990.1910.762211236152024781.950.2510.00001133712038273.010.8210.50441145232029842.350.1611.153311552321037492.950.7211.45111011642273049413.890.7313.807611744272039483.110.3313.6516141184021533602.640.3711.40001193821529362.310.6911.401112044272068515.390.9912.28762214327039263.090.4711.9500122
42、2610343813.450.5211.807512337182071425.620.8511.81552242892026122.060.3711.65111252593026382.080.7812.251112634142043223.400.4115.005512750322028622.250.698.80221 This example apply minimum similarity coefficient method originating from Euclidean distance, cluster equilibration method and sum of squ
43、ares of deviations method to cluster the data. The results are listed in chart 19-3, chart19-4 and chart19-5. All the variables have been standardized before analysis. chart 19-3 the hierarchy dendrogram of 27 petroleum pitch workers and pyro-furnacemen using minimum similarity coefficient methodCha
44、rt 19-4 the hierarchy dendrogram of 27 petroleum pitch workers and pyro-furnacemen using cluster equilibration method Chart 19-5 the hierarchy dendrogram of 27 petroleum pitch workers and pyro-furnacemen using sum of squares of deviations method The outcomes of the three kinds of clustering are not
45、the same, from which we can see different ways have different efficiency. The differences are more distinct in case of more variables. So youd better select efficient variables before clustering analysis. Such as the p21 and p53 in this example. You can get more information by reading the clustering
46、 chart. According to expertise ,we can see the outcome of equilibration clustering is more reasonable. The classifying result is filled in the last column. Workers numbered 10，20，23 are classified as one class; others are another .researchers find that workers numbered 10，20，23 are in high risk of c
47、ancer. Number 10，20，23，8，16，26 are clustered together according to the chart of sum of squares of deviations, reminding that workers of 8，16，26 maybe in high risk too.Dynamic clustering If there are too many samples under classified ,hierarchy clustering analysis demands more space to store similari
48、ty coefficient matrix. and is quite inefficient. Whats more ,samples cant be changed once they are classified. Because of these shortcomings, statists put forward dynamic clustering which can overcome the inefficiency and adjust the classifying along with the process of clustering. The principle of
49、dynamic clustering analysis is: firstly, select several representative samples ,called cohesion point, as the core of each class; secondly, classify others. adjust the core of each class until classifying is reasonable . The most common way of dynamic clustering analysis is k-means, which is quite e
50、fficient and its principle is simple. We can get the outcomes even if samples are in large number. However we have to know how many classes the samples are classified into before analysis. we may know under some circumstances in terms of expertise ,but not in other cases. Ordinal Clustering Methods

展开阅读全文

163文库所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：第十九章-聚类分析-Chapter19-Clustering-Analysis-PPT课件.ppt
链接地址：https://www.163wenku.com/p-2772457.html

三亚风情

内容提供者

实名认证

联系作者