第十九章-聚类分析-Chapter19-Clustering-Analysis-PPT课件.ppt
- 【下载声明】
1. 本站全部试题类文档,若标题没写含答案,则无答案;标题注明含答案的文档,主观题也可能无答案。请谨慎下单,一旦售出,不予退换。
2. 本站全部PPT文档均不含视频和音频,PPT中出现的音频或视频标识(或文字)仅表示流程,实际无音频或视频文件。请谨慎下单,一旦售出,不予退换。
3. 本页资料《第十九章-聚类分析-Chapter19-Clustering-Analysis-PPT课件.ppt》由用户(三亚风情)主动上传,其收益全归该用户。163文库仅提供信息存储空间,仅对该用户上传内容的表现方式做保护处理,对上传内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知163文库(点击联系客服),我们立即给予删除!
4. 请根据预览情况,自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
5. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器,压缩文件请下载最新的WinRAR软件解压。
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 第十九 聚类分析 Chapter19 Clustering Analysis PPT 课件
- 资源描述:
-
1、Chapter19 Clustering AnalysisContent Similarity coefficient Hierarchical clustering analysis Dynamic clustering analysis Ordered sample clustering analysis Discriminant Analysis : having known with certainty to come from two or more populations, its a method to acquire the discriminate model that wi
2、ll allocate further individuals to the correct population. Clustering Analysis: a statistic method for grouping objects of random kind into respective categories. Its used when theres no priori hypotheses, but trying to find the most appropriate sorting method resorting to mathematical statistics an
3、d some collected information. It has become the first selected means to uncover great capacity of genetic messages.Both are methods of multivariate statistics to study classification. Clustering analysis is a method of exploring statistical analysis. It can be classified into two major species accor
4、ding to its aims. For example, m refers to the number of variables(i.e. indexes) while n refers to that of cases(i.e. samples) ,you can do as follows: (1) R-type clustering: also called index clustering. The method to sort the m kinds of indexes, aiming at lowering the dimension of indexes and choos
5、ing typical ones. (2)Q-type clustering: also called sample clustering. The method to sort the n kinds of samples to find the commonness among them. The most important thing for both R-type clustering and Q-type clustering is the definition of similarity, that is how to quantify similarity. The first
6、 step of clustering is to define the metric similarity between two indexes or two samples- similarity coefficient 1 similarity coefficient 1 similarity coefficient of R-type clustering Suppose there are m kinds of variables: X1,X2,Xm. R-type clustering usually use the absolute value of simple correl
7、ation coefficient to define the similarity coefficient among variables: The two variables tend to be more similar when the absolute value increases. Similarly, Spearman rank correlation coefficient can be used to define the similarity coefficient of non-normal variables. But when the variables are a
8、ll qualitative variables, its best to use contingency coefficient. 22()() (19-1)()()iijjijiijjXXXXrXXXX2 Similarity coefficient commonly used in Q-type clustering : Suppose there are n cases regard as n spots in a m dimensions space, distance between two spots can be used to define similarity coeffi
9、cient, the two samples tend to be more similar when the distance declines. (1)Euclidean distance (2)Manhattan distance (3)Minkowski distance: Absolute distance refers to Minkowski distance when q=1;Euclidean distance is direct-viewing and simple to compute, but having not regarded the correlated rel
10、ations among variables. Thats why Manhattan distance was introduced.2() (19-3)ijijdXX| (19-4)ijijdXX| ( 1 9 - 5 )qqi jijdXX(19-5) (4)Mahalanobis distance :its used to express the sample covariance matrix among m kinds of variables. It can be worked out as follows: When its a unit matrix, Mahalanobis
11、 distance equals to the square of Euclidean distance.All of the four distances refer to quantitative variables, for the qualitative variables and ordinal variables, quantization is needed before using. (19-6)ijd1XS X 1122(,)ijijimjmXXXXXXX 2 Hierarchical Clustering Analysis Hierarchical clustering a
12、nalysis is a most commonly used method to sort out similar samples or variables. The process is as follows: 1)At the beginning, samples(or variables) are regarded respectively as one single cluster, that is, each cluster contains only one sample(or variable). Then work out similarity coefficient mat
13、rix among clusters. The matrix is made up of similarity coefficients between samples (or variables). Similarity coefficient matrix is a symmetrical matrix. 2)The two clusters with the maximum similarity coefficient( minimum distance or maximum correlation coefficient) are merged into a new cluster.
14、Compute the similarity coefficient between the new cluster with other clusters. Repeat step two until all of the samples (or variables) are merged into one cluster.The calculation of similarity coefficient between clusters Each step of hierarchical clustering has to calculate the similarity coeffici
15、ent among clusters. When there is only one sample or variable in each of the two clusters, the similarity coefficient between them equals to that of the two samples or the two variables, or compute according to section one. When there are more than one sample or variable in each cluster, many kinds
16、of methods can be used to compute similarity coefficient. Just list 5 kinds of methods as follows. and refer to the two clusters, which respectively has or kinds of samples or variables. pGqGpnqn1The maximum similarity coefficient method If therere respectively , samples(or variables) in cluster and
17、 , herere altogether and similarity coefficients between the two clusters, but only the maximum is considered as the similarity coefficient of the two clusters. Attention :the minimum distance also means the maximum similarity coefficient. 2The Minimum similarity coefficient method similarity coeffi
18、cient between clusters can be calculated as follows:,Min () , 19-7Max () , pqpqpqiji Gj Gpqiji Gj GDdrr样品聚类()指标聚类,Max () , 19-8Min () , pqpqpqiji GjGpqiji GjGDdrr样品聚类()指标聚类pnqnpGqG2pn2qn3. The center of gravity method (only used in sample clustering) The weights are the index means among clusters. I
19、t can be computed as follows:4Cluster equilibration method (only used in sample clustering) work out the average square distance between two samples of each cluster. Cluster equilibration is one of the good methods in the hierarchical clustering, because it can fully reflect the individual informati
20、on within a cluster. 221 (19-10)pqijpqDdn n (19-9)pqpqDdX X5sum of squares of deviations method also called Ward method,only for sample clustering. It imitates the basic thoughts of variance analysis, that is, a rational classification can make the sum of squares of deviation within a cluster smalle
21、r, while that among clusters larger. Suppose that samples have been classified into g clusters, including and . The sum of squares of deviations of cluster from samples is: ( is the mean of ) . The merged sum of squares of deviations of all the g clusters is . If and are merged, there will be g-1 cl
22、usters. The increment of merged sum of squares of deviations is ,which is defined as the square distance between the two clusters. Obviously, when n samples respectively forms a single cluster, the merged sum of squares of deviation is 0.211()knmkijjijLXXjXjXgkLL21pqggDLLnpGqGknkpGqG Sample 19-1 The
23、rere four variables surveying from 3454 female adults : height(X1)、length of legs (X2)、waistline(X3)and chest circumference(X4).The correlation matrix has been worked out as follows: Try to use hierarchical clustering to cluster the 4 indexes. This is a case of R-type(index) clustering. We choose si
24、mple similarity coefficient as the similarity coefficient ,and use maximum similarity coefficient method to calculate the similarity coefficient among clusters.732. 0174. 0234. 0055. 0099. 0852. 0432321)0(XXXXXXR The clustering procedure is listed as follows: (1)each index is regarded as a single cl
25、uster G1=X1,G2=X2,G3=X3,G4=X4.Therere altogether 4 clusters.(2)Merge the two clusters with maximum similarity coefficient into a new cluster. In this case, we merge G1 and G2( similarity coefficient is 0.852) as G5=X1 , X2. Calculate the similarity coefficient among G5、G3 and G4. The similar matrix
展开阅读全文