华语料库session(外语学习)课件.ppt

上传人（卖家）：晟晟文业

文档编号：5215126

上传时间：2023-02-17

格式：PPT

页数：38

大小：369.06KB

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

25 文币

交易提醒：下载本文档，相应价格的文币将全额进入上传人（卖家）的账号。立即下载优惠套餐（点此详情）

【下载声明】
1. 本站全部试题类文档，若标题没写含答案，则无答案；标题注明含答案的文档，主观题也可能无答案。请谨慎下单，一旦售出，不予退换。
2. 本站全部PPT文档均不含视频和音频，PPT中出现的音频或视频标识（或文字）仅表示流程，实际无音频或视频文件。请谨慎下单，一旦售出，不予退换。
3. 本页资料《华语料库session(外语学习)课件.ppt》由用户（晟晟文业）主动上传，其收益全归该用户。163文库仅提供信息存储空间，仅对该用户上传内容的表现方式做保护处理，对上传内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知163文库（点击联系客服），我们立即给予删除！
4. 请根据预览情况，自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
5. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器，压缩文件请下载最新的WinRAR软件解压。

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 华语 session 外语学习课件

资源描述：: 1、Making statistic claimsCorpus LinguisticsRichard XUpdate on assignmentsDeadline for submission(email submission)TBAThe Harvard referencing styleAssignment A Corpus study:introduction;synopsis/overview,critical review of data,method of analysis,conclusion etc;conclusions,bibliography CL2005:http:/www
2、.corpus.bham.ac.uk/pclc/index.shtml CL2007:http:/www.corpus.bham.ac.uk/conference/proceedings.shtml UCCTS2008:http:/www.lancs.ac.uk/fass/projects/corpus/UCCTS2008Proceedings/UCCTS2010:http:/www.lancs.ac.uk/fass/projects/corpus/UCCTS2010Proceedings/Corpus tool:Introduction;description of the tool,its
3、 main features and functions;your critical evaluation of the tool:how well it does the jobs it is supposed to do;user interface,powerfulness,etc;conclusions;bibliographyAssignment B Introduction;literature review;methodology;results and discussions;conclusions;bibliographyOption B:A 3,500-word essay
4、,similar to Assignment BOutline of the session Lecture Raw and normalised frequency Descriptive statistics(mean,mode,media,measure of dispersion)Inferential statistics(chi squared,LL,Fishers Exact tests)Collocation statistics Lab UCREL online LL calculator Xus LL calculator SPSSQuantitative analysis
5、 Corpus analysis is both qualitative and quantitative One of the advantages of corpora is that they can readily provide quantitative data which intuitions cannot provide reliably“The use of quantification in corpus linguistics typically goes well beyond simple counting”(McEnery and Wilson 2001:81)Wh
6、at can we do with those numbers and counts?Raw frequency The arithmetic count of the number of linguistic feature(a word,a structure etc)The most direct quantitative data provided by a corpus Frequency itself does NOT tell you much in terms of the validity of a hypothesis There are 250 instances of
7、the f*k swearword in the spoken BNC,so what?Does this mean that people swear frequently or infrequently when they speak?Normalized frequency in relation to what?Corpus analysis is inherently comparative There are 250 instances of the swearword in the spoken BNC and 500 instances in the written BNC D
8、o people swear twice as often in writing as in speech?Remember the written BNC is 9 times as large as the spoken BNC When comparing corpora of different sizes,we need to normalize the frequencies to a common base(e.g.per million tokens)Normalised freq=raw freq/token number*common base The swearword
9、is 4 times as frequent in speech as in writing Swearword in spoken BNC=250/10*1=25 per million tokens Swearword in written BNC=500/90*1=6 per million tokens but is this difference statistically significant?Normalized frequency The size of a sample may affect the level of statistical significance Tip
10、s for normalizing frequency data The common base for normalization must be comparable to the sizes of the corpora Normalizing the spoken vs.written BNC to a common base of 1000 tokens?Warning Results obtained on an irrationally enlarged or reduced common base are distortedDescriptive statistics Freq
11、uencies are a type of descriptive statistics Descriptive statistics are used to describe a dataset A group of ten students took a test and their scores are as follows 4,5,6,6,7,7,7,9,9,10 How will you report the measure of central tendency of this group of test results using a single score?The mean
12、The mean is the arithmetic average The most common measure of central tendency Can be calculated by adding all of the scores together and then dividing the sum by the number of scores(i.e.7)4+5+6+6+7+7+7+9+9+10=70/10=7 While the mean is a useful measure,unless we also knows how dispersed(i.e.spread
13、out)the scores in a dataset are,the mean can be an uncertain guideThe mode and the median The mode is the most common score in a set of scores The mode in our testing example is 7,because this score occurs more frequently than any other score 4,5,6,6,7,7,7,9,9,10 The median is the middle score of a
14、set of scores ordered from the lowest to the highest For an odd number of scores,the median is the central score in an ordered list For an even number of scores,the median is the average of the two central scores In the above example the median is 7(i.e.(7+7)/2)Measure of dispersion:range The range
15、is a simple way to measure the dispersion of a set of data The difference between the highest and lowest frequencies/scores In our testing example the range is 6(i.e.highest 10 lowest 4)Only a poor measure of dispersion An unusually high or low score in a dataset may make the range unreasonably larg
16、e,thus giving a distorted picture of the datasetMeasure of dispersion:variance The variance measures the distance of each score in the dataset from the mean In our test results,the variance of the score 4 is 3(i.e.74);and the variance of the score 9 is 2(97)For the whole dataset,the sum of these dif
17、ferences is always zero Some scores will be above the mean while some will be below the mean Meaningless to use variance to measure the dispersion of a whole datasetMeasure of dispersion:std dev Standard deviation is equal to the square root of the quantity of the sum of the deviation scores squared
18、 divided by the number of scores in a dataset F is a score in a dataset(i.e.any of the ten scores)is the mean score(i.e.7)N is the number of scores under consideration(i.e.10)Std dev in our example of test results is 1.687NF2)(Measure of dispersion:std devFor a normally distributed dataset(i.e.where
19、 most of the items are clustered towards the centre rather than the lower or higher end of the scale)68%of the scores lie within one standard deviation of the mean 95%lie within two standard deviations of the mean 99.7%lie within three standard deviations of the meanThe standard deviation is the mos
20、t reasonable measure of the dispersion of a datasetNormal distribution(bell-shaped curve)Computing std dev with SPSSDescriptive Statistics104106.801.68710scoreValid N(listwise)NMinimumMaximumMeanStd.DeviationSPSS Menu-Analyze Descriptive statistics-DescriptivesInferential statistics Descriptive stat
21、istics are useful in summarizing a dataset Inferential statistics are typically used to formulate or test a hypothesis Using statistical measures to test whether or not any differences observed are statistically significant Tests of statistical significance chi-square test log-likelihood(LL)test Fis
22、hers Exact test Collocation statistics Mutual information(MI)z scoreStatistical significance In testing a linguistic hypothesis,it would be nice to be 100%sure that the hypothesis can be accepted However,one can never be 100%sure in real life cases There is always the possibility that the difference
23、s observed between two corpora have been due to chance In our swearword example,it is 4 times as frequent in speech as in writing We need to use a statistical test to help us to decide whether this difference is statistically significant The level of statistical significance=the level of our confide
24、nce in accepting a given hypothesis The closer the likelihood is to 100%,the more confident we can be One must be more than 95%confident that the observed differences have not arisen by chanceCommonly used statistical tests Chi square test compares the difference between the observed values(e.g.the
25、actual frequencies extracted from corpora)and the expected values(e.g.the frequencies that one would expect if no factor other than chance was affecting the frequencies)Log likelihood test(LL)Similar,but more reliable as LL does not assume that data is normally distributed The preferred test for sta
26、tistic significanceCommonly used statistical tests Interpreting results The greater the difference(absolute value)between the observed values and the expected values,the less likely it is that the difference is due to chance;conversely,the closer the observed values are to the expected values,the mo
27、re likely it is that the difference has arisen by chance A probability value p close to 0 indicates that a difference is highly significant statistically;a value close to 1 indicates that a difference is almost certainly due to chance By convention,the general practice is that a hypothesis can be ac
28、cepted only when the level of significance is less than 0.05(i.e.p0.05,or more than 95%confident)Online LL calculator http:/ucrel.lancs.ac.uk/llwizard.htmlHow to find the probability value p for an LL score of 301.88?Contingency tabledegree of freedom(d.f.)=(No.of row-1)*(No.of column-1)=(2-1)*(2 1)
29、=1*1=1Critical valuesThe chi square test or LL test score must be greater than 3.84(1 d.f.)for a difference to be statistically significant.Oakes,M(1998)Statistics for Corpus Linguistics,EUP,p.266In the example of swearword in spoken/written BNC,LL 301.88 for 1 d.f.More than 99.99%confident that the
30、 difference is statistically significantExcel LL calculator by Xuwww.corpus4u.org/attachment.php?attachmentid=560&d=1240826440SPSS:Left-vs.right-handedDefine variablesData viewweight case(Data Weight cases)SPSS:Left-vs.right-handedCross-tabSelect variablesSPSS:Left-vs.right-handedCritical value(X2/L
31、L)for 1 d.f.at p0.05(95%):3.84Is there a relationship between gender and left-or right-handedness?Any cells with an expected value less than 5?Fishers Exact test The chi-square or log-likelihood test may not be reliable with very low frequencies When a cell in a contingency table has an expected val
32、ue less than 5,Fishers Exact test is more reliable In this case,SPSS computes Fishers exact significance level automatically when the chi-square test is selected SPSS Releases 15 and 16 have removed the Fishers Exact test module,which can be purchased separatelyFishers Exact testDont forget to weigh
33、t cases!Fishers Exact testFishers Exact testForce an FE testPractice Use both the UCREL/Xus LL calculator/SPSS to determine if the difference in the frequencies of passives in the CLEC and LOCNESS corpora is statistically significant CLEC:7,911 instances in 1,070,602 words LOCNESS:5,465 instances in
34、 324,304 wordsCollocation statistics Collocation:the habitual or characteristic co-occurrence patterns of words Can be identified using a statistical approach in CL,e.g.Mutual Information(MI),t test,z score Can be computed using tools like SPSS,Wordsmith,AntConc,Xaira Only a brief introduction here
35、More discussions of collocation statistics to be followedMutual information Computed by dividing the observed frequency of the co-occurring word in the defined span for the search string(so-called node word),e.g.a 4:4 window,by the expected frequency of the co-occurring word in that span and then ta
36、king the logarithm to the base 2 of the resultMutual information A measure of collocational strength The higher the MI score,the stronger the link between two items MI score of 3.0 or higher to be taken as evidence that two items are collocates The closer to 0 the MI score gets,the more likely it is
37、 that the two items co-occur by chance A negative MI score indicates that the two items tend to shun each otherThe t test Computed by subtracting the expected frequency from the observed frequency and then dividing the result by the standard deviation A t score of 2 or higher is normally considered
38、to be statistically significant The specific probability level can be looked up in a table of t distributionThe z score The z score is the number of standard deviations from the mean frequency The z test compares the observed frequency with the frequency expected if only chance is affecting the distribution A higher z score indicates a greater degree of collocability of an item with the node word

展开阅读全文

163文库所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：华语料库session(外语学习)课件.ppt
链接地址：https://www.163wenku.com/p-5215126.html

晟晟文业

内容提供者

实名认证

联系作者