区分n-gram数量B与词汇量V
这本书当中依然有很多错误,译者也助长了错误。在第六章 语言模型部分,作者详细定义了各种概念,但是对于B的翻译不够好:训练实例的类别量,其实就是模型的参数数量或者n-gram的数量。围绕这个概念问题,出了一系列错误。
第一个错误表现在127页的译者注释,译者注意到manning公布的勘误表,注意到:“训练语料有273266个词形,B应该是273266,。。。译者注”。
这里的B应该是V,对于二元语法模型B = V^2。
第二个错误是128页:“这个训练语料库共有14585个词形。所以对于新的条件概率p(not|was),新的估计是(608 + 0.5)/(9404 + 14589*0.5)”。这里也跟着错,应该是:
(608 + 0.5)/(9404 + 14589^2*0.5)
当然这里是由于原作者错误,译者不察觉。相应地,表格6.5里的ELE估计都是错的,原文结论说折扣掉一半也完全错误。结论是第六章出现的一系列错误作者难辞其咎。译者也未能指出错误。
http://nlp.stanford.edu/fsnlp/errata.html
page 196, line -13: Change "This will be V^{n-1}" to "This will be V", given the following major clarification: In Section 6.1, the number of 'bins' is used to refer to the number of possible values of the classificatory feature vectors, while (unfortunately) from Section 6.2 on, with this change, the term 'bins' and the letter B is used to refer to the number of values of the target feature. This is V for prediction of the next word, but V^n for predicting the frequency of n-grams. (Thanks to Tibor Kiss <tibor .... linguistics.ruhr-uni-bochum.de>
page 202-203: While the whole corpus had 400,653 word types, the training corpus had only 273,266 word types. This smaller number should have been used as B in the calculation of a Laplace's law estimate of table 6.4 (whereas actually 400,653 was used). The result of this change is that f_{Lap}(0) = 0.000295, and then 99.96% of the probability mass is given to previously unseen bigrams (!). In such a model, note that we have used a (demonstrably wrong) closed vocabulary assumption, so despite this huge mass being given to unseen bigrams, none is being given to potential bigrams using vocabulary items outside the training set vocabulary (OOV = out of vocabulary items). (Thanks to Steve Renals <s.renals .... dcs.shef.ac.uk> and Gary Cottrell <gary .... cs.ucsd.edu>
page 205, line 2-3: Correction: here it is said that there are 14589 word types, but the number given elsewhere in the chapter (and the actual number found on rechecking the data file) is 14585. Clarification: Here we directly smooth the conditional distributions, so there are only |V| = 14585 values for the bigram conditional distribution added into the denominator during smoothing, whereas on pp. 202-203, we were estimating bigram probabilities, and there are |V|^2 different bigrams. (Thanks to Hidetosi Sirai <sirai .... sccs.chukyo-u.ac.jp>, Mark Lewellen <lewellen .... erols.com>, and Gary Cottrell <gary .... cs.ucsd.edu>
有关键情节透露