统计自然语言解析:可靠的语言模型?Statistical Natural Language Parsing: Reliable Models of Language? |
|
课程网址: | http://videolectures.net/mitworld_fong_parsing/ |
主讲教师: | Sandiway Fong |
开课单位: | 亚利桑那大学 |
开课时间: | 2012-02-10 |
课程语种: | 英语 |
中文简介: | 统计自然语言语言学家很大程度上归功于宾夕法尼亚大学着名的Treebank项目。但这一百万字的巨型语料库 - 实际上,来自华尔街日报的49,000个句子都被仔细标记为语法和语义成分,实际上既是“祝福也是诅咒”,Sandiway Fong说。根据Fong的说法,这个“黄金标准”的解析句子清单已经成为“城里唯一的游戏”,这是十多年来工作的结果。开发自然语言算法的语言学家通常依靠复杂的Penn Treebank来构建和训练概率的,无上下文的语法,并且Fong承认Treebank对该领域的革命性影响。但他也认为,研究依赖宾夕法尼亚大学银行系统的系统是如何实际执行的还是一样。他一直在探索三个基本问题:这些系统是否能够获得认知上合理的语言知识,例如区分句子的语法和非语法成分?这些系统有多脆弱,所以如果你拼错一个单词或翻转句子的一部分,系统会“给你一些解析?这些系统可以学习非自然语言吗?Fong发现了一些有趣的问题。例如,两个众所周知的解析系统得分不超过50%,找出了在过去和现在时态部署的八个句子中发音“读”这个词的正确方法(例如,女孩们会阅读论文;女孩们已阅读论文)。并且这两个系统没有得到相同的句子错误。 Fong想知道“阅读华尔街日报是不是学习如何发音'读'或'红'的好方法。”Fong还证明了解析系统可以在一个例子的存在(或不存在)的情况下开启这句话“含有4%乳脂的牛奶”,质疑这些系统是否真正健壮。虽然基于树库的解析系统在树库上表现得很好,但是不能推断它们必然具备语法能力和语言稳定性。 Fong说,我们必须理解,4万个训练样本并没有真正提供足够的参数来为普通人几乎毫不费力地提供的计算系统提供广泛的语言案例。 “我们希望统计系统能够处理噪音。但它们极其脆弱,尽管它们具有统计特性,并且对大型数据集进行了培训。“ |
课程简介: | The statistical natural language linguist owes much to the University of Pennsylvania’s famous Treebank project. But this giant corpus of one million words – actually, 49 thousand sentences from the Wall Street Journal all carefully labeled for their syntactic and semantic components -- is actually both a “blessing and a curse,” says Sandiway Fong. This “gold standard” list of parsed sentences, the result of more than a decade of work, has become “the only game in town,”according to Fong. Linguists developing natural language algorithms often rely on the complex Penn Treebank to construct and train probabilistic, context-free grammars, and Fong acknowledges the Treebank’s revolutionary impact on the field. But he also thinks it’ sworthwhile to examine how systems that rely on Penn Treebank actually perform. He has been exploring three basic questions: Do such systems attain cognitively plausible knowledge of language, such as distinguishing between grammatical and ungrammatical components of sentences? How brittle are these systems, so that if you misspell a word or flip one part of the sentence, the system will “give you back some parse? Can these systems learn non-natural languages? Fong has unearthed some interesting issues. For instance, two well-known parsing systems couldn’t score more than 50% figuring out the right way to pronounce the word “read” in eight sentences that deployed the past and present tenses (e.g., The girls will read the paper; The girls have read the paper). And the two systems didn’t get the same sentences wrong. Fong wonders if “reading the Wall Street Journal is not a good way to learn how to pronounce ‘read’ or ‘red.’” Fong also demonstrated that a parsing system could be turned on the presence (or absence) of a single example involving the phrase “milk with 4% butterfat,” calling in question whether such systems are truly robust. While Treebank-based parsing systems demonstrably perform well on Treebank-like sentences, one cannot infer they have necessarily achieved grammatical competence nor linguistic stability. We must understand, says Fong, that 40 thousand training samples do not really provide enough parameters to provide the broad range of linguistic cases for computational systems that ordinary people pick up nearly effortlessly. “We expect statistical systems to be able to deal with noise. But they are extremely fragile, despite their statistical nature and training over a large data set.” |
关 键 词: | 统计; 自然语言; 语料库 |
课程来源: | 视频讲座网 |
最后编审: | 2019-05-29:lxf |
阅读次数: | 53 |