0


软件2.0和浮潜:超越手工标记数据

Software 2.0 and Snorkel: Beyond Hand-Labeled Data
课程网址: http://videolectures.net/kdd2018_re_hand_labeled_data/  
主讲教师: Christopher Ré
开课单位: 斯坦福大学计算机科学系
开课时间: 2018-11-23
课程语种: 英语
中文简介:
这篇演讲介绍了Snorkel,这是一个软件系统,其目标是让日常的机器学习任务变得更加容易。Snorkel关注的是机器学习系统开发中的一个关键瓶颈:缺乏用于用户任务的大型训练数据集。在Snorkel中,用户通过编写创建标记数据的简单程序来隐式定义大型训练集,而不是繁琐地手动标记单个数据项。反过来,这允许用户整合许多训练数据来源,其中一些是低质量的,以构建高质量的模型。本演讲将描述Snorkel如何改变用户编程机器学习模型的方式。浮潜的一个关键技术挑战是结合启发式训练数据,这些数据可能具有不均匀且未知的质量和未知的相关性结构。这篇演讲将解释基本理论,包括学习无标记数据生成模型的参数和结构的方法。此外,我们将介绍我们最近的黑客马拉松经验,这表明Snorkel方法可能允许更广泛的用户训练机器学习模型,并且比以前的方法更容易。浮潜正被基因组学和药物再利用等领域的科学家、参与各种形式搜索的多家公司以及执法部门用于打击人口贩运。Snorkel在github上是开源的。Snorkel.Standford.edu提供技术博客文章和教程。
课程简介: This talk describes Snorkel, a software system whose goal is to make routine machine learning tasks dramatically easier. Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets for a user’s task. In Snorkel, a user implicitly defines large training sets by writing simple programs that create labeled data, instead of tediously hand-labeling individual data items. In turn, this allows users to incorporate many sources of training data, some of low quality, to build high-quality models. This talk will describe how Snorkel changes the way users program machine learning models. A key technical challenge in Snorkel is combining heuristic training data that may have uneven and unknown quality and an unknown correlation structure. This talk will explain the underlying theory, including methods to learn both the parameters and structure of generative models without labeled data. Additionally we’ll describe our recent experiences with hackathons, which suggest the Snorkel approach may allow a broader set of users to train machine learning models and do so more easily than previous approaches. Snorkel is being used by scientists in areas including genomics and drug repurposing, by a number of companies involved in various forms of search, and by law enforcement in the fight against human trafficking. Snorkel is open source on github. Technical blog posts and tutorials are available at Snorkel.Stanford.edu.
关 键 词: 日常的机器学习任务; 大型训练数据集; 机器学习模型
课程来源: 视频讲座网
数据采集: 2023-01-30:cyh
最后编审: 2023-01-31:cyh
阅读次数: 21