0


机器学习软件的实践:Quo Vadis?

Machine Learning Software in Practice: Quo Vadis?
课程网址: http://videolectures.net/kdd2017_pafka_machine_learning_software/  
主讲教师: Szilard Pafka
开课单位: Epoch公司
开课时间: 2017-10-09
课程语种: 英语
中文简介:
由于我们行业在过去几年的大肆宣传,机器学习从业者希望的软件工具、他们真正需要的工作、可用的软件工具(商业或开源)和工具开发人员和研究人员关注的工具之间的不匹配越来越大。在这次演讲中,我们将给出几个这种不匹配的例子。一些调查和轶事证据表明,大多数从业者大部分时间(至少在建模阶段)都在处理单个服务器RAM中的数据集,因此分布式计算工具是非常多余的。我们对最广泛使用的开源二进制分类工具(算法的各种实现,如线性方法、随机森林、梯度增强树和神经网络)的基准测试(可在github[1]上获得)显示了各种工具之间超过10倍的速度和超过10倍的RAM使用差异,其中“大数据”工具是最低效的。这些工具结合了各种底层(接近CPU和内存体系结构)优化,从而获得了显著由于我们行业在过去几年的大肆宣传,机器学习从业者希望的软件工具、他们真正需要的工作、可用的软件工具(商业或开源)和工具开发人员和研究人员关注的工具之间的不匹配越来越大。在这次演讲中,我们将给出几个这种不匹配的例子。一些调查和轶事证据表明,大多数从业者大部分时间(至少在建模阶段)都在处理单个服务器RAM中的数据集,因此分布式计算工具是非常多余的。我们对最广泛使用的开源二进制分类工具(算法的各种实现,如线性方法、随机森林、梯度增强树和神经网络)的基准测试(可在github[1]上获得)显示了各种工具之间超过10倍的速度和超过10倍的RAM使用差异,其中“大数据”工具是最低效的。这些工具结合了各种底层(接近CPU和内存体系结构)优化,从而获得了显著的性能提升。然而,我们将展示,即使是最好的工具,在具有大量内核的多套接字服务器上,性能也会下降,而这些系统最近已经被广泛访问。最后,虽然大部分讨论都是关于性能的,但我们也会认为,具有高级易用api的机器学习工具为从业者提供了不断提高的生产力,因此是更可取的。的性能提升。然而,我们将展示,即使是最好的工具,在具有大量内核的多套接字服务器上,性能也会下降,而这些系统最近已经被广泛访问。最后,虽然大部分讨论都是关于性能的,但我们也会认为,具有高级易用api的机器学习工具为从业者提供了不断提高的生产力,因此是更可取的。
课程简介: Due to the hype in our industry in the last couple of years, there is a growing mismatch between software tools machine learning practitioners wish for, what they would truly need for their work, what's available (either commercially or open source) and what tool developers and researchers focus on. In this talk we will give a couple of examples of this mismatch. Several surveys and anecdotal evidence show that most practitioners work most of the time (at least in the modeling phase) with datasets that t in the RAM of a single server, therefore distributed computing tools are very of- ten overkill. Our benchmarks (available on github [1]) of the most widely used open source tools for binary classification (various implementations of algorithms such as linear methods, random forests, gradient boosted trees and neural networks) on such data show over 10x speed and over 10x RAM usage difference between various tools, with "big data" tools being the most inefficient. Significant performance gains have been obtained by those tools that incorporate various low-level (close to CPU and memory architecture) optimizations. Nevertheless, we will show that even the best tools show degrading performance on the multi-socket servers featuring a high number of cores, systems that have become widely accessible more recently. Finally, while most of this talk is about performance, we will also argue that machine learning tools that feature high-level easy-to-use APIs provide increasing productivity for practitioners and therefore are preferable.
关 键 词: 机器学习; 开源二进制; 软件工具
课程来源: 视频讲座网
数据采集: 2022-11-18:chenjy
最后编审: 2022-11-18:chenjy
阅读次数: 36