将R用于可扩展数据科学:单机到Hadoop Spark集群Using R for Scalable Data Science: Single Machines to Hadoop Spa0rk Clusters |
|
课程网址: | http://videolectures.net/kdd2017_hands_on_tutorial_scalable_data/ |
主讲教师: | Hang Zhang |
开课单位: | 微软 |
开课时间: | 2017-11-14 |
课程语种: | 英语 |
中文简介: | 在本教程中,我们将演示如何在单机、SQL Server中的数据库和运行Spark的Hadoop集群上用R创建可扩展的端到端数据分析过程。我们将在公共GitHub存储库中提供实践练习和代码,供与会者在数据科学实践中采用。特别是,与会者将看到如何使用R中的分布式机器学习功能构建、持久化和使用机器学习模型。 R是数据科学、统计和机器学习(ML)社区中使用最多的语言之一。尽管开源R(CRAN库)现在有超过10000个用于静态和ML的包和函数,但当涉及到使用R的可扩展分析或将经过训练的模型部署到生产中时,许多数据科学家受到以下阻碍:(a)其有效处理大型数据集的可用函数的限制,以及(b)关于将R脚本从桌面分析扩展到弹性和分布式云服务的适当计算环境的知识。在本教程中,我们将讨论如何创建利用分布式计算资源的端到端数据科学解决方案。在本教程中,我们将提供演示、示例和示例代码的实践练习。此外,我们将提供一个公共GitHub代码库,与会者将能够访问该库并根据自己的实践进行调整。我们相信,本教程将引起越来越多的数据科学家和开发人员的强烈兴趣,他们正在使用R创建和部署分析解决方案。 |
课程简介: | In this tutorial, we will demonstrate how to create scalable, end-to-end data analysis processes in R on single machines as well as in-database in SQL Server and on Hadoop clusters running Spark. We will provide hands-on exercises as well as code in a public GitHub repository for attendees to adopt in their data science practice. In particular, the attendees will see how to build, persist, and consume machine learning models using distributed machine learning functions in R. R is one of the most used languages in the data science, statistical and machine learning (ML) community. Although open-source R (CRAN library) now has in excess of 10,000 packages and functions for statics and ML, when it comes to scalable analysis using R, or deployment of trained models into production, many data scientists are blocked or hindered by (a) its limitations of available functions to handle large datasets efficiently, and (b) knowledge about the appropriate computing environments to scale R scripts from desktop analysis to elastic and distributed cloud services. In this tutorial, we will discuss how to create end-to-end data science solutions that utilize distributed compute resources. During the tutorial, we will provide presentations, worked-out examples, and hands-on exercises with sample code. In addition, we will provide a public GitHub code repository that attendees will be able to access and adapt to their own practice. We believe this tutorial will be of strong interest to a large and growing community of data scientists and developers who are using R for creating and deploying analytical solutions. |
关 键 词: | 数据分析; 数据实践; 机器学习 |
课程来源: | 视频讲座网 |
数据采集: | 2023-07-19:chenxin01 |
最后编审: | 2023-07-19:chenxin01 |
阅读次数: | 21 |