时间:2018年5月16日上午9:30
地点:望江校区东三教503会议室
报告人:唐明洁
报告人简介:2007年公司计算机本科毕业,2010年从中国科学院研究生院取得计算机硕士学位,2013年从美国普渡大学获得计算机硕士学位,2016年从美国普渡大学取得计算机博士学位。曾就职于美国微软,IBM研究院。现就职于大数据公司Hortonworks做研究科学家,主要从事Spark和TensorFlow的研究和开发。博士期间在包括VLDB, TKDE, ICDE, EDBT, SIGSPATILA, IEEEIntelj在内的会议杂志发篇论文20余篇,曾获得数据库会议SISAP201最佳论文,数据挖掘会议ADMA2009最佳应用论文,部分研究成果已经被开源社区PostgreSQL和Spark所采用。
学术报告摘要:TensorFlow and XGBoost are state-of-the-art platform for Deep learning and Machine learning. However, either of them are suit for big data processing in real production environment. For example, TensorFlow fail to provide OLAP or ETL over big data, thus, it impedes TensorFlow to train a deep learning model with clean and enough data in more efficient way. Similarly, despite better performance compared with other gradient-boosting implementations, it’s still a time-consuming task to train XGBoost model when the data is big. And it usually requires extensive parameter tuning to get a highly accurate model, which brings the strong requirement to speed up the whole process.
In this talk, we will mainly introduce how Spark to improve TensorFlow and XGBoost in the real application, and demonstrate how these platforms could be benefit from big data techniques. More specifically, we at first introduce how Spark ML come to support auto parameter tuning, and apply transfer learning to enhance the real application like recommendation system and image searching. Secondly, we cover the implementation and performance improvement of GPU-based XGBoost algorithm, summarize model tuning experience and best practice, share the insights on how to build a heterogeneous data analytic and machine learning pipeline based on Spark in a GPU-equipped YARN cluster, and show how to push model into production.