Guest Lectures :: Data Science Programming Methods

STAT 447 > Guest Lectures

Guest Lectures

Szilard Pafka, Epoch

Topic: With all the hype about deep learning and “AI”, it is not well (enough) publicized that for structured/tabular data widely encountered in business applications it is actually another machine learning algorithm, the gradient boosting machine/gradient boosted decision trees (GBM/GBDT) that most often achieves the highest accuracy in supervised learning/prediction tasks. In this talk we’ll provide plenty of evidence about the vast superiority of GBMs over deep learning on tabular/business data. Then, we will present some of the major open source GBM implementations such as xgboost, h2o, lightgbm and catboost (all of them available from R and Python), and finally, we will compare their main performance characteristics: training speed, memory footprint, scaling to multiple CPU cores, GPU implementations etc. While deep learning is certainly the best algorithm available for computer vision (and it has also shown some success in a few other rather specialized domains), in most business applications, where the data is most often of a tabular structure, gradient boosted decision trees are vastly superior to deep learning neural networks and should definitely be the algorithm of choice.

Bio: Dr Szilard Pafka is Chief Scientist at Epoch. Szilard studied Physics in the 90s and obtained a PhD by using statistical methods to analyze the risk of financial portfolios. He worked in finance, then in 2006 he moved to become the Chief Scientist of a tech company in Santa Monica, California doing everything data (analysis, modeling, data visualization, machine learning, data infrastructure etc). He was the founder/organizer of several meetups in the Los Angeles area (R, data science etc) and the data science community website datascience.la for more than a decade until he relocated to Texas in 2021. He is the author of a well-known machine learning benchmark on github (1000+ stars), a frequent speaker at conferences (keynote/invited at KDD, R-finance, Crunch, eRum and contributed at useR!, PAW, EARL, H2O World, Data Science Pop-up, Dataworks Summit etc.), and he has developed and taught graduate data science and machine learning courses as a visiting professor at two universities (UCLA in California and CEU in Europe). You can follow him on LinkedIn, Twitter or Github.

The lecture took place Friday, October 28, at 1pm Central. The presentation slides are available, a video recording is available (to anybody with a valid University of Illinois ‘Netid’). An unrestricted YouTube! link is also available.

John Mount, WinVector

Topic: Machine learning practice, often called data science, emphasizes empirical tuning of predictive models. When these practitioners run into common problems they propose and promote fixes somewhat different than the statistical canon. I’ll discuss two issues where data science practice differs from statistical inference: co-linear variables and building classifiers for un-balanced models. For co-linear variables the data science practice is often “regularize and ignore”, which I will define and explain why this fire and forget procedure seems to work. This lets us start to explore the consequences of using prediction quality as an exclusive model quality metric. For un-balanced models I argue that the result is the opposite: ignoring the internal probabilistic structure of the problem leads to unnecessarily clumsy work arounds. The goal is to show how to appreciate data science as street fighting statistics.

Bio: Dr. John Mount is a Principal Consultant at Win Vector LLC. John has a Ph.D in computer science from Carnegie Mellon University, using probabilistic methods to prove convergence rates of Markov chains in optimization and sampling applications. He did work on structural diversity of molecules for biotech applications, wrote and executed algorithmic trading strategies for Banc of America securities (a division of Bank of America). He is now concentrating on data science, machine learning, AI and analytics consulting and teaching. His most recent teaching product is a two week private immersion course on data science for engineers. He is the co-author of the book Practical Data Science with R, from Manning and now it its 2nd edition. He is the author of a number of packages for data science in both R and Python. You can follow him on LinkedIn https://www.linkedin.com/in/johnamount/ , Twitter https://twitter.com/winvectorllc , or Github https://github.com/WinVector .

The lecture took place Wednesday, November 9, at 5pm Central. The presentation slides are available, as is the supporting GitHub repo A video recording is available (to anybody with a valid University of Illinois ‘Netid’). An unrestricted YouTube! link is also available.