STAT 447: Data Science Programming Methods is a course in the Department of Statistics at the University of Illinois.
Data Science Programming Methods started in the Spring 2019, Fall 2019 and Fall 2020 terms as STAT 430: Topics in Applied Statistics. Since the Fall 2021, Fall 2022 and Spring 2024 terms, it has been offered under its own course number as STAT 447. The instructor is Dirk Eddelbuettel who also designed the course, and taught the previous instances (which can still be accessed, see the resources/websites link on the left).
Course lectures slides as well as guest lectures are publically accessible, see the lectures by topic links on the left.
Note that the website is currently being updated for the Spring 2025 version. If you see any outdated reference to 2024 or prior runs please let us know at the instructor email.
A 2018 report by National Academies of Sciences, Engineering, and Medicine stated:
Data science is emerging as a field that is revolutionizing science and industries alike. Work across nearly all domains is becoming more data driven, affecting both the jobs that are available and the skills that are required. As more data and ways of analyzing them become available, more aspects of the economy, society, and daily life will become dependent on data.
This courses introduces key concepts for computational literacy in a data science context:
find
, grep
, sed
, awk
, …) and build this up to simple shell scripts.git
(and the GitHub site): we instroduce version control to
managed source code (and other files such as write ups and documentation) and much more such as
social computing, plus a foray into GitHub Actions building on shell script.sqlite
and duckdb
This course is fast paced. We cover a considerable amount of material.
Note that the CBTF tests generally require an on-campus presence. For Chicago-based students an alternate location downtown may be made available and upon request (and demonstrated reasons) remote students may be accomodated. Note, howeverm that the default is for on-line homework and in-person tests.
Statistics and Data Science are focused on making sense of data – and face an ever-increasing
demand for their work. Yet at the same time, data sets increase in size and scope. Proper tooling
is essential to meet these challenges, and as applied work in data analysis is in effect applied
computational work, we will learn the computational tools and programming methods to meet these
data science challenges. Proficiency at the shell, familiarity with git
version control,
sufficient understanding of SQL, and of course acquiring actual expertise in R programming are the
goals of this course to prepare students for the coming computational challenges. We offer an
RStudio Server instance along with use of personal computers. Prior programming experience (in R or
another language) will certainly be helpful, but is not a formal requirement for taking the course.
Please see the Lectures by Topics link to the left. Content is often refreshed or added as the course progresses but you always have prior years (from most recent year Spring 2024 to prior runs in Fall 2022, Fall 2021, Fall 2020, Fall 2019 as well as Spring 2019 as a complete reference.