Syllabus
Statistics 215A, Fall 2024
Instructor: Professor Bin Yu, binyu at berkeley.edu
Lectures: T/Th: 12:30 pm - 2:00 pm (332 Evans)
Office hours: 409 Evans: Tuesday 10:00-11:00, Wed 11:00-12:00.
Discussion: Friday: 9:00 am - 11:00 am (342 Evans)
GSI: Anthony Ozerov (ozerov at berkeley.edu ); GSI will be in charge of the discussion sessions, Ed Discussions (replacing piazza), and the labs/homework.
GSI Office hours: Evans 444: Monday 14:00-15:00, Wednesday 12:00-13:00, Thursday 16:00-17:00.
Text books:
“Veridical data science: the practice of responsible data analysis and decision-making” by Bin Yu and Rebecca Barter (free version at vdsbook.com; print version to be published in Oct. 2024 by MIT Press) (required).
Statistical models, David Freedman (Cambridge Press, 2009, 2nd Ed.) (required).
The elements of statistical learning, Trevor Hastie, Rob Tibshirani, Jerome Friedman (Springer, 2016, 2nd Ed.) (recommended).
Comments, Suggestions, Gripes: Before or after the lectures, email, or talk to the instructor and the GSI.
Ed Discussion: Questions and discussion about course material, HWs, and labs can be posted on the Ed Discussion page (accessed on bCourses). The GSI will regularly monitor this to ensure all questions are answered in a timely manner, but students are encouraged to help their classmates as well. Please think carefully before asking questions specifically about the projects. For example, questions concerning how to do something specific in R are fine, but questions asking what other people did for their analysis are not. Questions asking about clarifications are fine.
bCourses: https://bcourses.berkeley.edu/courses/1537926
Course website: https://stat215a.berkeley.edu/fall-2024/ (with up-to-date schedule)
Grading:
55% assignments (homework and labs)
5% class/discussion participation and reading assignments
15% midterm
25% final project
Assignments: There will be 4 or 5 assignments given out on Friday in the discussion session and usually due in two weeks (there will be an announcement if otherwise). The assignments require two full weeks of work to satisfactorily complete, which requires a very early start. The assignments contain homework problems and data analysis labs. For the data labs, each student will produce a 12-page (maximum) report presenting a narrative that connects the motivating questions, the analysis conducted and the conclusions drawn. The labs will be completed in Python (optionally some parts in R, especially for visualization). The reports will be made using Jupyter Notebooks or pure LaTeX and the final pdf output should not contain any code whatsoever. Each report will be hosted in a Github repository containing both the code and the written report, and an automated script will pull the submissions at 12:00am on the due date (i.e., midnight of the night before our Friday discussion section). HW submissions will be made to Gradescope. No late assignments will be accepted, for any reason.
Readings: There will be a number of readings assigned throughout the semester. The assigned readings for each week and links to the papers can be found in the calendar on the course website. Students should read at least one paper or reading carefully and go through other readings (as time allows) each week and be prepared to discuss their takeaways during the lecture and lab.
Gradescope: https://www.gradescope.com/courses/833104 (by invite)
Course description:
Overview
Information technology advances have made it possible to collect huge amounts of data in every walk of our life and beyond. These vast amounts of data have enabled scientists, social scientists, government agencies, and companies to ask increasingly complex questions aimed at understanding the physical and human world, informing pubic policies, and improving productivity.
However, having data alone is not enough. To be able to solve a domain problem or answer a domain question, it takes a principled statistics and machine learning investigation process in the context of and in combination with domain knowledge. This process includes problem formulation in context and with a domain goal outside statistics, data collection (ideally through careful experimental design), data cleaning, data visualization, algorithm/model development, validation, post-hoc analysis, and drawing conclusions in context. It rests on adequate domain knowledge, suitable computing platforms, appropriate choices of data cleaning schemes and scalable algorithms/models, and careful consideration of domain knowledge and information from the data to draw meaningful conclusions about a domain problem. That is, with both computation and narratives (domain knowledge) forming its foundation, the statistical and machine learning investigation process is one of rigorous evidence-seeking to make trustworthy data-driven conclusions that are useful for the domain problem and accessible by the domain experts.
The most impactful contributions are often made when domain experts (scientists, for example) and statisticians and machine learners work together to brainstorm and ask questions. These domain experts are not only key to formalizing the ideas, but they also are integral in generating the data. Engaging with the individuals who collected the data in the first place allows the statistician and machine learner to learn about the context in which the data lives, and subsequently, to conduct an effective analysis capable of answering the question being asked.
Collaborative learning in context This course will demonstrate what it is like to be an applied statistician, or data scientist in today’s data-rich world. We emphasize the goal of answering questions outside of statistics or machine learning using data and domain knowledge through working with domain experts. We illustrate through lectures, class discussions, data labs, and homework assignments, the many steps involved in the iterative process of information extraction and evidence gathering of a statistical investigation. Specifically, students will learn together and critically understand the technical topics of EDA (exploratory data analysis), prediction algorithms (e.g. Least Squares, random forests), identification of sources of randomness in data, probabilistic models (e.g. linear regression), inference, and interpretation. We ground our class on. the concepts of reality, representation of reality, and mental construct to separate current data, algorithms/models and future data in the context of domain knowledge. We discuss when and how to connect these three concepts in the entire process of data analysis. In particular, the PCS framework (workflow and documentation) based on the three principles of data science - predictability, stability and computability (PCS) will be employed as an overarching theme.
The lectures (and labs) will be based on real-data problems, and students will learn useful statistical concepts and methods in the contexts of these problems. The aim is to illustrate how judgement and common-sense are crucial to the statistical investigation process. Moreover, we introduce the technical topics through a first-principles approach so that students gain the skills necessary to develop new techniques to solve problems in unfamiliar situations in the future.
The essential elements of applied statistics are captured in Bin’s piece entitled “Data Wisdom” (http://www.odbms.org/2015/04/data-wisdom-for-data-science/). Students are asked to read the piece after the first lecture.
Data lab format and peer-grading The data labs will be done individually, except for one group lab later in the semester. The goal of writing the lab report is not only to gain data analysis experience but is also an exercise in communication. We ask that particular attention is given to the writing of the report, as your peers will be reading them: so that the students can learn from one another, the labs will be peer-reviewed. Each student will review 2-3 labs from their peers, and will provide feedback and a grade based on several criteria including clarity of writing, validity of analysis and informativeness of visualizations. The final grade of each lab will be decided by the GSI who will use the student grades as a guide.
Full commitment to the class is necessary Please be aware that this is a heavy-load class. If you are not sure that you can commit, please audit the class instead since there are many students on the waitlist. Further, because class discussions are an integral part of the course, registered students are required to attend all classes unless permitted by the instructor under justifiable circumstances.
Pre-requisites In this class, we require knowledge of upper division mathematical statistics and probability courses (Stat 134 and 135) at UC Berkeley. In terms of computing, at a minimum you should be comfortable manipulating files in Unix and writing your own functions, manipulating and cleaning data and creating and customizing graphics in Python. Ideally, students will already have a basic fluency in Python as well as confidence using Github. While we will be providing a short introduction to these topics in the labs, students who are entirely unfamiliar with these tools will need to put in additional work to ensure that they meet the standards expected of the course.
Tentative list of topics:
Getting to know each other. Overview of the course. PCS (workflow and documentation) as the guiding framework for the course, where PCS stands for predictability, computability and stability. (Aug.
- Aug. 30: Lab 0 assigned
PCS: reality check and stability through appropriate data and algorithm perturbations. Domain question to answer (often about the future), Relevant data collection and cleaning, EDA. (Sept. 3, 5)
- Sept. 6: Lab 0 due (midnight); Lab 1 assigned (on ER PECARN data cleaning)
Video Guest Lecture by Aaron Kornblith (UCSF) on data collection case study 1 (pediatric ER); The ER PECARN data problem will be investigated throughout the semester in the labs. Unsupervised learning (clustering): reality-check and stability consideration through appropriate data and algorithm perturbations. (Sept. 10, 12)
Unsupervised learning (PCA, NMF, and audo-encoder) and prediction problems: reality-check and stability consideration through appropriate data and algorithm perturbations. (Sept. 17, 19)
- Sept. 20: Lab 1 due (midnight); Lab 1 peer review assigned; Lab 2 assigned (on clustering linguistic data)
Least Squares (LS). 3-way data split: test set as best proxy for future data. Cross validation. Regularized LS: model selection, forward selection, L2boosting, (Sept. 24 (zoom lecture by Bin), 26 (video lecture by Stark on data collection regarding election auditing))
- Sept. 27: Lab 1 peer review due (midnight)
Lasso and Ridge. (Oct. 1, 3)
- Oct. 4: Lab 2 due (midnight); Lab 2 peer review assigned; Lab 3 assigned (computing and evaluating the stability of k-means)
Weighted LS. Binary classification through WLS and logistic algorithm. Prediction with uncertainty measures. Calibration and evaluation or scrutiny of results. Sources of randomness. Simple random sampling. Density estimation and generative DL (Oct. 8, 10)
- Oct. 11: Lab 2 peer review due (midnight)
EM, Neyman-Rubin model for A/B testing, and linear regression models (Oct. 15, 17)
- Oct. 18: Lab 3 due (midnight); Lab 3 peer review assigned;
Logistic Regression Model and review. (Oct. 22)
Midterm Oct. 24 (in class) (Bin will be on travel)
- Oct. 25: Lab 3 peer review due (midnight); Lab 4 assigned (group project)
Exponential family. GLMS, IRWLS, model checking through calibration.(Oct. 29, Oct. 31)
Drawing conclusions from linear regression and logistic regression models through data and algorithm perturbations (or PCS inference). Interpretation of data results. Hypothesis testing, sequential testing, and multiple hypothesis testing. (Nov. 5, 7).
- Nov. 15: Lab 4 Due (midnight); Lab 4 peer review assigned; Final project assigned
Supervised DL and other advanced topics (RF, Kernel methods)(Nov. 12, 24)
AIC/BIC, e-L2boosting, Lasso theory; tree-based methods (Nov. 19,
- Nov. 22: Lab 4 peer review due (midnight)
Interpretable ML; Kernel methods (Nov. 22, lecture + lab)
PCS revisited (Nov. 26)
Last Lecture (Guest Lecture) Dec. 3 Tuesday; No lecture on Thursday. Dec. 6 Friday Last lab session
No in-class final exam, but there is a final project.
Final Project Due: Dec. 13 (midnight).