Syllabus

Statistics 215A, Fall 2024

Instructor: Professor Bin Yu, binyu at berkeley.edu
Lectures: T/Th: 12:30 pm - 2:00 pm (332 Evans)
Office hours: 409 Evans: Tuesday 10:00-11:00, Wed 11:00-12:00.

Discussion: Friday: 9:00 am - 11:00 am (342 Evans)
GSI: Anthony Ozerov (ozerov at berkeley.edu ); GSI will be in charge of the discussion sessions, Ed Discussions (replacing piazza), and the labs/homework.
GSI Office hours: Evans 444: Monday 14:00-15:00, Wednesday 12:00-13:00, Thursday 16:00-17:00.
Text books:

Comments, Suggestions, Gripes: Before or after the lectures, email, or talk to the instructor and the GSI.
Ed Discussion: Questions and discussion about course material, HWs, and labs can be posted on the Ed Discussion page (accessed on bCourses). The GSI will regularly monitor this to ensure all questions are answered in a timely manner, but students are encouraged to help their classmates as well. Please think carefully before asking questions specifically about the projects. For example, questions concerning how to do something specific in R are fine, but questions asking what other people did for their analysis are not. Questions asking about clarifications are fine.
bCourses: https://bcourses.berkeley.edu/courses/1537926
Course website: https://stat215a.berkeley.edu/fall-2024/ (with up-to-date schedule)
Grading:

Assignments: There will be 4 or 5 assignments given out on Friday in the discussion session and usually due in two weeks (there will be an announcement if otherwise). The assignments require two full weeks of work to satisfactorily complete, which requires a very early start. The assignments contain homework problems and data analysis labs. For the data labs, each student will produce a 12-page (maximum) report presenting a narrative that connects the motivating questions, the analysis conducted and the conclusions drawn. The labs will be completed in Python (optionally some parts in R, especially for visualization). The reports will be made using Jupyter Notebooks or pure LaTeX and the final pdf output should not contain any code whatsoever. Each report will be hosted in a Github repository containing both the code and the written report, and an automated script will pull the submissions at 12:00am on the due date (i.e., midnight of the night before our Friday discussion section). HW submissions will be made to Gradescope. No late assignments will be accepted, for any reason.
Readings: There will be a number of readings assigned throughout the semester. The assigned readings for each week and links to the papers can be found in the calendar on the course website. Students should read at least one paper or reading carefully and go through other readings (as time allows) each week and be prepared to discuss their takeaways during the lecture and lab.
Gradescope: https://www.gradescope.com/courses/833104 (by invite)
Course description:

Overview

Information technology advances have made it possible to collect huge amounts of data in every walk of our life and beyond. These vast amounts of data have enabled scientists, social scientists, government agencies, and companies to ask increasingly complex questions aimed at understanding the physical and human world, informing pubic policies, and improving productivity.

However, having data alone is not enough. To be able to solve a domain problem or answer a domain question, it takes a principled statistics and machine learning investigation process in the context of and in combination with domain knowledge. This process includes problem formulation in context and with a domain goal outside statistics, data collection (ideally through careful experimental design), data cleaning, data visualization, algorithm/model development, validation, post-hoc analysis, and drawing conclusions in context. It rests on adequate domain knowledge, suitable computing platforms, appropriate choices of data cleaning schemes and scalable algorithms/models, and careful consideration of domain knowledge and information from the data to draw meaningful conclusions about a domain problem. That is, with both computation and narratives (domain knowledge) forming its foundation, the statistical and machine learning investigation process is one of rigorous evidence-seeking to make trustworthy data-driven conclusions that are useful for the domain problem and accessible by the domain experts.

The most impactful contributions are often made when domain experts (scientists, for example) and statisticians and machine learners work together to brainstorm and ask questions. These domain experts are not only key to formalizing the ideas, but they also are integral in generating the data. Engaging with the individuals who collected the data in the first place allows the statistician and machine learner to learn about the context in which the data lives, and subsequently, to conduct an effective analysis capable of answering the question being asked.
Collaborative learning in context This course will demonstrate what it is like to be an applied statistician, or data scientist in today’s data-rich world. We emphasize the goal of answering questions outside of statistics or machine learning using data and domain knowledge through working with domain experts. We illustrate through lectures, class discussions, data labs, and homework assignments, the many steps involved in the iterative process of information extraction and evidence gathering of a statistical investigation. Specifically, students will learn together and critically understand the technical topics of EDA (exploratory data analysis), prediction algorithms (e.g. Least Squares, random forests), identification of sources of randomness in data, probabilistic models (e.g. linear regression), inference, and interpretation. We ground our class on. the concepts of reality, representation of reality, and mental construct to separate current data, algorithms/models and future data in the context of domain knowledge. We discuss when and how to connect these three concepts in the entire process of data analysis. In particular, the PCS framework (workflow and documentation) based on the three principles of data science - predictability, stability and computability (PCS) will be employed as an overarching theme.

The lectures (and labs) will be based on real-data problems, and students will learn useful statistical concepts and methods in the contexts of these problems. The aim is to illustrate how judgement and common-sense are crucial to the statistical investigation process. Moreover, we introduce the technical topics through a first-principles approach so that students gain the skills necessary to develop new techniques to solve problems in unfamiliar situations in the future.

The essential elements of applied statistics are captured in Bin’s piece entitled “Data Wisdom” (http://www.odbms.org/2015/04/data-wisdom-for-data-science/). Students are asked to read the piece after the first lecture.
Data lab format and peer-grading The data labs will be done individually, except for one group lab later in the semester. The goal of writing the lab report is not only to gain data analysis experience but is also an exercise in communication. We ask that particular attention is given to the writing of the report, as your peers will be reading them: so that the students can learn from one another, the labs will be peer-reviewed. Each student will review 2-3 labs from their peers, and will provide feedback and a grade based on several criteria including clarity of writing, validity of analysis and informativeness of visualizations. The final grade of each lab will be decided by the GSI who will use the student grades as a guide.
Full commitment to the class is necessary Please be aware that this is a heavy-load class. If you are not sure that you can commit, please audit the class instead since there are many students on the waitlist. Further, because class discussions are an integral part of the course, registered students are required to attend all classes unless permitted by the instructor under justifiable circumstances.
Pre-requisites In this class, we require knowledge of upper division mathematical statistics and probability courses (Stat 134 and 135) at UC Berkeley. In terms of computing, at a minimum you should be comfortable manipulating files in Unix and writing your own functions, manipulating and cleaning data and creating and customizing graphics in Python. Ideally, students will already have a basic fluency in Python as well as confidence using Github. While we will be providing a short introduction to these topics in the labs, students who are entirely unfamiliar with these tools will need to put in additional work to ensure that they meet the standards expected of the course.
Tentative list of topics:

Final Project Due: Dec. 13 (midnight).