Skip to content

Overview

Course Overview

This course provides hands-on experience developing and deploying foundational machine learning algorithms on real-world datasets for practical applications including predicting housing prices, document retrieval, and product recommendation, and image classification using deep learning. Students will learn about the machine learning pipeline end-to-end including dataset creation, pre- and post-processing, preparation for machine learning, training and evaluating multiple models. Students will focus on real-world challenges at each stage of the ML pipeline while handling bias in models and datasets. 

Course Instructors

 

Prof. Angelique Taylor

  •   Instructor
  •   Office hours: Mondays from 2:15PM-3:15PM, Bloomberg 61X
  •   Email: amt298@cornell.edu

 

 

 

 

Jonathan Segal

  • Teaching Assistant
  • Office hours: Wednesday from 3-4 PM
  • Email: jis62@cornell.edu

 

 

 

 

Adnan Al Armouti

  • Teaching Assistant
  • Office hours: Friday from 11 AM – 12 PM
  • Email: aa2546@cornell.edu

 

 

 

 

 

Marianne Arriola

  • Teaching Assistant
  • Office hours: Monday from 4-5 PM
  • Email: ma2238@cornell.edu

 

 

 

 

 

 

Stella Hong

  • Grader
  • Email: sh2577@cornell.edu

 

 

 

 

 

Jacky He

  • Grader
  • Email: ph474@cornell.edu

 

 

 

 

 

 

Yibei Li

  • Grader
  • Email: yl3692@cornell.edu

 

 

 

 

Course Description

This course provides hands-on experience developing and deploying foundational machine learning algorithms on real-world datasets for practical applications including predicting housing prices, document retrieval, and product recommendation, and image classification using deep learning. Students will learn about the machine learning pipeline end-to-end including dataset creation, pre- and post-processing, preparation for machine learning, training and evaluating multiple models. Students will focus on real-world challenges at each stage of the ML pipeline while handling bias in models and datasets. 

Course Outcomes

After this course, students will be able to:

  • Prepare datasets for a ML task, train and evaluate ML models 
  • Understand core challenges of dataset creation including handling missing data, bias, among others 
  • Visualize features in datasets to be used for ML tasks 
  • Apply, analyze, and identify key differences in regression, classification, clustering, and deep learning algorithms 
  • Evaluate model quality using appropriate metrics of performance 
  • Build front- and back-end ML pipelines for analysis of ML performance and tools for ML practitioners.

Course Format

Lectures will cover these topics: 

  • Dataset Curation
  • Building an End-to-End ML Pipeline
  • Regression for Predicting Housing Prices 
  • Clustering for Document Retrieval 
  • Classification for Product Recommendation 
  • Deep Learning for Image Search 

Guest lectures will be given by experts working in AI and ML.

Lectures will cover these topics: 

  • Dataset Curation
  • Building an End-to-End ML Pipeline
  • Regression for Predicting Housing Prices 
  • Clustering for Document Retrieval 
  • Classification for Product Recommendation 
  • Deep Learning for Image Search 

Guest lectures will be given by experts working in AI and ML.

Prerequisites

CS 2800 or equivalent, linear algebra, probability, and experience programming with Python, or permission of the instructor.

Book

Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. ” O’Reilly Media, Inc.”, 2022.

Readings

  • Ameisen, Emmanuel. Building Machine Learning Powered Applications: Going from Idea to Product. ” O’Reilly Media, Inc.”, 2020.
  • Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. ” O’Reilly Media, Inc.”, 2022.
  • Jigyasa Grover, Rishabh Misra, Julian McAuley, Laurence Moroney, Mengting Wan (Foreword)“Sculpting Data for ML: The first act of Machine Learning”

Grading

Course grades are evaluated based on homework, class participation, and final project as follows:

  • Homeworks – 40%
  • Class participation – 10%
  • Final Project – 50%

Libraries

  • Scikit-Learn is a free and easy-to-use library that implements many Machine Learning algorithms efficiently making it a great entry point for learning ML.
  • Streamlit is an open-source framework in Python for quick web application development with no front-end experience required.
  • Python’s main scientific libraries—in particular, NumPy, pandas, and Matplotlib.

Frameworks & Tools

  • Google Colab – Write and execute arbitrary ML python code through the browser to access free GPUs and perform data analysis. Used for deep learning topics.
  • Github –  is a version control platform that allows developers to create, store, and manage their code.
  • git – Git is a distributed version control system that tracks changes in any set of computer files, usually used for coordinating work among programmers who are collaboratively developing source code during software development.
  • Visual Studio Code – is a source code editor used for debugging, syntax highlighting, intelligent code completion, snippets, code refactoring, and embedded Git.
  • pip – a package-management system written in Python and is used to install and manage software packages.
  • Anaconda – a distribution of the Python for scientific computing, that simplifies package management and deployment.
  • Jupyter notebook – web-based interactive development environment for notebooks, code, and data.

Attendance

Students are expected to attend lectures and participate in discussions as well as assignments to be successful in this course. If you miss a lecture due to an illness or emergency, refer to the recorded lectures to review what you missed. Seek help early and often by attending office hours to avoid delays while completing assignments.

If you miss a substantial number of classes due to an ongoing illness, please contact Student Disability Services to arrange accommodations and inform the instructor.

Course Schedule

Week # Lecture (M) Lecture (W)
1: Week of 1/20 No Class Lecture 1: Introduction to PAML (Homework 0)
2: Week of 1/27 Lecture 2: Revisit Preliminaries Lecture 3: Regression for Predicting Housing Prices  (Homework 0 DUE, Homework 1 Assigned)
3: Week of 2/3 Lecture 4: Regression for Predicting Housing Prices Lecture 5: Regression for Predicting Housing Prices 
4: Week of 2/10 Lecture 6: Regression for Predicting Housing Prices  Lecture 7: Introduction to Classification (Homework 1 DUE, Homework 2 Assigned)
5: Week of 2/17 No Class – February Break Lecture 8: Classification for Product Recommendation
6: Week of 2/24 Lecture 9: Classification for Product Recommendation Lecture 10: Classification for Product Recommendation
7: Week of 3/3 Lecture 11:  Introduction to Clustering (Homework 2 DUE, Homework 3 Assigned) Lecture 12:  Introduction to Clustering
8: Week of 3/10 Lecture 13: Clustering for Document Retrieval Lecture 14: Clustering for Document Retrieval 
9: Week of 3/17 Lecture 15: Introduction to Deep Learning (Homework 3 DUE; Homework 4 Assigned) Lecture 16: Introduction to Deep Learning
10: Week of 3/24 Lecture 17: Introduction to Deep Learning  Lecture 18: Final Project (FP) Discussion (FP Proposal assigned)
11: Week of 3/31 No Class – Spring Break No Class – Spring Break
12: Week of 4/7 Lecture 19: Guest Lecture – Allison Koenecke from Cornell University (Homework 4 DUE) Lecture 20: Guest Lecture – Nikhil Garg from Cornell Tech

FP Proposal DUE Friday

13: Week of 4/14 Lecture 21: Guest Lecture – Emma Pierson from Berkeley Lecture 22: Guest Lecture – Sarah Dean from Cornell University
14: Week of 4/21 Lecture 23: FP Midpoint Report 

(Submit via Gradescope)

Lecture 24: Guest Lecture – Angelina Wang from Princeton University
15: Week of 4/28 Lecture 25: Guest Lecture –  Rajalakshmi Nandakumar from Cornell Tech Lecture 26: Guest Lecture – Kilian Weinberger from Cornell University
16: Week of 5/5 Last Day of Instruction;  Final Project Presentation DUE No Class; Final Project Report & Code Due
17: Week of 5/12 No class No class

Integrity

This course follows Cornell’s policies on academic integrity as outlined in the Academic Integrity Handbook

Inclusivity

Students are expected to treat their classmates and course staff with respect. All individuals from different cultural backgrounds, genders, and sexual orientations are welcome here. When students encounter incidents that violate this, they are encouraged to inform the instructors so these issues can be addressed in a timely manner (See Cornell’s Computer Science Community Statement of Values of Inclusion).   

Late Policy

Students have 6 late days to use for the semester for assignment submissions (maximum of 2 per assignment), for homeworks only. After that, the grade will be dropped one letter grade per day late. No exceptions.

Students have 1 week after assignments are returned to make a regrade request (no exceptions). Send an email to Prof. Taylor, the TAs and Graders. 

Collaboration Policy and Honor Code

You are expected to work on homework assignments in groups. You are expected to write up homeworks and code and reports from scratch, and you must acknowledge in your submission all the students worked with and their contribution to the project using Peer Assessment. The following are considered to be honor code violations: 

  • Looking at the writeup or code of another student outside your team. 
  • Showing your write-up or code to another student outside your team. 
  • Write code for homework assignments in such detail that your solution is almost identical to another team’s answer. 
  • Uploading your writeup or code to a public repository (e.g. github) so that it can be accessed by other student groups.

When debugging code together, you are only allowed to look at the input-output behavior of each other’s programs (so you should write good test cases!). It is important to remember that even if you didn’t copy but just gave another student your solution outside your team, you are still violating the honor code, so be careful.

Accessibility

We are happy to accommodate all students in terms of accessibility. Please contact the course instructors when you need help. Furthermore, the Office of Student Disability Services has available resources. 

Use of Generative AI

Generative AI (Artificial Intelligence) is now widely available to produce text, images, and other media. Our goal as a community of learners is to explore and understand how these tools may be used to augment human performance. However, keep the following three principles in mind: (1) An AI cannot pass this course; (2) AI contributions must be attributed and true; (3) The use of AI resources must be open and documented. 

Generative Artificial Intelligence (AI) models, including ChatGPT, are prohibited in this course.

Failure to document your use of AI tools, as well as any plagiarism, even inadvertently, from the use of AI tools (such as quotations or information that are not properly attributed) constitutes academic misconduct and may be referred to the Center for Teaching Innovation. You can find more information about using generative AI and access a secure AI tool provided by Cornell at https://teaching.cornell.edu/generative-artificial-intelligence.