Course Description
This course provides hands-on experience developing and deploying foundational machine learning algorithms on real-world datasets for practical applications (e.g., predicting housing prices, document retrieval, and product recommendation among others). Students will learn about the machine learning pipeline end-to-end including dataset creation, pre- and post-processing, annotation, annotation validation, preparation for machine learning, training and testing a model, and evaluation. Students will focus on real-world challenges at each stage of the ML pipeline while handling bias in models and datasets. Lastly, students will analyze the strengths and weaknesses of regression, classification, clustering, and deep learning algorithms.
Course Instructors
Prof. Angelique Taylor
- Instructor
- Office hours: Mondays from 4:00PM-5:00PM, Bloomberg 262
- Email: amt298@cornell.edu
Tauhid Tanjim
- Teaching Assistant
- Email: tt485@cornell.edu
- Office hours: Tuesday from 2:30PM-3:30PM
Jinzhao Kang
- Grader
- Email: jk2575@cornell.edu
- Office Hours: Wednesdays from 3PM-4pm
Kathryn Gdula
- Grader
- Email: kg435@cornell.edu
- Office Hours: Thursday from 10AM-11AM
Course Outcomes
- Collect a new dataset and prepare it for a ML task, train a model, and evaluate it
- Apply regression, classification, clustering, and deep learning algorithms to practical applications
- Analyze and identify key differences in regression, classification, clustering, and deep learning algorithms
- Understand core challenges of dataset creation including handling missing data, bias, unlabeled data, among others
- Represent features in datasets to be used for ML tasks
- Evaluate model quality using appropriate metrics of performance
- Build front- and back-end ML pipelines for analysis of ML performance and tools for users.
Course Format
-
- Dataset Curation
- Building an End-to-End ML Pipeline
- Regression for Predicting Housing Prices
- Clustering for Document Retrieval
- Classification for Product Recommendation
- Deep Learning for Image Search
Lectures are on Monday from 1:00PM – 2:15PM ET, Bloomberg 61X and focused on hands-on collaborative coding in teams to build end-to-end ML pipelines.
Guest lectures will be given by experts working in related fields including data curation, AI, ML, and computer vision.
Prerequisites
CS 2800 or equivalent, linear algebra, probability, and experience programming with Python, or permission of the instructor.
Reading
Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. ” O’Reilly Media, Inc.”, 2022.
Grading
Final grades are evaluated based on homeworks, labs, class participation, and final project as follows:
- Homeworks – 35%
- Reading Quizzes – 15%
- Class participation – 10%
- Final Project – 40%
Summary of Course Topics
- Dataset Curation
- Building an End-to-End ML Pipeline
- Regression for Predicting Housing Prices
- Clustering for Document Retrieval
- Classification for Product Recommendation
- Deep Learning for Image Search
- Ethics in ML
Frameworks, Libraries, & Tools
- Scikit-Learn is a free and easy-to-use library that implements many Machine Learning algorithms efficiently making it a great entry point for learning ML.
- TensorFlow is an end-to-end ML framework created at Google for processing and loading data for ML, building ML models, utilizing pre-trained models, deploying, and implementing large-scale ML applications for production.
- Keras is a high-level Deep Learning API that makes it very simple to train and run neural networks. It can run on top of either TensorFlow 2, Theano, or Microsoft Cognitive Toolkit (formerly known as CNTK).
- Streamlit is an open-source framework in Python for quick web application development with no front-end experience required.
- This course uses Python’s main scientific libraries—in particular, NumPy, pandas, and Matplotlib.
Final Projects
In final projects, students will propose a new application that addresses a real-world problem and provides a front- and back-end solution to users for well-justified user-cases. Students will select an application area (e.g., healthcare, social media) search for or collect a dataset to address a problem, build an end-to-end ML pipeline, evaluate the algorithms using standard metrics, create visualization tools to analyze ML performance, create a front- and back-end application, launch and share it.
Course projects will be done in groups of up to 3 students and can fall into one or more of the following categories:
- Application of machine learning to a practical problem of your choice. Improvements to machine learning algorithms.
- Comparison to two or more comparative machine learning methods on benchmarks.
- Analysis of machine learning models.
Pick a topic that’s meaningful to you and that excites you. For example, if you do PhD research in robotics, you can do a project related to a research problem that you’re working on. If you’re in Urban Tech, you can work with a city dataset that you find interesting. Students are encouraged to find something on their own. Feel free to talk to the teaching team during office hours.
Students will present their final project and write a technical report consistent with industry standards.
Attendance
Students are expected to attend lectures and participate in discussion as well as assignments to be successful in this course. If you miss a lecture due to an illness or emergency, refer to the recorded lectures to review what you missed.
Seek help early and often to avoid delays in feedback when issues come up while completing assignments.
If you miss a substantial number of classes due to an on-going illness, please contact Student Disability Services to arrange accommodations and inform the instructor.
Integrity
This course follows Cornell’s policies on academic integrity as outlined in the Academic Integrity Handbook.
Inclusivity
Students are expected to treat their classmates and course staff with respect. All individuals from different cultural backgrounds, genders, and sexual orientations are welcome here. When students encounter incidents that violate this, they are encouraged to inform the instructors so these issues can be addressed in a timely manner (See Cornell’s Computer Science Community Statement of Values of Inclusion).
Accessibility
We are happy to accommodate all students in terms of accessibility. Please contact the course instructors when you need help. Furthermore, the Office of Student Disability Services has available resources.
Late Policy
Students have 6 late days (2 max per assignment) to use for the semester for assignment submissions, including homework and the final project. After that, the grade will be dropped one letter grade per day late.
The assignment deadlines are due on Gradescope and as follows:
- Reading Quizzes: Covers course topics in reading assignments and will be made available before being discussed in lecture.
- Homeworks: Coding assignments on course topics, assigned the week they are discussed in lecture and DUE after 2 weeks.
- Final Project Proposal: Propose an FP on a PAML; DUE on April 28th @ 5PM.
- Final Project Midpoint Report: Provide updates on project deliverables on May 3th during the class session.
- Final Project Presentation: Present FP in class for 15 minutes with a 3-minute Q&A; scheduled on May 15th @ 12PM.
Students have 1 week after assignments are returned to make a regrade request (no exceptions). Send an email to Prof. Taylor, Jinzhao Kang, and Kathryn Guda.
Collaboration Policy and Honor Code
You are free to form study groups and discuss homeworks and projects. However, you must write up homeworks and code from scratch independently, and you must acknowledge in your submission all the students you discussed with. The following are considered to be honor code violations:
- Looking at the writeup or code of another student.
- Showing your writeup or code to another student.
- Discussing homework problems in such detail that your solution (writeup or code) is almost identical to another student’s answer.
- Uploading your writeup or code to a public repository (e.g. github) so that it can be accessed by other students.
When debugging code together, you are only allowed to look at the input-output behavior of each other’s programs (so you should write good test cases!). It is important to remember that even if you didn’t copy but just gave another student your solution, you are still violating the honor code, so please be careful.