diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index 906e9239..34324005 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -22,7 +22,7 @@ jobs: - name: Generate main page run: | cp .assets/template.html _site/index.html - sed 's/\](https:\/\/mlip-cmu.github.io\/s2023\/slides\/\([^\(]*\)\.html) (\[/\](https:\/\/mlip-cmu.github.io\/s2023\/slides\/\1.html) ([md](https:\/\/github.com\/mlip-cmu\/s2023\/blob\/main\/lectures\/\1.md), [pdf](https:\/\/mlip-cmu.github.io\/s2023\/slides\/\1.pdf), \[/' < schedule.md | sed 's/\](https:\/\/mlip-cmu.github.io\/s2023\/slides\/\([^\(]*\)\.html) *|/\](https:\/\/mlip-cmu.github.io\/s2023\/slides\/\1.html) ([md](https:\/\/github.com\/mlip-cmu\/s2023\/blob\/main\/lectures\/\1.md), [pdf](https:\/\/mlip-cmu.github.io\/s2023\/slides\/\1.pdf)) |/' > schedule_.md + sed 's/\](https:\/\/mlip-cmu.github.io\/s2024\/slides\/\([^\(]*\)\.html) (\[/\](https:\/\/mlip-cmu.github.io\/s2024\/slides\/\1.html) ([md](https:\/\/github.com\/mlip-cmu\/s2024\/blob\/main\/lectures\/\1.md), [pdf](https:\/\/mlip-cmu.github.io\/s2024\/slides\/\1.pdf), \[/' < schedule.md | sed 's/\](https:\/\/mlip-cmu.github.io\/s2024\/slides\/\([^\(]*\)\.html) *|/\](https:\/\/mlip-cmu.github.io\/s2024\/slides\/\1.html) ([md](https:\/\/github.com\/mlip-cmu\/s2024\/blob\/main\/lectures\/\1.md), [pdf](https:\/\/mlip-cmu.github.io\/s2024\/slides\/\1.pdf)) |/' > schedule_.md sed -i -e '/^\[Schedule\]/r schedule_.md' README.md npx marked -i README.md >> _site/index.html cat .assets/template_end.html >> _site/index.html diff --git a/README.md b/README.md index 98093772..ca895dcf 100644 --- a/README.md +++ b/README.md @@ -1,20 +1,18 @@ # Machine Learning in Production (17-445/17-645/17-745) / AI Engineering (11-695) -### Spring 2023 +### Spring 2024 -CMU course that covers how to build, deploy, assure, and maintain products with machine-learned models. Includes the entire lifecycle from a prototype ML model to an entire system. Covers also **responsible AI** (safety, security, fairness, explainability) and **MLOps**. The course is crosslisted both as **Machine Learning in Production** and **AI Engineering**. For earlier offerings see websites for [Fall 2019](https://ckaestne.github.io/seai/F2019), [Summer 2020](https://ckaestne.github.io/seai/S2020), [Fall 2020](https://ckaestne.github.io/seai/F2020/), [Spring 2021](https://ckaestne.github.io/seai/S2021/)  [Spring 2022](https://ckaestne.github.io/seai/S2022/), and [Fall 2022](https://ckaestne.github.io/seai/F2022/). This Spring 2023 offering is designed for students with some data science experience (e.g., has taken a machine learning course, has used sklearn) and basic programming skills (e.g., basic Python programming with libraries, can navigate a Unix shell), but will not expect a software engineering background (i.e., experience with testing, requirements, architecture, process, or teams is not required). Going forward we expect to offer this course at least every spring semester and possibly some fall semesters (not summer semesters). +CMU course that covers how to build, deploy, assure, and maintain software products with machine-learned models. Includes the entire lifecycle from a prototype ML model to an entire system deployed in production. Covers also **responsible AI** (safety, security, fairness, explainability) and **MLOps**. The course is crosslisted both as **Machine Learning in Production** and **AI Engineering**. For earlier offerings see websites for [Fall 2019](https://ckaestne.github.io/seai/F2019), [Summer 2020](https://ckaestne.github.io/seai/S2020), [Fall 2020](https://ckaestne.github.io/seai/F2020/), [Spring 2021](https://ckaestne.github.io/seai/S2021/)  [Spring 2022](https://ckaestne.github.io/seai/S2022/), [Fall 2022](https://ckaestne.github.io/seai/F2022/), and [Spring 2023](https://github.com/mlip-cmu/s2023). This Spring 2024 offering is designed for students with some data science experience (e.g., has taken a machine learning course, has used sklearn) and basic programming skills (e.g., basic Python programming with libraries, can navigate a Unix shell), but will not expect a software engineering background (i.e., experience with testing, requirements, architecture, process, or teams is not required). Going forward we expect to offer this course at least every spring semester and possibly some fall semesters (not summer semesters). --- -**Future offerings: Unfortunately, we will not be able to offer the course in Fall 2023. The next offering will be Spring 2024. In the mean time, there is plenty of material here to self-study here (slides, book, readings, assignments, ...).** - For researchers, educators, or others interested in this topic, we share all course material, including slides and assignments, under a creative commons license on GitHub (https://github.com/mlip-cmu) and have also published a [textbook](https://ckaestne.medium.com/machine-learning-in-production-book-overview-63be62393581) with chapters corresponding to almost every lecture. A while ago we also wrote an article describing the rationale and the initial design of this course: [Teaching Software Engineering for AI-Enabled Systems](https://arxiv.org/abs/2001.06691). Video recordings of the Summer 2020 offering are online on the [course page](https://ckaestne.github.io/seai/S2020/#course-content), though they are a bit outdated by now. We would be happy to see this course or a similar version taught at other universities. See also an [annotated bibliography](https://github.com/ckaestne/seaibib) on research in this field. ## Course Description -This is a course for those who want to build **products** with **machine learning**, not just models and demos. Assume you can learn a model to make predictions, what does it take to turn the model into a product and actually deploy it, have confidence in its quality, and successfully operate and maintain it at scale? +This is a course for those who want to build **software products** with **machine learning**, not just models and demos. We assume that you can train a model or build prompts to make predictions, but what does it take to turn the model into a product and actually deploy it, have confidence in its quality, and successfully operate and maintain it at scale? -The course is designed to establish a working relationship between **software engineers** and **data scientists**: both contribute to building AI-enabled systems but have different expertise and focuses. To work together they need a mutual understanding of their roles, tasks, concerns, and goals and build a working relationship. This course is aimed at **software engineers** who want to build robust and responsible systems meeting the specific challenges of working with AI components and at **data scientists** who want to understand the requirements of the model for production use and want to facilitate getting a prototype model into production; it facilitates communication and collaboration between both roles. The course is a good fit for student looking at a career as an **ML engineer**. *The course focuses on all the steps needed to turn a model into a production system in a responsible and reliable manner.* +The course is designed to establish a working relationship between **software engineers** and **data scientists**: both contribute to building ML-enabled systems but have different expertise and focuses. To work together they need a mutual understanding of their roles, tasks, concerns, and goals and build a working relationship. This course is aimed at **software engineers** who want to build robust and responsible products meeting the specific challenges of working with ML components and at **data scientists** who want to understand the requirements of the model for production use and want to facilitate getting a prototype model into production; it facilitates communication and collaboration between both roles. The course is a good fit for student looking at a career as an **ML engineer**. *The course focuses on all the steps needed to turn a model into a production system in a responsible and reliable manner.* ![Course overview](overview.svg) @@ -23,23 +21,23 @@ It covers topics such as: * **How to design for wrong predictions the model may make?** How to assure *safety* and *security* despite possible mistakes? How to design the *user interface* and the entire system to operate in the real world? * **How to reliably deploy and update models in production?** How can we *test* the entire machine learning pipeline? How can *MLOps* tools help to automate and scale the deployment process? How can we *experiment in production* (A/B testing, canary releases)? How do we detect *data quality* issues, *concept drift*, and *feedback loops* in production? * **How to scale production ML systems?** How do we design a system to process huge amounts of training data, telemetry data, and user requests? Should we use stream processing, batch processing, lambda architecture, or data lakes? -* **How to test and debug production ML systems?** How can we *evaluate* the quality of a model’s predictions in production? How can we *test* the entire AI-enabled system, not just the model? What lessons can we learn from *software testing*, *automated test case generation*, *simulation*, and *continuous integration* for testing for production machine learning? -* **Which qualities matter beyond a model’s prediction accuracy?** How can we identify and measure important quality requirements, including *learning and inference latency, operating cost, scalability, explainablity, fairness, privacy, robustness*, and *safety*? Does the application need to be able to *operate offline* and how often do we need to update the models? How do we identify what’s important in a AI-enabled product in a production setting for a business? How do we resolve *conflicts* and *tradeoffs*? +* **How to test and debug production ML systems?** How can we *evaluate* the quality of a model’s predictions in production? How can we *test* the entire ML-enabled system, not just the model? What lessons can we learn from *software testing*, *automated test case generation*, *simulation*, and *continuous integration* for testing for production machine learning? +* **Which qualities matter beyond a model’s prediction accuracy?** How can we identify and measure important quality requirements, including *learning and inference latency, operating cost, scalability, explainablity, fairness, privacy, robustness*, and *safety*? Does the application need to be able to *operate offline* and how often do we need to update the models? How do we identify what’s important in a ML-enabled product in a production setting for a business? How do we resolve *conflicts* and *tradeoffs*? * **How to work effectively in interdisciplinary teams?** How can we bring data scientists, software engineers, UI designers, managers, domain experts, big data specialists, operators, legal council, and other roles together and develop a *shared understanding* and *team culture*? **Examples and case studies** of ML-driven products we discuss include automated audio transcription; distributed detection of missing children on webcams and instant translation in augmented reality; cancer detection, fall detection, COVID diagnosis, and other smart medical and health services; automated slide layout in Powerpoint; semi-automated college admissions; inventory management; smart playlists and movie recommendations; ad fraud detection; delivery robots and smart driving features; and many others. -An extended group project focuses on building, deploying, evaluating, and maintaining a robust and scalable *movie recommendation service* under somewhat realistic “production” conditions. +An extended group project focuses on building, deploying, evaluating, and maintaining a robust and scalable *movie recommendation service* under somewhat realistic “production” conditions with 1 million users. ### Learning Outcomes After taking this course, among others, students should be able to -* analyze tradeoffs for designing production systems with AI-components, analyzing various qualities beyond accuracy such as operation cost, latency, updateability, and explainability -* plan for mistakes in AI components and implement production-quality systems that are robust to those mistakes +* analyze tradeoffs for designing production systems with ML-components, analyzing various qualities beyond accuracy such as operation cost, latency, updateability, and explainability +* plan for mistakes in ML components and implement production-quality systems that are robust to those mistakes * design fault-tolerant and scalable data infrastructure for learning models, serving models, versioning, and experimentation * ensure quality of the entire machine learning pipeline with test automation and other quality assurance techniques, including automated checks for data quality, data drift, feedback loops, and model quality * build systems that can be tested and monitored in production and build robust deployment pipelines -* consider system-level requirements such as safety, security, privacy, fairness, and usability when building complex AI-enabled products +* consider system-level requirements such as safety, security, privacy, fairness, and usability when building complex ML-enabled products * communicate effectively in interdisciplinary teams In addition, students will gain familiarity with production-quality infrastructure tools, including stream processing with Apache Kafka, test automation with Jenkins, monitoring with Prometheus and Grafana, and deployment with Docker and various MLOps tools. @@ -50,27 +48,27 @@ In addition, students will gain familiarity with production-quality infrastructu The course is the same under all course numbers, with the exception of the PhD-level 17-745 which replaces two homework assignments with a mandatory [research project](https://github.com/mlip-cmu/s2023/blob/main/assignments/research_project.md). -Open to undergraduate and graduate students meeting the prerequisites. +Open to all undergraduate and graduate students meeting the prerequisites. -### Spring 2023 +### Spring 2024 -Lectures Monday/Wednesday 2-3:20pm, in person, TEP 1403 +Lectures Monday/Wednesday 2-3:20pm, in person, PH 100 -Recitations Friday 10-10:50am in POS 152 (A) and DH 1117 (C) and 12-12:50pm in PH A18A (B) and DH 1117 (D) +Labs Friday 9:30-10:50am in PH 226C (A) and WEH 4709 (B) and 11-12:20pm in PH A22 (C) and WEH 5310 (D) and 2-3:20 in PH 226C (E) and GHC 4215 (F). -Instructors: [Eunsuk Kang](https://eskang.github.io/) and [Christian Kaestner](https://www.cs.cmu.edu/~ckaestne/) +Instructors: [Claire Le Goues](https://clairelegoues.com) and [Christian Kaestner](https://www.cs.cmu.edu/~ckaestne/) -TAs: Adeep Biswas, Dhanraj Kotian, Hari Prasath John Kennedy, Mukunda Das, Priyank Bhandia, Ritika Dhiman +TAs: tbd ### Coordination -We are happy to answer questions by email, over Slack, over Canvas, meet in person, and will jump on a quick Zoom call if you ask us. We also always arrive 5 to 10 min early to class and stay longer for discussions and questions. If you have questions about assignments and logistics, we prefer that you ask them publicly on Slack. +We are happy to answer questions by email and over Slack, meet in person, and will jump on a quick Zoom call if you ask us. We also always arrive 5 to 10 min early to class and stay longer for discussions and questions. If you have questions about assignments and logistics, we prefer that you ask them publicly on Slack. ## Course content -The general course content has been fairly stable over the last few years, though specific topics and tools are constantly updated with new research and tooling. Our list of learning goals under [Learning Goals](https://github.com/mlip-cmu/s2023/blob/main/learning_goals.md) describes what we aim to cover. Below is a table of a preliminary schedule. This is subject to change and will be updated as the semester progresses, especially to help focus on requested topics or support learning. +The general course content has been fairly stable over the last few years, though specific topics and tools are constantly updated with new research and tooling. Our list of learning goals under [Learning Goals](https://github.com/mlip-cmu/s2023/blob/main/learning_goals.md) describes what we aim to cover. Below is a table of a preliminary schedule. This is subject to change and will be updated as the semester progresses, especially to help focus on requested topics or support learning. [Schedule](https://github.com/mlip-cmu/s2023/blob/main/schedule.md) @@ -78,27 +76,27 @@ The general course content has been fairly stable over the last few years, thoug ## Course Syllabus and Policies -The course uses Canvas and Gradescope for homework submission, grading, discussion, questions, announcements, and supplementary documents; slides will be posted here; Slack is used for communication around homeworks and projects; Github is used to coordinate group work. All public course material (assignments, slides, syllabus) can be found in the course’s [GitHub repository](https://github.com/mlip-cmu/s2023); announcements and all *private* material (e.g., grades, passwords) will be shared through Canvas. +The course uses Canvas and Gradescope for homework submission, grading, discussion, questions, announcements, and supplementary documents; slides will be posted here; Slack is used for communication around homeworks and projects; Github is used to coordinate group work. All public course material (assignments, slides, syllabus) can be found in the course’s [GitHub repository](https://github.com/mlip-cmu/s2024); announcements and all *private* material (e.g., grades, passwords) will be shared through Canvas. -**Prerequisites:** The course does not have formal prerequesites, but we describe background knowledge that will help you be successful in the course. In a nutshell, we expect basic exposure to machine learning and basic programming skills, but do not require software engineering experience. +**Prerequisites:** The course does not have formal prerequisites, but we describe background knowledge that will help you be successful in the course. In a nutshell, we expect basic exposure to machine learning and basic programming skills, but do not require software engineering experience. *Machine learning (some experience recommended):* We suggest that you have basic familiarity with the process of extracting features, building and evaluating models, and a basic understanding of how and when different kinds of learning techniques work. Familiarity with Python and Jupyter notebooks is helpful. Courses such as 10-301, 10-315, and 05-434 will prepare you well, but project experience or self-learning from books or online courses will likely be sufficient for our purposes. For example, we recommend the book [Hands-On Machine Learning](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991019665684604436) to get practical experience in building and evaluating models prior to taking this course. We have set up a *[prerequisite knowledge check](https://forms.gle/JcS61Uao7wHSFQen8)* as a Google Form, where we ask 10 questions on machine learning, which help you assess your background. This is set up as an anonymous and ungraded quiz, where you can compare your knowledge against what we believe is useful for you to be successful in this course (click on *“view score”* after submitting your answer). After submitting your answers, the system will give specific pointers to readings and exercises that may help you fill gaps in background knowledge. -*Programming (basic proficiency required):* The course has a substantial programming component, especially in the first assignment and the team project, so basic programming skills will be needed. If you take the course without programming experience, you will significantly struggle and it may cause conflicts within the group project. We expect that you meet the following criteria: (1) basic fluency in a programming language like Python, (2) ability to install and learn libraries in that language, (3) ability to ssh into a unix machine and perform basic command line operations, and (4) ability to install and learn new tools like Docker. We do not prescribe a programming language, but essentially all student teams decide to work primarily in Python. We will will provide some introductions and examples for essential tools like Git, Docker, Grafana, and Jenkins in recitations, but we expect that you will be able to pick up new tools and libraries on your own. For example, we expect that you will be able, on your own, to learn basic use of a library like [Flask](https://flask.palletsprojects.com/en/2.1.x/) to write a web service. Throughout the semester, expect to read lots of documentation and tutorials to learn various libraries and tools on your own. If you are worried whether your technical background is sufficient, we recommend that you look at (or even try) [homework I1](https://github.com/mlip-cmu/s2023/blob/main/assignments/I1_mlproduct.md) before the semester. +*Programming (basic proficiency required):* The course has a substantial programming component, especially in the first assignment and the team project, so basic programming skills will be needed. If you take the course without programming experience, you will significantly struggle and it may cause conflicts within the group project. We expect that you meet the following criteria: (1) basic fluency in a programming language like Python, (2) ability to install and learn libraries in that language, (3) ability to ssh into a unix machine and perform basic command line operations, and (4) ability to install and learn new tools like Docker. We do not prescribe a programming language, but almost all student teams decide to work primarily in Python. We will provide some introductions and examples for essential tools like Git, Docker, Grafana, and Jenkins in labs, but we expect that you will be able to pick up new tools and libraries on your own. For example, we expect that you will be able, on your own, to learn basic use of a library like [Flask](https://flask.palletsprojects.com/en/2.1.x/) to write a web service. Throughout the semester, expect to read lots of documentation and tutorials to learn various libraries and tools on your own. If you are worried whether your technical background is sufficient, we recommend that you look at (or even try) [homework I1](https://github.com/mlip-cmu/s2024/blob/main/assignments/I1_mlproduct.md) before the semester. -*Software engineering (no experience required):* Many students will have some software engineering experience beyond basic programming skills from software engineering courses or from working in larger software teams or on larger software projects, for example experience with requirements engineering, software design, software testing, distributed systems, continuous deployment, or managing teams. No such experience is expected as a prerequisite; we will cover basics for these topics in the course. +*Software engineering (no experience required):* Many students will have some software engineering experience beyond basic programming skills from software engineering courses, from internships, or from working in industry, for example experience with requirements engineering, software design, software testing, distributed systems, continuous deployment, or managing teams. No such experience is expected as a prerequisite; we will cover these topics in the course. Email the instructors if you would like to further talk to us about prerequisites. -**In-person teaching and lecture recordings:** The course will be taught in person and we consider in-class participation as an important part of the learning experience. We will not provide an online option. We will *not* make recordings of lectures or recitations available. +**In-person teaching and lecture recordings:** The course will be taught in person and we consider in-class participation as an important part of the learning experience. We will not provide an online option. We will *not* make recordings of lectures or labs available. We regularly use Slack for in-class activities. Please make sure that you have access to Slack on a laptop, tablet, or mobile phone during class. -If you cannot attend class due to a medical issue, family emergency, or other unforeseeable reason, please contact us about possible accommodations. We try to be as flexible as we can, but will handle these cases individually. +If you cannot attend class due to a medical issue, family emergency, interview, or other unforeseeable reason, please contact us about possible accommodations. We try to be as flexible as we can, but will handle these cases individually. -**Grading:** Evaluation will be based on the following distribution: 40% individual assignments, 30% group project, 10% midterm, 10% participation, 10% reading quizzes. No final exam. +**Grading:** Evaluation will be based on the following distribution: 35% individual assignments, 30% group project, 10% midterm, 10% participation, 5% labs, 10% reading quizzes. No final exam. -We strive for providing clear specifications and clear point breakdowns for all homework to set clear expectations and taking the guessing out of homework. We often give you choices to self-direct your learning, deciding what to work on and how to address a problem (e.g., we never prescribe a programming language and often give choices to answer a subset of possible questions). Clear specifications and point breakdowns allow you to intentionally decide to skip parts of assignments with clear upfront consequences. All parts will be graded pass/fail, no partial credit. For opportunities to redo work, see *resubmissions* below. For grading participation and quizzes see below. Some assignments have a small amount of bonus points. +We strive to provide clear specifications and clear point breakdowns for all homework to set clear expectations and take the guessing out of homework. We often give you choices to self-direct your learning, deciding what to work on and how to address a problem (e.g., we never prescribe a programming language and often give choices to answer a subset of possible questions). Clear specifications and point breakdowns allow you to intentionally decide to skip parts of assignments with clear upfront consequences. All parts will be graded pass/fail, no partial credit. For opportunities to redo work, see *resubmissions* below. For grading participation and quizzes see below. Some assignments have a small amount of bonus points. Since we give flexibility to resubmit assignments, we set grade boundaries fairly high. We expect the following grade boundaries: @@ -117,57 +115,66 @@ Since we give flexibility to resubmit assignments, we set grade boundaries fairl We assign participation grades as follows: -* 100%: Participates actively at least once in most lectures +* 100%: Participates actively at least once in most lectures (4 lectures waived, no questions asked) * 90%: Participates actively at least once in two thirds of the lectures * 75%: Participates actively at least once in over half of the lectures * 50%: Participates actively at least once in one quarter of the lectures * 20%: Participates actively at least once in at least 3 lectures. * 0%: No participation in the entire semester. +**Labs:** Labs typically introduce tools and have a task with one or more clear deliverables. Lab assignments are designed to take about 1h of work and can be completed before or during the lab session. The deliverable is graded on a pass/fail basis at any time during that week's lab session by showing your work to the TA. Typically showing your work involves showing source code, demoing executions, and (verbally) answering a few questions. The TA may ask a few questions about your implementation to probe that you understand your work. + +We intend labs to be very low stakes – this is your first practical engagement with the material and mistakes are a normal part of the learning process. Deliverables are graded pass/fail on whether they meet the stated expectations for the deliverables. If your solution does not meet the expectations you can continue working on it during the lab session until it does. + +We encourage collaboration on labs: You can work together with other students both before the lab session and during the lab session. While we do not recommend it, you may look at other students’ solutions and reference solutions and even copy them. However, you will have to present and explain your solution to the TA on your own. + **Textbook, reading assignments, and reading quizzes:** We will be using Goeff Hulten's "*Building Intelligent Systems: A Guide to Machine Learning Engineering*" (ISBN: 1484234316) throughout much of the course. The library provides an [electronic copy](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991019649190004436). In addition, we will provide various additional readings, including blog posts and academic papers, throughout the semester. -We also wrote our own textbook "[Machine Learning in Production](https://ckaestne.medium.com/machine-learning-in-production-book-overview-63be62393581)" that mirrors the lectures closely. The book is available freely online. We will not assign chapters from our own textbook, but we always point to the corresponding chapter for each lecture, which you can use as supplementary reading. +We also wrote our own textbook "[Machine Learning in Production](https://ckaestne.medium.com/machine-learning-in-production-book-overview-63be62393581)" that mirrors the lectures closely. The book is available freely online. We will not assign chapters from our own textbook, but we always point to the corresponding chapter for each lecture, which we suggest as supplementary reading. We will assign readings for most classes and post a corresponding quiz on Canvas that is due before class. Each quiz contains an open ended question that relates to the reading. Reading quizzes are graded pass/fail for a good-faith effort to engage with the question. -**Teamwork:** Teamwork is an essential part of this course. The course contains a multi-milestone group project to be done in teams of 3-5 students. Teams will be assigned by the instructor. We will help teams throughout the semester and cover some specific content on teamwork as part of the course. Peer rating will be performed for team assignments with regard to *team citizenship* (i.e., being active and cooperative members), following a procedure adapted from [this article](https://www.cs.tufts.edu/~nr/cs257/archive/teaching/barbara-oakley/JSCL-collaboration.pdf), which we will further explain in an early lecture. Use [this site](https://ckaestne.github.io/seai/F2022/assignments/peergrading.html) to preview the expected adjustments for peer ratings. +**Teamwork:** Teamwork is an essential part of this course. The course contains a multi-milestone group project to be done in teams of 3-5 students. Teams will be assigned by the instructor. A TA will be assigned as a mentor to each team. We will help teams throughout the semester and cover some specific content on teamwork as part of the course. Peer rating will be performed for team assignments with regard to *team citizenship* (i.e., being active and cooperative members), following a procedure adapted from [this article](https://www.cs.tufts.edu/~nr/cs257/archive/teaching/barbara-oakley/JSCL-collaboration.pdf), which we will further explain in an early lecture. Use [this site](https://ckaestne.github.io/seai/F2022/assignments/peergrading.html) to preview the expected adjustments for peer ratings. The team's mentor will also debrief with the team after every milestone and discuss possible strategies to improve teamwork. **Late work policy and resubmissions:** We understand that students will always have competing deadlines, unusual events, interviews for job searches, and other activities that compete with coursework. We therefore build flexibility and a safety net directly into the rubric. If you need additional accommodations, please contact us. In addition, we expect that the past/fail grading scheme without partial credit, may lead to harsh point deductions for missing small parts of the requirements, so we provide a mechanism to resubmit work to regain lost points. -Every student receives *7 individual tokens* that they can spend throughout the semester in the following ways: +Every student receives *8 individual tokens* that they can spend throughout the semester in the following ways: * For each token a student can submit a homework assignment 1 day late (with 2 tokens a student can submit two homeworks one day late each or a single homework up to two days late). * For *three* tokens a student can improve or redo an individual homework assignment and resubmit. The earlier submission is discarded and the regraded assignment counts toward the final grade. Resubmissions can be made at any time in the semester up to the final project presentation (see schedule). – Note that this technically allows to blow the original deadline (receiving 0 points initially) and then resubmit the homework arbitrarily late for three tokens. * For one token a student can submit a reading quiz late (any time before the final presentation) or resubmit a graded reading quiz. +* For one token a student can complete a lab late or redo a lab (any time before the final presentation) by showing the work to a TA during office hours. * Remaining tokens at the end of the semester are counted as one participation day each. If a student runs out of tokens, late individual assignments receive a penalty of 15% per started day. -Every team independently receives *7 team tokens* that they can spend for extensions of any milestone deadline (1 token per day per milestone, except final presentation deadline) or to resubmit any milestone (3 tokens each, resubmitted any time before the final presentation). If a team runs out of tokens, late submissions in group assignments will receive feedback but no credit. +Every team independently receives *8 team tokens* that they can spend for extensions of any milestone deadline (1 token per day per milestone, except final presentation deadline) or to resubmit any milestone (3 tokens each, resubmitted any time before the final presentation). If a team runs out of tokens, late submissions in group assignments will receive feedback but no credit. -In general, late submissions and resubmissions can be done at any point in the semester before the final presentations. If submitting any work more than 3 days late, we will assign 0 points initially and you have to use the provided *resubmission form* in Canvas rather submitting to Gradescope. +In general, late submissions and resubmissions can be done at any point in the semester before the final presentations. If you submit any work more than 3 days late, we will assign 0 points initially and you have to use the provided *resubmission form* in Canvas rather than submitting to Gradescope. -Exceptions to this policy will be made at discretion of the instructor in important circumstances, almost always involving a family or medical emergency and an email from your advisor — you can ask your academic advisor or the Dean of Student Affairs requesting the exception on your behalf. Please communicate also with your team about potential timing issues. +Exceptions to this policy will be made at the discretion of the instructor in important circumstances, almost always involving a family or medical emergency and an email from your advisor — you can ask your academic advisor or the Dean of Student Affairs requesting the exception on your behalf. Please communicate also with your team about potential timing issues. -**Communication:** We make important announcements through Canvas and may post less important information on Slack. We answer email, Canvas messages, and monitor Slack, which may all be used for clarifying homework assignments and other interactions. We suggest to monitor slack for public questions and interactions with your teams. Email or slack us if you would like to make an appointment. +**Communication:** We make important announcements on Slack; we recommend to enable Slack notifications. We answer email and monitor Slack, which may all be used for clarifying homework assignments and other interactions. We strongly recommend to ask questions publicly on Slack if others might have similar questions. Email or slack us if you would like to make an appointment. -**Auditing:** We welcome students to audit the course as long as the room capacities allow it. Auditing students will have access to all course materials (which is online anyway) and can attend lectures. Unfortunately we won't be able to grade homework submissions of auditing students or assign them to teams in the group project. To have auditing be on your transcript, approach us with the necessary paperwork. To assign a passing auditing grade at the end of the semester, we expect the student to get at least a 90% participation grade (see above) and a 70% score on reading quizzes. +**Auditing:** Due to the high demand for this course, we do *not* allow auditing. If you like to self-study, all course materials are online. We welcome interested audiences to sit in for lectures as long as the room capacity allows it. -**Time management:** This is a 12-unit course, and it is our intention to manage it so that you spend close to 12 hours a week on the course, on average. In general, 4 hours/week will be spent in class and 1-2 hours on readings and reading quizzes, and 6-7 hours on assignments. Notice that much homework is done in groups, so please account for the overhead and decreased time flexibility that comes with groupwork. Please give the course staff feedback if the time the course is taking for you differs significantly from our intention. +**Time management:** This is a 12-unit course, and it is our intention to manage it so that you spend close to 12 hours a week on the course, on average. In general, 3 hours/week will be spent in class, about 1 hour for the labs, 1-2 hours on readings and reading quizzes, and 6-7 hours on assignments. Notice that much homework is done in groups, so please account for the overhead and decreased time flexibility that comes with groupwork. Please give the course staff feedback if the time the course is taking for you differs significantly from our intention. **Writing:** Describing tradeoffs among decisions and communication with stakeholders from other backgrounds are key aspects of this class. Many homework assignments have a component that requires discussing issues in written form or reflecting about experiences. To practice writing skills, the Global Communications Center (GCC) offers one-on-one help for students, along with workshops. The instructors are also happy to provide additional guidance if requested. -**Use of content generation AI tools and external sources:** Given the nature of this course, we are open to using AI tools for completing work. We place no restrictions on the use of content generation tools, such as ChatGPT, GPT3, Co-Pilot, Stable Diffusion. You may also reuse code from external sources, such as StackOverflow or tutorials. In any case, you will be solely responsible for the correctness of the solution. Note that content generation tools often create plausible-looking but incorrect answers, which will not receive credit. You are also responsible for complying with any applicable licenses. If you use content generation tools, we encourage you to share your experience with the course staff or the entire class. +**Use of content generation AI tools and external sources:** Given the nature of this course, we are open to using AI tools for completing work. We place no restrictions on the use of content generation tools, such as ChatGPT, Bard, Co-Pilot, or Stable Diffusion. You may also reuse code from external sources, such as StackOverflow or tutorials. In any case, you will be solely responsible for the correctness of the solution. Note that content generation tools often create plausible-looking but incorrect answers, which will not receive credit. You are also responsible for complying with any applicable licenses. If you use content generation tools, we encourage you to share your experience with the course staff or the entire class. -**Academic honesty and collaboration:** The usual policies apply, especially the University Policy on Academic Integrity. Many parts of the work will be done in groups. We expect that group members collaborate with one another, but that groups work independently from other groups, not exchanging results with other groups. Within groups, we expect that you are honest about your contribution to the group's work. This implies not taking credit for others' work and not covering for team members that have not contributed to the team. Otherwise, our expectations regarding academic honestly and collaboration for group and pair work are the same as for individual work, substituting elevated to the level of "group." +**Academic honesty and collaboration:** The usual policies apply, especially the University Policy on Academic Integrity. Many parts of the work will be done in groups. We expect that group members collaborate with one another, but that groups work independently from other groups, not exchanging results with other groups. Within groups, we expect that you are honest about your contribution to the group's work. This implies not taking credit for others' work and not covering for team members that have not contributed to the team. This also applies to in-class discussions, where indicating working with others who did not participate in the discussion is considered an academic honesty violation. Otherwise, our expectations regarding academic honestly and collaboration for group and pair work are the same as for individual work, substituting elevated to the level of "group." Beyond that, the key guiding principle of academic honesty in this course is: *"You may not copy any part of a solution to a problem that was written by another student (in this or prior iterations of the class), or was developed together with another student, or was delegated to another person. You may not look at another student's solution, even if you have completed your own, nor may you knowingly give your solution to another student or leave your solution where another student can see it.*" Note that this implies that you cannot publicly post your solutions on GitHub (e.g., as part of a portfolio during job applications). While the use of AI content generation tools is okay (see above) using the work from other students is not. Discussing challenges and solution strategies with others at a high level is okay, sharing code or text is not. +You may collaborate with other students on labs, but not on reading quizzes, homeworks, and exams. + We also expect and respect honesty when communicating with the course staff. -Any violation of this policy is cheating. The minimum penalty for cheating will be a zero grade for the whole assignment. Cheating incidents will also be reported through University channels, with possible additional disciplinary action (see the University Policy on Academic Integrity). +Any violation of this policy is cheating. The minimum penalty for cheating will be a zero grade for the whole assignment. Cheating incidents will also be reported through University channels, with possible additional disciplinary action (see the University Policy on Academic Integrity). There is no statute of limitations for violations of the collaboration policy; penalties may be assessed (and referred to the university disciplinary board) after you have completed the course, and some requirements of the collaboration policy (such as restrictions on you posting your solutions) extend beyond your completion of the course. If you have any question about how this policy applies in a particular situation, ask the instructors for clarification. diff --git a/assignments/I1_mlproduct.md b/assignments/I1_mlproduct.md index 4b0f2478..75813d78 100644 --- a/assignments/I1_mlproduct.md +++ b/assignments/I1_mlproduct.md @@ -49,8 +49,8 @@ Commit all your code changes to your GitHub repository, but *do not commit priva Additionally upload a short report to Gradescope by [date see Canvas] with the following content: -* **GitHub link:** Start the document with a link to your last commit on GitHub: On the GitHub webpage, click on the last commit message and copy the URL in the format `https://github.com/[user]/[repo]/commit/[commitid]`. Make sure that the link includes the long ID of the last commit. -* **Technical description (1 page max):** Briefly describe how you implemented the two features. Provide pointers to the relevant parts of the code, ideally as direct links to files or even to specific lines on GitHub. We prefer readable links in the PDF rather than hyperlinks behind text (e.g., https://github.com/ckaestne/mlip-s23/blob/main/assignments/I1_mlproduct.md rather than [this](https://github.com/ckaestne/mlip-s23/blob/main/assignments/I1_mlproduct.md)). +* **GitHub link:** Start the document with a link to your last commit on GitHub: On the GitHub webpage, click on the last commit message and copy the URL in the format `https://github.com/cmu-seai/[repo]/commit/[commitid]`. Make sure that the link includes the long ID of the last commit. +* **Technical description (1 page max):** Briefly describe how you implemented the two features. Provide pointers to the relevant parts of the code, ideally as direct links to files or even to specific lines on GitHub. We prefer readable links in the PDF rather than hyperlinks behind text (e.g., https://github.com/mlip-cmu/s2024/blob/main/assignments/I1_mlproduct.md rather than [this](https://github.com/mlip-cmu/s2024/blob/main/assignments/I1_mlproduct.md)). * **User interface design approach (1 page max):** Recommend for each of the two features how the feature should interact with users (automate, prompt, organize, annotate, hybrid) and why. Justify your recommendation, considering forcefulness, frequency, value, and cost. If your implementation differs from the recommended approach, briefly explain how you would change your implementation if you had more time. * **Harms (1 page max):** Discuss what possible harms you can anticipate from using machine learning for the features in the applications (e.g., safety, fairness). Identify at least one harm and discuss potential solutions to mitigate the harm. (You do not need to implement the solutions.) * **Production challenges (1 page max):** Discuss any technical challenges you anticipate if you want to deploy this feature in production (e.g., scalability, operating costs) and how you would change your implementation if you expected millions of users. Identify at least one problem and discuss corresponding potential solutions. (You do not need to implement the solutions.) diff --git a/assignments/I3_architecture.md b/assignments/I3_architecture.md deleted file mode 100644 index 967e72cc..00000000 --- a/assignments/I3_architecture.md +++ /dev/null @@ -1,48 +0,0 @@ -# Individual Assignment 3: Architecture - -(17-445/17-645 Machine Learning in Production; 11-695 AI Engineering) - -## Overview - -In this assignment, we return to the Dashcam scenario from I2 and explore architecture and alternatives of different deployment options. - -Learning goals: -* Reason about qualities relevant to the deployment of an ML component in a system architecture -* Design measures for design qualities and telemetry - -## Tasks - -Return to the scenario description of I2. Carefully read the list of qualities discussed in the scenario description and make sure you understand the concepts of interest here. - -**Task 1: Deployment.** Compare four different design alternatives about how to deploy the system with regard to the eight qualities listed in the scenario. To that end, analyze whether the ML component(s) for recognizing a person in an image should be deployed (a) on the dashcam, (b) on a phone, (c) in the cloud, or (d) some other configuration you describe (e.g., hybrid or edge). Provide a short explanation and an architecture diagram for your fourth design. - -Where possible estimate the impact of the different designs on the eight different qualities listed in the scenario description. You may want to do some Internet research about typical characteristics of various hardware and software components (e.g., storage capacity of dashcams, size of typical face recognition models, bandwidth of Bluetooth connections). You do not need to conduct precise measurements or estimate concrete values, but should inform your discussion with an understanding of the qualities in the context of the scenario (e.g., “solution A is better than solution B because of a bottleneck in Bluetooth bandwidth” or “privacy is better in solution C because customer data does not leave the device”). - -After understanding the four different designs, explicitly discuss the tradeoffs involved, which involves discussing the relative relevance of the qualities and the differences in qualities for the different solutions. Recommend one of the solutions. - -**Task 2: Telemetry.** Suggest a design for telemetry to identify how well (a) the system and (b) the ML component(s) are performing in production. Proceed in the typical three steps: Be explicit about what quality measures you use, what data you would collect, and how you would use the collected data to compute the quality measures. In addition, briefly justify your design and why it is appropriate in the context of the scenario. That discussion should cover at least (1) the amount of data transmitted or stored, (2) how it copes with rare events, and (3) whether it can detect both false positives and false negatives. - -## Deliverable - -Submit a report as a single PDF file to Gradescope that covers the following topics in clearly labeled sections (ideally each section starts on a new page): - -1. **Fourth deployment design** (1 page max): Describe a fourth deployment architecture and provide an architecture diagram. -1. **Analysis of deployment alternatives** (4 pages max): For each of the 4 deployment options discuss the 8 qualities listed in I2. We recommend that you start a bullet list with 4 elements (one for each deployment option) for each of the 8 qualities, but tabular or other representations are also possible. Rough estimates or relative ratings with a brief explanation are sufficient as long as they are grounded and realistic in the scenario. -2. **Recommendation and justification of deployment architecture** (1 page max): Recommend a deployment architecture and justify this recommendation in terms of the relative relevance of the qualities and the tradeoffs among quality attributes. -3. **Telemetry** (1 page max): Suggest how telemetry should be selected for a system quality and a model quality and describe how quality would be measured from telemetry data, and briefly justify those decisions. - -Page limits are recommendations and not strictly enforced. You can exceed the page limit if there is a good reason. We prefer precise and concise answers over long and rambling ones. - -## Grading - -The assignment is worth 100 points. For full credit, we expect: - -* [ ] 10 points: Description of a fourth deployment architecture is included. -* [ ] 10 points: An architecture diagram for the fourth deployment architecture is included and matches the description. -* [ ] 20 points: For each of the 4 design alternatives at least 4 quality attributes are analyzed. The analysis is plausible for the scenario. -* [ ] 10 points: For each of the 4 design alternatives all 8 quality attributes are analyzed plausibly. The analysis is plausible for the scenario. -* [ ] 10 points: A clear recommendation for one deployment decision is provided and a justification for the decision is provided. -* [ ] 10 points: The justification clearly makes tradeoffs among the discussed qualities and weighs the relative importance of the qualities to come to a conclusion supported by the analysis. -* [ ] 10 points: The telemetry section describes what telemetry data is collected and how. It is plausible in the scenario that this data can be collected. -* [ ] 10 points: The telemetry section contains a description of two quality measures, one for the system and one for the model. The section describes how both the metrics are operationalized with the telemetry data in a way that is clear enough for a third party to independently implement. -* [ ] 10 points: The telemetry section contains a justification for the chosen approach. The justification considers (1) the amount of data transmitted or stored, (2) how telemetry copes with rare events, and (3) whether this form of telemetry can detect both false positives and false negatives. diff --git a/assignments/I4_mlops_tools.md b/assignments/I3_mlops_tools.md similarity index 100% rename from assignments/I4_mlops_tools.md rename to assignments/I3_mlops_tools.md diff --git a/assignments/I4_explainability.md b/assignments/I4_explainability.md new file mode 100644 index 00000000..0601a099 --- /dev/null +++ b/assignments/I4_explainability.md @@ -0,0 +1 @@ +tbd diff --git a/assignments/research_project.md b/assignments/research_project.md index f361b805..8375d6a1 100644 --- a/assignments/research_project.md +++ b/assignments/research_project.md @@ -46,13 +46,13 @@ If you plan to conduct interviews or surveys as part of the project and you plan ## Deliverables -Submit a draft at the milestone deadlines and a paper and a presentation at the final deadline. +Email a draft to the instructors at the milestone deadlines and submit the paper to Gradescope for the final deadline. Present the work during the final presentation slot of the class. The paper should be in a form submittable to a new-idea track, short-paper track, or workshop in the field. It should have at least an introduction motivating the research, one or more clear and motivated research questions, a discussion of the state of the art or related work, and a description of the conducted or planned research, and some results. While we do not enforce a specific page limit or formatting requirements, we would typically expect around 4 pages double-column format, such as for the [ICSE-NIER](https://conf.researchr.org/track/icse-2022/icse-2022-nier---new-ideas-and-emerging-results) track. The presentation should be no longer than 8 minutes. How you structure the presentation is up to you. You do not need to cover everything, but consider how to make this interesting to the audience. It will be presented in the same time slot as the presentations from the group project. -Send drafts, papers and slides as attachments or links per email to the instructors. +Send drafts, papers and slides as attachments or links per email to the instructors. Submit the final paper to Gradescope. ## Grading diff --git a/exams/README.md b/exams/README.md index db8bef22..f488957f 100644 --- a/exams/README.md +++ b/exams/README.md @@ -10,12 +10,13 @@ Topic-wise, everything covered in class, in the readings, and in recitation is f Midterms from previous semesters are available as practice. We expect the midterm to have a similar format, though topic coverage differs slightly between semesters. -* [Practice midterm from Fall 2019](https://github.com/ckaestne/seai/blob/F2019/other_material/practice_midterm.pdf) (corresponds quite well to topics covered this semester) -* [Midterm from Fall 2019](https://github.com/ckaestne/seai/blob/F2019/other_material/midterm.pdf) (we did not cover version control yet) -* [Final from Fall 2019](https://github.com/ckaestne/seai/blob/F2019/other_material/final_exam.pdf) (covers different topics, but provides yet another scenario) -* [Midterm from Summer 2020](https://github.com/ckaestne/seai/blob/S2020/exams/midterm.pdf) (slightly fewer topics covered than this semester) -* [Midterm from Fall 2020](https://github.com/ckaestne/seai/blob/F2020/exams/midterm_f20.pdf) (similar coverage) -* [Midterm from Spring 2021](https://github.com/ckaestne/seai/blob/S2021/exams/) (similar coverage) -* [Midterm from Spring 2022](https://github.com/ckaestne/seai/blob/S2022/exams/) (similar coverage, except data quality and infrastructure quality yet) -* [Midterm from Fall 2022](https://github.com/ckaestne/seai/blob/F2022/exams/) (similar coverage) +* [Practice midterm from Fall 2019](https://github.com/ckaestne/seai/blob/F2019/other_material/practice_midterm.pdf) +* [Midterm from Fall 2019](https://github.com/ckaestne/seai/blob/F2019/other_material/midterm.pdf) +* [Final from Fall 2019](https://github.com/ckaestne/seai/blob/F2019/other_material/final_exam.pdf) +* [Midterm from Summer 2020](https://github.com/ckaestne/seai/blob/S2020/exams/midterm.pdf) +* [Midterm from Fall 2020](https://github.com/ckaestne/seai/blob/F2020/exams/midterm_f20.pdf) +* [Midterm from Spring 2021](https://github.com/ckaestne/seai/blob/S2021/exams/) +* [Midterm from Spring 2022](https://github.com/ckaestne/seai/blob/S2022/exams/) +* [Midterm from Fall 2022](https://github.com/ckaestne/seai/blob/F2022/exams/) +* [Midterm from Spring 2023](https://github.com/mlip-cmu/s2023/tree/main/exams) diff --git a/labs/lab01.md b/labs/lab01.md new file mode 100644 index 00000000..89bb81db --- /dev/null +++ b/labs/lab01.md @@ -0,0 +1,46 @@ +# Lab 1: Calling, Building, and Securing APIs +In homework I1 you will use third-party machine learning APIs and in the group project you will develop your own APIs. In this lab, you will experiment with both, connecting to the Azure Vision API and providing your own API endpoint. +To receive credit for this lab, show your work to the TA during recitation. + +## Deliverables +- [ ] Create an account and connect to the Azure Vision API +- [ ] Explain to the TA why hard-coding credentials is a bad idea. Commit your code to GitHub without committing your credentials. +- [ ] Run the API endpoint with the starter code and demonstrate that it works with an example invocation (e.g., using curl). + +## Getting started +Clone the starter code from this Git repository: https://github.com/eshetty/mlip-api-lab + +The code implements a flask web application that receives API requests to analyze an image and return information about the image, including the text contained within. To identify the text, the OCR feature of the Azure Vision API [[documentation](https://westcentralus.dev.cognitive.microsoft.com/docs/services/computer-vision-v3-2/operations/56f91f2e778daf14a499f20d#:~:text=test.jpg%22%7D-,Response%20200,-The%20OCR%20results), [response format](https://westcentralus.dev.cognitive.microsoft.com/docs/services/computer-vision-v3-2/operations/56f91f2e778daf14a499f20d#:~:text=test.jpg%22%7D-,Response%20200,-The%20OCR%20results)] can be used by adjusting the API endpoint and credentials in the code. We use the Azure’s provided libraries to abstract from low-level protocol details. + +Install the dependencies in the `requirements.txt` file with pip or similar. To set up the flask server, just run `python3 app.py`. The system should try to analyze an example image and report the results when you go to http://localhost:3000/ + +## Connecting to the Azure Vision API +1. Sign up for the a student account for Microsoft Azure: https://azure.microsoft.com/en-us/free/students/ – no credit card required + +2. Create an instance of the Computer Vision service and get an API endpoint of your instance of the service. + +3. Get a subscription key to authorize your script to call the Computer Vision API. + +4. Update the code with the endpoint and key and test it. + +## Secure your Credentials +The starter code hardcodes credentials in the code. This is a bad practice. + +Research and discuss best practices, such as never hard-code credentials, never commit credentials to Git, rotate secrets regularly, encrypt your secrets at rest/in-transit if possible, practice least-access privilege on machines where your credentials are stored as environment variables or within local files. + +Rewrite the code to load credentials from a file or an environment variable and commit the code without the credentials to GitHub. + +## Calling your own API +The starter code comes with a flask server that serves the website at http://localhost:3000/ but also exposes an own API at http://localhost:3000/api/v1/analysis/ accepting a GET request with a JSON object with a single field “uri” pointing to an image to analyze. + +Identify how to call your own API with a tool like [curl](https://curl.se/docs/manpage.html) or [Postman](https://www.postman.com). + +Optionally extend the API or document it with [Swagger](https://swagger.io). + +## Additional resources +- [Redhat article on API](https://www.redhat.com/en/topics/api/what-are-application-programming-interfaces) +- [Azure Computer Vision](https://learn.microsoft.com/en-us/python/api/overview/azure/cognitiveservices-vision-computervision-readme?view=azure-python) +- [API Design Best Practices](https://blog.stoplight.io/crud-api-design?_ga=2.223919515.1813989671.1674077556-1488117179.1674077556) +- [API Endpoint Best Practices](https://www.telerik.com/blogs/7-tips-building-good-web-api) +- The file `seai-azure-cv-ocr-api.json` has the structure to test calls to the Azure Vision API with Postman. + diff --git a/labs/lab02.md b/labs/lab02.md new file mode 100644 index 00000000..31695710 --- /dev/null +++ b/labs/lab02.md @@ -0,0 +1,59 @@ +# Lab 2: Kafka for Data Streaming + +In this lab, you will gain hands-on experience with Apache Kafka, a distributed streaming platform that plays a key role in processing large-scale real-time data. You will establish a connection to a Kafka broker, produce and consume messages, and explore Kafka command-line tools. This lab will prepare you for your group project, where you'll work with Kafka streams. + +To receive credit for this lab, show your work to the TA during recitation. + +## Deliverables +- [ ] Establish a secure SSH tunnel to the Kafka server. Explain to the TA about Kafka Topic and Offsets. How do they ensure message continuity if a consumer is disconnected? +- [ ] Modify starter code to implement producer and consumer modes for a Kafka topic. +- [ ] Use Kafka's CLI tools to manage and monitor Kafka topics and messages. + + +## Getting started +- Clone the starter code from this [Git repository](https://github.com/tanya-5/mlip-kafka-lab/). +- The repository includes a python notebook for Kafka producer and consumer model. +- Install the Kafka Python package by running: + `python -m pip install kafka-python` + +## Connecting to Kafka server +1. Use SSH to create a tunnel to the Kafka server: + `ssh -L :localhost: @ -NTf` +2. Test the Kafka server connection to ensure it's operational. + +## Implementing Producer-Consumer Mode +### 1. Producer Mode: Writes Data to Broker +Refer TODO sections in the script. Edit the bootstrap servers and add 2-3 cities of your choice. Run the code to write to Kafka stream. + +### 2. Consumer Mode: Reads Data from Broker + +Use the notebook in this repository as starter code: https://github.com/tanya-5/mlip-kafka-lab/tree/main + +Modify the TODO section by filling appropriate parameters/arguments in the starter code. Verify `Kafka_log.csv`. + +Ref: [KafkaProducer Documentation](https://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html) + [KafkaConsumer Documentation](https://kafka-python.readthedocs.io/en/master/apidoc/KafkaConsumer.html) + +## Using Kafka’s CLI tools +Kcat is a CLI (Command Line Interface). Previously known as kafkacat. +Install with your package installer such as: +- macOS: `brew install kcat` +- Ubuntu: `apt-get install kcat` +- Note for Windows Users: Setting up kcat on Windows is complex. Please work in pairs with someone with mac/Ubuntu during recitation for this deliverable. The purpose is to understand CLI which will be helpful in the group project for using Kafka on Virtual machines (Linux based). + +Using the kcat documentation, write a command that connects to the local Kafka broker, specifies a topic, and consumes messages from the earliest offset. + +Ref: [kcat usage](https://docs.confluent.io/platform/current/app-development/kafkacat-usage.html) + [kcat GitHub](https://github.com/edenhill/kcat) + +## Optional but Recommended +For your group project you will be reading movies from the Kafka stream. Try finding the list of all topics and then read some movielog streams to get an idea of what the data looks like: +`kcat -b localhost:9092 -L` + +## Additional resources +- [Apache Kafka](https://kafka.apache.org/) +- [Kafka for Beginners](https://www.cloudkarafka.com/blog/2016-11-30-part1-kafka-for-beginners-what-is-apache-kafka.html) +- [What is Apache Kafka? - TIBCO](https://www.tibco.com/reference-center/what-is-apache-kafka) +- [Kafka Introduction Video 1](https://www.youtube.com/watch?v=PzPXRmVHMxI) +- [Kafka Introduction Video 2](https://www.youtube.com/watch?v=JalUUBKdcA0) + diff --git a/labs/lab03.md b/labs/lab03.md new file mode 100644 index 00000000..aad73cf8 --- /dev/null +++ b/labs/lab03.md @@ -0,0 +1,82 @@ +# Lab 3: Git + +## Deliverables: +You will perform three tasks in this exercise. + +- [ ] Create and fix a merge conflict +- [ ] Amend a commit + +It is strongly recommended that you use a git extension for your IDE to complete this lab. If you are using Visual Studio Code, you can use the [GitLens](https://marketplace.visualstudio.com/items?itemName=eamodio.gitlens) extension. + +## Setup +1. Fork the [PyTorch](https://github.com/pytorch/pytorch) repository to your GitHub account. +2. Clone the forked repository to your local machine by running the following command in the terminal: +``` +git clone -n --depth=1 --filter=tree:0 +cd pytorch +git sparse-checkout set --no-cone torch/nn +git checkout +``` +3. Open the repository in your IDE. + +## Exercise 1: Create and fix a merge conflict + +1. Create a new branch called `merge-conflict` from `main` branch. +2. Open the `torch/nn/functional.py` file, navigate to the `interpolate` function (line 3856) and change the resizing mode from `nearest` to `bilinear`: +3. Commit the changes to the `merge-conflict` branch. Make sure you add a meaningful commit message. +4. Switch back to `main` branch. +5. Open the `torch/nn/functional.py` file, navigate to the `interpolate`(line 3856) function and change the resizing mode from `nearest` to `bicubic` and `align_corners` to `True`: +6. Commit the changes to the `main` branch. Make sure you add a meaningful commit message. +7. Merge the `merge-conflict` branch into the `main` branch. +8. Resolve the merge conflict by keeping the resizing mode `bilinear` and `align_corners` `True`. +9. Commit the changes to the `main` branch. Make sure you add a meaningful commit message. + +## Exercise 2: Amend a commit + +1. Create a new branch called `amend-commit` from `main` branch. +2. In the `torch/nn/functional.py` file, navigate to the `multi_margin_loss` function (line 3566) and change the margin to 1.5 and reduction mode to `sum' +3. Commit the changes to the `amend-commit` branch. Make sure you add a meaningful commit message. +4. Amend the commit by changing the margin to 2.0 +5. Commit the changes to the `amend-commit` branch. Make sure you add a meaningful commit message. + +## Exercise 3: Create and approve a pull request + +**Note: Please ensure on GitHub, you create the pull request to the main branch of your forked repository. Under no circumstances should you create a pull request to the original PyTorch repository.** +

+(make sure you choose *username*/pytorch instead of pytorch/pytorch)
+image
+ +1. Create a new branch called `pull-request` from `main` branch. +2. In the `torch/nn/functional.py` file, navigate to the `l1_loss` function (line 3308) and add code to check if the reduction mode is `sum` and raise an exception: +3. Commit the changes to the `pull-request` branch. Make sure you add a meaningful commit message. +4. Push the `pull-request` branch to the remote repository. +5. Create a pull request to merge the `pull-request` branch into the `main` branch. +6. Approve the pull request. +7. Merge the `pull-request` branch into the `main` branch. + + + +## Useful commands + +- `git checkout -b ` - creates a new branch and switches to it +- `git checkout ` - switches to the specified branch +- `git merge ` - merges the specified branch into the current branch +- `git status` - shows the status of the current branch +- `git add ` - adds the specified file to the staging area +- `git commit -m ""` - commits the staged changes with the specified commit message +- `git log` - shows the commit history +- `git log --oneline` - shows the commit history with each commit on a single line +- `git log --oneline --graph` - shows the commit history with each commit on a single line and the branches graph +- `git push origin ` - pushes the specified branch to the remote repository +- `git pull origin ` - pulls the specified branch from the remote repository +- `git branch -d ` - deletes the specified branch +- `git commit --amend` - amends the last commit +- `git push origin --delete ` - deletes the specified branch from the remote repository + + +## Resources +- [Git Handbook](https://guides.github.com/introduction/git-handbook/) +- [Git Cheat Sheet](https://education.github.com/git-cheat-sheet-education.pdf) +- [Git Documentation](https://git-scm.com/doc) +- [Git Exercises](https://gitexercises.fracz.com/) + diff --git a/lectures/01_introduction/intro.md b/lectures/01_introduction/intro.md index c5b6ae5e..d61928e9 100644 --- a/lectures/01_introduction/intro.md +++ b/lectures/01_introduction/intro.md @@ -1,8 +1,8 @@ --- -author: Eunsuk Kang & Christian Kaestner +author: Claire Le Goues & Christian Kaestner title: "MLiP: Motivation, Syllabus, and Introductions" -semester: Fall 2022 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" +semester: Spring 2024 +footer: "Machine Learning in Production/AI Engineering • Claire Le Goues & Christian Kaestner, Carnegie Mellon University • Spring 2024" license: Creative Commons Attribution 4.0 International (CC BY 4.0) --- @@ -32,12 +32,14 @@ Setup the ability to read/post to Slack during lecture `¯\_(ツ)_/¯` -Best guess: Most people will get in, but it may take a few days +About 120 students waitlisted + +Best guess: 40 more people will get in, but it may take a few days For those joining late: * Ask us for recording of missed lectures on Slack * Post introduction on Slack (`#intro`) when joining - * If joining after Jan 25, automatic 5 day extension for Homework I1 + * See Canvas for automatic extensions and makeup opportunities for quizzes, labs, and homeworks * Automatically excused for participation in missed lectures @@ -306,18 +308,12 @@ Example: # Syllabus and Class Structure -17-445/17-645/17-745/11-695, Spring 2023, 12 units - -Monday/Wednesdays 2-3:25pm +17-445/17-645/17-745/11-695, Spring 2024, 12 units -Recitation Fridays 10:00-10:50am / 12:00-12:50pm - ----- +Monday/Wednesdays 2-3:20pm -## Instructors +Recitation Fridays 9:30am, 11am, and 2pm -![Faces of instructors](tas.png) - ---- @@ -325,10 +321,10 @@ Recitation Fridays 10:00-10:50am / 12:00-12:50pm * Email us or ping us on Slack (invite link on Canvas) * All announcements through Slack `#announcements` -* Weekly office hours (see Canvas for schedule) +* Weekly office hours, starting next week, schedule on Canvas * Post questions on Slack * Please use `#general` or `#assignments` and post publicly if possible; your classmates will benefit from your Q&A! -* All course materials (slides, assignments, old midterms) available on GitHub and course website: https://mlip-cmu.github.io/s2023/ +* All course materials (slides, assignments, old midterms) available on GitHub and course website: https://mlip-cmu.github.io/s2024/ * Pull requests encouraged! ---- @@ -391,7 +387,7 @@ Both text-based and code-based homework assignments *"Coding warmup assignment"* -[Out now](https://github.com/ckaestne/seai/blob/F2022/assignments/I1_mlproduct.md), due Monday Jan 30 +[Out now](https://github.com/mlip-cmu/s2024/blob/main/assignments/I1_mlproduct.md), due Monday Jan 29 Enhance simple web *application* with ML-based features: Image search and automated captioning @@ -501,9 +497,7 @@ Mostly similar coverage to lecture Not required, use as supplementary reading -Still evolving, feedback appreciated! - -Published [online](https://ckaestne.medium.com/machine-learning-in-production-book-overview-63be62393581) +Published [online](https://ckaestne.medium.com/machine-learning-in-production-book-overview-63be62393581) (and in book form next year) @@ -514,7 +508,7 @@ Published [online](https://ckaestne.medium.com/machine-learning-in-production-bo
-All [assignments](https://github.com/mlip-cmu/s2023/tree/main/assignments) available on GitHub now +Most [assignments](https://github.com/mlip-cmu/s2024/tree/main/assignments) available on GitHub now Series of 4 small to medium-sized **individual assignments**: * Engage with practical challenges @@ -541,38 +535,54 @@ Design your own research project and write a report Very open ended: Align with own research interests and existing projects -See the [project description](https://github.com/mlip-cmu/s2023/blob/main/assignments/research_project.md) and talk to us +See the [project requirements](https://github.com/mlip-cmu/s2024/blob/main/assignments/research_project.md) and talk to us First hard milestone: initial description due Feb 27 + +![Timeline](timeline.svg).element: class="plain" style="width:100%" +--> ---- -## Recitations +## Labs Introducing various tools, e.g., fastAPI (serving), Kafka (stream processing), Jenkins (continuous integration), MLflow (experiment tracking), Docker & Kubernetis (containers), Prometheus & Grafana (monitoring), CHAP (explainability)... Hands on exercises, bring a laptop -Often introducing tools relevant for assignments +Often introducing tools useful for assignments + +about 1h of work, graded pass/fail, low stakes, show work to TA -First recitation on **this Friday**: Calling, securing, and creating APIs +First lab on **this Friday**: Calling, securing, and creating APIs + +---- +## Lab grading and collaboration + +We recommend to start at lab before the recitation, but can be completed during + +Graded pass/fail by TA on the spot, can retry + +*Relaxed collaboration policy:* Can work with others before and during recitation, but have to present/explain solution to TA individually + +(Think of recitations as mandatory office hours) ---- ## Grading -* 40% individual assignment +* 35% individual assignment * 30% group project with final presentation * 10% midterm * 10% participation * 10% reading quizzes +* 5% labs * No final exam (final presentations will take place in that timeslot) Expected grade cutoffs in syllabus (>82% B, >94 A-, >96% A, >99% A+) @@ -600,14 +610,14 @@ Opportunities to resubmit work until last day of class
-7 individual tokens per student: +8 individual tokens per student: - Submit individual assignment 1 day late for 1 token (after running out of tokens 15% penalty per late day) - Redo individual assignment for 3 token - Resubmit or submit reading quiz late for 1 token +- Redo or complete a lab late for 1 token (show in office hours) - Remaining tokens count toward participation -- 1 bonus token for attending >66% of recitations -7 team tokens per team: +8 team tokens per team: - Submit milestone 1 day late for 1 token (no late submissions accepted when out of tokens) - Redo milestone for 3 token @@ -617,9 +627,9 @@ Opportunities to resubmit work until last day of class ## How to use tokens * No need to tell us if you plan to submit very late. We will assign 0 and you can resubmit -* Instructions and form for resubmission on Canvas +* Instructions and Google form for resubmission on Canvas * We will automatically use remaining tokens toward participation and quizzes at the end -* Remaining individual tokens reflected on Canvas, for remaining team tokens ask your TA. +* Remaining individual tokens reflected on Canvas, for remaining team tokens ask your team mentor. @@ -629,9 +639,9 @@ Opportunities to resubmit work until last day of class Instructor-assigned teams -Teams stay together for project throughout semester, starting Feb 6 +Teams stay together for project throughout semester, starting Feb 5 -Fill out Catme Team survey before Feb 6 (3pt) +Fill out Catme Team survey before Feb 5 (3pt) Some advice in lectures; we'll help with debugging team issues @@ -651,6 +661,8 @@ In a nutshell: do not copy from other students, do not lie, do not share or publ In group work, be honest about contributions of team members, do not cover for others +Collaboration okay on labs, but not quizzes, individual assignments, or exams + If you feel overwhelmed or stressed, please come and talk to us (see syllabus for other support opportunities) ---- @@ -659,7 +671,7 @@ If you feel overwhelmed or stressed, please come and talk to us (see syllabus fo -GPT3, ChatGPT, ...? Reading quizzes, homework submissions, ...? +GPT4, ChatGPT, CoPilot...? Reading quizzes, homework submissions, ...? ---- @@ -669,9 +681,9 @@ This is a course on responsible building of ML products. This includes questions Feel free to use them and explore whether they are useful. Welcome to share insights/feedback. -Warning: They are *[bullshit generators](https://aisnakeoil.substack.com/p/chatgpt-is-a-bullshit-generator-but)*! Requires understanding to check answers. We test them ourselves and they often generate bad/wrong answers for reading quizzes. +Warning: Be aware of hallucinations. Requires understanding to check answers. We test them ourselves and they often generate bad/wrong answers for reading quizzes. -**You are still responsible for the correctness of what you submit!** +**You are responsible for the correctness of what you submit!** diff --git a/lectures/10_qainproduction/bookingcom2.png b/lectures/02_systems/bookingcom2.png similarity index 100% rename from lectures/10_qainproduction/bookingcom2.png rename to lectures/02_systems/bookingcom2.png diff --git a/lectures/02_systems/systems.md b/lectures/02_systems/systems.md index 7b4fe8b9..3d359f28 100644 --- a/lectures/02_systems/systems.md +++ b/lectures/02_systems/systems.md @@ -1,8 +1,8 @@ --- author: Christian Kaestner title: "MLiP: From Models to Systems" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" +semester: Spring 2024 +footer: "Machine Learning in Production/AI Engineering • Claire Le Goues & Christian Kaestner, Carnegie Mellon University • Spring 2024" license: Creative Commons Attribution 4.0 International (CC BY 4.0) --- @@ -19,6 +19,16 @@ license: Creative Commons Attribution 4.0 International (CC BY 4.0) --- +# Administrativa + +* Still waiting for registrar to add another section +* Follow up on syllabus discussion: + * When not feeling well -- please stay home and get well, and email us for accommodation + * When using generative AI to generate responses (or email/slack messages) -- please ask it to be brief and to the point! + + +---- + # Learning goals * Understand how ML components are a (small or large) part of a larger system @@ -378,6 +388,13 @@ Passi, S., & Sengers, P. (2020). [Making data science systems work](https://jour +---- +## Model vs System Goal? + +![Model and system goals not aligning at booking.com](bookingcom2.png) + + + ---- ## More Accurate Predictions may not be THAT Important @@ -427,7 +444,7 @@ Wagstaff, Kiri. "Machine learning that matters." In Proceedings of the 29 th Int * **MLOps** ~ technical infrastructure automating ML pipelines * sometimes **ML Systems Engineering** -- but often this refers to building distributed and scalable ML and data storage platforms * "AIOps" ~ using AI to make automated decisions in operations; "DataOps" ~ use of agile methods and automation in business data analytics -* My preference: **Production Systems with Machine-Learning Components** +* My preference: **Software Products with Machine-Learning Components**
@@ -466,7 +483,7 @@ Start understanding the **requirements** of the system and its components
* **Organizational objectives:** Innate/overall goals of the organization -* **System goals:** Goals of the software system/feature to be built +* **System goals:** Goals of the software system/product/feature to be built * **User outcomes:** How well the system is serving its users, from the user's perspective * **Model properties:** Quality of the model used in a system, from the model's perspective * @@ -622,8 +639,26 @@ As a group post answer to `#lecture` tagging all group members using template: > User goals: ...
> Model goals: ...
+---- +## Academic Integrity Issue + +* Please do not cover for people not participating in discussion +* Easy to detect discrepancy between # answers and # people in classroom +* Please let's not have to have unpleasant meetings. +---- +## Breakout: Automating Admission Decisions + +What are different types of goals behind automating admissions decisions to a Master's program? + +As a group post answer to `#lecture` tagging all group members using template: +> Organizational goals: ...
+> Leading indicators: ...
+> System goals: ...
+> User goals: ...
+> Model goals: ...
+ diff --git a/lectures/03_requirements/requirements.md b/lectures/03_requirements/requirements.md index 0fbfc960..d8e4d3b0 100644 --- a/lectures/03_requirements/requirements.md +++ b/lectures/03_requirements/requirements.md @@ -1,8 +1,8 @@ --- -author: Christian Kaestner & Eunsuk Kang +author: Claire Le Goues & Christian Kaestner title: "MLiP: Gathering Requirements" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Eunsuk Kang & Christian Kaestner, Carnegie Mellon University • Spring 2023" +semester: Spring 2024 +footer: "Machine Learning in Production/AI Engineering • Claire Le Goues & Christian Kaestner, Carnegie Mellon University • Spring 2024" license: Creative Commons Attribution 4.0 International (CC BY 4.0) --- @@ -34,10 +34,10 @@ failures ---- ## Readings -Required reading: 🗎 Jackson, Michael. "[The world and the machine](https://web.archive.org/web/20170519054102id_/http://mcs.open.ac.uk:80/mj665/icse17kn.pdf)." In Proceedings of the International Conference on Software Engineering. IEEE, 1995. +Required reading: Jackson, Michael. "[The world and the machine](https://web.archive.org/web/20170519054102id_/http://mcs.open.ac.uk:80/mj665/icse17kn.pdf)." In Proceedings of the International Conference on Software Engineering. IEEE, 1995. -Going deeper: 🕮 Van Lamsweerde, Axel. [Requirements engineering: From system goals to UML models to software](https://bookshop.org/books/requirements-engineering-from-system-goals-to-uml-models-to-software-specifications/9780470012703). John Wiley & Sons, 2009. +Going deeper: Van Lamsweerde, Axel. [Requirements engineering: From system goals to UML models to software](https://bookshop.org/books/requirements-engineering-from-system-goals-to-uml-models-to-software-specifications/9780470012703). John Wiley & Sons, 2009. --- # Failures in ML-Based Systems @@ -571,7 +571,7 @@ Slate, 01/2022 -See 🗎 Jackson, Michael. "[The world and the machine](https://web.archive.org/web/20170519054102id_/http://mcs.open.ac.uk:80/mj665/icse17kn.pdf)." In Proceedings of the International Conference on Software Engineering. IEEE, 1995. +See Jackson, Michael. "[The world and the machine](https://web.archive.org/web/20170519054102id_/http://mcs.open.ac.uk:80/mj665/icse17kn.pdf)." In Proceedings of the International Conference on Software Engineering. IEEE, 1995. ---- ## Understanding requirements is hard @@ -589,7 +589,7 @@ See 🗎 Jackson, Michael. "[The world and the machine](https://web.archive.org/ -See also 🗎 Jackson, Michael. "[The world and the machine](https://web.archive.org/web/20170519054102id_/http://mcs.open.ac.uk:80/mj665/icse17kn.pdf)." In Proceedings of the International Conference on Software Engineering. IEEE, 1995. +See also Jackson, Michael. "[The world and the machine](https://web.archive.org/web/20170519054102id_/http://mcs.open.ac.uk:80/mj665/icse17kn.pdf)." In Proceedings of the International Conference on Software Engineering. IEEE, 1995. ---- ## Start with Stakeholders... @@ -779,12 +779,12 @@ Identify stakeholders, interview them, resolve conflicts
-* 🕮 Van Lamsweerde, Axel. Requirements engineering: From system goals to UML models to software. John Wiley & Sons, 2009. -* 🗎 Vogelsang, Andreas, and Markus Borg. "Requirements Engineering for Machine Learning: Perspectives from Data Scientists." In Proc. of the 6th International Workshop on Artificial Intelligence for Requirements Engineering (AIRE), 2019. -* 🗎 Rahimi, Mona, Jin LC Guo, Sahar Kokaly, and Marsha Chechik. "Toward Requirements Specification for Machine-Learned Components." In 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW), pp. 241-244. IEEE, 2019. -* 🗎 Kulynych, Bogdan, Rebekah Overdorf, Carmela Troncoso, and Seda Gürses. "POTs: protective optimization technologies." In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 177-188. 2020. -* 🗎 Wiens, Jenna, Suchi Saria, Mark Sendak, Marzyeh Ghassemi, Vincent X. Liu, Finale Doshi-Velez, Kenneth Jung et al. "Do no harm: a roadmap for responsible machine learning for health care." Nature medicine 25, no. 9 (2019): 1337-1340. -* 🗎 Bietti, Elettra. "From ethics washing to ethics bashing: a view on tech ethics from within moral philosophy." In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 210-219. 2020. -* 🗎 Guizani, Mariam, Lara Letaw, Margaret Burnett, and Anita Sarma. "Gender inclusivity as a quality requirement: Practices and pitfalls." IEEE Software 37, no. 6 (2020). +* Van Lamsweerde, Axel. Requirements engineering: From system goals to UML models to software. John Wiley & Sons, 2009. +* Vogelsang, Andreas, and Markus Borg. "Requirements Engineering for Machine Learning: Perspectives from Data Scientists." In Proc. of the 6th International Workshop on Artificial Intelligence for Requirements Engineering (AIRE), 2019. +* Rahimi, Mona, Jin LC Guo, Sahar Kokaly, and Marsha Chechik. "Toward Requirements Specification for Machine-Learned Components." In 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW), pp. 241-244. IEEE, 2019. +* Kulynych, Bogdan, Rebekah Overdorf, Carmela Troncoso, and Seda Gürses. "POTs: protective optimization technologies." In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 177-188. 2020. +* Wiens, Jenna, Suchi Saria, Mark Sendak, Marzyeh Ghassemi, Vincent X. Liu, Finale Doshi-Velez, Kenneth Jung et al. "Do no harm: a roadmap for responsible machine learning for health care." Nature medicine 25, no. 9 (2019): 1337-1340. +* Bietti, Elettra. "From ethics washing to ethics bashing: a view on tech ethics from within moral philosophy." In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 210-219. 2020. +* Guizani, Mariam, Lara Letaw, Margaret Burnett, and Anita Sarma. "Gender inclusivity as a quality requirement: Practices and pitfalls." IEEE Software 37, no. 6 (2020).
diff --git a/lectures/04_mistakes/mistakes.md b/lectures/04_mistakes/mistakes.md index 42ebf6b4..3c6be2e5 100644 --- a/lectures/04_mistakes/mistakes.md +++ b/lectures/04_mistakes/mistakes.md @@ -1,8 +1,8 @@ --- -author: Eunsuk Kang and Christian Kaestner +author: Claire Le Goues & Christian Kaestner title: "MLiP: Planning for Mistakes" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Eunsuk Kang & Christian Kaestner, Carnegie Mellon University • Spring 2023" +semester: Spring 2024 +footer: "Machine Learning in Production/AI Engineering • Claire Le Goues & Christian Kaestner, Carnegie Mellon University • Spring 2024" license: Creative Commons Attribution 4.0 International (CC BY 4.0) --- @@ -12,6 +12,139 @@ license: Creative Commons Attribution 4.0 International (CC BY 4.0) # Planning for Mistakes +--- +# From last time... +---- +## Requirements elicitation techniques (1) + +* Background study: understand organization, read documents, try to use old system +* Interview different stakeholders + * Ask open ended questions about problems, needs, possible solutions, preferences, concerns... + * Support with visuals, prototypes, ask about tradeoffs + * Use checklists to consider qualities (usability, privacy, latency, ...) + + +**What would you ask in lane keeping software? In fall detection software? In college admissions software?** + +---- +## ML Prototyping: Wizard of Oz + +![Wizard of oz excerpt](wizard.gif) + +Note: In a wizard of oz experiment a human fills in for the ML model that is to be developed. For example a human might write the replies in the chatbot. + +---- +## Requirements elicitation techniques (2) + +* Surveys, groups sessions, workshops: Engage with multiple stakeholders, explore conflicts +* Ethnographic studies: embed with users, passively observe or actively become part +* Requirements taxonomies and checklists: Reusing domain knowledge +* Personas: Shift perspective to explore needs of stakeholders not interviewed + +---- +## Negotiating Requirements + +Many requirements are conflicting/contradictory + +Different stakeholders want different things, have different priorities, preferences, and concerns + +Formal requirements and design methods such as [card sorting](https://en.wikipedia.org/wiki/Card_sorting), [affinity diagramming](https://en.wikipedia.org/wiki/Affinity_diagram), [importance-difficulty matrices](https://spin.atomicobject.com/2018/03/06/design-thinking-difficulty-importance-matrix/) + +Generally: sort through requirements, identify alternatives and conflicts, resolve with priorities and decisions -> single option, compromise, or configuration + + + +---- +## Stakeholder Conflict Examples + +*User wishes vs developer preferences:* free updates vs low complexity + +*Customer wishes vs affected third parties:* privacy preferences vs disclosure + +*Product owner priorities vs regulators:* maximizing revenue vs privacy protections + +**Conflicts in lane keeping software? In fall detection software? In college admissions software?** + + +**Who makes the decisions?** + +---- +## Requirements documentation + + +![paperwork](../_chapterimg/06_requirements.jpg) + + +---- +## Requirements documentation + +Write down requirements +* what the software *shall* do, what it *shall* not do, what qualities it *shall* have, +* document decisions and rationale for conflict resolution + +Requirements as input to design and quality assurance + +Formal requirements documents often seen as bureaucratic, lightweight options in notes, wikis, issues common + +Systems with higher risk -> consider more formal documentation + +---- +## Requirements evaluation (validation!) + +![Validation vs verification](validation.svg) + + + +---- +## Requirements evaluation + +Manual inspection (like code review) + +Show requirements to stakeholders, ask for misunderstandings, gaps + +Show prototype to stakeholders + +Checklists to cover important qualities + + +Critically inspect assumptions for completeness and realism + +Look for unrealistic ML-related assumptions (no false positives, unbiased representative data) + + +---- +## How much requirements eng. and when? + +![Waterfall process picture](waterfall.svg) + +---- +## How much requirements eng. and when? + +Requirements important in risky systems + +Requirements as basis of a contract (outsourcing, assigning blame) + +Rarely ever fully completely upfront and stable, anticipate change +* Stakeholders see problems in prototypes, change their minds +* Especially ML requires lots of exploration to establish feasibility + +Low-risk problems often use lightweight, agile approaches + +(We'll return to this later) + +---- +# Summary + +Requirements state the needs of the stakeholders and are expressed + over the phenomena in the world + +Software/ML models have limited influence over the world + +Environmental assumptions play just as an important role in +establishing requirements + +Identify stakeholders, interview them, resolve conflicts + --- ## Exploring Requirements... @@ -32,7 +165,7 @@ license: Creative Commons Attribution 4.0 International (CC BY 4.0) -Required reading: 🕮 Hulten, Geoff. "Building Intelligent Systems: A +Required reading: Hulten, Geoff. "Building Intelligent Systems: A Guide to Machine Learning Engineering." (2018), Chapters 6–7 (Why creating IE is hard, balancing IE) and 24 (Dealing with mistakes) @@ -58,12 +191,6 @@ creating IE is hard, balancing IE) and 24 (Dealing with mistakes)
----- -## Common excuse: Just software mistake - -
- - ---- ## Common excuse: The problem is just data @@ -286,8 +413,7 @@ Notes: Cancer prediction, sentencing + recidivism, Tesla autopilot, military "ki ![Example of email responses suggested by GMail](email.png) -* Fall detection smartwatch -* Safe browsing +* Fall detection smartwatch? ---- ## Human in the Loop - Examples? @@ -692,14 +818,15 @@ A number of methods:
-* Fault tree: A top-down diagram that displays the relationships -between a system failure (i.e., requirement violation) and its potential causes. - * Identify sequences of events that result in a failure - * Prioritize the contributors leading to the failure - * Inform decisions about how to (re-)design the system +* Fault tree: A diagram that displays relationships +between a system failure (i.e., requirement violation) and potential causes. + * Identify event sequences that can result in failure + * Prioritize contributors leading to a failure + * Inform design decisions * Investigate an accident & identify the root cause * Often used for safety & reliability, but can also be used for other types of requirements (e.g., poor performance, security attacks...) +* (Observation: they're weirdly named!)
@@ -728,7 +855,7 @@ other types of requirements (e.g., poor performance, security attacks...) Event: An occurrence of a fault or an undesirable action * (Intermediate) Event: Explained in terms of other events - * Basic Event: No further development or breakdown; leaf + * Basic Event: No further development or breakdown; leaf (choice!) Gate: Logical relationship between an event & its immediate subevents * AND: All of the sub-events must take place @@ -846,12 +973,25 @@ Solution combines a vision-based system identifying people in the door with pres * Remove basic events with mitigations * Increase the size of cut sets with mitigations - +* Recall: Guardrails ![FTA for trapping people in doors of a train](fta-without-mitigation.svg) ---- +## Guardrails - Examples + +Recall: Thermal fuse in smart toaster + +![Thermal fuse](thermalfuse.png) + + ++ maximum toasting time + extra heat sensor + +---- + + + ![Updated FTA for trapping people in doors of a train](fta-mitigation.svg) @@ -875,15 +1015,17 @@ Possible mitigations? ---- ## FTA: Caveats + In general, building a **complete** tree is impossible * There are probably some faulty events that you missed * "Unknown unknowns" + * Events can always be decomposed; detail level is a choice. Domain knowledge is crucial for improving coverage * Talk to domain experts; augment your tree as you learn more FTA is still very valuable for risk reduction! - * Forces you to think about & explicitly document possible failure scenarios + * Forces you to think about, document possible failure scenarios * A good starting basis for designing mitigations diff --git a/lectures/04_mistakes/validation.svg b/lectures/04_mistakes/validation.svg new file mode 100644 index 00000000..ec0ec642 --- /dev/null +++ b/lectures/04_mistakes/validation.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/lectures/15_process/waterfall.svg b/lectures/04_mistakes/waterfall.svg similarity index 100% rename from lectures/15_process/waterfall.svg rename to lectures/04_mistakes/waterfall.svg diff --git a/lectures/25_summary/wizard.gif b/lectures/04_mistakes/wizard.gif similarity index 100% rename from lectures/25_summary/wizard.gif rename to lectures/04_mistakes/wizard.gif diff --git a/lectures/05_modelaccuracy/modelquality1.md b/lectures/05_modelaccuracy/modelquality1.md index 10c9c6f1..ed8aca78 100644 --- a/lectures/05_modelaccuracy/modelquality1.md +++ b/lectures/05_modelaccuracy/modelquality1.md @@ -1,8 +1,8 @@ --- -author: Christian Kaestner and Eunsuk Kang +author: Christian Kaestner and Claire Le Goues title: "MLiP: Model Correctness and Accuracy" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" +semester: Spring 2024 +footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Claire Le Goues, Carnegie Mellon University • Spring 2024" license: Creative Commons Attribution 4.0 International (CC BY 4.0) --- @@ -153,8 +153,6 @@ More on system vs model goals and other model qualities later **Model:** $\overline{X} \rightarrow Y$ -**Training/validation/test data:** sets of $(\overline{X}, Y)$ pairs indicating desired outcomes for select inputs - **Performance:** In machine learning, "performance" typically refers to accuracy: "this model performs better" = it produces more accurate results @@ -617,18 +615,6 @@ As a group, post your answer to `#lecture` tagging all group members.
----- -## Correlation vs Causation - - -![causation1](causation1.png) - -![causation2](causation2.png) - - - -https://www.tylervigen.com/spurious-correlations - ---- @@ -647,67 +633,6 @@ https://www.tylervigen.com/spurious-correlations ----- -## Risks of Metrics as Incentives - -Metrics-driven incentives can: - * Extinguish intrinsic motivation - * Diminish performance - * Encourage cheating, shortcuts, and unethical behavior - * Become addictive - * Foster short-term thinking - -Often, different stakeholders have different incentives - -**Make sure data scientists and software engineers share goals and success measures** - ----- -## Example: University Rankings - - - -![US News](us-news.jpg) - - - -* Originally: Opinion-based polls, but complaints by schools on subjectivity -* Data-driven model: Rank colleges in terms of "educational excellence" -* Input: SAT scores, student-teacher ratios, acceptance rates, -retention rates, campus facilities, alumni donations, etc., - - ----- -## Example: University Rankings - - - -![US News](us-news.jpg) - - - -* Can the ranking-based metric be misused or cause unintended side effects? - - - - - - -For more, see Weapons of Math Destruction by Cathy O'Neil - - -Notes: - -* Example 1 - * Schools optimize metrics for higher ranking (add new classrooms, nicer - facilities) - * Tuition increases, but is not part of the model! - * Higher ranked schools become more expensive - * Advantage to students from wealthy families -* Example 2 - * A university founded in early 2010's - * Math department ranked by US News as top 10 worldwide - * Top international faculty paid \$\$ as a visitor; asked to add affiliation - * Increase in publication citations => skyrocket ranking! @@ -1178,7 +1103,8 @@ Note: The curve is the real trend, red points are training data, green points ar Example: Kaggle competition on detecting distracted drivers -![Driver Picture 1](driver_phone1.png) ![Driver Picture 2](driver_phone2.png) +![Driver Picture 1](driver_phone1.png) +![Driver Picture 2](driver_phone2.png) Relation of datapoints may not be in the data (e.g., driver) diff --git a/lectures/06_teamwork/teams.md b/lectures/06_teamwork/teams.md index afc45bb7..46bfbee3 100644 --- a/lectures/06_teamwork/teams.md +++ b/lectures/06_teamwork/teams.md @@ -1,8 +1,8 @@ --- -author: Christian Kaestner and Eunsuk Kang +author: Christian Kaestner and Claire Le Goues title: "MLiP: Working with Interdisciplinary (Student) Teams" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" +semester: Spring 2024 +footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Claire Le Goues, Carnegie Mellon University • Spring 2024" license: Creative Commons Attribution 4.0 International (CC BY 4.0) --- @@ -31,7 +31,7 @@ license: Creative Commons Attribution 4.0 International (CC BY 4.0) * Say hi, introduce yourself: Name? SE or ML background? Favorite movie? Fun fact? * Find time for first team meeting in next few days * Agree on primary communication until team meeting -* Pick a movie-related team name, post team name and tag all group members on slack in `#social` +* Pick a movie-related team name (use a language model if needed), post team name and tag all group members on slack in `#social` --- ## Teamwork is crosscutting... @@ -390,7 +390,7 @@ Based on research and years of own experience ---- -## Breakout: Navigating Team Issues +## Breakout: Premortem Pick one or two of the scenarios (or another one team member faced in the past) and openly discuss proactive/reactive solutions @@ -580,4 +580,4 @@ Adjusting grades based on survey and communication with course staff * Classic work on team dysfunctions: Lencioni, Patrick. “The five dysfunctions of a team: A Leadership Fable.” Jossey-Bass (2002). * Oakley, Barbara, Richard M. Felder, Rebecca Brent, and Imad Elhajj. "[Turning student groups into effective teams](https://norcalbiostat.github.io/MATH456/notes/Effective-Teams.pdf)." Journal of student centered learning 2, no. 1 (2004): 9-34. -
\ No newline at end of file + diff --git a/lectures/07_modeltesting/capabilities1.png b/lectures/07_modeltesting/capabilities1.png deleted file mode 100644 index 0fd08926..00000000 Binary files a/lectures/07_modeltesting/capabilities1.png and /dev/null differ diff --git a/lectures/07_modeltesting/capabilities2.png b/lectures/07_modeltesting/capabilities2.png deleted file mode 100644 index 5d8ae5e4..00000000 Binary files a/lectures/07_modeltesting/capabilities2.png and /dev/null differ diff --git a/lectures/07_modeltesting/checklist.jpg b/lectures/07_modeltesting/checklist.jpg deleted file mode 100644 index 64d7b725..00000000 Binary files a/lectures/07_modeltesting/checklist.jpg and /dev/null differ diff --git a/lectures/07_modeltesting/ci.png b/lectures/07_modeltesting/ci.png deleted file mode 100644 index e686e50f..00000000 Binary files a/lectures/07_modeltesting/ci.png and /dev/null differ diff --git a/lectures/07_modeltesting/coverage.png b/lectures/07_modeltesting/coverage.png deleted file mode 100644 index 35f64927..00000000 Binary files a/lectures/07_modeltesting/coverage.png and /dev/null differ diff --git a/lectures/07_modeltesting/easeml.png b/lectures/07_modeltesting/easeml.png deleted file mode 100644 index 19bc1a62..00000000 Binary files a/lectures/07_modeltesting/easeml.png and /dev/null differ diff --git a/lectures/07_modeltesting/googlehome.jpg b/lectures/07_modeltesting/googlehome.jpg deleted file mode 100644 index 9c7660d2..00000000 Binary files a/lectures/07_modeltesting/googlehome.jpg and /dev/null differ diff --git a/lectures/07_modeltesting/imgcaptioning.png b/lectures/07_modeltesting/imgcaptioning.png deleted file mode 100644 index 9de8d250..00000000 Binary files a/lectures/07_modeltesting/imgcaptioning.png and /dev/null differ diff --git a/lectures/07_modeltesting/inputpartitioning.png b/lectures/07_modeltesting/inputpartitioning.png deleted file mode 100644 index e10dfcb8..00000000 Binary files a/lectures/07_modeltesting/inputpartitioning.png and /dev/null differ diff --git a/lectures/07_modeltesting/inputpartitioning2.png b/lectures/07_modeltesting/inputpartitioning2.png deleted file mode 100644 index b2a8f1ea..00000000 Binary files a/lectures/07_modeltesting/inputpartitioning2.png and /dev/null differ diff --git a/lectures/07_modeltesting/mlflow-web-ui.png b/lectures/07_modeltesting/mlflow-web-ui.png deleted file mode 100644 index 82e3e39a..00000000 Binary files a/lectures/07_modeltesting/mlflow-web-ui.png and /dev/null differ diff --git a/lectures/07_modeltesting/mlvalidation.png b/lectures/07_modeltesting/mlvalidation.png deleted file mode 100644 index e536d91f..00000000 Binary files a/lectures/07_modeltesting/mlvalidation.png and /dev/null differ diff --git a/lectures/07_modeltesting/modelquality2.md b/lectures/07_modeltesting/modelquality2.md deleted file mode 100644 index 0e59d2d0..00000000 --- a/lectures/07_modeltesting/modelquality2.md +++ /dev/null @@ -1,1339 +0,0 @@ ---- -author: Christian Kaestner and Eunsuk Kang -title: "MLiP: Model Testing beyond Accuracy" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 ---- - -
- -## Machine Learning in Production - - -# Model Testing beyond Accuracy - -

(Slicing, Capabilities, Invariants, Simulation, ...)

- - - - ---- -## More model-level QA... - - -![Overview of course content](../_assets/overview.svg) - - ----- - -# Learning Goals - -* Curate validation datasets for assessing model quality, covering subpopulations and capabilities as needed -* Explain the oracle problem and how it challenges testing of software and models -* Use invariants to check partial model properties with automated testing -* Select and deploy automated infrastructure to evaluate and monitor model quality - ---- -# Model Quality - - -**First Part:** Measuring Prediction Accuracy -* the data scientist's perspective - -**Second Part:** What is Correctness Anyway? -* the role and lack of specifications, validation vs verification - -**Third Part:** Learning from Software Testing 🠔 -* unit testing, test case curation, invariants, simulation (next lecture) - -**Later:** Testing in Production -* monitoring, A/B testing, canary releases (in 2 weeks) - - - ----- - -![Xkcd commit 1838](xkcd1838.png) - - -[XKCD 1838](https://xkcd.com/1838/), cc-by-nc 2.5 Randall Munroe - - - - - - - - - - - ---- -# Curating Validation Data & Input Slicing - -![Fruit slices](slices.jpg) - - ----- -## Breakout Discussion - -
- -Write a few tests for the following program: - -```scala -def nextDate(year: Int, month: Int, day: Int) = ... -``` - -A test may look like: -```java -assert nextDate(2021, 2, 8) == (2021, 2, 9); -``` - -**As a group, discuss how you select tests. Discuss how many tests you need to feel confident.** - -Post answer to `#lecture` tagging group members in Slack using template: -> Selection strategy: ...
-> Test quantity: ...
- -
- ----- -## Defining Software Testing - -* Program *p* with specification *s* -* Test consists of - - Controlled environment - - Test call, test inputs - - Expected behavior/output (oracle) - -```java -assertEquals(4, add(2, 2)); -assertEquals(??, factorPrime(15485863)); -``` - -Testing is complete but unsound: -Cannot guarantee the absence of bugs - - ----- -## How to Create Test Cases? - -```scala -def nextDate(year: Int, month: Int, day: Int) = ... -``` - - - - -Note: Can focus on specification (and concepts in the domain, such as -leap days and month lengths) or can focus on implementation - -Will not randomly sample from distribution of all days - ----- -## Software Test Case Design - -
- -**Opportunistic/exploratory testing:** Add some unit tests, without much planning - -**Specification-based testing** ("black box"): Derive test cases from specifications - - Boundary value analysis - - Equivalence classes - - Combinatorial testing - - Random testing - -**Structural testing** ("white box"): Derive test cases to cover implementation paths - - Line coverage, branch coverage - - Control-flow, data-flow testing, MCDC, ... - -Test execution usually automated, but can be manual too; automated generation from specifications or code possible - -
- ----- -## Example: Boundary Value Testing - -Analyze the specification, not the implementation! - -**Key Insight:** Errors often occur at the boundaries of a variable value - -For each variable select (1) minimum, (2) min+1, (3) medium, (4) max-1, and (5) maximum; possibly also invalid values min-1, max+1 - -Example: `nextDate(2015, 6, 13) = (2015, 6, 14)` - - **Boundaries?** - ----- -## Example: Equivalence classes - -**Idea:** Typically many values behave similarly, but some groups of values are different - -Equivalence classes derived from specifications (e.g., cases, input ranges, error conditions, fault models) - -Example `nextDate(2015, 6, 13)` - - leap years, month with 28/30/31 days, days 1-28, 29, 30, 31 - -Pick 1 value from each group, combine groups from all variables - ----- -## Exercise - -```scala -/** Compute the price of a bus ride: - * - Children under 2 ride for free, children under 18 and - * senior citizen over 65 pay half, all others pay the - * full fare of $3. - * - On weekdays, between 7am and 9am and between 4pm and - * 7pm a peak surcharge of $1.5 is added. - * - Short trips under 5min during off-peak time are free.*/ -def busTicketPrice(age: Int, - datetime: LocalDateTime, - rideTime: Int) -``` - -*suggest test cases based on boundary value analysis and equivalence class testing* - - ----- -## Selecting Validation Data for Model Quality? - - - - - ----- -## Validation Data Representative? - -* Validation data should reflect usage data -* Be aware of data drift (face recognition during pandemic, new patterns in credit card fraud detection) -* "*Out of distribution*" predictions often low quality (it may even be worth to detect out of distribution data in production, more later) - -*(note, similar to requirements validation: did we hear all/representative stakeholders)* - - - - ----- -## Not All Inputs are Equal - -![Google Home](googlehome.jpg) - - -"Call mom" -"What's the weather tomorrow?" -"Add asafetida to my shopping list" - ----- -## Not All Inputs are Equal - -> There Is a Racial Divide in Speech-Recognition Systems, Researchers Say: -> Technology from Amazon, Apple, Google, IBM and Microsoft misidentified 35 percent of words from people who were black. White people fared much better. -- [NYTimes March 2020](https://www.nytimes.com/2020/03/23/technology/speech-recognition-bias-apple-amazon-google.html) - ----- -
- ----- -## Not All Inputs are Equal - -> some random mistakes vs rare but biased mistakes? - -* A system to detect when somebody is at the door that never works for people under 5ft (1.52m) -* A spam filter that deletes alerts from banks - - -**Consider separate evaluations for important subpopulations; monitor mistakes in production** - - - ----- -## Identify Important Inputs - -Curate Validation Data for Specific Problems and Subpopulations: -* *Regression testing:* Validation dataset for important inputs ("call mom") -- expect very high accuracy -- closest equivalent to **unit tests** -* *Uniformness/fairness testing:* Separate validation dataset for different subpopulations (e.g., accents) -- expect comparable accuracy -* *Setting goals:* Validation datasets for challenging cases or stretch goals -- accept lower accuracy - -Derive from requirements, experts, user feedback, expected problems etc. Think *specification-based testing*. - - ----- -## Important Input Groups for Cancer Prognosis? - - - - ----- -## Input Partitioning - -* Guide testing by identifying groups and analyzing accuracy of subgroups - * Often for fairness: gender, country, age groups, ... - * Possibly based on business requirements or cost of mistakes -* Slice test data by population criteria, also evaluate interactions -* Identifies problems and plan mitigations, e.g., enhance with more data for subgroup or reduce confidence - - - - -Good reading: Barash, Guy, Eitan Farchi, Ilan Jayaraman, Orna Raz, Rachel Tzoref-Brill, and Marcel Zalmanovici. "Bridging the gap between ML solutions and their business requirements using feature interactions." In Proc. Symposium on the Foundations of Software Engineering, pp. 1048-1058. 2019. - ----- -## Input Partitioning Example - -
- - -![Input partitioning example](inputpartitioning2.png) - -Input divided by movie age. Notice low accuracy, but also low support (i.e., little validation data), for old movies. - -![Input partitioning example](inputpartitioning.png) - -Input divided by genre, rating, and length. Accuracy differs, but also amount of test data used ("support") differs, highlighting low confidence areas. - - - -
- - - -Source: Barash, Guy, et al. "Bridging the gap between ML solutions and their business requirements using feature interactions." In Proc. FSE, 2019. - ----- -## Input Partitioning Discussion - -**How to slice evaluation data for cancer prognosis?** - - - - ----- -## Example: Model Impr. at Apple (Overton) - -![Overton system](overton.png) - - - - - -Ré, Christopher, Feng Niu, Pallavi Gudipati, and Charles Srisuwananukorn. "[Overton: A Data System for Monitoring and Improving Machine-Learned Products](https://arxiv.org/abs/1909.05372)." arXiv preprint arXiv:1909.05372 (2019). - - ----- -## Example: Model Improvement at Apple (Overton) - -* Focus engineers on creating training and validation data, not on model search (AutoML) -* Flexible infrastructure to slice telemetry data to identify underperforming subpopulations -> focus on creating better training data (better, more labels, in semi-supervised learning setting) - - - - -Ré, Christopher, Feng Niu, Pallavi Gudipati, and Charles Srisuwananukorn. "[Overton: A Data System for Monitoring and Improving Machine-Learned Products](https://arxiv.org/abs/1909.05372)." arXiv preprint arXiv:1909.05372 (2019). - - - - - - - ---- -# Testing Model Capabilities - - -![Checklist](checklist.jpg) - - - - - -Further reading: Christian Kaestner. [Rediscovering Unit Testing: Testing Capabilities of ML Models](https://towardsdatascience.com/rediscovering-unit-testing-testing-capabilities-of-ml-models-b008c778ca81). Toward Data Science, 2021. - - - ----- -## Testing Capabilities - -
- -Are there "concepts" or "capabilities" the model should learn? - -Example capabilities of sentiment analysis: -* Handle *negation* -* Robustness to *typos* -* Ignore synonyms and abbreviations -* Person and location names are irrelevant -* Ignore gender -* ... - -For each capability create specific test set (multiple examples) - -
- - - -Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. "[Beyond Accuracy: Behavioral Testing of NLP Models with CheckList](https://homes.cs.washington.edu/~wtshuang/static/papers/2020-acl-checklist.pdf)." In Proceedings ACL, p. 4902–4912. (2020). - - ----- -## Testing Capabilities - -![Examples of Capabilities from Checklist Paper](capabilities1.png) - - - - -From: Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. "[Beyond Accuracy: Behavioral Testing of NLP Models with CheckList](https://homes.cs.washington.edu/~wtshuang/static/papers/2020-acl-checklist.pdf)." In Proceedings ACL, p. 4902–4912. (2020). - - ----- -## Testing Capabilities - -![Examples of Capabilities from Checklist Paper](capabilities2.png) - - - - -From: Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. "[Beyond Accuracy: Behavioral Testing of NLP Models with CheckList](https://homes.cs.washington.edu/~wtshuang/static/papers/2020-acl-checklist.pdf)." In Proceedings ACL, p. 4902–4912. (2020). - ----- -## Examples of Capabilities - -**What could be capabilities of the cancer classifier?** - -![radiology](radiology.jpg) - ----- -## Capabilities vs Specifications vs Slicing - - - ----- -## Capabilities vs Specifications vs Slicing - -Capabilities are partial specifications of expected behavior (not expected to always hold) - -Some capabilities correspond to slices of existing test data, for others we may need to create new data - ----- -## Recall: Is it fair to expect generalization beyond training distribution? - - -![](radiology-distribution.png) - - -*Shall a cancer detector generalize to other hospitals? Shall image captioning generalize to describing pictures of star formations?* - -Note: We wouldn't test a first year elementary school student on high-school math. This would be "out of the training distribution" - ----- -## Recall: Shortcut Learning - -![Shortcut learning illustration from paper below](shortcutlearning.png) - - - -Figure from: Geirhos, Robert, et al. "[Shortcut learning in deep neural networks](https://arxiv.org/abs/2004.07780)." Nature Machine Intelligence 2, no. 11 (2020): 665-673. - ----- -## More Shortcut Learning :) - -![Cows with different backgrounds](shortcutlearning-cows.png) - - -Figure from Beery, Sara, Grant Van Horn, and Pietro Perona. “Recognition in terra incognita.” In Proceedings of the European Conference on Computer Vision (ECCV), pp. 456–473. 2018. - ----- -## Generalization beyond Training Distribution? - -
- -* Typically training and validation data from same distribution (i.i.d. assumption!) -* Many models can achieve similar accuracy -* Models that learn "right" abstractions possibly indistinguishable from models that use shortcuts - - see tank detection example - - Can we guide the model towards "right" abstractions? -* Some models generalize better to other distributions not used in training - - e.g., cancer images from other hospitals, from other populations - - Drift and attacks, ... - -
- - - -See discussion in D'Amour, Alexander, et al. "[Underspecification presents challenges for credibility in modern machine learning](https://arxiv.org/abs/2011.03395)." arXiv preprint arXiv:2011.03395 (2020). - ----- -## Hypothesis: Testing Capabilities may help with Generalization - -* Capabilities are "partial specifications", given beyond training data -* Encode domain knowledge of the problem - * Capabilities are inherently domain specific - * Curate capability-specific test data for a problem -* Testing for capabilities helps to distinguish models that use intended abstractions -* May help find models that generalize better - - - - - -See discussion in D'Amour, Alexander, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen et al. "[Underspecification presents challenges for credibility in modern machine learning](https://arxiv.org/abs/2011.03395)." arXiv preprint arXiv:2011.03395 (2020). - ----- - -## Strategies for identifying capabilities - -* Analyze common mistakes (e.g., classify past mistakes in cancer prognosis) -* Use existing knowledge about the problem (e.g., linguistics theories) -* Observe humans (e.g., how do radiologists look for cancer) -* Derive from requirements (e.g., fairness) -* Causal discovery from observational data? - - - -Further reading: Christian Kaestner. [Rediscovering Unit Testing: Testing Capabilities of ML Models](https://towardsdatascience.com/rediscovering-unit-testing-testing-capabilities-of-ml-models-b008c778ca81). Toward Data Science, 2021. - - - ----- -## Examples of Capabilities - -**What could be capabilities of image captioning system?** - -![Image captioning task](imgcaptioning.png) - - - ----- -## Generating Test Data for Capabilities - -**Idea 1: Domain-specific generators** - -Testing *negation* in sentiment analysis with template:
-`I {NEGATION} {POS_VERB} the {THING}.` - -Testing texture vs shape priority with artificial generated images: -![Texture vs shape example](texturevsshape.png) - - - -Figure from Geirhos, Robert, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.” In Proc. International Conference on Learning Representations (ICLR), (2019). - ----- -## Generating Test Data for Capabilities - -**Idea 2: Mutating existing inputs** - -Testing *synonyms* in sentiment analysis by replacing words with synonyms, keeping label - -Testing *robust against noise and distraction* add `and false is not true` or random URLs to text - -![Examples of Capabilities from Checklist Paper](capabilities1.png) - - - - -Figure from: Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. "[Beyond Accuracy: Behavioral Testing of NLP Models with CheckList](https://homes.cs.washington.edu/~wtshuang/static/papers/2020-acl-checklist.pdf)." In Proceedings ACL, p. 4902–4912. (2020). - - ----- -## Generating Test Data for Capabilities - -**Idea 3: Crowd-sourcing test creation** - -Testing *sarcasm* in sentiment analysis: Ask humans to minimally change text to flip sentiment with sarcasm - -Testing *background* in object detection: Ask humans to take pictures of specific objects with unusual backgrounds - -![Example of modifications to text](sarcasm.png) - - - -Figure from: Kaushik, Divyansh, Eduard Hovy, and Zachary C. Lipton. “Learning the difference that makes a difference with counterfactually-augmented data.” In Proc. International Conference on Learning Representations (ICLR), (2020). - ----- -## Generating Test Data for Capabilities - -**Idea 4: Slicing test data** - -Testing *negation* in sentiment analysis by finding sentences containing 'not' - - -![Overton system](overton.png) - - - - -Ré, Christopher, Feng Niu, Pallavi Gudipati, and Charles Srisuwananukorn. "[Overton: A Data System for Monitoring and Improving Machine-Learned Products](https://arxiv.org/abs/1909.05372)." arXiv preprint arXiv:1909.05372 (2019). - - - ----- -## Examples of Capabilities - -**How to generate test data for capabilities of the cancer classifier?** - -![radiology](radiology.jpg) - - ----- -## Testing vs Training Capabilities - -* Dual insight for testing and training -* Strategies for curating test data can also help select training data -* Generate capability-specific training data to guide training (data augmentation) - - - -Further reading on using domain knowledge during training: Von Rueden, Laura, Sebastian Mayer, Jochen Garcke, Christian Bauckhage, and Jannis Schuecker. "Informed machine learning–towards a taxonomy of explicit integration of knowledge into machine learning." Learning 18 (2019): 19-20. - - - ----- -## Preliminary Summary: Specification-Based Testing Techniques as Inspiration - -* Boundary value analysis -* Partition testing & equivalence classes -* Combinatorial testing -* Decision tables - -Use to identify datasets for **subpopulations** and **capabilities**, not individual tests. - ----- -## On Terminology - -* Test data curation is emerging as a very recent concept for testing ML components -* No consistent terminology - - "Testing capabilities" in checklist paper - - "Stress testing" in some others (but stress testing has a very different meaning in software testing: robustness to overload) -* Software engineering concepts translate, but names not adopted in ML community - - specification-based testing, black-box testing - - equivalence class testing, boundary-value analysis - -![Random letters](../_assets/onterminology.jpg) - - - - - - - - - - - - - - - ---- -# Automated (Random) Testing and Invariants - -(if it wasn't for that darn oracle problem) - -![Random dice throw](random.jpg) - - - ----- -## Random Test Input Generation is Easy - - -```java -@Test -void testNextDate() { - nextDate(488867101, 1448338253, -997372169) - nextDate(2105943235, 1952752454, 302127018) - nextDate(1710531330, -127789508, 1325394033) - nextDate(-1512900479, -439066240, 889256112) - nextDate(1853057333, 1794684858, 1709074700) - nextDate(-1421091610, 151976321, 1490975862) - nextDate(-2002947810, 680830113, -1482415172) - nextDate(-1907427993, 1003016151, -2120265967) -} -``` - -**But is it useful?** - ----- -## Cancer in Random Image? - -![](white-noise.jpg) - ----- -## Randomly Generating "Realistic" Inputs is Possible - - -```java -@Test -void testNextDate() { - nextDate(2010, 8, 20) - nextDate(2024, 7, 15) - nextDate(2011, 10, 27) - nextDate(2024, 5, 4) - nextDate(2013, 8, 27) - nextDate(2010, 2, 30) -} -``` - -**But how do we know whether the computation is correct?** - - - ----- -## Automated Model Validation Data Generation? - -```java -@Test -void testCancerPrediction() { - cancerModel.predict(generateRandomImage()) - cancerModel.predict(generateRandomImage()) - cancerModel.predict(generateRandomImage()) -} -``` - -* **Realistic inputs?** -* **But how do we get labels?** - - ----- -## The Oracle Problem - -*How do we know the expected output of a test?* - -```java -assertEquals(??, factorPrime(15485863)); -``` - - - - ----- -## Test Case Generation & The Oracle Problem - -
- -* Manually construct input-output pairs (does not scale, cannot automate) -* Comparison against gold standard (e.g., alternative implementation, executable specification) -* Checking of global properties only -- crashes, buffer overflows, code injections -* Manually written assertions -- partial specifications checked at runtime - -
- -![Solving the Oracle Problem with Gold Standard or Assertions](oracle.svg) - - ----- -## Manually constructing outputs - - -```java -@Test -void testNextDate() { - assert nextDate(2010, 8, 20) == (2010, 8, 21); - assert nextDate(2024, 7, 15) == (2024, 7, 16); - assert nextDate(2010, 2, 30) throws InvalidInputException; -} -``` - -```java -@Test -void testCancerPrediction() { - assert cancerModel.predict(loadImage("random1.jpg")) == true; - assert cancerModel.predict(loadImage("random2.jpg")) == true; - assert cancerModel.predict(loadImage("random3.jpg")) == false; -} -``` - -*(tedious, labor intensive; possibly crowd sourced)* - ----- -## Compare against reference implementation - -**assuming we have a correct implementation** - -```java -@Test -void testNextDate() { - assert nextDate(2010, 8, 20) == referenceLib.nextDate(2010, 8, 20); - assert nextDate(2024, 7, 15) == referenceLib.nextDate(2024, 7, 15); - assert nextDate(2010, 2, 30) == referenceLib.nextDate(2010, 2, 30) -} -``` - -```java -@Test -void testCancerPrediction() { - assert cancerModel.predict(loadImage("random1.jpg")) == ???; -} -``` - -*(usually no reference implementation for ML problems)* - - ----- -## Checking global specifications - -**Ensure, no computation crashes** - -```java -@Test -void testNextDate() { - nextDate(2010, 8, 20) - nextDate(2024, 7, 15) - nextDate(2010, 2, 30) -} -``` - - -```java -@Test -void testCancerPrediction() { - cancerModel.predict(generateRandomImage()) - cancerModel.predict(generateRandomImage()) - cancerModel.predict(generateRandomImage()) -} -``` - -*(we usually do fear crashing bugs in ML models)* - ----- -## Invariants as partial specification - - -```java -class Stack { - int size = 0; - int MAX_SIZE = 100; - String[] data = new String[MAX_SIZE]; - // class invariant checked before and after every method - private void check() { - assert(size>=0 && size<=MAX_SIZE); - } - public void push(String v) { - check(); - if (size -Code: -```java -void foo(a, b, c) { - int x=0, y=0, z=0; - if (a) x=-2; - if (b<5) { - if (!a && c) y=1; - z=2; - } - assert(x+y+z!=3) -} -``` - - - -Paths: -* $a\wedge (b<5)$: x=-2, y=0, z=2 -* $a\wedge\neg (b<5)$: x=-2, y=0, z=0 -* $\neg a\wedge (\neg a\wedge c)$: x=0, z=1, z=2 -* $\neg a\wedge (b<5)\wedge\neg (\neg a\wedge c)$: x=0, z=0, z=2 -* $\neg a\wedge (b<5)\wedge\neg (\neg a\wedge c)$: x=0, z=0, z=2 -* $\neg a\wedge\neg (b<5)$: x=0, z=0, z=0 - - - - - -Note: example source: http://web.cs.iastate.edu/~weile/cs641/9.SymbolicExecution.pdf - ----- -## Generating Inputs for ML Problems - -* Completely random data generation (uniform sampling from each feature's domain) -* Using knowledge about feature distributions (sample from each feature's distribution) -* Knowledge about dependencies among features and whole population distribution (e.g., model with probabilistic programming language) -* Mutate from existing inputs (e.g., small random modifications to select features) -* Generate "fake data" with Generative Adversarial Networks - - ----- -## ML Models = Untestable Software? - -
- -```java -@Test -void testCancerPrediction() { - cancerModel.predict(generateRandomImage()) -} -``` - - - -* Manually construct input-output pairs (does not scale, cannot automate) - - **too expensive at scale** -* Comparison against gold standard (e.g., alternative implementation, executable specification) - - **no specification, usually no other "correct" model** - - comparing different techniques useful? (see ensemble learning) - - semi-supervised learning as approximation? -* Checking of global properties only -- crashes, buffer overflows, code injections - **??** -* Manually written assertions -- partial specifications checked at runtime - **??** - -
- - - ----- -## Invariants in Machine Learned Models (Metamorphic Testing) - -Exploit relationships between inputs - -* If two inputs differ only in **X** -> output should be the same -* If inputs differ in **Y** output should be flipped -* If inputs differ only in feature **F**, prediction for input with higher F should be higher -* ... - ----- -## Invariants in Machine Learned Models? - - - ----- -## Some Capabilities are Invariants - -**Some capability tests can be expressed as invariants and automatically encoded as transformations to existing test data** - - -* Negation should flip sentiment analysis result -* Typos should not affect sentiment analysis result -* Changes to locations or names should not affect sentiment analysis results - -![Examples of NLP capability tests](capabilities1.png) - - - - -From: Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. "[Beyond Accuracy: Behavioral Testing of NLP Models with CheckList](https://homes.cs.washington.edu/~wtshuang/static/papers/2020-acl-checklist.pdf)." In Proceedings ACL, p. 4902–4912. (2020). - - - ----- -## Examples of Invariants - -
- - -* Credit rating should not depend on gender: - - $\forall x. f(x[\text{gender} \leftarrow \text{male}]) = f(x[\text{gender} \leftarrow \text{female}])$ -* Synonyms should not change the sentiment of text: - - $\forall x. f(x) = f(\texttt{replace}(x, \text{"is not", "isn't"}))$ -* Negation should swap meaning: - - $\forall x \in \text{"X is Y"}. f(x) = 1-f(\texttt{replace}(x, \text{" is ", " is not "}))$ -* Robustness around training data: - - $\forall x \in \text{training data}. \forall y \in \text{mutate}(x, \delta). f(x) = f(y)$ -* Low credit scores should never get a loan (sufficient conditions for classification, "anchors"): - - $\forall x. x.\text{score} < 649 \Rightarrow \neg f(x)$ - -Identifying invariants requires domain knowledge of the problem! - -
- ----- -## Metamorphic Testing - -Formal description of relationships among inputs and outputs (*Metamorphic Relations*) - -In general, for a model $f$ and inputs $x$ define two functions to transform inputs and outputs $g\_I$ and $g\_O$ such that: - -$\forall x. f(g\_I(x)) = g\_O(f(x))$ - - - -e.g. $g\_I(x)= \texttt{replace}(x, \text{" is ", " is not "})$ and $g\_O(x)=\neg x$ - - - ----- -## On Testing with Invariants/Assertions - -* Defining good metamorphic relations requires knowledge of the problem domain -* Good metamorphic relations focus on parts of the system -* Invariants usually cover only one aspect of correctness -- maybe capabilities -* Invariants and near-invariants can be mined automatically from sample data (see *specification mining* and *anchors*) - - - -Further reading: -* Segura, Sergio, Gordon Fraser, Ana B. Sanchez, and Antonio Ruiz-Cortés. "[A survey on metamorphic testing](https://core.ac.uk/download/pdf/74235918.pdf)." IEEE Transactions on software engineering 42, no. 9 (2016): 805-824. -* Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "[Anchors: High-precision model-agnostic explanations](https://sameersingh.org/files/papers/anchors-aaai18.pdf)." In Thirty-Second AAAI Conference on Artificial Intelligence. 2018. - - ----- -## Invariant Checking aligns with Requirements Validation - -![Machine Learning Validation vs Verification](mlvalidation.png) - - - - ----- -## Approaches for Checking in Variants - -* Generating test data (random, distributions) usually easy -* Transformations of existing test data -* Adversarial learning: For many techniques gradient-based techniques to search for invariant violations -- that's roughly analogous to symbolic execution in SE -* Early work on formally verifying invariants for certain models (e.g., small deep neural networks) - - - - -Further readings: -Singh, Gagandeep, Timon Gehr, Markus Püschel, and Martin Vechev. "[An abstract domain for certifying neural networks](https://dl.acm.org/doi/pdf/10.1145/3290354)." Proceedings of the ACM on Programming Languages 3, no. POPL (2019): 1-30. - - ----- -## Using Invariant Violations - -* Are invariants strict? - * Single violation in random inputs usually not meaningful - * In capability testing, average accuracy in realistic data needed - * Maybe strict requirements for fairness or robustness? -* Do invariant violations matter if the input data is not representative? - - - - ---- -# Simulation-Based Testing - -![Driving a simulator](simulationdriving.jpg) - - - ----- -## One More Thing: Simulation-Based Testing - -In some cases it is easy to go from outputs to inputs: - -```java -assertEquals(??, factorPrime(15485862)); -``` - -```java -randomNumbers = [2, 3, 7, 7, 52673] -assertEquals(randomNumbers, - factorPrime(multiply(randomNumbers))); -``` - -**Similar idea in machine-learning problems?** - - - ----- -## One More Thing: Simulation-Based Testing - -
- - - -* Derive input-output pairs from simulation, esp. in vision systems -* Example: Vision for self-driving cars: - * Render scene -> add noise -> recognize -> compare recognized result with simulator state -* Quality depends on quality of simulator: - * examples: render picture/video, synthesize speech, ... - * Less suitable where input-output relationship unknown, e.g., cancer prognosis, housing price prediction - -![Simulation is the inverse of prediction](simulationbased-testing.svg) - - - - -
- - - -Further readings: Zhang, Mengshi, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. "DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems." In Proc. ASE. 2018. - - ----- -## Preliminary Summary: Invariants and Generation - -* Generating sample inputs is easy, but knowing corresponding outputs is not (oracle problem) -* Crashing bugs are not a concern -* Invariants + generated data can check capabilities or properties (metamorphic testing) - - Inputs can be generated realistically or to find violations (adversarial learning) -* If inputs can be computed from outputs, tests can be automated (simulation-based testing) - ----- -## On Terminology - -**Metamorphic testing** is an academic software engineering term that's not common in ML literature, it generalizes many concepts regularly reinvented - -Much of the security, safety and robustness literature in ML focuses on invariants - -![Random letters](../_assets/onterminology.jpg) - - - - - - - - ---- -# Other Testing Concepts - - ----- - -## Test Coverage - -![](coverage.png) - - ----- -## Example: Structural testing - -```java -int divide(int A, int B) { - if (A==0) - return 0; - if (B==0) - return -1; - return A / B; -} -``` - -*minimum set of test cases to cover all lines? all decisions? all path?* - - -![](coverage.png) - - ----- -## Defining Structural Testing ("white box") - -* Test case creation is driven by the implementation, not the specification -* Typically aiming to increase coverage of lines, decisions, etc -* Automated test generation often driven by maximizing coverage (for finding crashing bugs) - - ----- -## Whitebox Analysis in ML - -* Several coverage metrics have been proposed - - All path of a decision tree? - - All neurons activated at least once in a DNN? (several papers "neuron coverage") - - Linear regression models?? -* Often create artificial inputs, not realistic for distribution -* Unclear whether those are useful -* Adversarial learning techniques usually more efficient at finding invariant violations - ----- -## Regression Testing - -* Whenever bug detected and fixed, add a test case -* Make sure the bug is not reintroduced later -* Execute test suite after changes to detect regressions - - Ideally automatically with continuous integration tools -* -* Maps well to curating test sets for important populations in ML - ----- -## Mutation Analysis - -* Start with program and passing test suite -* Automatically insert small modifications ("mutants") in the source code - - `a+b` -> `a-b` - - `a `a<=b` - - ... -* Can program detect modifications ("kill the mutant")? -* Better test suites detect more modifications ("mutation score") - -```java -int divide(int A, int B) { - if (A==0) // A!=0, A<0, B==0 - return 0; // 1, -1 - if (B==0) // B!=0, B==1 - return -1; // 0, -2 - return A / B; // A*B, A+B -} -assert(1, divide(1,1)); -assert(0, divide(0,1)); -assert(-1, divide(1,0)); -``` - ----- -## Mutation Analysis - -* Some papers exist, but strategy unclear -* Mutating model parameters? Mutating hyperparameters? Mutating inputs? -* What's considered as killing a mutant, if we don't have specifications? -* -* Still unclear application... - - - - - - - - - - - ---- -# Continuous Integration for Model Quality - -[![Uber's internal dashboard](uber-dashboard.png)](https://eng.uber.com/michelangelo/) - - ----- -## Continuous Integration - -![](ci.png) - ----- -## Continuous Integration for Model Quality? - - ----- -## Continuous Integration for Model Quality - -
- -* Testing script - * Existing model: Automatically evaluate model on labeled training set; multiple separate evaluation sets possible, e.g., for slicing, regressions - * Training model: Automatically train and evaluate model, possibly using cross-validation; many ML libraries provide built-in support - * Report accuracy, recall, etc. in console output or log files - * May deploy learning and evaluation tasks to cloud services - * Optionally: Fail test below bound (e.g., accuracy <.9; accuracy < last accuracy) -* Version control test data, model and test scripts, ideally also learning data and learning code (feature extraction, modeling, ...) -* Continuous integration tool can trigger test script and parse output, plot for comparisons (e.g., similar to performance tests) -* Optionally: Continuous deployment to production server - -
- ----- -## Dashboards for Model Evaluation Results - -[![Uber's internal dashboard](uber-dashboard.png)](https://eng.uber.com/michelangelo/) - - - -Jeremy Hermann and Mike Del Balso. [Meet Michelangelo: Uber’s Machine Learning Platform](https://eng.uber.com/michelangelo/). Blog, 2017 - ----- - -## Specialized CI Systems - -![Ease.ml/ci](easeml.png) - - - -Renggli et. al, [Continuous Integration of Machine Learning Models with ease.ml/ci: Towards a Rigorous Yet Practical Treatment](http://www.sysml.cc/doc/2019/162.pdf), SysML 2019 - ----- -## Dashboards for Comparing Models - -![MLflow UI](mlflow-web-ui.png) - - - -Matei Zaharia. [Introducing MLflow: an Open Source Machine Learning Platform](https://databricks.com/blog/2018/06/05/introducing-mlflow-an-open-source-machine-learning-platform.html), 2018 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ---- -# Summary - -
- -Curating test data - - Analyzing specifications, capabilities - - Not all inputs are equal: Identify important inputs (inspiration from specification-based testing) - - Slice data for evaluation - - Identifying capabilities and generating relevant tests - -Automated random testing - - Feasible with invariants (e.g. metamorphic relations) - - Sometimes possible with simulation - -Automate the test execution with continuous integration - -
- ---- -# Further readings - -
- - -* Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "[Semantically equivalent adversarial rules for debugging NLP models](https://www.aclweb.org/anthology/P18-1079.pdf)." In Proc. ACL, pp. 856-865. 2018. -* Barash, Guy, Eitan Farchi, Ilan Jayaraman, Orna Raz, Rachel Tzoref-Brill, and Marcel Zalmanovici. "[Bridging the gap between ML solutions and their business requirements using feature interactions](https://dl.acm.org/doi/abs/10.1145/3338906.3340442)." In Proc. FSE, pp. 1048-1058. 2019. -* Ashmore, Rob, Radu Calinescu, and Colin Paterson. "[Assuring the machine learning lifecycle: Desiderata, methods, and challenges](https://arxiv.org/abs/1905.04223)." arXiv preprint arXiv:1905.04223. 2019. -* Christian Kaestner. [Rediscovering Unit Testing: Testing Capabilities of ML Models](https://towardsdatascience.com/rediscovering-unit-testing-testing-capabilities-of-ml-models-b008c778ca81). Toward Data Science, 2021. -* D'Amour, Alexander, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen et al. "[Underspecification presents challenges for credibility in modern machine learning](https://arxiv.org/abs/2011.03395)." *arXiv preprint arXiv:2011.03395* (2020). -* Segura, Sergio, Gordon Fraser, Ana B. Sanchez, and Antonio Ruiz-Cortés. "[A survey on metamorphic testing](https://core.ac.uk/download/pdf/74235918.pdf)." IEEE Transactions on software engineering 42, no. 9 (2016): 805-824. - -
- diff --git a/lectures/07_modeltesting/oracle.svg b/lectures/07_modeltesting/oracle.svg deleted file mode 100644 index 5d0d3832..00000000 --- a/lectures/07_modeltesting/oracle.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/07_modeltesting/overton.png b/lectures/07_modeltesting/overton.png deleted file mode 100644 index a3ccce38..00000000 Binary files a/lectures/07_modeltesting/overton.png and /dev/null differ diff --git a/lectures/07_modeltesting/radiology-distribution.png b/lectures/07_modeltesting/radiology-distribution.png deleted file mode 100644 index ad7f5375..00000000 Binary files a/lectures/07_modeltesting/radiology-distribution.png and /dev/null differ diff --git a/lectures/07_modeltesting/radiology.jpg b/lectures/07_modeltesting/radiology.jpg deleted file mode 100644 index 5bc31795..00000000 Binary files a/lectures/07_modeltesting/radiology.jpg and /dev/null differ diff --git a/lectures/07_modeltesting/random.jpg b/lectures/07_modeltesting/random.jpg deleted file mode 100644 index 5ec8eded..00000000 Binary files a/lectures/07_modeltesting/random.jpg and /dev/null differ diff --git a/lectures/07_modeltesting/sarcasm.png b/lectures/07_modeltesting/sarcasm.png deleted file mode 100644 index eb58e9c8..00000000 Binary files a/lectures/07_modeltesting/sarcasm.png and /dev/null differ diff --git a/lectures/07_modeltesting/shortcutlearning-cows.png b/lectures/07_modeltesting/shortcutlearning-cows.png deleted file mode 100644 index 4ce89b21..00000000 Binary files a/lectures/07_modeltesting/shortcutlearning-cows.png and /dev/null differ diff --git a/lectures/07_modeltesting/shortcutlearning.png b/lectures/07_modeltesting/shortcutlearning.png deleted file mode 100644 index c522e6b4..00000000 Binary files a/lectures/07_modeltesting/shortcutlearning.png and /dev/null differ diff --git a/lectures/07_modeltesting/simulationbased-testing.svg b/lectures/07_modeltesting/simulationbased-testing.svg deleted file mode 100644 index 5db26eb8..00000000 --- a/lectures/07_modeltesting/simulationbased-testing.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/07_modeltesting/simulationdriving.jpg b/lectures/07_modeltesting/simulationdriving.jpg deleted file mode 100644 index 6716946f..00000000 Binary files a/lectures/07_modeltesting/simulationdriving.jpg and /dev/null differ diff --git a/lectures/07_modeltesting/slices.jpg b/lectures/07_modeltesting/slices.jpg deleted file mode 100644 index e4ac6a08..00000000 Binary files a/lectures/07_modeltesting/slices.jpg and /dev/null differ diff --git a/lectures/07_modeltesting/texturevsshape.png b/lectures/07_modeltesting/texturevsshape.png deleted file mode 100644 index d0da33f7..00000000 Binary files a/lectures/07_modeltesting/texturevsshape.png and /dev/null differ diff --git a/lectures/07_modeltesting/uber-dashboard.png b/lectures/07_modeltesting/uber-dashboard.png deleted file mode 100644 index 381ea6c5..00000000 Binary files a/lectures/07_modeltesting/uber-dashboard.png and /dev/null differ diff --git a/lectures/07_modeltesting/white-noise.jpg b/lectures/07_modeltesting/white-noise.jpg deleted file mode 100644 index 97d15bd7..00000000 Binary files a/lectures/07_modeltesting/white-noise.jpg and /dev/null differ diff --git a/lectures/07_modeltesting/xkcd1838.png b/lectures/07_modeltesting/xkcd1838.png deleted file mode 100644 index 38c4c1e5..00000000 Binary files a/lectures/07_modeltesting/xkcd1838.png and /dev/null differ diff --git a/lectures/08_architecture/adversarial.png b/lectures/08_architecture/adversarial.png deleted file mode 100644 index 497992d8..00000000 Binary files a/lectures/08_architecture/adversarial.png and /dev/null differ diff --git a/lectures/08_architecture/architectures.png b/lectures/08_architecture/architectures.png deleted file mode 100644 index 85d8b873..00000000 Binary files a/lectures/08_architecture/architectures.png and /dev/null differ diff --git a/lectures/08_architecture/credit-card.jpg b/lectures/08_architecture/credit-card.jpg deleted file mode 100644 index d843a542..00000000 Binary files a/lectures/08_architecture/credit-card.jpg and /dev/null differ diff --git a/lectures/08_architecture/decisiontreeexample-full.png b/lectures/08_architecture/decisiontreeexample-full.png deleted file mode 100644 index 7cc0bc0e..00000000 Binary files a/lectures/08_architecture/decisiontreeexample-full.png and /dev/null differ diff --git a/lectures/08_architecture/decisiontreeexample.png b/lectures/08_architecture/decisiontreeexample.png deleted file mode 100644 index 58c7973e..00000000 Binary files a/lectures/08_architecture/decisiontreeexample.png and /dev/null differ diff --git a/lectures/08_architecture/design-space.svg b/lectures/08_architecture/design-space.svg deleted file mode 100644 index f945ca6e..00000000 --- a/lectures/08_architecture/design-space.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/08_architecture/design.png b/lectures/08_architecture/design.png deleted file mode 100644 index 7b73afb6..00000000 Binary files a/lectures/08_architecture/design.png and /dev/null differ diff --git a/lectures/08_architecture/fashion_mnist.png b/lectures/08_architecture/fashion_mnist.png deleted file mode 100644 index 213b1e1f..00000000 Binary files a/lectures/08_architecture/fashion_mnist.png and /dev/null differ diff --git a/lectures/08_architecture/gizzard.png b/lectures/08_architecture/gizzard.png deleted file mode 100644 index af77c1cb..00000000 Binary files a/lectures/08_architecture/gizzard.png and /dev/null differ diff --git a/lectures/08_architecture/information-hiding.png b/lectures/08_architecture/information-hiding.png deleted file mode 100644 index afbf8eac..00000000 Binary files a/lectures/08_architecture/information-hiding.png and /dev/null differ diff --git a/lectures/08_architecture/lane-detect.jpg b/lectures/08_architecture/lane-detect.jpg deleted file mode 100644 index 32df2433..00000000 Binary files a/lectures/08_architecture/lane-detect.jpg and /dev/null differ diff --git a/lectures/08_architecture/lane.jpg b/lectures/08_architecture/lane.jpg deleted file mode 100644 index b4b189fa..00000000 Binary files a/lectures/08_architecture/lane.jpg and /dev/null differ diff --git a/lectures/08_architecture/linear-regression.png b/lectures/08_architecture/linear-regression.png deleted file mode 100644 index 5f9defe8..00000000 Binary files a/lectures/08_architecture/linear-regression.png and /dev/null differ diff --git a/lectures/08_architecture/ml-methods-poll.jpg b/lectures/08_architecture/ml-methods-poll.jpg deleted file mode 100644 index 94414ff8..00000000 Binary files a/lectures/08_architecture/ml-methods-poll.jpg and /dev/null differ diff --git a/lectures/08_architecture/mlperceptron.svg b/lectures/08_architecture/mlperceptron.svg deleted file mode 100644 index 69feea0c..00000000 --- a/lectures/08_architecture/mlperceptron.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/08_architecture/netflix-leaderboard.png b/lectures/08_architecture/netflix-leaderboard.png deleted file mode 100644 index fd264669..00000000 Binary files a/lectures/08_architecture/netflix-leaderboard.png and /dev/null differ diff --git a/lectures/08_architecture/neur_logic.svg b/lectures/08_architecture/neur_logic.svg deleted file mode 100644 index dbc62145..00000000 --- a/lectures/08_architecture/neur_logic.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/08_architecture/neural-network.png b/lectures/08_architecture/neural-network.png deleted file mode 100644 index 00cc3d8b..00000000 Binary files a/lectures/08_architecture/neural-network.png and /dev/null differ diff --git a/lectures/08_architecture/nfps.png b/lectures/08_architecture/nfps.png deleted file mode 100644 index 88dfb163..00000000 Binary files a/lectures/08_architecture/nfps.png and /dev/null differ diff --git a/lectures/08_architecture/not-dl.jpg b/lectures/08_architecture/not-dl.jpg deleted file mode 100644 index 52d0e97f..00000000 Binary files a/lectures/08_architecture/not-dl.jpg and /dev/null differ diff --git a/lectures/08_architecture/pareto-front.svg b/lectures/08_architecture/pareto-front.svg deleted file mode 100644 index 74179042..00000000 --- a/lectures/08_architecture/pareto-front.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/08_architecture/parts.png b/lectures/08_architecture/parts.png deleted file mode 100644 index dae64d96..00000000 Binary files a/lectures/08_architecture/parts.png and /dev/null differ diff --git a/lectures/08_architecture/perceptron.svg b/lectures/08_architecture/perceptron.svg deleted file mode 100644 index a31ed101..00000000 --- a/lectures/08_architecture/perceptron.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/08_architecture/radiology-scan.jpg b/lectures/08_architecture/radiology-scan.jpg deleted file mode 100644 index 9ea3a9c4..00000000 Binary files a/lectures/08_architecture/radiology-scan.jpg and /dev/null differ diff --git a/lectures/08_architecture/random-forest.png b/lectures/08_architecture/random-forest.png deleted file mode 100644 index f4e43a6d..00000000 Binary files a/lectures/08_architecture/random-forest.png and /dev/null differ diff --git a/lectures/08_architecture/req-arch-impl.svg b/lectures/08_architecture/req-arch-impl.svg deleted file mode 100644 index 34bea3ff..00000000 --- a/lectures/08_architecture/req-arch-impl.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/08_architecture/smartkeyboard.jpg b/lectures/08_architecture/smartkeyboard.jpg deleted file mode 100644 index bc8d59f8..00000000 Binary files a/lectures/08_architecture/smartkeyboard.jpg and /dev/null differ diff --git a/lectures/08_architecture/spotify.png b/lectures/08_architecture/spotify.png deleted file mode 100644 index b58b9ddf..00000000 Binary files a/lectures/08_architecture/spotify.png and /dev/null differ diff --git a/lectures/08_architecture/system.svg b/lectures/08_architecture/system.svg deleted file mode 100644 index 9d3cfe66..00000000 --- a/lectures/08_architecture/system.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/08_architecture/temi.png b/lectures/08_architecture/temi.png deleted file mode 100644 index 29ce2dd5..00000000 Binary files a/lectures/08_architecture/temi.png and /dev/null differ diff --git a/lectures/08_architecture/tradeoffs.md b/lectures/08_architecture/tradeoffs.md deleted file mode 100644 index 7305586b..00000000 --- a/lectures/08_architecture/tradeoffs.md +++ /dev/null @@ -1,1093 +0,0 @@ ---- -author: Christian Kaestner and Eunsuk Kang -title: "MLiP: Toward Architecture and Design" -semester: Spring 2023 -footer: "17-645 Machine Learning in Production • Christian Kaestner, -Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - -# Toward Architecture and Design - - - ---- -## After requirements... - -![Overview of course content](../_assets/overview.svg) - - - ----- -## Learning Goals - -* Describe the role of architecture and design between requirements and implementation -* Identify the different ML components and organize and prioritize their quality concerns for a given project -* Explain they key ideas behind decision trees and random forests and analyze consequences for various qualities -* Demonstrate an understanding of the key ideas of deep learning and how it drives qualities -* Plan and execute an evaluation of the qualities of alternative AI components for a given purpose - ----- -## Readings - -Required reading: Hulten, Geoff. "Building Intelligent Systems: A -Guide to Machine Learning Engineering." (2018), Chapters 17 and 18 - -Recommended reading: Siebert, Julien, Lisa Joeckel, Jens Heidrich, Koji Nakamichi, Kyoko Ohashi, Isao Namba, Rieko Yamamoto, and Mikio Aoyama. “Towards Guidelines for Assessing Qualities of Machine Learning Systems.” In International Conference on the Quality of Information and Communications Technology, pp. 17–31. Springer, Cham, 2020. - - - - - - - - - - - - - - ---- -# Recall: ML is a Component in a System in an Environment - -![Temi Screenshot](temi.png) - ----- - -![Simple architecture diagram of transcription service](transcriptionarchitecture2.svg) - - - - -* **ML components** for transcription model, pipeline to train the model, monitoring infrastructure... -* **Non-ML components** for data storage, user interface, payment processing, ... -* User requirements and assumptions -* -* System quality vs model quality -* System requirements vs model requirements - - - - ----- -## Recall: Systems Thinking - -![](system.svg) - - -> A system is a set of inter-related components that work together in a particular environment to perform whatever functions are required to achieve the system's objective -- Donella Meadows - - - ---- -# Thinking like a Software Architect - -![Architecture between requirements and implementation](req-arch-impl.svg) - - ----- -## So far: Requirements - -* Identify goals for the system, define success metrics -* Understand requirements, specifications, and assumptions -* Consider risks, plan for mitigations to mistakes -* Approaching component requirements: Understand quality requirements and constraints for models and learning algorithms - - ----- -## From Requirements to Implementations... - -We know what to build, but how? How to we meet the quality goals? - -![Architecture between requirements and implementation](req-arch-impl.svg) - - -**Software architecture:** Key design decisions, made early in the - development, focusing on key product qualities - -Architectural decisions are hard to change later - - ----- -## Software Architecture - -> The software architecture of a program or computing system is the **structure or structures** of the system, which comprise **software elements**, the ***externally visible properties*** of those elements, and the relationships among them. -> -- [Kazman et al. 2012](https://www.oreilly.com/library/view/software-architecture-in/9780132942799/?ar) - - ----- -## Architecture Decisions: Examples - -* What are the major components in the system? What does each -component do? -* Where do the components live? Monolithic vs microservices? -* How do components communicate to each other? Synchronous vs -asynchronous calls? -* What API does each component publish? Who can access this API? -* Where does the ML inference happen? Client-side or server-side? -* Where is the telemetry data collected from the users stored? -* How large should the user database be? Centralized vs decentralized? -* ...and many others - ----- -## Software Architecture - -> Architecture represents the set of **significant** **design** decisions that shape the form and the function of a system, where **significant** is measured by cost of change. -> -- [Grady Booch, 2006] - ----- -## How much Architecture/Design? - -![Design](design.png) - - - -Software Engineering Theme: *Think before you code* - -Like requirements: Slower initially, but upfront investment can prevent problems later and save overall costs - --> Focus on most important qualities early, but leave flexibility - ----- -## Quality Requirements Drive Architecture Design - -Driven by requirements, identify most important qualities - -Examples: -* Development cost, operational cost, time to release -* Scalability, availability, response time, throughput -* Security, safety, usability, fairness -* Ease of modifications and updates -* ML: Accuracy, ability to collect data, training latency - ----- -## Architecture Design Involve Quality Trade-offs - -![Monolithic vs microservice](architectures.png) - - -**Q. What are quality trade-offs between the two?** - - -[Image source](https://medium.com/javanlabs/micro-services-versus-monolithic-architecture-what-are-they-e17ddc8d3910) - ----- - -## Why Architecture? ([Kazman et al. 2012](https://www.oreilly.com/library/view/software-architecture-in/9780132942799/?ar)) - -
- -Represents earliest design decisions. - -Aids in **communication** with stakeholders: Shows them “how” at a level they can understand, raising questions about whether it meets their needs - -Defines **constraints** on implementation: Design decisions form “load-bearing walls” of application - -Dictates **organizational structure**: Teams work on different components - -Inhibits or enables **quality attributes**: Similar to design patterns - -Supports **predicting** cost, quality, and schedule: Typically by predicting information for each component - -Aids in software **evolution**: Reason about cost, design, and effect of changes - -
- ----- - -## Case Study: Twitter - -![twitter](twitter.png) - -Note: Source and additional reading: Raffi. [New Tweets per second record, and how!](https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-and-how.html) Twitter Blog, 2013 - ----- - -## Twitter - Caching Architecture - -![twitter](twitter-caching.png) - - -Notes: - -* Running one of the world’s largest Ruby on Rails installations -* 200 engineers -* Monolithic: managing raw database, memcache, rendering the site, and * presenting the public APIs in one codebase -* Increasingly difficult to understand system; organizationally challenging to manage and parallelize engineering teams -* Reached the limit of throughput on our storage systems (MySQL); read and write hot spots throughout our databases -* Throwing machines at the problem; low throughput per machine (CPU + RAM limit, network not saturated) -* Optimization corner: trading off code readability vs performance - ----- - -## Twitter's Redesign Goals - -
- -* **Performance** - * Improve median latency; lower outliers - * Reduce number of machines 10x -+ **Reliability** - * Isolate failures -+ **Maintainability** - * *"We wanted cleaner boundaries with “related” logic being in one place"*: -encapsulation and modularity at the systems level (vs class/package level) -* **Modifiability** - * Quicker release of new features: *"run small and empowered engineering teams that could make local decisions and ship user-facing changes, independent of other teams"* - -
- - - -Raffi. [New Tweets per second record, and how!](https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-and-how.html) Twitter Blog, 2013 - ----- -## Twitter: Redesign Decisions - - - -* Ruby on Rails -> JVM/Scala -* Monolith -> Microservices -* RPC framework with monitoring, connection pooling, failover strategies, loadbalancing, ... built in -* New storage solution, temporal clustering, "roughly sortable ids" -* Data driven decision making - - - -![Gizzard](gizzard.png) - - - ----- - -## Twitter Case Study: Key Insights - -Architectural decisions affect entire systems, not only individual modules - -Abstract, different abstractions for different scenarios - -Reason about quality attributes early - -Make architectural decisions explicit - -Question: **Did the original architect make poor decisions?** - - - - - - - - - ---- -# Decomposition, Interfaces, and Responsibility Assignm. - -![Exploded parts diagram of a complex device](parts.png) - - ----- -## System Decomposition - -![Simple architecture diagram of transcription service](transcriptionarchitecture2.svg) - - -Identify components and their responsibilities - -Establishes interfaces and team boundaries - ----- -## Information Hiding - -![Information hiding](information-hiding.png) - - -Hide design decisions that are likely to change from clients - -**Q. Examples? What are the benefits of information hiding?** - ----- -## Information Hiding - -Decomposition enables scaling teams -* Each team works on a component -* Coordinate on *interfaces*, but implementations remain hidden - -**Interface descriptions are crucial** -* Who is responsible for what -* Component requirements (specifications), behavioral and quality -* Especially consider nonlocal qualities: e.g., safety, privacy - -Challenges: Interfaces rarely fully specified, source of conflicts, -changing requirements - ----- -## Each system is different... - -![Temi](temi.png) - - - ----- -## Each system is different... - -![Spotify](spotify.png) - - ----- -## Each system is different... - -![Autonomous vehicle](lane.jpg) - ----- -## Each system is different... - - -![Smart keyboard](smartkeyboard.jpg) - - ----- -## System Decomposition - -
- -Each system is different, identify important components - - -Examples: -* Personalized music recommendations: microserivce deployment in cloud, logging of user activity, nightly batch processing for inference, regular model updates, regular experimentation, easy fallback -* Transcription service: irregular user interactions, large model, expensive inference, inference latency not critical, rare model updates -* Autonomous vehicle: on-board hardware sets limits, real-time needs, safety critical, updates necessary, limited experimentation in practice, not always online -* Smart keyboard: privacy focused, small model, federated learning on user device, limited telemetry - -
- - ----- -## Common Components in ML-based Systems - -* **Model inference service**: Uses model to make predictions for input data -* **ML pipeline**: Infrastructure to train/update the model -* **Monitoring**: Observe model and system -* **Data sources**: Manual/crowdsourcing/logs/telemetry/... -* **Data management**: Storage and processing of data, often at scale -* **Feature store**: Reusable feature engineering code, cached feature computations - ----- -## Common System-Wide Design Challenges - -Separating concerns, understanding interdependencies -* e.g., anticipating/breaking feedback loops, conflicting needs of components - -Facilitating experimentation, updates with confidence - -Separating training and inference; closing the loop -* e.g., collecting telemetry to learn from user interactions - -Learn, serve, and observe at scale or with resource limits -* e.g., cloud deployment, embedded devices - - - - - - ---- -# Scoping Relevant Qualities of ML Components - -From System Quality Requirements to Component Quality Specifications - - ----- -## AI = DL? - -![not-dl](not-dl.jpg) - ----- -## ML Algorithms Today - -![ml-methods-poll](ml-methods-poll.jpg) - ----- -## Design Decision: ML Model Selection - -How do I decide which ML algorithm to use for my project? - -Criteria: Quality Attributes & Constraints - ----- -## Recall: Quality Attributes - -
- -Measurable or testable properties of a system that are used to indicate how well it satisfies its goals - -Examples - * Performance - * Features - * Reliability - * Conformance - * Durability - * Serviceability - * Aesthetics - * Perceived quality - * and many others - -
- - - -Reference: -Garvin, David A., [What Does Product Quality Really Mean](http://oqrm.org/English/What_does_product_quality_really_means.pdf). Sloan management review 25 (1984). - ----- -## Accuracy is not Everything - -Beyond prediction accuracy, what qualities may be relevant for an ML component? - - - -Note: Collect qualities on whiteboard - - - ----- -## Qualities of Interest? - -Scenario: ML component for transcribing audio files - -![Temi Screenshot](temi.png) - - -Note: Which of the previously discussed qualities are relevant? -Which additional qualities may be relevant here? - -Cost per transaction; how much does it cost to transcribe? How much do -we make? - ----- -## Qualities of Interest? - -Scenario: Component for detecting lane markings in a vehicle - -![Lane detection](lane-detect.jpg) - -Note: Which of the previously discussed qualities are relevant? -Which additional qualities may be relevant here? - -Realtime use - ----- -## Qualities of Interest? - -Scenario: Component for detecting credit card frauds, as a service for banks - -![Credit card](credit-card.jpg) - - - -Note: Very high volume of transactions, low cost per transaction, frequent updates - -Incrementality - - - ----- -## Common of ML Qualities to Consider - -* Accuracy -* Correctness guarantees? Probabilistic guarantees (--> symbolic AI) -* How many features? -* How much data needed? Data quality important? -* Incremental training possible? -* Training time, memory need, model size -- depending on training data volume and feature size -* Inference time, energy efficiency, resources needed, scalability -* Interpretability, explainability -* Robustness, reproducibility, stability -* Security, privacy, fairness - ----- -![Table of NFPs and their relationship to different components](nfps.png) - - - -From: Habibullah, Khan Mohammad, Gregory Gay, and Jennifer Horkoff. "[Non-Functional Requirements for Machine Learning: An Exploration of System Scope and Interest](https://arxiv.org/abs/2203.11063)." arXiv preprint arXiv:2203.11063 (2022). - ----- -## Preview: Interpretability/Explainability - -
- -*"Why did the model predict X?"* - -**Explaining predictions + Validating Models + Debugging** - -``` -IF age between 18–20 and sex is male THEN predict arrest -ELSE IF age between 21–23 and 2–3 prior offenses THEN predict arrest -ELSE IF more than three priors THEN predict arrest -ELSE predict no arrest -``` - -* Some models inherently simpler to understand -* Some tools may provide post-hoc explanations -* Explanations may be more or less truthful -* How to measure interpretability? - -
- ----- -## Preview: Robustness - -![Adversarial Example](adversarial.png) - -* Small input modifications may change output -* Small training data modifications may change predictions -* How to measure robustness? - - - -Image source: [OpenAI blog](https://openai.com/blog/adversarial-example-research/) - - ----- -## Preview: Fairness - - -*Does the model perform differently for different populations?* - -``` -IF age between 18–20 and sex is male THEN predict arrest -ELSE IF age between 21–23 and 2–3 prior offenses THEN predict arrest -ELSE IF more than three priors THEN predict arrest -ELSE predict no arrest -``` - -* Many different notions of fairness -* Often caused by bias in training data -* Enforce invariants in model or apply corrections outside model -* Important consideration during requirements solicitation! - - ----- -## Recall: Measuring Qualities - -
- -* Define a metric: Define units of interest - - e.g., requests per second, max memory per inference, average training time in seconds for 1 million datasets -* Collect data -* Operationalize metric: Define measurement protocol - - e.g., conduct experiment: train model with fixed dataset, report median training time across 5 runs, file size, average accuracy with leave-one-out cross-validation after hyperparameter tuning - - e.g., ask 10 humans to independently label evaluation data, report - reduction in error from the ML model over human predictions -* Describe all relevant factors: Inputs/experimental units used, configuration decisions and tuning, hardware used, protocol for manual steps - -**On terminology:** *metric/measure* refer a method or standard format for measuring something; *operationalization* is identifying and implementing a method to measure some factor - -
- - ----- -## On terminology - -Data scientists seem to speak of *model properties* when referring to accuracy, inference time, fairness, etc - * ... but they also use this term for whether a *learning technique* can learn non-linear relationships or whether the learning algorithm is monotonic - -Software engineering wording would usually be *quality attribute*, *quality requirement*, *quality specification* - or *non-functional requirement* - -![Random letters](../_assets/onterminology.jpg) - - ---- -# Common ML Algorithms and their Qualities - ----- -## Linear Regression: Qualities - -![linear-regression](linear-regression.png) - -* Tasks: Regression -* Qualities: __Advantages__: ?? __Drawbacks__: ?? - -Notes: -* Easy to interpret, low training cost, small model size -* Can't capture non-linear relationships well - ----- -## Decision Trees - - -![Decision tree](decisiontreeexample-full.png) - - - ----- -## Building Decision Trees - -![Decision tree](decisiontreeexample.png) - - - - - -* Identify all possible decisions -* Select the decision that best splits the dataset into distinct - outcomes (typically via entropy or similar measure) -* Repeatedly further split subsets, until stopping criteria reached - - - ----- -## Decision Trees: Qualities - -![Decision tree](decisiontreeexample.png) - - -* Tasks: Classification & regression -* Qualities: __Advantages__: ?? __Drawbacks__: ?? - -Notes: -* Easy to interpret (up to a size); can capture non-linearity; can do well with - little data -* High risk of overfitting; possibly very large tree size -* Obvious ones: fairly small model size, low inference cost, -no obvious incremental training; easy to interpret locally and -even globally if shallow; easy to understand decision boundaries - - - - - - - - ----- -## Random Forests - -![Random forest](random-forest.png) - - -* Train multiple trees on subsets of data or subsets of decisions. -* Return average prediction of multiple trees. -* Qualities: __Advantages__: ?? __Drawbacks__: ?? - -Note: Increased training time and model size, -less prone to overfitting, more difficult to interpret - - - ----- - -# Neural Networks - -![Xkcd commit 2173](xkcd2173.png) - - -[XKCD 2173](https://xkcd.com/2173/), cc-by-nc 2.5 Randall Munroe - -Note: Artificial neural networks are inspired by how biological neural networks work ("groups of chemically connected or functionally associated neurons" with synapses forming connections) - -From "Texture of the Nervous System of Man and the Vertebrates" by Santiago Ramón y Cajal, via https://en.wikipedia.org/wiki/Neural_circuit#/media/File:Cajal_actx_inter.jpg - ----- -## Artificial Neural Networks - -Simulating biological neural networks of neurons (nodes) and synapses (connections), popularized in 60s and 70s - -Basic building blocks: Artificial neurons, with $n$ inputs and one output; output is activated if at least $m$ inputs are active - -![Simple computations with artificial neuron](neur_logic.svg) - - -(assuming at least two activated inputs needed to activate output) - ----- -## Threshold Logic Unit / Perceptron - -computing weighted sum of inputs + step function - -$z = w_1 x_1 + w_2 x_2 + ... + w_n x_n = \mathbf{x}^T \mathbf{w}$ - -e.g., step: `$\phi$(z) = if (z<0) 0 else 1` - -![Perceptron](perceptron.svg) - - ----- - - - -![Perceptron](perceptron.svg) - - - - -$o_1 = \phi(b_{1} + w_{1,1} x_1 + w_{1,2} x_2)$ -$o_2 = \phi(b_{2} + w_{2,1} x_1 + w_{2,2} x_2)$ -$o_3 = \phi(b_{3} + w_{3,1} x_1 + w_{3,2} x_2)$ - - - -**** -$f_{\mathbf{W},\mathbf{b}}(\mathbf{X})=\phi(\mathbf{W} \cdot \mathbf{X}+\mathbf{b})$ - -($\mathbf{W}$ and $\mathbf{b}$ are parameters of the model) - ----- -## Multiple Layers - -![Multi Layer Perceptron](mlperceptron.svg) - - -Note: Layers are fully connected here, but layers may have different numbers of neurons - ----- -$f_{\mathbf{W}_h,\mathbf{b}_h,\mathbf{W}_o,\mathbf{b}_o}(\mathbf{X})=\phi( \mathbf{W}_o \cdot \phi(\mathbf{W}_h \cdot \mathbf{X}+\mathbf{b}_h)+\mathbf{b}_o)$ - -![Multi Layer Perceptron](mlperceptron.svg) - - -(matrix multiplications interleaved with step function) - ----- -## Learning Model Parameters (Backpropagation) - -
- -Intuition: -- Initialize all weights with random values -- Compute prediction, remembering all intermediate activations -- If predicted output has an error (measured with a loss function), - + Compute how much each connection contributed to the error on output layer - + Repeat computation on each lower layer - + Tweak weights a little toward the correct output (gradient descent) -- Continue training until weights stabilize - -Works efficiently only for certain $\phi$, typically logistic function: $\phi(z)=1/(1+exp(-z))$ or ReLU: $\phi(z)=max(0,z)$. - -
- ----- -## Deep Learning - -More layers - -Layers with different numbers of neurons - -Different kinds of connections, e.g., - - Fully connected (feed forward) - - Not fully connected (eg. convolutional networks) - - Keeping state (eg. recurrent neural networks) - - Skipping layers - - -See Chapter 10 in Géron, Aurélien. ”[Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991019662775504436)”, 2nd Edition (2019) or any other book on deep learning - - -Note: Essentially the same with more layers and different kinds of architectures. - - ----- -## Deep Learning - -![neural-network](neural-network.png) - -* Tasks: Classification & regression -* Qualities: __Advantages__: ?? __Drawbacks__: ?? - -Notes: -* High accuracy; can capture a wide range of problems (linear & non-linear) -* Difficult to interpret; high training costs (time & amount of -data required, hyperparameter tuning) - - ----- -## Example Scenario - -*MNIST Fashion dataset of 70k 28x28 grayscale pixel images, 10 output classes* - -![MNIST Fashion](fashion_mnist.png) - - ----- -## Example Scenario - -* MNIST Fashion dataset of 70k 28x28 grayscale pixel images, 10 output classes -* 28x28 = 784 inputs in input layers (each 0..255) -* Example model with 3 layers, 300, 100, and 10 neurons - -```python -model = keras.models.Sequential([ - keras.layers.Flatten(input_shape=[28, 28]), - keras.layers.Dense(300, activation="relu"), - keras.layers.Dense(100, activation="relu"), - keras.layers.Dense(10, activation="softmax") -]) -``` - -**How many parameters does this model have?** - ----- -## Example Scenario - -```python -model = keras.models.Sequential([ - keras.layers.Flatten(input_shape=[28, 28]), - # 784*300+300 = 235500 parameter - keras.layers.Dense(300, activation="relu"), - # 300*100+100 = 30100 parameters - keras.layers.Dense(100, activation="relu"), - # 100*10+10 = 1010 parameters - keras.layers.Dense(10, activation="softmax") -]) -``` - -Total of 266,610 parameters in this small example! (Assuming float types, that's 1 MB) - ----- -## Network Size - -
- -* 50 Layer ResNet network -- classifying 224x224 images into 1000 categories - * 26 million weights, computes 16 million activations during inference, 168 MB to store weights as floats -* Google in 2012(!): 1TB-1PB of training data, 1 billion to 1 trillion parameters -* OpenAI’s GPT-2 (2019) -- text generation - - 48 layers, 1.5 billion weights (~12 GB to store weights) - - released model reduced to 117 million weights - - trained on 7-8 GPUs for 1 month with 40GB of internet text from 8 million web pages -* OpenAI’s GPT-3 (2020): 96 layers, 175 billion weights, 700 GB in memory, $4.6M in approximate compute cost for training -
- -Notes: https://lambdalabs.com/blog/demystifying-gpt-3/ - ----- -## Cost & Energy Consumption - -
- - - -| Consumption | CO2 (lbs) | -| - | - | -| Air travel, 1 passenger, NY↔SF | 1984 | -| Human life, avg, 1 year | 11,023 | -| American life, avg, 1 year | 36,156 | -| Car, avg incl. fuel, 1 lifetime | 126,000 | - - - -| Training one model (GPU) | CO2 (lbs) | -| - | - | -| NLP pipeline (parsing, SRL) | 39 | -| w/ tuning & experimentation | 78,468 | -| Transformer (big) | 192 | -| w/ neural architecture search | 626,155 | - - - -
- - -Strubell, Emma, Ananya Ganesh, and Andrew McCallum. "[Energy and Policy Considerations for Deep Learning in NLP](https://arxiv.org/pdf/1906.02243.pdf)." In Proc. ACL, pp. 3645-3650. 2019. - - ----- -## Cost & Energy Consumption - - -| Model | Hardware | Hours | CO2 | Cloud cost in USD | -| - | - | - | - | - | -| Transformer | P100x8 | 84 | 192 | 289–981 | -| ELMo | P100x3 | 336 | 262 | 433–1472 | -| BERT | V100x64 | 79 | 1438 | 3751–13K | -| NAS | P100x8 | 274,120 | 626,155 | 943K–3.2M | -| GPT-2 | TPUv3x32 | 168 | — | 13K–43K | -| GPT-3 | | | — | 4.6M | - - - -Strubell, Emma, Ananya Ganesh, and Andrew McCallum. "[Energy and Policy Considerations for Deep Learning in NLP](https://arxiv.org/pdf/1906.02243.pdf)." In Proc. ACL, pp. 3645-3650. 2019. - - - - - - - - - - ---- -# Constraints and Tradeoffs - -![Pareto Front Example](pareto-front.svg) - - - ----- -## Design Decision: ML Model Selection - -How do I decide which ML algorithm to use for my project? - -Criteria: Quality Attributes & Constraints - ----- -## Constraints - -Constraints define the space of attributes for valid design solutions - -![constraints](design-space.svg) - - -Note: Design space exploration: The space of all possible designs (dotted rectangle) is reduced by several constraints on qualities of the system, leaving only a subset of designs for further consideration (highlighted center area). - - ----- -## Types of Constraints - -**Problem constraints**: Minimum required QAs for an acceptable product - -**Project constraint**s: Deadline, project budget, available personnel/skills - -**Design constraints** -* Type of ML task required (regression/classification) -* Available data -* Limits on computing resources, max. inference cost/time - ----- -## Constraints: Cancer Prognosis? - -![Radiology scans](radiology-scan.jpg) - - ----- -## Constraints: Music Recommendations? - -![Spotify](spotify.png) - - ----- -## Trade-offs between ML algorithms - -If there are multiple ML algorithms that satisfy the given constraints, which -one do we select? - -Different ML qualities may conflict with each other; this requires -making a __trade-off__ between these qualities - -Among the qualities of interest, which one(s) do we care the most -about? -* And which ML algorithm is most suitable for achieving those qualities? -* (Similar to requirements conflicts) - ----- -## Multi-Objective Optimization - - - -* Determine optimal solutions given multiple, possibly - **conflicting** objectives -* **Dominated** solution: A solution that is inferior to - others in every way -* **Pareto frontier**: A set of non-dominated solutions -* Consider trade-offs among Pareto optimal solutions - - - -![Pareto Front Example](pareto-front.svg) - - - - -Note: Tradeoffs among multiple design solutions along two dimensions (cost and error). Gray solutions are all dominated by others that are better both in terms of cost and error (e.g., solution D has worse error and worse cost than solution A). The remaining black solutions are each better than another solution on one dimension but worse on another — they are all pareto optimal and which solution to pick depends on the relative importance of the dimensions. - - ----- -## Trade-offs: Cost vs Accuracy - - - -![Netflix prize leaderboard](netflix-leaderboard.png) - - - - -_"We evaluated some of the new methods offline but the additional -accuracy gains that we measured did not seem to justify the -engineering effort needed to bring them into a production -environment.”_ - - - - - -Amatriain & Basilico. [Netflix Recommendations: Beyond the 5 stars](https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429), -Netflix Technology Blog (2012) - ----- -## Trade-offs: Accuracy vs Interpretability - -![Illustrated interpretability accuracy tradeoff](tradeoffs.png) - - -**Q. Examples where one is more important than the other?** - - - -Bloom & Brink. [Overcoming the Barriers to Production-Ready Machine Learning -Workflows](https://conferences.oreilly.com/strata/strata2014/public/schedule/detail/32314), Presentation at O'Reilly Strata Conference (2014). - ----- -## Breakout: Qualities & ML Algorithms - -Consider two scenarios: -1. Credit card fraud detection -2. Pedestrian detection in sidewalk robot - -As a group, post to `#lecture` tagging all group members: -> * Qualities of interests: ?? -> * Constraints: ?? -> * ML algorithm(s) to use: ?? - - - - - ---- -# Summary - -Software architecture focuses on early key design decisions, focused on key qualities - -Between requirements and implementation - -Decomposing the system into components, many ML components - -Many qualities of interest, define metrics and operationalize - -Constraints and tradeoff analysis for selecting ML techniques in production ML settings - - - - ----- -## Further Readings - -
- -* Bass, Len, Paul Clements, and Rick Kazman. Software architecture in practice. Addison-Wesley Professional, 3rd edition, 2012. -* Yokoyama, Haruki. “Machine learning system architectural pattern for improving operational stability.” In 2019 IEEE International Conference on Software Architecture Companion (ICSA-C), pp. 267–274. IEEE, 2019. -* Serban, Alex, and Joost Visser. “An Empirical Study of Software Architecture for Machine Learning.” In Proceedings of the International Conference on Software Analysis, Evolution and Reengineering (SANER), 2022. -* Lakshmanan, Valliappa, Sara Robinson, and Michael Munn. Machine learning design patterns. O’Reilly Media, 2020. -* Lewis, Grace A., Ipek Ozkaya, and Xiwei Xu. “Software Architecture Challenges for ML Systems.” In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 634–638. IEEE, 2021. -* Vogelsang, Andreas, and Markus Borg. “Requirements Engineering for Machine Learning: Perspectives from Data Scientists.” In Proc. of the 6th International Workshop on Artificial Intelligence for Requirements Engineering (AIRE), 2019. -* Habibullah, Khan Mohammad, Gregory Gay, and Jennifer Horkoff. "[Non-Functional Requirements for Machine Learning: An Exploration of System Scope and Interest](https://arxiv.org/abs/2203.11063)." arXiv preprint arXiv:2203.11063 (2022). - -
diff --git a/lectures/08_architecture/tradeoffs.png b/lectures/08_architecture/tradeoffs.png deleted file mode 100644 index 5e2609c4..00000000 Binary files a/lectures/08_architecture/tradeoffs.png and /dev/null differ diff --git a/lectures/08_architecture/transcriptionarchitecture2.svg b/lectures/08_architecture/transcriptionarchitecture2.svg deleted file mode 100644 index 212a40f7..00000000 --- a/lectures/08_architecture/transcriptionarchitecture2.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/08_architecture/twitter-caching.png b/lectures/08_architecture/twitter-caching.png deleted file mode 100644 index 9f08e32c..00000000 Binary files a/lectures/08_architecture/twitter-caching.png and /dev/null differ diff --git a/lectures/08_architecture/twitter.png b/lectures/08_architecture/twitter.png deleted file mode 100644 index d5bc26d4..00000000 Binary files a/lectures/08_architecture/twitter.png and /dev/null differ diff --git a/lectures/08_architecture/xkcd2173.png b/lectures/08_architecture/xkcd2173.png deleted file mode 100644 index 99fa0fc3..00000000 Binary files a/lectures/08_architecture/xkcd2173.png and /dev/null differ diff --git a/lectures/09_deploying_a_model/2phase-prediction.svg b/lectures/09_deploying_a_model/2phase-prediction.svg deleted file mode 100644 index f9b92a94..00000000 --- a/lectures/09_deploying_a_model/2phase-prediction.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/09_deploying_a_model/3tier-with-ml.svg b/lectures/09_deploying_a_model/3tier-with-ml.svg deleted file mode 100644 index ccc2e2c2..00000000 --- a/lectures/09_deploying_a_model/3tier-with-ml.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/09_deploying_a_model/apollo.png b/lectures/09_deploying_a_model/apollo.png deleted file mode 100644 index 03609231..00000000 Binary files a/lectures/09_deploying_a_model/apollo.png and /dev/null differ diff --git a/lectures/09_deploying_a_model/ar-architecture.svg b/lectures/09_deploying_a_model/ar-architecture.svg deleted file mode 100644 index 57759c9b..00000000 --- a/lectures/09_deploying_a_model/ar-architecture.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/09_deploying_a_model/arch-diagram-example.svg b/lectures/09_deploying_a_model/arch-diagram-example.svg deleted file mode 100644 index ff959025..00000000 --- a/lectures/09_deploying_a_model/arch-diagram-example.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/09_deploying_a_model/deployment.md b/lectures/09_deploying_a_model/deployment.md deleted file mode 100644 index cf4adccd..00000000 --- a/lectures/09_deploying_a_model/deployment.md +++ /dev/null @@ -1,899 +0,0 @@ ---- -author: Christian Kaestner and Eunsuk Kang -title: "MLiP: Deploying a Model" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - -# Deploying a Model - - - ---- -## Deeper into architecture and design... - -![Overview of course content](../_assets/overview.svg) - - - - ----- - -## Learning Goals - -
- -* Understand important quality considerations when deploying ML components -* Follow a design process to explicitly reason about alternative designs and their quality tradeoffs -* Gather data to make informed decisions about what ML technique to use and where and how to deploy it -* Understand the power of design patterns for codifying design knowledge -* -* Create architectural models to reason about relevant characteristics -* Critique the decision of where an AI model lives (e.g., cloud vs edge vs hybrid), considering the relevant tradeoffs -* Deploy models locally and to the cloud -* Document model inference services - -
- ----- -## Readings - -Required reading: -* 🕮 Hulten, Geoff. "[Building Intelligent Systems: A Guide to Machine Learning Engineering.](https://www.buildingintelligentsystems.com/)" Apress, 2018, Chapter 13 (Where Intelligence Lives). -* 📰 Daniel Smith. "[Exploring Development Patterns in Data Science](https://www.theorylane.com/2017/10/20/some-development-patterns-in-data-science/)." TheoryLane Blog Post. 2017. - -Recommended reading: -* 🕮 Rick Kazman, Paul Clements, and Len Bass. [Software architecture in practice.](https://www.oreilly.com/library/view/software-architecture-in/9780132942799/?ar) Addison-Wesley Professional, 2012, Chapter 1 - - - - ---- -# Deploying a Model is Easy - ----- -## Deploying a Model is Easy - -Model inference component as function/library - -```python -from sklearn.linear_model import LogisticRegression -model = … # learn model or load serialized model ... -def infer(feature1, feature2): - return model.predict(np.array([[feature1, feature2]]) -``` - ----- -## Deploying a Model is Easy - -Model inference component as a service - - -```python -from flask import Flask, escape, request -app = Flask(__name__) -app.config['UPLOAD_FOLDER'] = '/tmp/uploads' -detector_model = … # load model… - -# inference API that returns JSON with classes -# found in an image -@app.route('/get_objects', methods=['POST']) -def pred(): - uploaded_img = request.files["images"] - coverted_img = … # feature encoding of uploaded img - result = detector_model(converted_img) - return jsonify({"response": - result['detection_class_entities']}) - -``` - ----- -## Deploying a Model is Easy - -Packaging a model inference service in a container - - -```docker -FROM python:3.8-buster -RUN pip install uwsgi==2.0.20 -RUN pip install tensorflow==2.7.0 -RUN pip install flask==2.0.2 -RUN pip install gunicorn==20.1.0 -COPY models/model.pf /model/ -COPY ./serve.py /app/main.py -WORKDIR ./app -EXPOSE 4040 -CMD ["gunicorn", "-b 0.0.0.0:4040", "main:app"] -``` - ----- -## Deploying a Model is Easy - -Model inference component as a service in the cloud - -* Package in container or other infrastructure -* Deploy in cloud infrastructure -* Auto-scaling with demand ("*Stateless Serving Functions Pattern*") -* MLOps infrastructure to automate all of this (more on this later) - * [BentoML](https://github.com/bentoml/BentoML) (low code service creation, deployment, model registry), - * [Cortex](https://github.com/bentoml/BentoML) (automated deployment and scaling of models on AWS), - * [TFX model serving](https://www.tensorflow.org/tfx/guide/serving) (tensorflow GRPC services) - * [Seldon Core](https://www.seldon.io/tech/products/core/) (no-code model service and many many additional services for monitoring and operations on Kubernetes) - - - - ----- -## But is it really easy? - -Offline use? - -Deployment at scale? - -Hardware needs and operating cost? - -Frequent updates? - -Integration of the model into a system? - -Meeting system requirements? - -**Every system is different!** - ----- -## Every System is Different - -Personalized music recommendations for Spotify - -Transcription service startup - -Self-driving car - -Smart keyboard for mobile device - ----- -## Inference is a Component within a System - -![Transcription service architecture example](transcriptionarchitecture2.svg) - - - - - - ---- -# Recall: Thinking like a Software Architect - -![Architecture between requirements and implementation](req-arch-impl.svg) - - - ----- -## Recall: Systems Thinking - -![](system.svg) - - -> A system is a set of inter-related components that work together in a particular environment to perform whatever functions are required to achieve the system's objective -- Donella Meadows - - - ---- - -# Architectural Modeling and Reasoning ----- -![](pgh.jpg) -Notes: Map of Pittsburgh. Abstraction for navigation with cars. ----- -![](pgh-cycling.jpg) -Notes: Cycling map of Pittsburgh. Abstraction for navigation with bikes and walking. ----- -![](pgh-firezones.png) -Notes: Fire zones of Pittsburgh. Various use cases, e.g., for city planners. ----- -## Analysis-Specific Abstractions - -All maps were abstractions of the same real-world construct - -All maps were created with different goals in mind - - Different relevant abstractions - - Different reasoning opportunities - -Architectural models are specific system abstractions, for reasoning about specific qualities - -No uniform notation - ----- - -## What can we reason about? - -![](lan-boundary.png) - - ----- - -## What can we reason about? - -![](gfs.png) - - - -Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "[The Google file system.](https://ai.google/research/pubs/pub51.pdf)" ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003. - -Notes: Scalability through redundancy and replication; reliability wrt to single points of failure; performance on edges; cost - ----- -## What can we reason about? - -![Apollo Self-Driving Car Architecture](apollo.png) - - - -Peng, Zi, Jinqiu Yang, Tse-Hsun Chen, and Lei Ma. "A first look at the integration of machine learning models in complex autonomous driving systems: a case study on Apollo." In Proc. FSE, 2020. - ----- - -## Suggestions for Graphical Notations - -Use notation suitable for analysis - -Document meaning of boxes and edges in legend - -Graphical or textual both okay; whiteboard sketches often sufficient - -Formal notations available - - - - - - - - - - - - - - - ---- - -# Case Study: Augmented Reality Translation - - -![Seoul Street Signs](seoul.jpg) - - - -Notes: Image: https://pixabay.com/photos/nightlife-republic-of-korea-jongno-2162772/ - ----- -## Case Study: Augmented Reality Translation -![Google Translate](googletranslate.png) - - ----- -## Case Study: Augmented Reality Translation -![Google Glasses](googleglasses.jpg) - - -Notes: Consider you want to implement an instant translation service similar toGoogle translate, but run it on embedded hardware in glasses as an augmented reality service. ----- -## System Qualities of Interest? - - - - ---- -# Design Decision: Selecting ML Algorithms - -What ML algorithms to use and why? Tradeoffs? - -![](googletranslate.png) - - - -Notes: Relate back to previous lecture about AI technique tradeoffs, including for example -Accuracy -Capabilities (e.g. classification, recommendation, clustering…) -Amount of training data needed -Inference latency -Learning latency; incremental learning? -Model size -Explainable? Robust? - ---- -# Design Decision: Where Should the Model Live? - -(Deployment Architecture) - ----- -## Where Should the Models Live? - -![AR Translation Architecture Sketch](ar-architecture.svg) - - -Cloud? Phone? Glasses? - -What qualities are relevant for the decision? - -Notes: Trigger initial discussion - - ----- -## Considerations - -* How much data is needed as input for the model? -* How much output data is produced by the model? -* How fast/energy consuming is model execution? -* What latency is needed for the application? -* How big is the model? How often does it need to be updated? -* Cost of operating the model? (distribution + execution) -* Opportunities for telemetry? -* What happens if users are offline? - ----- -## Breakout: Latency and Bandwidth Analysis - - -1. Estimate latency and bandwidth requirements between components -2. Discuss tradeoffs among different deployment models - - -![AR Translation Architecture Sketch](ar-architecture.svg) - - - -As a group, post in `#lecture` tagging group members: -* Recommended deployment for OCR (with justification): -* Recommended deployment for Translation (with justification): - - - -Notes: Identify at least OCR and Translation service as two AI components in a larger system. Discuss which system components are worth modeling (e.g., rendering, database, support forum). Discuss how to get good estimates for latency and bandwidth. - -Some data: -200ms latency is noticable as speech pause; -20ms is perceivable as video delay, 10ms as haptic delay; -5ms referenced as cybersickness threshold for virtual reality -20ms latency might be acceptable - -bluetooth latency around 40ms to 200ms - -bluetooth bandwidth up to 3mbit, wifi 54mbit, video stream depending on quality 4 to 10mbit for low to medium quality - -google glasses had 5 megapixel camera, 640x360 pixel screen, 1 or 2gb ram, 16gb storage - - ----- - -![Example of an architectural diagram](arch-diagram-example.svg) - - - ----- -## From the Reading: When would one use the following designs? - -* Static intelligence in the product -* Client-side intelligence (user-facing devices) -* Server-centric intelligence -* Back-end cached intelligence -* Hybrid models -* -* Consider: Offline use, inference latency, model updates, application updates, operating cost, scalability, protecting intellectual property - - -Notes: -From the reading: -* Static intelligence in the product - - difficult to update - - good execution latency - - cheap operation - - offline operation - - no telemetry to evaluate and improve -* Client-side intelligence - - updates costly/slow, out of sync problems - - complexity in clients - - offline operation, low execution latency -* Server-centric intelligence - - latency in model execution (remote calls) - - easy to update and experiment - - operation cost - - no offline operation -* Back-end cached intelligence - - precomputed common results - - fast execution, partial offline - - saves bandwidth, complicated updates -* Hybrid models - - ----- -## Where Should Feature Encoding Happen? - -![Feature Encoding](featureencoding.svg) - - -*Should feature encoding happen server-side or client-side? Tradeoffs?* - -Note: When thinking of model inference as a component within a system, feature encoding can happen with the model-inference component or can be the responsibility of the client. That is, the client either provides the raw inputs (e.g., image files; dotted box in the figure above) to the inference service or the client is responsible for computing features and provides the feature vector to the inference service (dashed box). Feature encoding and model inference could even be two separate services that are called by the client in sequence. Which alternative is preferable is a design decision that may depend on a number of factors, for example, whether and how the feature vectors are stored in the system, how expensive computing the feature encoding is, how often feature encoding changes, how many models use the same feature encoding, and so forth. For instance, in our stock photo example, having feature encoding being part of the inference service is convenient for clients and makes it easy to update the model without changing clients, but we would have to send the entire image over the network instead of just the much smaller feature vector for the reduced 300 x 300 pixels. - - ----- -## Reusing Feature Engineering Code - - -![Feature encoding shared between training and inference](shared-feature-encoding.svg) - - - -Avoid *training–serving skew* - ----- -## The Feature Store Pattern - -* Central place to store, version, and describe feature engineering code -* Can be reused across projects -* Possible caching of expensive features - - -Many open source and commercial offerings, e.g., Feast, Tecton, AWS SageMaker Feature Store - ----- -## Tecton Feature Store - - - ----- -## More Considerations for Deployment Decisions - -Coupling of ML pipeline parts - -Coupling with other parts of the system - -Ability for different developers and analysts to collaborate - -Support online experiments - -Ability to monitor - - ----- -## Real-Time Serving; Many Models - -![Apollo Self-Driving Car Architecture](apollo.png) - - - -Peng, Zi, Jinqiu Yang, Tse-Hsun Chen, and Lei Ma. "A first look at the integration of machine learning models in complex autonomous driving systems: a case study on Apollo." In Proc. FSE. 2020. - - ----- -## Infrastructure Planning (Facebook Examp.) - -![Example of Facebook’s Machine Learning Flow and Infrastructure](facebook-flow.png) - - - - -Hazelwood, Kim, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy et al. "Applied machine learning at facebook: A datacenter infrastructure perspective." In Int'l Symp. High Performance Computer Architecture. IEEE, 2018. - ----- -## Capacity Planning (Facebook Example) - -
- -| Services | Relative Capacity | Compute | Memory | -|--|--|--|--| -| News Feed | 100x | Dual-Socket CPU | High | -| Facer (face recognition) | 10x | Single-Socket CPU | Low | -| Lumos (image understanding) | 10x | Single-Socket CPU | Low | -| Search | 10x | Dual-Socket CPU | High | -| Lang. Translation | 1x | Dual-Socket CPU | High | -| Sigma (anomaly and spam detection) | 1x | Dual-Socket CPU | High | - -* Trillions of inferences per day, in real time -* Preference for cheap single-CPU machines whether possible -* Different latency requirements, some "nice to have" predictions -* Some models run on mobile device to improve latency and reduce communication cost - -
- - - -Hazelwood, et al. "Applied machine learning at facebook: A datacenter infrastructure perspective." In Int'l Symp. High Performance Computer Architecture. IEEE, 2018. - - ----- -## Operational Robustness - -Redundancy for availability? - -Load balancer for scalability? - -Can mistakes be isolated? - - Local error handling? - - Telemetry to isolate errors to component? - -Logging and log analysis for what qualities? - - - ---- -# Preview: Telemetry Design - ----- -## Telemetry Design - -How to evaluate system performance and mistakes in production? - -![](googletranslate.png) - - -Notes: Discuss strategies to determine accuracy in production. What kind of telemetry needs to be collected? - ----- -## The Right and Right Amount of Telemetry - -
- -Purpose: - - Monitor operation - - Monitor mistakes (e.g., accuracy) - - Improve models over time (e.g., detect new features) - -Challenges: - - too much data, no/not enough data - - hard to measure, poor proxy measures - - rare events - - cost - - privacy - -**Interacts with deployment decisions** - -
- ----- -## Telemetry Tradeoffs - -What data to collect? How much? When? - -Estimate data volume and possible bottlenecks in system. - -![](googletranslate.png) - - -Notes: Discuss alternatives and their tradeoffs. Draw models as suitable. - -Some data for context: -Full-screen png screenshot on Pixel 2 phone (1080x1920) is about 2mb (2 megapixel); Google glasses had a 5 megapixel camera and a 640x360 pixel screen, 16gb of storage, 2gb of RAM. Cellar cost are about $10/GB. - - - - - ---- -# Integrating Models into a System - ----- -## Recall: Inference is a Component within a System - -![Transcription service architecture example](transcriptionarchitecture2.svg) - - ----- -## Separating Models and Business Logic - -![3-tier architecture integrating ML](3tier-with-ml.svg) - - - -Based on: Yokoyama, Haruki. "Machine learning system architectural pattern for improving operational stability." In Int'l Conf. Software Architecture Companion, pp. 267-274. IEEE, 2019. - ----- -## Separating Models and Business Logic - -Clearly divide responsibilities - -Allows largely independent and parallel work, assuming stable interfaces - -Plan location of non-ML safeguards and other processing logic - - - ----- -## Composing Models: Ensemble and metamodels - -![Ensemble models](ensemble.svg) - - ----- -## Composing Models: Decomposing the problem, sequential - -![](sequential-model-composition.svg) - - ----- -## Composing Models: Cascade/two-phase prediction - -![](2phase-prediction.svg) - - - - - - - - - - ---- -# Documenting Model Inference Interfaces - - - ----- -## Why Documentation - -Model inference between teams: - * Data scientists developing the model - * Other data scientists using the model, evolving the model - * Software engineers integrating the model as a component - * Operators managing model deployment - -Will this model work for my problem? - -What problems to anticipate? - ----- -## Classic API Documentation - - -```java -/** - * compute deductions based on provided adjusted - * gross income and expenses in customer data. - * - * see tax code 26 U.S. Code A.1.B, PART VI - */ -float computeDeductions(float agi, Expenses expenses); -``` - - - ----- -## What to document for models? - - - ----- -## Documenting Input/Output Types for Inference Components - -```js -{ - "mid": string, - "languageCode": string, - "name": string, - "score": number, - "boundingPoly": { - object (BoundingPoly) - } -} -``` -From Google’s public [object detection API](https://cloud.google.com/vision/docs/object-localizer). - ----- -## Documentation beyond I/O Types - -Intended use cases, model capabilities and limitations - -Supported target distribution (vs preconditions) - -Accuracy (various measures), incl. slices, fairness - -Latency, throughput, availability (service level agreements) - -Model qualities such as explainability, robustness, calibration - -Ethical considerations (fairness, safety, security, privacy) - - -**Example for OCR model? How would you describe these?** - ----- -## Model Cards - -* Proposal and template for documentation from Google - * Intended use, out-of-scope use - * Training and evaluation data - * Considered demographic factors - * Accuracy evaluations - * Ethical considerations -* 1-2 page summary -* Focused on fairness -* Widely discussed, but not frequently adopted - - -Mitchell, Margaret, et al. "[Model cards for model reporting](https://arxiv.org/abs/1810.03993)." In *Proceedings of the Conference on Fairness, Accountability, and Transparency*, 2019. - ----- -![Model card example](modelcard.png) - - - -Example from Model Cards paper - ----- -![Model card screenshot from Google](modelcard2.png) - - - -From: https://modelcards.withgoogle.com/object-detection - ----- -## FactSheets - -
- -Proposal and template for documentation from IBM; intended to communicate intended qualities and assurances - -Longer list of criteria, including - * Service intention, intended use - * Technical description - * Target distribution - * Own and third-party evaluation results - * Safety and fairness considerations, explainability - * Preparation for drift and evolution - * Security, lineage and versioning - -
- - -Arnold, Matthew, et al. "[FactSheets: Increasing trust in AI services through supplier's declarations of conformity](https://arxiv.org/pdf/1808.07261.pdf)." *IBM Journal of Research and Development* 63, no. 4/5 (2019): 6-1. - ----- -## Recall: Correctness vs Fit - -Without a clear specification a model is difficult to document - -Need documentation to allow evaluation for *fit* - -Description of *target distribution* is a key challenge - - - - - - - - - - - - ---- -# Design Patterns for AI Enabled Systems - -(no standardization, *yet*) - ----- -## Design Patterns are Codified Design Knowl. - -Vocabulary of design problems and solutions - -![Observer pattern](observer.png) - - - -Example: *Observer pattern* object-oriented design pattern describes a solution how objects can be notified when another object changes without strongly coupling these objects to each other - ----- -## Common System Structures - -Client-server architecture - -Multi-tier architecture - -Service-oriented architecture and microservices - -Event-based architecture - -Data-flow architecture - ----- -## Multi-Tier Architecture - -![3-tier architecture integrating ML](3tier-with-ml.svg) - - - -Based on: Yokoyama, Haruki. "Machine learning system architectural pattern for improving operational stability." In Int'l Conf. Software Architecture Companion, pp. 267-274. IEEE, 2019. - - ----- -## Microservices - -![Microservice illustration](microservice.svg) - - - -(more later) - - - ----- -## Patterns for ML-Enabled Systems - -* Stateless/serverless Serving Function Pattern -* Feature-Store Pattern -* Batched/precomuted serving pattern -* Two-phase prediction pattern -* Batch Serving Pattern -* Decouple-training-from-serving pattern - - ----- -## Anti-Patterns - -* Big Ass Script Architecture -* Dead Experimental Code Paths -* Glue code -* Multiple Language Smell -* Pipeline Jungles -* Plain-Old Datatype Smell -* Undeclared Consumers - - - - - -See also: 🗎 Washizaki, Hironori, Hiromu Uchida, Foutse Khomh, and Yann-Gaël Guéhéneuc. "[Machine Learning Architecture and Design Patterns](http://www.washi.cs.waseda.ac.jp/wp-content/uploads/2019/12/IEEE_Software_19__ML_Patterns.pdf)." Draft, 2019; 🗎 Sculley, et al. "[Hidden technical debt in machine learning systems](http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf)." In NeurIPS, 2015. - - - - - - - - - - - - - - ---- - -# Summary - -
- -Model deployment seems easy, but involves many design decisions - * What models to use? - * Where to deploy? - * How to design feature encoding and feature engineering? - * How to compose with other components? - * How to document? - * How to collect telemetry? - -Problem-specific modeling and analysis: Gather estimates, consider design alternatives, make tradeoffs explicit - -Codifying design knowledge as patterns - -
- ----- -## Further Readings -
- -* 🕮 Lakshmanan, Valliappa, Sara Robinson, and Michael Munn. Machine learning design patterns. O’Reilly Media, 2020. -* 🗎 Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. “Model cards for model reporting.” In Proceedings of the conference on fairness, accountability, and transparency, pp. 220–229. 2019. -* 🗎 Arnold, Matthew, Rachel KE Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilović, Ravi Nair, Karthikeyan Natesan Ramamurthy, Darrell Reimer, Alexandra Olteanu, David Piorkowski, Jason Tsay, and Kush R. Varshney. “FactSheets: Increasing trust in AI services through supplier’s declarations of conformity.” IBM Journal of Research and Development 63, no. 4/5 (2019): 6–1. -* 🗎 Yokoyama, Haruki. “Machine learning system architectural pattern for improving operational stability.” In 2019 IEEE International Conference on Software Architecture Companion (ICSA-C), pp. 267–274. IEEE, 2019. - -
\ No newline at end of file diff --git a/lectures/09_deploying_a_model/ensemble.svg b/lectures/09_deploying_a_model/ensemble.svg deleted file mode 100644 index 7be898f2..00000000 --- a/lectures/09_deploying_a_model/ensemble.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/09_deploying_a_model/facebook-flow.png b/lectures/09_deploying_a_model/facebook-flow.png deleted file mode 100644 index 49989236..00000000 Binary files a/lectures/09_deploying_a_model/facebook-flow.png and /dev/null differ diff --git a/lectures/09_deploying_a_model/featureencoding.svg b/lectures/09_deploying_a_model/featureencoding.svg deleted file mode 100644 index 46fe79ba..00000000 --- a/lectures/09_deploying_a_model/featureencoding.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/09_deploying_a_model/gfs.png b/lectures/09_deploying_a_model/gfs.png deleted file mode 100644 index b60e059d..00000000 Binary files a/lectures/09_deploying_a_model/gfs.png and /dev/null differ diff --git a/lectures/09_deploying_a_model/googleglasses.jpg b/lectures/09_deploying_a_model/googleglasses.jpg deleted file mode 100644 index 73e19a59..00000000 Binary files a/lectures/09_deploying_a_model/googleglasses.jpg and /dev/null differ diff --git a/lectures/09_deploying_a_model/googletranslate.png b/lectures/09_deploying_a_model/googletranslate.png deleted file mode 100644 index 0d653aa6..00000000 Binary files a/lectures/09_deploying_a_model/googletranslate.png and /dev/null differ diff --git a/lectures/09_deploying_a_model/lan-boundary.png b/lectures/09_deploying_a_model/lan-boundary.png deleted file mode 100644 index 6f3abcd8..00000000 Binary files a/lectures/09_deploying_a_model/lan-boundary.png and /dev/null differ diff --git a/lectures/09_deploying_a_model/microservice.svg b/lectures/09_deploying_a_model/microservice.svg deleted file mode 100644 index 09cdf95d..00000000 --- a/lectures/09_deploying_a_model/microservice.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/09_deploying_a_model/modelcard.png b/lectures/09_deploying_a_model/modelcard.png deleted file mode 100644 index f390c256..00000000 Binary files a/lectures/09_deploying_a_model/modelcard.png and /dev/null differ diff --git a/lectures/09_deploying_a_model/modelcard2.png b/lectures/09_deploying_a_model/modelcard2.png deleted file mode 100644 index d9ff48c5..00000000 Binary files a/lectures/09_deploying_a_model/modelcard2.png and /dev/null differ diff --git a/lectures/09_deploying_a_model/observer.png b/lectures/09_deploying_a_model/observer.png deleted file mode 100644 index 44b26048..00000000 Binary files a/lectures/09_deploying_a_model/observer.png and /dev/null differ diff --git a/lectures/09_deploying_a_model/pgh-cycling.jpg b/lectures/09_deploying_a_model/pgh-cycling.jpg deleted file mode 100644 index 7f44fe0a..00000000 Binary files a/lectures/09_deploying_a_model/pgh-cycling.jpg and /dev/null differ diff --git a/lectures/09_deploying_a_model/pgh-firezones.png b/lectures/09_deploying_a_model/pgh-firezones.png deleted file mode 100644 index 7ff1cd2a..00000000 Binary files a/lectures/09_deploying_a_model/pgh-firezones.png and /dev/null differ diff --git a/lectures/09_deploying_a_model/pgh.jpg b/lectures/09_deploying_a_model/pgh.jpg deleted file mode 100644 index 4286fb9e..00000000 Binary files a/lectures/09_deploying_a_model/pgh.jpg and /dev/null differ diff --git a/lectures/09_deploying_a_model/req-arch-impl.svg b/lectures/09_deploying_a_model/req-arch-impl.svg deleted file mode 100644 index 34bea3ff..00000000 --- a/lectures/09_deploying_a_model/req-arch-impl.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/09_deploying_a_model/seoul.jpg b/lectures/09_deploying_a_model/seoul.jpg deleted file mode 100644 index 7ef85dd7..00000000 Binary files a/lectures/09_deploying_a_model/seoul.jpg and /dev/null differ diff --git a/lectures/09_deploying_a_model/sequential-model-composition.svg b/lectures/09_deploying_a_model/sequential-model-composition.svg deleted file mode 100644 index 3fce8495..00000000 --- a/lectures/09_deploying_a_model/sequential-model-composition.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/09_deploying_a_model/shared-feature-encoding.svg b/lectures/09_deploying_a_model/shared-feature-encoding.svg deleted file mode 100644 index ea221aeb..00000000 --- a/lectures/09_deploying_a_model/shared-feature-encoding.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/09_deploying_a_model/system.svg b/lectures/09_deploying_a_model/system.svg deleted file mode 100644 index 9d3cfe66..00000000 --- a/lectures/09_deploying_a_model/system.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/09_deploying_a_model/transcriptionarchitecture2.svg b/lectures/09_deploying_a_model/transcriptionarchitecture2.svg deleted file mode 100644 index 212a40f7..00000000 --- a/lectures/09_deploying_a_model/transcriptionarchitecture2.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/10_qainproduction/ab-button.png b/lectures/10_qainproduction/ab-button.png deleted file mode 100644 index 94ec0068..00000000 Binary files a/lectures/10_qainproduction/ab-button.png and /dev/null differ diff --git a/lectures/10_qainproduction/ab-groove.jpg b/lectures/10_qainproduction/ab-groove.jpg deleted file mode 100644 index f5da8094..00000000 Binary files a/lectures/10_qainproduction/ab-groove.jpg and /dev/null differ diff --git a/lectures/10_qainproduction/alexa.png b/lectures/10_qainproduction/alexa.png deleted file mode 100644 index 9adf5327..00000000 Binary files a/lectures/10_qainproduction/alexa.png and /dev/null differ diff --git a/lectures/10_qainproduction/amazon.png b/lectures/10_qainproduction/amazon.png deleted file mode 100644 index 16de6074..00000000 Binary files a/lectures/10_qainproduction/amazon.png and /dev/null differ diff --git a/lectures/10_qainproduction/bookingcom.png b/lectures/10_qainproduction/bookingcom.png deleted file mode 100644 index 9a77b15a..00000000 Binary files a/lectures/10_qainproduction/bookingcom.png and /dev/null differ diff --git a/lectures/10_qainproduction/canary.jpg b/lectures/10_qainproduction/canary.jpg deleted file mode 100644 index 79b6948a..00000000 Binary files a/lectures/10_qainproduction/canary.jpg and /dev/null differ diff --git a/lectures/10_qainproduction/confint.png b/lectures/10_qainproduction/confint.png deleted file mode 100644 index adc08b63..00000000 Binary files a/lectures/10_qainproduction/confint.png and /dev/null differ diff --git a/lectures/10_qainproduction/datarobot.png b/lectures/10_qainproduction/datarobot.png deleted file mode 100644 index a9a634b4..00000000 Binary files a/lectures/10_qainproduction/datarobot.png and /dev/null differ diff --git a/lectures/10_qainproduction/drift.jpg b/lectures/10_qainproduction/drift.jpg deleted file mode 100644 index ff35da56..00000000 Binary files a/lectures/10_qainproduction/drift.jpg and /dev/null differ diff --git a/lectures/10_qainproduction/flightforcast.jpg b/lectures/10_qainproduction/flightforcast.jpg deleted file mode 100644 index 74101165..00000000 Binary files a/lectures/10_qainproduction/flightforcast.jpg and /dev/null differ diff --git a/lectures/10_qainproduction/flywheel.png b/lectures/10_qainproduction/flywheel.png deleted file mode 100644 index 1bfeed11..00000000 Binary files a/lectures/10_qainproduction/flywheel.png and /dev/null differ diff --git a/lectures/10_qainproduction/grafana.png b/lectures/10_qainproduction/grafana.png deleted file mode 100644 index 8bc0a0f7..00000000 Binary files a/lectures/10_qainproduction/grafana.png and /dev/null differ diff --git a/lectures/10_qainproduction/grafanadashboard.png b/lectures/10_qainproduction/grafanadashboard.png deleted file mode 100644 index 3ab72059..00000000 Binary files a/lectures/10_qainproduction/grafanadashboard.png and /dev/null differ diff --git a/lectures/10_qainproduction/kohavi-bing-search.jpg b/lectures/10_qainproduction/kohavi-bing-search.jpg deleted file mode 100644 index 4400b526..00000000 Binary files a/lectures/10_qainproduction/kohavi-bing-search.jpg and /dev/null differ diff --git a/lectures/10_qainproduction/mturk.jpg b/lectures/10_qainproduction/mturk.jpg deleted file mode 100644 index 46519d7b..00000000 Binary files a/lectures/10_qainproduction/mturk.jpg and /dev/null differ diff --git a/lectures/10_qainproduction/perfcomp.png b/lectures/10_qainproduction/perfcomp.png deleted file mode 100644 index 92faaf31..00000000 Binary files a/lectures/10_qainproduction/perfcomp.png and /dev/null differ diff --git a/lectures/10_qainproduction/prometheusarchitecture.png b/lectures/10_qainproduction/prometheusarchitecture.png deleted file mode 100644 index 1610bc02..00000000 Binary files a/lectures/10_qainproduction/prometheusarchitecture.png and /dev/null differ diff --git a/lectures/10_qainproduction/qainproduction.md b/lectures/10_qainproduction/qainproduction.md deleted file mode 100644 index 4aaac7e9..00000000 --- a/lectures/10_qainproduction/qainproduction.md +++ /dev/null @@ -1,886 +0,0 @@ ---- -author: Christian Kaestner and Eunsuk Kang -title: "MLiP: Testing in Production" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - -# Testing in Production - - ---- - -
- - ---- -## Back to QA... - -![Overview of course content](../_assets/overview.svg) - - - ----- -## Learning Goals - -* Design telemetry for evaluation in practice -* Understand the rationale for beta tests and chaos experiments -* Plan and execute experiments (chaos, A/B, shadow releases, ...) in production -* Conduct and evaluate multiple concurrent A/B tests in a system -* Perform canary releases -* Examine experimental results with statistical rigor -* Support data scientists with monitoring platforms providing insights from production data - ----- -## Readings - - -Required Reading: -* 🕮 Hulten, Geoff. "[Building Intelligent Systems: A Guide to Machine Learning Engineering.](https://www.buildingintelligentsystems.com/)" Apress, 2018, Chapters 14 and 15 (Intelligence Management and Intelligent Telemetry). - -Suggested Readings: -* Alec Warner and Štěpán Davidovič. "[Canary Releases](https://landing.google.com/sre/workbook/chapters/canarying-releases/)." in [The Site Reliability Workbook](https://landing.google.com/sre/books/), O'Reilly 2018 -* Kohavi, Ron, Diane Tang, and Ya Xu. "[Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing](https://bookshop.org/books/trustworthy-online-controlled-experiments-a-practical-guide-to-a-b-testing/9781108724265)." Cambridge University Press, 2020. - - - ---- -# From Unit Tests to Testing in Production - -*(in traditional software systems)* - ----- -## Unit Test, Integration Tests, System Tests - -![Testing levels](testinglevels.png) - -Note: Testing before release. Manual or automated. - ----- -## Beta Testing - -![Windows 95 beta release](windowsbeta.jpg) - - -Note: Early release to select users, asking them to send feedback or report issues. No telemetry in early days. - ----- -## Crash Telemetry - -![Windows 95 Crash Report](wincrashreport_windows_xp.png) - -Note: With internet availability, send crash reports home to identify problems "in production". Most ML-based systems are online in some form and allow telemetry. - ----- -## A/B Testing - -![A/B test example](ab-groove.jpg) - -Notes: Usage observable online, telemetry allows testing in production. Picture source: https://www.designforfounders.com/ab-testing-examples/ - ----- -## Chaos Experiments - - -[![Simian Army logo by Netflix](simianarmy.jpg)](https://en.wikipedia.org/wiki/Chaos_engineering) - - -Note: Deliberate introduction of faults in production to test robustness. - - - - - - - ---- -# Model Assessment in Production - -Ultimate held-out evaluation data: Unseen real user data - ----- -## Limitations of Offline Model Evaluation - -Training and test data drawn from the same population -* **i.i.d.: independent and identically distributed** -* leakage and overfitting problems quite common - -Is the population representative of production data? - -If not or only partially or not anymore: Does the model generalize beyond training data? - - ----- -## Identify Feedback Mechanism in Production - -Live observation in the running system - -Potentially on subpopulation (A/B testing) - -Need telemetry to evaluate quality -- challenges: -- Gather feedback without being intrusive (i.e., labeling outcomes), without harming user experience -- Manage amount of data -- Isolating feedback for specific ML component + version - ----- -## Discuss how to collect feedback - -* Was the house price predicted correctly? -* Did the profanity filter remove the right blog comments? -* Was there cancer in the image? -* Was a Spotify playlist good? -* Was the ranking of search results good? -* Was the weather prediction good? -* Was the translation correct? -* Did the self-driving car break at the right moment? Did it detect the pedestriants? - - - -Notes: More: -* SmartHome: Does it automatically turn of the lights/lock the doors/close the window at the right time? -* Profanity filter: Does it block the right blog comments? -* News website: Does it pick the headline alternative that attracts a user’s attention most? -* Autonomous vehicles: Does it detect pedestrians in the street? - - - ----- - - -![Skype feedback dialog](skype1.jpg) - -![Skype report problem button](skype2.jpg) - - -Notes: -Expect only sparse feedback and expect negative feedback over-proportionally - ----- -![Flight cost forcast](flightforcast.jpg) - -Notes: Can just wait 7 days to see actual outcome for all predictions ----- -![Temi Transcription Service Editor](temi.png) - -Notes: Clever UI design allows users to edit transcripts. UI already highlights low-confidence words, can - ----- -## Manually Label Production Samples - -Similar to labeling learning and testing data, have human annotators - -![Amazon mechanical turk](mturk.jpg) - ----- -## Summary: Telemetry Strategies - -* Wait and see -* Ask users -* Manual/crowd-source labeling, shadow execution -* Allow users to complain -* Observe user reaction - - - ----- -## Breakout: Design Telemetry in Production - -Discuss how to collect telemetry (Wait and see, ask users, manual/crowd-source labeling, shadow execution, allow users to complain, observe user reaction) - -Scenarios: -* Front-left: Amazon: Shopping app feature that detects the shoe brand from photos -* Front-right: Google: Tagging uploaded photos with friends' names -* Back-left: Spotify: Recommended personalized playlists -* Back-right: Wordpress: Profanity filter to moderate blog posts - -(no need to post in slack yet) - - - ----- -## Measuring Model Quality with Telemetry - -
- -* Usual 3 steps: (1) Metric, (2) data collection (telemetry), (3) operationalization -* Telemetry can provide insights for correctness - - sometimes very accurate labels for real unseen data - - sometimes only mistakes - - sometimes delayed - - often just samples - - often just weak proxies for correctness -* Often sufficient to *approximate* precision/recall or other model-quality measures -* Mismatch to (static) evaluation set may indicate stale or unrepresentative data -* Trend analysis can provide insights even for inaccurate proxy measures -
- ----- -## Breakout: Design Telemetry in Production - -
- -Discuss how to collect telemetry, the metric to monitor, and how to operationalize - -Scenarios: -* Front-left: Amazon: Shopping app detects the shoe brand from photos -* Front-right: Google: Tagging uploaded photos with friends' names -* Back-left: Spotify: Recommended personalized playlists -* Back-right: Wordpress: Profanity filter to moderate blog posts - -As a group post to `#lecture` and tag team members: -> * Quality metric: -> * Data to collect: -> * Operationalization: - -
- ----- -## Monitoring Model Quality in Production - -* Monitor model quality together with other quality attributes (e.g., uptime, response time, load) -* Set up automatic alerts when model quality drops -* Watch for jumps after releases - - roll back after negative jump -* Watch for slow degradation - - Stale models, data drift, feedback loops, adversaries -* Debug common or important problems - - Monitor characteristics of requests - - Mistakes uniform across populations? - - Challenging problems -> refine training, add regression tests - ----- -![Grafana screenshot from Movie Recommendation Service](grafana.png) - ----- -## Prometheus and Grafana - -[![Prometheus Architecture](prometheusarchitecture.png)](https://prometheus.io/docs/introduction/overview/) - - ----- -![Grafana Dashboard](grafanadashboard.png) - - ----- -## Many commercial solutions - -[![DataRobot MLOps](datarobot.png)](https://www.datarobot.com/platform/mlops/) - - - -e.g. https://www.datarobot.com/platform/mlops/ - -Many pointers: Ori Cohen "[Monitor! Stop Being A Blind Data-Scientist.](https://towardsdatascience.com/monitor-stop-being-a-blind-data-scientist-ac915286075f)" Blog 2019 - - ----- -## Detecting Drift - -![Drift](drift.jpg) - - -Image source: Joel Thomas and Clemens Mewald. [Productionizing Machine Learning: From Deployment to Drift Detection](https://databricks.com/blog/2019/09/18/productionizing-machine-learning-from-deployment-to-drift-detection.html). Databricks Blog, 2019 - ----- -## Engineering Challenges for Telemetry -![Amazon news story](alexa.png) - ----- -## Engineering Challenges for Telemetry -* Data volume and operating cost - - e.g., record "all AR live translations"? - - reduce data through sampling - - reduce data through summarization (e.g., extracted features rather than raw data; extraction client vs server side) -* Adaptive targeting -* Biased sampling -* Rare events -* Privacy -* Offline deployments? - ----- -## Breakout: Engineering Challenges in Telemetry - -Discuss: Cost, privacy, rare events, bias - -Scenarios: -* Front-left: Amazon: Shopping app feature that detects the shoe brand from photos -* Front-right: Google: Tagging uploaded photos with friends' names -* Back-left: Spotify: Recommended personalized playlists -* Back-right: Wordpress: Profanity filter to moderate blog posts - - -(can update slack, but not needed) - - ---- -# Telemetry for Training: The ML Flywheel - ----- - -![The ML Flywheel](flywheel.png) - - - - - graphic by [CBInsights](https://www.cbinsights.com/research/team-blog/data-network-effects/) - - ---- -# Revisiting Model Quality vs System Goals - ----- -## Model Quality vs System Goals - -Telemetry can approximate model accuracy - -Telemetry can directly measure system qualities, leading indicators, user outcomes -- define measures for "key performance indicators" -- clicks, buys, signups, engagement time, ratings -- operationalize with telemetry - ----- -## Model Quality vs System Quality - -![Booking.com homepage](bookingcom.png) - - - -Bernardi, Lucas, et al. "150 successful machine learning models: 6 lessons learned at Booking.com." In Proc. Int'l Conf. Knowledge Discovery & Data Mining, 2019. - ----- -## Possible causes of model vs system conflict? - -![Model accuracy does not need to correlate with business metric](bookingcom2.png) - - - - - - -Bernardi, Lucas, et al. "150 successful machine learning models: 6 lessons learned at Booking.com." In Proc. Int'l Conf. Knowledge Discovery & Data Mining, 2019. - - -Note: hypothesized -* model value saturated, little more value to be expected -* segment saturation: only very few users benefit from further improvement -* overoptimization on proxy metrics not real target metrics -* uncanny valley effect from "creepy AIs" - ----- -## Breakout: Design Telemetry in Production - -Discuss: What key performance indicator of the *system* to collect? - -Scenarios: -* Front-left: Amazon: Shopping app feature that detects the shoe brand from photos -* Front-right: Google: Tagging uploaded photos with friends' names -* Back-left: Spotify: Recommended personalized playlists -* Back-right: Wordpress: Profanity filter to moderate blog posts - - -(can update slack, but not needed) - - ---- -# Experimenting in Production - -* A/B experiments -* Shadow releases / traffic teeing -* Blue/green deployment -* Canary releases -* Chaos experiments - - ----- -
- - ---- -# A/B Experiments ----- -## What if...? - - - -* ... we hand plenty of subjects for experiments -* ... we could randomly assign to treatment and control group without them knowing -* ... we could analyze small individual changes and keep everything else constant - - -▶ Ideal conditions for controlled experiments - - - -![Amazon.com front page](amazon.png) - - - ----- -## A/B Testing for Usability - -* In running system, random users are shown modified version -* Outcomes (e.g., sales, time on site) compared among groups - -![A/B test example](ab-groove.jpg) - - - -Notes: Picture source: https://www.designforfounders.com/ab-testing-examples/ - - ----- - - -![A/B experiment at Bing](kohavi-bing-search.jpg) - - -
- -## Bing Experiment - -* Experiment: Ad Display at Bing -* Suggestion prioritzed low -* Not implemented for 6 month -* Ran A/B test in production -* Within 2h *revenue-too-high* alarm triggered suggesting serious bug (e.g., double billing) -* Revenue increase by 12% - $100M anually in US -* Did not hurt user-experience metrics - -
- -
- -From: Kohavi, Ron, Diane Tang, and Ya Xu. "[Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing](https://bookshop.org/books/trustworthy-online-controlled-experiments-a-practical-guide-to-a-b-testing/9781108724265)." 2020. - -
- - - - ----- -## A/B Experiment for ML Components? - -* New product recommendation algorithm for web store? -* New language model in audio transcription service? -* New (offline) model to detect falls on smart watch - - - ----- -## Experiment Size - -With enough subjects (users), we can run many many experiments - -Even very small experiments become feasible - -Toward causal inference - -![A/B test example of a single button's color](ab-button.png) - - - ----- - -## Implementing A/B Testing - -Implement alternative versions of the system -* using feature flags (decisions in implementation) -* separate deployments (decision in router/load balancer) - -Map users to treatment group -* Randomly from distribution -* Static user - group mapping -* Online service (e.g., [launchdarkly](https://launchdarkly.com/), [split](https://www.split.io/)) - -Monitor outcomes *per group* -* Telemetry, sales, time on site, server load, crash rate - ----- -## Feature Flags (Boolean flags) - -
- -```java -if (features.enabled(userId, "one_click_checkout")) { - // new one click checkout function -} else { - // old checkout functionality -} -``` - -* Good practices: tracked explicitly, documented, keep them localized and independent -* External mapping of flags to customers, who should see what configuration - * e.g., 1% of users sees `one_click_checkout`, but always the same users; or 50% of beta-users and 90% of developers and 0.1% of all users - -```scala -def isEnabled(user): Boolean = (hash(user.id) % 100) < 10 -``` - -
- ----- -![split.io screenshot](splitio.png) - - - - - - - ---- -# Confidence in A/B Experiments - -(statistical tests) - ----- - -## Comparing Averages - - -**Group A** - -*classic personalized content recommendation model* - -2158 Users - -average 3:13 min time on site - - - -**Group B** - -*updated personalized content recommendation model* - -10 Users - -average 3:24 min time on site - - ----- -## Comparing Distributions - -![Two distributions, 10000 samples each from a normal distribution](twodist.png) - - ----- -## Different effect size, same deviations - - -![](twodist.png) - -![](twodisteffect.png) - - ----- -## Same effect size, different deviations - - -![](twodist.png) - -![](twodistnoise.png) - - -Less noise --> Easier to recognize - - - ----- - -## Dependent vs. independent measurements - -Pairwise (dependent) measurements -* Before/after comparison -* With same benchmark + environment -* e.g., new operating system/disc drive faster - -Independent measurements -* Repeated measurements -* Input data regenerated for each measurement - ----- -## Significance level -* Statistical change of an error -* Define before executing the experiment - * use commonly accepted values - * based on cost of a wrong decision -* Common: - * 0.05 significant - * 0.01 very significant -* Statistically significant result ⇏ proof -* Statistically significant result ⇏ important result -* Covers only alpha error (more later) - ----- - -## Intuition: Error Model -* 1 random error, influence +/- 1 -* Real mean: 10 -* Measurements: 9 (50%) und 11 (50%) -* -* 2 random errors, each +/- 1 -* Measurements: 8 (25%), 10 (50%) und 12 (25%) -* -* 3 random errors, each +/- 1 -* Measurements : 7 (12.5%), 9 (37.5), 11 (37.5), 12 (12.5) ----- - ----- -## Normal Distribution -![Normal distribution](normaldist.png) - - - - -(CC 4.0 [D Wells](https://commons.wikimedia.org/wiki/File:Standard_Normal_Distribution.png)) ----- -## Confidence Intervals -![](confint.png) ----- -## Comparison with Confidence Intervals -![](perfcomp.png)_ - - - - -Source: Andy Georges, et al. 2007. [Statistically rigorous java performance evaluation](https://dri.es/files/oopsla07-georges.pdf). In Proc. Conference on Object-Oriented Programming Systems and Applications. ----- -# t-test - -```r -> t.test(x, y, conf.level=0.9) - - Welch Two Sample t-test - -t = 1.9988, df = 95.801, p-value = 0.04846 -alternative hypothesis: true difference in means is -not equal to 0 -90 percent confidence interval: - 0.3464147 3.7520619 -sample estimates: -mean of x mean of y - 51.42307 49.37383 - -> # paired t-test: -> t.test(x-y, conf.level=0.9) -``` ----- -![t-test in an A/B testing dashboard](testexample.png) - - - -Source: https://conversionsciences.com/ab-testing-statistics/ ----- -![t-test in an A/B testing dashboard](testexample2.png) - - - -Source: https://cognetik.com/why-you-should-build-an-ab-test-dashboard/ ----- -## How many samples needed? - -**Too few?** - - - -**Too many?** - - - - - - - ---- -# A/B testing automation - -* Experiment configuration through DSLs/scripts -* Queue experiments -* Stop experiments when confident in results -* Stop experiments resulting in bad outcomes (crashes, very low sales) -* Automated reporting, dashboards - - - -Further readings: -* Tang, Diane, et al. [Overlapping experiment infrastructure: More, better, faster experimentation](https://ai.google/research/pubs/pub36500.pdf). Proc. KDD, 2010. (Google) -* Bakshy, Eytan et al. [Designing and deploying online field experiments](https://arxiv.org/pdf/1409.3174). Proc. WWW, 2014. (Facebook) ----- -## DSL for scripting A/B tests at Facebook -```java -button_color = uniformChoice( - choices=['#3c539a', '#5f9647', '#b33316'], - unit=cookieid); - -button_text = weightedChoice( - choices=['Sign up', 'Join now'], - weights=[0.8, 0.2], - unit=cookieid); - -if (country == 'US') { - has_translate = bernoulliTrial(p=0.2, unit=userid); -} else { - has_translate = bernoulliTrial(p=0.05, unit=userid); -} -``` - - -Further readings: Bakshy, Eytan et al. [Designing and deploying online field experiments](https://arxiv.org/pdf/1409.3174). Proc. WWW, 2014. (Facebook) - ----- -## Concurrent A/B testing - -Multiple experiments at the same time - * Independent experiments on different populations -- interactions not explored - * Multi-factorial designs, well understood but typically too complex, e.g., not all combinations valid or interesting - * Grouping in sets of experiments (layers) - - - -Further readings: -* Tang, Diane, et al. [Overlapping experiment infrastructure: More, better, faster experimentation](https://ai.google/research/pubs/pub36500.pdf). Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010. -* Bakshy, Eytan, Dean Eckles, and Michael S. Bernstein. [Designing and deploying online field experiments](https://arxiv.org/pdf/1409.3174). Proceedings of the 23rd International Conference on World Wide Web. ACM, 2014. - - - ---- -# Other Experiments in Production - -Shadow releases / traffic teeing - -Blue/green deployment - -Canary releases - -Chaos experiments - - ----- -## Shadow releases / traffic teeing - -Run both models in parallel - -Use predictions of old model in production - -Compare differences between model predictions - -If possible, compare against ground truth labels/telemetry - -**Examples?** - ----- -## Blue/green deployment - -Provision service both with old and new model (e.g., services) - -Support immediate switch with load-balancer - -Allows to undo release rapidly - -**Advantages/disadvantages?** - ----- -## Canary Releases - - - -Release new version to small percentage of population (like A/B testing) - -Automatically roll back if quality measures degrade - -Automatically and incrementally increase deployment to 100% otherwise - - - -![Canary bird](canary.jpg) - - - - ----- -## Chaos Experiments - -[![Simian Army logo by Netflix](simianarmy.jpg)](https://en.wikipedia.org/wiki/Chaos_engineering) - - ----- -## Chaos Experiments for ML Components? - - - -Note: Artifically reduce model quality, add delays, insert bias, etc to test monitoring and alerting infrastructure - - ----- -## Advice for Experimenting in Production - -Minimize *blast radius* (canary, A/B, chaos expr) - -Automate experiments and deployments - -Allow for quick rollback of poor models (continuous delivery, containers, loadbalancers, versioning) - -Make decisions with confidence, compare distributions - -Monitor, monitor, monitor - - - - ---- -# Bonus: Monitoring without Ground Truth - ----- -## Invariants/Assertions to Assure with Telemetry - -
- -* Consistency between multiple sources - * e.g., multiple models agree, multiple sensors agree - * e.g., text and image agree -* Physical domain knowledge - * e.g., cars in video shall not flicker, - * e.g., earthquakes should appear in sensors grouped by geography -* Domain knowledge about unlikely events - - e.g., unlikely to have 3 cars in same location -* Stability - * e.g., object detection should not change with video noise -* Input conforms to schema (e.g. boolean features) -* And all invariants from model quality lecture, including capabilities - -
- - - -Kang, Daniel, et al. "Model Assertions for Monitoring and Improving ML Model." Proc. MLSys 2020. - ---- -# Summary - -Production data is ultimate unseen validation data - -Both for model quality and system quality - -Telemetry is key and challenging (design problem and opportunity) - -Monitoring and dashboards - -Many forms of experimentation and release (A/B testing, shadow releases, canary releases, chaos experiments, ...) to minimize "blast radius"; -gain confidence in results with statistical tests - ----- - -## Further Readings -
- - -* On canary releases: Alec Warner and Štěpán Davidovič. “[Canary Releases](https://landing.google.com/sre/workbook/chapters/canarying-releases/).” in[ The Site Reliability Workbook](https://landing.google.com/sre/books/), O’Reilly 2018 -* Everything on A/B testing: Kohavi, Ron. [*Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing*](https://bookshop.org/books/trustworthy-online-controlled-experiments-a-practical-guide-to-a-b-testing/9781108724265). Cambridge University Press, 2020. -* A/B testing critiques: Josh Constine. [The Morality Of A/B Testing](https://techcrunch.com/2014/06/29/ethics-in-a-data-driven-world/). Blog 2014; the [Center of Humane Technology](https://www.humanetech.com/); and the Netflix documentary [The Social Dilemma](https://en.wikipedia.org/wiki/The_Social_Dilemma) -* Ori Cohen “[Monitor! Stop Being A Blind Data-Scientist.](https://towardsdatascience.com/monitor-stop-being-a-blind-data-scientist-ac915286075f)” Blog 2019 -* Jens Meinicke, Chu-Pan Wong, Bogdan Vasilescu, and Christian Kästner.[ Exploring Differences and Commonalities between Feature Flags and Configuration Options](https://www.cs.cmu.edu/~ckaestne/pdf/icseseip20.pdf). In Proceedings of the Proc. International Conference on Software Engineering ICSE-SEIP, pages 233–242, May 2020. -
\ No newline at end of file diff --git a/lectures/10_qainproduction/simianarmy.jpg b/lectures/10_qainproduction/simianarmy.jpg deleted file mode 100644 index 8a1f2bbe..00000000 Binary files a/lectures/10_qainproduction/simianarmy.jpg and /dev/null differ diff --git a/lectures/10_qainproduction/skype1.jpg b/lectures/10_qainproduction/skype1.jpg deleted file mode 100644 index a5dd482d..00000000 Binary files a/lectures/10_qainproduction/skype1.jpg and /dev/null differ diff --git a/lectures/10_qainproduction/skype2.jpg b/lectures/10_qainproduction/skype2.jpg deleted file mode 100644 index 76110b4b..00000000 Binary files a/lectures/10_qainproduction/skype2.jpg and /dev/null differ diff --git a/lectures/10_qainproduction/splitio.png b/lectures/10_qainproduction/splitio.png deleted file mode 100644 index be09513a..00000000 Binary files a/lectures/10_qainproduction/splitio.png and /dev/null differ diff --git a/lectures/10_qainproduction/temi.png b/lectures/10_qainproduction/temi.png deleted file mode 100644 index 29ce2dd5..00000000 Binary files a/lectures/10_qainproduction/temi.png and /dev/null differ diff --git a/lectures/10_qainproduction/testexample.png b/lectures/10_qainproduction/testexample.png deleted file mode 100644 index 031680a1..00000000 Binary files a/lectures/10_qainproduction/testexample.png and /dev/null differ diff --git a/lectures/10_qainproduction/testexample2.png b/lectures/10_qainproduction/testexample2.png deleted file mode 100644 index d9e3c26e..00000000 Binary files a/lectures/10_qainproduction/testexample2.png and /dev/null differ diff --git a/lectures/10_qainproduction/testinglevels.png b/lectures/10_qainproduction/testinglevels.png deleted file mode 100644 index dee6a0ea..00000000 Binary files a/lectures/10_qainproduction/testinglevels.png and /dev/null differ diff --git a/lectures/10_qainproduction/twodist.png b/lectures/10_qainproduction/twodist.png deleted file mode 100644 index 824de1b6..00000000 Binary files a/lectures/10_qainproduction/twodist.png and /dev/null differ diff --git a/lectures/10_qainproduction/twodisteffect.png b/lectures/10_qainproduction/twodisteffect.png deleted file mode 100644 index af55637d..00000000 Binary files a/lectures/10_qainproduction/twodisteffect.png and /dev/null differ diff --git a/lectures/10_qainproduction/twodistnoise.png b/lectures/10_qainproduction/twodistnoise.png deleted file mode 100644 index f67ba07b..00000000 Binary files a/lectures/10_qainproduction/twodistnoise.png and /dev/null differ diff --git a/lectures/10_qainproduction/wincrashreport_windows_xp.png b/lectures/10_qainproduction/wincrashreport_windows_xp.png deleted file mode 100644 index e6e28968..00000000 Binary files a/lectures/10_qainproduction/wincrashreport_windows_xp.png and /dev/null differ diff --git a/lectures/10_qainproduction/windowsbeta.jpg b/lectures/10_qainproduction/windowsbeta.jpg deleted file mode 100644 index 32ca1457..00000000 Binary files a/lectures/10_qainproduction/windowsbeta.jpg and /dev/null differ diff --git a/lectures/11_dataquality/Accuracy_and_Precision.svg b/lectures/11_dataquality/Accuracy_and_Precision.svg deleted file mode 100644 index a1f66057..00000000 --- a/lectures/11_dataquality/Accuracy_and_Precision.svg +++ /dev/null @@ -1,2957 +0,0 @@ - - - - - - - - - - - - - - image/svg+xml - - - 21/01/2006 - - - Tijmen Stam - - - - - GFDL - - - - en - - - Darts - Dart board - - - A standard dart board. - - - - - - - - - - - - - - Accuracy - Precision - - - - Yes - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Probability - density - - - - - - - - - - - - Precision - - Value - Reference value - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Probability - density - - - - - - - - - - Accuracy - - - - - - - - - Precision - - Value - Reference value - - - - - No - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Probability - density - - - - - - - - Precision - Value - Reference value - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Precision - - - - Probability - density - - - - - - Value - Reference value - - - - - Accuracy - - - - - - Yes - No - - - - - - - - - - - - diff --git a/lectures/11_dataquality/amazon-hiring.png b/lectures/11_dataquality/amazon-hiring.png deleted file mode 100644 index 94822f89..00000000 Binary files a/lectures/11_dataquality/amazon-hiring.png and /dev/null differ diff --git a/lectures/11_dataquality/data-explosion.png b/lectures/11_dataquality/data-explosion.png deleted file mode 100644 index f03b202f..00000000 Binary files a/lectures/11_dataquality/data-explosion.png and /dev/null differ diff --git a/lectures/11_dataquality/datacascades.png b/lectures/11_dataquality/datacascades.png deleted file mode 100644 index d3b31f9e..00000000 Binary files a/lectures/11_dataquality/datacascades.png and /dev/null differ diff --git a/lectures/11_dataquality/dataquality.md b/lectures/11_dataquality/dataquality.md deleted file mode 100644 index 396565e0..00000000 --- a/lectures/11_dataquality/dataquality.md +++ /dev/null @@ -1,1115 +0,0 @@ ---- -author: Eunsuk Kang and Christian Kaestner -title: "MLiP: Data Quality" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Eunsuk Kang & Christian Kaestner, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - - -# Data Quality - - - - ---- -# Midterm - -One week from today, here - -Questions based on shared scenario, apply concepts - -Past midterms [online](https://github.com/mlip-cmu/s2023/tree/main/exams), similar style - -All lectures and readings in scope, focus on concepts with opportunity to practice (e.g., recitations, homeworks, in-class exercises) - -Closed book, but 6 sheets of notes (sorry, no ChatGPT) - - ---- -## More Quality Assurance... - -![Overview of course content](../_assets/overview.svg) - - - - ----- -## Readings - -Required reading: -* Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021, May). “[Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI](https://dl.acm.org/doi/abs/10.1145/3411764.3445518). In Proc. Conference on Human Factors in Computing Systems (pp. 1-15). - - - -Recommended reading: -* Schelter, S., et al. [Automating large-scale data quality verification](http://www.vldb.org/pvldb/vol11/p1781-schelter.pdf). Proceedings of the VLDB Endowment, 11(12), pp.1781-1794. - - - ----- - -## Learning Goals - -* Distinguish precision and accuracy; understanding the better models vs more data tradeoffs -* Use schema languages to enforce data schemas -* Design and implement automated quality assurance steps that check data schema conformance and distributions -* Devise infrastructure for detecting data drift and schema violations -* Consider data quality as part of a system; design an organization that values data quality - - ---- -# Poor Data Quality has Consequences - -(often delayed, hard-to-fix consequences) - ----- - -![Data explosion](data-explosion.png) - - - -Image source: https://medium.com/@melodyucros/ladyboss-heres-why-you-should-study-big-data-721b04b8a0ca - ----- - -![Oprah data](everybody-data.jpeg) - - ----- -## GIGO: Garbage in, garbage out - -![GIGO](gigo.jpg) - - - -Image source: https://monkeylearn.com/blog/data-cleaning-python - ----- -## Example: Systematic bias in labeling - -Poor data quality leads to poor models - -Often not detectable in offline evaluation - **Q. why not**? - -Causes problems in production - now difficult to correct - -![Newspaper report on canceled amazon hiring project](amazon-hiring.png) - ----- -## Delayed Fixes increase Repair Cost - -![Cost of bug repair depending on when the bug was introduced and fixed](defectcost.jpg) - - ----- -## Data Cascades - -![Data cascades figure](datacascades.png) - -Detection almost always delayed! Expensive rework. -Difficult to detect in offline evaluation. - - -Sambasivan, N., et al. (2021, May). “[Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI](https://dl.acm.org/doi/abs/10.1145/3411764.3445518). In Proc. CHI (pp. 1-15). - - - ---- - -# Data-Quality Challenges - ----- - -> Data cleaning and repairing account for about 60% of the work of data scientists. - - -**Own experience?** - - - -Quote: Gil Press. “[Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/).” Forbes Magazine, 2016. - - ----- -## Case Study: Inventory Management - -![Shelves in a warehouse](warehouse.jpg) - - -Goal: Train an ML model to predict future sales; make decisions about what to (re)stock/when/how many... - ----- -## Data Comes from Many Sources - -Manually entered - -Generated through actions in IT systems - -Logging information, traces of user interactions - -Sensor data - -Crowdsourced - - ----- -## Many Data Sources - -
Twitter
SalesTrends
AdNetworks
Inventory ML
VendorSales
ProductData
Marketing
Expired/Lost/Theft
PastSales
- -*sources of different reliability and quality* - - ----- -## Inventory Database - -
- -Product Database: - -| ID | Name | Weight | Description | Size | Vendor | -| - | - | - | - | - | - | -| ... | ... | ... | ... | ... | ... | - -Stock: - -| ProductID | Location | Quantity | -| - | - | - | -| ... | ... | ... | - -Sales history: - -| UserID | ProductId | DateTime | Quantity | Price | -| - | - | - | - |- | -| ... | ... | ... |... |... | - -
- ----- -## *Raw Data* is an Oxymoron - -![shipment receipt form](shipment-delivery-receipt.jpg) - - - - - -Recommended Reading: Gitelman, Lisa, Virginia Jackson, Daniel Rosenberg, Travis D. Williams, Kevin R. Brine, Mary Poovey, Matthew Stanley et al. "[Data bite man: The work of sustaining a long-term study](https://ieeexplore.ieee.org/abstract/document/6462156)." In "Raw Data" Is an Oxymoron, (2013), MIT Press: 147-166. - ----- -## What makes good quality data? - -**Accuracy:** The data was recorded correctly. - -**Completeness:** All relevant data was recorded. - -**Uniqueness:** The entries are recorded once. - -**Consistency:** The data agrees with itself. - -**Timeliness:** The data is kept up to date. - ----- -## Data is noisy - -Unreliable sensors or data entry - -Wrong results and computations, crashes - -Duplicate data, near-duplicate data - -Out of order data - -Data format invalid - -**Examples in inventory system?** - ----- -## Data changes - -System objective changes over time - -Software components are upgraded or replaced - -Prediction models change - -Quality of supplied data changes - -User behavior changes - -Assumptions about the environment no longer hold - -**Examples in inventory system?** - ----- -## Users may deliberately change data - -Users react to model output; causes data shift (more later) - -Users try to game/deceive the model - -**Examples in inventory system?** - ----- -## Accuracy vs Precision - - - -Accuracy: Reported values (on average) represent real value - -Precision: Repeated measurements yield the same result - -Accurate, but imprecise: **Q. How to deal with this issue?** - -Inaccurate, but precise: ? - - - - -![Accuracy-vs-precision visualized](Accuracy_and_Precision.svg) - - - - - -(CC-BY-4.0 by [Arbeck](https://commons.wikimedia.org/wiki/File:Accuracy_and_Precision.svg)) - - ----- - -## Accuracy and Precision Problems in Warehouse Data? - -![Shelves in a warehouse](warehouse.jpg) - - - ----- -## Data Quality and Machine Learning - -More data -> better models (up to a point, diminishing effects) - -Noisy data (imprecise) -> less confident models, more data needed - * some ML techniques are more or less robust to noise (more on robustness in a later lecture) - -Inaccurate data -> misleading models, biased models - --> Need the "right" data - --> Invest in data quality, not just quantity - - - - - - - ---- - -# Data Schema - -Ensuring basic consistency about shape and types - - ----- -## Dirty Data: Example - -![Dirty data](dirty-data-example.jpg) - - -*Problems with this data?* - - - ----- -## Data Quality Problems - -![Quality Problems Taxonomy](qualityproblems.png) - - -* Schema-level: Generic, domain-independent issues in data -* Instance-level: Application- and domain-specific - - - -Source: Rahm, Erhard, and Hong Hai Do. [Data cleaning: Problems and current approaches](http://dc-pubs.dbs.uni-leipzig.de/files/Rahm2000DataCleaningProblemsand.pdf). IEEE Data Eng. Bull. 23.4 (2000): 3-13. - - ----- -## Data Schema - -Define the expected format of data - * expected fields and their types - * expected ranges for values - * constraints among values (within and across sources) - -Data can be automatically checked against schema - -Protects against change; explicit interface between components - - ----- -## Schema Problems: Uniqueness, data format, integrity, ... - -* Illegal attribute values: `bdate=30.13.70` -* Violated attribute dependencies: `age=22, bdate=12.02.70` -* Uniqueness violation: `(name=”John Smith”, SSN=”123456”), (name=”Peter Miller”, SSN=”123456”)` -* Referential integrity violation: `emp=(name=”John Smith”, deptno=127)` if department 127 not defined - - - -Further readings: Rahm, Erhard, and Hong Hai Do. [Data cleaning: Problems and current approaches](http://dc-pubs.dbs.uni-leipzig.de/files/Rahm2000DataCleaningProblemsand.pdf). IEEE Data Eng. Bull. 23.4 (2000): 3-13. - - - - - ----- -## Schema in Relational Databases - -```sql -CREATE TABLE employees ( - emp_no INT NOT NULL, - birth_date DATE NOT NULL, - name VARCHAR(30) NOT NULL, - PRIMARY KEY (emp_no)); -CREATE TABLE departments ( - dept_no CHAR(4) NOT NULL, - dept_name VARCHAR(40) NOT NULL, - PRIMARY KEY (dept_no), UNIQUE KEY (dept_name)); -CREATE TABLE dept_manager ( - dept_no CHAR(4) NOT NULL, - emp_no INT NOT NULL, - FOREIGN KEY (emp_no) REFERENCES employees (emp_no), - FOREIGN KEY (dept_no) REFERENCES departments (dept_no), - PRIMARY KEY (emp_no,dept_no)); -``` - - ----- -## Which Problems are Schema Problems? - -![Dirty data](dirty-data-example.jpg) - - - ----- -## What Happens When New Data Violates Schema? - - - - ----- -## Modern Databases: Schema-Less - -![NoSQL](noSQL.jpeg) - - - -Image source: https://www.kdnuggets.com/2021/05/nosql-know-it-all-compendium.html - ----- -## Schema-Less Data Exchange - -* CSV files -* Key-value stores (JSon, XML, Nosql databases) -* Message brokers -* REST API calls -* R/Pandas Dataframes - -``` -2022-10-06T01:31:18,230550,GET /rate/narc+2002=4 -2022-10-06T01:31:19,332644,GET /rate/i+am+love+2009=4 -``` - -```json -{"user_id":5,"age":26,"occupation":"scientist","gender":"M"} -``` - ----- -## Schema-Less Data Exchange - -**Q. Benefits? Drawbacks?** - ----- -## Schema Library: Apache Avro - -```json -{ "type": "record", - "namespace": "com.example", - "name": "Customer", - "fields": [{ - "name": "first_name", - "type": "string", - "doc": "First Name of Customer" - }, - { - "name": "age", - "type": "int", - "doc": "Age at the time of registration" - } - ] -} -``` - ----- -## Schema Library: Apache Avro - -
- - - -Schema specification in JSON format - -Serialization and deserialization with automated checking - -Native support in Kafka - - - -Benefits - * Serialization in space efficient format - * APIs for most languages (ORM-like) - * Versioning constraints on schemas - -Drawbacks - * Reading/writing overhead - * Binary data format, extra tools needed for reading - * Requires external schema and maintenance - * Learning overhead - - - -
- -Notes: Further readings eg https://medium.com/@stephane.maarek/introduction-to-schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321, https://www.confluent.io/blog/avro-kafka-data/, https://avro.apache.org/docs/current/ - ----- -## Many Schema Libraries/Formats - -Examples -* Avro -* XML Schema -* Protobuf -* Thrift -* Parquet -* ORC - ----- -## Discussion: Data Schema Constraints for Inventory System? - -
- -Product Database: - -| ID | Name | Weight | Description | Size | Vendor | -| - | - | - | - | - | - | -| ... | ... | ... | ... | ... | ... | - -Stock: - -| ProductID | Location | Quantity | -| - | - | - | -| ... | ... | ... | - -Sales history: - -| UserID | ProductId | DateTime | Quantity | Price | -| - | - | - | - |- | -| ... | ... | ... |... |... | - -
- ----- -## Summary: Schema - -Basic structure and type definition of data - -Well supported in databases and many tools - -*Very low bar of data quality* - - - - - - - - - ---- -# Instance-Level Problems - -Application- and domain-specific data issues - ----- -## Dirty Data: Example - -![Dirty data](dirty-data-example.jpg) - - -*Problems with the data beyond schema problems?* - - ----- -## Instance-Level Problems - - -* Missing values: `phone=9999-999999` -* Misspellings: `city=Pittsburg` -* Misfielded values: `city=USA` -* Duplicate records: `name=John Smith, name=J. Smith` -* Wrong reference: `emp=(name=”John Smith”, deptno=127)` if department 127 defined but wrong - -**Q. How can we detect and fix these problems?** - - - - -Further readings: Rahm, Erhard, and Hong Hai Do. [Data cleaning: Problems and current approaches](http://dc-pubs.dbs.uni-leipzig.de/files/Rahm2000DataCleaningProblemsand.pdf). IEEE Data Eng. Bull. 23.4 (2000): 3-13. - - ----- -## Discussion: Instance-Level Problems? - -![Shelves in a warehouse](warehouse.jpg) - - - ----- -## Data Cleaning Overview - -Data analysis / Error detection - * Usually focused on specific kind of problems, e.g., duplication, typos, missing values, distribution shift - * Detection in input data vs detection in later stages (more context) - -Error repair - * Repair data vs repair rules, one at a time or holistic - * Data transformation or mapping - * Automated vs human guided - ----- -## Error Detection Examples - -Illegal values: min, max, variance, deviations, cardinality - -Misspelling: sorting + manual inspection, dictionary lookup - -Missing values: null values, default values - -Duplication: sorting, edit distance, normalization - ----- -## Error Detection: Example - -![Dirty data](dirty-data-example.jpg) - - -*Can we (automatically) detect instance-level problems? Which problems are domain-specific?* - - ----- -## Example Tool: Great Expectations - -```python -expect_column_values_to_be_between( - column="passenger_count", - min_value=1, - max_value=6 -) -``` - -Supports schema validation and custom instance-level checks. - - -https://greatexpectations.io/ - - ----- -## Example Tool: Great Expectations - -![Great expectations screenshot](greatexpectations.png) - - - - -https://greatexpectations.io/ - - ----- -## Data Quality Rules - -Invariants on data that must hold - -Typically about relationships of multiple attributes or data sources, eg. - - ZIP code and city name should correspond - - User ID should refer to existing user - - SSN should be unique - - For two people in the same state, the person with the lower income should not have the higher tax rate - -Classic integrity constraints in databases or conditional constraints - -*Rules can be used to reject data or repair it* - ----- -## ML for Detecting Inconsistencies - -![Data Inconsistency Examples](errors_chicago.jpg) - - - -Image source: Theo Rekatsinas, Ihab Ilyas, and Chris Ré, “[HoloClean - Weakly Supervised Data Repairing](https://dawn.cs.stanford.edu/2017/05/12/holoclean/).” Blog, 2017. - ----- -## Example: HoloClean - -![HoloClean](holoclean.jpg) - - -
- -* User provides rules as integrity constraints (e.g., "two entries with the same -name can't have different city") -* Detect violations of the rules in the data; also detect statistical outliers -* Automatically generate repair candidates (with probabilities) -
- - -Image source: Theo Rekatsinas, Ihab Ilyas, and Chris Ré, “[HoloClean - Weakly Supervised Data Repairing](https://dawn.cs.stanford.edu/2017/05/12/holoclean/).” Blog, 2017. - ----- -## Discovery of Data Quality Rules - -
- - - -Rules directly taken from external databases - * e.g. zip code directory - -Given clean data, - * several algorithms that find functional relationships ($X\Rightarrow Y$) among columns - * algorithms that find conditional relationships (if $Z$ then $X\Rightarrow Y$) - * algorithms that find denial constraints ($X$ and $Y$ cannot co-occur in a row) - - -Given mostly clean data (probabilistic view), - * algorithms to find likely rules (e.g., association rule mining) - * outlier and anomaly detection - -Given labeled dirty data or user feedback, - * supervised and active learning to learn and revise rules - * supervised learning to learn repairs (e.g., spell checking) - -
- - - -Further reading: Ilyas, Ihab F., and Xu Chu. [Data cleaning](https://dl.acm.org/doi/book/10.1145/3310205). Morgan & Claypool, 2019. - ----- -## Excursion: Association rule mining - -
- -* Sale 1: Bread, Milk -* Sale 2: Bread, Diaper, Beer, Eggs -* Sale 3: Milk, Diaper, Beer, Coke -* Sale 4: Bread, Milk, Diaper, Beer -* Sale 5: Bread, Milk, Diaper, Coke - -Rules -* {Diaper, Beer} -> Milk (40% support, 66% confidence) -* Milk -> {Diaper, Beer} (40% support, 50% confidence) -* {Diaper, Beer} -> Bread (40% support, 66% confidence) - -*(also useful tool for exploratory data analysis)* - -
- - -Further readings: Standard algorithms and many variations, see [Wikipedia](https://en.wikipedia.org/wiki/Association_rule_learning) - - ----- -## Discussion: Data Quality Rules - -![Shelves in a warehouse](warehouse.jpg) - - - - - - - - - - - - - - - ---- - -# Drift - -*Why does my model begin to perform poorly over time?* - - - - - - - - - - - ----- -## Types of Drift - -![types of drift](drifts.png) - - - -Gama et al., *A survey on concept drift adaptation*. ACM Computing Surveys Vol. 46, Issue 4 (2014) - ----- - -## Drift & Model Decay - -
- -**Concept drift** (or concept shift) - * properties to predict change over time (e.g., what is credit card fraud) - * model has not learned the relevant concepts - * over time: different expected outputs for same inputs - -**Data drift** (or covariate shift, virtual drift, distribution shift, or population drift) - * characteristics of input data changes (e.g., customers with face masks) - * input data differs from training data - * over time: predictions less confident, further from training data - -**Upstream data changes** - * external changes in data pipeline (e.g., format changes in weather service) - * model interprets input data incorrectly - * over time: abrupt changes due to faulty inputs - -**How do we fix these drifts?** - -
-Notes: - * fix1: retrain with new training data or relabeled old training data - * fix2: retrain with new data - * fix3: fix pipeline, retrain entirely - ----- -## On Terminology - -Concept and data drift are separate concepts - -In practice and literature not always clearly distinguished - -Colloquially encompasses all forms of model degradations and environment changes - -Define term for target audience - - -![Random letters](../_assets/onterminology.jpg) - ----- -## Breakout: Drift in the Inventory System - -*What kind of drift might be expected?* - -As a group, tagging members, write plausible examples in `#lecture`: - -> * Concept Drift: -> * Data Drift: -> * Upstream data changes: - - -![Shelves in a warehouse](warehouse.jpg) - - - - - - ----- -## Watch for Degradation in Prediction Accuracy - -![Model Drift](model_drift.jpg) - - - -Image source: Joel Thomas and Clemens Mewald. [Productionizing Machine Learning: From Deployment to Drift Detection](https://databricks.com/blog/2019/09/18/productionizing-machine-learning-from-deployment-to-drift-detection.html). Databricks Blog, 2019 - - - ----- -## Indicators of Concept Drift - -*How to detect concept drift in production?* - - - ----- -## Indicators of Concept Drift - -Model degradations observed with telemetry - -Telemetry indicates different outputs over time for similar inputs - -Relabeling training data changes labels - -Interpretable ML models indicate rules that no longer fit - -*(many papers on this topic, typically on statistical detection)* - ----- -## Indicators of Data Drift - -*How to detect data drift in production?* - - - ----- -## Indicators of Data Drift - -Model degradations observed with telemetry - -Distance between input distribution and training distribution increases - -Average confidence of model predictions declines - -Relabeling of training data retains stable labels - - ----- -## Detecting Data Drift - -* Compare distributions over time (e.g., t-test) -* Detect both sudden jumps and gradual changes -* Distributions can be manually specified or learned (see invariant detection) - - -![Two distributions](twodist.png) - -![Time series with confidence intervals](timeseries.png) - - - ----- -## Data Distribution Analysis - -Plot distributions of features (histograms, density plots, kernel density estimation) - - Identify which features drift - -Define distance function between inputs and identify distance to closest training data (e.g., energy distance, see also kNN) - -Anomaly detection and "out of distribution" detection - -Compare distribution of output labels - ----- -## Data Distribution Analysis Example - -https://rpubs.com/ablythe/520912 - - ----- -## Microsoft Azure Data Drift Dashboard - -![Dashboard](drift-ui-expanded.png) - - -Image source and further readings: [Detect data drift (preview) on models deployed to Azure Kubernetes Service (AKS)](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-monitor-datasets?tabs=python) - - ----- -## Dealing with Drift - -Regularly retrain model on recent data - - Use evaluation in production to detect decaying model performance - -Involve humans when increasing inconsistencies detected - - Monitoring thresholds, automation - -Monitoring, monitoring, monitoring! - - - ----- -## Breakout: Drift in the Inventory System - -*What kind of monitoring for previously listed drift in Inventory scenario?* - - -![Shelves in a warehouse](warehouse.jpg) - - - - - - - - - - - - - ---- -# Data Quality is a System-Wide Concern - -![](system.svg) - - ----- - -> "Everyone wants to do the model work, not the data work" - - -Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021, May). “[Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI](https://dl.acm.org/doi/abs/10.1145/3411764.3445518). In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-15). - ----- -## Data flows across components - -![Transcription service example with labeling tool, user interface, database, pipeline etc](transcriptionarchitecture2.svg) - - - ----- -## Data Quality is a System-Wide Concern - -Data flows across components, e.g., from user interface into database to crowd-sourced labeling team into ML pipeline - -Documentation at the interfaces is important - -Humans interacting with the system -* Entering data, labeling data -* Observed with sensors/telemetry -* Incentives, power structures, recognition - -Organizational practices -* Value, attention, and resources given to data quality - ----- -## Data Quality Documentation - -
- -Teams rarely document expectations of data quantity or quality - -Data quality tests are rare, but some teams adopt defensive monitoring -* Local tests about assumed structure and distribution of data -* Identify drift early and reach out to producing teams - -Several ideas for documenting distributions, including [Datasheets](https://dl.acm.org/doi/fullHtml/10.1145/3458723) and [Dataset Nutrition Label](https://arxiv.org/abs/1805.03677) -* Mostly focused on static datasets, describing origin, consideration, labeling procedure, and distributions; [Example](https://dl.acm.org/doi/10.1145/3458723#sec-supp) - -
- - -🗎 Gebru, Timnit, et al. "[Datasheets for datasets](https://dl.acm.org/doi/fullHtml/10.1145/3458723)." Communications of the ACM 64, no. 12 (2021).
-🗎 Nahar, Nadia, et al. “[Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process](https://arxiv.org/abs/2110.10234).” In Pro. ICSE, 2022. - - ----- -## Common Data Cascades - -
- - -**Physical world brittleness** -* Idealized data, ignoring realities and change of real-world data -* Static data, one time learning mindset, no planning for evolution - -**Inadequate domain expertise** -* Not understand. data and its context -* Involving experts only late for trouble shooting - - -**Conflicting reward systems** -* Missing incentives for data quality -* Not recognizing data quality importance, discard as technicality -* Missing data literacy with partners - -**Poor (cross-org.) documentation** -* Conflicts at team/organization boundary -* Undetected drift - - -
- - -Sambasivan, N., et al. (2021). “[Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI](https://dl.acm.org/doi/abs/10.1145/3411764.3445518). In Proc. Conference on Human Factors in Computing Systems. - - - - ----- -## Discussion: Possible Data Cascades? - -* Interacting with physical world brittleness -* Inadequate domain expertise -* Conflicting reward systems -* Poor (cross-organizational) documentation - -![Shelves in a warehouse](warehouse.jpg) - - - - - ----- -## Ethics and Politics of Data - -> Raw data is an oxymoron - - - ----- -## Incentives for Data Quality? Valuing Data Work? - - - - - - - ---- -# Summary - -* Data from many sources, often inaccurate, imprecise, inconsistent, incomplete, ... -- many different forms of data quality problems -* Many mechanisms for enforcing consistency and cleaning - * Data schema ensures format consistency - * Data quality rules ensure invariants across data points -* Concept and data drift are key challenges -- monitor -* Data quality is a system-level concern - * Data quality at the interface between components - * Documentation and monitoring often poor - * Involves organizational structures, incentives, ethics, ... - ----- -## Further Readings - -
- -* Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F. and Grafberger, A., 2018. Automating large-scale data quality verification. Proceedings of the VLDB Endowment, 11(12), pp.1781-1794. -* Polyzotis, Neoklis, Martin Zinkevich, Sudip Roy, Eric Breck, and Steven Whang. "Data validation for machine learning." Proceedings of Machine Learning and Systems 1 (2019): 334-347. -* Polyzotis, Neoklis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. “Data Management Challenges in Production Machine Learning.” In Proceedings of the 2017 ACM International Conference on Management of Data, 1723–26. ACM. -* Theo Rekatsinas, Ihab Ilyas, and Chris Ré, “HoloClean - Weakly Supervised Data Repairing.” Blog, 2017. -* Ilyas, Ihab F., and Xu Chu. Data cleaning. Morgan & Claypool, 2019. -* Moreno-Torres, Jose G., Troy Raeder, Rocío Alaiz-Rodríguez, Nitesh V. Chawla, and Francisco Herrera. "A unifying view on dataset shift in classification." Pattern recognition 45, no. 1 (2012): 521-530. -* Vogelsang, Andreas, and Markus Borg. "Requirements Engineering for Machine Learning: Perspectives from Data Scientists." In Proc. of the 6th International Workshop on Artificial Intelligence for Requirements Engineering (AIRE), 2019. -* Humbatova, Nargiz, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. "Taxonomy of real faults in deep learning systems." In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 1110-1121. 2020. - - -
diff --git a/lectures/11_dataquality/defectcost.jpg b/lectures/11_dataquality/defectcost.jpg deleted file mode 100644 index f6dc5588..00000000 Binary files a/lectures/11_dataquality/defectcost.jpg and /dev/null differ diff --git a/lectures/11_dataquality/dirty-data-example.jpg b/lectures/11_dataquality/dirty-data-example.jpg deleted file mode 100644 index 56f03660..00000000 Binary files a/lectures/11_dataquality/dirty-data-example.jpg and /dev/null differ diff --git a/lectures/11_dataquality/drift-ui-expanded.png b/lectures/11_dataquality/drift-ui-expanded.png deleted file mode 100644 index 8fc3c03d..00000000 Binary files a/lectures/11_dataquality/drift-ui-expanded.png and /dev/null differ diff --git a/lectures/11_dataquality/drifts.png b/lectures/11_dataquality/drifts.png deleted file mode 100644 index cb188dc0..00000000 Binary files a/lectures/11_dataquality/drifts.png and /dev/null differ diff --git a/lectures/11_dataquality/errors_chicago.jpg b/lectures/11_dataquality/errors_chicago.jpg deleted file mode 100644 index 5b377982..00000000 Binary files a/lectures/11_dataquality/errors_chicago.jpg and /dev/null differ diff --git a/lectures/11_dataquality/everybody-data.jpeg b/lectures/11_dataquality/everybody-data.jpeg deleted file mode 100644 index 192bffab..00000000 Binary files a/lectures/11_dataquality/everybody-data.jpeg and /dev/null differ diff --git a/lectures/11_dataquality/gigo.jpg b/lectures/11_dataquality/gigo.jpg deleted file mode 100644 index 10bb955c..00000000 Binary files a/lectures/11_dataquality/gigo.jpg and /dev/null differ diff --git a/lectures/11_dataquality/greatexpectations.png b/lectures/11_dataquality/greatexpectations.png deleted file mode 100644 index ea9a8811..00000000 Binary files a/lectures/11_dataquality/greatexpectations.png and /dev/null differ diff --git a/lectures/11_dataquality/holoclean.jpg b/lectures/11_dataquality/holoclean.jpg deleted file mode 100644 index 7ea1c8f4..00000000 Binary files a/lectures/11_dataquality/holoclean.jpg and /dev/null differ diff --git a/lectures/11_dataquality/model_drift.jpg b/lectures/11_dataquality/model_drift.jpg deleted file mode 100644 index 47857a0e..00000000 Binary files a/lectures/11_dataquality/model_drift.jpg and /dev/null differ diff --git a/lectures/11_dataquality/noSQL.jpeg b/lectures/11_dataquality/noSQL.jpeg deleted file mode 100644 index 7d5ce6b5..00000000 Binary files a/lectures/11_dataquality/noSQL.jpeg and /dev/null differ diff --git a/lectures/11_dataquality/qualityproblems.png b/lectures/11_dataquality/qualityproblems.png deleted file mode 100644 index e3e7e9a0..00000000 Binary files a/lectures/11_dataquality/qualityproblems.png and /dev/null differ diff --git a/lectures/11_dataquality/shipment-delivery-receipt.jpg b/lectures/11_dataquality/shipment-delivery-receipt.jpg deleted file mode 100644 index 6ff1940a..00000000 Binary files a/lectures/11_dataquality/shipment-delivery-receipt.jpg and /dev/null differ diff --git a/lectures/11_dataquality/system.svg b/lectures/11_dataquality/system.svg deleted file mode 100644 index 9d3cfe66..00000000 --- a/lectures/11_dataquality/system.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/11_dataquality/timeseries.png b/lectures/11_dataquality/timeseries.png deleted file mode 100644 index e9d5d4bb..00000000 Binary files a/lectures/11_dataquality/timeseries.png and /dev/null differ diff --git a/lectures/11_dataquality/transcriptionarchitecture2.svg b/lectures/11_dataquality/transcriptionarchitecture2.svg deleted file mode 100644 index 212a40f7..00000000 --- a/lectures/11_dataquality/transcriptionarchitecture2.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/11_dataquality/twodist.png b/lectures/11_dataquality/twodist.png deleted file mode 100644 index 824de1b6..00000000 Binary files a/lectures/11_dataquality/twodist.png and /dev/null differ diff --git a/lectures/11_dataquality/warehouse.jpg b/lectures/11_dataquality/warehouse.jpg deleted file mode 100644 index 762addb8..00000000 Binary files a/lectures/11_dataquality/warehouse.jpg and /dev/null differ diff --git a/lectures/12_pipelinequality/ci.png b/lectures/12_pipelinequality/ci.png deleted file mode 100644 index e686e50f..00000000 Binary files a/lectures/12_pipelinequality/ci.png and /dev/null differ diff --git a/lectures/12_pipelinequality/client-code-backend.svg b/lectures/12_pipelinequality/client-code-backend.svg deleted file mode 100644 index a6f7980c..00000000 --- a/lectures/12_pipelinequality/client-code-backend.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/12_pipelinequality/coverage.png b/lectures/12_pipelinequality/coverage.png deleted file mode 100644 index 35f64927..00000000 Binary files a/lectures/12_pipelinequality/coverage.png and /dev/null differ diff --git a/lectures/12_pipelinequality/driver-code-backend.svg b/lectures/12_pipelinequality/driver-code-backend.svg deleted file mode 100644 index 0e4ed85b..00000000 --- a/lectures/12_pipelinequality/driver-code-backend.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/12_pipelinequality/driver-code-stub.svg b/lectures/12_pipelinequality/driver-code-stub.svg deleted file mode 100644 index 7d3b444c..00000000 --- a/lectures/12_pipelinequality/driver-code-stub.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/12_pipelinequality/driver-stubs-interface.svg b/lectures/12_pipelinequality/driver-stubs-interface.svg deleted file mode 100644 index 60e25ba3..00000000 --- a/lectures/12_pipelinequality/driver-stubs-interface.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/12_pipelinequality/manualtesting.jpg b/lectures/12_pipelinequality/manualtesting.jpg deleted file mode 100644 index 02379513..00000000 Binary files a/lectures/12_pipelinequality/manualtesting.jpg and /dev/null differ diff --git a/lectures/12_pipelinequality/mlflow-web-ui.png b/lectures/12_pipelinequality/mlflow-web-ui.png deleted file mode 100644 index 82e3e39a..00000000 Binary files a/lectures/12_pipelinequality/mlflow-web-ui.png and /dev/null differ diff --git a/lectures/12_pipelinequality/mltestingandmonitoring.png b/lectures/12_pipelinequality/mltestingandmonitoring.png deleted file mode 100644 index 1b00ab01..00000000 Binary files a/lectures/12_pipelinequality/mltestingandmonitoring.png and /dev/null differ diff --git a/lectures/12_pipelinequality/notebook-example.png b/lectures/12_pipelinequality/notebook-example.png deleted file mode 100644 index 2b614ce0..00000000 Binary files a/lectures/12_pipelinequality/notebook-example.png and /dev/null differ diff --git a/lectures/12_pipelinequality/notebookinproduction.png b/lectures/12_pipelinequality/notebookinproduction.png deleted file mode 100644 index fe12e5aa..00000000 Binary files a/lectures/12_pipelinequality/notebookinproduction.png and /dev/null differ diff --git a/lectures/12_pipelinequality/pipeline-connections.svg b/lectures/12_pipelinequality/pipeline-connections.svg deleted file mode 100644 index 9fe37f55..00000000 --- a/lectures/12_pipelinequality/pipeline-connections.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/12_pipelinequality/pipeline.svg b/lectures/12_pipelinequality/pipeline.svg deleted file mode 100644 index 5195af76..00000000 --- a/lectures/12_pipelinequality/pipeline.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/12_pipelinequality/pipelinequality.md b/lectures/12_pipelinequality/pipelinequality.md deleted file mode 100644 index 10de0c9a..00000000 --- a/lectures/12_pipelinequality/pipelinequality.md +++ /dev/null @@ -1,1520 +0,0 @@ ---- -author: Christian Kaestner and Eunsuk Kang -title: "MLiP: Automating and Testing ML Pipelines" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - - -# Automating and Testing ML Pipelines - - - - ---- -## Infrastructure Quality... - -![Overview of course content](../_assets/overview.svg) - - - - ----- -## Readings - -Required reading: Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. [The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction](https://research.google.com/pubs/archive/46555.pdf). Proceedings of IEEE Big Data (2017) - -Recommended readings: -* O'Leary, Katie, and Makoto Uchida. "[Common problems with Creating Machine Learning Pipelines from Existing Code](https://research.google/pubs/pub48984.pdf)." Proc. Conference on Machine Learning and Systems (MLSys) (2020). - ----- - -# Learning Goals - -* Decompose an ML pipeline into testable functions -* Implement and automate tests for all parts of the ML pipeline -* Understand testing opportunities beyond functional correctness -* Describe the different testing levels and testing opportunities at each level -* Automate test execution with continuous integration - - ---- -# ML Pipelines - -![Pipeline](pipeline.svg) - - -All steps to create (and deploy) the model - ----- -## Common ML Pipeline - -![Notebook snippet](notebook-example.png) - -Note: -Computational notebook - -Containing all code, often also dead experimental code - ----- -## Notebooks as Production Pipeline? - -[![Howoto Notebook in Production Blog post](notebookinproduction.png)](https://tanzu.vmware.com/content/blog/how-data-scientists-can-tame-jupyter-notebooks-for-use-in-production-systems) - -Parameterize and use `nbconvert`? - - ----- -## Real Pipelines can be Complex - -![Connections between the pipeline and other components](pipeline-connections.svg) - - - ----- -## Real Pipelines can be Complex - -Large arguments of data - -Distributed data storage - -Distributed processing and learning - -Special hardware needs - -Fault tolerance - -Humans in the loop - - - ----- -## Possible Mistakes in ML Pipelines - -![Pipeline](pipeline.svg) - - -Danger of "silent" mistakes in many phases - -**Examples?** - ----- -## Possible Mistakes in ML Pipelines - -Danger of "silent" mistakes in many phases: - -* Dropped data after format changes -* Failure to push updated model into production -* Incorrect feature extraction -* Use of stale dataset, wrong data source -* Data source no longer available (e.g web API) -* Telemetry server overloaded -* Negative feedback (telemtr.) no longer sent from app -* Use of old model learning code, stale hyperparameter -* Data format changes between ML pipeline steps - ----- -## Pipeline Thinking - -After exploration and prototyping build robust pipeline - -One-off model creation -> repeatable automateable process - -Enables updates, supports experimentation - -Explicit interfaces with other parts of the system (data sources, labeling infrastructure, training infrastructure, deployment, ...) - -**Design for change** - - ----- -## Building Robust Pipeline Automation - -* Support experimentation and evolution - * Automate - * Design for change - * Design for observability - * Testing the pipeline for robustness -* Thinking in pipelines, not models -* Integrating the Pipeline with other Components - - - - - - - - - - - - - - - - - - - - ---- -# Pipeline Testability and Modularity - - - ----- -## Pipelines are Code - -From experimental notebook code to production code - -Each stage as a function or module - -Well tested in isolation and together - -Robust to changes in inputs (automatically adapt or crash, no silent mistakes) - -Use good engineering practices (version control, documentation, testing, naming, code review) - - - ----- -## Sequential Data Science Code in Notebooks - -
- -```python -# typical data science code from a notebook -df = pd.read_csv('data.csv', parse_dates=True) - - -# data cleaning -# ... - - -# feature engineering -df['month'] = pd.to_datetime(df['datetime']).dt.month -df['dayofweek']= pd.to_datetime(df['datetime']).dt.dayofweek -df['delivery_count'] = boxcox(df['delivery_count'], 0.4) -df.drop(['datetime'], axis=1, inplace=True) - - -dummies = pd.get_dummies(df, columns = ['month', 'weather', 'dayofweek']) -dummies = dummies.drop(['month_1', 'hour_0', 'weather_1'], axis=1) - - -X = dummies.drop(['delivery_count'], axis=1) -y = pd.Series(df['delivery_count']) - - -X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) - - -# training and evaluation -lr = LinearRegression() -lr.fit(X_train, y_train) - - -print(lr.score(X_train, y_train)) -print(lr.score(X_test, y_test)) -``` - -
- -**How to test??** - ----- -## Pipeline restructed into separate function - -
- -```python -def encode_day_of_week(df): - if 'datetime' not in df.columns: raise ValueError("Column datetime missing") - if df.datetime.dtype != 'object': raise ValueError("Invalid type for column datetime") - df['dayofweek']= pd.to_datetime(df['datetime']).dt.day_name() - df = pd.get_dummies(df, columns = ['dayofweek']) - return df - - -# ... - - -def prepare_data(df): - df = clean_data(df) - - - df = encode_day_of_week(df) - df = encode_month(df) - df = encode_weather(df) - df.drop(['datetime'], axis=1, inplace=True) - return (df.drop(['delivery_count'], axis=1), - encode_count(pd.Series(df['delivery_count']))) - - -def learn(X, y): - lr = LinearRegression() - lr.fit(X, y) - return lr - - -def pipeline(): - train = pd.read_csv('train.csv', parse_dates=True) - test = pd.read_csv('test.csv', parse_dates=True) - X_train, y_train = prepare_data(train) - X_test, y_test = prepare_data(test) - model = learn(X_train, y_train) - accuracy = eval(model, X_test, y_test) - return model, accuracy -``` - - -
- - ----- -## Orchestrating Functions - -```python -def pipeline(): - train = pd.read_csv('train.csv', parse_dates=True) - test = pd.read_csv('test.csv', parse_dates=True) - X_train, y_train = prepare_data(train) - X_test, y_test = prepare_data(test) - model = learn(X_train, y_train) - accuracy = eval(model, X_test, y_test) - return model, accuracy -``` - -Dataflow frameworks like [Luigi](https://github.com/spotify/luigi), [DVC](https://dvc.org/), [Airflow](https://airflow.apache.org/), [d6tflow](https://github.com/d6t/d6tflow), and [Ploomber](https://ploomber.io/) support distribution, fault tolerance, monitoring, ... - -Hosted versions like [DataBricks](https://databricks.com/) and [AWS SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines/) - - ----- -## Test the Modules - -```python -def encode_day_of_week(df): - if 'datetime' not in df.columns: raise ValueError("Column datetime missing") - if df.datetime.dtype != 'object': raise ValueError("Invalid type for column datetime") - df['dayofweek']= pd.to_datetime(df['datetime']).dt.day_name() - df = pd.get_dummies(df, columns = ['dayofweek']) - return df -``` - -```python -def test_day_of_week_encoding(): - df = pd.DataFrame({'datetime': ['2020-01-01','2020-01-02','2020-01-08'], 'delivery_count': [1, 2, 3]}) - encoded = encode_day_of_week(df) - assert "dayofweek_Wednesday" in encoded.columns - assert (encoded["dayofweek_Wednesday"] == [1, 0, 1]).all() - -# more tests... -``` - - - - - - - - - - - ----- -## Subtle Bugs in Data Wrangling Code - - -```python -df['Join_year'] = df.Joined.dropna().map( - lambda x: x.split(',')[1].split(' ')[1]) -``` -```python -df.loc[idx_nan_age,'Age'].loc[idx_nan_age] = - df['Title'].loc[idx_nan_age].map(map_means) -``` -```python -df["Weight"].astype(str).astype(int) -``` - - ----- -## Subtle Bugs in Data Wrangling Code (continued) - -```python -df['Reviws'] = df['Reviews'].apply(int) -``` -```python -df["Release Clause"] = - df["Release Clause"].replace(regex=['k'], value='000') -df["Release Clause"] = - df["Release Clause"].astype(str).astype(float) -``` - -Notes: - -1 attempting to remove na values from column, not table - -2 loc[] called twice, resulting in assignment to temporary column only - -3 astype() is not an in-place operation - -4 typo in column name - -5&6 modeling problem (k vs K) - - - - ----- -## Modularity fosters Testability - -Breaking code into functions/modules - -Supports reuse, separate development, and testing - -Can test individual parts - - - ---- -# Excursion: Test Automation - ----- -## From Manual Testing to Continuous Integration - - -![Manual Testing](manualtesting.jpg) - -![Continuous Integration](ci.png) - - - ----- -## Anatomy of a Unit Test - -```java -import org.junit.Test; -import static org.junit.Assert.assertEquals; - -public class AdjacencyListTest { - @Test - public void testSanityTest(){ - // set up - Graph g1 = new AdjacencyListGraph(10); - Vertex s1 = new Vertex("A"); - Vertex s2 = new Vertex("B"); - // check expected results (oracle) - assertEquals(true, g1.addVertex(s1)); - assertEquals(true, g1.addVertex(s2)); - assertEquals(true, g1.addEdge(s1, s2)); - assertEquals(s2, g1.getNeighbors(s1)[0]); - } - - // use abstraction, e.g. common setups - private int helperMethod… -} -``` - ----- -## Ingredients to a Test - -Specification - -Controlled environment - -Test inputs (calls and parameters) - -Expected outputs/behavior (oracle) - - ----- -## Unit Testing Pitfalls - - -Working code, failing tests - -"Works on my machine" - -Tests break frequently - -**How to avoid?** - - ----- -## Testable Code - -Think about testing when writing code - -Unit testing encourages you to write testable code - -Separate parts of the code to make them independently testable - -Abstract functionality behind interface, make it replaceable - -Bonus: Test-Driven Development is a design and development method in which you *always* write tests *before* writing code - - - - ----- -## Build systems & Continuous Integration - -Automate all build, analysis, test, and deployment steps from a command line call - -Ensure all dependencies and configurations are defined - -Ideally reproducible and incremental - -Distribute work for large jobs - -Track results - -**Key CI benefit: Tests are regularly executed, part of process** - ----- -![Continuous Integration example](ci.png) - - - ----- -## Tracking Build Quality - -Track quality indicators over time, e.g., -* Build time -* Coverage -* Static analysis warnings -* Performance results -* Model quality measures -* Number of TODOs in source code - - - ----- -## Coverage - -![](coverage.png) - - - ----- -[![Jenkins Dashboard with Metrics](https://blog.octo.com/wp-content/uploads/2012/08/screenshot-dashboard-jenkins1.png)](https://blog.octo.com/en/jenkins-quality-dashboard-ios-development/) - - - -Source: https://blog.octo.com/en/jenkins-quality-dashboard-ios-development/ - - ----- -## Tracking Model Qualities - -Many tools: MLFlow, ModelDB, Neptune, TensorBoard, Weights & Biases, Comet.ml, ... - -![MLFlow interface](mlflow-web-ui.png) - ----- -## ModelDB Example - -```python -from verta import Client -client = Client("http://localhost:3000") - -proj = client.set_project("My first ModelDB project") -expt = client.set_experiment("Default Experiment") - -# log a training run -run = client.set_experiment_run("First Run") -run.log_hyperparameters({"regularization" : 0.5}) -model1 = # ... model training code goes here -run.log_metric('accuracy', accuracy(model1, validationData)) -``` - - - - - - - - ---- -# Testing Maturity - - - - -Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. [The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction](https://research.google.com/pubs/archive/46555.pdf). Proceedings of IEEE Big Data (2017) - - ----- - -![](mltestingandmonitoring.png) - - -Source: Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. [The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction](https://research.google.com/pubs/archive/46555.pdf). Proceedings of IEEE Big Data (2017) - - - ----- -## Data Tests - -1. Feature expectations are captured in a schema. -2. All features are beneficial. -3. No feature’s cost is too much. -4. Features adhere to meta-level requirements. -5. The data pipeline has appropriate privacy controls. -6. New features can be added quickly. -7. All input feature code is tested. - - - -Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. [The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction](https://research.google.com/pubs/archive/46555.pdf). Proceedings of IEEE Big Data (2017) - ----- -## Tests for Model Development - -1. Model specs are reviewed and submitted. -2. Offline and online metrics correlate. -3. All hyperparameters have been tuned. -4. The impact of model staleness is known. -5. A simpler model is not better. -6. Model quality is sufficient on important data slices. -7. The model is tested for considerations of inclusion. - - - -Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. [The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction](https://research.google.com/pubs/archive/46555.pdf). Proceedings of IEEE Big Data (2017) - ----- -## ML Infrastructure Tests - -1. Training is reproducible. -2. Model specs are unit tested. -3. The ML pipeline is Integration tested. -4. Model quality is validated before serving. -5. The model is debuggable. -6. Models are canaried before serving. -7. Serving models can be rolled back. - - - - -Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. [The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction](https://research.google.com/pubs/archive/46555.pdf). Proceedings of IEEE Big Data (2017) - ----- -## Monitoring Tests - -1. Dependency changes result in notification. -2. Data invariants hold for inputs. -3. Training and serving are not skewed. -4. Models are not too stale. -5. Models are numerically stable. -6. Computing performance has not regressed. -7. Prediction quality has not regressed. - - - - -Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. [The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction](https://research.google.com/pubs/archive/46555.pdf). Proceedings of IEEE Big Data (2017) - - ----- - -## Case Study: Covid-19 Detection - - - -(from S20 midterm; assume cloud or hybrid deployment) ----- -## Breakout Groups - -* In the Smartphone Covid Detection scenario -* Discuss in groups: - * Back left: data tests - * Back right: model dev. tests - * Front right: infrastructure tests - * Front left: monitoring tests -* For 8 min, discuss some of the listed point in the context of the Covid-detection scenario: what would you do? -* In `#lecture`, tagging group members, suggest what top 2 tests to implement and how - - - - ---- -# Minimizing and Stubbing Dependencies - - - - ----- -## How to unit test component with dependency on other code? - - - ----- -## How to Test Parts of a System? - - -![Client-code-backend](client-code-backend.svg) - -```python -# original implementation hardcodes external API -def clean_gender(df): - def clean(row): - if pd.isnull(row['gender']): - row['gender'] = gender_api_client.predict(row['firstname'], row['lastname'], row['location']) - return row - return df.apply(clean, axis=1) -``` - - ----- -## Automating Test Execution - -![Test driver-code-backend](driver-code-backend.svg) - - -```python -def test_do_not_overwrite_gender(): - df = pd.DataFrame({'firstname': ['John', 'Jane', 'Jim'], - 'lastname': ['Doe', 'Doe', 'Doe'], - 'location': ['Pittsburgh, PA', 'Rome, Italy', 'Paris, PA '], - 'gender': [np.nan, 'F', np.nan]}) - out = clean_gender(df, model_stub) - assert(out['gender'] ==['M', 'F', 'M']).all() -``` - ----- -## Decoupling from Dependencies - - -```java -def clean_gender(df, model): - def clean(row): - if pd.isnull(row['gender']): - row['gender'] = model(row['firstname'], - row['lastname'], - row['location']) - return row - return df.apply(clean, axis=1) -``` - -Replace concrete API with an interface that caller can parameterize - ----- -## Stubbing the Dependency - -![Test driver-code-stub](driver-code-stub.svg) - -```python -def test_do_not_overwrite_gender(): - def model_stub(first, last, location): - return 'M' - - df = pd.DataFrame({'firstname': ['John', 'Jane', 'Jim'], 'lastname': ['Doe', 'Doe', 'Doe'], 'location': ['Pittsburgh, PA', 'Rome, Italy', 'Paris, PA '], 'gender': [np.nan, 'F', np.nan]}) - out = clean_gender(df, model_stub) - assert(out['gender'] ==['M', 'F', 'M']).all() -``` - - - ----- -## General Testing Strategy: Decoupling Code Under Test - -![Test driver-code-stub](driver-stubs-interface.svg) - - -(Mocking frameworks provide infrastructure for expressing such tests compactly.) - - - - - ---- -# Testing Error Handling / Infrastructure Robustness - ----- - - - ----- -## General Error Handling Strategies - -Avoid silent errors - -Recover locally if possible, propagate error if necessary -- fail entire task if needed - -Explicitly handle exceptional conditions and mistakes - -Test correct error handling - -If logging only, is anybody analyzing log files? - - ----- -## Test for Expected Exceptions - - -```python -def test_invalid_day_of_week_data(): - df = pd.DataFrame({'datetime_us': ['01/01/2020'], - 'delivery_count': [1]}) - with pytest.raises(ValueError): - encode_day_of_week(df) -``` - - ----- -## Test for Expected Exceptions - - -```python -def test_learning_fails_with_missing_data(): - df = pd.DataFrame({}) - with pytest.raises(NoDataError): - learn(df) -``` - - ----- -## Test Recovery Mechanisms with Stub - -Use stubs to inject artificial faults - -```python -## testing retry mechanism -from retry.api import retry_call -import pytest - -# stub of a network connection, sometimes failing -class FailedConnection(Connection): - remaining_failures = 0 - def __init__(self, failures): - self.remaining_failures = failures - def get(self, url): - print(self.remaining_failures) - self.remaining_failures -= 1 - if self.remaining_failures >= 0: - raise TimeoutError('fail') - return "success" - -# function to be tested, with recovery mechanism -def get_data(connection, value): - def get(): return connection.get('https://replicate.npmjs.com/registry/'+value) - return retry_call(get, - exceptions = TimeoutError, tries=3, delay=0.1, backoff=2) - -# 3 tests for no problem, recoverable problem, and not recoverable -def test_no_problem_case(): - connection = FailedConnection(0) - assert get_data(connection, '') == 'success' - -def test_successful_recovery(): - connection = FailedConnection(2) - assert get_data(connection, '') == 'success' - -def test_exception_if_unable_to_recover(): - connection = FailedConnection(10) - with pytest.raises(TimeoutError): - get_data(connection, '') -``` - ----- -## Test Error Handling throughout Pipeline - -Is invalid data rejected / repaired? - -Are missing data updates raising errors? - -Are unavailable APIs triggering errors? - -Are failing deployments reported? - ----- -## Log Error Occurrence - -Even when reported or mitigated, log the issue - -Allows later analysis of frequency and patterns - -Monitoring systems can raise alarms for anomalies - - ----- -## Example: Error Logging - -```python -from prometheus_client import Counter -connection_timeout_counter = Counter( - 'connection_retry_total', - 'Retry attempts on failed connections') - -class RetryLogger(): - def warning(self, fmt, error, delay): - connection_timeout_counter.inc() - -retry_logger = RetryLogger() - -def get_data(connection, value): - def get(): return connection.get('https://replicate.npmjs.com/registry/'+value) - return retry_call(get, - exceptions = TimeoutError, tries=3, delay=0.1, backoff=2, - logger = retry_logger) -``` - - - ----- -## Test Monitoring - -* Inject/simulate faulty behavior -* Mock out notification service used by monitoring -* Assert notification - -```java -class MyNotificationService extends NotificationService { - public boolean receivedNotification = false; - public void sendNotification(String msg) { - receivedNotification = true; } -} -@Test void test() { - Server s = getServer(); - MyNotificationService n = new MyNotificationService(); - Monitor m = new Monitor(s, n); - s.stop(); - s.request(); s.request(); - wait(); - assert(n.receivedNotification); -} -``` - ----- -## Test Monitoring in Production - -Like fire drills (manual tests may be okay!) - -Manual tests in production, repeat regularly - -Actually take down service or trigger wrong signal to monitor - ----- -## Chaos Testing - -![Chaos Monkey](simiamarmy.jpg) - - - - -http://principlesofchaos.org - -Notes: Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. Pioneered at Netflix - ----- -## Chaos Testing Argument - -* Distributed systems are simply too complex to comprehensively predict - * experiment to learn how it behaves in the presence of faults -* Base corrective actions on experimental results because they reflect real risks and actual events -* -* Experimentation != testing -- Observe behavior rather then expect specific results -* Simulate real-world problem in production (e.g., take down server, inject latency) -* *Minimize blast radius:* Contain experiment scope - ----- -## Netflix's Simian Army - -
- -* Chaos Monkey: randomly disable production instances -* Latency Monkey: induces artificial delays in our RESTful client-server communication layer -* Conformity Monkey: finds instances that don’t adhere to best-practices and shuts them down -* Doctor Monkey: monitors external signs of health to detect unhealthy instances -* Janitor Monkey: ensures cloud environment is running free of clutter and waste -* Security Monkey: finds security violations or vulnerabilities, and terminates the offending instances -* 10–18 Monkey: detects problems in instances serving customers in multiple geographic regions -* Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. - -
- - ----- -## Chaos Toolkit - -* Infrastructure for chaos experiments -* Driver for various infrastructure and failure cases -* Domain specific language for experiment definitions - -```js -{ - "version": "1.0.0", - "title": "What is the impact of an expired certificate on our application chain?", - "description": "If a certificate expires, we should gracefully deal with the issue.", - "tags": ["tls"], - "steady-state-hypothesis": { - "title": "Application responds", - "probes": [ - { - "type": "probe", - "name": "the-astre-service-must-be-running", - "tolerance": true, - "provider": { - "type": "python", - "module": "os.path", - "func": "exists", - "arguments": { - "path": "astre.pid" - } - } - }, - { - "type": "probe", - "name": "the-sunset-service-must-be-running", - "tolerance": true, - "provider": { - "type": "python", - "module": "os.path", - "func": "exists", - "arguments": { - "path": "sunset.pid" - } - } - }, - { - "type": "probe", - "name": "we-can-request-sunset", - "tolerance": 200, - "provider": { - "type": "http", - "timeout": 3, - "verify_tls": false, - "url": "https://localhost:8443/city/Paris" - } - } - ] - }, - "method": [ - { - "type": "action", - "name": "swap-to-expired-cert", - "provider": { - "type": "process", - "path": "cp", - "arguments": "expired-cert.pem cert.pem" - } - }, - { - "type": "probe", - "name": "read-tls-cert-expiry-date", - "provider": { - "type": "process", - "path": "openssl", - "arguments": "x509 -enddate -noout -in cert.pem" - } - }, - { - "type": "action", - "name": "restart-astre-service-to-pick-up-certificate", - "provider": { - "type": "process", - "path": "pkill", - "arguments": "--echo -HUP -F astre.pid" - } - }, - { - "type": "action", - "name": "restart-sunset-service-to-pick-up-certificate", - "provider": { - "type": "process", - "path": "pkill", - "arguments": "--echo -HUP -F sunset.pid" - }, - "pauses": { - "after": 1 - } - } - ], - "rollbacks": [ - { - "type": "action", - "name": "swap-to-vald-cert", - "provider": { - "type": "process", - "path": "cp", - "arguments": "valid-cert.pem cert.pem" - } - }, - { - "ref": "restart-astre-service-to-pick-up-certificate" - }, - { - "ref": "restart-sunset-service-to-pick-up-certificate" - } - ] -} -``` - - - -http://principlesofchaos.org, https://github.com/chaostoolkit, https://github.com/Netflix/SimianArmy - ----- -## Chaos Experiments for ML Infrastructure? - - - - -Note: Fault injection in production for testing in production. Requires monitoring and explicit experiments. - - - - - - - ---- -# Where to Focus Testing? - - - ----- -## Testing in ML Pipelines - -Usually assume ML libraries already tested (pandas, sklearn, etc) - -Focus on custom code -- data quality checks -- data wrangling (feature engineering) -- training setup -- interaction with other components - -Consider tests of latency, throughput, memory, ... - ----- -## Testing Data Quality Checks - -Test correct detection of problems - -```python -def test_invalid_day_of_week_data(): - ... -``` - -Test correct error handling or repair of detected problems - -```python -def test_fill_missing_gender(): - ... -def test_exception_for_missing_data(): - ... -``` - ----- -## Test Data Wrangling Code - -```python -num = data.Size.replace(r'[kM]+$', '', regex=True). - astype(float) -factor = data.Size.str.extract(r'[\d\.]+([KM]+)', - expand =False) -factor = factor.replace(['k','M'], [10**3, 10**6]).fillna(1) -data['Size'] = num*factor.astype(int) -``` -```python -data["Size"]= data["Size"]. - replace(regex =['k'], value='000') -data["Size"]= data["Size"]. - replace(regex =['M'], value='000000') -data["Size"]= data["Size"].astype(str). astype(float) -``` - -Note: both attempts are broken: - -* Variant A, returns 10 for “10k” -* Variant B, returns 100.5000000 for “100.5M” - ----- -## Test Model Training Setup? - -Execute training with small sample data - -Ensure shape of model and data as expected (e.g., tensor dimensions) - ----- -## Test Interactions with Other Components - -Test error handling for detecting connection/data problems -* loading training data -* feature server -* uploading serialized model -* A/B testing infrastructure - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ---- -# Integration and system tests - -![Testing levels](unit-integration-system-testing.svg) - - -Notes: - -Software is developed in units that are later assembled. Accordingly we can distinguish different levels of testing. - -Unit Testing - A unit is the "smallest" piece of software that a developer creates. It is typically the work of one programmer and is stored in a single file. Different programming languages have different units: In C++ and Java the unit is the class; in C the unit is the function; in less structured languages like Basic and COBOL the unit may be the entire program. - -Integration Testing - In integration we assemble units together into subsystems and finally into systems. It is possible for units to function perfectly in isolation but to fail when integrated. For example because they share an area of the computer memory or because the order of invocation of the different methods is not the one anticipated by the different programmers or because there is a mismatch in the data types. Etc. - -System Testing - A system consists of all of the software (and possibly hardware, user manuals, training materials, etc.) that make up the product delivered to the customer. System testing focuses on defects that arise at this highest level of integration. Typically system testing includes many types of testing: functionality, usability, security, internationalization and localization, reliability and availability, capacity, performance, backup and recovery, portability, and many more. - -Acceptance Testing - Acceptance testing is defined as that testing, which when completed successfully, will result in the customer accepting the software and giving us their money. From the customer's point of view, they would generally like the most exhaustive acceptance testing possible (equivalent to the level of system testing). From the vendor's point of view, we would generally like the minimum level of testing possible that would result in money changing hands. -Typical strategic questions that should be addressed before acceptance testing are: Who defines the level of the acceptance testing? Who creates the test scripts? Who executes the tests? What is the pass/fail criteria for the acceptance test? When and how do we get paid? - - ----- -## Integration and system tests - -Test larger units of behavior - -Often based on use cases or user stories -- customer perspective - -```java -@Test void gameTest() { - Poker game = new Poker(); - Player p = new Player(); - Player q = new Player(); - game.shuffle(seed) - game.add(p); - game.add(q); - game.deal(); - p.bet(100); - q.bet(100); - p.call(); - q.fold(); - assert(game.winner() == p); -} - -``` - - - ----- -## Integration tests - -Test combined behavior of multiple functions - -```java -def test_cleaning_with_feature_eng() { - d = load_test_data(); - cd = clean(d); - f = feature3.encode(cd); - assert(no_missing_values(f["m"])); - assert(max(f["m"]) <= 1.0); -} - -``` - - - ----- -## Test Integration of Components - -```javascript -// making predictions with an ensemble of models -function predict_price(data, models, timeoutms) { - // send asynchronous REST requests all models - const requests = models.map(model => rpc(model, data, {timeout: timeoutms}).then(parseResult).catch(e => -1)) - // collect all answers and return average if at least two models succeeded - return Promise.all(requests).then(predictions => { - const success = predictions.filter(v => v >= 0) - if (success.length < 2) throw new Error("Too many models failed") - return success.reduce((a, b) => a + b, 0) / success.length - }) -} - -// test ensemble of models -const timeout = 500, M1 = "http://localhost:3000/predict", ... -beforeAll(() => { - // launch model 1 API at address M1 - // launch model 2 API at address M2 - // launch model API with timeout at address M3 -} -afterAll(() => { /* shut down all model APIs */ } - -test("success despite timeout", async () => { - const start = performance.now(); - const val = await predict_price(input, [M1, M2, M3], timeout) - expect(performance.now() - start).toBeLessThan(2 * timeout) - expect(val).toBeGreaterThan(0) -}) - -test("fail on too many timeouts", async () => { - const start = performance.now(); - const val = await predict_price(input, [M1, M3, M3], timeout) - expect(performance.now() - start).toBeLessThan(2 * timeout) - expect(val).toThrow() -}) -``` - ----- -## End-To-End Test of Entire Pipeline - -```python -def test_pipeline(): - train = pd.read_csv('pipelinetest_training.csv', parse_dates=True) - test = pd.read_csv('pipelinetest_test.csv', parse_dates=True) - X_train, y_train = prepare_data(train) - X_test, y_test = prepare_data(test) - model = learn(X_train, y_train) - accuracy = eval(model, X_test, y_test) - assert accuracy > 0.9 -``` - - ----- -## System Testing from a User Perspective - -Test the product as a whole, not just components - -Click through user interface, achieve task (often manually performed) - -Derived from requirements (use cases, user stories) - -Testing in production - ----- -## The V-Model of Testing - -![V-Model](vmodel.svg) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ---- -# Code Review and Static Analysis - ----- -## Code Review - -Manual inspection of code -- Looking for problems and possible improvements -- Possibly following checklists -- Individually or as group - -Modern code review: Incremental review at checking -- Review individual changes before merging -- Pull requests on GitHub -- Not very effective at finding bugs, but many other benefits: knowledge transfer, code imporvement, shared code ownership, improving testing - ----- -![Code Review on GitHub](review_github.png) - - ----- -## Subtle Bugs in Data Wrangling Code - -```python -df['Join_year'] = df.Joined.dropna().map( - lambda x: x.split(',')[1].split(' ')[1]) -``` -```python -df.loc[idx_nan_age,'Age'].loc[idx_nan_age] = - df['Title'].loc[idx_nan_age].map(map_means) -``` -```python -df["Weight"].astype(str).astype(int) -``` -```python -df['Reviws'] = df['Reviews'].apply(int) -``` - -Notes: We did code review earlier together - ----- -## Static Analysis, Code Linting - -Automatic detection of problematic patterns based on code structure - -```java -if (user.jobTitle = "manager") { - ... -} -``` - -```javascript -function fn() { - x = 1; - return x; - x = 3; -} -``` - - ----- -## Static Analysis for Data Science Code - -* Lots of research -* Style issues in Python -* Shape analysis of tensors in deep learning -* Analysis of flow of datasets to detect data leakage -* ... - - -Examples: -* Yang, Chenyang, et al.. "Data Leakage in Notebooks: Static Detection and Better Processes." Proc. ASE (2022). -* Lagouvardos, S. et al. (2020). Static analysis of shape in TensorFlow programs. In Proc. ECOOP. -* Wang, Jiawei, et al. "Better code, better sharing: on the need of analyzing jupyter notebooks." In Proc. ICSE-NIER. 2020. - - ----- -## Process Integration: Static Analysis Warnings during Code Review - -![Static analysis warnings during code review](staticanalysis_codereview.png) - - - - -Sadowski, Caitlin, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. "Lessons from building static analysis tools at google." Communications of the ACM 61, no. 4 (2018): 58-66. - -Note: Social engineering to force developers to pay attention. Also possible with integration in pull requests on GitHub. - - - ----- -## Bonus: Data Linter at Google - - -**Miscoding** - * Number, date, time as string - * Enum as real - * Tokenizable string (long strings, all unique) - * Zip code as number - - -**Outliers and scaling** - * Unnormalized feature (varies widely) - * Tailed distributions - * Uncommon sign - -**Packaging** - * Duplicate rows - * Empty/missing data - - -Further readings: Hynes, Nick, D. Sculley, and Michael Terry. [The data linter: Lightweight, automated sanity checking for ML data sets](http://learningsys.org/nips17/assets/papers/paper_19.pdf). NIPS MLSys Workshop. 2017. - - - - - - - - ---- -# Summary - -* Beyond model and data quality: Quality of the infrastructure matters, danger of silent mistakes -* Automate pipelines to foster testing, evolution, and experimentation -* Many SE techniques for test automation, testing robustness, test adequacy, testing in production useful for infrastructure quality - ----- -## Further Readings - -
- -* 🗎 O'Leary, Katie, and Makoto Uchida. "[Common problems with Creating Machine Learning Pipelines from Existing Code](https://research.google/pubs/pub48984.pdf)." Proc. Third Conference on Machine Learning and Systems (MLSys) (2020). -* 🗎 Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. Proceedings of IEEE Big Data (2017) -* 📰 Zinkevich, Martin. [Rules of Machine Learning: Best Practices for ML Engineering](https://developers.google.com/machine-learning/guides/rules-of-ml/). Google Blog Post, 2017 -* 🗎 Serban, Alex, Koen van der Blom, Holger Hoos, and Joost Visser. "[Adoption and Effects of Software Engineering Best Practices in Machine Learning](https://arxiv.org/pdf/2007.14130)." In Proc. ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (2020). - -
- diff --git a/lectures/12_pipelinequality/review_github.png b/lectures/12_pipelinequality/review_github.png deleted file mode 100644 index 523142ce..00000000 Binary files a/lectures/12_pipelinequality/review_github.png and /dev/null differ diff --git a/lectures/12_pipelinequality/simiamarmy.jpg b/lectures/12_pipelinequality/simiamarmy.jpg deleted file mode 100644 index 2db32a78..00000000 Binary files a/lectures/12_pipelinequality/simiamarmy.jpg and /dev/null differ diff --git a/lectures/12_pipelinequality/staticanalysis_codereview.png b/lectures/12_pipelinequality/staticanalysis_codereview.png deleted file mode 100644 index 98616bee..00000000 Binary files a/lectures/12_pipelinequality/staticanalysis_codereview.png and /dev/null differ diff --git a/lectures/12_pipelinequality/unit-integration-system-testing.svg b/lectures/12_pipelinequality/unit-integration-system-testing.svg deleted file mode 100644 index daf0d8dc..00000000 --- a/lectures/12_pipelinequality/unit-integration-system-testing.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/12_pipelinequality/vmodel.svg b/lectures/12_pipelinequality/vmodel.svg deleted file mode 100644 index b5d6207e..00000000 --- a/lectures/12_pipelinequality/vmodel.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/13_dataatscale/apigateway.svg b/lectures/13_dataatscale/apigateway.svg deleted file mode 100644 index 25a35378..00000000 --- a/lectures/13_dataatscale/apigateway.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/13_dataatscale/bluescreen.png b/lectures/13_dataatscale/bluescreen.png deleted file mode 100644 index 1fd3f77d..00000000 Binary files a/lectures/13_dataatscale/bluescreen.png and /dev/null differ diff --git a/lectures/13_dataatscale/dataatscale.md b/lectures/13_dataatscale/dataatscale.md deleted file mode 100644 index c1bf1077..00000000 --- a/lectures/13_dataatscale/dataatscale.md +++ /dev/null @@ -1,1041 +0,0 @@ ---- -author: Christian Kaestner and Eunsuk Kang -title: "MLiP: Scaling Data Storage and Data Processing" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - -# Scaling Data Storage and Data Processing - - ---- -## Design and operations - -![Overview of course content](../_assets/overview.svg) - - - ----- -## Readings - -Required reading: 🕮 Nathan Marz. Big Data: Principles and best practices of scalable realtime data systems. Simon and Schuster, 2015. Chapter 1: A new paradigm for Big Data - -Suggested watching: Molham Aref. [Business Systems with Machine Learning](https://www.youtube.com/watch?v=_bvrzYOA8dY). Guest lecture, 2020. - -Suggested reading: Martin Kleppmann. [Designing Data-Intensive Applications](https://dataintensive.net/). OReilly. 2017. - ----- - -# Learning Goals - -* Organize different data management solutions and their tradeoffs -* Understand the scalability challenges involved in large-scale machine learning and specifically deep learning -* Explain the tradeoffs between batch processing and stream processing and the lambda architecture -* Recommend and justify a design and corresponding technologies for a given system - ---- -# Case Study - -![Google Photos Screenshot](gphotos.png) - - -Notes: -* Discuss possible architecture and when to predict (and update) -* in may 2017: 500M users, uploading 1.2billion photos per day (14k/sec) -* in Jun 2019 1 billion users - ----- - -## Adding capacity - - - -*Stories of catastrophic success?* - ---- - -# Data Management and Processing in ML-Enabled Systems - ----- -## Kinds of Data - -* Training data -* Input data -* Telemetry data -* (Models) - -*all potentially with huge total volumes and high throughput* - -*need strategies for storage and processing* - ----- -## Data Management and Processing in ML-Enabled Systems - -Store, clean, and update training data - -Learning process reads training data, writes model - -Prediction task (inference) on demand or precomputed - -Individual requests (low/high volume) or large datasets? - -*Often both learning and inference data heavy, high volume tasks* - ----- -## Scaling Computations - - -Efficent Algorithms - -Faster Machines - -More Machines - - ----- -## Distributed Everything - -Distributed data cleaning - -Distributed feature extraction - -Distributed learning - -Distributed large prediction tasks - -Incremental predictions - -Distributed logging and telemetry - - - ----- -## Reliability and Scalability Challenges in AI-Enabled Systems? - - - - - ----- -## Distributed Systems and AI-Enabled Systems - -* Learning tasks can take substantial resources -* Datasets too large to fit on single machine -* Nontrivial inference time, many many users -* Large amounts of telemetry -* Experimentation at scale -* Models in safety critical parts -* Mobile computing, edge computing, cyber-physical systems - ----- -## Reminder: T-Shaped People - -![T-shaped people illustration](tshaped.png) - - - -Go deeper with: Martin Kleppmann. [Designing Data-Intensive Applications](https://dataintensive.net/). OReilly. 2017. - ---- -# Excursion: Distributed Deep Learning with the Parameter Server Architecture - - -Li, Mu, et al. "[Scaling distributed machine learning with the parameter server](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf)." OSDI, 2014. - ----- -## Recall: Backpropagation - -![Multi Layer Perceptron](mlperceptron.svg) - - ----- -## Training at Scale is Challenging - -Already 2012 at Google: 1TB-1PB of training data, $10^9-10^{12}$ parameters - -Need distributed training; learning is often a sequential problem - -Just exchanging model parameters requires substantial network bandwidth - -Fault tolerance essential (like batch processing), add/remove nodes - -Tradeoff between convergence rate and system efficiency - - -Li, Mu, et al. "[Scaling distributed machine learning with the parameter server](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf)." OSDI, 2014. - ----- -## Distributed Gradient Descent - -[![Parameter Server](parameterserver.png)](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf) - - ----- -## Parameter Server Architecture - -[![Parameter Server](parameterserver2.png)](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf) - - -Note: -Multiple parameter servers that each only contain a subset of the parameters, and multiple workers that each require only a subset of each - -Ship only relevant subsets of mathematical vectors and matrices, batch communication - -Resolve conflicts when multiple updates need to be integrated (sequential, eventually, bounded delay) - -Run more than one learning algorithm simulaneously - ----- -## SysML Conference - - -Increasing interest in the systems aspects of machine learning - -e.g., building large scale and robust learning infrastructure - -https://mlsys.org/ - - - - - - - - - - - - - ---- -# Data Storage Basics - -Relational vs document storage - -1:n and n:m relations - -Storage and retrieval, indexes - -Query languages and optimization - ----- -## Relational Data Models - -
- -**Photos:** - -|photo_id|user_id|path|upload_date|size|camera_id|camera_setting| -|-|-|-|-|-|-|-| -|133422131|54351|/st/u211/1U6uFl47Fy.jpg|2021-12-03T09:18:32.124Z|5.7|663|ƒ/1.8; 1/120; 4.44mm; ISO271| -|133422132|13221| /st/u11b/MFxlL1FY8V.jpg |2021-12-03T09:18:32.129Z|3.1|1844|ƒ/2, 1/15, 3.64mm, ISO1250| -|133422133|54351|/st/x81/ITzhcSmv9s.jpg|2021-12-03T09:18:32.131Z|4.8|663|ƒ/1.8; 1/120; 4.44mm; ISO48| - - - -**Users:** - - -| user_id |account_name|photos_total|last_login| -|-|-|-|-| -|54351| ckaestne | 5124 | 2021-12-08T12:27:48.497Z | -|13221| eva.burk |3|2021-12-21T01:51:54.713Z| - - - -**Cameras:** - - -| camera_id |manufacturer|print_name| -|-|-|-| -|663| Google | Google Pixel 5 | -|1844|Motorola|Motorola MotoG3| - - - -```sql -select p.photo_id, p.path, u.photos_total -from photos p, users u -where u.user_id=p.user_id and u.account_name = "ckaestne" -``` - -
- ----- - -## Document Data Models - - -```js -{ - "_id": 133422131, - "path": "/st/u211/1U6uFl47Fy.jpg", - "upload_date": "2021-12-03T09:18:32.124Z", - "user": { - "account_name": "ckaestne", - "account_id": "a/54351" - }, - "size": "5.7", - "camera": { - "manufacturer": "Google", - "print_name": "Google Pixel 5", - "settings": "ƒ/1.8; 1/120; 4.44mm; ISO271" - } -} - -``` - -```js -db.getCollection('photos').find( { "user.account_name": "ckaestne"}) -``` - ----- -## Log files, unstructured data - -```text -02:49:12 127.0.0.1 GET /img13.jpg 200 -02:49:35 127.0.0.1 GET /img27.jpg 200 -03:52:36 127.0.0.1 GET /main.css 200 -04:17:03 127.0.0.1 GET /img13.jpg 200 -05:04:54 127.0.0.1 GET /img34.jpg 200 -05:38:07 127.0.0.1 GET /img27.jpg 200 -05:44:24 127.0.0.1 GET /img13.jpg 200 -06:08:19 127.0.0.1 GET /img13.jpg 200 -``` - - ----- -## Tradeoffs - - - ----- -## Data Encoding - -Plain text (csv, logs) - -Semi-structured, schema-free (JSON, XML) - -Schema-based encoding (relational, Avro, ...) - -Compact encodings (protobuffer, ...) - ---- -# Distributed Data Storage - ----- -## Replication vs Partitioning - - - ----- -## Partitioning - - -Divide data: - -* *Horizontal partitioning:* Different rows in different tables; e.g., movies by decade, hashing often used -* *Vertical partitioning:* Different columns in different tables; e.g., movie title vs. all actors - -**Tradeoffs?** - - - -![Horizontal partitioning](horizonalpartition.svg) - - - - ----- -## Replication with Leaders and Followers - -![Leader-follower replication](leaderfollowerreplication.svg) - - - ----- -## Replication Strategies: Leaders and Followers - -Write to leader, propagated synchronously or async. - -Read from any follower - -Elect new leader on leader outage; catchup on follower outage - -Built in model of many databases (MySQL, MongoDB, ...) - -**Benefits and Drawbacks?** - - ----- -## Recall: Google File System - -![](gfs.png) - - - -Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "[The Google file system.](https://ai.google/research/pubs/pub51.pdf)" ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003. - - ----- -## Multi-Leader Replication - -Scale write access, add redundancy - -Requires coordination among leaders -* Resolution of write conflicts - -Offline leaders (e.g. apps), collaborative editing - - - ----- -## Leaderless Replication - -Client writes to multiple replica, propagate from there - -Read from multiple replica (quorum required) -* Repair on reads, background repair process - -Versioning of entries (clock problem) - -*e.g. Amazon Dynamo, Cassandra, Voldemort* - ----- -## Transactions - -Multiple operations conducted as one, all or nothing - -Avoids problems such as -* dirty reads -* dirty writes - -Various strategies, including locking and optimistic+rollback - -Overhead in distributed setting - ---- -# Data Processing (Overview) - -* Services (online) - * Responding to client requests as they come in - * Evaluate: Response time -* Batch processing (offline) - * Computations run on large amounts of data - * Takes minutes to days; typically scheduled periodically - * Evaluate: Throughput -* Stream processing (near real time) - * Processes input events, not responding to requests - * Shortly after events are issued - ---- -# Microservices - ----- -## Microservices - -![Audible example](microservice.svg) - - - -Figure based on Christopher Meiklejohn. [Dynamic Reduction: Optimizing Service-level Fault Injection Testing With Service Encapsulation](http://christophermeiklejohn.com/filibuster/2021/10/14/filibuster-4.html). Blog Post 2021 - ----- -## Microservices - -
- -Independent, cohesive services - * Each specialized for one task - * Each with own data storage - * Each independently scalable through multiple instances + load balancer - -Remote procedure calls - -Different teams can work on different services independently (even in different languages) - -But: Substantial complexity from distributed system nature: various network failures, - latency from remote calls, ... - -*Avoid microservice complexity unless really needed for scalability* - -
- ----- -## API Gateway Pattern - -Central entry point, authentication, routing, updates, ... - -![API Gateway illustration](apigateway.svg) - - - - ---- -# Batch Processing - ----- -## Large Jobs - -* Analyzing TB of data, typically distributed storage -* Filtering, sorting, aggregating -* Producing reports, models, ... - -```sh -cat /var/log/nginx/access.log | - awk '{print $7}' | - sort | - uniq -c | - sort -r -n | - head -n 5 -``` ----- -[![Map Reduce example](mapreduce.svg)](mapreduce.svg) - - ----- -## Distributed Batch Processing - -Process data locally at storage - -Aggregate results as needed - -Separate pluming from job logic - -*MapReduce* as common framework - - ----- -## MapReduce -- Functional Programming Style - -Similar to shell commands: Immutable inputs, new outputs, avoid side effects - -Jobs can be repeated (e.g., on crashes) - -Easy rollback - -Multiple jobs in parallel (e.g., experimentation) - ----- -## Machine Learning and MapReduce - - - -Notes: Useful for big learning jobs, but also for feature extraction - ----- -## Dataflow Engines (Spark, Tez, Flink, ...) - -Single job, rather than subjobs - -More flexible than just map and reduce - -Multiple stages with explicit dataflow between them - -Often in-memory data - -Pluming and distribution logic separated - ----- -## Key Design Principle: Data Locality - -> Moving Computation is Cheaper than Moving Data -- [Hadoop Documentation](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#aMoving_Computation_is_Cheaper_than_Moving_Data) - -Data often large and distributed, code small - -Avoid transfering large amounts of data - -Perform computation where data is stored (distributed) - -Transfer only results as needed - -*"The map reduce way"* - - - ---- -# Stream Processing - -Event-based systems, message passing style, publish subscribe - ----- -## Stream Processing (e.g., Kafka) -![Stream example](stream.svg) - - ----- -## Messaging Systems - -Multiple producers send messages to topic - -Multiple consumers can read messages - --> Decoupling of producers and consumers - -Message buffering if producers faster than consumers - -Typically some persistency to recover from failures - -Messages removed after consumption or after timeout - -Various error handling strategies (acknowledgements, redelivery, ...) - ----- -## Common Designs - -Like shell programs: Read from stream, produce output in other stream. -> loose coupling - - -![](stream-dataflow.svg) - - ----- -## Stream Queries - -Processing one event at a time independently - -vs incremental analysis over all messages up to that point - -vs floating window analysis across recent messages - -Works well with probabilistic analyses - ----- -## Consumers - -Multiple consumers share topic for scaling and load balancing - -Multiple consumers read same message for different work - -Partitioning possible - ----- -## Design Questions - -Message loss important? (at-least-once processing) - -Can messages be processed repeatedly (at-most-once processing) - -Is the message order important? - -Are messages still needed after they are consumed? - ----- -## Stream Processing and AI-enabled Systems? - - - -Notes: Process data as it arrives, prepare data for learning tasks, -use models to annotate data, analytics - ----- -## Event Sourcing - -* Append only databases -* Record edit events, never mutate data -* Compute current state from all past events, can reconstruct old state -* For efficiency, take state snapshots -* *Similar to traditional database logs, but persistent* - -```text -addPhoto(id=133422131, user=54351, path="/st/u211/1U6uFl47Fy.jpg", date="2021-12-03T09:18:32.124Z") -updatePhotoData(id=133422131, user=54351, title="Sunset") -replacePhoto(id=133422131, user=54351, path="/st/x594/vipxBMFlLF.jpg", operation="/filter/palma") -deletePhoto(id=133422131, user=54351) -``` - ----- -## Benefits of Immutability (Event Sourcing) - -
- -* All history is stored, recoverable -* Versioning easy by storing id of latest record -* Can compute multiple views -* Compare *git* - -> *On a shopping website, a customer may add an item to their cart and then -remove it again. Although the second event cancels out the first event [...], it may be useful to know for analytics purposes that the -customer was considering a particular item but then decided against it. Perhaps they -will choose to buy it in the future, or perhaps they found a substitute. This information is recorded in an event log, but would be lost in a database [...].* - -
- - - -Source: Greg Young. [CQRS and Event Sourcing](https://www.youtube.com/watch?v=JHGkaShoyNs). Code on the Beach 2014 via Martin Kleppmann. Designing Data-Intensive Applications. OReilly. 2017. - ----- -## Drawbacks of Immutable Data - - - -Notes: -* Storage overhead, extra complexity of deriving state -* Frequent changes may create massive data overhead -* Some sensitive data may need to be deleted (e.g., privacy, security) - ---- -# The Lambda Architecture - ----- -## 3 Layer Storage Architecture - - -* Batch layer: best accuracy, all data, recompute periodically -* Speed layer: stream processing, incremental updates, possibly approximated -* Serving layer: provide results of batch and speed layers to clients - -Assumes append-only data - -Supports tasks with widely varying latency - -Balance latency, throughput and fault tolerance - ----- -## Lambda Architecture and Machine Learning - -![Lambda Architecture](lambda.svg) - - - -* Learn accurate model in batch job -* Learn incremental model in stream processor - ----- -## Data Lake - -Trend to store all events in raw form (no consistent schema) - -May be useful later - -Data storage is comparably cheap - - - ----- -## Data Lake - -Trend to store all events in raw form (no consistent schema) - -May be useful later - -Data storage is comparably cheap - -Bet: *Yet unknown future value of data is greater than storage costs* - ----- -## Reasoning about Dataflows - -Many data sources, many outputs, many copies - -Which data is derived from what other data and how? - -Is it reproducible? Are old versions archived? - -How do you get the right data to the right place in the right format? - -**Plan and document data flows** - ----- - -![](stream-dataflow.svg) - - ----- - -[![Lots of data storage systems](etleverywhere.png)](https://youtu.be/_bvrzYOA8dY?t=1452) - - -Molham Aref "[Business Systems with Machine Learning](https://www.youtube.com/watch?v=_bvrzYOA8dY)" - ---- -# Breakout: Vimeo Videos - -As a group, discuss and post in `#lecture`, tagging group members: -* How to distribute storage: -* How to design scalable copy-right protection solution: -* How to design scalable analytics (views, ratings, ...): - -[![Vimeo page](vimeo.png)](https://vimeo.com/about) - ---- -# Excursion: ETL Tools - -Extract, tranform, load - -**The data engineer's toolbox** - ----- -## Data Warehousing (OLAP) - -Large denormalized databases with materialized views for large scale reporting queries -* e.g. sales database, queries for sales trends by region - -Read-only except for batch updates: Data from OLTP systems loaded periodically, e.g. over night - - -![Data warehouse](datawarehouse.jpg) - -Note: Image source: https://commons.wikimedia.org/wiki/File:Data_Warehouse_Feeding_Data_Mart.jpg - ----- -## ETL: Extract, Transform, Load - -* Transfer data between data sources, often OLTP -> OLAP system -* Many tools and pipelines - - Extract data from multiple sources (logs, JSON, databases), snapshotting - - Transform: cleaning, (de)normalization, transcoding, sorting, joining - - Loading in batches into database, staging -* Automation, parallelization, reporting, data quality checking, monitoring, profiling, recovery -* Many commercial tools - - -Examples of tools in [several](https://www.softwaretestinghelp.com/best-etl-tools/) [lists](https://www.scrapehero.com/best-data-management-etl-tools/) - ----- -[![XPlenty Web Page Screenshot](xplenty.png)](https://www.xplenty.com/) - ----- - -[![ETL everywhere](etleverywhere.png)](https://youtu.be/_bvrzYOA8dY?t=1452) - - -Molham Aref "[Business Systems with Machine Learning](https://www.youtube.com/watch?v=_bvrzYOA8dY)" - - - ---- -# Complexity of Distributed Systems - ----- -![Stop Fail](bluescreen.png) - - ----- -## Common Distributed System Issues - -* Systems may crash -* Messages take time -* Messages may get lost -* Messages may arrive out of order -* Messages may arrive multiple times -* Messages may get manipulated along the way -* Bandwidth limits -* Coordination overhead -* Network partition -* ... - ----- -## Types of failure behaviors - -* Fail-stop -* Other halting failures -* Communication failures - * Send/receive omissions - * Network partitions - * Message corruption -* Data corruption -* Performance failures - * High packet loss rate - * Low throughput, High latency -* Byzantine failures - ----- -## Common Assumptions about Failures - -* Behavior of others is fail-stop -* Network is reliable -* Network is semi-reliable but asynchronous -* Network is lossy but messages are not corrupt -* Network failures are transitive -* Failures are independent -* Local data is not corrupt -* Failures are reliably detectable -* Failures are unreliably detectable - ----- -## Strategies to Handle Failures - -* Timeouts, retry, backup services -* Detect crashed machines (ping/echo, heartbeat) -* Redundant + first/voting -* Transactions -* -* Do lost messages matter? -* Effect of resending message? - ----- -## Test Error Handling - -* Recall: Testing with stubs -* Recall: Chaos experiments - - - - - - - ---- -# Performance Planning and Analysis - ----- -## Performance Planning and Analysis - -Ideally architectural planning upfront - * Identify key components and their interactions - * Estimate performance parameters - * Simulate system behavior (e.g., queuing theory) - -Existing system: Analyze performance bottlenecks - * Profiling of individual components - * Performance testing (stress testing, load testing, etc) - * Performance monitoring of distributed systems - ----- -## Performance Analysis - -What is the average waiting? - -How many customers are waiting on average? - -How long is the average service time? - -What are the chances of one or more servers being idle? - -What is the average utilization of the servers? - --> Early analysis of different designs for bottlenecks - --> Capacity planning - ----- -## Queuing Theory - -
- -Queuing theory deals with the analysis of lines where customers wait to receive a service -* Waiting at Quiznos -* Waiting to check-in at an airport -* Kept on hold at a call center -* Streaming video over the net -* Requesting a web service - -A queue is formed when request for services outpace the ability of the server(s) to service them immediately - * Requests arrive faster than they can be processed (unstable queue) - * Requests do not arrive faster than they can be processed but their processing is delayed by some time (stable queue) - -Queues exist because infinite capacity is infinitely expensive and excessive capacity is excessively expensive - -
- ----- -## Queuing Theory - -![Simple Queues](queuingth.png) - - ----- -## Analysis Steps (roughly) - -Identify system abstraction to analyze (typically architectural level, e.g. services, but also protocols, datastructures and components, parallel processes, networks) - -Model connections and dependencies - -Estimate latency and capacity per component (measurement and testing, prior systems, estimates, …) - -Run simulation/analysis to gather performance curves - -Evaluate sensitivity of simulation/analysis to various parameters (‘what-if questions’) - ----- -## Simulation (e.g., JMT) - -![JMT screenshot](jmt1.png) - - - - -G.Serazzi Ed. Performance Evaluation Modelling with JMT: learning by examples. Politecnico di Milano - DEI, TR 2008.09, 366 pp., June 2008 - ----- -## Profiling - -Mostly used during development phase in single components - -![VisualVM profiler](profiler.jpg) - - ----- -## Performance Testing - -* Load testing: Assure handling of maximum expected load -* Scalability testing: Test with increasing load -* Soak/spike testing: Overload application for some time, observe stability -* Stress testing: Overwhelm system resources, test graceful failure + recovery -* -* Observe (1) latency, (2) throughput, (3) resource use -* All automateable; tools like JMeter - ----- -## Performance Monitoring of Distr. Systems - -[![](distprofiler.png)](distprofiler.png) - - - -Source: https://blog.appdynamics.com/tag/fiserv/ - ----- -## Performance Monitoring of Distributed Systems - -* Instrumentation of (Service) APIs -* Load of various servers -* Typically measures: latency, traffic, errors, saturation -* -* Monitoring long-term trends -* Alerting -* Automated releases/rollbacks -* Canary testing and A/B testing - - - - - ---- - -# Summary - -* Large amounts of data (training, inference, telemetry, models) -* Distributed storage and computation for scalability -* Common design patterns (e.g., batch processing, stream processing, lambda architecture) -* Design considerations: mutable vs immutable data -* Distributed computing also in machine learning -* Lots of tooling for data extraction, transformation, processing -* Many challenges through distribution: failures, debugging, performance, ... - - -Recommended reading: Martin Kleppmann. [Designing Data-Intensive Applications](https://dataintensive.net/). OReilly. 2017. - - - - ----- - -## Further Readings - -
- -* Molham Aref "[Business Systems with Machine Learning](https://www.youtube.com/watch?v=_bvrzYOA8dY)" Invited Talk 2020 -* Sawadogo, Pegdwendé, and Jérôme Darmont. "[On data lake architectures and metadata management](https://hal.archives-ouvertes.fr/hal-03114365/)." Journal of Intelligent Information Systems 56, no. 1 (2021): 97-120. -* Warren, James, and Nathan Marz. [Big Data: Principles and best practices of scalable realtime data systems](https://bookshop.org/books/big-data-principles-and-best-practices-of-scalable-realtime-data-systems/9781617290343). Manning, 2015. -* Smith, Jeffrey. [Machine Learning Systems: Designs that Scale](https://bookshop.org/books/machine-learning-systems-designs-that-scale/9781617293337). Manning, 2018. -* Polyzotis, Neoklis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. “[Data Management Challenges in Production Machine Learning](https://dl.acm.org/doi/pdf/10.1145/3035918.3054782).” In Proceedings of the 2017 ACM International Conference on Management of Data, 1723–26. ACM. - -
\ No newline at end of file diff --git a/lectures/13_dataatscale/datawarehouse.jpg b/lectures/13_dataatscale/datawarehouse.jpg deleted file mode 100644 index 946ffcb4..00000000 Binary files a/lectures/13_dataatscale/datawarehouse.jpg and /dev/null differ diff --git a/lectures/13_dataatscale/distprofiler.png b/lectures/13_dataatscale/distprofiler.png deleted file mode 100644 index 66a0dfab..00000000 Binary files a/lectures/13_dataatscale/distprofiler.png and /dev/null differ diff --git a/lectures/13_dataatscale/etleverywhere.png b/lectures/13_dataatscale/etleverywhere.png deleted file mode 100644 index 96234cc8..00000000 Binary files a/lectures/13_dataatscale/etleverywhere.png and /dev/null differ diff --git a/lectures/13_dataatscale/gfs.png b/lectures/13_dataatscale/gfs.png deleted file mode 100644 index b60e059d..00000000 Binary files a/lectures/13_dataatscale/gfs.png and /dev/null differ diff --git a/lectures/13_dataatscale/gphotos.png b/lectures/13_dataatscale/gphotos.png deleted file mode 100644 index 585309d3..00000000 Binary files a/lectures/13_dataatscale/gphotos.png and /dev/null differ diff --git a/lectures/13_dataatscale/horizonalpartition.svg b/lectures/13_dataatscale/horizonalpartition.svg deleted file mode 100644 index 78172554..00000000 --- a/lectures/13_dataatscale/horizonalpartition.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/13_dataatscale/jmt1.png b/lectures/13_dataatscale/jmt1.png deleted file mode 100644 index 185010b2..00000000 Binary files a/lectures/13_dataatscale/jmt1.png and /dev/null differ diff --git a/lectures/13_dataatscale/lambda.svg b/lectures/13_dataatscale/lambda.svg deleted file mode 100644 index 241033f1..00000000 --- a/lectures/13_dataatscale/lambda.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/13_dataatscale/leaderfollowerreplication.svg b/lectures/13_dataatscale/leaderfollowerreplication.svg deleted file mode 100644 index 86704fa8..00000000 --- a/lectures/13_dataatscale/leaderfollowerreplication.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/13_dataatscale/mapreduce.svg b/lectures/13_dataatscale/mapreduce.svg deleted file mode 100644 index 230a66d2..00000000 --- a/lectures/13_dataatscale/mapreduce.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/13_dataatscale/microservice.svg b/lectures/13_dataatscale/microservice.svg deleted file mode 100644 index 09cdf95d..00000000 --- a/lectures/13_dataatscale/microservice.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/13_dataatscale/mlperceptron.svg b/lectures/13_dataatscale/mlperceptron.svg deleted file mode 100644 index 69feea0c..00000000 --- a/lectures/13_dataatscale/mlperceptron.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/13_dataatscale/parameterserver.png b/lectures/13_dataatscale/parameterserver.png deleted file mode 100644 index 2cc17a72..00000000 Binary files a/lectures/13_dataatscale/parameterserver.png and /dev/null differ diff --git a/lectures/13_dataatscale/parameterserver2.png b/lectures/13_dataatscale/parameterserver2.png deleted file mode 100644 index 98c77f1c..00000000 Binary files a/lectures/13_dataatscale/parameterserver2.png and /dev/null differ diff --git a/lectures/13_dataatscale/profiler.jpg b/lectures/13_dataatscale/profiler.jpg deleted file mode 100644 index ca87d36a..00000000 Binary files a/lectures/13_dataatscale/profiler.jpg and /dev/null differ diff --git a/lectures/13_dataatscale/queuingth.png b/lectures/13_dataatscale/queuingth.png deleted file mode 100644 index 4125ec1e..00000000 Binary files a/lectures/13_dataatscale/queuingth.png and /dev/null differ diff --git a/lectures/13_dataatscale/stream-dataflow.svg b/lectures/13_dataatscale/stream-dataflow.svg deleted file mode 100644 index 19f22a93..00000000 --- a/lectures/13_dataatscale/stream-dataflow.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/13_dataatscale/stream.svg b/lectures/13_dataatscale/stream.svg deleted file mode 100644 index 601451b2..00000000 --- a/lectures/13_dataatscale/stream.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/13_dataatscale/tshaped.png b/lectures/13_dataatscale/tshaped.png deleted file mode 100644 index e4b6d35b..00000000 Binary files a/lectures/13_dataatscale/tshaped.png and /dev/null differ diff --git a/lectures/13_dataatscale/vimeo.png b/lectures/13_dataatscale/vimeo.png deleted file mode 100644 index 88a8b401..00000000 Binary files a/lectures/13_dataatscale/vimeo.png and /dev/null differ diff --git a/lectures/13_dataatscale/xplenty.png b/lectures/13_dataatscale/xplenty.png deleted file mode 100644 index 2b30d2fd..00000000 Binary files a/lectures/13_dataatscale/xplenty.png and /dev/null differ diff --git a/lectures/14_operations/Kubernetes.png b/lectures/14_operations/Kubernetes.png deleted file mode 100644 index a446a7a5..00000000 Binary files a/lectures/14_operations/Kubernetes.png and /dev/null differ diff --git a/lectures/14_operations/classicreleasepipeline.png b/lectures/14_operations/classicreleasepipeline.png deleted file mode 100644 index 49cbd071..00000000 Binary files a/lectures/14_operations/classicreleasepipeline.png and /dev/null differ diff --git a/lectures/14_operations/continuous_delivery.gif b/lectures/14_operations/continuous_delivery.gif deleted file mode 100644 index 30a22de7..00000000 Binary files a/lectures/14_operations/continuous_delivery.gif and /dev/null differ diff --git a/lectures/14_operations/devops.png b/lectures/14_operations/devops.png deleted file mode 100644 index 3abb34d2..00000000 Binary files a/lectures/14_operations/devops.png and /dev/null differ diff --git a/lectures/14_operations/devops_meme.jpg b/lectures/14_operations/devops_meme.jpg deleted file mode 100644 index ac3b1911..00000000 Binary files a/lectures/14_operations/devops_meme.jpg and /dev/null differ diff --git a/lectures/14_operations/devops_tools.jpg b/lectures/14_operations/devops_tools.jpg deleted file mode 100644 index 4140b6d4..00000000 Binary files a/lectures/14_operations/devops_tools.jpg and /dev/null differ diff --git a/lectures/14_operations/docker_logo.png b/lectures/14_operations/docker_logo.png deleted file mode 100644 index c08509fe..00000000 Binary files a/lectures/14_operations/docker_logo.png and /dev/null differ diff --git a/lectures/14_operations/facebookpipeline.png b/lectures/14_operations/facebookpipeline.png deleted file mode 100644 index 3415f13e..00000000 Binary files a/lectures/14_operations/facebookpipeline.png and /dev/null differ diff --git a/lectures/14_operations/lfai-landscape.png b/lectures/14_operations/lfai-landscape.png deleted file mode 100644 index f7fbfbaf..00000000 Binary files a/lectures/14_operations/lfai-landscape.png and /dev/null differ diff --git a/lectures/14_operations/operations.md b/lectures/14_operations/operations.md deleted file mode 100644 index 6423af45..00000000 --- a/lectures/14_operations/operations.md +++ /dev/null @@ -1,826 +0,0 @@ ---- -author: Christian Kaestner and Eunsuk Kang -title: "MLiP: Planning for Operations" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - - -# Planning for Operations - - ---- -## Operations - -![Overview of course content](../_assets/overview.svg) - - - - ----- -## Readings - -Required reading: Shankar, Shreya, Rolando Garcia, Joseph M. Hellerstein, and Aditya G. Parameswaran. "[Operationalizing machine learning: An interview study](https://arxiv.org/abs/2209.09125)." arXiv preprint arXiv:2209.09125 (2022). - -Recommended readings: -* O'Leary, Katie, and Makoto Uchida. "[Common problems with Creating Machine Learning Pipelines from Existing Code](https://research.google/pubs/pub48984.pdf)." Proc. Conference on Machine Learning and Systems (MLSys) (2020). - ----- - -# Learning Goals - - -* Deploy a service for models using container infrastructure -* Automate common configuration management tasks -* Devise a monitoring strategy and suggest suitable components for implementing it -* Diagnose common operations problems -* Understand the typical concerns and concepts of MLOps - - ---- -## Running Example: Blogging Platform with Spam Filter - - -![Screenshot from Substack](substack.png) - - ---- -# "Operations" - ----- -## Operations - - - -Provision and monitor the system in production, respond to problems - -Avoid downtime, scale with users, manage operating costs - -Heavy focus on infrastructure - -Traditionally sysadmin and hardware skills - - - -![SRE Book Cover](srebook.jpg) - - - - ----- -## Service Level Objectives - -Quality requirements in operations, such as -* maximum latency -* minimum system throughput -* targeted availability/error rate -* time to deploy an update -* durability for storage - -Each with typical measures - -For the system as a whole or individual services - ----- -## Example Service Level Objectives? - -![Screenshot from Substack](substack.png) - - - ----- -## Operators on a Team - -Operators cannot work in isolation - -Rely on developers for software quality and performance - -Negotiate service level agreements and budget (e.g., 99.9% vs 99.99% availability) - -Risk management role (not risk avoidance) - ----- -## Operations and ML - -ML has distinct workloads and hardware requirements - -Deep learning often pushes scale boundaries - -Regular updates or learning in production - - ----- -## Common Themes - -Observability is essential - -Release management and automated deployments - -Infrastructure as code and virtualization - -Scaling deployments - -Incident response planning - - - - ---- -# Dev vs. Ops - -![](devops_meme.jpg) - ----- -## Common Release Problems? - - - ----- -## Common Release Problems? - -![Screenshot from Substack](substack.png) - ----- -## Common Release Problems (Examples) - -* Missing dependencies -* Different compiler versions or library versions -* Different local utilities (e.g. unix grep vs mac grep) -* Database problems -* OS differences -* Too slow in real settings -* Difficult to roll back changes -* Source from many different repositories -* Obscure hardware? Cloud? Enough memory? - ----- - -## Developers - - -* Coding -* Testing, static analysis, reviews -* Continuous integration -* Bug tracking -* Running local tests and scalability experiments -* ... - - -## Operations - -* Allocating hardware resources -* Managing OS updates -* Monitoring performance -* Monitoring crashes -* Managing load spikes, … -* Tuning database performance -* Running distributed at scale -* Rolling back releases -* ... - - - -QA responsibilities in both roles - ----- - -## Quality Assurance does not stop in Dev - -* Ensuring product builds correctly (e.g., reproducible builds) -* Ensuring scalability under real-world loads -* Supporting environment constraints from real systems (hardware, software, OS) -* Efficiency with given infrastructure -* Monitoring (server, database, Dr. Watson, etc) -* Bottlenecks, crash-prone components, … (possibly thousands of crash reports per day/minute) - - ---- -# DevOps -![DevOps Cycle](devops.png) - ----- -## Key ideas and principles - -* Better coordinate between developers and operations (collaborative) -* Key goal: Reduce friction bringing changes from development into production -* Considering the *entire tool chain* into production (holistic) -* Documentation and versioning of all dependencies and configurations ("configuration as code") -* Heavy automation, e.g., continuous delivery, monitoring -* Small iterations, incremental and continuous releases -* -* Buzz word! ----- -![DevOps Cycle](devops.png) - - ----- -## Common Practices - -All configurations in version control - -Test and deploy in containers - -Automated testing, testing, testing, ... - -Monitoring, orchestration, and automated actions in practice - -Microservice architectures - -Release frequently - ----- -## Heavy tooling and automation - -[![DevOps tooling overview](devops_tools.jpg)](devops_tools.jpg) - ----- -## Heavy tooling and automation -- Examples - -* Infrastructure as code — Ansible, Terraform, Puppet, Chef -* CI/CD — Jenkins, TeamCity, GitLab, Shippable, Bamboo, Azure DevOps -* Test automation — Selenium, Cucumber, Apache JMeter -* Containerization — Docker, Rocket, Unik -* Orchestration — Kubernetes, Swarm, Mesos -* Software deployment — Elastic Beanstalk, Octopus, Vamp -* Measurement — Datadog, DynaTrace, Kibana, NewRelic, ServiceNow - - - - - - ---- -# Continuous Delivery - ----- -## Manual Release Pipelines - -![Classic Release Pipeline](classicreleasepipeline.png) - - - -Source: https://www.slideshare.net/jmcgarr/continuous-delivery-at-netflix-and-beyond - ----- - -## Continuous Integr. - -* Automate tests after commit -* Independent test infrastructure - -## Continuous Delivery - -* Full automation from commit to deployable container -* Heavy focus on testing, reproducibility and rapid feedback, creates transparency - - -## Continuous Deployment - -* Full automation from commit to deployment -* Empower developers, quick to production -* Encourage experimentation and fast incremental changes -* Commonly integrated with monitoring and canary releases - - ----- -## Automate Everything - -![CD vs CD](continuous_delivery.gif) ----- -## Example: Facebook Tests for Mobile Apps - -* Unit tests (white box) -* Static analysis (null pointer warnings, memory leaks, ...) -* Build tests (compilation succeeds) -* Snapshot tests (screenshot comparison, pixel by pixel) -* Integration tests (black box, in simulators) -* Performance tests (resource usage) -* Capacity and conformance tests (custom) - - -Further readings: Rossi, Chuck, Elisa Shibley, Shi Su, Kent Beck, Tony Savor, and Michael Stumm. [Continuous deployment of mobile software at facebook (showcase)](https://research.fb.com/wp-content/uploads/2017/02/fse-rossi.pdf). In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 12-23. ACM, 2016. - ----- -## Release Challenges for Mobile Apps - -* Large downloads -* Download time at user discretion -* Different versions in production -* Pull support for old releases? -* -* Server side releases silent and quick, consistent -* -* -> App as container, most content + layout from server - ----- -## Real-world pipelines are complex - -[![Facebook's Release Pipeline](facebookpipeline.png)](facebookpipeline.png) - - - - - - - - - - ---- - -# Containers and Configuration Management ----- -## Containers - - -* Lightweight virtual machine -* Contains entire runnable software, incl. all dependencies and configurations -* Used in development and production -* Sub-second launch time -* Explicit control over shared disks and network connections - - -![Docker logo](docker_logo.png) - - ----- -## Docker Example - -```docker -FROM ubuntu:latest -MAINTAINER ... -RUN apt-get update -y -RUN apt-get install -y python-pip python-dev build-essential -COPY . /app -WORKDIR /app -RUN pip install -r requirements.txt -ENTRYPOINT ["python"] -CMD ["app.py"] -``` - - -Source: http://containertutorials.com/docker-compose/flask-simple-app.html - ----- -## Common configuration management questions - -What runs where? - -How are machines connected? - -What (environment) parameters does software X require? - -How to update dependency X everywhere? - -How to scale service X? - ----- -## Ansible Examples - -* Software provisioning, configuration mgmt., and deployment tool -* Apply scripts to many servers - - -```ini -[webservers] -web1.company.org -web2.company.org -web3.company.org - -[dbservers] -db1.company.org -db2.company.org - -[replication_servers] -... -``` - -```yml -# This role deploys the mongod processes and sets up the replication set. -- name: create data directory for mongodb - file: path={{ mongodb_datadir_prefix }}/mongo-{{ inventory_hostname }} state=directory owner=mongod group=mongod - delegate_to: '{{ item }}' - with_items: groups.replication_servers - -- name: create log directory for mongodb - file: path=/var/log/mongo state=directory owner=mongod group=mongod - -- name: Create the mongodb startup file - template: src=mongod.j2 dest=/etc/init.d/mongod-{{ inventory_hostname }} mode=0655 - delegate_to: '{{ item }}' - with_items: groups.replication_servers - - -- name: Create the mongodb configuration file - template: src=mongod.conf.j2 dest=/etc/mongod-{{ inventory_hostname }}.conf - delegate_to: '{{ item }}' - with_items: groups.replication_servers - -- name: Copy the keyfile for authentication - copy: src=secret dest={{ mongodb_datadir_prefix }}/secret owner=mongod group=mongod mode=0400 - -- name: Start the mongodb service - command: creates=/var/lock/subsys/mongod-{{ inventory_hostname }} /etc/init.d/mongod-{{ inventory_hostname }} start - delegate_to: '{{ item }}' - with_items: groups.replication_servers - -- name: Create the file to initialize the mongod replica set - template: src=repset_init.j2 dest=/tmp/repset_init.js - -- name: Pause for a while - pause: seconds=20 - -- name: Initialize the replication set - shell: /usr/bin/mongo --port "{{ mongod_port }}" /tmp/repset_init.js -``` - - ----- -## Puppet Example - -Declarative specification, can be applied to many machines - -```puppet -$doc_root = "/var/www/example" - -exec { 'apt-get update': - command => '/usr/bin/apt-get update' -} - -package { 'apache2': - ensure => "installed", - require => Exec['apt-get update'] -} - -file { $doc_root: - ensure => "directory", - owner => "www-data", - group => "www-data", - mode => 644 -} - -file { "$doc_root/index.html": - ensure => "present", - source => "puppet:///modules/main/index.html", - require => File[$doc_root] -} - -file { "/etc/apache2/sites-available/000-default.conf": - ensure => "present", - content => template("main/vhost.erb"), - notify => Service['apache2'], - require => Package['apache2'] -} - -service { 'apache2': - ensure => running, - enable => true -} -``` - -Note: source: https://www.digitalocean.com/community/tutorials/configuration-management-101-writing-puppet-manifests - ----- -## Container Orchestration with Kubernetes - -Manages which container to deploy to which machine - -Launches and kills containers depending on load - -Manage updates and routing - -Automated restart, replacement, replication, scaling - -Kubernetis master controls many nodes - -*Substantial complexity and learning curve* - ----- - -![Kubernetes](Kubernetes.png) - - - -CC BY-SA 4.0 [Khtan66](https://en.wikipedia.org/wiki/Kubernetes#/media/File:Kubernetes.png) ----- -## Monitoring - -* Monitor server health -* Monitor service health -* Monitor telemetry (see past lecture) -* Collect and analyze measures or log files -* Dashboards and triggering automated decisions -* -* Many tools, e.g., Grafana as dashboard, Prometheus for metrics, Loki + ElasticSearch for logs -* Push and pull models - ----- - -![Hawkular Dashboard](https://www.hawkular.org/img/hawkular-apm/components.png) - - -https://www.hawkular.org/hawkular-apm/ - - - - ---- -## The DevOps Mindset - -* Consider the entire process and tool chain holistically -* Automation, automation, automation -* Elastic infrastructure -* Document, test, and version everything -* Iterate and release frequently -* Emphasize observability -* Shared goals and responsibilities - - - - - ---- -![MLOps](https://ml-ops.org/img/mlops-loop-banner.jpg) - - - -https://ml-ops.org/ - ----- -## On Terminology - -* Many vague buzzwords, often not clearly defined -* **MLOps:** Collaboration and communication between data scientists and operators, e.g., - - Automate model deployment - - Model training and versioning infrastructure - - Model deployment and monitoring -* **AIOps:** Using AI/ML to make operations decision, e.g. in a data center -* **DataOps:** Data analytics, often business setting and reporting - - Infrastructure to collect data (ETL) and support reporting - - Combines agile, DevOps, Lean Manufacturing ideas - - -![Random letters](../_assets/onterminology.jpg) - ----- -## MLOps Overview - -Integrate ML artifacts into software release process, unify process (i.e., DevOps extension) - -Automated data and model validation (continuous deployment) - - -Continuous deployment for ML models: from experimenting in notebooks to quick feedback in production - -Versioning of models and datasets (more later) - -Monitoring in production (discussed earlier) - - - -Further reading: [MLOps principles -](https://ml-ops.org/content/mlops-principles.html) - ----- -## Tooling Landscape LF AI - -[![LF AI Landscape](lfai-landscape.png)](https://landscape.lfai.foundation/) - - -Linux Foundation AI Initiative - - ----- -## MLOps Goals and Principles - -Like DevOps: Automation, testing, holistic, observability, teamwork - -Supporting frequent experimentation, rapid prototyping, and constant iteration - -3V: Velocity, Validation, Versioning - - - ----- -## MLOps Tools -- Examples - -* Model registry, versioning and metadata: MLFlow, Neptune, ModelDB, WandB, ... -* Model monitoring: Fiddler, Hydrosphere -* Data pipeline automation and workflows: DVC, Kubeflow, Airflow -* Model packaging and deployment: BentoML, Cortex -* Distributed learning and deployment: Dask, Ray, ... -* Feature store: Feast, Tecton -* Integrated platforms: Sagemaker, Valohai, ... -* Data validation: Cerberus, Great Expectations, ... - -Long list: https://github.com/kelvins/awesome-mlops - - ----- -## MLOps Common Goals - -Enable experimentation with data and models, small incremental changes; hide complexity from data scientists - -Automate (nuanced) model validation (like CI) and integrate with testing in production (monitoring) - -Dynamic view of constantly evolving training and test data; invest in data validation - -Version data, models; track experiment results - - - ----- -## Recall: DevOps Mindset - -* Consider the entire process and tool chain holistically -* Automation, automation, automation -* Elastic infrastructure -* Document, test, and version everything -* Iterate and release frequently -* Emphasize observability -* Shared goals and responsibilities - - ----- -## Breakout: MLOps Goals - -For the blog spam filter scenario, consider DevOps and MLOps infrastructure (CI, CD, containers, config. mgmt, monitoring, model registry, pipeline automation, feature store, data validation, ...) - -As a group, tagging group members, post to `#lecture`: -> * Which DevOps or MLOps goals to prioritize? -> * Which tools to try? - - - ---- -# Incident Response Planning - ----- -## Mistakes will Happen. Be Prepared - -Even with careful anticipation and mitigation, mistakes will happen - -Anticipated or not - -ML as unreliable component raises risks - - -Design mitigations help avoid anticipated mistakes - -Incident response plan prepares for unanticipated or unmitigated mistakes - ----- -## Incident Response Plan - -* Provide contact channel for problem reports -* Have expert on call -* Design process for anticipated problems, e.g., rollback, reboot, takedown -* Prepare for recovery -* Proactively collect telemetry -* Investigate incidents -* Plan public communication (responsibilities) - ----- -## Incident Resp. Plan for Blog's Spam Filter? -![Screenshot from Substack](substack.png) - - ---- -# Excursion: Organizational Culture - -![Book Cover: Organizational Culture and Leadership](orgculture.jpg) - - ----- -## Organizational Culture - -*“this is how we always did things”* - -Implicit and explicit assumptions and rules guiding behavior - -Often grounded in history, very difficult to change - -Examples: -* Move fast and break things -* Privacy first -* Development opportunities for all employees - - ----- -![Org chart comic](orgchart.png) - - - -Source: Bonkers World - - ----- -## Organizational Culture - - -![Screenshot from Substack](substack.png) - - - ----- -## Levels of Organizational Culture - - -Artifacts -- What we see -* Behaviors, systems, processes, policies - -Espoused Values -- What we say -* Ideals, goals, values, aspirations - -Basic assumptions -- What we believe -* Underlying assumptions, "old ways of doing things", unconsciously taken for granted - - -Iceberg models: Only artifacts and espoused values visible, but practices driven by invisible basic assumptions - - ----- -## Culture Change - -Changing organizational culture is very difficult - -Top down: espoused values, management buy in, incentives - -Bottom up: activism, show value, spread - - -**Examples of success of failure stories?** - ----- -## MLOps Culture - -Dev with Ops instead of Dev vs Ops - -A culture of collaboration, joint goals, joint responsibilities - -Artifacts: Joint tools, processes - -Underlying assumptions: Devs provide production-ready code; Ops focus on value, automation is good, observability is important, ... - ----- -## Resistance to DevOps Culture? - -From "us vs them" to blameless culture -- How? - -Introduction of new tools and processes -- Disruptive? Costly? Competing with current tasks? Who wants to write tests? - -Future benefits from rapid feedback and telemetry -- Unrealistic? - -Automation and shifting responsibilities -- Hiring freeze and layoffs? - -Past experience with poor adoption -- All costs, no benefits? Compliance only? - - ----- -## Successful DevOps Adoption - -Need supportive management; typically driven by advocacy of individuals, convincing colleagues and management - -Education to generate buy-in - -Experts and consultants can help with initial costly transition - -Demonstrate benefits on small project, promote afterward - -Focus on key bottlenecks, over perfect adoption (e.g., prioritize experimentation, test automation, rapid feedback with telemetry) - - -Luz, Welder Pinheiro, Gustavo Pinto, and Rodrigo Bonifácio. “[Adopting DevOps in the real world: A theory, a model, and a case study](http://gustavopinto.org/lost+found/jss2019.pdf).” Journal of Systems and Software 157 (2019): 110384. - - ---- -# Summary - -* Plan for change, plan for operations -* Operations requirements: service level objectives -* DevOps integrations development and operations tasks with joint goals and tools - * Heavy automation - * Continuous integration and continuous delivery - * Containers and configuration management - * Monitoring -* MLOps extends this to operating pipelines and deploying models -* Organizational culture is slow and difficult to change - ----- -## Further Reading - -
- -* Shankar, Shreya, Rolando Garcia, Joseph M. Hellerstein, and Aditya G. Parameswaran. "[Operationalizing machine learning: An interview study](https://arxiv.org/abs/2209.09125)." arXiv preprint arXiv:2209.09125 (2022). -* https://ml-ops.org/ -* Beyer, Betsy, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. [Site reliability engineering: How Google runs production systems](https://sre.google/sre-book/table-of-contents/). O’Reilly, 2016. -* Kim, Gene, Jez Humble, Patrick Debois, John Willis, and Nicole Forsgren. [The DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in Technology Organizations](https://bookshop.org/books/the-devops-handbook-how-to-create-world-class-agility-reliability-security-in-technology-organizations/9781950508402). IT Revolution, 2nd ed, 2021. -* Treveil, Mark, Nicolas Omont, Clément Stenac, Kenji Lefevre, Du Phan, Joachim Zentici, Adrien Lavoillotte, Makoto Miyazaki, and Lynn Heidmann. [Introducing MLOps: How to Scale Machine Learning in the Enterprise](https://bookshop.org/books/introducing-mlops-how-to-scale-machine-learning-in-the-enterprise/9781492083290). O’Reilly, 2020. -* Luz, Welder Pinheiro, Gustavo Pinto, and Rodrigo Bonifácio. “[Adopting DevOps in the real world: A theory, a model, and a case study](http://gustavopinto.org/lost+found/jss2019.pdf).” Journal of Systems and Software 157 (2019): 110384. -* Schein, Edgar H. *Organizational culture and leadership*. 5th ed. John Wiley & Sons, 2016. -
\ No newline at end of file diff --git a/lectures/14_operations/orgchart.png b/lectures/14_operations/orgchart.png deleted file mode 100644 index 6df71aa3..00000000 Binary files a/lectures/14_operations/orgchart.png and /dev/null differ diff --git a/lectures/14_operations/orgculture.jpg b/lectures/14_operations/orgculture.jpg deleted file mode 100644 index 71b95468..00000000 Binary files a/lectures/14_operations/orgculture.jpg and /dev/null differ diff --git a/lectures/14_operations/srebook.jpg b/lectures/14_operations/srebook.jpg deleted file mode 100644 index cc6e1f24..00000000 Binary files a/lectures/14_operations/srebook.jpg and /dev/null differ diff --git a/lectures/14_operations/substack.png b/lectures/14_operations/substack.png deleted file mode 100644 index 276871f3..00000000 Binary files a/lectures/14_operations/substack.png and /dev/null differ diff --git a/lectures/15_process/accuracy-improvements.png b/lectures/15_process/accuracy-improvements.png deleted file mode 100644 index 455cb820..00000000 Binary files a/lectures/15_process/accuracy-improvements.png and /dev/null differ diff --git a/lectures/15_process/combinedprocess1.png b/lectures/15_process/combinedprocess1.png deleted file mode 100644 index e240816d..00000000 Binary files a/lectures/15_process/combinedprocess1.png and /dev/null differ diff --git a/lectures/15_process/combinedprocess2.png b/lectures/15_process/combinedprocess2.png deleted file mode 100644 index 61872865..00000000 Binary files a/lectures/15_process/combinedprocess2.png and /dev/null differ diff --git a/lectures/15_process/combinedprocess5.svg b/lectures/15_process/combinedprocess5.svg deleted file mode 100644 index 25207d46..00000000 --- a/lectures/15_process/combinedprocess5.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/15_process/components.png b/lectures/15_process/components.png deleted file mode 100644 index ebe25114..00000000 Binary files a/lectures/15_process/components.png and /dev/null differ diff --git a/lectures/15_process/crispdm.png b/lectures/15_process/crispdm.png deleted file mode 100644 index bf0e4547..00000000 Binary files a/lectures/15_process/crispdm.png and /dev/null differ diff --git a/lectures/15_process/data-science-process.jpg b/lectures/15_process/data-science-process.jpg deleted file mode 100644 index aca8d96f..00000000 Binary files a/lectures/15_process/data-science-process.jpg and /dev/null differ diff --git a/lectures/15_process/debt.png b/lectures/15_process/debt.png deleted file mode 100644 index 47dbe67a..00000000 Binary files a/lectures/15_process/debt.png and /dev/null differ diff --git a/lectures/15_process/defectcost.jpg b/lectures/15_process/defectcost.jpg deleted file mode 100644 index f6dc5588..00000000 Binary files a/lectures/15_process/defectcost.jpg and /dev/null differ diff --git a/lectures/15_process/developers-processes.jpeg b/lectures/15_process/developers-processes.jpeg deleted file mode 100644 index 3628a947..00000000 Binary files a/lectures/15_process/developers-processes.jpeg and /dev/null differ diff --git a/lectures/15_process/dodprocess.jpg b/lectures/15_process/dodprocess.jpg deleted file mode 100644 index 08860d39..00000000 Binary files a/lectures/15_process/dodprocess.jpg and /dev/null differ diff --git a/lectures/15_process/facebook1.jpeg b/lectures/15_process/facebook1.jpeg deleted file mode 100644 index cf395b73..00000000 Binary files a/lectures/15_process/facebook1.jpeg and /dev/null differ diff --git a/lectures/15_process/facebook2.jpeg b/lectures/15_process/facebook2.jpeg deleted file mode 100644 index 5245a62f..00000000 Binary files a/lectures/15_process/facebook2.jpeg and /dev/null differ diff --git a/lectures/15_process/healthcare.gov-crash.png b/lectures/15_process/healthcare.gov-crash.png deleted file mode 100644 index f382e769..00000000 Binary files a/lectures/15_process/healthcare.gov-crash.png and /dev/null differ diff --git a/lectures/15_process/notebook-example.png b/lectures/15_process/notebook-example.png deleted file mode 100644 index 2b614ce0..00000000 Binary files a/lectures/15_process/notebook-example.png and /dev/null differ diff --git a/lectures/15_process/process.md b/lectures/15_process/process.md deleted file mode 100644 index f31a2cd3..00000000 --- a/lectures/15_process/process.md +++ /dev/null @@ -1,801 +0,0 @@ ---- -author: Christian Kaestner -title: "MLiP: Process and Technical Debt" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - -# Process and Technical Debt - - - - ---- -## Process... - -![Overview of course content](../_assets/overview.svg) - - - ----- - -## Readings - -
- -Required Reading: -* Sculley, David, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. "[Hidden technical debt in machine learning systems](http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf)." In Advances in neural information processing systems, pp. 2503-2511. 2015. - -Suggested Readings: -* Fowler and Highsmith. [The Agile Manifesto](http://agilemanifesto.org/) -* Steve McConnell. Software project survival guide. Chapter 3 -* Kruchten, Philippe, Robert L. Nord, and Ipek Ozkaya. "[Technical debt: From metaphor to theory and practice](https://resources.sei.cmu.edu/asset_files/WhitePaper/2012_019_001_58818.pdf)." IEEE Software 29, no. 6 (2012): 18-21. - -
- ----- - -## Learning Goals - -
- - -* Overview of common data science workflows (e.g., CRISP-DM) - * Importance of iteration and experimentation - * Role of computational notebooks in supporting data science workflows -* Overview of software engineering processes and lifecycles: costs and benefits of process, common process models, role of iteration and experimentation -* Contrasting data science and software engineering processes, goals and conflicts -* Integrating data science and software engineering workflows in process model for engineering AI-enabled systems with ML and non-ML components; contrasting different kinds of AI-enabled systems with data science trajectories -* Overview of technical debt as metaphor for process management; common sources of technical debt in AI-enabled systems - -
- ---- -## Case Study: Real-Estate Website - -![Zillow front page](zillow_main.png) - - - ----- -## ML Component: Predicting Real Estate Value - -Given a large database of house sales and statistical/demographic data from public records, predict the sales price of a house. - - -$f(size, rooms, tax, neighborhood, ...) \rightarrow price$ - - -![Zillow estimates](zillow.png) - - ----- -## What's your process? - -**Q. What steps would you take to build this component?** - - - ----- -## Exploratory Questions - -* What exactly are we trying to model and predict? -* What types of data do we need? -* What type of model works the best for this problem? -* What are the right metrics to evaluate the model performance? -* What is the user actually interested in seeing? -* Will this product actually help with the organizational goals? -* ... - ---- -# Data Science: Iteration and Exploration - - ----- -## Data Science is Iterative and Exploratory - -![Data Science Workflow](data-science-process.jpg) - - - -Source: Guo. "[Data Science Workflow: Overview and Challenges](https://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext)." Blog@CACM, Oct 2013 - - ----- -## Data Science is Iterative and Exploratory - - -![CRISP-DM](crispdm.png) - - - - -Martínez-Plumed et al. "[CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories](https://research-information.bris.ac.uk/files/220614618/TKDE_Data_Science_Trajectories_PF.pdf)." IEEE Transactions on Knowledge and Data Engineering (2019). - ----- -## Data Science is Iterative and Exploratory - -[![Data Science Lifecycle](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/media/overview/tdsp-lifecycle2.png)](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/media/overview/tdsp-lifecycle2.png) - - - -Microsoft Azure Team, "[What is the Team Data Science Process?](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview)" Microsoft Doc., Jan 2020 - - - ----- -## Data Science is Iterative and Exploratory - -[![Experimental results showing incremental accuracy improvement](accuracy-improvements.png)](accuracy-improvements.png) - - - -Source: Patel, Kayur, James Fogarty, James A. Landay, and Beverly Harrison. "[Investigating statistical machine learning as a tool for software development](http://www.kayur.org/papers/chi2008.pdf)." In Proc. CHI, 2008. - -Notes: -This figure shows the result from a controlled experiment in which participants had 2 sessions of 2h each to build a model. Whenever the participants evaluated a model in the process, the accuracy is recorded. These plots show the accuracy improvements over time, showing how data scientists make incremental improvements through frequent iteration. - - - ----- -## Data Science is Iterative and Exploratory - - -Science mindset: start with rough goal, no clear specification, unclear whether possible - -Heuristics and experience to guide the process - -Try and error, refine iteratively, hypothesis testing - -Go back to data collection and cleaning if needed, revise goals - - ----- -## Share Experience? - - - - ----- -## Different Trajectories - - -![CRISP-DM](crispdm.png) - - - - -Martínez-Plumed et al. "[CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories](https://research-information.bris.ac.uk/files/220614618/TKDE_Data_Science_Trajectories_PF.pdf)." IEEE Transactions on Knowledge and Data Engineering (2019). - ----- -## Different Trajectories - -![Example Trajectories](trajectories.png) - - - -From: Martínez-Plumed et al. "[CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories](https://research-information.bris.ac.uk/files/220614618/TKDE_Data_Science_Trajectories_PF.pdf)." IEEE Transactions on Knowledge and Data Engineering (2019). - -Notes: - -* A product to recommend trips connecting tourist attractions in a town may be based on location tracking data collected by navigation and mapping apps. To build such a project, one might start with a concrete goal in mind and explore whether enough user location history data is available or can be acquired. One would then go through traditional data preparation and modeling stages before exploring how to best present the results to users. -* An insurance company tries to improve their model to score the risk of drivers based on their behavior and sensors in their cars. Here an existing product is to be refined and a better understanding of the business case is needed before diving into the data exploration and modeling. The team might spend significant time in exploring new data sources that may provide new insights and may debate the cost and benefits of this data or data gathering strategy (e.g., installing sensors in customer cars). -* A credit card company may want to sell data about what kind of products different people (nationalities) tend to buy at different times and days in different locations to other companies (retailers, restaurants). They may explore existing data without yet knowing what kind of data may be of interest to what kind of customers. They may actively search for interesting narratives in the data, posing questions such as “Ever wondered when the French buy their food?” or “Which places the Germans flock to on their holidays?” in promotional material. - - ----- -## Computational Notebooks - - - -
- -* Origins in "literate programming", interleaving text and code, treating programs as literature (Knuth'84) -* First notebook in Wolfram Mathematica 1.0 in 1988 -* Document with text and code cells, showing execution results under cells -* Code of cells is executed, per cell, in a kernel -* Many notebook implementations and supported languages, Python + Jupyter currently most popular - -
- - - -![Notebook example](notebook-example.png) - - - - -Notes: -* See also https://en.wikipedia.org/wiki/Literate_programming -* Demo with public notebook, e.g., https://colab.research.google.com/notebooks/mlcc/intro_to_pandas.ipynb - ----- -## Notebooks Support Iteration and Exploration - -Quick feedback, similar to REPL - -Visual feedback including figures and tables - -Incremental computation: reexecuting individual cells - -Quick and easy: copy paste, no abstraction needed - -Easy to share: document includes text, code, and results - ----- -## Brief Discussion: Notebook Limitations and Drawbacks? - - - - - - ---- - -# Software Engineering Process - - - - - - - - - - - - - - - - ----- - -## Software Process - -> “The set of activities and associated results that produce a software product” - -> A structured, systematic way of carrying out these activities - -**Q. Examples?** - -Notes: - -Writing down all requirements -Require approval for all changes to requirements -Use version control for all changes -Track all reported bugs -Review requirements and code -Break down development into smaller tasks and schedule and monitor them -Planning and conducting quality assurance -Have daily status meetings -Use Docker containers to push code between developers and operation - ----- -## Developers dislike processes - -![developers](developers-processes.jpeg) - - ----- -![facebook1](facebook1.jpeg) - - ----- - -![facebook2](facebook2.jpeg) - ----- -## Developers' view of processes - -![DOD Aquisition Process Chart](dodprocess.jpg) - - -Notes: Complicated processes like these are often what people associate with "process". Software process is needed, but does not -need to be complicated. - ----- -## What developers want - -![full](process1.png) - - -Notes: Visualization following McConnell, Steve. Software project survival guide. Pearson Education, 1998. - ----- -## What developers want - -![full](process2.png) - - -Notes: Idea: spent most of the time on coding, accept a little rework - ----- -## What developers think of processes - -![full](process3.png) - - -Notes: negative view of process. pure overhead, reduces productive -work, limits creativity - ----- -## What eventually happens anyway - -![full](process4.png) - - -Notes: Real experience if little attention is payed to process: increasingly complicated, increasing rework; attempts to rescue by introducing process - ----- -## Survival Mode - -Missed deadlines -> "solo development mode" to meet own deadlines - -Ignore integration work - -Stop interacting with testers, technical writers, managers, ... - --> Results in further project delays, added costs, poor product quality... - - - -McConnell, Steve. Software project survival guide. Pearson Education, 1998. - ----- -## Example of Process Problems? - - - -Notes: -Collect examples of what could go wrong: - -Change Control: Mid-project informal agreement to changes suggested by customer or manager. Project scope expands 25-50% -Quality Assurance: Late detection of requirements and design issues. Test-debug-reimplement cycle limits development of new features. Release with known defects. -Defect Tracking: Bug reports collected informally, forgotten -System Integration: Integration of independently developed components at the very end of the project. Interfaces out of sync. -Source Code Control: Accidentally overwritten changes, lost work. -Scheduling: When project is behind, developers are asked weekly for new estimates. - - ----- -## Example: Helthcare.gov - -![healthcare.gov](healthcare.gov-crash.png) - -* Launched Oct, 2013; high demand (5x expected) causes site crash -* UI incomplete (e.g., missing drop-down menu); missing/incomplete -insurance data; log-in system also crashed for IT technicians -* On 1st day, 6 users managed to register -* Initial budget: 93.7M USD; Final cost: 1.7B - - ----- -## Example: Helthcare.gov - - -* Lack of experience: _"...and project managers had little knowledge on - the amount of work required and typical product development - processes"_ -* Lack of leadership: _"...no formal division of responsibilities in -place...a lack of communication when key decisions were made"_ -* Schedule pressure: _"...employees were pressured to launch on - time regardless of completion or the amount (and results) of testing"_ - - -[The Failed Launch Of www.HealthCare.gov](https://d3.harvard.edu/platform-rctom/submission/the-failed-launch-of-www-healthcare-gov/) - - - - - - - - - - - - - ----- -*Hypothesis: Process increases flexibility and efficiency + Upfront investment for later greater returns* - -![](process5.png) - - -Notes: ideal setting of little process investment upfront - ----- -![Chart showing that the longer bugs remain undetected, the more expensive they are to fix](defectcost.jpg) - -Notes: Empirically well established rule: Bugs are increasingly expensive to fix the larger the distance between the phase where they are created vs where they are corrected. - - - ---- -# Software Process Models - - ----- - -## Ad-hoc Processes - -1. Discuss the software that needs to be written -2. Write some code -3. Test the code to identify the defects -4. Debug to find causes of defects -5. Fix the defects -6. If not done, return to step 1 - - ----- -## Waterfall Model - -![Waterfall model](waterfall.svg) - - -Understand requirements, plan & design before coding, test & deploy - -Notes: Although dated, the key idea is still essential -- think and plan before implementing. Not all requirements and design can be made upfront, but planning is usually helpful. - ----- -## Problems with Waterfall? - - - ----- -## Risk First: Spiral Model - -![Spiral model](spiral_model.svg) - - -Incremental prototypes, starting with most risky components - ----- -## Constant iteration: Agile - -![Scrum Process](scrum.svg) - - -* Constant interactions with customers, constant replanning -* Scrum: Break into _sprints_; daily meetings, sprint reviews, planning - - -(Image CC BY-SA 4.0, Lakeworks) - - ----- -## Selecting Process Models - -Individually, vote in slack: -[1] Ad-hoc -[2] Waterfall -[3] Spiral -[4] Agile - -and write a short justification in `#lecture` - -![Zillow](zillow_main.png) - - - - - ---- -# Data Science vs Software Engineering - - ----- -## Discussion: Iteration in Notebook vs Agile? - - -[![Experimental results showing incremental accuracy improvement](accuracy-improvements.png)](accuracy-improvements.png) - -![Scrum Process](scrum.svg) - - -(CC BY-SA 4.0, Lakeworks) - - -Note: There is similarity in that there is an iterative process, -but the idea is different and the process model seems mostly orthogonal -to iteration in data science. -The spiral model prioritizes risk, especially when it is not clear -whether a model is feasible. One can do similar things in model development, seeing whether it is feasible with data at hand at all and build an early -prototype, but it is not clear that an initial okay model can be improved -incrementally into a great one later. -Agile can work with vague and changing requirements, but that again seems -to be a rather orthogonal concern. Requirements on the product are not so -much unclear or changing (the goal is often clear), but it's not clear -whether and how a model can solve it. - ----- -## Poor Software Engineering Practices in Notebooks? - - -![Notebook](notebook-example.png) - -* -* Little abstraction -* Global state -* No testing -* Heavy copy and paste -* Little documentation -* Poor version control -* Out of order execution -* Poor development features (vs IDE) - - - ----- -## Understanding Data Scientist Workflows - -Instead of blindly recommended "SE Best Practices" understand context - -Documentation and testing not a priority in exploratory phase - -Help with transitioning into practice -* From notebooks to pipelines -* Support maintenance and iteration once deployed -* Provide infrastructure and tools - ----- -## Data Science Practices by Software Eng. - -
- -* Many software engineers get involved in data science without explicit training -* Copying from public examples, little reading of documentation -* Lack of data visualization/exploration/understanding, no focus on data quality -* Strong preference for code editors, non-GUI tools -* Improve model by adding more data or changing models, rarely feature engineering or debugging -* Lack of awareness about overfitting/bias problems, single focus on accuracy, no monitoring -* More system thinking about the product and its needs - -
- - - -Yang, Qian, Jina Suh, Nan-Chen Chen, and Gonzalo Ramos. "[Grounding interactive machine learning tool design in how non-experts actually build models](http://www.audentia-gestion.fr/MICROSOFT/Machine_Teaching_DIS_18.pdf)." In *Proceedings of the 2018 Designing Interactive Systems Conference*, pp. 573-584. 2018. - ----- - - - - - Data - Scientists - Software - Engineers - - - ---- -# Integrated Process for AI-Enabled Systems - ----- -![Components](components.png) - - - -Figure from Dogru, Ali H., and Murat M. Tanik. “A process model for component-oriented software engineering.” IEEE Software 20, no. 2 (2003): 34–41. - ----- -![Combined process](combinedprocess1.png) - - ----- -## Recall: ML models are system components - -![Architecture example](transcriptionarchitecture2.svg) - - ----- -![Combined process](combinedprocess1.png) - - ----- -![Combined process](combinedprocess2.png) - - ----- -![Combined process](combinedprocess5.svg) - - ----- -## Process for AI-Enabled Systems - -
- - -* Integrate Software Engineering and Data Science processes -* Establish system-level requirements (e.g., user needs, safety, fairness) -* Inform data science modeling with system requirements (e.g., privacy, fairness) -* Try risky parts first (most likely include ML components; ~spiral) -* Incrementally develop prototypes, incorporate user feedback (~agile) -* Provide flexibility to iterate and improve -* Design system with characteristics of AI component (e.g., UI design, safeguards) -* Plan for testing throughout the process and in production -* Manage project understanding both software engineering and data science workflows -* -* __No existing "best practices" or workflow models__ - -
- ----- -## Trajectories - -Not every project follows the same development process, e.g. -* Small ML addition: Product first, add ML feature later -* Research only: Explore feasibility before thinking about a product -* Data science first: Model as central component of potential product, build system around it - -Different focus on system requirements, qualities, and upfront planning - -Manage interdisciplinary teams and different expectations - - - ---- -# Technical debt - - -[![](debt.png)](https://www.monkeyuser.com/2018/tech-debt/) - - ----- -## Technical Debt Metaphor - -Analogy to financial debt -+ Make a decision for an immediate benefit (e.g., release now) -+ Accepting later cost (loss of productivity, higher maintenance and operating cost, rework) -+ Debt accumulates and can suffocate project - -Ideally, a deliberate decision (short term tactical or long term strategic) - -Ideally, track debt and plan for paying it down later - -**Q. Examples?** - ----- -![Technical Debt Quadrant](techDebtQuadrant.png) - - -Source: Martin Fowler 2009, https://martinfowler.com/bliki/TechnicalDebtQuadrant.html - ----- -## Technical Debt: Examples - -Prudent & deliberate: Skip using a CI platform -* Reason for debt: Short deadline; test the product viability with - alpha users using a prototype -* Debt payback: Refactoring effort to integrate the system into CI - -Reckless & inadvertent: Forget to encrypt user credentials in DB -* Reason for debt: Lack of in-house security expertise -* Debt payback: Security vulnerabilities & fallouts from an attack - (loss of data); - effort to retrofit security into the system - ----- -## Breakout: Technical Debt from ML - -As a group in `#lecture`, tagging members: Post two plausible examples technical debt in housing price prediction system: - 1. Deliberate, prudent: - 2. Reckless, inadvertent: - -![Zillow](zillow_main.png) - - - -Sculley, David, et al. [Hidden technical debt in machine learning systems](http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf). Advances in Neural Information Processing Systems. 2015. - ----- - -## Technical Debt through Notebooks? - -> Jupyter Notebooks are a gift from God to those who work with data. They allow us to do quick experiments with Julia, Python, R, and more -- [John Paul Ada](https://towardsdatascience.com/no-hassle-machine-learning-experiments-with-azure-notebooks-e1a22e8782c3) - -![](notebook-example.png) - - -Notes: Discuss benefits and drawbacks of Jupyter style notebooks - ----- - -## ML and Technical Debt - -**Often reckless and inadvertent in inexperienced teams** - -ML can seem like an easy addition, but it may cause long-term costs - -Needs to be maintained, evolved, and debugged - -Goals may change, environment may change, some changes are subtle - ----- -## Example problems: ML and Technical Debt - -- Systems and models are tangled and changing one has cascading effects on the other -- Untested, brittle infrastructure; manual deployment -- Unstable data dependencies, replication crisis -- Data drift and feedback loops -- Magic constants and dead experimental code paths - - -Further reading: Sculley, David, et al. [Hidden technical debt in machine learning systems](http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf). Advances in Neural Information Processing Systems. 2015. - - ----- -## Controlling Technical Debt from ML Components - - - - ----- -## Controlling Technical Debt from ML Components - -
- -* Avoid AI when not needed -* Understand and document requirements, design for mistakes -* Build reliable and maintainable pipelines, infrastructure, good engineering practices -* Test infrastructure, system testing, testing and monitoring in production -* Test and monitor data quality -* Understand and model data dependencies, feedback loops, ... -* Document design intent and system architecture -* Strong interdisciplinary teams with joint responsibilities -* Document and track technical debt -* ... - -
- ----- -![Zillow](zillow_main.png) - - ---- -# Summary - -Data scientists and software engineers follow different processes - -ML projects need to consider process needs of both - -Iteration and upfront planning are both important, process models codify good practices - -Deliberate technical debt can be good, too much debt can suffocate a project - -Easy to amount (reckless) technical debt with machine learning - ---- -## Further Reading - -
- -* 🗎 Sculley, David, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. "[Hidden technical debt in machine learning systems](http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf)." In Advances in neural information processing systems, pp. 2503-2511. 2015. -* 🗎 Studer, Stefan, Thanh Binh Bui, Christian Drescher, Alexander Hanuschkin, Ludwig Winkler, Steven Peters, and Klaus-Robert Mueller. "[Towards CRISP-ML (Q): A Machine Learning Process Model with Quality Assurance Methodology](https://arxiv.org/abs/2003.05155)." arXiv preprint arXiv:2003.05155 (2020). -* 🗎 Martínez-Plumed, Fernando, et al. "[CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories](https://research-information.bris.ac.uk/files/220614618/TKDE_Data_Science_Trajectories_PF.pdf)." IEEE Transactions on Knowledge and Data Engineering (2019). -* 📰 Kaestner, Christian. [On the process for building software with ML components](https://ckaestne.medium.com/on-the-process-for-building-software-with-ml-components-c54bdb86db24). Blog Post, 2020 - -
- - ----- -## Further Reading 2 - -
- -* 🗎 Patel, Kayur, James Fogarty, James A. Landay, and Beverly Harrison. "[Investigating statistical machine learning as a tool for software development](http://www.kayur.org/papers/chi2008.pdf)." In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 667-676. 2008. -* 🗎 Yang, Qian, Jina Suh, Nan-Chen Chen, and Gonzalo Ramos. "[Grounding interactive machine learning tool design in how non-experts actually build models](http://www.audentia-gestion.fr/MICROSOFT/Machine_Teaching_DIS_18.pdf)." In *Proceedings of the 2018 Designing Interactive Systems Conference*, pp. 573-584. 2018. -* 📰 Fowler and Highsmith. [The Agile Manifesto](http://agilemanifesto.org/) -* 🕮 Steve McConnell. Software project survival guide. Chapter 3 -* 🕮 Pfleeger and Atlee. Software Engineering: Theory and Practice. Chapter 2 -* 🗎 Kruchten, Philippe, Robert L. Nord, and Ipek Ozkaya. "[Technical debt: From metaphor to theory and practice](https://resources.sei.cmu.edu/asset_files/WhitePaper/2012_019_001_58818.pdf)." IEEE Software 29, no. 6 (2012): 18-21. -
diff --git a/lectures/15_process/process1.png b/lectures/15_process/process1.png deleted file mode 100644 index 17088cd0..00000000 Binary files a/lectures/15_process/process1.png and /dev/null differ diff --git a/lectures/15_process/process2.png b/lectures/15_process/process2.png deleted file mode 100644 index 93281149..00000000 Binary files a/lectures/15_process/process2.png and /dev/null differ diff --git a/lectures/15_process/process3.png b/lectures/15_process/process3.png deleted file mode 100644 index 47aa2950..00000000 Binary files a/lectures/15_process/process3.png and /dev/null differ diff --git a/lectures/15_process/process4.png b/lectures/15_process/process4.png deleted file mode 100644 index 4113b23d..00000000 Binary files a/lectures/15_process/process4.png and /dev/null differ diff --git a/lectures/15_process/process5.png b/lectures/15_process/process5.png deleted file mode 100644 index df824f31..00000000 Binary files a/lectures/15_process/process5.png and /dev/null differ diff --git a/lectures/15_process/scrum.svg b/lectures/15_process/scrum.svg deleted file mode 100644 index a8149ac1..00000000 --- a/lectures/15_process/scrum.svg +++ /dev/null @@ -1,383 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - Drawing - - - - - - - - - - - - Drawing - - - - - - - - - - - - Drawing - - - - - - - - - - - - Drawing - - - - - - - - - - - - Drawing - - - - - - - - - - - - - - - - - - 30 days - 24 h - Working incrementof the software - - - - - - - Sprint Backlog - Sprint - Product Backlog - - - - - - - - - - diff --git a/lectures/15_process/spiral_model.svg b/lectures/15_process/spiral_model.svg deleted file mode 100644 index 14ea98b8..00000000 --- a/lectures/15_process/spiral_model.svg +++ /dev/null @@ -1,434 +0,0 @@ - - - - - - - - - - - - - - image/svg+xml - - - - - - - - - 1.Determineobjectives - 2. Identify and resolve risks - 3. Development and Test - 4. Plan the next iteration - - - - Progress - Cumulative cost - Requirementsplan - Concept ofoperation - Concept ofrequirements - Prototype 1 - Prototype 2 - Operationalprototype - Requirements - Draft - Detaileddesign - Code - Integration - Test - Implementation - Release - Test plan - Verification & Validation - Developmentplan - - Review - - diff --git a/lectures/15_process/techDebtQuadrant.png b/lectures/15_process/techDebtQuadrant.png deleted file mode 100644 index d298c812..00000000 Binary files a/lectures/15_process/techDebtQuadrant.png and /dev/null differ diff --git a/lectures/15_process/trajectories.png b/lectures/15_process/trajectories.png deleted file mode 100644 index 3c2d0847..00000000 Binary files a/lectures/15_process/trajectories.png and /dev/null differ diff --git a/lectures/15_process/transcriptionarchitecture2.svg b/lectures/15_process/transcriptionarchitecture2.svg deleted file mode 100644 index 212a40f7..00000000 --- a/lectures/15_process/transcriptionarchitecture2.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/15_process/zillow.png b/lectures/15_process/zillow.png deleted file mode 100644 index d3522d64..00000000 Binary files a/lectures/15_process/zillow.png and /dev/null differ diff --git a/lectures/15_process/zillow_main.png b/lectures/15_process/zillow_main.png deleted file mode 100644 index 67a83fea..00000000 Binary files a/lectures/15_process/zillow_main.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/Martin_Shkreli_2016.jpg b/lectures/16_intro_ethics_fairness/Martin_Shkreli_2016.jpg deleted file mode 100644 index 72ba7ffa..00000000 Binary files a/lectures/16_intro_ethics_fairness/Martin_Shkreli_2016.jpg and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/Social-media-driving.png b/lectures/16_intro_ethics_fairness/Social-media-driving.png deleted file mode 100644 index 7456657d..00000000 Binary files a/lectures/16_intro_ethics_fairness/Social-media-driving.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/amazon-hiring.png b/lectures/16_intro_ethics_fairness/amazon-hiring.png deleted file mode 100644 index 94822f89..00000000 Binary files a/lectures/16_intro_ethics_fairness/amazon-hiring.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/bing-translate-bias.png b/lectures/16_intro_ethics_fairness/bing-translate-bias.png deleted file mode 100644 index cc09f011..00000000 Binary files a/lectures/16_intro_ethics_fairness/bing-translate-bias.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/ceo.png b/lectures/16_intro_ethics_fairness/ceo.png deleted file mode 100644 index edccbe99..00000000 Binary files a/lectures/16_intro_ethics_fairness/ceo.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/college-admission.jpg b/lectures/16_intro_ethics_fairness/college-admission.jpg deleted file mode 100644 index 44b03d35..00000000 Binary files a/lectures/16_intro_ethics_fairness/college-admission.jpg and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/crime-map.jpg b/lectures/16_intro_ethics_fairness/crime-map.jpg deleted file mode 100644 index b5c6b5ab..00000000 Binary files a/lectures/16_intro_ethics_fairness/crime-map.jpg and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/dont-be-evil.png b/lectures/16_intro_ethics_fairness/dont-be-evil.png deleted file mode 100644 index 8e01d9d8..00000000 Binary files a/lectures/16_intro_ethics_fairness/dont-be-evil.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/eej2.jpeg b/lectures/16_intro_ethics_fairness/eej2.jpeg deleted file mode 100644 index 2354a176..00000000 Binary files a/lectures/16_intro_ethics_fairness/eej2.jpeg and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/examples/amazon-hiring.png b/lectures/16_intro_ethics_fairness/examples/amazon-hiring.png deleted file mode 100644 index 94822f89..00000000 Binary files a/lectures/16_intro_ethics_fairness/examples/amazon-hiring.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/examples/crime-map.jpg b/lectures/16_intro_ethics_fairness/examples/crime-map.jpg deleted file mode 100644 index b5c6b5ab..00000000 Binary files a/lectures/16_intro_ethics_fairness/examples/crime-map.jpg and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/examples/freelancing.png b/lectures/16_intro_ethics_fairness/examples/freelancing.png deleted file mode 100644 index b6ac942b..00000000 Binary files a/lectures/16_intro_ethics_fairness/examples/freelancing.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/examples/gender-detection.png b/lectures/16_intro_ethics_fairness/examples/gender-detection.png deleted file mode 100644 index d02b1d8c..00000000 Binary files a/lectures/16_intro_ethics_fairness/examples/gender-detection.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/examples/online-ad.png b/lectures/16_intro_ethics_fairness/examples/online-ad.png deleted file mode 100644 index 933c5627..00000000 Binary files a/lectures/16_intro_ethics_fairness/examples/online-ad.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/examples/recidivism-bias.jpeg b/lectures/16_intro_ethics_fairness/examples/recidivism-bias.jpeg deleted file mode 100644 index 00b9cb86..00000000 Binary files a/lectures/16_intro_ethics_fairness/examples/recidivism-bias.jpeg and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/examples/recidivism-propublica.png b/lectures/16_intro_ethics_fairness/examples/recidivism-propublica.png deleted file mode 100644 index 5e871c64..00000000 Binary files a/lectures/16_intro_ethics_fairness/examples/recidivism-propublica.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/examples/shirley-card.jpg b/lectures/16_intro_ethics_fairness/examples/shirley-card.jpg deleted file mode 100644 index d0e1fc47..00000000 Binary files a/lectures/16_intro_ethics_fairness/examples/shirley-card.jpg and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/examples/shirley-card.png b/lectures/16_intro_ethics_fairness/examples/shirley-card.png deleted file mode 100644 index 48e700a4..00000000 Binary files a/lectures/16_intro_ethics_fairness/examples/shirley-card.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/examples/xing-bias.jpeg b/lectures/16_intro_ethics_fairness/examples/xing-bias.jpeg deleted file mode 100644 index 9debb5c4..00000000 Binary files a/lectures/16_intro_ethics_fairness/examples/xing-bias.jpeg and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/fake-news.jpg b/lectures/16_intro_ethics_fairness/fake-news.jpg deleted file mode 100644 index 45b46957..00000000 Binary files a/lectures/16_intro_ethics_fairness/fake-news.jpg and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/feedbackloop.svg b/lectures/16_intro_ethics_fairness/feedbackloop.svg deleted file mode 100644 index 66334f7c..00000000 --- a/lectures/16_intro_ethics_fairness/feedbackloop.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/16_intro_ethics_fairness/gender-bias.png b/lectures/16_intro_ethics_fairness/gender-bias.png deleted file mode 100644 index 7e875de5..00000000 Binary files a/lectures/16_intro_ethics_fairness/gender-bias.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/gender-detection.png b/lectures/16_intro_ethics_fairness/gender-detection.png deleted file mode 100644 index d02b1d8c..00000000 Binary files a/lectures/16_intro_ethics_fairness/gender-detection.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/google-translate-bias.png b/lectures/16_intro_ethics_fairness/google-translate-bias.png deleted file mode 100644 index a9d54cf8..00000000 Binary files a/lectures/16_intro_ethics_fairness/google-translate-bias.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/harms-table.png b/lectures/16_intro_ethics_fairness/harms-table.png deleted file mode 100644 index 4074d948..00000000 Binary files a/lectures/16_intro_ethics_fairness/harms-table.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/infinitescroll.png b/lectures/16_intro_ethics_fairness/infinitescroll.png deleted file mode 100644 index a4b9a66c..00000000 Binary files a/lectures/16_intro_ethics_fairness/infinitescroll.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/intro-ethics-fairness.md b/lectures/16_intro_ethics_fairness/intro-ethics-fairness.md deleted file mode 100644 index cf3dbd4a..00000000 --- a/lectures/16_intro_ethics_fairness/intro-ethics-fairness.md +++ /dev/null @@ -1,645 +0,0 @@ ---- -author: Eunsuk Kang & Christian Kaestner -title: "MLiP: Intro to Ethics and Fairness" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - -# Responsible ML Engineering - -(Intro to Ethics and Fairness) - - - - - - - ---- -
- - ----- -## Changing directions... - -![Overview of course content](../_assets/overview.svg) - - - - ----- -# Readings - -R. Caplan, J. Donovan, L. Hanson, J. -Matthews. "Algorithmic Accountability: A Primer", Data & Society -(2018). - - - ----- -# Learning Goals - -* Review the importance of ethical considerations in designing AI-enabled systems -* Recall basic strategies to reason about ethical challenges -* Diagnose potential ethical issues in a given system -* Understand the types of harm that can be caused by ML -* Understand the sources of bias in ML - ---- -# Overview - -Many interrelated issues: -* Ethics -* Fairness -* Justice -* Discrimination -* Safety -* Privacy -* Security -* Transparency -* Accountability - -*Each is a deep and nuanced research topic. We focus on survey of some key issues.* - - - - - - ----- - - - -![Martin Shkreli](Martin_Shkreli_2016.jpg) - - - -
- -*In 2015, Shkreli received widespread criticism [...] obtained the manufacturing license for the antiparasitic drug Daraprim and raised its price from USD 13.5 to 750 per pill [...] referred to by the media as "the most hated man in America" and "Pharma Bro".* -- [Wikipedia](https://en.wikipedia.org/wiki/Martin_Shkreli) - -"*I could have raised it higher and made more profits for our shareholders. Which is my primary duty.*" -- Martin Shkreli - -
- - -Note: Image source: https://en.wikipedia.org/wiki/Martin_Shkreli#/media/File:Martin_Shkreli_2016.jpg - - ----- -## Terminology - -**Legal** = in accordance to societal laws - - systematic body of rules governing society; set through government - - punishment for violation - -**Ethical** = following moral principles of tradition, group, or individual - - branch of philosophy, science of a standard human conduct - - professional ethics = rules codified by professional organization - - no legal binding, no enforcement beyond "shame" - - high ethical standards may yield long term benefits through image and staff loyalty - -![Random letters](../_assets/onterminology.jpg) - - ----- -## With a few lines of code... - -Developers have substantial power in shaping products - -Small design decisions can have substantial impact (safety, security, -discrimination, ...) -- not always deliberate - -Our view: We have both **legal & ethical** responsibilities to anticipate mistakes, -think through their consequences, and build in mitigations! - - - - - - - - - - - ----- -## Example: Social Media - -![zuckerberg](mark-zuckerberg.png) - - -*What is the (real) organizational objective of the company?* - ----- -## Optimizing for Organizational Objective - - - -How do we maximize the user engagement? Examples: - - Infinite scroll: Encourage non-stop, continual use - - Personal recommendations: Suggest news feed to increase engagement - - Push notifications: Notify disengaged users to return to the app - - - -![Infinite Scroll](infinitescroll.png) - - - ----- -## Addiction - -![social-media-driving](Social-media-driving.png) - - -* 210M people worldwide addicted to social media -* 71% of Americans sleep next to a mobile device -* ~1000 people injured **per day** due to distracted driving (USA) - - - -https://www.flurry.com/blog/mobile-addicts-multiply-across-the-globe/; -https://www.cdc.gov/motorvehiclesafety/Distracted_Driving/index.html - ----- -## Mental Health - -![teen-suicide-rate](teen-suicide-rate.png) - - -* 35% of US teenagers with low social-emotional well-being have been bullied on social media. -* 70% of teens feel excluded when using social media. - - -https://leftronic.com/social-media-addiction-statistics - ----- -## Disinformation & Polarization - -![fake-news](fake-news.jpg) - - ----- -## Discrimination - -[![twitter-cropping](twitter-cropping.png)](https://twitter.com/bascule/status/1307440596668182528) - - ----- -## Who's to blame? - -![dont-be-evil](dont-be-evil.png) - - -*Are these companies intentionally trying to cause harm? If not, - what are the root causes of the problem?* - - ----- -## Liability? - -> THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. - -Note: Software companies have usually gotten away with claiming no liability for their products - ----- -## Some Challenges - -
- - - -*Misalignment between organizational goals & societal values* - * Financial incentives often dominate other goals ("grow or die") - -*Hardly any regulation* - * Little legal consequences for causing negative impact (with some exceptions) - * Poor understanding of socio-technical systems by policy makers - - - -*Engineering challenges, at system- & ML-level* - * Difficult to clearly define or measure ethical values - * Difficult to anticipate all possible usage contexts - * Difficult to anticipate impact of feedback loops - * Difficult to prevent malicious actors from abusing the system - * Difficult to interpret output of ML and make ethical decisions - - - -**These problems have existed before, but they are being - rapidly exacerbated by the widespread use of ML** - -
- - - ----- -## Responsible Engineering Matters - -Engineers have substantial power in shaping products and outcomes - -Serious individual and societal harms possible from (a) negligence and (b) malicious designs -* Safety, mental health, weapons -* Security, privacy -* Manipulation, addiction, surveilance, polarization -* Job loss, deskilling -* Discrimination - ----- -## Buzzword or real progress? - -![Microsoft responsible AI principles](responsibleai.png) - - - ----- -## Responsible Engineering in this Course - - -Key areas of concern -* Fairness -* Safety -* Security and privacy -* Transparency and accountability - -Technical infrastructure concepts -* Interpretability and explainability -* Versioning, provenance, reproducibility - - - - ---- -# Fairness - ----- -## Legally protected classes (US) - -
- -- Race ([Civil Rights Act of 1964](https://en.wikipedia.org/wiki/Civil_Rights_Act_of_1964)) -- Religion ([Civil Rights Act of 1964](https://en.wikipedia.org/wiki/Civil_Rights_Act_of_1964)) -- National origin ([Civil Rights Act of 1964](https://en.wikipedia.org/wiki/Civil_Rights_Act_of_1964)) -- Sex, sexual orientation, and gender identity ([Equal Pay Act of 1963](https://en.wikipedia.org/wiki/Equal_Pay_Act_of_1963), [Civil Rights Act of 1964](https://en.wikipedia.org/wiki/Civil_Rights_Act_of_1964), and [Bostock v. Clayton](https://en.wikipedia.org/wiki/Bostock_v._Clayton_County)) -- Age (40 and over, [Age Discrimination in Employment Act of 1967](https://en.wikipedia.org/wiki/Age_Discrimination_in_Employment_Act_of_1967)) -- Pregnancy ([Pregnancy Discrimination Act of 1978](https://en.wikipedia.org/wiki/Pregnancy_Discrimination_Act)) -- Familial status (preference for or against having children, [Civil Rights Act of 1968](https://en.wikipedia.org/wiki/Civil_Rights_Act_of_1968)) -- Disability status ([Rehabilitation Act of 1973](https://en.wikipedia.org/wiki/Rehabilitation_Act_of_1973); [Americans with Disabilities Act of 1990](https://en.wikipedia.org/wiki/Americans_with_Disabilities_Act_of_1990)) -- Veteran status ([Vietnam Era Veterans’ Readjustment Assistance Act of 1974](https://en.wikipedia.org/wiki/Vietnam_Era_Veterans'_Readjustment_Assistance_Act); [Uniformed Services Employment and Reemployment Rights Act of 1994](https://en.wikipedia.org/wiki/Uniformed_Services_Employment_and_Re-employment_Rights_Act_of_1994)) -- Genetic information ([Genetic Information Nondiscrimination Act of 2008](https://en.wikipedia.org/wiki/Genetic_Information_Nondiscrimination_Act)) - -
- - - -https://en.wikipedia.org/wiki/Protected_group ----- -## Regulated domains (US) - -* Credit (Equal Credit Opportunity Act) -* Education (Civil Rights Act of 1964; Education Amendments of 1972) -* Employment (Civil Rights Act of 1964) -* Housing (Fair Housing Act) -* ‘Public Accommodation’ (Civil Rights Act of 1964) - -Extends to marketing and advertising; not limited to final decision - - -Barocas, Solon and Moritz Hardt. "[Fairness in machine learning](https://mrtz.org/nips17/#/)." NIPS Tutorial 1 (2017). - - ----- -## What is fair? - -> Fairness discourse asks questions about how to treat people and whether treating different groups of people differently is ethical. If two groups of people are systematically treated differently, this is often considered unfair. - ----- -## Dividing a Pie? - - - - -* Equal slices for everybody -* Bigger slices for active bakers -* Bigger slices for inexperienced/new members (e.g., children) -* Bigger slices for hungry people -* More pie for everybody, bake more - -*(Not everybody contributed equally during baking, not everybody is equally hungry)* - - - - -![Pie](../_chapterimg/16_fairness.jpg) - - - - ----- -## Preview: Equality vs Equity vs Justice - -![Contrasting equality, equity, and justice](eej2.jpeg) - - ----- -## Types of Harm on Society - -__Harms of allocation__: Withhold opportunities or resources - -__Harms of representation__: Reinforce stereotypes, subordination along - the lines of identity - - - -Kate Crawford. “The Trouble With Bias”, NeurIPS Keynote (2017). - ----- -## Harms of Allocation - -* Withhold opportunities or resources -* Poor quality of service, degraded user experience for certain groups - -![](gender-detection.png) - - - - -_Gender Shades: Intersectional Accuracy Disparities in -Commercial Gender Classification_, Buolamwini & Gebru, ACM FAT* (2018). - ----- -## Harms of Representation - -* Over/under-representation of certain groups in organizations -* Reinforcement of stereotypes - -![](online-ad.png) - - - - -_Discrimination in Online Ad Delivery_, Latanya Sweeney, SSRN (2013). - ----- -## Identifying harms - -![](harms-table.png) - - -* Multiple types of harms can be caused by a product! -* Think about your system objectives & identify potential harms. - - - -_Challenges of incorporating algorithmic fairness into practice_, FAT* Tutorial (2019). - ----- -## Not all discrimination is harmful - -![](gender-bias.png) - - -* Loan lending: Gender discrimination is illegal. -* Medical diagnosis: Gender-specific diagnosis may be desirable. -* The problem is _unjustified_ differentiation; i.e., discriminating on factors that should not matter -* Discrimination is a __domain-specific__ concept (i.e., world vs machine) - ----- -## Role of Requirements Engineering - -* Identify system goals -* Identify legal constraints -* Identify stakeholders and fairness concerns -* Analyze risks with regard to discrimination and fairness -* Analyze possible feedback loops (world vs machine) -* Negotiate tradeoffs with stakeholders -* Set requirements/constraints for data and model -* Plan mitigations in the system (beyond the model) -* Design incident response plan -* Set expectations for offline and online assurance and monitoring - - - - - - - - - - - - - - - - - - ---- -# Sources of Bias - ----- -## Where does the bias come from? - -![](google-translate-bias.png) - - - - -_Semantics derived automatically from language corpora contain -human-like biases_, Caliskan et al., Science (2017). - ----- -## Where does the bias come from? - -![](bing-translate-bias.png) - - ----- -## Sources of Bias - -* Historial bias -* Tainted examples -* Skewed sample -* Limited features -* Sample size disparity -* Proxies - - - -_Big Data's Disparate Impact_, Barocas & Selbst California Law Review (2016). - ----- -## Historical Bias - -*Data reflects past biases, not intended outcomes* - -![Image search for "CEO"](ceo.png) - - -*Should the algorithm reflect the reality?* - -Note: "An example of this type of bias can be found in a 2018 image search -result where searching for women CEOs ultimately resulted in fewer female CEO images due -to the fact that only 5% of Fortune 500 CEOs were woman—which would cause the search -results to be biased towards male CEOs. These search results were of course reflecting -the reality, but whether or not the search algorithms should reflect this reality is an issue worth -considering." - ----- -## Correcting Historical Bias? - - -> "Big Data processes codify the past. They do not invent the future. Doing that requires moral imagination, and that’s something only humans can provide. " -- Cathy O'Neil in [Weapons of Math Destruction](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991016462699704436) - -> "Through user studies, the [image search] team learned that many users -were uncomfortable with the idea of the company “manipulating” search results, viewing this behavior as unethical." -- observation from interviews by Ken Holstein - - - ----- -## Tainted Labels - -*Bias in dataset labels assigned (directly or indirectly) by humans* - -![](amazon-hiring.png) - - -Example: Hiring decision dataset -- labels assigned by (possibly biased) experts or derived from past (possibly biased) hiring decisions - ----- -## Skewed Sample - -*Bias in how and what data is collected* - -![](crime-map.jpg) - - -Crime prediction: Where to analyze crime? What is considered crime? Actually a random/representative sample? - -Recall: Raw data is an oxymoron - ----- -## Limited Features - -*Features that are less informative/reliable for certain subpopulations* - -![](performance-review.jpg) - - -* Graduate admissions: Letters of recommendation equally reliable for international applicants? -* Employee performance review: "Leave of absence" acceptable feature if parental leave is gender skewed? - - -Note: -Decisions may be based on features that are predictive and accurate for a large part of the target distribution, but not so for some other parts of the distribution. -For example, a system ranking applications for graduate school admissions may heavily rely on letters of recommendation and be well calibrated for applicants who can request letters from mentors familiar with the culture and jargon of such letters in the US, but may work poorly for international applicants from countries where such letters are not common or where such letters express support with different jargon. To reduce bias, we should be carefully reviewing all features and analyze whether they may be less predictive for certain subpopulations. - ----- -## Sample Size Disparity - -*Limited training data for some subpopulations* - -![](shirley-card.jpg) - - -* Biased sampling process: "Shirley Card" used for Kodak color calibration, using mostly Caucasian models -* Small subpopulations: Sikhs small minority in US (0.2%) barely represented in a random sample - ----- -## Sample Size Disparity - -Without intervention: -* Models biased toward populations more represented in target distribution (e.g., Caucasian skin tones) -* ... biased towards population that are easier to sample (e.g., people self-selecting to post to Instagram) -* ... may ignore small minority populations as noise - -Typically requires deliberate sampling strategy, intentional oversampling - - ----- -## Proxies - -*Features correlate with protected attribute, remain after removal* - - -![](neighborhoods.png) - - -* Example: Neighborhood as a proxy for race -* Extracurricular activities as proxy for gender and social class (e.g., “cheerleading”, “peer-mentor for ...”, “sailing team”, “classical music”) - - ----- -## Feedback Loops reinforce Bias - -![Feedback loop](feedbackloop.svg) - - - - -> "Big Data processes codify the past. They do not invent the future. Doing that requires moral imagination, and that’s something only humans can provide. " -- Cathy O'Neil in [Weapons of Math Destruction](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991016462699704436) - ----- -## Breakout: College Admission - -![](college-admission.jpg) - - -Scenario: Evaluate applications & identify students likely to succeed - -Features: GPA, GRE/SAT, gender, race, undergrad institute, alumni - connections, household income, hometown, transcript, etc. - ----- -## Breakout: College Admission - -Scenario: Evaluate applications & identify students who are -likely to succeed - -Features: GPA, GRE/SAT, gender, race, undergrad institute, alumni - connections, household income, hometown, transcript, etc. - -As a group, post to `#lecture` tagging members: - * **Possible harms:** Allocation of resources? Quality of service? Stereotyping? Denigration? Over-/Under-representation? - * **Sources of bias:** Skewed sample? Tainted labels? Historical bias? Limited features? - Sample size disparity? Proxies? - ---- -# Next lectures - -1. Measuring and Improving Fairness at the Model Level - -2. Fairness is a System-Wide Concern - - - ---- -# Summary - - -* Many interrelated issues: ethics, fairness, justice, safety, security, ... -* Both legal & ethical dimensions -* Challenges with developing ethical systems / developing systems responsibly -* Large potential for damage: Harm of allocation & harm of representation -* Sources of bias in ML: Skewed sample, tainted labels, limited features, sample size, disparity, proxies - ----- -## Further Readings - -
- -- 🕮 O’Neil, Cathy. [Weapons of math destruction: How big data increases inequality and threatens democracy](https://bookshop.org/books/weapons-of-math-destruction-how-big-data-increases-inequality-and-threatens-democracy/9780553418835). Crown Publishing, 2017. -- 🗎 Barocas, Solon, and Andrew D. Selbst. “[Big data’s disparate impact](http://www.californialawreview.org/wp-content/uploads/2016/06/2Barocas-Selbst.pdf).” Calif. L. Rev. 104 (2016): 671. -- 🗎 Mehrabi, Ninareh, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. “[A survey on bias and fairness in machine learning](https://arxiv.org/abs/1908.09635).” ACM Computing Surveys (CSUR) 54, no. 6 (2021): 1–35. -- 🗎 Bietti, Elettra. “[From ethics washing to ethics bashing: a view on tech ethics from within moral philosophy](https://dl.acm.org/doi/pdf/10.1145/3351095.3372860).” In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 210–219. 2020. - - -
diff --git a/lectures/16_intro_ethics_fairness/mark-zuckerberg.png b/lectures/16_intro_ethics_fairness/mark-zuckerberg.png deleted file mode 100644 index 05cb03dd..00000000 Binary files a/lectures/16_intro_ethics_fairness/mark-zuckerberg.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/neighborhoods.png b/lectures/16_intro_ethics_fairness/neighborhoods.png deleted file mode 100644 index fe0a0d6e..00000000 Binary files a/lectures/16_intro_ethics_fairness/neighborhoods.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/online-ad.png b/lectures/16_intro_ethics_fairness/online-ad.png deleted file mode 100644 index 933c5627..00000000 Binary files a/lectures/16_intro_ethics_fairness/online-ad.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/performance-review.jpg b/lectures/16_intro_ethics_fairness/performance-review.jpg deleted file mode 100644 index 34323bd4..00000000 Binary files a/lectures/16_intro_ethics_fairness/performance-review.jpg and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/responsibleai.png b/lectures/16_intro_ethics_fairness/responsibleai.png deleted file mode 100644 index 1b3ec1f5..00000000 Binary files a/lectures/16_intro_ethics_fairness/responsibleai.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/shirley-card.jpg b/lectures/16_intro_ethics_fairness/shirley-card.jpg deleted file mode 100644 index d0e1fc47..00000000 Binary files a/lectures/16_intro_ethics_fairness/shirley-card.jpg and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/teen-suicide-rate.png b/lectures/16_intro_ethics_fairness/teen-suicide-rate.png deleted file mode 100644 index 0e04315e..00000000 Binary files a/lectures/16_intro_ethics_fairness/teen-suicide-rate.png and /dev/null differ diff --git a/lectures/16_intro_ethics_fairness/twitter-cropping.png b/lectures/16_intro_ethics_fairness/twitter-cropping.png deleted file mode 100644 index c5abdf0b..00000000 Binary files a/lectures/16_intro_ethics_fairness/twitter-cropping.png and /dev/null differ diff --git a/lectures/17_fairness_measures/appraisal.png b/lectures/17_fairness_measures/appraisal.png deleted file mode 100644 index cd497f63..00000000 Binary files a/lectures/17_fairness_measures/appraisal.png and /dev/null differ diff --git a/lectures/17_fairness_measures/cancer-stats.png b/lectures/17_fairness_measures/cancer-stats.png deleted file mode 100644 index 6a0f1ad5..00000000 Binary files a/lectures/17_fairness_measures/cancer-stats.png and /dev/null differ diff --git a/lectures/17_fairness_measures/confusion-matrix.jpg b/lectures/17_fairness_measures/confusion-matrix.jpg deleted file mode 100644 index 5f743cd5..00000000 Binary files a/lectures/17_fairness_measures/confusion-matrix.jpg and /dev/null differ diff --git a/lectures/17_fairness_measures/eej2.jpeg b/lectures/17_fairness_measures/eej2.jpeg deleted file mode 100644 index 2354a176..00000000 Binary files a/lectures/17_fairness_measures/eej2.jpeg and /dev/null differ diff --git a/lectures/17_fairness_measures/fairness-papers.jpeg b/lectures/17_fairness_measures/fairness-papers.jpeg deleted file mode 100644 index ee144f5d..00000000 Binary files a/lectures/17_fairness_measures/fairness-papers.jpeg and /dev/null differ diff --git a/lectures/17_fairness_measures/gender-bias.png b/lectures/17_fairness_measures/gender-bias.png deleted file mode 100644 index 7e875de5..00000000 Binary files a/lectures/17_fairness_measures/gender-bias.png and /dev/null differ diff --git a/lectures/17_fairness_measures/justice.jpeg b/lectures/17_fairness_measures/justice.jpeg deleted file mode 100644 index e3193d49..00000000 Binary files a/lectures/17_fairness_measures/justice.jpeg and /dev/null differ diff --git a/lectures/17_fairness_measures/manymetrics.png b/lectures/17_fairness_measures/manymetrics.png deleted file mode 100644 index 4764e070..00000000 Binary files a/lectures/17_fairness_measures/manymetrics.png and /dev/null differ diff --git a/lectures/17_fairness_measures/model_fairness.md b/lectures/17_fairness_measures/model_fairness.md deleted file mode 100644 index 05074c45..00000000 --- a/lectures/17_fairness_measures/model_fairness.md +++ /dev/null @@ -1,552 +0,0 @@ ---- -author: Christian Kaestner and Eunsuk Kang -title: "MLiP: Measuring Fairness" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - - -# Measuring Fairness - - ---- -## Diving into Fairness... - -![Overview of course content](../_assets/overview.svg) - - - - ----- -## Reading - -
- -Required: -- Nina Grgic-Hlaca, Elissa M. Redmiles, Krishna P. Gummadi, and Adrian Weller. -[Human Perceptions of Fairness in Algorithmic Decision Making: -A Case Study of Criminal Risk Prediction](https://dl.acm.org/doi/pdf/10.1145/3178876.3186138) -In WWW, 2018. - -Recommended: -- Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter and Julia Lane. [Big Data and Social Science: Data Science Methods and Tools for Research and Practice](https://textbook.coleridgeinitiative.org/). Chapter 11, 2nd ed, 2020 -- Solon Barocas and Moritz Hardt and Arvind Narayanan. [Fairness and Machine Learning](http://www.fairmlbook.org). 2019 (incomplete book) -- Pessach, Dana, and Erez Shmueli. "[A Review on Fairness in Machine Learning](https://dl.acm.org/doi/full/10.1145/3494672)." ACM Computing Surveys (CSUR) 55, no. 3 (2022): 1-44. - -
- ---- -# Learning Goals - -* Understand different definitions of fairness -* Discuss methods for measuring fairness -* Outline interventions to improve fairness at the model level - ---- -## Real change, or lip service? - -![tiktok](tiktok.jpg) - - - -https://www.nytimes.com/2023/03/23/business/tiktok-screen-time.html - ---- -# Fairness: Definitions - -How do we measure the fairness of an ML model? - ----- -### Fairness is still an actively studied & disputed concept! - -![](fairness-papers.jpeg) - - -Source: Mortiz Hardt, https://fairmlclass.github.io/ - ----- -## Fairness: Definitions - -* Anti-classification (fairness through blindness) -* Group fairness (independence) -* Equalized odds (separation) -* ...and numerous others and variations! - - ---- -# Running Example: Mortgage Applications - -* Large loans repayed over long periods, large loss on default -* Home ownership is key path to build generational wealth -* Past decisions often discriminatory (redlining) -* Replace biased human decisions by objective and more accurate ML model - - income, other debt, home value - - past debt and payment behavior (credit score) - ----- -## Recall: What is fair? - -> Fairness discourse asks questions about how to treat people and whether treating different groups of people differently is ethical. If two groups of people are systematically treated differently, this is often considered unfair. - ----- -## Recall: What is fair? - -![Contrasting equality, equity, and justice](eej2.jpeg) - - ----- -## What is fair in mortgage applications? - -1. Distribute loans equally across all groups of protected attribute(s) - (e.g., ethnicity) -2. Prioritize those who are more likely to pay back (e.g., higher - income, good credit history) - - - ----- -## Redlining - - - -![Redlining](redlining.jpeg) - - - - -Withold services (e.g., mortgage, education, retail) from people in neighborhoods -deemed "risky" - -Map of Philadelphia, 1936, Home Owners' Loan Corps. (HOLC) -* Classification based on estimated "riskiness" of loans - - ----- -## Past bias, different starting positions - -![Severe median income and worth disparities between white and black households](mortgage.png) - - -Source: Federal Reserve’s [Survey of Consumer Finances](https://www.federalreserve.gov/econres/scfindex.htm) - ---- -# Anti-classification - -* __Anti-classification (fairness through blindness)__ -* Group fairness (independence) -* Equalized odds (separation) -* ...and numerous others and variations! - ----- -## Anti-Classification - - -![](justice.jpeg) - - -* Also called _fairness through blindness_ or _fairness through unawareness_ -* Ignore certain sensitive attributes when making a decision -* Example: Remove gender and race from mortgage model - ----- -## Anti-Classification: Example - -![appraisal](appraisal.png) - - -"After Ms. Horton removed all signs of Blackness, a second appraisal valued a Jacksonville home owned by her and her husband, Alex Horton, at 40 percent higher." - - -https://www.nytimes.com/2022/03/21/realestate/remote-home-appraisals-racial-bias.html - ----- -## Anti-Classification - -![](justice.jpeg) - - -*Easy to implement, but any limitations?* - ----- -## Recall: Proxies - -*Features correlate with protected attributes* - -![](neighborhoods.png) - ----- -## Recall: Not all discrimination is harmful - -![](gender-bias.png) - - -* Loan lending: Gender and racial discrimination is illegal. -* Medical diagnosis: Gender/race-specific diagnosis may be desirable. -* Discrimination is a __domain-specific__ concept! - ----- -## Anti-Classification - -![](justice.jpeg) - - -* Ignore certain sensitive attributes when making a decision -* Advantage: Easy to implement and test -* Limitations - * Sensitive attributes may be correlated with other features - * Some ML tasks need sensitive attributes (e.g., medical diagnosis) - ----- -## Ensuring Anti-Classification - -How to train models that are fair w.r.t. anti-classification? - - - ----- -## Ensuring Anti-Classification - -How to train models that are fair w.r.t. anti-classification? - ---> Simply remove features for protected attributes from training and inference data - ---> Null/randomize protected attribute during inference - -*(does not account for correlated attributes, is not required to)* - ----- -## Testing Anti-Classification - -How do we test that a classifier achieves anti-classification? - - - ----- -## Testing Anti-Classification - -Straightforward invariant for classifier $f$ and protected attribute $p$: - -$\forall x. f(x[p\leftarrow 0]) = f(x[p\leftarrow 1])$ - -*(does not account for correlated attributes, is not required to)* - -Test with *any* test data, e.g., purely random data or existing test data - -Any single inconsistency shows that the protected attribute was used. Can also report percentage of inconsistencies. - - -See for example: Galhotra, Sainyam, Yuriy Brun, and Alexandra Meliou. "[Fairness testing: testing software for discrimination](http://people.cs.umass.edu/brun/pubs/pubs/Galhotra17fse.pdf)." In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp. 498-510. 2017. - - ----- -## Anti-Classification Discussion - -*Testing of anti-classification barely needed, because easy to ensure by constructing during training or inference!* - -Anti-classification is a good starting point to think about protected attributes - -Useful baseline for comparison - -Easy to implement, but only effective if (1) no proxies among features -and (2) protected attributes add no predictive power - ---- -# Group fairness - -* Anti-classification (fairness through blindness) -* __Group fairness (independence)__ -* Equalized odds (separation) -* ...and numerous others and variations! - ----- -## Group fairness - -Key idea: Compare outcomes across two groups -* Similar rates of accepted loans across racial/gender groups? -* Similar chance of being hired/promoted between gender groups? -* Similar rates of (predicted) recidivism across racial groups? - -Outcomes matter, not accuracy! - ----- -## Disparate impact vs. disparate treatment - -Disparate treatment: Practices or rules that treat a certain protected -group(s) differently from others -* e.g., Apply different mortgage rules for people from different backgrounds - -Disparate impact: Neutral rules, but outcome is worse for - one or more protected groups -* Same rules are applied, but certain groups have a harder time - obtaining mortgage in a particular neighborhood - ----- -## Group fairness in discrimination law - -Relates to *disparate impact* and the four-fifth rule - -Can sue organizations for discrimination if they -* mostly reject job applications from one minority group (identified by protected classes) and hire mostly from another -* reject most loans from one minority group and more frequently accept applicants from another - - ----- -## Notations - -* $X$: Feature set (e.g., age, race, education, region, income, etc.,) -* $A \in X$: Sensitive attribute (e.g., gender) -* $R$: Regression score (e.g., predicted likelihood of on-time loan payment) -* $Y'$: Classifier output - * $Y' = 1$ if and only if $R > T$ for some threshold $T$ - * e.g., Grant the loan ($Y' = 1$) if the likelihood of paying back > 80% -* $Y$: Target variable being predicted ($Y = 1$ if the person actually - pays back on time) - -[Setting classification thresholds: Loan lending example](https://research.google.com/bigpicture/attacking-discrimination-in-ml) - - - ----- -## Group Fairness - -$P[Y' = 1 | A = a] = P[Y' = 1 | A = b]$ - -* Also called _independence_ or _demographic parity_ -* Mathematically, $Y' \perp A$ - * Prediction ($Y'$) must be independent of the sensitive attribute ($A$) -* Examples: - * The predicted rate of recidivism is the same across all races - * Both women and men have the equal probability of being promoted - * i.e., P[promote = 1 | gender = M] = P[promote = 1 | gender = F] - ----- -## Group Fairness Limitations - -What are limitations of group fairness? - - - ----- -## Group Fairness Limitations - -* Ignores possible correlation between $Y$ and $A$ - * Rules out perfect predictor $Y' = Y$ when $Y$ & $A$ are correlated! -* Permits abuse and laziness: Can be satisfied by randomly assigning a positive outcome ($Y' = 1$) to protected groups - * e.g., Randomly promote people (regardless of their - job performance) to match the rate across all groups - ----- -## Adjusting Thresholds for Group Fairness - -Select different classification thresholds ($t_0$, $t_1$) for different groups (A = 0, A = 1) to achieve group fairness, such that -$P[R > t_0 | A = 0] = P[R > t_1 | A = 1]$ - -
- -Example: Mortgage application - * R: Likelihood of paying back the loan on time - * Suppose: With a uniform threshold used (i.e., R = 80%), group fairness is not achieved - * P[R > 0.8 | A = 0] = 0.4, P[R > 0.8 | A = 1] = 0.7 - * Adjust thresholds to achieve group fairness - * P[R > 0.6 | A = 0] = P[R > 0.8 | A = 1] -* Wouldn't group A = 1 argue it's unfair? When does this type of adjustment make sense? - -
- ----- -## Testing Group Fairness - -*How would you test whether a classifier achieves group fairness?* - - - ----- -## Testing Group Fairness - -Collect realistic, representative data (not randomly generated!) -* Use existing validation/test data -* Monitor production data -* (Somehow) generate realistic test data, e.g. from probability distribution of population - -Separately measure the rate of positive predictions -* e.g., P[promoted = 1 | gender = M], P[promoted = 1 | gender = F] = ? - -Report issue if the rates differ beyond some threshold $\epsilon$ across -groups - - ---- -# Equalized odds - -* Anti-classification (fairness through blindness) -* Group fairness (independence) -* **Equalized odds (separation)** -* ...and numerous others and variations! - ----- -## Equalized odds - -Key idea: Focus on accuracy (not outcomes) across two groups - -* Similar default rates on accepted loans across racial/gender groups? -* Similar rate of "bad hires" and "missed stars" between gender groups? -* Similar accuracy of predicted recidivism vs actual recidivism across racial groups? - -Accuracy matters, not outcomes! - ----- -## Equalized odds in discrimination law - -Relates to *disparate treatment* - -Typically, lawsuits claim that protected attributes (e.g., race, gender) were used in decisions even though they were irrelevant -* e.g., fired over complaint because of being Latino, whereas other White employees were not fired with similar complaints - -Must prove that the defendant had *intention* to discriminate -* Often difficult: Relying on shifting justifications, inconsistent application of rules, or explicit remarks overheard or documented - ----- -## Equalized odds - -$P[Y'=1∣Y=0,A=a] = P[Y'=1∣Y=0,A=b]$ -$P[Y'=0∣Y=1,A=a] = P[Y'=0∣Y=1,A=b]$ - -Statistical property of *separation*: $Y' \perp A | Y$ - * Prediction must be independent of the sensitive attribute - _conditional_ on the target variable - ----- -## Review: Confusion Matrix - -![](confusion-matrix.jpg) - - -Can we explain separation in terms of model errors? -* $P[Y'=1∣Y=0,A=a] = P[Y'=1∣Y=0,A=b]$ -* $P[Y'=0∣Y=1,A=a] = P[Y'=0∣Y=1,A=b]$ - ----- -## Separation - -$P[Y'=1∣Y=0,A=a] = P[Y'=1∣Y=0,A=b]$ (FPR parity) - -$P[Y'=0∣Y=1,A=a] = P[Y'=0∣Y=1,A=b]$ (FNR parity) - -* $Y' \perp A | Y$: Prediction must be independent of the sensitive attribute - _conditional_ on the target variable -* i.e., All groups are susceptible to the same false positive/negative rates -* Example: Y': Promotion decision, A: Gender of applicant: Y: Actual job performance - ----- -## Testing Separation - -Requires realistic representative test data (telemetry or representative test data, not random) - -Separately measure false positive and false negative rates - * e..g, for FNR, compare P[promoted = 0 | female, good employee] vs P[promoted = 0 | male, good employee] - - -*How is this different from testing group fairness?* - ---- -# Breakout: Cancer Prognosis - -![](cancer-stats.png) - -In groups, post to `#lecture` tagging members: - -* Does the model meet anti-classification fairness w.r.t. gender? -* Does the model meet group fairness? -* Does the model meet equalized odds? -* Is the model fair enough to use? - - ---- -# Other fairness measures - -* Anti-classification (fairness through blindness) -* Group fairness (independence) -* Equalized odds (separation)** -* **...and numerous others and variations!** - ----- - -![](manymetrics.png) - ----- -## Many measures - -Many measures proposed - -Some specialized for tasks (e.g., ranking, NLP) - -Some consider downstream utility of various outcomes - -Most are similar to the three discussed -* Comparing different measures in the error matrix (e.g., false positive rate, lift) - - - - - - - ---- -# Outlook: Building Fair ML-Based Products - -**Next lecture:** Fairness is a *system-wide* concern - -* Identifying and negotiating fairness requirements -* Fairness beyond model predictions (product design, mitigations, data collection) -* Fairness in process and teamwork, barriers and responsibilities -* Documenting fairness at the interface -* Monitoring -* Promoting best practices - - - - - - ---- -# Summary - -* Three definitions of fairness: Anti-classification, group fairness, equalized odds -* Tradeoffs between fairness criteria - * What is the goal? - * Key: how to deal with unequal starting positions -* Improving fairness of a model - * In all *pipeline stages*: data collection, data cleaning, training, inference, evaluation - ----- -## Further Readings - -- 🕮 Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter and Julia Lane. [Big Data and Social Science: Data Science Methods and Tools for Research and Practice](https://textbook.coleridgeinitiative.org/). Chapter 11, 2nd ed, 2020 -- 🕮 Solon Barocas and Moritz Hardt and Arvind Narayanan. [Fairness and Machine Learning](http://www.fairmlbook.org). 2019 (incomplete book) -- 🗎 Pessach, Dana, and Erez Shmueli. "[A Review on Fairness in Machine Learning](https://dl.acm.org/doi/full/10.1145/3494672)." ACM Computing Surveys (CSUR) 55, no. 3 (2022): 1-44. - - - diff --git a/lectures/17_fairness_measures/mortgage.png b/lectures/17_fairness_measures/mortgage.png deleted file mode 100644 index 8a774d3a..00000000 Binary files a/lectures/17_fairness_measures/mortgage.png and /dev/null differ diff --git a/lectures/17_fairness_measures/neighborhoods.png b/lectures/17_fairness_measures/neighborhoods.png deleted file mode 100644 index fe0a0d6e..00000000 Binary files a/lectures/17_fairness_measures/neighborhoods.png and /dev/null differ diff --git a/lectures/17_fairness_measures/redlining.jpeg b/lectures/17_fairness_measures/redlining.jpeg deleted file mode 100644 index fe9439e6..00000000 Binary files a/lectures/17_fairness_measures/redlining.jpeg and /dev/null differ diff --git a/lectures/17_fairness_measures/tiktok.jpg b/lectures/17_fairness_measures/tiktok.jpg deleted file mode 100644 index db32ab93..00000000 Binary files a/lectures/17_fairness_measures/tiktok.jpg and /dev/null differ diff --git a/lectures/18_system_fairness/aequitas-report.png b/lectures/18_system_fairness/aequitas-report.png deleted file mode 100644 index d11913d5..00000000 Binary files a/lectures/18_system_fairness/aequitas-report.png and /dev/null differ diff --git a/lectures/18_system_fairness/aequitas.png b/lectures/18_system_fairness/aequitas.png deleted file mode 100644 index 2281bb6f..00000000 Binary files a/lectures/18_system_fairness/aequitas.png and /dev/null differ diff --git a/lectures/18_system_fairness/apes.png b/lectures/18_system_fairness/apes.png deleted file mode 100644 index 8e0a6f16..00000000 Binary files a/lectures/18_system_fairness/apes.png and /dev/null differ diff --git a/lectures/18_system_fairness/atm.gif b/lectures/18_system_fairness/atm.gif deleted file mode 100644 index c966f66c..00000000 Binary files a/lectures/18_system_fairness/atm.gif and /dev/null differ diff --git a/lectures/18_system_fairness/blood-pressure-monitor.jpg b/lectures/18_system_fairness/blood-pressure-monitor.jpg deleted file mode 100644 index 2ec3a2fa..00000000 Binary files a/lectures/18_system_fairness/blood-pressure-monitor.jpg and /dev/null differ diff --git a/lectures/18_system_fairness/bongo.gif b/lectures/18_system_fairness/bongo.gif deleted file mode 100644 index 598a9abf..00000000 Binary files a/lectures/18_system_fairness/bongo.gif and /dev/null differ diff --git a/lectures/18_system_fairness/ceo.png b/lectures/18_system_fairness/ceo.png deleted file mode 100644 index edccbe99..00000000 Binary files a/lectures/18_system_fairness/ceo.png and /dev/null differ diff --git a/lectures/18_system_fairness/college-admission.jpg b/lectures/18_system_fairness/college-admission.jpg deleted file mode 100644 index 44b03d35..00000000 Binary files a/lectures/18_system_fairness/college-admission.jpg and /dev/null differ diff --git a/lectures/18_system_fairness/compas-metrics.png b/lectures/18_system_fairness/compas-metrics.png deleted file mode 100644 index 99cb488d..00000000 Binary files a/lectures/18_system_fairness/compas-metrics.png and /dev/null differ diff --git a/lectures/18_system_fairness/component.svg b/lectures/18_system_fairness/component.svg deleted file mode 100644 index 9e488f32..00000000 --- a/lectures/18_system_fairness/component.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/18_system_fairness/data-bias-stage.png b/lectures/18_system_fairness/data-bias-stage.png deleted file mode 100644 index 0c317711..00000000 Binary files a/lectures/18_system_fairness/data-bias-stage.png and /dev/null differ diff --git a/lectures/18_system_fairness/datasheet.png b/lectures/18_system_fairness/datasheet.png deleted file mode 100644 index 6292f6d4..00000000 Binary files a/lectures/18_system_fairness/datasheet.png and /dev/null differ diff --git a/lectures/18_system_fairness/eej1.jpg b/lectures/18_system_fairness/eej1.jpg deleted file mode 100644 index 4b38ccf0..00000000 Binary files a/lectures/18_system_fairness/eej1.jpg and /dev/null differ diff --git a/lectures/18_system_fairness/eej2.jpeg b/lectures/18_system_fairness/eej2.jpeg deleted file mode 100644 index 2354a176..00000000 Binary files a/lectures/18_system_fairness/eej2.jpeg and /dev/null differ diff --git a/lectures/18_system_fairness/facial-dataset.png b/lectures/18_system_fairness/facial-dataset.png deleted file mode 100644 index 2e9a7a76..00000000 Binary files a/lectures/18_system_fairness/facial-dataset.png and /dev/null differ diff --git a/lectures/18_system_fairness/fairness-accuracy.jpeg b/lectures/18_system_fairness/fairness-accuracy.jpeg deleted file mode 100644 index 7e6dccc9..00000000 Binary files a/lectures/18_system_fairness/fairness-accuracy.jpeg and /dev/null differ diff --git a/lectures/18_system_fairness/fairness-lifecycle.jpg b/lectures/18_system_fairness/fairness-lifecycle.jpg deleted file mode 100644 index 4ea3d720..00000000 Binary files a/lectures/18_system_fairness/fairness-lifecycle.jpg and /dev/null differ diff --git a/lectures/18_system_fairness/fairness-longterm.png b/lectures/18_system_fairness/fairness-longterm.png deleted file mode 100644 index f95bca5f..00000000 Binary files a/lectures/18_system_fairness/fairness-longterm.png and /dev/null differ diff --git a/lectures/18_system_fairness/fairness_tree.png b/lectures/18_system_fairness/fairness_tree.png deleted file mode 100644 index 74e3c01d..00000000 Binary files a/lectures/18_system_fairness/fairness_tree.png and /dev/null differ diff --git a/lectures/18_system_fairness/feedbackloop.svg b/lectures/18_system_fairness/feedbackloop.svg deleted file mode 100644 index 66334f7c..00000000 --- a/lectures/18_system_fairness/feedbackloop.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/18_system_fairness/freelancing.png b/lectures/18_system_fairness/freelancing.png deleted file mode 100644 index b6ac942b..00000000 Binary files a/lectures/18_system_fairness/freelancing.png and /dev/null differ diff --git a/lectures/18_system_fairness/hiring.png b/lectures/18_system_fairness/hiring.png deleted file mode 100644 index 2c9a71ec..00000000 Binary files a/lectures/18_system_fairness/hiring.png and /dev/null differ diff --git a/lectures/18_system_fairness/loanprofit.png b/lectures/18_system_fairness/loanprofit.png deleted file mode 100644 index 50180f67..00000000 Binary files a/lectures/18_system_fairness/loanprofit.png and /dev/null differ diff --git a/lectures/18_system_fairness/model_drift.jpg b/lectures/18_system_fairness/model_drift.jpg deleted file mode 100644 index 47857a0e..00000000 Binary files a/lectures/18_system_fairness/model_drift.jpg and /dev/null differ diff --git a/lectures/18_system_fairness/modelcards.png b/lectures/18_system_fairness/modelcards.png deleted file mode 100644 index e54afe1c..00000000 Binary files a/lectures/18_system_fairness/modelcards.png and /dev/null differ diff --git a/lectures/18_system_fairness/recidivism-propublica.png b/lectures/18_system_fairness/recidivism-propublica.png deleted file mode 100644 index d5af3318..00000000 Binary files a/lectures/18_system_fairness/recidivism-propublica.png and /dev/null differ diff --git a/lectures/18_system_fairness/system_fairness.md b/lectures/18_system_fairness/system_fairness.md deleted file mode 100644 index c714d549..00000000 --- a/lectures/18_system_fairness/system_fairness.md +++ /dev/null @@ -1,1189 +0,0 @@ ---- -author: Christian Kaestner and Eunsuk Kang -title: "MLiP: Improving Fairness" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - - -# Building Fair Products - - ---- -## From Fairness Concepts to Fair Products - -![Overview of course content](../_assets/overview.svg) - - - - ----- -## Reading - -Required reading: -* Holstein, Kenneth, Jennifer Wortman Vaughan, Hal -Daumé III, Miro Dudik, and Hanna -Wallach. "[Improving fairness in machine learning systems: What do industry practitioners need?](http://users.umiacs.umd.edu/~hal/docs/daume19fairness.pdf)" -In Proceedings of the 2019 CHI Conference on Human Factors in -Computing Systems, pp. 1-16. 2019. - -Recommended reading: -* 🗎 Metcalf, Jacob, and Emanuel Moss. "[Owning ethics: Corporate logics, silicon valley, and the institutionalization of ethics](https://datasociety.net/wp-content/uploads/2019/09/Owning-Ethics-PDF-version-2.pdf)." *Social Research: An International Quarterly* 86, no. 2 (2019): 449-476. - ----- -## Learning Goals - -* Understand the role of requirements engineering in selecting ML -fairness criteria -* Understand the process of constructing datasets for fairness -* Document models and datasets to communicate fairness concerns -* Consider the potential impact of feedback loops on AI-based systems - and need for continuous monitoring -* Consider achieving fairness in AI-based systems as an activity throughout the entire development cycle - ---- -## A few words about I4 - -* Pick a tool & write a blog post about it - * Must have engineering aspect for building ML **systems** - * Out of scope: Purely model-centric tools (e.g., better ML libraries) -* Use case in the context of movie recommendation, but no need to be -about your specific system -* If the tool is from the previous semester, discuss different -features/capabilities -* Can also compare different tools (strengths & limitations) -* Think of it as a learning experience! Pick a new tool that you haven't used before - - ---- -## Today: Fairness as a System Quality - -Fairness can be measured for a model - -... but we really care whether the system, as it interacts with the environment, is fair/safe/secure - -... does the system cause harm? - -![System thinking](component.svg) - - ----- -## Fair ML Pipeline Process - -Fairness must be considered throughout the entire lifecycle! - -![](fairness-lifecycle.jpg) - - - - -_Fairness-aware Machine Learning_, Bennett et al., WSDM Tutorial (2019). - - ----- -## Fairness Problems are System-Wide Challenges - -* **Requirements engineering challenges:** How to identify fairness concerns, fairness metric, design data collection and labeling -* **Human-computer-interaction design challenges:** How to present results to users, fairly collect data from users, design mitigations -* **Quality assurance challenges:** Evaluate the entire system for fairness, continuously assure in production -* **Process integration challenges:** Incorprorate fairness work in development process -* **Education and documentation challenges:** Create awareness, foster interdisciplinary collaboration - - ---- -# Understanding System-Level Goals for Fairness - -i.e., Requirements engineering - - ----- -## Recall: Fairness metrics - -* Anti-classification (fairness through blindness) -* Group fairness (independence) -* Equalized odds (separation) -* ...and numerous others and variations! - -**But which one makes most sense for my product?** - ----- -## Identifying Fairness Goals is a Requirements Engineering Problem - -
- -* What is the goal of the system? What benefits does it provide and to - whom? - -* Who are the stakeholders of the system? What are the stakeholders’ views or expectations on fairness and where do they conflict? Are we trying to achieve fairness based on equality or equity? - -* What subpopulations (including minority groups) may be using or be affected by the system? What types of harms can the system cause with discrimination? - -* Does fairness undermine any other goals of the system (e.g., accuracy, profits, time to release)? - -* Are there legal anti-discrimination requirements to consider? Are - there societal expectations about ethics w.r.t. to this product? What is the activist position? - -* ... - - -
- - ----- -## 1. Identify Protected Attributes - -Against which groups might we discriminate? What attributes identify them directly or indirectly? - -Requires understanding of target population and subpopulations - -Use anti-discrimination law as starting point, but do not end there -* Socio-economic status? Body height? Weight? Hair style? Eye color? Sports team preferences? -* Protected attributes for non-humans? Animals, inanimate objects? - -Involve stakeholders, consult lawyers, read research, ask experts, ... - - ----- -## Protected attributes are not always obvious - -![ATM](atm.gif) - - -**Q. Other examples?** - ----- -## 2. Analyze Potential Harms - -Anticipate harms from unfair decisions -* Harms of allocation, harms of representation? -* How do biased model predictions contribute to system behavior? - -Consider how automation can amplify harm - -Overcome blind spots within teams -* Systematically consider consequences of bias -* Consider safety engineering techniques (e.g., FTA) -* Assemble diverse teams, use personas, crowdsource audits - - ----- -## Example: Judgment Call Game - - - -Card "Game" by Microsoft Research - -Participants write "Product reviews" from different perspectives -* encourage thinking about consequences -* enforce persona-like role taking - - - -![Photo of Judgment Call Game cards](../_chapterimg/17_fairnessgame.jpg) - - - - ----- -## Example: Judgment Call Game - - -![user-review1](user-review1.png) - - -![user-review2](user-review2.png) - - - - - -[Judgment Call the Game: Using Value Sensitive Design and Design -Fiction to Surface Ethical Concerns Related to Technology](https://dl.acm.org/doi/10.1145/3322276.3323697) - ----- -## 3. Negotiate Fairness Goals/Measures - -* Negotiate with stakeholders to determine fairness requirement for -the product: What is the suitable notion of fairness for the -product? Equality or equity? -* Map the requirements to model-level (model) specifications: Anti-classification? Group fairness? Equalized odds? -* Negotiation can be challenging! - * Conflicts with other system goals (accuracy, profits...) - * Conflicts among different beliefs, values, political views, etc., - * Will often need to accept some (perceived) unfairness - ----- -## Recall: What is fair? - -> Fairness discourse asks questions about how to treat people and whether treating different groups of people differently is ethical. If two groups of people are systematically treated differently, this is often considered unfair. - - ----- -## Intuitive Justice - -Research on what post people perceive as fair/just (psychology) - -When rewards depend on inputs and participants can chose contributions: Most people find it fair to split rewards proportional to inputs -* *Which fairness measure does this relate to?* - -Most people agree that for a decision to be fair, personal characteristics that do not influence the reward, such as gender or age, should not be considered when dividing the rewards. -* *Which fairness measure does this relate to?* - ----- -## Key issue: Unequal starting positions - -Not everybody starts from an equal footing -- individual and group differences -* Some differences are inert, e.g., younger people have (on average) less experience -* Some differences come from past behavior/decisions, e.g., whether to attend college -* Some past decisions and opportunities are influenced by past injustices, e.g., redlining creating generational wealth differences - -Individual and group differences not always clearly attributable, e.g., nature vs nurture discussion - ----- -## Unequal starting position - -
- -Fair or not? Should we account for unequal starting positions? -* Tom is lazier than Bob. He should get less pie. -* People in Egypt have on average a much longer work week (53h) than people in Germany (35h). They have less time to bake and should get more pie. -* Disabled people are always exhausted quickly. They should get less pie, because they contribute less. -* Men are on average more violent than women. This should be reflected in recidivism prediction. -* Employees with a PhD should earn higher wages than those with a bachelor's degree, because they decided to invest in more schooling. -* Students from poor neighborhoods should receive extra resources at school, because they get less help at home. -* Poverty is a moral failing. Poor people are less deserving of pie. - -
- ----- -## Dealing with unequal starting positions - -Equality (minimize disparate treatment): -* Treat everybody equally, regardless of starting position -* Focus on meritocracy, strive for fair opportunities -* Equalized-odds-style fairness; equality of opportunity - -Equity (minimize disparate impact): -* Compensate for different starting positions -* Lift disadvantaged group, affirmative action -* Strive for similar outcomes (distributive justice) -* Group-fairness-style fairness; equality of outcomes - ----- -## Equality vs Equity - -![Contrasting equality, equity, and justice](eej2.jpeg) - - ----- -## Equality vs Equity - -![Contrasting equality, equity, and justice](eej1.jpg) - - ----- -## Justice - -Aspirational third option, that avoids a choice between equality and equity - -Fundamentally removes initial imbalance or removes need for decision - -Typically rethinks entire societal system in which the imbalance existed, beyond the scope of the ML product - ----- -## Choosing Equality vs Equity - -Each rooted in long history in law and philosophy - -Typically incompatible, cannot achieve both - -Designers need to decide - -Problem dependent and goal dependent - -What differences are associated with merits and which with systemic disadvantages of certain groups? Can we agree on the degree a group is disadvantaged? - - ----- -## Punitive vs Assistive Decisions - -* If the decision is **punitive** in nature: - * Harm is caused when a group is given an unwarranted penalty - * e.g. decide whom to deny bail based on risk of recidivism - * Heuristic: Use a fairness metric (equalized odds) based on false positive rates -* If the decision is **assistive** in nature: - * Harm is caused when a group in need is denied assistance - * e.g., decide who should receive a loan or a food subsidy - * Heuristic: Use a fairness metric based on false negative rates - ----- -## Fairness Tree - -![](fairness_tree.png) - - - - -Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter and Julia Lane. [Big Data and Social Science: Data Science Methods and Tools for Research and Practice](https://textbook.coleridgeinitiative.org/). Chapter 11, 2nd ed, 2020 - - ----- -## Trade-offs in Fairness vs Accuracy - - - -![](fairness-accuracy.jpeg) - - - - - -Fairness imposes constraints, limits what models can be learned - -**But:** Arguably, unfair predictions are not desirable! - -Determine how much compromise in accuracy or fairness is acceptable to - your stakeholders - - - - - -[Fairness Constraints: Mechanisms for Fair Classification.](https://proceedings.mlr.press/v54/zafar17a.html) Zafar et -al. (AISTATS 2017). - ----- -## Fairness, Accuracy, and Profits - -![](loanprofit.png) - - - -Interactive visualization: https://research.google.com/bigpicture/attacking-discrimination-in-ml/ - ----- -## Fairness, Accuracy, and Profits - -Fairness can conflict with accuracy goals - -Fairness can conflict with organizational goals (profits, usability) - -Fairer products may attract more customers - -Unfair products may receive bad press, reputation damage - -Improving fairness through better data can benefit everybody - - - ----- -## Discussion: Fairness Goal for Mortgage Applications? - - - ----- -## Discussion: Fairness Goal for Mortgage Applications? - -Disparate impact considerations seem to prevail -- group fairness - -Need to justify strong differences in outcomes - -Can also sue over disparate treatment if bank indicates that protected attribute was reason for decision - - - - - - - - - - - ----- -## Breakout: Fairness Goal for College Admission? - -![](college-admission.jpg) - - -Post as a group in #lecture: -* What kind of harm can be caused? -* Fairness goal: Equality or equity? -* Model: Anti-classification, group fairness, or equalized odds (with FPR/FNR)? - ----- -## Discussion: Fairness Goal for College Admission? - -Very limited scope of *affirmative action*: Contentious topic, -subject of multiple legal cases, banned in many states -* Supporters: Promote representation, counteract historical bias -* Opponents: Discriminate against certain racial groups - -Most forms of group fairness are likely illegal - -In practice: Anti-classification - - ----- -## Discussion: Fairness Goal for Hiring Decisions? - -![](hiring.png) - - -* What kind of harm can be caused? -* What do we want to achieve? Equality or equity? -* Anti-classification, group fairness, or equalized odds (FPR/FNR)? - ----- -## Law: "Four-fifth rule" (or "80% rule") - - -* Group fairness with a threshold: $\frac{P[R = 1 | A = a]}{P[R = 1 | A = b]} \geq 0.8$ -* Selection rate for a protected group (e.g., $A = a$) < -80% of highest rate => selection procedure considered as having "adverse -impact" -* Guideline adopted by Federal agencies (Department of Justice, Equal -Employment Opportunity Commission, etc.,) in 1978 -* If violated, must justify business necessity (i.e., the selection procedure is -essential to the safe & efficient operation) -* Example: Hiring 50% of male applicants vs 20% female applicants hired - (0.2/0.5 = 0.4) -- Is there a business justification for hiring men at a higher rate? - - ----- -## Recidivism Revisited - -![](recidivism-propublica.png) - - -* COMPAS system, developed by Northpointe: Used by judges in - sentencing decisions across multiple states (incl. PA) - - - - -[ProPublica article](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) - - ----- -## Which fairness definition? - -![](compas-metrics.png) - - -* ProPublica: COMPAS violates equalized odds w/ FPR & FNR -* Northpointe: COMPAS is fair because it has similar FDRs - * FDR = FP / (FP + TP) = 1 - Precision; FPR = FP / (FP + TN) -* __Q. So is COMPAS both fair & unfair at the same time?__ - - -[Figure from Big Data and Social Science, Ch. 11](https://textbook.coleridgeinitiative.org/chap-bias.html#ref-angwin2016b) - - ----- -## Fairness Definitions: Pitfalls - - -![](bongo.gif) - -* "Impossibility Theorem": Can't satisfy multiple fairness criteria -* Easy to pick some definition & claim that the model is fair - * But does a "fair" model really help reduce harm in the long term? -* Instead of just focusing on building a "fair"' model, can we understand & - address the root causes of bias? - - - -A. Chouldechova [Fair prediction with disparate impact: A study of bias in recidivism prediction instruments](https://arxiv.org/pdf/1703.00056.pdf) - - - - - - - ---- -# Dataset Construction for Fairness - - ----- -## Flexibility in Data Collection - -* Data science education often assumes data as given -* In industry, we often have control over data collection and curation (65%) -* Most address fairness issues by collecting more data (73%) - * Carefully review data collection procedures, sampling bias, how - trustworthy labels are - * **Often high-leverage point to improve fairness!** - - - - -[Challenges of incorporating algorithmic fairness into practice](https://www.youtube.com/watch?v=UicKZv93SOY), -FAT* Tutorial, 2019 ([slides](https://bit.ly/2UaOmTG)) - ----- -## Data Bias - -![](data-bias-stage.png) - - -* Bias can be introduced at any stage of the data pipeline! - - - -Bennett et al., [Fairness-aware Machine Learning](https://sites.google.com/view/wsdm19-fairness-tutorial), WSDM Tutorial (2019). - - ----- -## Types of Data Bias - -* __Population bias__ -* __Historical bias__ -* __Behavioral bias__ -* Content production bias -* Linking bias -* Temporal bias - - - -_Social Data: Biases, Methodological Pitfalls, and Ethical -Boundaries_, Olteanu et al., Frontiers in Big Data (2016). - ----- -## Population Bias - -![](facial-dataset.png) - - -* Differences in demographics between dataset vs target population -* May result in degraded services for certain groups - - -Merler, Ratha, Feris, and Smith. [Diversity in Faces](https://arxiv.org/abs/1901.10436) - ----- -## Historical Bias - -![Image search for "CEO"](ceo.png) - - -* Dataset matches the reality, but certain groups are under- or -over-represented due to historical reasons - ----- -## Behavioral Bias - -![](freelancing.png) - - -* Differences in user behavior across platforms or social contexts -* Example: Freelancing platforms (Fiverr vs TaskRabbit) - * Bias against certain minority groups on different platforms - - - -_Bias in Online Freelance Marketplaces_, Hannak et al., CSCW (2017). - ----- -## Fairness-Aware Data Collection - -* Address population bias - - * Does the dataset reflect the demographics in the target - population? - * If not, collect more data to achieve this -* Address under- & over-representation issues - - * Ensure sufficient amount of data for all groups to avoid being - treated as "outliers" by ML - * Also avoid over-representation of certain groups (e.g., - remove historical data) - - -_Fairness-aware Machine Learning_, Bennett et al., WSDM Tutorial (2019). - ----- -## Fairness-Aware Data Collection - -* Data augmentation: Synthesize data for minority groups to reduce under-representation - - * Observed: "He is a doctor" -> synthesize "She is a doctor" -* Model auditing for better data collection - - * Evaluate accuracy across different groups - * Collect more data for groups with highest error rates - - -_Fairness-aware Machine Learning_, Bennett et al., WSDM Tutorial (2019). - ----- -## Example Audit Tool: Aequitas - -![](aequitas.png) - ----- -## Example Audit Tool: Aequitas - -![](aequitas-report.png) - - - - - - - - - - - - - - - - ----- -## Documentation for Fairness: Data Sheets - -![](datasheet.png) - - -* Common practice in the electronics industry, medicine -* Purpose, provenance, creation, __composition__, distribution - * "Does the dataset relate to people?" - * "Does the dataset identify any subpopulations (e.g., by age)?" - - - -_Datasheets for Dataset_, Gebru et al., (2019). https://arxiv.org/abs/1803.09010 - ----- -## Model Cards - -See also: https://modelcards.withgoogle.com/about - -![Model Card Example](modelcards.png) - - - - -Mitchell, Margaret, et al. "[Model cards for model reporting](https://www.seas.upenn.edu/~cis399/files/lecture/l22/reading2.pdf)." In Proceedings of the Conference on fairness, accountability, and transparency, pp. 220-229. 2019. - - ----- -## Dataset Exploration - -![](what-if-tool.png) - - -[Google What-If Tool](https://pair-code.github.io/what-if-tool/demos/compas.html) - - - - - - - - - - - - - - - - - - - - ---- -# Anticipate Feedback Loops - ----- -## Feedback Loops - -![Feedback loop](feedbackloop.svg) - - ----- -## Feedback Loops in Mortgage Applications? - - - ----- -## Feedback Loops go through the Environment - -![](component.svg) - - - - ----- -## Analyze the World vs the Machine - -![world vs machine](worldvsmachine.svg) - - -*State and check assumptions!* - - ----- -## Analyze the World vs the Machine - -How do outputs affect change in the real world, how does this (indirectly) influence inputs? - -Can we decouple inputs from outputs? Can telemetry be trusted? - -Interventions through system (re)design: -* Focus data collection on less influenced inputs -* Compensate for bias from feedback loops in ML pipeline -* Do not build the system in the first place - - ----- -## Long-term Impact of ML - -* ML systems make multiple decisions over time, influence the -behaviors of populations in the real world -* *But* most models are built & optimized assuming that the world is -static -* Difficult to estimate the impact of ML over time - * Need to reason about the system dynamics (world vs machine) - * e.g., what's the effect of a mortgage lending policy on a population? - ----- -## Long-term Impact & Fairness - - - -Deploying an ML model with a fairness criterion does NOT guarantee - improvement in equality/equity over time - -Even if a model appears to promote fairness in -short term, it may result harm over long term - - - -![](fairness-longterm.png) - - - - - -[Fairness is not static: deeper understanding of long term fairness via simulation studies](https://dl.acm.org/doi/abs/10.1145/3351095.3372878), -in FAT* 2020. - - ----- -## Prepare for Feedback Loops - -We will likely not anticipate all feedback loops... - -... but we can anticipate that unknown feedback loops exist - --> Monitoring! - - - - ---- -# Monitoring - - ----- -## Monitoring & Auditing - -* Operationalize fairness measure in production with telemetry -* Continuously monitor for: - - - Mismatch between training data, test data, and instances encountered in deployment - - Data shifts: May suggest needs to adjust fairness metric/thresholds - - User reports & complaints: Log and audit system decisions - perceived to be unfair by users -* Invite diverse stakeholders to audit system for biases - - ----- -## Monitoring & Auditing - -![](model_drift.jpg) - - -* Continuosly monitor the fairness metric (e.g., error rates for -different sub-populations) -* Re-train model with recent data or adjust classification thresholds - if needed - - ----- -## Preparing for Problems - -Prepare an *incidence response plan* for fairness issues -* What can be shut down/reverted on short notice? -* Who does what? -* Who talks to the press? To affected parties? What do they need to know? - -Provide users with a path to *appeal decisions* -* Provide feedback mechanism to complain about unfairness -* Human review? Human override? - - - - - ---- -# Fairness beyond the Model - ----- -## Bias Mitigation through System Design - - - -Examples of mitigations around the model? - ----- -## 1. Avoid Unnecessary Distinctions - - -![Healthcare worker applying blood pressure monitor](blood-pressure-monitor.jpg) - -*Image captioning gender biased?* - - ----- -## 1. Avoid Unnecessary Distinctions - - -![Healthcare worker applying blood pressure monitor](blood-pressure-monitor.jpg) - - -"Doctor/nurse applying blood pressure monitor" -> "Healthcare worker applying blood pressure monitor" - - ----- -## 1. Avoid Unnecessary Distinctions - -Is the distinction actually necessary? Is there a more general class to unify them? - -Aligns with notion of *justice* to remove the problem from the system - - ----- -## 2. Suppress Potentially Problem Outputs - -![Twitter post of user complaining about misclassification of friends as Gorilla](apes.png) - - -*How to fix?* - - ----- -## 2. Suppress Potentially Problem Outputs - -Anticipate problems or react to reports - -Postprocessing, filtering, safeguards -* Suppress entire output classes -* Hardcoded rules or other models (e.g., toxicity detection) - -May degrade system quality for some use cases - -See mitigating mistakes generally - ----- -## 3. Design Fail-Soft Strategy - -Example: Plagiarism detector - - - -**A: Cheating detected! This incident has been reported.** - - - -**B: This answer seems to perfect. Would you like another exercise?** - - - - - -HCI principle: Fail-soft interfaces avoid calling out directly; communicate friendly and constructively to allow saving face - -Especially relevant if system unreliable or biased - - ----- -## 4. Keep Humans in the Loop - - -![Temi.com screenshot](temi.png) - - -TV subtitles: Humans check transcripts, especially with heavy dialects - ----- -## 4. Keep Humans in the Loop - -Recall: Automate vs prompt vs augment - -Involve humans to correct for mistakes and bias - -But, model often introduced to avoid bias in human decision - -But, challenging human-interaction design to keep humans engaged and alert; human monitors possibly biased too, making it worse - -**Does a human have a fair chance to detect and correct bias?** Enough information? Enough context? Enough time? Unbiased human decision? - ----- -## Predictive Policing Example - -> "officers expressed skepticism -about the software and during ride alongs showed no intention of using it" - -> "the officer discounted the software since it showed what he already -knew, while he ignored those predictions that he did not understand" - -Does the system just lend credibility to a biased human process? - - -Lally, Nick. "[“It makes almost no difference which algorithm you use”: on the modularity of predictive policing](http://www.nicklally.com/wp-content/uploads/2016/09/lallyModularityPP.pdf)." Urban Geography (2021): 1-19. - - - - - ---- -# Process Integration - ----- -## Fairness in Practice today - -Lots of attention in academia and media - -Lofty statements by big companies, mostly aspirational - -Strong push by few invested engineers (internal activists) - -Some dedicated teams, mostly in Big Tech, mostly research focused - -Little institutional support, no broad practices - ----- -## Barriers to Fairness Work - - - - ----- -## Barriers to Fairness Work - -1. Rarely an organizational priority, mostly reactive (media pressure, regulators) - * Limited resources for proactive work - * Fairness work rarely required as deliverable, low priority, ignorable - * No accountability for actually completing fairness work, unclear responsibilities - - -*What to do?* - ----- -## Barriers to Fairness Work - -2. Fairness work seen as ambiguous and too complicated for available resources (esp. outside Big Tech) - * Academic discussions and metrics too removed from real problems - * Fairness research evolves too fast - * Media attention keeps shifting, cannot keep up - * Too political - -*What to do?* - - ----- -## Barriers to Fairness Work - -3. Most fairness work done by volunteers outside official job functions - * Rarely rewarded in performance evaluations, promotions - * Activists seen as troublemakers - * Reliance on personal networks among interested parties - -*What to do?* - - ----- -## Barriers to Fairness Work - -4. Impact of fairness work difficult to quantify, making it hard to justify resource investment - * Does it improve sales? Did it avoid PR disaster? Missing counterfactuals - * Fairness rarely monitored over time - * Fairness rarely a key performance indicator of product - * Fairness requires long-term perspective (feedback loops, rare disasters), but organizations focus on short-term goals - -*What to do?* - - ----- -## Barriers to Fairness Work - -5. Technical challenges - * Data privacy policies restrict data access for fairness analysis - * Bureaucracy - * Distinguishing unimportant user complains from systemic bias issues, debugging bias issues - -6. Fairness concerns are project specific, hard to transfer actionable insights and tools across teams - -*What to do?* - - ----- -## Improving Process Integration -- Aspirations - -Integrate proactive practices in development processes -- both model and system level! - -Move from individuals to institutional processes distributing the work - -Hold the entire organization accountable for taking fairness seriously - -*How?* - - - - ----- -## Improving Process Integration -- Examples - -1. Mandatory discussion of discrimination risks, protected attributes, and fairness goals in *requirements documents* -2. Required fairness reporting in addition to accuracy in automated *model evaluation* -3. Required internal/external fairness audit before *release* -4. Required fairness monitoring, oversight infrastructure in *operation* - ----- -## Improving Process Integration -- Examples - -5. Instituting fairness measures as *key performance indicators* of products -6. Assign clear responsibilities of who does what -7. Identify measurable fairness improvements, recognize in performance evaluations - -*How to avoid pushback against bureaucracy?* - ----- -## Affect Culture Change - -Buy-in from management is crucial - -Show that fairness work is taken seriously through action (funding, hiring, audits, policies), not just lofty mission statements - -Reported success strategies: -1. Frame fairness work as financial profitable, avoiding rework and reputation cost -2. Demonstrate concrete, quantified evidence of benefits of fairness work -3. Continuous internal activism and education initiatives -4. External pressure from customers and regulators - - ----- -## Assigning Responsibilities - -Hire/educate T-shaped professionals - -Have dedicated fairness expert(s) consulting with teams, performing/guiding audits, etc - -Not everybody will be a fairness expert, but ensure base-level awareness on when to seek help - - ----- -## Aspirations - -
- -> "They imagined that organizational leadership would understand, support, and engage deeply with responsible AI concerns, which would be contextualized within their organizational context. Responsible AI would be prioritized as part of the high-level organizational mission and then translated into actionable goals down at the individual levels through established processes. Respondents wanted the spread of information to go through well-established channels so that people know where to look and how to share information." - -
- - -From Rakova, Bogdana, Jingying Yang, Henriette Cramer, and Rumman Chowdhury. "Where responsible AI meets reality: Practitioner perspectives on enablers for shifting organizational practices." Proceedings of the ACM on Human-Computer Interaction 5, no. CSCW1 (2021): 1-23. - ----- -## Burnout is a Real Danger - -Unsupported fairness work is frustrating and often ineffective - -> “However famous the company is, it’s not worth being in a work situation where you don’t feel like your entire company, or at least a significant part of your company, is trying to do this with you. Your job is not to be paid lots of money to point out problems. Your job is to help them make their product better. And if you don’t believe in the product, then don’t work there.” -- Rumman Chowdhury via [Melissa Heikkilä](https://www.technologyreview.com/2022/11/01/1062474/how-to-survive-as-an-ai-ethicist/) - - - - - - - ---- -# Best Practices - ----- -## Best Practices - -**Best practices are emerging and evolving** - -Start early, be proactive - -Scrutinize data collection and labeling - -Invest in requirements engineering and design - -Invest in education - -Assign clear responsibilities, demonstrate leadership buy-in - ----- -## Many Tutorials, Checklists, Recommendations - -Tutorials (fairness notions, sources of bias, process recom.): -* [Fairness in Machine Learning](https://vimeo.com/248490141), [Fairness-Aware Machine Learning in Practice](https://sites.google.com/view/fairness-tutorial) -* [Challenges of Incorporating Algorithmic Fairness into Industry Practice](https://www.microsoft.com/en-us/research/video/fat-2019-translation-tutorial-challenges-of-incorporating-algorithmic-fairness-into-industry-practice/) - -Checklist: -* Microsoft’s [AI Fairness Checklist](https://www.microsoft.com/en-us/research/project/ai-fairness-checklist/): concrete questions, concrete steps throughout all stages, including deployment and monitoring - - - - - - - - - - - - - ---- -# Summary - -* Requirements engineering for fair ML systems - * Identify potential harms, protected attributes - * Negotiate conflicting fairness goals, tradeoffs - * Consider societal implications -* Apply fair data collection practices -* Anticipate feedback loops -* Operationalize & monitor for fairness metrics -* Design fair systems beyond the model, mitigate bias outside the model -* Integrate fairness work in process and culture - - ----- -## Further Readings - -
- -- 🗎 Rakova, Bogdana, Jingying Yang, Henriette Cramer, and Rumman Chowdhury. "[Where responsible AI meets reality: Practitioner perspectives on enablers for shifting organizational practices](https://arxiv.org/abs/2006.12358)." *Proceedings of the ACM on Human-Computer Interaction* 5, no. CSCW1 (2021): 1-23. -- 🗎 Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. "[Model cards for model reporting](https://arxiv.org/abs/1810.03993)." In *Proceedings of the conference on fairness, accountability, and transparency*, pp. 220-229. 2019. -- 🗎 Boyd, Karen L. "[Datasheets for Datasets help ML Engineers Notice and Understand Ethical Issues in Training Data](http://karenboyd.org/Datasheets_Help_CSCW.pdf)." Proceedings of the ACM on Human-Computer Interaction 5, no. CSCW2 (2021): 1-27. -- 🗎 Bietti, Elettra. "[From ethics washing to ethics bashing: a view on tech ethics from within moral philosophy](https://dl.acm.org/doi/pdf/10.1145/3351095.3372860)." In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 210-219. 2020. -- 🗎 Madaio, Michael A., Luke Stark, Jennifer Wortman Vaughan, and Hanna Wallach. "[Co-Designing Checklists to Understand Organizational Challenges and Opportunities around Fairness in AI](http://www.jennwv.com/papers/checklists.pdf)." In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1-14. 2020. -- 🗎 Hopkins, Aspen, and Serena Booth. "[Machine Learning Practices Outside Big Tech: How Resource Constraints Challenge Responsible Development](http://www.slbooth.com/papers/AIES-2021_Hopkins_and_Booth.pdf)." In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (AIES ’21) (2021). -- 🗎 Metcalf, Jacob, and Emanuel Moss. "[Owning ethics: Corporate logics, silicon valley, and the institutionalization of ethics](https://datasociety.net/wp-content/uploads/2019/09/Owning-Ethics-PDF-version-2.pdf)." *Social Research: An International Quarterly* 86, no. 2 (2019): 449-476. - -
diff --git a/lectures/18_system_fairness/temi.png b/lectures/18_system_fairness/temi.png deleted file mode 100644 index 29ce2dd5..00000000 Binary files a/lectures/18_system_fairness/temi.png and /dev/null differ diff --git a/lectures/18_system_fairness/user-review1.png b/lectures/18_system_fairness/user-review1.png deleted file mode 100644 index 291a81c6..00000000 Binary files a/lectures/18_system_fairness/user-review1.png and /dev/null differ diff --git a/lectures/18_system_fairness/user-review2.png b/lectures/18_system_fairness/user-review2.png deleted file mode 100644 index 80fadfdb..00000000 Binary files a/lectures/18_system_fairness/user-review2.png and /dev/null differ diff --git a/lectures/18_system_fairness/what-if-tool.png b/lectures/18_system_fairness/what-if-tool.png deleted file mode 100644 index af3da32f..00000000 Binary files a/lectures/18_system_fairness/what-if-tool.png and /dev/null differ diff --git a/lectures/18_system_fairness/worldvsmachine.svg b/lectures/18_system_fairness/worldvsmachine.svg deleted file mode 100644 index d1ecb609..00000000 --- a/lectures/18_system_fairness/worldvsmachine.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/19_explainability/accuracy_explainability.png b/lectures/19_explainability/accuracy_explainability.png deleted file mode 100644 index a4a6c35b..00000000 Binary files a/lectures/19_explainability/accuracy_explainability.png and /dev/null differ diff --git a/lectures/19_explainability/adversarialexample.png b/lectures/19_explainability/adversarialexample.png deleted file mode 100644 index ba5abb74..00000000 Binary files a/lectures/19_explainability/adversarialexample.png and /dev/null differ diff --git a/lectures/19_explainability/badges.png b/lectures/19_explainability/badges.png deleted file mode 100644 index 59f50d73..00000000 Binary files a/lectures/19_explainability/badges.png and /dev/null differ diff --git a/lectures/19_explainability/cancerdialog.png b/lectures/19_explainability/cancerdialog.png deleted file mode 100644 index 63619a7d..00000000 Binary files a/lectures/19_explainability/cancerdialog.png and /dev/null differ diff --git a/lectures/19_explainability/cancerpred.png b/lectures/19_explainability/cancerpred.png deleted file mode 100644 index 5ebc2552..00000000 Binary files a/lectures/19_explainability/cancerpred.png and /dev/null differ diff --git a/lectures/19_explainability/cat.png b/lectures/19_explainability/cat.png deleted file mode 100644 index 6743b764..00000000 Binary files a/lectures/19_explainability/cat.png and /dev/null differ diff --git a/lectures/19_explainability/cheyneylibrary.jpeg b/lectures/19_explainability/cheyneylibrary.jpeg deleted file mode 100644 index ad0113b0..00000000 Binary files a/lectures/19_explainability/cheyneylibrary.jpeg and /dev/null differ diff --git a/lectures/19_explainability/compas_screenshot.png b/lectures/19_explainability/compas_screenshot.png deleted file mode 100644 index 79cb7e83..00000000 Binary files a/lectures/19_explainability/compas_screenshot.png and /dev/null differ diff --git a/lectures/19_explainability/conceptbottleneck.png b/lectures/19_explainability/conceptbottleneck.png deleted file mode 100644 index f493139c..00000000 Binary files a/lectures/19_explainability/conceptbottleneck.png and /dev/null differ diff --git a/lectures/19_explainability/emeter.png b/lectures/19_explainability/emeter.png deleted file mode 100644 index 11b2ef77..00000000 Binary files a/lectures/19_explainability/emeter.png and /dev/null differ diff --git a/lectures/19_explainability/expl_confidence.png b/lectures/19_explainability/expl_confidence.png deleted file mode 100644 index a107a3e8..00000000 Binary files a/lectures/19_explainability/expl_confidence.png and /dev/null differ diff --git a/lectures/19_explainability/explainability.md b/lectures/19_explainability/explainability.md deleted file mode 100644 index 9a2b4d0b..00000000 --- a/lectures/19_explainability/explainability.md +++ /dev/null @@ -1,1461 +0,0 @@ ---- -author: Christian Kaestner and Eunsuk Kang -title: "MLiP: Explainability and Interpretability" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - - -# Explainability and Interpretability - - ---- -## Explainability as Building Block in Responsible Engineering - -![Overview of course content](../_assets/overview.svg) - - - - ----- -## "Readings" - -Required one of: -* 🎧 Data Skeptic Podcast Episode “[Black Boxes are not Required](https://dataskeptic.com/blog/episodes/2020/black-boxes-are-not-required)” with Cynthia Rudin (32min) -* 🗎 Rudin, Cynthia. "[Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead](https://arxiv.org/abs/1811.10154)." Nature Machine Intelligence 1, no. 5 (2019): 206-215. - -Recommended supplementary reading: -* 🕮 Christoph Molnar. "[Interpretable Machine Learning: A Guide for Making Black Box Models Explainable](https://christophm.github.io/interpretable-ml-book/)." 2019 - ----- -# Learning Goals - -* Understand the importance of and use cases for interpretability -* Explain the tradeoffs between inherently interpretable models and post-hoc explanations -* Measure interpretability of a model -* Select and apply techniques to debug/provide explanations for data, models and model predictions -* Eventuate when to use interpretable models rather than ex-post explanations - ---- -# Motivating Examples - - ----- - -![Turtle recognized as gun](gun.png) - - - ----- - -![Adversarial examples](adversarialexample.png) - - - -Image: Gong, Yuan, and Christian Poellabauer. "[An overview of vulnerabilities of voice controlled systems](https://arxiv.org/pdf/1803.09156.pdf)." arXiv preprint arXiv:1803.09156 (2018). - ----- -## Detecting Anomalous Commits - -[![Reported commit](nodejs-unusual-commit.png)](nodejs-unusual-commit.png) - - - -Goyal, Raman, Gabriel Ferreira, Christian Kästner, and James Herbsleb. "[Identifying unusual commits on GitHub](https://www.cs.cmu.edu/~ckaestne/pdf/jsep17.pdf)." Journal of Software: Evolution and Process 30, no. 1 (2018): e1893. - ----- -## Is this recidivism model fair? - -```fortran -IF age between 18–20 and sex is male THEN - predict arrest -ELSE IF age between 21–23 and 2–3 prior offenses THEN - predict arrest -ELSE IF more than three priors THEN - predict arrest -ELSE - predict no arrest -``` - - - -Rudin, Cynthia. "[Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead](https://arxiv.org/abs/1811.10154)." Nature Machine Intelligence 1, no. 5 (2019): 206-215. - ----- -## How to interpret the results? - -![Screenshot of the COMPAS tool](compas_screenshot.png) - - - -Image source (CC BY-NC-ND 4.0): Christin, Angèle. (2017). Algorithms in practice: Comparing web journalism and criminal justice. Big Data & Society. 4. - ----- -## How to consider seriousness of the crime? - -![Recidivism scoring systems](recidivism_scoring.png) - - - -Rudin, Cynthia, and Berk Ustun. "[Optimized scoring systems: Toward trust in machine learning for healthcare and criminal justice](https://users.cs.duke.edu/~cynthia/docs/WagnerPrizeCurrent.pdf)." Interfaces 48, no. 5 (2018): 449-466. - ----- -## What factors go into predicting stroke risk? - -![Scoring system for stroke risk](scoring.png) - - - -Rudin, Cynthia, and Berk Ustun. "[Optimized scoring systems: Toward trust in machine learning for healthcare and criminal justice](https://users.cs.duke.edu/~cynthia/docs/WagnerPrizeCurrent.pdf)." Interfaces 48, no. 5 (2018): 449-466. - ----- -## Is there an actual problem? How to find out? - -
- ----- - -
- ----- -![News headline: Stanford algorithm for vaccine priority controversy](stanford.png) - - ----- -![The "algorithm" used at Stanford](stanfordalgorithm.png) - ----- -## Explaining Decisions - -Cat? Dog? Lion? -- Confidence? Why? - -![Cat](cat.png) - - ----- -## What's happening here? - -![Perceptron](mlperceptron.svg) - - - ----- -## Explaining Decisions - -[![Slack Notifications Decision Tree](slacknotifications.jpg)](slacknotifications.jpg) - - ----- -## Explainability in ML - -Explain how the model made a decision - - Rules, cutoffs, reasoning? - - What are the relevant factors? - - Why those rules/cutoffs? - -Challenging because models too complex and based on data - - Can we understand the rules? - - Can we understand why these rules? - - - - - ---- -# Why Explainability? - ----- -## Why Explainability? - - - ----- -## Debugging - - - -* Why did the system make a wrong prediction in this case? -* What does it actually learn? -* What data makes it better? -* How reliable/robust is it? -* How much does second model rely on outputs of first? -* Understanding edge cases - - - -![Turtle recognized as gun](gun.png) - - - -**Debugging is the most common use in practice** (Bhatt et al. "Explainable machine learning in deployment." In Proc. FAccT. 2020.) - - ----- -## Auditing - -* Understand safety implications -* Ensure predictions use objective criteria and reasonable rules -* Inspect fairness properties -* Reason about biases and feedback loops -* Validate "learned specifications/requirements" with stakeholders - -```fortran -IF age between 18–20 and sex is male THEN predict arrest -ELSE IF age between 21–23 and 2–3 prior offenses THEN predict arrest -ELSE IF more than three priors THEN predict arrest -ELSE predict no arrest -``` - ----- -## Trust - - - -More accepting a prediction if clear how it is made, e.g., - * Model reasoning matches intuition; reasoning meets fairness criteria - * Features are difficult to manipulate - * Confidence that the model generalizes beyond target distribution - - - -![Trust model](trust.png) - - - - -Conceptual model of trust: R. C. Mayer, J. H. Davis, and F. D. Schoorman. An integrative model of organizational trust. Academy -of Management Review, 20(3):709–734, July 1995. - - - ----- -## Actionable Insights to Improve Outcomes - -> "What can I do to get the loan?" - -> "How can I change my message to get more attention on Twitter?" - -> "Why is my message considered as spam?" - ----- -## Regulation / Legal Requirements - - -> The EU General Data Protection Regulation extends the automated decision-making rights [...] to provide a legally disputed form of a **right to an explanation**: "[the data subject should have] the right ... to obtain an explanation of the decision reached" - - - -> US Equal Credit Opportunity Act requires to notify applicants of action taken with specific reasons: "The statement of reasons for adverse action required by paragraph (a)(2)(i) of this section must be specific and indicate the principal reason(s) for the adverse action." - - - -See also https://en.wikipedia.org/wiki/Right_to_explanation - ----- -## Curiosity, learning, discovery, science - -![Statistical modeling from Badges Paper](badges.png) - - ----- -## Curiosity, learning, discovery, science - -![News article about using machine learning for detecting smells](robotnose.png) - - - ----- -## Settings where Interpretability is *not* Important? - - - -Notes: -* Model has no significant impact (e.g., exploration, hobby) -* Problem is well studied? e.g optical character recognition -* Security by obscurity? -- avoid gaming - - - - ----- -## Exercise: Debugging a Model - -Consider the following debugging challenges. In groups discuss how you would debug the problem. In 3 min report back to the class. - - -*Algorithm bad at recognizing some signs in some conditions:* -![Stop Sign with Bounding Box](stopsign.jpg) - -*Graduate appl. system seems to rank applicants from HBCUs low:* -![Cheyney University founded in 1837 is the oldest HBCU](cheyneylibrary.jpeg) - - - - -Left Image: CC BY-SA 4.0, Adrian Rosebrock - - - - - - - - - ---- -# Defining Interpretability - - -Christoph Molnar. "[Interpretable Machine Learning: A Guide for Making Black Box Models Explainable](https://christophm.github.io/interpretable-ml-book/)." 2019 ----- -## Interpretability Definitions - -Two common approaches: - -> Interpretability is the degree to which a human can understand the cause of a decision - -> Interpretability is the degree to which a human can consistently predict the model’s result. - -(No mathematical definition) - - -**How would you measure interpretability?** ----- -## Explanation - -Understanding a single prediction for a given input - -> Your loan application has been *declined*. If your *savings account* had had more than $100 your loan application would be *accepted*. - -Answer **why** questions, such as - * Why was the loan rejected? (justification) - * Why did the treatment not work for the patient? (debugging) - * Why is turnover higher among women? (general science question) - -**How would you measure explanation quality?** - ----- -## Intrinsic interpretability vs Post-hoc explanation? - - -Models simple enough to understand -(e.g., short decision trees, sparse linear models) - -![Scoring system](scoring.png) - -Explanation of opaque model, local or global - -> Your loan application has been *declined*. If your *savings account* had more than $100 your loan application would be *accepted*. - - - ----- -## On Terminology - -Rudin's terminology and this lecture: - - Interpretable models: Intrinsily interpretable models - - Explainability: Post-hoc explanations - -Interpretability: property of a model - -Explainability: ability to explain the workings/predictions of a model - -Explanation: justification of a single prediction - -Transparency: The user is aware that a model is used / how it works - -*These terms are often used inconsistently or interchangeble* - - - -![Random letters](../_assets/onterminology.jpg) - - - - - - - - ---- -# Understanding a Model - -Levels of explanations: - -* **Understanding a model** -* Explaining a prediction -* Understanding the data - ----- -## Inherently Interpretable: Sparse Linear Models - -$f(x) = \alpha + \beta_1 x_1 + ... + \beta_n x_n$ - -Truthful explanations, easy to understand for humans - -Easy to derive contrastive explanation and feature importance - -Requires feature selection/regularization to minimize to few important features (e.g. Lasso); possibly restricting possible parameter values - ----- -## Score card: Sparse linear model with "round" coefficients - -![Scoring card](scoring.png) - - ----- -## Inherently Interpretable: Shallow Decision Trees - -Easy to interpret up to a size - -Possible to derive counterfactuals and feature importance - -Unstable with small changes to training data - - -```fortran -IF age between 18–20 and sex is male THEN predict arrest -ELSE IF age between 21–23 and 2–3 prior offenses THEN predict arrest -ELSE IF more than three priors THEN predict arrest -ELSE predict no arrest -``` - ----- -## Not all Linear Models and Decision Trees are Inherently Interpretable - -* Models can be very big, many parameters (factors, decisions) -* Nonlinear interactions possibly hard to grasp -* Tool support can help (views) -* Random forests, ensembles no longer easily interpretable - -
- -``` -173554.681081086 * root + 318523.818532818 * heuristicUnit + -103411.870761673 * eq + -24600.5000000002 * heuristicVsids + --11816.7857142856 * heuristicVmtf + -33557.8961038976 * heuristic + -95375.3513513509 * heuristicUnit * satPreproYes + -3990.79729729646 * transExt * satPreproYes + -136928.416666666 * eq * heuristicUnit + 12309.4990990994 * eq * satPreproYes + -33925.0833333346 * eq * heuristic + -643.428571428088 * backprop * heuristicVsids + -11876.2857142853 * backprop * -heuristicUnit + 1620.24242424222 * eq * backprop + -7205.2500000002 * eq * heuristicBerkmin + -2 * Num1 * Num2 + 10 * Num3 * Num4 -``` - -
- -Notes: Example of a performance influence model from http://www.fosd.de/SPLConqueror/ -- not the worst in terms of interpretability, but certainly not small or well formated or easy to approach. - - ----- -## Inherently Interpretable: Decision Rules - -*if-then rules mined from data* - -easy to interpret if few and simple rules - - -see [association rule mining](https://en.wikipedia.org/wiki/Association_rule_mining): -```text -{Diaper, Beer} -> Milk (40% support, 66% confidence) -Milk -> {Diaper, Beer} (40% support, 50% confidence) -{Diaper, Beer} -> Bread (40% support, 66% confidence) -``` - - ----- -## Research in Inherently Interpretable Models - -Several approaches to learn sparse constrained models (e.g., fit score cards, simple if-then-else rules) - -Often heavy emphasis on feature engineering and domain-specificity - -Possibly computationally expensive - - - -Rudin, Cynthia. "[Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead](https://arxiv.org/abs/1811.10154)." Nature Machine Intelligence 1, no. 5 (2019): 206-215. - - - - - - - - - - - - - - - - - - - - - - ----- -## Post-Hoc Model Explanation: Global Surrogates - -1. Select dataset X (previous training set or new dataset from same distribution) -2. Collect model predictions for every value: $y_i=f(x_i)$ -3. Train *inherently interpretable* model $g$ on (X,Y) -4. Interpret surrogate model $g$ - - -Can measure how well $g$ fits $f$ with common model quality measures, typically $R^2$ - -**Advantages? Disadvantages?** - -Notes: -Flexible, intuitive, easy approach, easy to compare quality of surrogate model with validation data ($R^2$). -But: Insights not based on real model; unclear how well a good surrogate model needs to fit the original model; surrogate may not be equally good for all subsets of the data; illusion of interpretability. -Why not use surrogate model to begin with? - - ----- -## Advantages and Disadvantages of Surrogates? - - - - ----- -## Advantages and Disadvantages of Surrogates? - -* short, contrastive explanations possible -* useful for debugging -* easy to use; works on lots of different problems -* explanations may use different features than original model -* -* explanation not necessarily truthful -* explanations may be unstable -* likely not sufficient for compliance scenario - - ----- -## Post-Hoc Model Explanation: Feature Importance - - -![FI example](featureimportance.png) - - - - -Source: -Christoph Molnar. "[Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/)." 2019 - ----- -## Feature Importance - -* Permute a feature's values in validation data -> hide it for prediction -* Measure influence on accuracy -* -> This evaluates feature's influence without retraining the model -* -* Highly compressed, *global* insights -* Effect for feature + interactions -* Can only be computed on labeled data, depends on model accuracy, randomness from permutation -* May produce unrealistic inputs when correlations exist - -(Can be evaluated both on training and validation data) - - -Note: Training vs validation is not an obvious answer and both cases can be made, see Molnar's book. Feature importance on the training data indicates which features the model has learned to use for predictions. - - - - - - - ----- -## Post-Hoc Model Explanation: Partial Dependence Plot (PDP) - - -![PDP Example](pdp.png) - - - -Source: -Christoph Molnar. "[Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/)." 2019 - -Note: bike rental data in DC - ----- -## Partial Dependence Plot - -* Computes marginal effect of feature on predicted outcome -* Identifies relationship between feature and outcome (linear, monotonous, complex, ...) -* -* Intuitive, easy interpretation -* Assumes no correlation among features - - - ----- -## Partial Dependence Plot for Interactions - - -![PDP Example](pdp2.png) - - - -Probability of cancer; source: -Christoph Molnar. "[Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/)." 2019 - - ----- -## Concept Bottleneck Models - - - -Hybrid/partially interpretable model - -Force models to learn features, not final predictions. Use inherently interpretable model on those features - -Requries to label features in training data - - - -![Illustration from the paper how CNN are used to extract features](conceptbottleneck.png) - - - - - -Koh, Pang Wei, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. "[Concept bottleneck models](http://proceedings.mlr.press/v119/koh20a/koh20a.pdf)." In Proc. ICML, 2020. - ----- -## Summary: Understanding a Model - -Understanding of the whole model, not individual predictions! - -Some models inherently interpretable: -* Sparse linear models -* Shallow decision trees - -Ex-post explanations for opaque models: -* Global surrogate models -* Feature importance, partial dependence plots -* Many more in the literature - - - - - - ---- -# Explaining a Prediction - - -Levels of explanations: - -* Understanding a model -* **Explaining a prediction** -* Understanding the data - - ----- -## Understanding Predictions from Inherently Interpretable Models is easy - -Derive key influence factors or decisions from model parameters - -Derive contrastive counterfacturals from models - -**Examples:** Predict arrest for 18 year old male with 1 prior: - -```fortran -IF age between 18–20 and sex is male THEN predict arrest -ELSE IF age between 21–23 and 2–3 prior offenses THEN predict arrest -ELSE IF more than three priors THEN predict arrest -ELSE predict no arrest -``` - - ----- -## Posthoc Prediction Explanation: Feature Influences - -*Which features were most influential for a specific prediction?* - - -![Lime Example](lime2.png) - - - - -Source: https://github.com/marcotcr/lime - ----- -## Feature Influences in Images - -![Lime Example](lime_cat.png) - - - -Source: https://github.com/marcotcr/lime - ----- -## Feature Importance vs Feature Influence - - - -Feature importance is global for the entire model (all predictions) - -![FI example](featureimportance.png) - - - - -Feature influence is for a single prediction - -![Lime Example](lime_cat.png) - - - ----- -## Feature Infl. with Local Surrogates (LIME) - -*Create an inherently interpretable model (e.g. sparse linear model) for the area around a prediction* - -Lime approach: -* Create random samples in the area around the data point of interest -* Collect model predictions with $f$ for each sample -* Learn surrogate model $g$, weighing samples by distance -* Interpret surrogate model $g$ - - -Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "["Why should I trust you?" Explaining the predictions of any classifier](http://dust.ess.uci.edu/ppr/ppr_RSG16.pdf)." In Proc International Conference on Knowledge Discovery and Data Mining, pp. 1135-1144. 2016. - ----- - -![Lime Example](lime1.png) - - - - -Source: -Christoph Molnar. "[Interpretable Machine Learning: A Guide for Making Black Box Models Explainable](https://christophm.github.io/interpretable-ml-book/)." 2019 - -Note: Model distinguishes blue from gray area. Surrogate model learns only a while line for the nearest decision boundary, which may be good enough for local explanations. - - ----- -## LIME Example -![Lime Example](lime_wolf.png) - - - -Source: Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "["Why should I trust you?" Explaining the predictions of any classifier](http://dust.ess.uci.edu/ppr/ppr_RSG16.pdf)." In Proc. KDD. 2016. - ----- -## Advantages and Disadvantages of Local Surrogates? - - - - ----- -## Posthoc Prediction Explanation: Shapley Values / SHAP - -
- -* Game-theoretic foundation for local explanations (1953) -* Explains contribution of feature, over predictions with different feature subsets - - *"The Shapley value is the average marginal contribution of a feature value across all possible coalitions"* -* Solid theory ensures fair mapping of influence to features -* Requires heavy computation, usually only approximations feasible -* Explanations contain all features (ie. not sparse) -**Currently, most common local method used in practice** - -
- - - -Lundberg, Scott M., and Su-In Lee. "[A unified approach to interpreting model predictions](https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf)." In Advances in neural information processing systems, pp. 4765-4774. 2017. - - - - - - - - - - - - - ----- -## Counterfactual Explanations - -*if X had not occured, Y would not have happened* - -> Your loan application has been *declined*. If your *savings account* had had more than $100 your loan application would be *accepted*. - - --> Smallest change to feature values that result in given output - - - ----- -## Multiple Counterfactuals - -
- - - -Often long or multiple explanations - -> Your loan application has been *declined*. If your *savings account* ... - -> Your loan application has been *declined*. If your lived in ... - -Report all or select "best" (e.g. shortest, most actionable, likely values) - - -*(Rashomon effect)* - -![Rashomon](rashomon.jpg) - - - - -
- ----- -## Searching for Counterfactuals? - - - ----- -## Searching for Counterfactuals - -Random search (with growing distance) possible, but inefficient - -Many search heuristics, e.g. hill climbing or Nelder–Mead, may use gradient of model if available - -Can incorporate distance in loss function - -$$L(x,x^\prime,y^\prime,\lambda)=\lambda\cdot(\hat{f}(x^\prime)-y^\prime)^2+d(x,x^\prime)$$ - - -(similar to finding adversarial examples) - - ----- -![Adversarial examples](adversarialexample.png) - - ----- -## Discussion: Counterfactuals - -* Easy interpretation, can report both alternative instance or required change -* No access to model or data required, easy to implement -* -* Often many possible explanations (Rashomon effect), requires selection/ranking -* May require changes to many features, not all feasible -* May not find counterfactual within given distance -* Large search spaces, especially with high-cardinality categorical features - ----- -## Actionable Counterfactuals - -*Example: Denied loan application* - -* Customer wants feedback of how to get the loan approved -* Some suggestions are more actionable than others, e.g., - * Easier to change income than gender - * Cannot change past, but can wait -* In distance function, not all features may be weighted equally - ----- -## Similarity - - - -* k-Nearest Neighbors inherently interpretable (assuming intutive distance function) -* Attempts to build inherently interpretable image classification models based on similarity of fragments - - - -![Paper screenshot from "this looks like that paper"](thislookslikethat.png) - - - - - -Chen, Chaofan, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and Jonathan K. Su. "This looks like that: deep learning for interpretable image recognition." In NeurIPS (2019). - - - ----- -## Summary: Understanding a Prediction - -Understanding a single predictions, not the model as a whole - -Explaining influences, providing counterfactuals and sufficient conditions, showing similar instances - -Easy on inherently interpretable models - -Ex-post explanations for opaque models: -* Feature influences (LIME, SHAP, attention maps) -* Searching for Cunterfactuals -* Similarity, knn - - - - - - - - - - - - - - - - - - - - - - - - - - ---- -# Understanding the Data - - -Levels of explanations: - -* Understanding a model -* Explaining a prediction -* **Understanding the data** - - - ----- -## Prototypes and Criticisms - -* *Prototype* is a data instance that is representative of all the data -* *Criticism* is a data instance not well represented by the prototypes - -![Example](prototype-dogs.png) - - - -Source: -Christoph Molnar. "[Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/)." 2019 - ----- -## Example: Prototypes and Criticisms? - -![Example](prototypes_without.png) - - - ----- -## Example: Prototypes and Criticisms - -![Example](prototypes.png) - - - -Source: -Christoph Molnar. "[Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/)." 2019 - ----- -## Example: Prototypes and Criticisms - -![Example](prototype-digits.png) - - - -Source: -Christoph Molnar. "[Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/)." 2019 - -Note: The number of digits is different in each set since the search was conducted globally, not per group. - - ----- -## Methods: Prototypes and Criticisms - -Clustering of data (ala k-means) - * k-medoids returns actual instances as centers for each cluster - * MMD-critic identifies both prototypes and criticisms - * see book for details - -Identify globally or per class - ----- -## Discussion: Prototypes and Criticisms - -* Easy to inspect data, useful for debugging outliers -* Generalizes to different kinds of data and problems -* Easy to implement algorithm -* -* Need to choose number of prototypes and criticism upfront -* Uses all features, not just features important for prediction - - - ----- -## Influential Instance - -**Data debugging:** *What data most influenced the training?* - -![Example](influentialinstance.png) - - - -Source: -Christoph Molnar. "[Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/)." 2019 - ----- -## Influential Instances - -**Data debugging:** *What data most influenced the training? Is the model skewed by few outliers?* - -Approach: -* Given training data with $n$ instances... -* ... train model $f$ with all $n$ instances -* ... train model $g$ with $n-1$ instances -* If $f$ and $g$ differ significantly, omitted instance was influential - - Difference can be measured e.g. in accuracy or difference in parameters - -Note: Instead of understanding a single model, comparing multiple models trained on different data - - ----- -## Influential Instances Discussion - -Retraining for every data point is simple but expensive - -For some class of models, influence of data points can be computed without retraining (e.g., logistic regression), see book for details - -Hard to generalize to taking out multiple instances together - -Useful model-agnostic debugging tool for models and data - - - -Christoph Molnar. "[Interpretable Machine Learning: A Guide for Making Black Box Models Explainable](https://christophm.github.io/interpretable-ml-book/)." 2019 - ----- -## Three Concepts - -**Feature importance:** How much does the model rely on a feature, across all predictions? - -**Feature influence:** How much does a specific prediction rely on a feature? - -**Influential instance:** How much does the model rely on a single training data instance? - ----- -## Summary: Understanding the Data - -Understand the characteristics of the data used to train the model - -Many data exploration and data debugging techniques: -* Criticisms and prototypes -* Influential instances -* many others... - - - - - - - - - - - ---- -## Breakout: Debugging with Explanations - -
- -In groups, discuss which explainability approaches may help and why. Tagging group members, write to `#lecture`. - - -*Algorithm bad at recognizing some signs in some conditions:* -![Stop Sign with Bounding Box](stopsign.jpg) - -*Graduate appl. system seems to rank applicants from HBCUs low:* -![Cheyney University founded in 1837 is the oldest HBCU](cheyneylibrary.jpeg) - - -
- - -Left Image: CC BY-SA 4.0, Adrian Rosebrock - - - - - - - - - - - - - - - - - - - - - - - - - - - ---- -# Explanations and User Interaction Design - - - -[People + AI Guidebook](https://pair.withgoogle.com/research/), Google - - - ----- -## How to Present Explanations? - -![Explanatory debugging example](expldebugging.png) - - - - -Kulesza, T., Burnett, M., Wong, W-K. & Stumpf, S.. Principles of -s Explanatory Debugging to personalize interactive machine learning. In: Proc. IUI, 2015 - - - - - - ----- - -![Positive example](https://pair.withgoogle.com/assets/ET1_aim-for.png) - - - -![Negative example](https://pair.withgoogle.com/assets/ET1_avoid.png) - - - - -Tell the user when a lack of data might mean they’ll need to use their own judgment. Don’t be afraid to admit when a lack of data could affect the quality of the AI recommendations. - - - - -Source: -[People + AI Guidebook](https://pair.withgoogle.com/research/), Google - - ----- - -![Positive example](https://pair.withgoogle.com/assets/ET3_aim-for.png) - -![Negative example](https://pair.withgoogle.com/assets/ET3_avoid.png) - -Give the user details about why a prediction was made in a high stakes scenario. Here, the user is exercising after an injury and needs confidence in the app’s recommendation. - - -Source: -[People + AI Guidebook](https://pair.withgoogle.com/research/), Google - - - ----- - -![Explanations wrt Confidence](expl_confidence.png) - - -**Example each?** - - -Source: [People + AI Guidebook](https://pair.withgoogle.com/research/), Google - - - - - - - - - - - - - ---- -# Beyond "Just" Explaining the Model - - - -Cai, Carrie J., Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry. ""Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for Human-AI Collaborative Decision-Making." Proceedings of the ACM on Human-computer Interaction 3, no. CSCW (2019): 1-24. - ----- -## Setting Cancer Imaging -- What explanations do radiologists want? - -![](cancerpred.png) - -* *Past attempts often not successful at bringing tools into production. Radiologists do not trust them. Why?* -* [Wizard of oz study](https://en.wikipedia.org/wiki/Wizard_of_Oz_experiment) to elicit requirements - ----- - -![Wizard of oz](wizardofoz.jpg) - - ----- -![Shown predictions in prostate cancer study](cancerdialog.png) - ----- -## Radiologists' questions - - -* How does it perform compared to human experts? -* "What is difficult for the AI to know? Where is it too sensitive? What criteria is it good at recognizing or not good at recognizing?" -* What data (volume, types, diversity) was the model trained on? -* "Does the AI have access to information that I don’t have? Does it have access to ancillary studies?" Is all used data shown in the UI? -* What kind of things is the AI looking for? What is it capable of learning? ("Maybe light and dark? Maybe colors? Maybe shapes, lines?", "Does it take into consideration the relationship between gland and stroma? Nuclear relationship?") -* "Does it have a bias a certain way?" (compared to colleagues) - - - ----- -## Radiologists' questions - -* Capabilities and limitations: performance, strength, limitations; e.g. how does it handle well-known edge cases -* Functionality: What data used for predictions, how much context, how data is used -* Medical point-of-view: calibration, how liberal/conservative when grading cancer severity -* Design objectives: Designed for few false positives or false negatives? Tuned to compensate for human error? -* Other considerations: legal liability, impact on workflow, cost of use - - -Further details: [Paper, Table 1](https://dl.acm.org/doi/pdf/10.1145/3359206) - - ----- -## Insights - -AI literacy important for trust - -Be transparent about data used - -Describe training data and capabilities - -Give mental model, examples, human-relatable test cases - -Communicate the AI’s point-of-view and design goal - - - - -Cai, Carrie J., Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry. ""Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for Human-AI Collaborative Decision-Making." Proceedings of the ACM on Human-computer Interaction 3, no. CSCW (2019): 1-24. - - - ---- -# The Dark Side of Explanations - ----- -## Many explanations are wrong - -Approximations of black-box models, often unstable - -Explanations necessarily partial, social - -Often multiple explanations possible (Rashomon effect) - -Possible to use inherently interpretable models instead? - -When explanation desired/required: What quality is needed/acceptable? - ----- -## Explanations foster Trust - -Users are less likely to question the model when explanations provided -* Even if explanations are unreliable -* Even if explanations are nonsensical/incomprehensible - -**Danger of overtrust and intentional manipulation** - - -Stumpf, Simone, Adrian Bussone, and Dympna O’sullivan. "Explanations considered harmful? user interactions with machine learning systems." In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI). 2016. - ----- -![Paper screenshot of experiment user interface](emeter.png) - - -Springer, Aaron, Victoria Hollis, and Steve Whittaker. "Dice in the black box: User experiences with an inscrutable algorithm." In 2017 AAAI Spring Symposium Series. 2017. - ----- -![3 Conditions of the experiment with different explanation designs](explanationexperimentgame.png) - -(a) Rationale, (b) Stating the prediction, (c) Numerical internal values - -Observation: Both experts and non-experts overtrust numerical explanations, even when inscrutable. - - -Ehsan, Upol, Samir Passi, Q. Vera Liao, Larry Chan, I. Lee, Michael Muller, and Mark O. Riedl. "The who in explainable AI: how AI background shapes perceptions of AI explanations." arXiv preprint arXiv:2107.13509 (2021). - - - ---- -# "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead." - - ----- -## Accuracy vs Explainability Conflict? - -![Accuracy vs Explainability Sketch](accuracy_explainability.png) - - - -Graphic from the DARPA XAI BAA (Explainable Artificial Intelligence) - ----- -## Faithfulness of Ex-Post Explanations - - - ----- -## CORELS’ model for recidivism risk prediction - -```fortran -IF age between 18–20 and sex is male THEN predict arrest -ELSE IF age between 21–23 and 2–3 prior offenses THEN predict arrest -ELSE IF more than three priors THEN predict arrest -ELSE predict no arrest -``` - -Simple, interpretable model with comparable accuracy to proprietary COMPAS model - - - -Rudin, Cynthia. "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead." Nature Machine Intelligence 1.5 (2019): 206-215. ([Preprint](https://arxiv.org/abs/1811.10154)) - - - ----- -## "Stop explaining ..." - -
- -Hypotheses: -* It is a myth that there is necessarily a trade-off between accuracy and interpretability (when having meaningful features) -* Explainable ML methods provide explanations that are not faithful to what the original model computes -* Explanations often do not make sense, or do not provide enough detail to understand what the black box is doing -* Black box models are often not compatible with situations where information outside the database needs to be combined with a risk assessment -* Black box models with explanations can lead to an overly complicated decision pathway that is ripe for human error - -
- - - -Rudin, Cynthia. "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead." Nature Machine Intelligence 1.5 (2019): 206-215. ([Preprint](https://arxiv.org/abs/1811.10154)) - ----- -## Prefer Interpretable Models over Post-Hoc Explanations - - -* Interpretable models provide faithful explanations - * post-hoc explanations may provide limited insights or illusion of understanding - * interpretable models can be audited -* Inherently interpretable models in many cases have similar accuracy -* Larger focus on feature engineering, more effort, but insights into when and *why* the model works -* Less research on interpretable models and some methods computationally expensive - ----- -## ProPublica Controversy - -![ProPublica Article](recidivism-propublica.png) - - -Notes: "ProPublica’s linear model was not truly an -“explanation” for COMPAS, and they should not have concluded that their explanation model uses the same -important features as the black box it was approximating." ----- -## ProPublica Controversy - - -```fortran -IF age between 18–20 and sex is male THEN - predict arrest -ELSE IF age between 21–23 and 2–3 prior offenses THEN - predict arrest -ELSE IF more than three priors THEN - predict arrest -ELSE - predict no arrest -``` - - - -Rudin, Cynthia. "[Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead](https://arxiv.org/abs/1811.10154)." Nature Machine Intelligence 1, no. 5 (2019): 206-215. - - ----- -## Drawbacks of Interpretable Models - -Intellectual property protection harder - - may need to sell model, not license as service - - who owns the models and who is responsible for their mistakes? - -Gaming possible; "security by obscurity" not a defense - -Expensive to build (feature engineering effort, debugging, computational costs) - -Limited to fewer factors, may discover fewer patterns, lower accuracy - - - - ---- -# Summary - -
- -* Interpretability useful for many scenarios: user feedback, debugging, fairness audits, science, ... -* Defining and measuring interpretability - * Explaining the model - * Explaining predictions - * Understanding the data -* Inherently interpretable models: sparse regressions, shallow decision trees -* Providing ex-post explanations of opaque models: global and local surrogates, dependence plots and feature importance, anchors, counterfactual explanations, criticisms, and influential instances -* Consider implications on user interface design -* Gaming and manipulation with explanations - -
- ----- -## Further Readings - -
- - -* Christoph Molnar. “[Interpretable Machine Learning: A Guide for Making Black Box Models Explainable](https://christophm.github.io/interpretable-ml-book/).” 2019 -* Google PAIR. [People + AI Guidebook](https://pair.withgoogle.com/guidebook/). 2019. -* Cai, Carrie J., Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry. “[”Hello AI”: Uncovering the Onboarding Needs of Medical Practitioners for Human-AI Collaborative Decision-Making](https://dl.acm.org/doi/abs/10.1145/3359206).” Proceedings of the ACM on Human-computer Interaction 3, no. CSCW (2019): 1–24. -* Kulesza, Todd, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. “[Principles of explanatory debugging to personalize interactive machine learning](https://core.ac.uk/download/pdf/190821828.pdf).” In Proceedings of the 20th International Conference on Intelligent User Interfaces, pp. 126–137. 2015. -* Amershi, Saleema, Max Chickering, Steven M. Drucker, Bongshin Lee, Patrice Simard, and Jina Suh. “[Modeltracker: Redesigning performance analysis tools for machine learning](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.697.1689&rep=rep1&type=pdf).” In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 337–346. 2015. - -
\ No newline at end of file diff --git a/lectures/19_explainability/explanationexperimentgame.png b/lectures/19_explainability/explanationexperimentgame.png deleted file mode 100644 index 83cf10c8..00000000 Binary files a/lectures/19_explainability/explanationexperimentgame.png and /dev/null differ diff --git a/lectures/19_explainability/expldebugging.png b/lectures/19_explainability/expldebugging.png deleted file mode 100644 index 99492e84..00000000 Binary files a/lectures/19_explainability/expldebugging.png and /dev/null differ diff --git a/lectures/19_explainability/featureimportance.png b/lectures/19_explainability/featureimportance.png deleted file mode 100644 index cf9c0620..00000000 Binary files a/lectures/19_explainability/featureimportance.png and /dev/null differ diff --git a/lectures/19_explainability/gun.png b/lectures/19_explainability/gun.png deleted file mode 100644 index d14bdf4c..00000000 Binary files a/lectures/19_explainability/gun.png and /dev/null differ diff --git a/lectures/19_explainability/influentialinstance.png b/lectures/19_explainability/influentialinstance.png deleted file mode 100644 index 9da24ed7..00000000 Binary files a/lectures/19_explainability/influentialinstance.png and /dev/null differ diff --git a/lectures/19_explainability/lime1.png b/lectures/19_explainability/lime1.png deleted file mode 100644 index ee185d18..00000000 Binary files a/lectures/19_explainability/lime1.png and /dev/null differ diff --git a/lectures/19_explainability/lime2.png b/lectures/19_explainability/lime2.png deleted file mode 100644 index 63bc05a5..00000000 Binary files a/lectures/19_explainability/lime2.png and /dev/null differ diff --git a/lectures/19_explainability/lime_cat.png b/lectures/19_explainability/lime_cat.png deleted file mode 100644 index 17a7391b..00000000 Binary files a/lectures/19_explainability/lime_cat.png and /dev/null differ diff --git a/lectures/19_explainability/lime_wolf.png b/lectures/19_explainability/lime_wolf.png deleted file mode 100644 index fe2426fe..00000000 Binary files a/lectures/19_explainability/lime_wolf.png and /dev/null differ diff --git a/lectures/19_explainability/mlperceptron.svg b/lectures/19_explainability/mlperceptron.svg deleted file mode 100644 index 69feea0c..00000000 --- a/lectures/19_explainability/mlperceptron.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/19_explainability/nodejs-unusual-commit.png b/lectures/19_explainability/nodejs-unusual-commit.png deleted file mode 100644 index af195477..00000000 Binary files a/lectures/19_explainability/nodejs-unusual-commit.png and /dev/null differ diff --git a/lectures/19_explainability/pdp.png b/lectures/19_explainability/pdp.png deleted file mode 100644 index 67ba194c..00000000 Binary files a/lectures/19_explainability/pdp.png and /dev/null differ diff --git a/lectures/19_explainability/pdp2.png b/lectures/19_explainability/pdp2.png deleted file mode 100644 index 716e9d52..00000000 Binary files a/lectures/19_explainability/pdp2.png and /dev/null differ diff --git a/lectures/19_explainability/prototype-digits.png b/lectures/19_explainability/prototype-digits.png deleted file mode 100644 index 0dce4668..00000000 Binary files a/lectures/19_explainability/prototype-digits.png and /dev/null differ diff --git a/lectures/19_explainability/prototype-dogs.png b/lectures/19_explainability/prototype-dogs.png deleted file mode 100644 index a8fdb02c..00000000 Binary files a/lectures/19_explainability/prototype-dogs.png and /dev/null differ diff --git a/lectures/19_explainability/prototypes.png b/lectures/19_explainability/prototypes.png deleted file mode 100644 index 492dbcdf..00000000 Binary files a/lectures/19_explainability/prototypes.png and /dev/null differ diff --git a/lectures/19_explainability/prototypes_without.png b/lectures/19_explainability/prototypes_without.png deleted file mode 100644 index e14fbf8f..00000000 Binary files a/lectures/19_explainability/prototypes_without.png and /dev/null differ diff --git a/lectures/19_explainability/rashomon.jpg b/lectures/19_explainability/rashomon.jpg deleted file mode 100644 index b8494521..00000000 Binary files a/lectures/19_explainability/rashomon.jpg and /dev/null differ diff --git a/lectures/19_explainability/recidivism-propublica.png b/lectures/19_explainability/recidivism-propublica.png deleted file mode 100644 index 5e871c64..00000000 Binary files a/lectures/19_explainability/recidivism-propublica.png and /dev/null differ diff --git a/lectures/19_explainability/recidivism_scoring.png b/lectures/19_explainability/recidivism_scoring.png deleted file mode 100644 index 82d5d818..00000000 Binary files a/lectures/19_explainability/recidivism_scoring.png and /dev/null differ diff --git a/lectures/19_explainability/robotnose.png b/lectures/19_explainability/robotnose.png deleted file mode 100644 index 55d025fd..00000000 Binary files a/lectures/19_explainability/robotnose.png and /dev/null differ diff --git a/lectures/19_explainability/scoring.png b/lectures/19_explainability/scoring.png deleted file mode 100644 index 37528717..00000000 Binary files a/lectures/19_explainability/scoring.png and /dev/null differ diff --git a/lectures/19_explainability/slacknotifications.jpg b/lectures/19_explainability/slacknotifications.jpg deleted file mode 100644 index 93543229..00000000 Binary files a/lectures/19_explainability/slacknotifications.jpg and /dev/null differ diff --git a/lectures/19_explainability/stanford.png b/lectures/19_explainability/stanford.png deleted file mode 100644 index 549cd647..00000000 Binary files a/lectures/19_explainability/stanford.png and /dev/null differ diff --git a/lectures/19_explainability/stanfordalgorithm.png b/lectures/19_explainability/stanfordalgorithm.png deleted file mode 100644 index a7b13bc1..00000000 Binary files a/lectures/19_explainability/stanfordalgorithm.png and /dev/null differ diff --git a/lectures/19_explainability/stopsign.jpg b/lectures/19_explainability/stopsign.jpg deleted file mode 100644 index cb76fe96..00000000 Binary files a/lectures/19_explainability/stopsign.jpg and /dev/null differ diff --git a/lectures/19_explainability/thislookslikethat.png b/lectures/19_explainability/thislookslikethat.png deleted file mode 100644 index b8455993..00000000 Binary files a/lectures/19_explainability/thislookslikethat.png and /dev/null differ diff --git a/lectures/19_explainability/trust.png b/lectures/19_explainability/trust.png deleted file mode 100644 index 3b4da1e0..00000000 Binary files a/lectures/19_explainability/trust.png and /dev/null differ diff --git a/lectures/19_explainability/wizardofoz.jpg b/lectures/19_explainability/wizardofoz.jpg deleted file mode 100644 index d3c3aa21..00000000 Binary files a/lectures/19_explainability/wizardofoz.jpg and /dev/null differ diff --git a/lectures/20_transparency/airegulation.png b/lectures/20_transparency/airegulation.png deleted file mode 100644 index 4bde739b..00000000 Binary files a/lectures/20_transparency/airegulation.png and /dev/null differ diff --git a/lectures/20_transparency/bigdog.png b/lectures/20_transparency/bigdog.png deleted file mode 100644 index 35653be8..00000000 Binary files a/lectures/20_transparency/bigdog.png and /dev/null differ diff --git a/lectures/20_transparency/course-aligned.jpg b/lectures/20_transparency/course-aligned.jpg deleted file mode 100644 index 0054a2f6..00000000 Binary files a/lectures/20_transparency/course-aligned.jpg and /dev/null differ diff --git a/lectures/20_transparency/course-unaligned.jpg b/lectures/20_transparency/course-unaligned.jpg deleted file mode 100644 index ed0c840b..00000000 Binary files a/lectures/20_transparency/course-unaligned.jpg and /dev/null differ diff --git a/lectures/20_transparency/facebook.png b/lectures/20_transparency/facebook.png deleted file mode 100644 index fd9bd21c..00000000 Binary files a/lectures/20_transparency/facebook.png and /dev/null differ diff --git a/lectures/20_transparency/faceswap.png b/lectures/20_transparency/faceswap.png deleted file mode 100644 index 8fc09c0b..00000000 Binary files a/lectures/20_transparency/faceswap.png and /dev/null differ diff --git a/lectures/20_transparency/illusionofcontrol.png b/lectures/20_transparency/illusionofcontrol.png deleted file mode 100644 index c22159f1..00000000 Binary files a/lectures/20_transparency/illusionofcontrol.png and /dev/null differ diff --git a/lectures/20_transparency/npr_facialrecognition.png b/lectures/20_transparency/npr_facialrecognition.png deleted file mode 100644 index eb31a7f0..00000000 Binary files a/lectures/20_transparency/npr_facialrecognition.png and /dev/null differ diff --git a/lectures/20_transparency/responsibleai.png b/lectures/20_transparency/responsibleai.png deleted file mode 100644 index 1b3ec1f5..00000000 Binary files a/lectures/20_transparency/responsibleai.png and /dev/null differ diff --git a/lectures/20_transparency/stackoverflow.png b/lectures/20_transparency/stackoverflow.png deleted file mode 100644 index 9db6ea59..00000000 Binary files a/lectures/20_transparency/stackoverflow.png and /dev/null differ diff --git a/lectures/20_transparency/surveillance.png b/lectures/20_transparency/surveillance.png deleted file mode 100644 index 6cbdceec..00000000 Binary files a/lectures/20_transparency/surveillance.png and /dev/null differ diff --git a/lectures/20_transparency/teen-suicide-rate.png b/lectures/20_transparency/teen-suicide-rate.png deleted file mode 100644 index 0e04315e..00000000 Binary files a/lectures/20_transparency/teen-suicide-rate.png and /dev/null differ diff --git a/lectures/20_transparency/transparency.md b/lectures/20_transparency/transparency.md deleted file mode 100644 index 650b8706..00000000 --- a/lectures/20_transparency/transparency.md +++ /dev/null @@ -1,462 +0,0 @@ ---- -author: Christian Kaestner and Eunsuk Kang -title: "MLiP: Transparency and Accountability" -semester: Fall 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - - -# Transparency and Accountability - - - ---- -## More Explainability, Policy, and Politics - -![Overview of course content](../_assets/overview.svg) - - - - ----- -## Readings - -Required reading: -* Google PAIR. People + AI Guidebook. Chapter: [Explainability and Trust](https://pair.withgoogle.com/chapter/explainability-trust/). 2019. - -Recommendedr hoeading: -* Metcalf, Jacob, and Emanuel Moss. "[Owning ethics: Corporate logics, silicon valley, and the institutionalization of ethics](https://datasociety.net/wp-content/uploads/2019/09/Owning-Ethics-PDF-version-2.pdf)." *Social Research: An International Quarterly* 86, no. 2 (2019): 449-476. - ---- -# Learning Goals - -* Explain key concepts of transparency and trust -* Discuss whether and when transparency can be abused to game the system -* Design a system to include human oversight -* Understand common concepts and discussions of accountability/culpability -* Critique regulation and self-regulation approaches in ethical machine learning - - - ---- -# Transparency - -Transparency: users know that algorithm exists / users know how the algorithm works - ----- - -
- - ----- -## Case Study: Facebook's Feed Curation - -![Facebook with and without filtering](facebook.png) - - - - - -Eslami, Motahhare, et al. [I always assumed that I wasn't really that close to [her]: Reasoning about Invisible Algorithms in News Feeds](http://eslamim2.web.engr.illinois.edu/publications/Eslami_Algorithms_CHI15.pdf). In Proc. CHI, 2015. - - - ----- -## Case Study: Facebook's Feed Curation - - -* 62% of interviewees were not aware of curation algorithm -* Surprise and anger when learning about curation - -> "Participants were most upset when close friends and -family were not shown in their feeds [...] participants often attributed missing stories to their friends’ decisions to exclude them rather than to Facebook News Feed algorithm." - -* Learning about algorithm did not change satisfaction level -* More active engagement, more feeling of control - - - - -Eslami, Motahhare, et al. [I always assumed that I wasn't really that close to [her]: Reasoning about Invisible Algorithms in News Feeds](http://eslamim2.web.engr.illinois.edu/publications/Eslami_Algorithms_CHI15.pdf). In Proc. CHI, 2015. - ----- -## The Dark Side of Transparency - -* Users may feel influence and control, even with placebo controls -* Companies give vague generic explanations to appease regulators - -![Sensemaking in study on how humans interpret machine filters and controls they have over it](illusionofcontrol.png) - - - - -Vaccaro, Kristen, Dylan Huang, Motahhare Eslami, Christian Sandvig, Kevin Hamilton, and Karrie Karahalios. "The illusion of control: Placebo effects of control settings." In Proc CHI, 2018. - - - ----- -## Appropriate Level of Algorithmic Transparency - -IP/Trade Secrets/Fairness/Perceptions/Ethics? - -How to design? How much control to give? - - - - - - - - - - ---- -# Gaming/Attacking the Model with Explanations? - -*Does providing an explanation allow customers to 'hack' the system?* - -* Loan applications? -* Apple FaceID? -* Recidivism? -* Auto grading? -* Cancer diagnosis? -* Spam detection? - - ----- -## Gaming the Model with Explanations? - - -![Course assessment does not align with learning goals, leading to shallow learning](course-unaligned.jpg) - - ----- -## Constructive Alignment in Teaching - - -![Course assessment does align with learning goals, leading to better learning](course-aligned.jpg) - - - -see also Claus Brabrand. [Teaching Teaching & Understanding Understanding](https://www.youtube.com/watch?v=w6rx-GBBwVg&t=148s). Youtube 2009 - - - ----- -## Gaming the Model with Explanations? - -* A model prone to gaming uses weak proxy features -* Protections requires to make the model hard to observe (e.g., expensive to query predictions) -* Protecting models akin to "security by obscurity" -* *Good models rely on hard facts that relate causally to the outcome <- hard to game* - - -```haskell -IF age between 18–20 and sex is male THEN predict arrest -ELSE IF age between 21–23 and 2–3 prior offenses THEN predict arrest -ELSE IF more than three priors THEN predict arrest -ELSE predict no arrest -``` - - - ---- -# Human Oversight and Appeals - ----- -## Human Oversight and Appeals - -* Unavoidable that ML models will make mistakes -* Users knowing about the model may not be comforting -* Inability to appeal a decision can be deeply frustrating - -
- ----- -## Capacity to keep humans in the loop? - -ML used because human decisions as a bottleneck - -ML used because human decisions biased and inconsistent - -**Do we have the capacity to handle complaints/appeals?** - -**Wouldn't reintroducing humans bring back biases and inconsistencies?** - ----- -## Designing Human Oversight - -Consider the entire system and consequences of mistakes - -Deliberately design mitigation strategies for handling mistakes - -Consider keeping humans in the loop, balancing harms and costs - * Provide pathways to appeal/complain? Respond to complains? - * Review mechanisms? Can humans override tool decision? - * Tracking telemetry, investigating common mistakes? - * Audit model and decision process rather than appeal individual outcomes? - - ---- -# Accountability and Culpability - -*Who is held accountable if things go wrong?* - ----- -## On Terminology - -* accountability, responsibility, liability, and culpability all overlap in common use -* often about assigning *blame* -- responsible for fixing or liable for paying for damages -* liability, culpability have *legal* connotation -* responsibility tends to describe *ethical* aspirations -* accountability often defined as oversight relationship, where actor is accountable to some "forum" that can impose penalties -* see also legal vs ethical earlier - -![Random letters](../_assets/onterminology.jpg) - ----- -## On Terminology - -Academic definition of accountability: - -> A relationship between an **actor** and a **forum**, -in which the actor has an obligation to explain -and to justify his or her conduct, the forum can -pose questions and pass judgement, and the -actor **may face consequences**. - -That is accountability implies some oversight with ability to penalize - - - -Wieringa, Maranke. "[What to account for when accounting for algorithms: a systematic literature review on algorithmic accountability](https://dl.acm.org/doi/abs/10.1145/3351095.3372833)." In *Proceedings of the Conference on Fairness, Accountability, and Transparency*, pp. 1-18. 2020. - -![Random letters](../_assets/onterminology.jpg) - - ----- -## Who is responsible? - -![teen-suicide-rate](teen-suicide-rate.png) - - ----- -## Who is responsible? - -![News headline: How US surveillance technology is propping up authoritarian regimes](surveillance.png) - ----- -## Who is responsible? - -![Weapons robot](bigdog.png) - ----- -## Who is responsible? - -[![Faceswap github webpage](faceswap.png)](https://github.com/deepfakes/faceswap) - ----- -## Faceswap's README "FaceSwap has ethical uses" - -
- -> [...] as is so often the way with new technology emerging on the internet, it was immediately used to create inappropriate content. - -> [...] it was the first AI code that anyone could download, run and learn by experimentation without having a Ph.D. in math, computer theory, psychology, and more. Before "deepfakes" these techniques were like black magic, only practiced by those who could understand all of the inner workings as described in esoteric and endlessly complicated books and papers. - -> [...] the release of this code opened up a fantastic learning opportunity. - -> Are there some out there doing horrible things with similar software? Yes. And because of this, the developers have been following strict ethical standards. Many of us don't even use it to create videos, we just tinker with the code to see what it does. [...] - -> FaceSwap is not for creating inappropriate content. -> FaceSwap is not for changing faces without consent or with the intent of hiding its use. -> FaceSwap is not for any illicit, unethical, or questionable purposes. [...] - -
- ----- - -> THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. - -Note: Software engineers got (mostly) away with declaring not to be liable - ----- -## Easy to Blame "The Algorithm" / "The Data" / "Software" - -> "Just a bug, things happen, nothing we could have done" - -- But system was designed by humans -- But humans did not anticipate possible mistakes, did not design to mitigate mistakes -- But humans made decisions about what quality was good enough -- But humans designed/ignored the development process -- But humans gave/sold poor quality software to other humans -- But humans used the software without understanding it -- ... - ----- -![Stack overflow survey on responsible](stackoverflow.png) - - -Results from the [2018 StackOverflow Survey](https://insights.stackoverflow.com/survey/2018/#technology-and-society) - ----- -## What to do? - -* Responsible organizations embed risk analysis, quality control, and ethical considerations into their process -* Establish and communicate policies defining responsibilities -* Work from aspirations toward culture change: baseline awareness + experts -* Document tradeoffs and decisions (e.g., datasheets, model cards) -* Continuous learning -* Consider controlling/restricting how software may be used, whether it should be built at all -* And... follow the law -* Get started with existing guidelines, e.g., in [AI Ethics Guidelines](https://algorithmwatch.org/en/ai-ethics-guidelines-global-inventory/) - - - ---- -# (Self-)Regulation and Policy - ----- -![Self regulation of tech companies on facial recognition](npr_facialrecognition.png) - - ----- -![Responsible AI website from Microsoft](responsibleai.png) - ----- -## Policy Discussion and Frameing - -* Corporate pitch: "Responsible AI" ([Microsoft](https://www.microsoft.com/en-us/ai/responsible-ai), [Google](https://ai.google/responsibilities/responsible-ai-practices/), [Accenture](https://www.accenture.com/_acnmedia/pdf-92/accenture-afs-responsible-ai.pdf)) -* Counterpoint: Ochigame ["The Invention of 'Ethical AI': How Big Tech Manipulates Academia to Avoid Regulation"](https://theintercept.com/2019/12/20/mit-ethical-ai-artificial-intelligence/), The Intercept 2019 - - "*The discourse of “ethical AI” was aligned strategically with a Silicon Valley effort seeking to avoid legally enforceable restrictions of controversial technologies.*" - -**Self-regulation vs government regulation? Assuring safety vs fostering innovation?** - ----- - -
- ----- - -
- - ----- -# "Wishful Worries" - -We are distracted with worries about fairness and safety of hypothetical systems - -Most systems fail because they didn't work in the first place; don't actually solve a problem or address impossible tasks - -Wouldn't help even if they solved the given problem (e.g., predictive policing?) - - - - -Raji, Inioluwa Deborah, I. Elizabeth Kumar, Aaron Horowitz, and Andrew Selbst. "The fallacy of AI functionality." In 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 959-972. 2022. - - ----- -[![Forbes Article: This Is The Year Of AI Regulations](airegulation.png)](https://www.forbes.com/sites/cognitiveworld/2020/03/01/this-is-the-year-of-ai-regulations/#1ea2a84d7a81) - - ----- -## “Accelerating America’s Leadership in Artificial Intelligence” - -> “the policy of the United States Government [is] to sustain and enhance the scientific, technological, and economic leadership position of the United States in AI.” -- [White House Executive Order Feb. 2019](https://www.whitehouse.gov/articles/accelerating-americas-leadership-in-artificial-intelligence/) - -Tone: "When in doubt, the government should not regulate AI." - -Note: -* 3. Setting AI Governance Standards: "*foster public trust in AI systems by establishing guidance for AI development. [...] help Federal regulatory agencies develop and maintain approaches for the safe and trustworthy creation and adoption of new AI technologies. [...] NIST to lead the development of appropriate technical standards for reliable, robust, trustworthy, secure, portable, and interoperable AI systems.*" - ----- -## Jan 13 2020 Draft Rules for Private Sector AI - -
- -* *Public Trust in AI*: Overarching theme: reliable, robust, trustworthy AI -* *Public participation:* public oversight in AI regulation -* *Scientific Integrity and Information Quality:* science-backed regulation -* *Risk Assessment and Management:* risk-based regulation -* *Benefits and Costs:* regulation costs may not outweigh benefits -* *Flexibility:* accommodate rapid growth and change -* *Disclosure and Transparency:* context-based transparency regulation -* *Safety and Security:* private sector resilience - - -[Draft: Guidance for Regulation of Artificial Intelligence Applications](https://www.whitehouse.gov/wp-content/uploads/2020/01/Draft-OMB-Memo-on-Regulation-of-AI-1-7-19.pdf) - -
----- -## Other Regulations - -* *China:* policy ensures state control of Chinese companies and over valuable data, including storage of data on Chinese users within the country and mandatory national standards for AI -* *EU:* Ethics Guidelines for Trustworthy Artificial Intelligence; Policy and investment recommendations for trustworthy Artificial Intelligence; draft regulatory framework for high-risk AI applications, including procedures for testing, record-keeping, certification, ... -* *UK:* Guidance on responsible design and implementation of AI systems and data ethics - - - -Source: https://en.wikipedia.org/wiki/Regulation_of_artificial_intelligence - - ----- -## Call for Transparent and Audited Models - -
- -> "no black box should be deployed -when there exists an interpretable model with the same level of performance" - -For high-stakes decisions -* ... with government involvement (recidivism, policing, city planning, ...) -* ... in medicine -* ... with discrimination concerns (hiring, loans, housing, ...) -* ... that influence society and discourse? (algorithmic content amplifications, targeted advertisement, ...) - -*Regulate possible conflict: Intellectual property vs public welfare* - -
- - - -Rudin, Cynthia. "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead." Nature Machine Intelligence 1.5 (2019): 206-215. ([Preprint](https://arxiv.org/abs/1811.10154)) - - - - ----- -## Criticism: Ethics Washing, Ethics Bashing, Regulatory Capture - - - - - ---- -# Summary - -* Transparency goes beyond explaining predictions -* Plan for mistakes and human oversight -* Accountability and culpability are hard to capture, little regulation -* Be a responsible engineer, adopt a culture of responsibility -* Regulations may be coming - ----- -## Further Readings - -
- -* Jacovi, Alon, Ana Marasović, Tim Miller, and Yoav Goldberg. [Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in AI](https://arxiv.org/abs/2010.07487). In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 624–635. 2021. -* Eslami, Motahhare, Aimee Rickman, Kristen Vaccaro, Amirhossein Aleyasen, Andy Vuong, Karrie Karahalios, Kevin Hamilton, and Christian Sandvig. [I always assumed that I wasn’t really that close to her: Reasoning about Invisible Algorithms in News Feeds](http://social.cs.uiuc.edu/papers/pdfs/Eslami_Algorithms_CHI15.pdf). In Proceedings of the 33rd annual ACM conference on human factors in computing systems, pp. 153–162. ACM, 2015. -* Rakova, Bogdana, Jingying Yang, Henriette Cramer, and Rumman Chowdhury. “[Where responsible AI meets reality: Practitioner perspectives on enablers for shifting organizational practices](https://arxiv.org/abs/2006.12358).” Proceedings of the ACM on Human-Computer Interaction 5, no. CSCW1 (2021): 1–23. -* Greene, Daniel, Anna Lauren Hoffmann, and Luke Stark. "[Better, nicer, clearer, fairer: A critical assessment of the movement for ethical artificial intelligence and machine learning](https://core.ac.uk/download/pdf/211327327.pdf)." In *Proceedings of the 52nd Hawaii International Conference on System Sciences* (2019). -* Metcalf, Jacob, and Emanuel Moss. "[Owning ethics: Corporate logics, silicon valley, and the institutionalization of ethics](https://datasociety.net/wp-content/uploads/2019/09/Owning-Ethics-PDF-version-2.pdf)." *Social Research: An International Quarterly* 86, no. 2 (2019): 449-476. -* Raji, Inioluwa Deborah, I. Elizabeth Kumar, Aaron Horowitz, and Andrew Selbst. "[The fallacy of AI functionality](https://dl.acm.org/doi/abs/10.1145/3531146.3533158)." In 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 959-972. 2022. - - -
\ No newline at end of file diff --git a/lectures/21_provenance/2phase-prediction.svg b/lectures/21_provenance/2phase-prediction.svg deleted file mode 100644 index f9b92a94..00000000 --- a/lectures/21_provenance/2phase-prediction.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/21_provenance/apollo.png b/lectures/21_provenance/apollo.png deleted file mode 100644 index 03609231..00000000 Binary files a/lectures/21_provenance/apollo.png and /dev/null differ diff --git a/lectures/21_provenance/creditcard-provenance.svg b/lectures/21_provenance/creditcard-provenance.svg deleted file mode 100644 index f7ee80d3..00000000 --- a/lectures/21_provenance/creditcard-provenance.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/21_provenance/ensemble.svg b/lectures/21_provenance/ensemble.svg deleted file mode 100644 index 7be898f2..00000000 --- a/lectures/21_provenance/ensemble.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/21_provenance/imgcaptioningml-blame.png b/lectures/21_provenance/imgcaptioningml-blame.png deleted file mode 100644 index cab2ea9c..00000000 Binary files a/lectures/21_provenance/imgcaptioningml-blame.png and /dev/null differ diff --git a/lectures/21_provenance/imgcaptioningml-decomposed.png b/lectures/21_provenance/imgcaptioningml-decomposed.png deleted file mode 100644 index 08ad0a58..00000000 Binary files a/lectures/21_provenance/imgcaptioningml-decomposed.png and /dev/null differ diff --git a/lectures/21_provenance/imgcaptioningml-nonmonotonic.png b/lectures/21_provenance/imgcaptioningml-nonmonotonic.png deleted file mode 100644 index 73e32cee..00000000 Binary files a/lectures/21_provenance/imgcaptioningml-nonmonotonic.png and /dev/null differ diff --git a/lectures/21_provenance/memgen-provenance.svg b/lectures/21_provenance/memgen-provenance.svg deleted file mode 100644 index e16cbc9d..00000000 --- a/lectures/21_provenance/memgen-provenance.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/21_provenance/mlflow-web-ui.png b/lectures/21_provenance/mlflow-web-ui.png deleted file mode 100644 index 82e3e39a..00000000 Binary files a/lectures/21_provenance/mlflow-web-ui.png and /dev/null differ diff --git a/lectures/21_provenance/overrides.svg b/lectures/21_provenance/overrides.svg deleted file mode 100644 index 04c7c08e..00000000 --- a/lectures/21_provenance/overrides.svg +++ /dev/null @@ -1 +0,0 @@ -
input
blocklist
no
model
guardrail
yes
diff --git a/lectures/21_provenance/partitioncontext.svg b/lectures/21_provenance/partitioncontext.svg deleted file mode 100644 index 34c8a3c2..00000000 --- a/lectures/21_provenance/partitioncontext.svg +++ /dev/null @@ -1 +0,0 @@ -
input
pick model
model1
model2
model3
yes/no
diff --git a/lectures/21_provenance/pipeline-versioning.svg b/lectures/21_provenance/pipeline-versioning.svg deleted file mode 100644 index b5c09b30..00000000 --- a/lectures/21_provenance/pipeline-versioning.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/21_provenance/provenance.md b/lectures/21_provenance/provenance.md deleted file mode 100644 index 2b3da0c3..00000000 --- a/lectures/21_provenance/provenance.md +++ /dev/null @@ -1,765 +0,0 @@ ---- -author: Christian Kaestner & Eunsuk Kang -title: "MLiP: Versioning, Provenance, and Reproducability" -semester: Fall 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - - -# Versioning, Provenance, and Reproducability - - - ---- -## More Foundational Technology for Responsible Engineering - -![Overview of course content](../_assets/overview.svg) - - - - ----- -## Readings - - -Required readings - -* 🕮 Hulten, Geoff. "[Building Intelligent Systems: A Guide to Machine Learning Engineering.](https://www.buildingintelligentsystems.com/)" Apress, 2018, Chapter 21 (Organizing Intelligence). -* 🗎 Halevy, Alon, Flip Korn, Natalya F. Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. [Goods: Organizing google's datasets](http://research.google.com/pubs/archive/45390.pdf). In Proceedings of the 2016 International Conference on Management of Data, pp. 795-806. ACM, 2016. - ---- - -# Learning Goals - -* Judge the importance of data provenance, reproducibility and explainability for a given system -* Create documentation for data dependencies and provenance in a given system -* Propose versioning strategies for data and models -* Design and test systems for reproducibility - ---- - -# Case Study: Credit Scoring - ----- -
- ----- - -
- ----- -![Example of dataflows between 4 sources and 3 models in credit card application scenario](creditcard-provenance.svg) - - - ----- - -## Debugging? - -What went wrong? Where? How to fix? - - - ----- - -## Debugging Questions beyond Interpretability - -* Can we reproduce the problem? -* What were the inputs to the model? -* Which exact model version was used? -* What data was the model trained with? -* What pipeline code was the model trained with? -* Where does the data come from? How was it processed/extracted? -* Were other models involved? Which version? Based on which data? -* What parts of the input are responsible for the (wrong) answer? How can we fix the model? - - - ----- -## Model Chaining: Automatic meme generator - -![Meme generator chaining 2 models](memgen-provenance.svg) - - -*Version all models involved!* - - -Example adapted from Jon Peck. [Chaining machine learning models in production with Algorithmia](https://algorithmia.com/blog/chaining-machine-learning-models-in-production-with-algorithmia). Algorithmia blog, 2019 - ----- -## Complex Model Composition: ML Models for Feature Extraction - - -![Architecture of Apollo](apollo.png) - - - -Image: Peng, Zi, Jinqiu Yang, Tse-Hsun Chen, and Lei Ma. "A first look at the integration of machine learning models in complex autonomous driving systems: a case study on Apollo." In Proc. FSE. 2020. - - -Note: see also Zong, W., Zhang, C., Wang, Z., Zhu, J., & Chen, Q. (2018). [Architecture design and implementation of an autonomous vehicle](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8340798). IEEE access, 6, 21956-21970. - - ----- - -## Breakout Discussion: Movie Predictions - -
- -> Assume you are receiving complains that a child gets many recommendations about R-rated movies - -In a group, discuss how you could address this in your own system and post to `#lecture`, tagging team members: - -* How could you identify the problematic recommendation(s)? -* How could you identify the model that caused the prediction? -* How could you identify the training code and data that learned the model? -* How could you identify what training data or infrastructure code "caused" the recommendations? - -
- - - -K.G Orphanides. [Children's YouTube is still churning out blood, suicide and cannibalism](https://www.wired.co.uk/article/youtube-for-kids-videos-problems-algorithm-recommend). Wired UK, 2018; -Kristie Bertucci. [16 NSFW Movies Streaming on Netflix](https://www.gadgetreview.com/16-nsfw-movies-streaming-on-netflix). Gadget Reviews, 2020 - - - ---- - -# Provenance Tracking - -*Historical record of data and its origin* - ----- - -## Data Provenance - - -* Track origin of all data - - Collected where? - - Modified by whom, when, why? - - Extracted from what other data or model or algorithm? -* ML models often based on data drived from many sources through many steps, including other models - - -![Example of dataflows between 4 sources and 3 models in credit card application scenario](creditcard-provenance.svg) - - - ----- -## Excursion: Provenance Tracking in Databases - -Whenever value is changed, record: - - who changed it - - time of change - - history of previous values - - possibly also justifcation of why - -Embedded as feature in some databases or implemented in business logic - -Possibly signing with cryptographic methods - - ----- - -## Tracking Data Lineage - -Document all data sources - -Identify all model dependencies and flows - -Ideally model all data and processing code - -Avoid "visibility debt" - -(Advanced: Use infrastructure to automatically capture/infer dependencies and flows as in [Goods](http://research.google.com/pubs/archive/45390.pdf)) - - - ----- -## Feature Provenance - -How are features extracted from raw data? - - during training - - during inference - -Has feature extraction changed since the model was trained? - -Recommendation: Modularize and version feature extraction code - -**Example?** - ----- -## Advanced Practice: Feature Store - -Stores feature extraction code as functions, versioned - -Catalog features to encourage reuse - -Compute and cache features centrally - -Use same feature used in training and inference code - -Advanced: Immutable features -- never change existing features, just add new ones (e.g., creditscore, creditscore2, creditscore3) - - ----- -## Model Provenance - -How was the model trained? - -What data? What library? What hyperparameter? What code? - -Ensemble of multiple models? - ----- -![Example of dataflows between 4 sources and 3 models in credit card application scenario](creditcard-provenance.svg) - - ----- -## In Real Systems: Tracking Provenance Across Multiple Models - - -![Meme generator chaining 2 models](memgen-provenance.svg) - - - - -Example adapted from Jon Peck. [Chaining machine learning models in production with Algorithmia](https://algorithmia.com/blog/chaining-machine-learning-models-in-production-with-algorithmia). Algorithmia blog, 2019 ----- -## Complex Model Composition: ML Models for Feature Extraction - -![Architecture of Apollo](apollo.png) - - - -Image: Peng, Zi, Jinqiu Yang, Tse-Hsun Chen, and Lei Ma. "A first look at the integration of machine learning models in complex autonomous driving systems: a case study on Apollo." In Proc. FSE. 2020. - ----- -## Summary: Provenance - -Data provenance - -Feature provenance - -Model provenance - - - - - - ---- -# Practical Data and Model Versioning - ----- -## How to Version Large Datasets? - - - -(movie ratings, movie metadata, user data?) - ----- -## Recall: Event Sourcing - -* Append only databases -* Record edit events, never mutate data -* Compute current state from all past events, can reconstruct old state -* For efficiency, take state snapshots -* Similar to traditional database logs - -```text -createUser(id=5, name="Christian", dpt="SCS") -updateUser(id=5, dpt="ISR") -deleteUser(id=5) -``` - ----- -## Versioning Strategies for Datasets - -1. Store copies of entire datasets (like Git), identify by checksum -2. Store deltas between datasets (like Mercurial) -3. Offsets in append-only database (like Kafka), identify by offset -4. History of individual database records (e.g. S3 bucket versions) - - some databases specifically track provenance (who has changed what entry when and how) - - specialized data science tools eg [Hangar](https://github.com/tensorwerk/hangar-py) for tensor data -5. Version pipeline to recreate derived datasets ("views", different formats) - - e.g. version data before or after cleaning? - - ----- -## Aside: Git Internals - -![Git internal model](https://git-scm.com/book/en/v2/images/data-model-4.png) - - - -Scott Chacon and Ben Straub. [Pro Git](https://git-scm.com/book/en/v2/Git-Internals-Git-References). 2014 - ----- -## Versioning Models - - - ----- -## Versioning Models - -Usually no meaningful delta/compression, version as binary objects - -Any system to track versions of blobs - ----- -## Versioning Pipelines - -![](pipeline-versioning.svg) - - -Associate model version with pipeline code version, data version, and hyperparameters! - ----- -## Versioning Dependencies - -Pipelines depend on many frameworks and libraries - -Ensure reproducible builds - - Declare versioned dependencies from stable repository (e.g. requirements.txt + pip) - - Avoid floating versions - - Optionally: commit all dependencies to repository ("vendoring") - -Optionally: Version entire environment (e.g. Docker container) - - -Test build/pipeline on independent machine (container, CI server, ...) - - - ----- -## ML Versioning Tools (MLOps) - -Tracking data, pipeline, and model versions - -Modeling pipelines: inputs and outputs and their versions - - automatically tracks how data is used and transformed - -Often tracking also metadata about versions - - Accuracy - - Training time - - ... - - ----- -## Example: DVC - -```sh -dvc add images -dvc run -d images -o model.p cnn.py -dvc remote add myrepo s3://mybucket -dvc push -``` - -* Tracks models and datasets, built on Git -* Splits learning into steps, incrementalization -* Orchestrates learning in cloud resources - - -https://dvc.org/ - ----- -## DVC Example - -```yaml -stages: - features: - cmd: jupyter nbconvert --execute featurize.ipynb - deps: - - data/clean - params: - - levels.no - outs: - - features - metrics: - - performance.json - training: - desc: Train model with Python - cmd: - - pip install -r requirements.txt - - python train.py --out ${model_file} - deps: - - requirements.txt - - train.py - - features - outs: - - ${model_file}: - desc: My model description - plots: - - logs.csv: - x: epoch - x_label: Epoch - meta: 'For deployment' - # User metadata and comments are supported -``` - - - ----- -## Experiment Tracking - -Log information within pipelines: hyperparameters used, evaluation results, and model files - -![MLflow UI](mlflow-web-ui.png) - - -Many tools: MLflow, ModelDB, Neptune, TensorBoard, Weights & Biases, Comet.ml, ... - -Note: Image from -Matei Zaharia. [Introducing MLflow: an Open Source Machine Learning Platform](https://databricks.com/blog/2018/06/05/introducing-mlflow-an-open-source-machine-learning-platform.html), 2018 - - - ----- -## ModelDB Example - -```python -from verta import Client -client = Client("http://localhost:3000") - -proj = client.set_project("My first ModelDB project") -expt = client.set_experiment("Default Experiment") - -# log the first run -run = client.set_experiment_run("First Run") -run.log_hyperparameters({"regularization" : 0.5}) -run.log_dataset_version("training_and_testing_data", dataset_version) -model1 = # ... model training code goes here -run.log_metric('accuracy', accuracy(model1, validationData)) -run.log_model(model1) - -# log the second run -run = client.set_experiment_run("Second Run") -run.log_hyperparameters({"regularization" : 0.8}) -run.log_dataset_version("training_and_testing_data", dataset_version) -model2 = # ... model training code goes here -run.log_metric('accuracy', accuracy(model2, validationData)) -run.log_model(model2) -``` - ----- -## Google's Goods - -Automatically derive data dependencies from system log files - -Track metadata for each table - -No manual tracking/dependency declarations needed - -Requires homogeneous infrastructure - -Similar systems for tracking inside databases, MapReduce, Sparks, etc. - - ----- -## From Model Versioning to Deployment - -Decide which model version to run where - - automated deployment and rollback (cf. canary releases) - - Kubernetis, Cortex, BentoML, ... - -Track which prediction has been performed with which model version (logging) - - - ----- - -## Logging and Audit Traces - - -**Key goal:** If a customer complains about an interaction, can we reproduce the prediction with the right model? Can we debug the model's pipeline and data? Can we reproduce the model? - -* Version everything -* Record every model evaluation with model version -* Append only, backed up - - - -``` -,,,, -,,,, -,,,, -``` - - ----- -## Logging for Composed Models - - -*Ensure all predictions are logged* - -![Meme generator chaining 2 models](memgen-provenance.svg) - - - - ----- - -## Breakout Discussion: Movie Predictions (Revisited) - -> Assume you are receiving complains that a child gets mostly recommendations about R-rated movies - -Discuss again, updating the previous post in `#lecture`: - -* How would you identify the model that caused the prediction? -* How would you identify the code and dependencies that trained the model? -* How would you identify the training data used for that model? - - - - - ---- -# Reproducability - ----- -## On Terminology - -**Replicability:** ability to reproduce results exactly -* Ensures everything is clear and documented -* All data, infrastructure shared; requires determinism - -**Reproducibility:** the ability of an experiment to be repeated with minor differences, achieving a consistent expected result -* In science, reproducing important to gain confidence -* many different forms distinguished: conceptual, close, direct, exact, independent, literal, nonexperiemental, partial, retest, ... - - - -Juristo, Natalia, and Omar S. Gómez. "[Replication of software engineering experiments](https://www.researchgate.net/profile/Omar_S_Gomez/publication/221051163_Replication_of_Software_Engineering_Experiments/links/5483c83c0cf25dbd59eb1038/Replication-of-Software-Engineering-Experiments.pdf)." In Empirical software engineering and verification, pp. 60-88. Springer, Berlin, Heidelberg, 2010. - -![Random letters](../_assets/onterminology.jpg) - - ----- -## "Reproducibility" of Notebooks -
- - - -2019 Study of 1.4M notebooks on GitHub: -- 21% had unexecuted cells -- 36% executed cells out of order -- 14% declare dependencies -- success rate for installing dependencies <40% (version issues, missing files) -- notebook execution failed with exception in >40% (often ImportError, NameError, FileNotFoundError) -- only 24% finished execution without problem, of those 75% produced different results - - - -2020 Study of 936 executable notebooks: -- 40% produce different results due to nondeterminism (randomness without seed) -- 12% due to time and date -- 51% due to plots (different library version, API misuse) -- 2% external inputs (e.g. Weather API) -- 27% execution environment (e.g., Python package versions) - - - -
- - -🗎 Pimentel, João Felipe, et al. "A large-scale study about quality and reproducibility of jupyter notebooks." In Proc. MSR, 2019. and -🗎 Wang, Jiawei, K. U. O. Tzu-Yang, Li Li, and Andreas Zeller. "Assessing and restoring reproducibility of Jupyter notebooks." In Proc. ASE, 2020. - ----- -## Practical Reproducibility - -Ability to generate the same research results or predictions - -Recreate model from data - -Requires versioning of data and pipeline (incl. hyperparameters and dependencies) - - - ----- -## Nondeterminism - -* Model inference almost always deterministic for a given model -* Many machine learning algorithms are nondeterministic - - Nondeterminism in neural networks initialized from random initial weights - - Nondeterminism from distributed computing, random forests - - Determinism in linear regression and decision trees -* Many notebooks and pipelines contain nondeterminism - - Depend on time or snapshot of online data (e.g., stream) - - Initialize random seed - - Different memory addresses for figures -* Different library versions installed on the machine - - ----- -## Recommendations for Reproducibility - -* Version pipeline and data (see above) -* Document each step - - document intention and assumptions of the process (not just results) - - e.g., document why data is cleaned a certain way - - e.g., document why certain parameters chosen -* Ensure determinism of pipeline steps (-> test) -* Modularize and test the pipeline -* Containerize infrastructure -- see MLOps - - - - - - - - - - - - ---- -# Summary - -Provenance is important for debugging and accountability - -Data provenance, feature provenance, model provenance - -Reproducibility vs replicability - -*Version everything!* - - Strategies for data versioning at scale - - Version the entire pipeline and dependencies - - Adopt a pipeline view, modularize, automate - - Containers and MLOps, many tools - ----- -## Further Readings - -* Sugimura, Peter, and Florian Hartl. “Building a Reproducible Machine Learning Pipeline.” *arXiv preprint arXiv:1810.04570* (2018). -* Chattopadhyay, Souti, Ishita Prasad, Austin Z. Henley, Anita Sarma, and Titus Barik. “[What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities](https://web.eecs.utk.edu/~azh/pubs/Chattopadhyay2020CHI_NotebookPainpoints.pdf).” In Proceedings of the CHI Conference on Human Factors in Computing Systems, 2020. -* Sculley, D, et al. “[Hidden technical debt in machine learning systems](http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf).” In Advances in neural information processing systems, pp. 2503–2511. 2015. - - - - - - - - - - ---- - -# Bonus: Debugging and Fixing Models - - - -See also Hulten. Building Intelligent Systems. Chapter 21 - -See also Nushi, Besmira, Ece Kamar, Eric Horvitz, and Donald Kossmann. "[On human intellect and machine failures: troubleshooting integrative machine learning systems](http://erichorvitz.com/human_repair_AI_pipeline.pdf)." In *Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence*, pp. 1017-1025. 2017. - - - ----- -## Recall: Composing Models: Ensemble and metamodels - -![Ensemble models](ensemble.svg) - - ----- -## Recall: Composing Models: Decomposing the problem, sequential - -![](sequential-model-composition.svg) - - ----- -## Recall: Composing Models: Cascade/two-phase prediction - -![](2phase-prediction.svg) - - - - ----- -## Decomposing the Image Captioning Problem? - -![Image of a snowboarder](snowboarder.png) - -Note: Using insights of how humans reason: Captions contain important objects in the image and their relations. Captions follow typical language/grammatical structure - ----- -## State of the Art Decomposition (in 2015) - -![Captioning example](imgcaptioningml-decomposed.png) - - - -Example and image from: Nushi, Besmira, Ece Kamar, Eric Horvitz, and Donald Kossmann. "[On human intellect and machine failures: troubleshooting integrative machine learning systems](http://erichorvitz.com/human_repair_AI_pipeline.pdf)." In Proc. AAAI. 2017. - - ----- -## Blame assignment? - -![blame assignment problem](imgcaptioningml-blame.png) - - - -Example and image from: Nushi, Besmira, Ece Kamar, Eric Horvitz, and Donald Kossmann. "[On human intellect and machine failures: troubleshooting integrative machine learning systems](http://erichorvitz.com/human_repair_AI_pipeline.pdf)." In Proc. AAAI. 2017. - ----- -## Nonmonotonic errors - -![example of nonmonotonic error](imgcaptioningml-nonmonotonic.png) - - - -Example and image from: Nushi, Besmira, Ece Kamar, Eric Horvitz, and Donald Kossmann. "[On human intellect and machine failures: troubleshooting integrative machine learning systems](http://erichorvitz.com/human_repair_AI_pipeline.pdf)." In Proc. AAAI. 2017. - - - ----- - -## Chasing Bugs - -* Update, clean, add, remove data -* Change modeling parameters -* Add regression tests -* Fixing one problem may lead to others, recognizable only later - ----- - -## Partitioning Contexts - - -* Separate models for different subpopulations -* Potentially used to address fairness issues -* ML approaches typically partition internally already - - -![](partitioncontext.svg) - - - ----- - -## Overrides - -* Hardcoded heuristics (usually created and maintained by humans) for special cases -* Blocklists, guardrails -* Potential neverending attempt to fix special cases - - -![](overrides.svg) - - - - - ----- -## Ideas? - - diff --git a/lectures/21_provenance/sequential-model-composition.svg b/lectures/21_provenance/sequential-model-composition.svg deleted file mode 100644 index 3fce8495..00000000 --- a/lectures/21_provenance/sequential-model-composition.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/21_provenance/snowboarder.png b/lectures/21_provenance/snowboarder.png deleted file mode 100644 index 8cd60b68..00000000 Binary files a/lectures/21_provenance/snowboarder.png and /dev/null differ diff --git a/lectures/22_security/30aisec.png b/lectures/22_security/30aisec.png deleted file mode 100644 index 1d2d2024..00000000 Binary files a/lectures/22_security/30aisec.png and /dev/null differ diff --git a/lectures/22_security/admission-hack.png b/lectures/22_security/admission-hack.png deleted file mode 100644 index 9134e0a2..00000000 Binary files a/lectures/22_security/admission-hack.png and /dev/null differ diff --git a/lectures/22_security/admission-threat-model.jpg b/lectures/22_security/admission-threat-model.jpg deleted file mode 100644 index a79a2e12..00000000 Binary files a/lectures/22_security/admission-threat-model.jpg and /dev/null differ diff --git a/lectures/22_security/amazon-gdpr.png b/lectures/22_security/amazon-gdpr.png deleted file mode 100644 index 9b6c431f..00000000 Binary files a/lectures/22_security/amazon-gdpr.png and /dev/null differ diff --git a/lectures/22_security/anonymization.png b/lectures/22_security/anonymization.png deleted file mode 100644 index 0c0859e9..00000000 Binary files a/lectures/22_security/anonymization.png and /dev/null differ diff --git a/lectures/22_security/arms-race.jpg b/lectures/22_security/arms-race.jpg deleted file mode 100644 index dc0b97e4..00000000 Binary files a/lectures/22_security/arms-race.jpg and /dev/null differ diff --git a/lectures/22_security/art-of-war.png b/lectures/22_security/art-of-war.png deleted file mode 100644 index 6747efe5..00000000 Binary files a/lectures/22_security/art-of-war.png and /dev/null differ diff --git a/lectures/22_security/big-tech.png b/lectures/22_security/big-tech.png deleted file mode 100644 index 141267fc..00000000 Binary files a/lectures/22_security/big-tech.png and /dev/null differ diff --git a/lectures/22_security/bing.png b/lectures/22_security/bing.png deleted file mode 100644 index 8d5708d9..00000000 Binary files a/lectures/22_security/bing.png and /dev/null differ diff --git a/lectures/22_security/cambridge-analytica.jpg b/lectures/22_security/cambridge-analytica.jpg deleted file mode 100644 index 8e1fddd8..00000000 Binary files a/lectures/22_security/cambridge-analytica.jpg and /dev/null differ diff --git a/lectures/22_security/can-bus.png b/lectures/22_security/can-bus.png deleted file mode 100644 index 31d84e34..00000000 Binary files a/lectures/22_security/can-bus.png and /dev/null differ diff --git a/lectures/22_security/cia-triad.png b/lectures/22_security/cia-triad.png deleted file mode 100644 index b64e8e61..00000000 Binary files a/lectures/22_security/cia-triad.png and /dev/null differ diff --git a/lectures/22_security/colonial-pipeline.jpg b/lectures/22_security/colonial-pipeline.jpg deleted file mode 100644 index 6f2b9ae8..00000000 Binary files a/lectures/22_security/colonial-pipeline.jpg and /dev/null differ diff --git a/lectures/22_security/component-design1.png b/lectures/22_security/component-design1.png deleted file mode 100644 index 56183b99..00000000 Binary files a/lectures/22_security/component-design1.png and /dev/null differ diff --git a/lectures/22_security/component-design2.png b/lectures/22_security/component-design2.png deleted file mode 100644 index 57c21cba..00000000 Binary files a/lectures/22_security/component-design2.png and /dev/null differ diff --git a/lectures/22_security/covid-tracing.png b/lectures/22_security/covid-tracing.png deleted file mode 100644 index 0f7a2a15..00000000 Binary files a/lectures/22_security/covid-tracing.png and /dev/null differ diff --git a/lectures/22_security/dashcam-architecture.jpg b/lectures/22_security/dashcam-architecture.jpg deleted file mode 100644 index 5274e8ae..00000000 Binary files a/lectures/22_security/dashcam-architecture.jpg and /dev/null differ diff --git a/lectures/22_security/data-lake.png b/lectures/22_security/data-lake.png deleted file mode 100644 index 49ce0215..00000000 Binary files a/lectures/22_security/data-lake.png and /dev/null differ diff --git a/lectures/22_security/data-sanitization.png b/lectures/22_security/data-sanitization.png deleted file mode 100644 index 406f2f9b..00000000 Binary files a/lectures/22_security/data-sanitization.png and /dev/null differ diff --git a/lectures/22_security/decisionboundary.png b/lectures/22_security/decisionboundary.png deleted file mode 100644 index ba126585..00000000 Binary files a/lectures/22_security/decisionboundary.png and /dev/null differ diff --git a/lectures/22_security/differential-privacy-example.png b/lectures/22_security/differential-privacy-example.png deleted file mode 100644 index 95530bb8..00000000 Binary files a/lectures/22_security/differential-privacy-example.png and /dev/null differ diff --git a/lectures/22_security/equifax.png b/lectures/22_security/equifax.png deleted file mode 100644 index 04c2b7ee..00000000 Binary files a/lectures/22_security/equifax.png and /dev/null differ diff --git a/lectures/22_security/evasion-attack.png b/lectures/22_security/evasion-attack.png deleted file mode 100644 index 4b596c38..00000000 Binary files a/lectures/22_security/evasion-attack.png and /dev/null differ diff --git a/lectures/22_security/federated-learning.png b/lectures/22_security/federated-learning.png deleted file mode 100644 index a50bba13..00000000 Binary files a/lectures/22_security/federated-learning.png and /dev/null differ diff --git a/lectures/22_security/florida-water-hack.jpg b/lectures/22_security/florida-water-hack.jpg deleted file mode 100644 index 6571447c..00000000 Binary files a/lectures/22_security/florida-water-hack.jpg and /dev/null differ diff --git a/lectures/22_security/gate.png b/lectures/22_security/gate.png deleted file mode 100644 index 094aca6f..00000000 Binary files a/lectures/22_security/gate.png and /dev/null differ diff --git a/lectures/22_security/genericarchitecture.png b/lectures/22_security/genericarchitecture.png deleted file mode 100644 index 72191952..00000000 Binary files a/lectures/22_security/genericarchitecture.png and /dev/null differ diff --git a/lectures/22_security/healthcare.jpg b/lectures/22_security/healthcare.jpg deleted file mode 100644 index d8ef644e..00000000 Binary files a/lectures/22_security/healthcare.jpg and /dev/null differ diff --git a/lectures/22_security/hospital-ransomware.jpg b/lectures/22_security/hospital-ransomware.jpg deleted file mode 100644 index 4605dce9..00000000 Binary files a/lectures/22_security/hospital-ransomware.jpg and /dev/null differ diff --git a/lectures/22_security/iphone-unlock.jpg b/lectures/22_security/iphone-unlock.jpg deleted file mode 100644 index ba729832..00000000 Binary files a/lectures/22_security/iphone-unlock.jpg and /dev/null differ diff --git a/lectures/22_security/jp-morgan.jpg b/lectures/22_security/jp-morgan.jpg deleted file mode 100644 index a207aa8f..00000000 Binary files a/lectures/22_security/jp-morgan.jpg and /dev/null differ diff --git a/lectures/22_security/model-inversion-image.png b/lectures/22_security/model-inversion-image.png deleted file mode 100644 index 71f95a78..00000000 Binary files a/lectures/22_security/model-inversion-image.png and /dev/null differ diff --git a/lectures/22_security/monolithic1.png b/lectures/22_security/monolithic1.png deleted file mode 100644 index d37e4fa1..00000000 Binary files a/lectures/22_security/monolithic1.png and /dev/null differ diff --git a/lectures/22_security/monolithic2.png b/lectures/22_security/monolithic2.png deleted file mode 100644 index 8f8ffeda..00000000 Binary files a/lectures/22_security/monolithic2.png and /dev/null differ diff --git a/lectures/22_security/security-phone.jpg b/lectures/22_security/security-phone.jpg deleted file mode 100644 index b7cd2461..00000000 Binary files a/lectures/22_security/security-phone.jpg and /dev/null differ diff --git a/lectures/22_security/security.md b/lectures/22_security/security.md deleted file mode 100644 index a48281b9..00000000 --- a/lectures/22_security/security.md +++ /dev/null @@ -1,1058 +0,0 @@ ---- -author: Eunsuk Kang & Christian Kaestner -title: "MLiP: Security and Privacy" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - -# Security and Privacy - - - - - - - ---- -## More responsible engineering... - -![Overview of course content](../_assets/overview.svg) - - - - ----- -## Readings - -* _Building Intelligent Systems: A Guide to Machine Learning Engineering_, G. Hulten (2018), Chapter 25: Adversaries and Abuse. -* _The Top 10 Risks of Machine Learning Security_, G. McGraw et al., IEEE Computer (2020). - ----- -## Learning Goals - -* Explain key concerns in security (in general and with regard to ML models) -* Identify security requirements with threat modeling -* Analyze a system with regard to attacker goals, attack surface, attacker capabilities -* Describe common attacks against ML models, including poisoning and evasion attacks -* Understand design opportunities to address security threats at the system level -* Apply key design principles for secure system design - ---- -# Security – Why do we care? - ----- -![](jp-morgan.jpg) - - - -Source: ABC news, Oct 12, 2014 - ----- -![](colonial-pipeline.jpg) - - -Colonial Pipeline attack, 2021 - ----- -![](florida-water-hack.jpg) - - - -Source: Wired, Feb 8, 2021 - ----- -![](hospital-ransomware.jpg) - - - -Source: Wall Street Journal, Sept 30, 2021 - ----- -## Security: Why do we care? - -Security is expensive -- Additional development cost; need security expertise in your team/organization -- Annoys and interferes with the user's work (e.g., two-factor authentication) -- Not really regulated/enforced by law -- Often retroactively added after an incident, to avoid embarrassment, lawsuits, - fines (sometimes) - ----- -## Security: Why do we care? - -But increasingly wider range of harms caused by security attacks -- Not just data leaks anymore -- Can cause **safety** failures; physical, environmental, mental harms -- Viewpoint: We can't all be security experts, but: - - should be aware of possible consequences of no/little security - - understand basic principles; avoid common pitfalls - - know how to apply best practices - - know how to talk to security experts - - Recall: T-shaped people! - ---- -# Security – A (Very Brief) Overview - ----- -## Elements of Security - -Security requirements (also called "policies") -* What does it mean for my system to be secure? - -Threat model -* What are the attacker's goals, capabilities, and incentives? - -Attack surface -* Which parts of the system are exposed to the attacker? - -Defense mechanisms (mitigiations) -* How do we prevent attacker from compromising a security req.? - ----- -## Security Requirements - -![](cia-triad.png) - - -*What do we mean by "secure"?* - ----- -## Security Requirements - -Common security requirements: "CIA triad" of information security - -__Confidentiality__: Sensitive data must be accessed by authorized users only - -__Integrity__: Sensitive data must be modifiable by authorized users only - -__Availability__: Critical services must be available when needed by clients - ----- -## Example: College Admission System - -![](admission-hack.png) - ----- -## Confidentiality, integrity, or availability? - -* Applications to the program can only be viewed by staff and faculty -in the department. -* The application site should be able to handle requests on the -day of the application deadline. -* Application decisions are recorded only by the faculty and staff. -* The acceptance notices can only be sent out by the program director. - ----- -## Other Security Requirements - -**Authentication:** Users are who they say they are - -**Non-repudiation:** Certain changes/actions in the system can be traced to who was responsible for it - -**Authorization:** Only users with the right permissions can access a resource/perform an action - - ---- -# ML-Specific Threats - ----- -## What's new/special about ML? - - - - ----- -## Where to worry about security? - -![](genericarchitecture.png) - - - -From: McGraw, G. et al. "An architectural risk analysis of machine learning systems: Toward more secure machine learning." Berryville Inst. ML (2020). - ----- -## ML-Specific Concerns - -Who can access/influence... -* training data -* labeling -* inference data -* models, pipeline code -* telemetry -* ... - - ----- -## Goals behind ML-Specific Attacks - -**Confidentiality attacks:** Exposure of sensitive data - * Infer a sensitive label for a data point (e.g., hospital record) - -**Integrity attacks:** Unauthorized modification of data - * Induce a model to misclassify data points from one class to another (e.g., spam filter) - -**Availability attacks:** Disruption to critical services - * Reduce the accuracy of a model (e.g., induce model to misclassify many data points) - - ----- -## Overview of Discussed ML-Specific Attacks - -* Evasion attacks/adversarial examples (integrity violation) -* Targeted poisoning attacks (integrity violation) -* Untargeted poisoning attacks (availability violation) -* Model stealing attacks (confidentiality violation against model data) -* Model inversion attack (confidentiality violation against training data) - ----- -## Evasion Attacks (Adversarial Examples) - -![](evasion-attack.png) - - -Attack at inference time -* Add noise to an existing sample & cause misclassification -* Possible with and without access to model internals -* **Q. Other examples?** - - -_Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art -Face Recognition_, Sharif et al. (2016). - ----- -## Evasion Attacks: Another Example - -![](stop-sign-attacks.png) - - - -_Robust Physical-World Attacks on Deep Learning Visual -Classification_, -Eykholt et al., in CVPR (2018). - ----- -## Task Decision Boundary vs Model Boundary - -[![Decision boundary vs model boundary](decisionboundary.png)](decisionboundary.png) - - - - -From Goodfellow et al (2018). [Making machine learning robust against adversarial inputs](https://par.nsf.gov/servlets/purl/10111674). *Communications of the ACM*, *61*(7), 56-66. - -Note: -Exploiting inaccurate model boundary and shortcuts - -* Decision boundary: Ground truth; often unknown and not specifiable -* Model boundary: What is learned; an _approximation_ of -decision boundary - - ----- -## Defense against Evasion Attacks - - - - -__How would you mitigate evasion attacks?__ - ----- -## Defense against Evasion Attacks - -![](stop-sign.png) - -Redundancy: Design multiple mechanisms to detect an attack -* Here: Insert a barcode as a checksum; harder to bypass - - -_Reliable Smart Road Signs_, Sayin et al. (2019). - ----- -## Defense against Evasion Attacks - -Redundancy: Design multiple mechanisms to detect an attack - -Adversarial training - * Improve decision boundary, robustness - * Generate/find a set of adversarial examples - * Re-train your model with correct labels - -Input sanitization - * "Clean" & remove noise from input samples - * e.g., Color depth reduction, spatial smoothing, JPEG compression - - ----- -## Generating Adversarial Examples - - - - -**How do we generate adversarial examples?** - - ----- -## Generating Adversarial Examples - -* See [counterfactual explanations](https://ckaestne.github.io/seai/F2020/slides/17_explainability/explainability.html#/7/1) -* Find small change to input that changes prediction - - $x^* = x + arg min \\{ |\epsilon| : f(x+\epsilon) \neq f(x) \\}$ - * Many similarity/distance measures for $|\epsilon|$ (e.g., change one feature vs small changes to many features) -* Attacks more effective with access to model internals, but black-box - attacks also feasible - - With model internals: Follow the model's gradient - - Without model internals: Learn [surrogate model](https://ckaestne.github.io/seai/F2020/slides/17_explainability/explainability.html#/6/2) - - With access to confidence scores: Heuristic search (e.g., hill climbing) - - ----- -## Untargeted Poisoning Attack on Availability - - - -Inject mislabeled training data to damage model quality - * 3% poisoning => 11% decrease in accuracy (Steinhardt, 2017) - -Attacker must have some access to the public or private training set - -**Q. Other examples?** - - - -*Example: Anti-virus (AV) scanner: AV company (allegedly) poisoned competitor's model by submitting fake viruses* - -![](virus.png) - - - - - ----- -## Targeted Poisoning Attacks on Integrity - -Insert training data with seemingly correct labels - -![](spam.jpg) - - -More targeted than availability attack, cause specific misclassification - - -_Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural -Networks_, Shafahi et al. (2018) - ----- -## Defense against Poisoning Attacks - - - - -__How would you mitigate poisoning attacks?__ - ----- -## Defense against Poisoning Attacks - -![](data-sanitization.png) - -Anomaly detection & data sanitization ----- -## Defense against Poisoning Attacks - - -Anomaly detection & data sanitization - * Identify and remove outliers in training set (see [data quality lecture](https://ckaestne.github.io/seai/F2020/slides/11_dataquality/dataquality.html#/3)) - * Identify and understand drift from telemetry - -Quality control over your training data - * Who can modify or add to my training set? Do I trust the data - source? *Model data flows and trust boundaries!* - * Use security mechanisms (e.g., authentication) and logging to - track data provenance - - -_Stronger Data Poisoning Attacks Break Data Sanitization Defenses_, -Koh, Steinhardt, and Liang (2018). - - - ----- -## Model Stealing Attacks - -![Bing stealing search results from Google](bing.png) - - - -Singel. [Google Catches Bing Copying; Microsoft Says 'So What?'](https://www.wired.com/2011/02/bing-copies-google/). Wired 2011. - - ----- -## Model Stealing Attacks - -Copy a model without direct access - --> Query model repeatedly and build surrogate model - -**Defenses?** - ----- -## Defending against Model Stealing Attacks - -Use model internally - -Rate limit API - -Abuse detection - -Inject artificial noise (vs. accuracy) - - ----- -## Model Inversion against Confidentiality - - - -Given a model output (e.g., name of a person), infer the -corresponding, potentially sensitive input (facial image of the -person) -* e.g., exploit model confidence values & search over input space - - - - -![](model-inversion-image.png) - - - - -_Model Inversion Attacks that Exploit Confidence Information and Basic -Countermeasures_, M. Fredrikson et al. in CCS (2015). - ----- -## Defense against Model Inversion Attacks - -![](differential-privacy-example.png) - -More noise => higher privacy, but also lower model accuracy! - ----- -## Defense against Model Inversion Attacks - -Limit attacker access to confidence scores -* e.g., reduce the precision of the scores by rounding them off -* But also reduces the utility of legitimate use of these scores! - -Differential privacy in ML -* Limit what attacker can learn about the model (e.g., parameters) - based on an individual training sample -* Achieved by adding noise to input or output (e.g., DP-SGD) -* More noise => higher privacy, but also lower model accuracy! - - -_Biscotti: A Ledger for Private and Secure Peer-to-Peer Machine -Learning_, M. Shayan et al., arXiv:1811.09904 (2018). - ----- -## Review: ML-Specific Attacks - -* Evasion attacks/adversarial examples (integrity violation) -* Targeted poisoning attacks (integrity violation) -* Untargeted poisoning attacks (availability violation) -* Model stealing attacks (confidentiality violation against model data) -* Model inversion attack (confidentiality violation against training data) - ----- -## Breakout: Dashcam System - - - -Recall: Dashcam system from I2/I3 - -As a group, tagging members, post in `#lecture`: - * Security requirements - * Possible (ML) attacks on the system - * Possible mitigations against these attacks - - - -![](dashcam-architecture.jpg) - - - - ----- -## State of ML Security - -![](arms-race.jpg) - - ----- -## State of ML Security - -On-going arms race (mostly among researchers) - * Defenses proposed & quickly broken by noble attacks - -Assume ML component is likely vulnerable - * Design your system to minimize impact of an attack - -Focus on protecting training and inference data access - -Remember: There may be easier ways to compromise system - * e.g., poor security misconfiguration (default password), lack of - encryption, code vulnerabilities, etc., - - - - - - - ---- -# Threat Modeling - ----- -## Why Threat Model? - -![](gate.png) - ----- -## Threat model: A profile of an attacker - -* __Goal__: What is the attacker trying to achieve? -* __Capability__: - * Knowledge: What does the attacker know? - * Actions: What can the attacker do? - * Resources: How much effort can it spend? -* __Incentive__: Why does the attacker want to do this? - -![](art-of-war.png) - ----- -## Attacker Goal - -What is the attacker trying to achieve? -* Typically, undermine security requirements (recall C.I.A) - -Example: College admission -* Access other applicants info without being authorized (C) -* Modify application status to “accepted” (I) -* Modify admissions model to reject certain applications (I) -* Cause website shutdown to sabotage other applicants (A) - ----- -## Attacker Capability - -![](admission-threat-model.jpg) - - -What actions are available to the attacker (to achieve its goal)? - * Depends on system boundary/interfaces exposed to external actors - * Use an architecture diagram to identify attack surface & actions - - - - - - - - - - - ----- -## STRIDE Threat Modeling - -![](stride.png) - - -A systematic approach to identifying attacks -* Construct an architectural diagram with components & connections -* Indicate trust boundaries -* For each untrusted connection, enumerate STRIDE threats -* For each potential threat, devise a mitigation strategy - - - -[More info: STRIDE approach](https://docs.microsoft.com/en-us/archive/msdn-magazine/2006/november/uncover-security-design-flaws-using-the-stride-approach) - ----- -## STRIDE: College Admission - -![](admission-threat-model.jpg) - - -* Spoofing: ? -* Tampering: ? -* Information disclosure: ? -* Denial of service: ? - ----- -## STRIDE: Example Threats - -![](admission-threat-model.jpg) - - -
- -* Spoofing: Attacker pretends to be another applicant by logging in -* Tampering: Attacker modifies applicant info using browser exploits -* Information disclosure: Attacker intercepts HTTP requests from/to - server to read applicant info -* Denial of service: Attacker creates a large number of bogus - accounts and overwhelms system with requests - -
----- -## STRIDE: Example Mitigations - -* Spoofing: Attacker pretends to be another applicant by logging in - * -> __Require two-factor authentication__ -* Tampering: Attacker modifies applicant info using browser exploits - * -> __Add server-side security tokens__ -* Information disclosure: Attacker intercepts HTTP requests from/to server to read applicant info - * -> __Use encryption (HTTPS)__ -* Denial of service: Attacker creates many bogus accounts and overwhelms system with requests - * -> __Limit requests per IP address__ - ----- -## Breakout: Threat Modeling - - - -Again: Dashcam system from I2/I3 - -Using STRIDE, discuss & post: - * Data flow throughout the system - * Possible attacks on the system - - - -![](dashcam-architecture.jpg) - - - ----- -## STRIDE & Other Threat Modeling Methods - -A systematic approach to identifying threats & attacker actions - -Limitations: - * May end up with a long list of threats, not all of them critical - * False sense of security: STRIDE does not imply completeness! - -Consider cost vs. benefit trade-offs -* Implementing mitigations add - to development cost and complexity -* Focus on most critical/likely threats - - - - - ---- -# Designing for Security - ----- -## Security Mindset - -![](security-phone.jpg) - - -* Assume that all components may be compromised eventually -* Don't assume users will behave as expected; assume all inputs to the system as potentially malicious -* Aim for risk minimization, not perfect security - ----- -## Secure Design Principles - -*Minimize the impact of a compromised component* - * **Principle of least privilege:** A component given only minimal privileges needed to fulfill its functionality - * **Isolation/compartmentalization:** Components should be able to interact with each other no more than necessary - * **Zero-trust infrastructure:** Components treat inputs from each other as potentially - malicious - -*Monitoring & detection* -* Identify data drift and unusual activity - ----- -## Monolithic Design - -![](monolithic1.png) - ----- -## Monolithic Design - -![](monolithic2.png) - - -Flaw in any part => Security impact on entire system! - ----- -## Compartmentalized Design - -![](component-design1.png) - ----- -## Compartmentalized Design - -![](component-design2.png) - -Flaw in one component => Limited impact on the rest of the system! - ----- -## Example: Vehicle Security - - -* Research project from UCSD: Remotely taking over vehicle control - * Create MP3 with malicious code & burn onto CD - * Play CD => send malicious commands to brakes, engine, locks... -* Problem: Over-privilege & lack of isolation! Shared CAN bus - -![](can-bus.png) - - - -_Comprehensive Experimental Analyses of Automotive Attack Surfaces_, Checkoway et al., in USENIX Security (2011). - ----- -## Secure Design Principles for ML - - - -*Principle of least privilege* -* Who has access to training data, model internal, system input & -output, etc.,? -* Does any user/stakeholder have more access than necessary? -* If so, limit access by using authentication mechanisms - - - -![](genericarchitecture.png) - - - ----- -## Secure Design Principles for ML - - - -*Isolation & compartmentalization* -* Can a security attack on one ML component (e.g., misclassification) - adversely affect other parts of the system? -* If so, compartmentalize or build in mechanisms to limit - impact (see lecture on mitigating mistakes) - - - -![](genericarchitecture.png) - - - - - ----- -## Secure Design Principles for ML - - - -*Monitoring & detection* - * Look for odd shifts in the dataset and clean the data if needed (for poisoning attacks) - * Assume all system input as potentially malicious & sanitize - (evasion attacks) - - - - -![](genericarchitecture.png) - - - - ---- -# AI for Security - ----- -[![Article: 30 COMPANIES MERGING AI AND CYBERSECURITY TO KEEP US SAFE AND SOUND](30aisec.png)](https://builtin.com/artificial-intelligence/artificial-intelligence-cybersecurity) - - ----- -## Many Defense Systems use ML - -* Classifiers to learn malicious content: Spam filters, virus detection -* Anomaly detection: Identify unusual/suspicious activity, eg. credit card fraud, intrusion detection -* Game theory: Model attacker costs and reactions, design countermeasures -* Automate incidence response and mitigation activities, DevOps -* Network analysis: Identify bad actors and their communication in public/intelligence data -* Many more, huge commercial interest - - -Recommended reading: Chandola, Varun, Arindam Banerjee, and Vipin Kumar. "[Anomaly detection: A survey](http://cucis.ece.northwestern.edu/projects/DMS/publications/AnomalyDetection.pdf)." ACM computing surveys (CSUR) 41, no. 3 (2009): 1-58. - ----- -## AI Security Solutions are ML-Enabled Systems Too - -ML component one part of a larger system - -Consider entire system, from training to telemetry, to user interface, to pipeline automation, to monitoring - -ML-based security solutions can be attacked themselves - ----- - -![Equifax logo](equifax.png) - -One contributing factor to the Equifax attack was an expired certificate for an intrusion detection system - - ---- -# ML & Data Privacy - ----- - -![Target headline](target-headline.png) - - -> Andew Pole, who heads a 60-person team at Target that studies -customer behavior, boasted at a conference in 2010 about a proprietary -program that could identify women - based on their purchases and -demographic profile - who were pregnant. - - -Lipka. "[What Target knows about you](https://www.reuters.com/article/us-target-breach-datamining/what-target-knows-about-you-idUSBREA0M1JM20140123)". Reuters, 2014 - ----- - -![Big tech](big-tech.png) - - ----- -## Data Lakes - -![data lakes](data-lake.png) - - -*Who has access?* - ----- -## Data Privacy vs Utility - -![healthcare](healthcare.jpg) - - ----- -## Data Privacy vs Utility - -![covid-tracing](covid-tracing.png) - - ----- -## Data Privacy vs Utility - -![iphone](iphone-unlock.jpg) - - ----- -## Data Privacy vs Utility - -![cambridge-analytica](cambridge-analytica.jpg) - - -* ML can leverage data to greatly benefit individuals and -society -* Unrestrained collection & use of data can enable abuse and -harm! -* __Viewpoint__: Users should be given an ability to learn and control how their data is - collected and used - ----- -## Best Practices for ML & Data Privacy - -
- -* Data collection & processing - * Only collect and store what you need - * Remove sensitive attributes, anonymize, or aggregate -* Training: Local, on-device processing if possible - * Federated learning -* Basic security practices - * Encryption & authentication - * Provenance: Track data sources and destinations -* Provide transparency to users - * Clearly explain what data is being collected and why -* Understand and follow the data protection regulations! - * e.g., General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), HIPAA (healthcare), FERPA (educational) - -
- ----- -## Best Practices for ML & Data Privacy - -
- -* **Data collection & processing** - * Only collect and store what you need - * Remove sensitive attributes, anonymize, or aggregate -* Training: Local, on-device processing if possible - * Federated learning -* Basic security practices - * Encryption & authentication - * Provenance: Track data sources and destinations -* Provide transparency to users - * Clearly explain what data is being collected and why -* Understand and follow the data protection regulations! - * e.g., General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), HIPAA (healthcare), FERPA (educational) - -
- ----- -## Collect and store only what you need - -![data lakes](data-lake.png) - - -*Realistic when data is seen as valuable?* - ----- -## Data Anonymization is Hard - -* Simply removing explicit identifiers (e.g., name) is often not -enough - * {ZIP, gender, birthd.} can identify 87% of Americans (L. Sweeney) -* k-anonymization: Identity-revealing data tuples appear in at least k rows - * Suppression: Replace certain values in columns with an asterisk - * Generalization: Replace individual values with broader categories - -![Anonymization](anonymization.png) - - - ----- -## Best Practices for ML & Data Privacy - -
- -* Data collection & processing - * Only collect and store what you need - * Remove sensitive attributes, anonymize, or aggregate -* **Training: Local, on-device processing if possible** - * Federated learning -* Basic security practices - * Encryption & authentication - * Provenance: Track data sources and destinations -* Provide transparency to users - * Clearly explain what data is being collected and why -* Understand and follow the data protection regulations! - * e.g., General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), HIPAA (healthcare), FERPA (educational) - -
- ----- -## Federated Learning - -![Federated learning](federated-learning.png) - - -* Train a global model with local data stored across multiple -devices -* Local devices push only model updates, not the raw data -* But: - increased network communication and other security risks (e.g., - backdoor injection) - - -[ML@CMU blog post on federated learning](https://blog.ml.cmu.edu/2019/11/12/federated-learning-challenges-methods-and-future-directions/) - ----- -## Best Practices for ML & Data Privacy - -
- -* Data collection & processing - * Only collect and store what you need - * Remove sensitive attributes, anonymize, or aggregate -* Training: Local, on-device processing if possible - * Federated learning -* Basic security practices - * Encryption & authentication - * Provenance: Track data sources and destinations -* Provide transparency to users - * Clearly explain what data is being collected and why -* **Understand and follow the data protection regulations!** - * e.g., General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), HIPAA (healthcare), FERPA (educational) - -
- ----- -## General Data Protection Reg. (GDPR) - -* Introduced by the European Union (EU) in 2016 -* Organizations must state: - * What personal data is being collected & stored - * Purpose(s) for which the data will be used - * Other entities that the data will be shared with -* Organizations must receive explicit consent from users - * Each user must be provided with the ability to view, modify and delete any personal data -* Compliance & enforcement - * Complaints are filed against non-compliant organizations - * A failure to comply may result in heavy penalties! - ----- -## Privacy Consent and Control - -![Techcrunch privacy](techcrunch-privacy.png) - - ----- -## But Does Informed Consent Work? - -![Popup on German news site asking for consent to advertisement](spiegel.png) - - - ----- - -![Amazon gdpr](amazon-gdpr.png) - - ----- -## Summary: Best Practices for ML & Data Privacy - -> __Be ethical and responsible with user data!__ Think about potential -> harms to users & society, caused by (mis-)handling of personal data - -* Data collection & processing -* Training: Local, on-device processing if possible -* Basic security practices -* Provide transparency to users -* Understand and follow the data protection regulations! - ---- -# Summary - -* Security requirements: Confidentiality, integrity, availability -* Threat modeling to identify security req. & attacker capabilities -* ML-specific attacks on training data, telemetry, or the model - - Poisoning attack on training data to influence predictions - - Evasion attacks (adversarial learning) to shape input data - - Model inversion attacks for privacy violations -* Security design at the system level: least privilege, isolation -* AI can be used for defense (e.g. anomaly detection) -* __Key takeaway__: Adopt a security mindset! Assume all components may be vulnerable. Design system to reduce the impact of attacks. - ----- -## Further Readings - -
- -- Gary McGraw, Harold Figueroa, Victor Shepardson, and Richie Bonett. [An Architectural Risk Analysis of Machine Learning Systems: Toward More Secure Machine Learning](https://berryvilleiml.com/docs/ara.pdf). Berryville Institute of Machine Learning (BIML), 2020 -- Meftah, Barmak. Business Software Assurance: Identifying and Reducing Software Risk in the Enterprise. 9th Semi-Annual Software Assurance Forum, Gaithersburg, Md., October 2008. -- Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust Physical-World Attacks on Deep Learning Visual Classification. In CVPR, 2018. -- Ian Goodfellow, Patrick McDaniel, and Nicolas Papernot. Making machine learning robust against adversarial inputs. Communications of the ACM, 61(7), 56-66. 2018. -- Tramèr, F., Kurakin, A., Papernot, N., Boneh, D., and McDaniel, P. Ensemble adversarial training: Attacks and defenses. arXiv, 2017 - -
diff --git a/lectures/22_security/spam.jpg b/lectures/22_security/spam.jpg deleted file mode 100644 index 64519ae2..00000000 Binary files a/lectures/22_security/spam.jpg and /dev/null differ diff --git a/lectures/22_security/spiegel.png b/lectures/22_security/spiegel.png deleted file mode 100644 index f19adbc7..00000000 Binary files a/lectures/22_security/spiegel.png and /dev/null differ diff --git a/lectures/22_security/stop-sign-attacks.png b/lectures/22_security/stop-sign-attacks.png deleted file mode 100644 index aa51fc94..00000000 Binary files a/lectures/22_security/stop-sign-attacks.png and /dev/null differ diff --git a/lectures/22_security/stop-sign.png b/lectures/22_security/stop-sign.png deleted file mode 100644 index f66dcc6a..00000000 Binary files a/lectures/22_security/stop-sign.png and /dev/null differ diff --git a/lectures/22_security/stride.png b/lectures/22_security/stride.png deleted file mode 100644 index f1fdd47a..00000000 Binary files a/lectures/22_security/stride.png and /dev/null differ diff --git a/lectures/22_security/target-headline.png b/lectures/22_security/target-headline.png deleted file mode 100644 index 800ed112..00000000 Binary files a/lectures/22_security/target-headline.png and /dev/null differ diff --git a/lectures/22_security/techcrunch-privacy.png b/lectures/22_security/techcrunch-privacy.png deleted file mode 100644 index 68c4e019..00000000 Binary files a/lectures/22_security/techcrunch-privacy.png and /dev/null differ diff --git a/lectures/22_security/virus.png b/lectures/22_security/virus.png deleted file mode 100644 index 5ce7b177..00000000 Binary files a/lectures/22_security/virus.png and /dev/null differ diff --git a/lectures/23_safety/IEC-process.png b/lectures/23_safety/IEC-process.png deleted file mode 100644 index 96b03263..00000000 Binary files a/lectures/23_safety/IEC-process.png and /dev/null differ diff --git a/lectures/23_safety/aurora-safety-case.png b/lectures/23_safety/aurora-safety-case.png deleted file mode 100644 index 39e49d7f..00000000 Binary files a/lectures/23_safety/aurora-safety-case.png and /dev/null differ diff --git a/lectures/23_safety/automation.jpg b/lectures/23_safety/automation.jpg deleted file mode 100644 index 598ec4a7..00000000 Binary files a/lectures/23_safety/automation.jpg and /dev/null differ diff --git a/lectures/23_safety/av-hype.png b/lectures/23_safety/av-hype.png deleted file mode 100644 index c90331c2..00000000 Binary files a/lectures/23_safety/av-hype.png and /dev/null differ diff --git a/lectures/23_safety/av-miles.jpg b/lectures/23_safety/av-miles.jpg deleted file mode 100644 index 16298d54..00000000 Binary files a/lectures/23_safety/av-miles.jpg and /dev/null differ diff --git a/lectures/23_safety/av-weird-cases.jpg b/lectures/23_safety/av-weird-cases.jpg deleted file mode 100644 index 7ca4a8ae..00000000 Binary files a/lectures/23_safety/av-weird-cases.jpg and /dev/null differ diff --git a/lectures/23_safety/decisionboundary2.svg b/lectures/23_safety/decisionboundary2.svg deleted file mode 100644 index 0c835a53..00000000 --- a/lectures/23_safety/decisionboundary2.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/23_safety/energy.png b/lectures/23_safety/energy.png deleted file mode 100644 index 03721b76..00000000 Binary files a/lectures/23_safety/energy.png and /dev/null differ diff --git a/lectures/23_safety/facebookdivisive.png b/lectures/23_safety/facebookdivisive.png deleted file mode 100644 index 5b6839d2..00000000 Binary files a/lectures/23_safety/facebookdivisive.png and /dev/null differ diff --git a/lectures/23_safety/handwriting-transformation.png b/lectures/23_safety/handwriting-transformation.png deleted file mode 100644 index efaead07..00000000 Binary files a/lectures/23_safety/handwriting-transformation.png and /dev/null differ diff --git a/lectures/23_safety/mentalhealth.png b/lectures/23_safety/mentalhealth.png deleted file mode 100644 index 011ee2bf..00000000 Binary files a/lectures/23_safety/mentalhealth.png and /dev/null differ diff --git a/lectures/23_safety/moralityabtesting.png b/lectures/23_safety/moralityabtesting.png deleted file mode 100644 index 486ec96c..00000000 Binary files a/lectures/23_safety/moralityabtesting.png and /dev/null differ diff --git a/lectures/23_safety/movie-recommendation.png b/lectures/23_safety/movie-recommendation.png deleted file mode 100644 index be668df6..00000000 Binary files a/lectures/23_safety/movie-recommendation.png and /dev/null differ diff --git a/lectures/23_safety/nader-report.png b/lectures/23_safety/nader-report.png deleted file mode 100644 index 5e340e24..00000000 Binary files a/lectures/23_safety/nader-report.png and /dev/null differ diff --git a/lectures/23_safety/robinhood.png b/lectures/23_safety/robinhood.png deleted file mode 100644 index f1f338d7..00000000 Binary files a/lectures/23_safety/robinhood.png and /dev/null differ diff --git a/lectures/23_safety/robot-uprising.jpg b/lectures/23_safety/robot-uprising.jpg deleted file mode 100644 index d6f40c9e..00000000 Binary files a/lectures/23_safety/robot-uprising.jpg and /dev/null differ diff --git a/lectures/23_safety/safety.md b/lectures/23_safety/safety.md deleted file mode 100644 index 28ba7971..00000000 --- a/lectures/23_safety/safety.md +++ /dev/null @@ -1,840 +0,0 @@ ---- -author: Eunsuk Kang & Christian Kaestner -title: "MLiP: Safety" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - -# Safety - - - - - - ---- -## Mitigating more mistakes... - -![Overview of course content](../_assets/overview.svg) - - - - ----- -## Reading - - -S. Mohseni et al., [Practical Solutions for Machine Learning Safety in Autonomous Vehicles](http://ceur-ws.org/Vol-2560/paper40.pdf). -SafeAI Workshop@AAAI (2020). - - - ---- -## Learning Goals - -* Understand safety concerns in traditional and AI-enabled systems -* Understand the importance of ML robustness for safety -* Apply hazard analysis to identify risks and requirements and understand their limitations -* Discuss ways to design systems to be safe against potential failures -* Suggest safety assurance strategies for a specific project -* Describe the typical processes for safety evaluations and their limitations - - - - - - - - - - - - - - - ---- -# System Safety - ----- -## Defining Safety - -Prevention of a system failure or malfunction that results in: - * Death or serious injury to people - * Loss or severe damage to equipment/property - * Harm to the environment or society - -Safety is a system concept - * Can't talk about software/ML being "safe"/"unsafe" on its own - * Safety is defined in terms of its effect on the **environment** - ----- -## Safety != Reliability - -Reliability = absence of defects, mean time between failure - -Safety = prevents accidents, harms - -Can build safe systems from unreliable components (e.g. redundancy, safeguards) - -System may be unsafe despite reliable components (e.g. stronger gas tank causes more severe damage in incident) - -Accuracy is usually about reliability! - - ----- -## Safety of AI-Enabled Systems - -
- -Notes: Systems can be unsafe in unexpected ways - ----- -## Safety of AI-Enabled Systems - -
- -Notes: Systems can be unsafe in unexpected ways - ----- -## Safety is a broad concept - -Not just physical harms/injuries to people - -Includes harm to mental health - -Includes polluting the environment, including noise pollution - -Includes harm to society, e.g. poverty, polarization - ----- -## Case Study: Self-Driving Car - -![](self-driving.jpeg) - ----- -## How did traditional vehicles become safer? - - - -![](nader-report.png) - - - - -National Traffic & Motor Safety Act (1966): -* Mandatory design changes (head rests, shatter-resistant windshields, safety belts) -* Road improvements (center lines, reflectors, guardrails) -* Significant reduction (13-46%) in traffic fatalities - - - ----- -## Autonomous Vehicles: What's different? - -![](av-hype.png) - - -* In traditional vehicles, humans ultimately responsible for safety - * Built-in safety features (lane keeping, emergency braking) - * i.e., safety = human control + safety mechanisms -* Use of AI in autonomous vehicles: Perception, control, routing, -etc., - * Inductive training: No explicit requirements or design insights - * __Can ML achieve safe design solely through lots of data?__ - ----- -## Demonstrating Safety - -![](av-miles.jpg) - - -__Q. More miles tested => safer?__ - - ----- -## Challenge: Edge/Unknown Cases - -![](av-weird-cases.jpg) - - -* Gaps in training data; ML unlikely to cover all unknown cases -* Is this a unique problem for AI? What about humans? - - ----- -## Safety Engineering - -Safety Engineering: An engineering discipline which assures that engineered systems provide acceptable levels of safety. - -Typical safety engineering process: - * Identify relevant hazards & safety requirements - * Identify potential root causes for hazards - * For each hazard, develop a mitigation strategy - * Provide evidence that mitigations are properly implemented - - - ----- -## Improving Safety of ML-Enabled Systems - -Anticipate problems (hazard analysis, FTA, FMEA, HAZOP, ...) - -Anticipate the existence of unanticipated problems - -Plan for mistakes, design mitigations (recall earlier lecture!) -* Human in the loop -* Undoable actions, failsoft -* Guardrails -* Mistaked detection -* Redundancy, ... - - ---- -# Demonstrating and Documenting Safety - - ----- -## Demonstrating Safety - -Two main strategies: - -1. **Evidence of safe behavior in the field** - * Extensive field trials - * Usually expensive -2. **Evidence of responsible (safety) engineering process** - * Process with hazard analysis, testing mitigations, etc - * Not sufficient to assure safety - -Most standards require both - ----- -## Demonstrating Safety - -![](av-miles.jpg) - - -__How do we demonstrate to a third-party that our system is safe?__ - ----- -## Safety & Certification Standards - -
- -* Guidelines & recommendations for achieving an acceptable level of -safety -* Examples: DO-178C (airborne systems), ISO 26262 (automotive), IEC 62304 (medical -software), Common Criteria (security) -* Typically, **prescriptive & process-oriented** - * Recommends use of certain development processes - * Requirements specification, design, hazard analysis, testing, - verification, configuration management, etc., -* Limitations - * Most not designed to handle ML systems (exception: UL 4600) - * Costly to satisfy & certify, but effectiveness unclear (e.g., many - FDA-certified products recalled due to safety incidents) -* Good processes are important, but not sufficient; provides only indirect evidence for system safety - -
- ----- -## Certification Standards: Example - - -![IEC process](IEC-process.png) - - ----- -## Demonstrating Safety - -Two main strategies: - -1. **Evidence of safe behavior in the field** - * Extensive field trials - * Usually expensive -2. **Evidence of responsible (safety) engineering process** - * Process with hazard analysis, testing mitigations, etc - * Not sufficient to assure safety - -**Most standards require both, but often not sufficient!** - ----- -## Assurance (Safety) Cases - -* An emerging approach to demonstrating safety -* An explicit argument that a system achieves a desired safety -requirement, along with supporting evidence -* Structure: - * Argument: A top-level claim decomposed into multiple sub-claims - * Evidence: Testing, software analysis, formal verification, - inspection, expert opinions, design mechanisms... - ----- -## Documenting Safety with Assurance (Safety) Cases - -![](safetycase.svg) - - - ----- -## Assurance Cases: Example - -![](safetycase.svg) - - -Questions to think about: - * Do sub-claims imply the parent claim? - * Am I missing any sub-claims? - * Is the evidence strong enough to discharge a leaf claim? - ----- -## Assurance Cases: Example - -![](aurora-safety-case.png) - - -[Aurora Safety Case](https://aurora.tech/blog/aurora-unveils-first-ever-safety-case-framework) - - ----- -## Discussion: Assurance Case for Recommender - -![](movie-recommendation.png) - - -How would you argue that your recommendation system -provides at least 95% availability? What evidence would you provide? - ----- -## Assurance Cases: Benefits & Limitations - -
- -* Provides an explicit structure to the safety argument - * Easier to navigate, inspect, and refute for third-party auditors - * Provides traceability between system-level claims & - low-level evidence - * Can also be used for other types of system quality (security, - reliability, etc.,) -* Challenges and pitfalls - * Informal links between claims & evidence, e.g., Does the sub-claims actually imply the top-level claim? - * Effort in constructing the case & evidence: How much evidence is enough? - * System evolution: If system changes, must reproduce the case & evidence -* Tools for building & analyzing safety cases available - * e.g., [ASCE/GSN](https://www.adelard.com/gsn.html) from Adelard - * But ultimately, can't replace domain knowledge & critical thinking - -
- - - - - - - - - - - - ---- -# Robustness for ML-based Systems - ----- -## Robustness - -Environment sometimes __deviates__ from expected, normal conditions -- Extreme weathers, unexpected obstacles, etc., -- Erratic user behaviors; unusually high service demand -- Adversarial actors; users trying to game your system, etc., - -Does your system work reasonably well under these deviations? i.e., is -it _robust_? - -Most safety-critical systems require some level of robustness -- Not enough to show that system is safe in normal conditions - ----- -## Defining Robustness for ML: - -* A prediction for input $x$ is robust if the outcome is stable under -minor perturbations to the input: - - $\forall x'. d(x, x')<\epsilon \Rightarrow f(x) = f(x')$ - - distance function $d$ and permissible distance $\epsilon$ depends - on the problem domain! -* A model is said to be robust if most predictions are robust -* An important concept in safety and security settings - * In safety, perturbations tend to be random or predictable (e.g., - sensor noise due to weather conditions) - * In security, perturbations are intentionally crafted (e.g., - adversarial attacks) - ----- -## Robustness and Distance for Images - -+ Slight rotation, stretching, or other transformations -+ Change many pixels minimally (below human perception) -+ Change only few pixels -+ Change most pixels mostly uniformly, e.g., brightness - -![Handwritten number transformation](handwriting-transformation.png) - - - -Image: [_An abstract domain for certifying neural networks_](https://dl.acm.org/doi/pdf/10.1145/3290354). - Gagandeep et al., POPL (2019). - - ----- -## No Model is Fully Robust - -* Every useful model has at least one decision boundary -* Predictions near that boundary are not (and should not) be robust - -![Decision boundary](decisionboundary2.svg) - - ----- -## Robustness of Interpretable Models - -Is this model robust? - -Is the prediction for a 20 year old male with 2 priors robust? Against -what perturbations? - -```fortran -IF age between 18–20 and sex is male THEN predict arrest -ELSE -IF age between 21–23 and 2–3 prior offenses THEN predict arrest -ELSE -IF more than three priors THEN predict arrest -ELSE predict no arrest -``` - ----- -## Evaluating ML Robustness - -
- -* Lots of on-going research (especially for DNNs) -* Formal verification - - Constraint solving or abstract interpretation over computations in neuron activations - - Conservative abstraction, may label robust inputs as not robust - - Currently not very scalable - - Example: [_An abstract domain for certifying neural networks_](https://dl.acm.org/doi/pdf/10.1145/3290354). - Gagandeep et al., POPL (2019). -* Sampling - - Sample within distance, compare prediction to majority prediction - - Probabilistic guarantees possible (with many queries, e.g., 100k) - - Example: - [_Certified adversarial robustness via randomized smoothing_](https://arxiv.org/abs/1902.02918). Cohen, - Rosenfeld, and Kolter, ICML (2019). -
- ----- -## ML Robustness: Limitations - -* Lots of on-going research (especially for DNNs) -* Mostly input-centric, focusing on small ($\epsilon$) perburtations - * Common use case: Robustness against adversarial attacks - * Q. But do these pertubations matter for safety? -* In practice: Perturbations result from environmental changes! - * Which parts of the world does the software sense? - * Can those parts change over time? Can the sensors be noisy, - faulty, etc.,? (these are **domain-specific**) - * What input pertburbations could be caused by from these changes/noise...? - ----- -## Robustness in a Safety Setting - -* Does the model detect stop signs under normal conditions? -* Does the model detect stop signs under deviations? - * __Q. What deviations do we care about?__ - -![Stop Sign](stop-sign.png) - - ----- -## Robustness in a Safety Setting - -* Does the model detect stop signs under normal settings? -* Does the model detect stop signs under deviations? - * Poor lighting? In fog? With a tilted camera? Sensor noise? - * With stickers taped to the sign? (Does it matter?) - -![Stop Sign](stop-sign-adversarial.png) - - - -Image: David Silver. [Adversarial Traffic Signs](https://medium.com/self-driving-cars/adversarial-traffic-signs-fd16b7171906). Blog post, 2017 - ----- -## Improving Robustness for Safety - -Q. How do we make ML-based systems more robust? - - - ----- -## Improving Robustness for Safety - -![](weather-conditions.png) - - -Learn more robust models - - Test/think about domain-specific deviations that might result in - perturbations to model input (e.g., - fogs, snow, sensor noise) - - Curate data for those abnormal scenarios or augment training data with transformed inputs - - -_Automated driving recognition technologies for adverse weather -conditions._ Yoneda et al., (2019). - - ----- -## Improving Robustness for Safety - -![](sensor-fusion.jpeg) - - -Design mechanisms - - Deploy redundant components for critical tasks (e.g., vision + map) - - Ensemble learning: Combine models with different biases - - Multiple, independent sensors (e.g., LiDAR + radar + cameras) - ----- -## Improving Robustness for Safety - -Design mechanisms - - Deploy redundant components for critical tasks (e.g., vision + map) - - Ensemble learning: Combine models with different biases - - Multiple, independent sensors (e.g., LiDAR + radar + cameras) - -Robustness checking at inference time - - Handle inputs with non-robust predictions differently - (e.g. discard or output low confidence score for outliers) - - Downside: Raises cost of prediction; may not be suitable - for time-sensitive applications (e.g., self-driving cars) - - ----- -## Breakout: Robustness - -Scenario: Medical use of transcription service, dictate diagnoses and prescriptions - -As a group, tagging members, post to `#lecture`: - -> 1. What safety concerns can you anticipate? -> 2. What deviations are you concerned about? -> 3. How would improve the robustness of the overall system? - - - - - - - - - - ---- -# AI Safety - -![Robot uprising](robot-uprising.jpg) - - -Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. "[Concrete problems in AI safety](https://arxiv.org/pdf/1606.06565.pdf%20http://arxiv.org/abs/1606.06565)." arXiv preprint arXiv:1606.06565 (2016). - ----- -## Your Favorite AI Dystopia? - - - ----- -## The AI Alignment Problem - -AI is optimized for a specific objective/cost function - * Inadvertently cause undesirable effects on the environment - * e.g., [Transport robot](https://www.youtube.com/watch?v=JzlsvFN_5HI): Move a box to a specific destination - * Side effects: Scratch furniture, bump into humans, etc., - -Side effects may cause ethical/safety issues (e.g., social media optimizing for clicks, causing teen depression) - -Difficult to define sensible fitness functions: - * Perform X *subject to common-sense constr. on the - environment* - * Perform X *but avoid side effects to the extent - possible* - - - ----- -## Reward Hacking - -> PlayFun algorithm pauses the game of Tetris indefinitely to avoid losing - -> When about to lose a hockey game, the PlayFun algorithm exploits a bug to make one of the players on the opposing team disappear from the map, thus forcing a draw. - -> Self-driving car rewarded for speed learns to spin in circles - -[Example: Coast Runner](https://www.youtube.com/watch?v=tlOIHko8ySg) - ----- -## Reward Hacking - -* AI can be good at finding loopholes to achieve a goal in unintended ways -* Technically correct, but does not follow *designer's informal intent* -* Many possible causes, incl. partially observed goals, abstract rewards, feedback loops -* In general, a very challenging problem! - * Difficult to specify goal & reward function to avoid all - possible hacks - * Requires careful engineering and iterative reward design - - -Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. "[Concrete problems in AI safety](https://arxiv.org/pdf/1606.06565.pdf%20http://arxiv.org/abs/1606.06565)." arXiv preprint arXiv:1606.06565 (2016). - ----- -## Reward Hacking -- Many Examples - -
- ----- -## Exploiting Human Weakness - -[![Dark side of A/B testing story](moralityabtesting.png)](https://techcrunch.com/2014/06/29/ethics-in-a-data-driven-world/) - ----- -## Exploiting Human Weakness - -![The Social Dilemma movie poster](socialdilemma.webp) - - -See also [Center for Humane Technology](https://www.humanetech.com/) - ----- -## AI Alignment Problem = Requirements Problem - -Recall: "World vs. machine" -* Identify stakeholders in the environment & possible effects on them -* Anticipate side effects, feedback loops -* Constrain the scope of the system -* Perfect contracts usually infeasible, undesirable - -But more requirements engineering unlikely to be only solution - - ----- -## Other Challenges - -
- -* Safe Exploration - - Exploratory actions "in production" may have consequences - - e.g., trap robots, crash drones -* Robustness to Drift - - Drift may lead to poor performance that may not even be recognized -* Scalable Oversight - - Cannot provide human oversight over every action (or label all possible training data) - - Use indirect proxies in telemetry to assess success/satisfaction - -
- - -Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. "[Concrete problems in AI safety](https://arxiv.org/pdf/1606.06565.pdf%20http://arxiv.org/abs/1606.06565)." arXiv preprint arXiv:1606.06565 (2016). - ----- -## Existential AI Risk - -Existential risk and AI alignment common in research - -Funding through *longtermism* branch of effective altruism *(Longtermism is the view that positively influencing the longterm future is a key moral priority of our time.)* - -Ord estimates 10% existential risk from unaligned AI in 100 years - -**Our view:** AI alignment not a real concern for the kind of ML-enabled products we consider here - - -Ord, Toby. The precipice: Existential risk and the future of humanity. Hachette Books, 2020. - -Note: Relevant for reinforcement learning and AGI - ----- -## More pressing AI isks? - -![TechCrunch article](techcrunch.png) - - -> “Those hypothetical risks are the focus of a dangerous ideology -called longtermism that ignores the actual harms resulting from the -deployment of AI systems today,” they wrote, citing worker -exploitation, data theft, synthetic media that props up existing power -structures and the further concentration of those power structures in -fewer hands. - - - ----- -## Practical Alignment Problems - -Does the model goal align with the system goal? Does the system goal align with the user's goals? -* Profits (max. accuracy) vs fairness -* Engagement (ad sales) vs enjoyment, mental health -* Accuracy vs operating costs - -Test model *and* system quality *in production* - -(see requirements engineering and architecture lectures) - - - - - ---- -# Beyond Traditional Safety Critical Systems - ----- -## Beyond Traditional Safety Critical Systems - -* Recall: Legal vs ethical -* Safety analysis not only for regulated domains (nuclear power plants, medical devices, planes, cars, ...) -* Many end-user applications have a safety component - -**Q. Examples?** - - ----- -## Mental Health - -[![Social Media vs Mental Health](mentalhealth.png)](https://www.healthline.com/health-news/social-media-use-increases-depression-and-loneliness) - - ----- -## Mental Health - -[![Social Media vs Mental Health](mentalhealth.png)](https://www.healthline.com/health-news/social-media-use-increases-depression-and-loneliness) - - - ----- -## IoT - -![Servers down](serversdown.png) - - - ----- -## Addiction - -[![Blog: Robinhood Has Gamified Online Trading Into an Addiction](robinhood.png)](https://marker.medium.com/robinhood-has-gamified-online-trading-into-an-addiction-cc1d7d989b0c) - - ----- -## Society: Unemployment Engineering / Deskilling - -![Automated food ordering system](automation.jpg) - -Notes: The dangers and risks of automating jobs. - -Discuss issues around automated truck driving and the role of jobs. - -See for example: Andrew Yang. The War on Normal People. 2019 - - ----- -## Society: Polarization - -[![Article: Facebook Executives Shut Down Efforts to Make the Site Less Divisive](facebookdivisive.png)](https://www.wsj.com/articles/facebook-knows-it-encourages-division-top-executives-nixed-solutions-11590507499) - - - -Notes: Recommendations for further readings: https://www.nytimes.com/column/kara-swisher, https://podcasts.apple.com/us/podcast/recode-decode/id1011668648 - -Also isolation, Cambridge Analytica, collaboration with ICE, ... - ----- -## Environmental: Energy Consumption - -[![Article: Creating an AI can be five times worse for the planet than a car](energy.png)](https://www.newscientist.com/article/2205779-creating-an-ai-can-be-five-times-worse-for-the-planet-than-a-car/) - - ----- -## Exercise - -*Look at apps on your phone. Which apps have a safety risk and use machine learning?* - -Consider safety broadly: including stress, mental health, discrimination, and environment pollution - - - - ----- -## Takeaway - -* Many systems have safety concerns -* ... not just nuclear power plants, planes, cars, and medical devices -* Do the right thing, even without regulation -* Consider safety broadly: including stress, mental health, discrimination, and environment pollution -* Start with requirements and hazard analysis - - - - - - ---- -# Designing for Safety - -See Lecture **Planning for Mistakes** - ----- -## Safety Assurance with ML Components - -* Consider ML components as unreliable, at most probabilistic guarantees -* Testing, testing, testing (+ simulation) - - Focus on data quality & robustness -* *Adopt a system-level perspective!* -* Consider safe system design with unreliable components - - Traditional systems and safety engineering - - Assurance cases -* Understand the problem and the hazards - - System level, goals, hazard analysis, world vs machine - - Specify *end-to-end system behavior* if feasible - - - - - - - - ---- -# Summary - -* Defining safety: absence of harm to people, property, and environment -- consider broadly; safety != reliability -* *Adopt a safety mindset!* -* Assume all components will eventually fail in one way or another, especially ML components -* Hazard analysis to identify safety risks and requirements; classic -safety design at the system level -* Robustness: Identify & address relevant deviations -* AI alignment: AI goals are difficult to specify precisely; susceptible to negative - side effect & reward hacking - ----- -## Further Readings - -
- -* Borg, Markus, Cristofer Englund, Krzysztof Wnuk, Boris Duran, Christoffer Levandowski, Shenjian Gao, Yanwen Tan, Henrik Kaijser, Henrik Lönn, and Jonas Törnqvist. “[Safely entering the deep: A review of verification and validation for machine learning and a challenge elicitation in the automotive industry](https://www.atlantis-press.com/journals/jase/125905766).” Journal of Automotive Software Engineering. 2019 -* Leveson, Nancy G. [Engineering a safer world: Systems thinking applied to safety](https://direct.mit.edu/books/book/2908/Engineering-a-Safer-WorldSystems-Thinking-Applied). The MIT Press, 2016. -* Salay, Rick, and Krzysztof Czarnecki. “[Using machine learning safely in automotive software: An assessment and adaption of software process requirements in ISO 26262](https://arxiv.org/pdf/1808.01614).” arXiv preprint arXiv:1808.01614 (2018). -* Mohseni, Sina, Mandar Pitale, Vasu Singh, and Zhangyang Wang. “[Practical Solutions for Machine Learning Safety in Autonomous Vehicles](https://arxiv.org/abs/1912.09630).” SafeAI workshop at AAAI’20, (2020). -* Huang, Xiaowei, Daniel Kroening, Wenjie Ruan, James Sharp, Youcheng Sun, Emese Thamo, Min Wu, and Xinping Yi. “[A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability](https://arxiv.org/abs/1812.08342).” Computer Science Review 37 (2020). -* Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. "[Concrete problems in AI safety](https://arxiv.org/pdf/1606.06565.pdf)." arXiv preprint arXiv:1606.06565 (2016). - -
diff --git a/lectures/23_safety/safetycase.svg b/lectures/23_safety/safetycase.svg deleted file mode 100644 index a9392bae..00000000 --- a/lectures/23_safety/safetycase.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/23_safety/self-driving.jpeg b/lectures/23_safety/self-driving.jpeg deleted file mode 100644 index 3748f26d..00000000 Binary files a/lectures/23_safety/self-driving.jpeg and /dev/null differ diff --git a/lectures/23_safety/sensor-fusion.jpeg b/lectures/23_safety/sensor-fusion.jpeg deleted file mode 100644 index dbfaaf76..00000000 Binary files a/lectures/23_safety/sensor-fusion.jpeg and /dev/null differ diff --git a/lectures/23_safety/serversdown.png b/lectures/23_safety/serversdown.png deleted file mode 100644 index 3d41b0a9..00000000 Binary files a/lectures/23_safety/serversdown.png and /dev/null differ diff --git a/lectures/23_safety/socialdilemma.webp b/lectures/23_safety/socialdilemma.webp deleted file mode 100644 index 47a1e16b..00000000 Binary files a/lectures/23_safety/socialdilemma.webp and /dev/null differ diff --git a/lectures/23_safety/stop-sign-adversarial.png b/lectures/23_safety/stop-sign-adversarial.png deleted file mode 100644 index c79808b3..00000000 Binary files a/lectures/23_safety/stop-sign-adversarial.png and /dev/null differ diff --git a/lectures/23_safety/stop-sign.png b/lectures/23_safety/stop-sign.png deleted file mode 100644 index 1a795719..00000000 Binary files a/lectures/23_safety/stop-sign.png and /dev/null differ diff --git a/lectures/23_safety/techcrunch.png b/lectures/23_safety/techcrunch.png deleted file mode 100644 index ef9a63ef..00000000 Binary files a/lectures/23_safety/techcrunch.png and /dev/null differ diff --git a/lectures/23_safety/weather-conditions.png b/lectures/23_safety/weather-conditions.png deleted file mode 100644 index 6e31ce4d..00000000 Binary files a/lectures/23_safety/weather-conditions.png and /dev/null differ diff --git a/lectures/24_teams/congruence.svg b/lectures/24_teams/congruence.svg deleted file mode 100644 index 08bd7784..00000000 --- a/lectures/24_teams/congruence.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/24_teams/connectedteams.svg b/lectures/24_teams/connectedteams.svg deleted file mode 100644 index fca4e4f6..00000000 --- a/lectures/24_teams/connectedteams.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/24_teams/devops.png b/lectures/24_teams/devops.png deleted file mode 100644 index 3abb34d2..00000000 Binary files a/lectures/24_teams/devops.png and /dev/null differ diff --git a/lectures/24_teams/drivebook.gif b/lectures/24_teams/drivebook.gif deleted file mode 100644 index 27490821..00000000 Binary files a/lectures/24_teams/drivebook.gif and /dev/null differ diff --git a/lectures/24_teams/groupthink.png b/lectures/24_teams/groupthink.png deleted file mode 100644 index 074ce7e9..00000000 Binary files a/lectures/24_teams/groupthink.png and /dev/null differ diff --git a/lectures/24_teams/loafinggraph.png b/lectures/24_teams/loafinggraph.png deleted file mode 100644 index 9f94ce1a..00000000 Binary files a/lectures/24_teams/loafinggraph.png and /dev/null differ diff --git a/lectures/24_teams/matrix-only.svg b/lectures/24_teams/matrix-only.svg deleted file mode 100644 index 5683c9ff..00000000 --- a/lectures/24_teams/matrix-only.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/24_teams/mmmbook.jpg b/lectures/24_teams/mmmbook.jpg deleted file mode 100644 index 2591f058..00000000 Binary files a/lectures/24_teams/mmmbook.jpg and /dev/null differ diff --git a/lectures/24_teams/pnc.jpg b/lectures/24_teams/pnc.jpg deleted file mode 100644 index e0f8a636..00000000 Binary files a/lectures/24_teams/pnc.jpg and /dev/null differ diff --git a/lectures/24_teams/project-only.svg b/lectures/24_teams/project-only.svg deleted file mode 100644 index 8621580e..00000000 --- a/lectures/24_teams/project-only.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/24_teams/roles_venn.svg b/lectures/24_teams/roles_venn.svg deleted file mode 100644 index b2df1fbf..00000000 --- a/lectures/24_teams/roles_venn.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/24_teams/scattering.png b/lectures/24_teams/scattering.png deleted file mode 100644 index 7d2e2881..00000000 Binary files a/lectures/24_teams/scattering.png and /dev/null differ diff --git a/lectures/24_teams/spotify.webp b/lectures/24_teams/spotify.webp deleted file mode 100644 index 8e2e8c2f..00000000 Binary files a/lectures/24_teams/spotify.webp and /dev/null differ diff --git a/lectures/24_teams/team-collaboration1.png b/lectures/24_teams/team-collaboration1.png deleted file mode 100644 index c9246c44..00000000 Binary files a/lectures/24_teams/team-collaboration1.png and /dev/null differ diff --git a/lectures/24_teams/team-collaboration2.png b/lectures/24_teams/team-collaboration2.png deleted file mode 100644 index a855d480..00000000 Binary files a/lectures/24_teams/team-collaboration2.png and /dev/null differ diff --git a/lectures/24_teams/team1200.jpg b/lectures/24_teams/team1200.jpg deleted file mode 100644 index 5ae2fe82..00000000 Binary files a/lectures/24_teams/team1200.jpg and /dev/null differ diff --git a/lectures/24_teams/team15.jpg b/lectures/24_teams/team15.jpg deleted file mode 100644 index c185f772..00000000 Binary files a/lectures/24_teams/team15.jpg and /dev/null differ diff --git a/lectures/24_teams/team200.png b/lectures/24_teams/team200.png deleted file mode 100644 index 92778882..00000000 Binary files a/lectures/24_teams/team200.png and /dev/null differ diff --git a/lectures/24_teams/team4.jpg b/lectures/24_teams/team4.jpg deleted file mode 100644 index 01492d40..00000000 Binary files a/lectures/24_teams/team4.jpg and /dev/null differ diff --git a/lectures/24_teams/team50.jpg b/lectures/24_teams/team50.jpg deleted file mode 100644 index be27f48a..00000000 Binary files a/lectures/24_teams/team50.jpg and /dev/null differ diff --git a/lectures/24_teams/teams.md b/lectures/24_teams/teams.md deleted file mode 100644 index f3a909a6..00000000 --- a/lectures/24_teams/teams.md +++ /dev/null @@ -1,912 +0,0 @@ ---- -author: Christian Kaestner and Eunsuk Kang -title: "MLiP: Fostering Interdisciplinary Teams" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - - -
- -## Machine Learning in Production - -# Fostering Interdisciplinary Teams - - ---- -## Administrativa - -Final presentations, May 4, 5:30pm-8pm, TEP 1403 -* 8 min, make it interesting -* Teams randomly selected (volunteers welcome) -* Snacks? -* Teams who do not present live are asked to record and share link to Zoom/Box.com/Youtube on Slack - - - ---- -## One last crosscutting topic - -![Overview of course content](../_assets/overview.svg) - - - - ----- -## Readings - - -Nahar, Nadia, Shurui Zhou, Grace Lewis, and Christian Kästner. "[Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process](https://arxiv.org/abs/2110.10234)." In International Conf. Software Engineering, 2022. - - - ----- - -## Learning Goals - -* Understand different roles in projects for AI-enabled systems -* Plan development activities in an inclusive fashion for participants in different roles -* Diagnose and address common teamwork issues -* Describe agile techniques to address common process and communication issues - - ---- -# Case Study: Depression Prognosis on Social Media - -![TikTok logo](tiktok.jpg) - - ----- -## The Project - -* Social media company of about 15000 employees, 500 developers and data scientists in US -* Use sentiment analysis on video data (and transcripts) to detect depression -* Planned interventions through recommending different content and showing ads for getting support, design for small group features -* Collaboration with mental health professionals and ML researches at top university - - ---- - - - - - Data - Scientists - Software - Engineers - - ----- -
- - -## Data scientist - -* Often fixed dataset for training and evaluation (e.g., PBS interviews) -* Focused on accuracy -* Prototyping, often Jupyter notebooks or similar -* Expert in modeling techniques and feature engineering -* Model size, updateability, implementation stability typically does not matter - - - -## Software engineer - -* Builds a product -* Concerned about cost, performance, stability, release time -* Identify quality through customer satisfaction -* Must scale solution, handle large amounts of data -* Detect and handle mistakes, preferably automatically -* Maintain, evolve, and extend the product over long periods -* Consider requirements for security, safety, fairness - - -
- - ----- -## Continuum of Skills - -* Software Engineer -* Data Engineer -* Data Scientist -* Applied Scientist -* Research Scientist - - - -Talk: Ryan Orban. [Bridging the Gap Between Data Science & Engineer: Building High-Performance Teams](https://www.slideshare.net/ryanorban/bridging-the-gap-between-data-science-engineer-building-highperformance-teams/3-Software_Engineer_Data_Engineer_Data). 2016 - ----- -![Unicorn](unicorn.jpg) - - - ----- - -![Roles Venn Diagram](roles_venn.svg) - - - -By Steven Geringer, via Ryan Orban. [Bridging the Gap Between Data Science & Engineer: Building High-Performance Teams](https://www.slideshare.net/ryanorban/bridging-the-gap-between-data-science-engineer-building-highperformance-teams/3-Software_Engineer_Data_Engineer_Data). 2016 - - - ----- -## Many Role Descriptions - -* Product Data Analyst (feature analysis) -* Business Intelligence, Analytics & Reporting (marketing) -* Modeling Analyst (financial forecasting) -* Machine Learning Engineer (user facing applications) -* Hybrid Data Engineer/Data Scientist (data pipelining) -* Hybrid Data Visualization Expert (communication, storytelling) -* Data Science Platforms & Tools Developer (supporting role) - - -e.g. Yorgos Askalidis -. [Demystifying data science roles](https://towardsdatascience.com/what-kind-of-data-science-role-is-right-for-you-9d2f4b117e81). 2019 - - ----- -## Evolution of Data Science Roles - - - -*More or less engineering focus? More or less statistics focus? ...* - ----- -## Software Engineering Specializations - -* Architects -* Requirements engineers -* Testers -* Site reliability engineers -* Devops -* Safety -* Security -* UIX -* Distributed systems, cloud -* ... - - ----- -## Needed Roles in Depression Prognosis Projects? - -![TikTok logo](tiktok.jpg) - ----- -## Common other Roles in ML-Enabled Systems? - -* **Domain specialists** -* Business, management, marketing -* Project management -* Designers, UI experts -* Operations -* Safety, security specialist -* Big data specialist -* Lawyers -* Social scientists, ethics -* ... - - - - - - - - - ---- -# Interdisciplinary Teams - ----- -## Unicorns -> Teams - -* Domain experts -* Data scientists -* Software engineers -* Operators -* Business leaders - - ----- -## Necessity of Groups - -* Division of labor -* Division of expertise (e.g., security expert, ML expert, data cleaning expert, database expert) - ----- -## Team Issues Discussed Today - -* Process costs -* (Groupthink) -* (Social loafing) -* Multiple/conflicting goals - - - - - - - - - - ---- -# Team Issue: -# Process Costs - ----- - -# Case Studies - -Disclaimer: All pictures represent abstract developer groups or products to give a sense of scale; they are not necessarily the developers of those products or developers at all. - ----- - -## How to structure teams? - -Microblogging platform; 3 friends - - -![Small team](team4.jpg) - -![Twitter](twitter.png) - - ----- -## How to structure teams? - -Banking app; 15 developers and data analysts - - -![Small team](team15.jpg) - -*(Instagram had 13 employees when they were bought for 1B in 2012)* - -![PNC app](pnc.jpg) - - - - - ----- -## How to structure teams? - -Mobile game; -50ish developers? - -![Team 50](team50.jpg) - - ----- -## How to structure teams? - -Mobile game; -200ish developers; -distributed teams? - -![Team 200](team200.png) - ----- -## How to structure teams? - -Self-driving cars; 1200 developers and data analysts - - -![Large team](team1200.jpg) - -![Uber self driving car](uber.png) - - - ----- -## Mythical Man Month - -![](mmmbook.jpg) - -> Brooks's law: Adding manpower to a late software project makes it later - - -1975, describing experience at -IBM developing OS/360 - ----- -## Process Costs - -![](connectedteams.svg) - - -*n(n − 1) / 2* communication links within a team - ----- -## Brook's Surgical Teams - -
- -* Chief programmer – most programming and initial documentation -* Support staff - * Copilot: supports chief programmer in development tasks, represents team at meetings - * Administrator: manages people, hardware and other resources - * Editor: editing documentation - * Two secretaries: one each for the administrator and editor - * Program clerk: keeps records of source code and documentation - * Toolsmith: builds specialized programming tools - * Tester: develops and runs tests - * Language lawyer: expert in programming languages, provides advice on producing optimal code. - -
- - - -Brooks. The Mythical Man-Month. 1971 - -Note: Would assume unicorns in today's context. - ----- -## Microsoft's Small Team Practices - -* Vision statement and milestones (2-4 month), no formal spec -* Feature selection, prioritized by market, assigned to milestones -* Modular architecture -* Allows small federated teams (Conway's law) -* Small teams of overlapping functional specialists - -(Windows 95: 200 developers and testers, one of 250 products) - ----- -## Microsoft's Feature Teams - -* 3-8 developers (design, develop) -* 3-8 testers (validation, verification, usability, market analysis) -* 1 program manager (vision, schedule communication; leader, facilitator) – working on several features -* 1 product manager (marketing research, plan, betas) - ----- -## Microsoft's Process - -* "Synchronize and stabilize" -* For each milestone - * 6-10 weeks feature development and continuous testing -frequent merges, daily builds - * 2-5 weeks integration and testing (“zero-bug release”, external betas ) - * 2-5 weeks buffer - ----- -## Agile Practices (e.g., Scrum) - -* 7±2 team members, collocated -* self managing -* Scrum master (potentially shared among 2-3 teams) -* Product owner / customer representative - ----- -## Spotify's Squads and Tribes - - -* Small crossfunctional teams with < 8 members -* Each squad has autonomy to decide what to build, how to build it, and how to work together -- under given Squad mission and product strategy -* Focused on regular independent releases -* Tribes are groups of squads focused on product delivery with a tribe leader (40-100 people) -* Chapters coordinate people in same role across squads - ----- -## Spotify's Squads and Tribes - - -![Spotify tribe model](spotify.webp) - ----- -> Large teams (29 people) create around six times as many defects as small teams (3 people) and obviously burn through a lot more money. Yet, the large team appears to produce about the same mount of output in only an average of 12 days’ less time. This is a truly astonishing finding, through it fits with my personal experience on projects over 35 years. - [Phillip Amour, 2006, CACM 49:9](https://dl.acm.org/citation.cfm?id=1151043) - ----- -## Establish communication patterns - -* Avoid overhead -* Ensure reliability -* Constraint latency -* -* e.g. Issue tracker vs email; online vs face to face - - ----- -## Establishing Interfaces - -* When dividing work, need to agree on interface -* Common source of mismatch and friction -* **Examples?** - - Team A uses data produced by Team B - - Team C deploys model produced by team A - - Team D uses model and needs to provide feedback to Team A - - Team D waits for improvement/feature from model A -* -* Ideally interfaces are stable and well documented - - - - - ----- -## Conway’s Law - -> “Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” — Mel Conway, 1967 - -> “If you have four groups working on a compiler, you'll get a 4-pass compiler.” - ----- -## Congurence - -![](congruence.svg) - - -Structural congruence, -Geographical congruence, -Task congruence, -IRC communication congruence - - ----- -## Leaky Abstractions for ML? - -* Can one team handle data quality, model quality, fairness etc? -* What needs to be exposed at the interface? -* Can divide an conquer work if we do not yet know what the model can do? -* -* Are clear abstractions/interfaces possible? - - -Subramonyam, Hariharan, Jane Im, Colleen Seifert, and Eytan Adar. "Solving Separation-of-Concerns Problems in Collaborative Design of Human-AI Systems through Leaky Abstractions." In Proc. CHI 2022. - ----- -## The Problem with Cross-Cutting Concerns - -![Scattered code across multiple modules](scattering.png) - ----- -## The Problem with Cross-Cutting Concerns - -![Illustration of the tyranny of the dominant decomposition](tyranny.png) - ----- -## Cross-Cutting Concerns - -System design involves many inter-related concerns - -Teams and engineering abstractions typically hierarchically organized - -Forced decision: What can be abstracted in a module and what concepts need to be exposed in interface and shared/coordinated/discussed across modules - -*Keep track of concerns that cannot be modularized!* - - - -Tarr, Peri, Harold Ossher, William Harrison, and Stanley M. Sutton Jr. "N degrees of separation: Multi-dimensional separation of concerns." In Proc. ICSE. 1999. - ----- -## Awareness - -* Notifications, meetings -* Brook's documentation book -* Email to all -* Code reviews - ----- -## Engineering Recommendations for Structuring ML-Enabled Systems - -* Decompose the system -* Independent components (e.g. microservices) -* Isolate ML if possible -* Clear, stable interfaces, minimal coupling, documentation -* Monitoring to observe contracts and quality -* Explicitly track cross-cutting, system-level concerns like safety, fairness, security - ----- -## Team Structure for Transcription Service? - -![Temi screenshot](temi.png) - - ----- -## Breakout: Team Structure for Depression Prognosis - -In groups, tagging team members, discuss and post in `#lecture`: -* How to decompose the work into teams? -* What roles to recruit for the teams - -![TikTok logo](tiktok.jpg) - - ---- -# Story Time: Conflicts at the Interface between Teams - ----- -![Team collaboration within a large tech company](team-collaboration1.png) - - ----- -![Team collaboration within a large tech company](team-collaboration2.png) - - ----- -## Common Challenge: Establishing Interfaces - -* Formal vs informal agreements? -* Service level agreements and automated enforcement? -* Close collaboration vs siloed teams? -* -* Many concerns: prediction accuracy, generalization, execution time, scalability, data quality, data quantity, feedback latency, privacy, explainability, time estimation, ... -* Formal agreements and enforcement expensive, slowing development? see technical debt - ----- -## Common Collaboration Points - -
- - 1. Understanding system requirements and ML capabilities - 2. Understanding ML-specific requirements at the system level, reasoning about feedback loops - 3. Project planning and architecture design - 4. Data needs, data quality, data meaning - 5. Documenting model output - 6. Planning and monitoring for drift - 7. Planning ML component QA (offline, online, monitoring) - 8. Planning system QA (integration, interaction, safety, feedback loops) - 9. Tool support for data scientists - 10. From prototype to production (pipelines, versioning, operations, user interactions, ...) - -
- - ---- -# Team issues: Multiple/conflicting goals - -(Organization of Interdisciplinary Teams) - - ----- -## Conflicting Goals? - -![DevOps](devops.png) - ----- -## Conflicting Goals? - - - - - - - Data - Scientists - Software - Engineers - - ----- -## Conflicting Goals? - - - - - - Data - Scientists - Compliance - Lawyers - - - ----- -## Conflicting Goals? - -![TikTok logo](tiktok.jpg) - - ----- -## How to Address Goal Conflicts? - - - ----- -## T-Shaped People - -*Broad-range generalist + Deep expertise* - -![T-Shaped](tshaped.png) - - - -Figure: Jason Yip. [Why T-shaped people?](https://medium.com/@jchyip/why-t-shaped-people-e8706198e437). 2018 - ----- -## T-Shaped People - -*Broad-range generalist + Deep expertise* - -Example: -* Basic skills of software engineering, business, distributed computing, and communication -* Deep skills in deep neural networks (technique) and medical systems (domain) - ----- -## Team Composition - -* Cover deep expertise in all important areas -* Aim for overlap in general skills - - Fosters communication, same language - - - - - ----- -## Matrix Organization - -![](matrix-only.svg) - - ----- -## Project Organization - -![](project-only.svg) - - - ----- -## Spotify's Squads and Tribes - - -![Spotify tribe model](spotify.webp) - - ----- -## Case Study: Brøderbund - -
- -> As the functional departments grew, staffing the heavily matrixed projects became more and more of a nightmare. To address this, the company reorganized itself into “Studios”, each with dedicated resources for each of the major functional areas reporting up to a Studio manager. Given direct responsibility for performance and compensation, Studio managers could allocate resources freely. - -> The Studios were able to exert more direct control on the projects and team members, but not without a cost. The major problem that emerged from Brøderbund’s Studio reorganization was that members of the various functional disciplines began to lose touch with their functional counterparts. Experience wasn’t shared as easily. Over time, duplicate effort began to appear. - -
- - -Mantle, Mickey W., and Ron Lichty. [Managing the unmanageable: rules, tools, and insights for managing software people and teams](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/8lb6it/cdi_askewsholts_vlebooks_9780132981279). Addison-Wesley Professional, 2012. - - ----- -## Specialist Allocation (Organizational Architectures) - -* Centralized: development teams consult with a core group of specialists when they need help -* Distributed: development teams hire specialists to be a first-class member of the team -* Weak Hybrid: centralized group of specialists and teams with critical applications hire specialists -* Strong Hybrid: centralized group of specialists and most teams also hire specialists - -**Tradeoffs?** - ----- -## Example: Security Roles - -* Everyone: “security awareness” – buy into the process -* Developers: know the security capabilities of development tools and use them, know how to spot and avoid relevant, common vulnerabilities -* Managers: enable the use of security practices -* Security specialists: everything security - ----- -## Allocation of Data Science/Software Engineering Expertise? - -![TikTok logo](tiktok.jpg) - - ----- -## Commitment & Accountability - -* Conflict is useful, expose all views -* Come to decision, commit to it -* Assign responsibilities -* Record decisions and commitments; make record available - ----- -## Bell & Hart – 8 Causes of Conflict - -* Conflicting resources. -* Conflicting styles. -* Conflicting perceptions. -* Conflicting goals. -* Conflicting pressures. -* Conflicting roles. -* Different personal values. -* Unpredictable policies. - -*Understanding causes helps design interventions. Examples?* - - -Bell, Art. (2002). [Six ways to resolve workplace conflicts](https://www.mindtools.com/pages/article/eight-causes-conflict.htm). -University of San Francisco - ----- -## Agile Techniques to Address Conflicting Goals? - - - - - - - - - - - - - - - - - - ---- -# Recall: Team issues: Groupthink - -![](groupthink.png) - ----- -## Groupthink - -* Group minimizing conflict -* Avoid exploring alternatives -* Suppressing dissenting views -* Isolating from outside influences -* -> Irrational/dysfunctional decision making - ----- -## Experiences? - - - - - - - - - - - - - - ---- -# Recall: Team issues: Social loafing - -![](tug.png) - ----- -![](loafinggraph.png) - - - - -Latane, Bibb, Kipling Williams, and Stephen Harkins. "[Many hands make light the work: The causes and consequences of social loafing.](http://web.mit.edu/curhan/www/docs/Articles/15341_Readings/Group_Dynamics/required_reading/4Latane_et_al_1979_Many_hands_make_light_the_work.pdf)" Journal of personality and social psychology 37.6 (1979): 822. - ----- -## Social Loafing - -* People exerting less effort within a group -* Reasons - * Diffusion of responsibility - * Motivation - * Dispensability of effort / missing recognition - * Avoid pulling everybody / "sucker effect" - * Submaximal goal setting -* “Evaluation potential, expectations of co-worker performance, task meaningfulness, and culture had especially strong influence” - - - - -Karau, Steven J., and Kipling D. Williams. "[Social loafing: A meta-analytic review and theoretical integration](https://www1.psych.purdue.edu/~willia55/392F-%2706/KarauWilliamsMetaAnalysisJPSP.pdf)." Journal of personality and social psychology 65.4 (1993): 681. - ----- -## Motivation - -Autonomy * Mastery * Purpose - -![](drivebook.gif) - - - - ----- -## Spotify's Squads and Tribes - - -![Spotify tribe model](spotify.webp) - - ---- -# Learning from DevOps - -![DevOps](devops.png) - ----- -## DevOps: A *culture* of collaboration - -* Overcome historic role and goal conflicts between developers and operators -* Joint planning for operations, joint responsibilities for testing and deployment -* -* Joint goals, joint vocabulary -* Joint tools (e.g., Docker, versioning, A/B testing, monitoring) -* Mutual benefits (faster releases, more telemetry, improved reliability, fewer conflicts) -* T-shaped professionals - ----- -## Changing practices and culture is hard - -* Ingrained "us vs them" and blame culture -* Inertia is hard to overcome (“this is how we always did things”) -* Learning cost for new concepts and tools -* Extra effort for new practices (e.g., testing) -* Overwhelmed with current tasks, no time to learn/change -* Poor adoption may cause more costs than benefits - ----- -## Working on Culture Change - -* Bottom-up and top-down change possible -* Often introduced by individual advocates, convincing others -* Always requires supportive management -* Education helps generate buy-in -* Consultants can help with adoption and learning -* -* Demonstrate benefits in one small project, promote from there - ----- -## Beyond DevOps - -* Organizational culture and DevOps have been well studied -* Learn from joint goal setting, joint vocabulary, win-win-collaborations, joint tooling -* -* *What could this look like for other groups (MLOps, MLDev, SecDevOps, LawML, DataExp, SafeML, UIDev, ...)?* - - - ---- - -# Summary - -* Team dysfunctions well studied -* Know the signs, know the interventions -* Small teams, crossfunctional teams - * Deliberately create teams, respect congruence, define interfaces - * Hire T-shaped developers -* Create awareness and accountability - ----- - -## Further Readings - -
- -* 🕮 Brooks Jr, Frederick P. [The mythical man-month: essays on software engineering](https://bookshop.org/books/the-mythical-man-month-essays-on-software-engineering-anniversary-edition/9780201835953). Pearson Education, 1995. -* 🕮 DeMarco, Tom, and Tim Lister. [Peopleware: productive projects and teams](https://bookshop.org/books/peopleware-productive-projects-and-teams-revised/9780321934116). Addison-Wesley, 2013. -* 🕮 Mantle, Mickey W., and Ron Lichty. [Managing the unmanageable: rules, tools, and insights for managing software people and teams](https://www.managingtheunmanageable.net/). Addison-Wesley Professional, 2019. -* 🕮 Lencioni, Patrick. "[The five dysfunctions of a team: A Leadership Fable.](https://bookshop.org/books/the-five-dysfunctions-of-a-team-a-leadership-fable-9780787960759/9780787960759)" Jossey-Bass (2002). -* 🗎 Rakova, Bogdana, Jingying Yang, Henriette Cramer, and Rumman Chowdhury. "[Where responsible AI meets reality: Practitioner perspectives on enablers for shifting organizational practices](https://arxiv.org/abs/2006.12358)." Proceedings of the ACM on Human-Computer Interaction 5, no. CSCW1 (2021): 1-23. -* 🗎 Luz, Welder Pinheiro, Gustavo Pinto, and Rodrigo Bonifácio. "[Adopting DevOps in the real world: A theory, a model, and a case study](http://gustavopinto.org/lost+found/jss2019.pdf)." Journal of Systems and Software 157 (2019): 110384. -* 🗎 Sambasivan, Nithya, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M. Aroyo. "[“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI](https://research.google/pubs/pub49953/)". In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1-15. 2021. - - -
- - - - - - - - - diff --git a/lectures/24_teams/temi.png b/lectures/24_teams/temi.png deleted file mode 100644 index 29ce2dd5..00000000 Binary files a/lectures/24_teams/temi.png and /dev/null differ diff --git a/lectures/24_teams/tiktok.jpg b/lectures/24_teams/tiktok.jpg deleted file mode 100644 index eb9746d5..00000000 Binary files a/lectures/24_teams/tiktok.jpg and /dev/null differ diff --git a/lectures/24_teams/tshaped.png b/lectures/24_teams/tshaped.png deleted file mode 100644 index e4b6d35b..00000000 Binary files a/lectures/24_teams/tshaped.png and /dev/null differ diff --git a/lectures/24_teams/tug.png b/lectures/24_teams/tug.png deleted file mode 100644 index 56ee08d7..00000000 Binary files a/lectures/24_teams/tug.png and /dev/null differ diff --git a/lectures/24_teams/twitter.png b/lectures/24_teams/twitter.png deleted file mode 100644 index de1bb920..00000000 Binary files a/lectures/24_teams/twitter.png and /dev/null differ diff --git a/lectures/24_teams/tyranny.png b/lectures/24_teams/tyranny.png deleted file mode 100644 index 42e4f4c4..00000000 Binary files a/lectures/24_teams/tyranny.png and /dev/null differ diff --git a/lectures/24_teams/uber.png b/lectures/24_teams/uber.png deleted file mode 100644 index 4dea159a..00000000 Binary files a/lectures/24_teams/uber.png and /dev/null differ diff --git a/lectures/24_teams/unicorn.jpg b/lectures/24_teams/unicorn.jpg deleted file mode 100644 index 753606cb..00000000 Binary files a/lectures/24_teams/unicorn.jpg and /dev/null differ diff --git a/lectures/25_summary/3tier-with-ml.svg b/lectures/25_summary/3tier-with-ml.svg deleted file mode 100644 index ccc2e2c2..00000000 --- a/lectures/25_summary/3tier-with-ml.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/Accuracy_and_Precision.svg b/lectures/25_summary/Accuracy_and_Precision.svg deleted file mode 100644 index a1f66057..00000000 --- a/lectures/25_summary/Accuracy_and_Precision.svg +++ /dev/null @@ -1,2957 +0,0 @@ - - - - - - - - - - - - - - image/svg+xml - - - 21/01/2006 - - - Tijmen Stam - - - - - GFDL - - - - en - - - Darts - Dart board - - - A standard dart board. - - - - - - - - - - - - - - Accuracy - Precision - - - - Yes - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Probability - density - - - - - - - - - - - - Precision - - Value - Reference value - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Probability - density - - - - - - - - - - Accuracy - - - - - - - - - Precision - - Value - Reference value - - - - - No - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Probability - density - - - - - - - - Precision - Value - Reference value - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Precision - - - - Probability - density - - - - - - Value - Reference value - - - - - Accuracy - - - - - - Yes - No - - - - - - - - - - - - diff --git a/lectures/25_summary/Kubernetes.png b/lectures/25_summary/Kubernetes.png deleted file mode 100644 index a446a7a5..00000000 Binary files a/lectures/25_summary/Kubernetes.png and /dev/null differ diff --git a/lectures/25_summary/MNIST-fashion.png b/lectures/25_summary/MNIST-fashion.png deleted file mode 100644 index 615108d3..00000000 Binary files a/lectures/25_summary/MNIST-fashion.png and /dev/null differ diff --git a/lectures/25_summary/Martin_Shkreli_2016.jpg b/lectures/25_summary/Martin_Shkreli_2016.jpg deleted file mode 100644 index 72ba7ffa..00000000 Binary files a/lectures/25_summary/Martin_Shkreli_2016.jpg and /dev/null differ diff --git a/lectures/25_summary/ab-groove.jpg b/lectures/25_summary/ab-groove.jpg deleted file mode 100644 index f5da8094..00000000 Binary files a/lectures/25_summary/ab-groove.jpg and /dev/null differ diff --git a/lectures/25_summary/accuracy-improvements.png b/lectures/25_summary/accuracy-improvements.png deleted file mode 100644 index 455cb820..00000000 Binary files a/lectures/25_summary/accuracy-improvements.png and /dev/null differ diff --git a/lectures/25_summary/adversarialexample.png b/lectures/25_summary/adversarialexample.png deleted file mode 100644 index ba5abb74..00000000 Binary files a/lectures/25_summary/adversarialexample.png and /dev/null differ diff --git a/lectures/25_summary/aequitas-report.png b/lectures/25_summary/aequitas-report.png deleted file mode 100644 index d11913d5..00000000 Binary files a/lectures/25_summary/aequitas-report.png and /dev/null differ diff --git a/lectures/25_summary/airegulation.png b/lectures/25_summary/airegulation.png deleted file mode 100644 index 4bde739b..00000000 Binary files a/lectures/25_summary/airegulation.png and /dev/null differ diff --git a/lectures/25_summary/albumy.png b/lectures/25_summary/albumy.png deleted file mode 100644 index a47c0181..00000000 Binary files a/lectures/25_summary/albumy.png and /dev/null differ diff --git a/lectures/25_summary/alexa.png b/lectures/25_summary/alexa.png deleted file mode 100644 index 9adf5327..00000000 Binary files a/lectures/25_summary/alexa.png and /dev/null differ diff --git a/lectures/25_summary/all.md b/lectures/25_summary/all.md deleted file mode 100644 index 0fc9fa71..00000000 --- a/lectures/25_summary/all.md +++ /dev/null @@ -1,5647 +0,0 @@ ---- -author: Christian Kaestner and Eunsuk Kang -title: "MLiP: Summary & Reflection" -semester: Spring 2023 -footer: "Machine Learning in Production/AI Engineering • Christian Kaestner & Eunsuk Kang, Carnegie Mellon University • Spring 2023" -license: Creative Commons Attribution 4.0 International (CC BY 4.0) ---- - -
- -## Machine Learning in Production - -# Summary & Reflection - - - ---- -# Today - - -(1) - -**Looking back at the semester** - -(413 slides in 40 min) - - -(2) - -**Discussion of future of ML in Production** - - -(3) - -**Feedback for future semesters** - - - - - ---- - -
- -## Machine Learning in Production - -# Motivation, Syllabus, and Introductions - - - ----- -## Learning Goals - -* Understand how ML components are parts of larger systems -* Illustrate the challenges in engineering an ML-enabled system beyond accuracy -* Explain the role of specifications and their lack in machine learning and the relationship to deductive and inductive reasoning -* Summarize the respective goals and challenges of software engineers vs data scientists -* Explain the concept and relevance of "T-shaped people" - ----- -## Catastrophic Success - -![Crowd](crowd.jpg) - ----- - -![competitor](transcription.png) - ----- - -## Breakout: Likely challenges in building commercial product? - -
- -As a group, think about challenges that the team will likely focus when turning their research into *a product*: -* One machine-learning challenge -* One engineering challenge in building the product -* One challenge from operating and updating the product -* One team or management challenge -* One business challenge -* One safety or ethics challenge - -*Post answer to `#lecture` on Slack and tag all group members* - -
- ----- -## ML in a Production System - - -![Architecture diagram of transcription service; many components, not just ML](transcriptionarchitecture2.svg) - - ----- - - - - - - Data - Scientists - Software - Engineers - - - -
and Data engineers + Domain specialists + Operators + Business team + Project managers + Designers, UI Experts + Safety, security specialists + Lawyers + Social scientists + ...
- - ----- - -![Unicorns](roles_venn.svg) - - -By Steven Geringer, via Ryan Orban. [Bridging the Gap Between Data Science & Engineer: Building High-Performance Teams](https://www.slideshare.net/ryanorban/bridging-the-gap-between-data-science-engineer-building-highperformance-teams/3-Software_Engineer_Data_Engineer_Data). 2016 - - - ----- -## T-Shaped People - -*Broad-range generalist + Deep expertise* - -![T-Shaped](tshaped.png) - - - - -Figure: Jason Yip. [Why T-shaped people?](https://medium.com/@jchyip/why-t-shaped-people-e8706198e437). 2018 - - ----- -# Syllabus and Class Structure - -17-445/17-645/17-745, Fall 2022, 12 units - -Monday/Wednesdays 1:25-2:45pm - -Recitation Fridays 10:10-11:00am / 1:25-2:45pm - ----- -## First Homework Assignment - - -*"Coding warmup assignment"* - -[Out now](https://github.com/ckaestne/seai/blob/F2022/assignments/I1_mlproduct.md), due Sep 7 - -Enhance simple web *application* with ML-based feature: Automated image captioning - -Open ended coding assignment, change existing code, learn new APIs - - - -![Screenshot of Albumy](albumy.png) - - - ----- - -![Class Overview](overview.svg) - - - ----- -## Reading Assignments & Quizzes - - -*Building Intelligent Systems* -by Geoff Hulten - -https://www.buildingintelligentsystems.com/ - -Most chapters assigned at some point in the semester - -Supplemented with research articles, blog posts, videos, podcasts, ... - -[Electronic version](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991019649190004436) in the library - - - -![Building intelligent systems book](book.webp) - - - - ----- -## Timeline - - -![Timeline](timeline.svg) - - ----- -## Grading Philosophy - -Specification grading, based in adult learning theory - -Giving you choices in what to work on or how to prioritize your work - -We are making every effort to be clear about expectations (specifications), will clarify if you have questions - - -Assignments broken down into expectations with point values, each graded **pass/fail** - -Opportunities to resubmit work until last day of class - -[[Example]](https://github.com/ckaestne/seai/blob/F2022/assignments/I1_mlproduct.md#grading) - - - ----- -## ML Models Make Mistakes - -![ML image captioning mistakes](mistakes.jpg) - - - -Note: Source: https://www.aiweirdness.com/do-neural-nets-dream-of-electric-18-03-02/ - ----- -## Lack of Specifications - -```java -/** - Return the text spoken within the audio file - ???? -*/ -String transcribe(File audioFile); -``` - ----- -## It's not all new - -We routinely build: -* Safe software with unreliable components -* Cyberphysical systems -* Non-ML big data systems, cloud systems -* "Good enough" and "fit for purpose" not "correct" - -ML intensifies our challenges - ----- -## Complexity -![Complexity prediction](complexity.svg) - - ---- - -
- -## Machine Learning in Production - - -# From Models to Systems - - - - ----- -# Learning goals - -* Understand how ML components are a (small or large) part of a larger system -* Explain how machine learning fits into the larger picture of building and maintaining production systems -* Define system goals and map them to goals for ML components -* Describe the typical components relating to AI in an AI-enabled system and typical design decisions to be made - - - - - - - ----- -## Why do we care about image captioning? - -![Image captioning one step](imgcaptioning.png) - - - ----- -![Image Search on Google Photos](imagesearch.png) - - - ----- -## Products using Image Synthesis? - -![Dall-e generated example of chairs in the form of an avocado](dall-e.png) - - - -From https://openai.com/blog/dall-e/ - - ----- -## Traditional Model Focus (Data Science) - -![](pipeline.svg) - - -Focus: building models from given data, evaluating accuracy - - ----- -## Automating Pipelines and MLOps (ML Engineering) - -![](pipeline2.svg) - - -Focus: experimenting, deploying, scaling training and serving, model monitoring and updating - ----- -## ML-Enabled Systems (ML in Production) - -![](pipeline-in-system.svg) - -Interaction of ML and non-ML components, system requirements, user interactions, safety, collaboration, delivering products - - - ----- -# Model vs System Goals - - - ----- -## Case Study: Self-help legal chatbot - -![Website](lawchat.png) - - - - - -Based on the excellent paper: Passi, S., & Sengers, P. (2020). [Making data science systems work](https://journals.sagepub.com/doi/full/10.1177/2053951720939605). Big Data & Society, 7(2). - -Note: Screenshots for illustration purposes, not the actual system studied - - - - ----- -## Machine learning that matters - -
- -* 2012(!) essay lamenting focus on algorithmic improvements and benchmarks - - focus on standard benchmark sets, not engaging with problem: Iris classification, digit recognition, ... - - focus on abstract metrics, not measuring real-world impact: accuracy, ROC - - distant from real-world concerns - - lack of follow-through, no deployment, no impact -* Failure to *reproduce* and *productionize* paper contributions common -* Ignoring design choices in how to collect data, what problem to solve, how to design human-AI interface, measuring impact, ... -* *Argues: Should focus on making impact -- requires building systems* - -
- - - -Wagstaff, Kiri. "Machine learning that matters." In Proceedings of the 29 th International Conference on Machine Learning, (2012). - - - - - ----- -# Setting and Untangling Goals - - - - ----- -## Layers of Success Measures - - - -
- -* **Organizational objectives:** Innate/overall goals of the organization -* **System goals:** Goals of the software system/feature to be built -* **User outcomes:** How well the system is serving its users, from the user's perspective -* **Model properties:** Quality of the model used in a system, from the model's perspective -* -* **Leading indicators:** Short-term proxies for long-term measures, typically for organizational objectives - -*Ideally, these goals should be aligned with each other* - -
- - - -![Goal relationships](goal-relationships.svg) - - - - - - ----- -## Breakout: Automating Admission Decisions - -What are different types of goals behind automating admissions decisions to a Master's program? - -As a group post answer to `#lecture` tagging all group members using template: -> Organizational goals: ...
-> Leading indicators: ...
-> System goals: ...
-> User goals: ...
-> Model goals: ...
- - - - - - - ----- -# Systems Thinking - -![](system.svg) - - - ----- -## Feedback Loops - - -![Feedback loop with data creating model, creating decisions, creating data](feedbackloop.svg) - - ----- -## User Interaction Design - -**Automate**: Take action on user's behalf - -**Prompt**: Ask the user if an action should be taken - -**Organize/Annotate/Augment**: Add information to a display - -Hybrids of these - - ----- -## Safety is a System Property - -* Code/models are not unsafe, cannot harm people -* Systems can interact with the environment in ways that are unsafe - -![Smart Toaster](toaster.jpg) - ----- -## Safety Assurance in/outside the Model - - -
- -In the model - - Ensure maximum toasting time - - Use heat sensor and past outputs for prediction - - Hard to make guarantees - -Outside the model (e.g., "guardrails") - - Simple code check for max toasting time - - Non-ML rule to shut down if too hot - - Hardware solution: thermal fuse - -
- - - -![Thermal fuse](thermalfuse.png) -(Image CC BY-SA 4.0, C J Cowie) - - - - - - ----- -## Things change... - - - - - -Newer better models released (better model architectures, more training data, ...) - -Goals and scope change (more domains, handling dialects, ...) - -The world changes (new products, names, slang, ...) - -Online experimentation - - - - - -![Architecture diagram of transcription service; many components, not just ML](transcriptionarchitecture.svg) - - - - - - ----- -## Monitoring in Production - -Design for telemetry - - -![Safe Browsing Feedback](safe-browsing-feedback.png) - -![Safe Browsing Statistics](safe-browsing-stats.png) - - - ----- -## Pipelines Thinking is Challenging - -
- -In enterprise ML teams: -* Data scientists often focus on modeling in local environment, model-centric workflow -* Rarely robust infrastructure, often monolithic and tangled -* Challenges in deploying systems and integration with monitoring, streams etc - -Shifting to pipeline-centric workflow challenging -* Requires writing robust programs, slower, less exploratory -* Standardized, modular infrastructure -* Big conceptual leap, major hurdle to adoption - -
- - - -O'Leary, Katie, and Makoto Uchida. "[Common problems with Creating Machine Learning Pipelines from Existing Code](https://research.google/pubs/pub48984.pdf)." Proc. Third Conference on Machine Learning and Systems (MLSys) (2020). - - - - ---- -# I1: Building an ML-enabled Product - -![Screenshot of Albumy](albumy.png) - - - ---- - -
- -## Machine Learning in Production - -# Gathering Requirements - - - ----- -## Learning Goals - -* Understand the role of requirements in ML-based systems and their -failures -* Understand the distinction between the world and the machine -* Understand the importance of environmental assumptions in - establishing system requirements -* Understand the challenges in and techniques for gathering, validating, - and negotiating requirements - ----- -## Facial Recognition in ATM - -![ATM](atm.gif) - - -**Q. What went wrong? What is the root cause of the failure?** - ----- -## Automated Hiring - -![Amazon Hiring Tool Scraped due to Bias](amazonhiring.png) - - -**Q. What went wrong? What is the root cause of the failure?** - - ----- -## Machine vs World - -![machine-world](worldvsmachine.svg) - - ----- -## Shared Phenomena - -![phenomena](phenomena.jpg) - - -
- -* Shared phenomena: Interface between the environment & software - * Input: Lidar, camera, pressure sensors, GPS - * Output: Signals generated & sent to the engine or brake control -* Software can influence the environment **only** through the shared interface - * Unshared parts of the environment are beyond software’s control - * We can only **assume** how these parts will behave - -
- ----- -## Breakout: Lane Assist Assumptions - -![lane-keeping](lane-keeping.png) - - -REQ: The vehicle must be prevented from veering off the lane. - -SPEC: Lane detector accurately identifies lane markings in the input image; - the controller generates correct steering commands - -**Discuss with your neighbor to come up with 2-3 assumptions** - ----- -## Lufthansa 2904 Runaway Crash - - -![Illustration of time elapsed between touchdown of the first main strut, the second and engagement of brakes.](lh2904_animation.gif) - - - -CC BY-SA 3.0 [Anynobody](https://en.wikipedia.org/wiki/Lufthansa_Flight_2904#/media/File:Lufthansa_Flight_2904.gif) - ----- -## Breakout Session: Fall detection - -![smart-watch](smartwatch.jpg) - - -As a group, post answer to `#lecture` and tag group members: -> Requirement: ...
-> Assumptions: ...
-> Specification: ...
-> What can go wrong: ...
- - ----- -## What went wrong? (REQ, ASM, SPEC)? - -![ATM](atm.gif) - - ----- -## Understanding requirements is hard - -* Customers don't know what they want until they see it -* Customers change their mind ("no, not like that") -* Descriptions are vague -* It is easy to ignore important requirements (privacy, fairness) -* Focused too narrowly on needs of few users -* Engineers think they already know the requirements -* Engineers are overly influenced by technical capability -* Engineers prefer elegant abstractions - -**Examples?** - - - -See also 🗎 Jackson, Michael. "[The world and the machine](https://web.archive.org/web/20170519054102id_/http://mcs.open.ac.uk:80/mj665/icse17kn.pdf)." In Proceedings of the International Conference on Software Engineering. IEEE, 1995. - - ----- -## Requirements elicitation techniques - -![Interview](interview.jpg) - - - - ----- -## ML Prototyping: Wizard of Oz - -![Wizard of oz excerpt](wizard.gif) - -Note: In a wizard of oz experiment a human fills in for the ML model that is to be developed. For example a human might write the replies in the chatbot. - ----- -## Personas in GenderMag - - -![Gendermag](gendermag1.png) - - -See examples and details http://gendermag.org/foundations.php - - ----- -## How much requirements eng. and when? - -![Waterfall process picture](waterfall.svg) - - ----- -# Homework I2: Requirements - -Dashcam system - - ---- - -
- -## Machine Learning in Production - -# Planning for Mistakes - - ----- -## Learning goals: - -* Consider ML models as unreliable components -* Use safety engineering techniques FTA, FMEA, and HAZOP to anticipate and analyze possible mistakes -* Design strategies for mitigating the risks of failures due to ML mistakes - - ----- -## Models make mistakes - -
- - ----- -## Common excuse: Nobody could have foreseen this... - -![Suicide rate of girls rising with the rise of social media](teen-suicide-rate.png) - ----- -## What responsibility do designers have to anticipate problems? - -![Critical headline about predictive policing](predictive-policing.png) - - ----- -## Confounding Variables - -![Confounding variable example](confoundingvariables.svg) - - ----- -## Reverse Causality - -![Chess](chess.jpg) - -Note: (from Prediction Machines, Chapter 6) Early 1980s chess program learned from Grandmaster games, learned that sacrificing queen would be a winning move, because it was occuring frequently in winning games. Program then started to sacrifice queen early. - ----- -## Reasons barely matter - -No model is every "correct" - -Some mistakes are unavoidable - -Anticipate the eventual mistake -* Make the system safe despite mistakes -* Consider the rest of the system (software + environment) -* Example: Thermal fuse in smart toaster - -**ML model = unreliable component** - - - - - - - ----- -## Bollards mitigate mistakes - -
- - - - ----- -## Today's Running Example: Autonomous Train - - - -![Docklands train](dlr.jpg) - - -
CC BY 2.0 by Matt Brown
- - -* REQ: The train shall not collide with obstacles -* REQ: The train shall not depart until all doors are closed -* REQ: The train shall not trap people between the doors -* ... - - - -Note: The Docklands Light Railway system in London has operated trains without a driver since 1987. Many modern public transportation systems use increasingly sophisticated automation, including the Paris Métro Line 14 and the Copenhagen Metro - ----- -## Human in the Loop - Examples - -* Email response suggestions - -![Example of email responses suggested by GMail](email.png) - -* Fall detection smartwatch -* Safe browsing - ----- -## Undoable actions - Examples - -![Nest thermostat](nest.jpg) - - -* Override thermostat setting -* Undo slide design suggestions -* Automated shipment + offering free return shipment -* Appeal process for banned "spammers" or "bots" -* Easy to repair bumpers on autonomous vehicles? - - - ----- -## Guardrails - Examples - -Recall: Thermal fuse in smart toaster - -![Thermal fuse](thermalfuse.png) - - - -+ maximum toasting time + extra heat sensor - - ----- -## Guardrails - Examples - -![Metro station Cour Saint-Émilion in Paris with automated platform screen doors that only open when a train is in the station](platformdoors.png) - - - - -CC BY-SA 4.0 by Chabe01 - - ----- -## Mistake detection - -Independent mechanism to detect problems (in the real world) - - -Example: Gyrosensor to detect a train taking a turn too fast - -![Train taking a corner](traincorner.jpg) - - ----- -## Graceful Degradation (Fail-safe) - - - -* Goal: When a component failure is detected, achieve system - safety by reducing functionality and performance -* Switches operating mode when failure detected (e.g., slower, conservative) - ----- -## Redundancy Example: Sensor Fusion - -![](sensor-fusion.jpeg) - - -* Combine data from a wide range of sensors -* Provides partial information even when some sensor is faulty -* A critical part of modern self-driving vehicles - - ----- -## Short Breakout - -What design strategies would you consider to mitigate ML mistakes: -* Credit card fraud detection -* Image captioning for accessibility in photo sharing site -* Speed limiter for cars (with vision system to detect traffic signs) - -Consider: Human in the loop, Undoable actions, Guardrails, Mistake detection and recovery (monitoring, doer-checker, fail-over, redundancy), Containment and isolation - - -As a group, post one design idea for each scenario to `#lecture` and tag all group members. - - - - - - ----- -## What's the worst that could happen? - -![Robot uprising](robot-uprising.jpg) - - - -*Likely?* Toby Ord predicts existential risk from GAI at 10% within 100 years: -Toby Ord, "The Precipice: Existential Risk and the Future of Humanity", 2020 - -Note: Discussion on existential risk. Toby Ord, Oxford philosopher predicts - ----- -## What's the worst that could happen? - -![Albumy screenshot](albumy.png) - - - - ----- -## What is Risk Analysis? - -What can possibly go wrong in my system, and what are potential -impacts on system requirements? - -Risk = Likelihood * Impact - -A number of methods: - * Failure mode & effects analysis (FMEA) - * Hazard analysis - * Why-because analysis - * Fault tree analysis (FTA) - * ... - - ----- -## Fault Tree Analysis (FTA) - - - -
- -* Fault tree: A top-down diagram that displays the relationships -between a system failure (i.e., requirement violation) and its potential causes. - * Identify sequences of events that result in a failure - * Prioritize the contributors leading to the failure - * Inform decisions about how to (re-)design the system - * Investigate an accident & identify the root cause -* Often used for safety & reliability, but can also be used for -other types of requirements (e.g., poor performance, security attacks...) - -
- - - -![fta-sample](fta-sample.png) - - - ----- -![FTA for trapping people in doors of a train](fta.svg) - ----- -## Consider Mitigations - -* Remove basic events with mitigations -* Increase the size of cut sets with mitigations - - -![FTA for trapping people in doors of a train](fta-without-mitigation.svg) - - ----- -## Failure Mode and Effects Analysis (FMEA) - -![](fmea-radiation.png) - - -* A __forward search__ technique to identify potential hazards -* Widely used in aeronautics, automotive, healthcare, food services, - semiconductor processing, and (to some extent) software - ----- -## Hazard and Interoperability Study (HAZOP) - -*identify hazards and component fault scenarios through guided inspection of requirements* - -![HAZOP example](hazop-perception.jpg) - - - - - - - ---- -# I2: Requirements - - - - - - - ---- - -
- - -## Machine Learning in Production - - -# Model Correctness and Accuracy - - - - ----- -# Learning Goals - -* Select a suitable metric to evaluate prediction accuracy of a model and to compare multiple models -* Select a suitable baseline when evaluating model accuracy -* Know and avoid common pitfalls in evaluating model accuracy -* Explain how software testing differs from measuring prediction accuracy of a model - ----- -# Model Quality - - -**First Part:** Measuring Prediction Accuracy -* the data scientist's perspective - -**Second Part:** What is Correctness Anyway? -* the role and lack of specifications, validation vs verification - -**Third Part:** Learning from Software Testing -* unit testing, test case curation, invariants, simulation (next lecture) - -**Later:** Testing in Production -* monitoring, A/B testing, canary releases (in 2 weeks) - - - - - ----- -## Confusion/Error Matrix - -
- - - -| | **Actually Grade 5 Cancer** | **Actually Grade 3 Cancer** | **Actually Benign** | -| :--- | --- | --- | --- | -|**Model predicts Grade 5 Cancer** | **10** | 6 | 2 | -|**Model predicts Grade 3 Cancer** | 3 | **24** | 10 | -|**Model predicts Benign** | 5 | 22 | **82** | - - - - - -$\textit{accuracy} = \frac{\textit{correct predictions}}{\textit{all predictions}}$ - -Example's accuracy - = $\frac{10+24+82}{10+6+2+3+24+10+5+22+82} = .707$ - -```scala -def accuracy(model, xs, ys): - count = length(xs) - countCorrect = 0 - for i in 1..count: - predicted = model(xs[i]) - if predicted == ys[i]: - countCorrect += 1 - return countCorrect / count -``` - - - -
- - ----- - -[![Recall/Precision visualization](recallprecision.png)](https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg) - - - - -(CC BY-SA 4.0 by [Walber](https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg)) - ----- -## Area Under the Curve - -Turning numeric prediction into classification with threshold ("operating point") - -![Recall/Precision Plot](prcurve.png) - - -Notes: The plot shows the recall precision/tradeoff at different thresholds (the thresholds are not shown explicitly). Curves closer to the top-right corner are better considering all possible thresholds. Typically, the area under the curve is measured to have a single number for comparison. - - - ----- -# Short Detour: - -# Measurement - - - ----- -## What is Measurement? - -> Measurement is the empirical, objective assignment of numbers, -according to a rule derived from a model or theory, to attributes of -objects or events with the intent of describing them. – Craner, Bond, -“Software Engineering Metrics: What Do They Measure and How Do We -Know?" - -> A quantitatively expressed reduction of uncertainty based on one or more observations. – Hubbard, “How to Measure Anything …" - - ----- -## Composing/Decomposing Measures -Often higher-level measures are composed from lower level measures - -Clear trace from specific low-level measurements to high-level metric - -![Decomposition of a maintainability metric](maintainability-decomp.svg) - - - -For design strategy, see [Goal-Question-Metric approach](https://en.wikipedia.org/wiki/GQM) - - ----- -## Measuring - -Make measurement clear and unambiguous. Ideally, third party can measure independently based on description. - -Three steps: -1. **Measure:** What do we try to capture? -2. **Data collection:** What data is collected and how? -3. **Operationalization:** How is the measure computed from the data? - -(Possible to repeat recursively when composing measures) - - ----- - -## Streetlight effect - - - -* A type of _observational bias_ -* People tend to look for something where it’s easiest to do so - * Use cheap proxy metrics that only poorly correlate with goal - * e.g., number of daily active users as a measure of projected revenue - - - -![Streetlight](streetlight.jpg) - - - - - - - - - ----- - -## The Legend of the Failed Tank Detector - - -![Tank in Forest](tank.jpg) - -![Forest](forest.jpg) - - -Notes: -Widely shared story, authenticity not clear: -AI research team tried to train image recognition to identify tanks hidden in forests, trained on images of tanks in forests and images of same or similar forests without tanks. The model could clearly separate the learned pictures, but would perform poorly on other pictures. - -Turns out the pictures with tanks were taken on a sunny day whereas the other pictures were taken on a cloudy day. The model picked up on the brightness of the picture rather than the presence of a tank, which worked great for the training set, but did not generalize. - -Pictures: https://pixabay.com/photos/lost-places-panzer-wreck-metal-3907364/, https://pixabay.com/photos/forest-dark-woods-trail-path-1031022/ - - - - ----- -## Production Data -- The Ultimate Unseen Validation Data - -more in a later lecture - - -
- - ----- -## Common Pitfalls of Evaluating Model Quality? - - - - ----- -## Test Data not Representative - -Often neither training nor test data representative of production data - -![MNIST Fashion Dataset Examples](MNIST-fashion.png) - - - - ----- -## Shortcut Learning - -![Shortcut learning illustration from paper below](shortcutlearning.png) - - - -Figure from: Geirhos, Robert, et al. "[Shortcut learning in deep neural networks](https://arxiv.org/abs/2004.07780)." Nature Machine Intelligence 2, no. 11 (2020): 665-673. - -Note: (From figure caption) Toy example of shortcut learning in neural networks. When trained on a simple dataset -of stars and moons (top row), a standard neural network (three layers, fully connected) can easily -categorise novel similar exemplars (mathematically termed i.i.d. test set, defined later in Section 3). -However, testing it on a slightly different dataset (o.o.d. test set, bottom row) reveals a shortcut -strategy: The network has learned to associate object location with a category. During training, -stars were always shown in the top right or bottom left of an image; moons in the top left or bottom -right. This pattern is still present in samples from the i.i.d. test set (middle row) but not in o.o.d. test -images (bottom row), exposing the shortcut. - ----- -## Data Leakage during Data Preprocessing - -```python -wordsVectorizer = CountVectorizer().fit(text) -wordsVector = wordsVectorizer.transform(text) -invTransformer = TfidfTransformer().fit(wordsVector) -invFreqOfWords = invTransformer.transform(wordsVector) -X = pd.DataFrame(invFreqOfWords.toarray()) - -train, test, spamLabelTrain, spamLabelTest = - train_test_split(X, y, test_size = 0.5) -predictAndReport(train = train, test = test) -``` - - - ----- -## Independence of Data: Temporal - -![Temporal dependence](temporaldependence.svg) - - -Note: The curve is the real trend, red points are training data, green points are validation data. If validation data is randomly selected, it is much easier to predict, because the trends around it are known. - - ----- -## Independence of Data: Related Datapoints - -Example: Kaggle competition on detecting distracted drivers - -![Driver Picture 1](driver_phone.png) ![Driver Picture 2](driver_phone2.png) - - -Relation of datapoints may not be in the data (e.g., driver) - - - -https://www.fast.ai/2017/11/13/validation-sets/ - -Note: -Many potential subtle and less subtle problems: -* Sales from same user -* Pictures taken on same day - - - - ----- -# Part 2: -# What is Correctness Anyway? - -specifications, bugs, fit - - ----- -## SE World: Evaluating a Component's Functional Correctness - -
- - -*Given a specification, do outputs match inputs?* - -```java -/** - * compute deductions based on provided adjusted - * gross income and expenses in customer data. - * - * see tax code 26 U.S. Code A.1.B, PART VI - */ -float computeDeductions(float agi, Expenses expenses); -``` - -**Each mismatch is considered a bug, should to be fixed.†** - -
- -
-(†=not every bug is economical to fix, may accept some known bugs) -
- - ----- -## Validation vs Verification - -![Validation vs Verification](validation.png) - - - ----- -## No specification! - -
- -![Cancer prognosis with ML](cancerpred.png) - -Use ML precisely because no specifications (too complex, rules unknown) -* No specification that could tell us for any input whether the output is correct -* Intuitions, ideas, goals, examples, "implicit specifications", but nothing we can write down as rules! -* *We are usually okay with some wrong predictions* - -
- ----- -## Testing a Machine Learning Model? - - -```java -// detects cancer in an image -boolean hasCancer(Image scan); - -@Test -void testPatient1() { - assertEquals(loadImage("patient1.jpg"), false); -} -@Test -void testPatient2() { - assertEquals(loadImage("patient2.jpg"), false); -} -``` - - - ----- -## All Models Are Wrong - - -> All models are approximations. Assumptions, whether implied or clearly stated, are never exactly true. **All models are wrong, but some models are useful**. So the question you need to ask is not "Is the model true?" (it never is) but "Is the model good enough for this particular application?" -- George Box - - - -See also https://en.wikipedia.org/wiki/All_models_are_wrong -
  - - ----- -## Deductive vs Inductive Reasoning - - - -[![Contrasting inductive and deductive reasoning](inductive.png)](https://danielmiessler.com/blog/the-difference-between-deductive-and-inductive-reasoning/) - - - - -(Daniel Miessler, CC SA 2.0) - - - - ----- -## Machine Learning Models Fit, or Not - -* A model is learned from given data in given procedure - - The learning process is typically not a correctness concern - - The model itself is generated, typically no implementation issues -* Is the data representative? Sufficient? High quality? -* Does the model "learn" meaningful concepts? -* -* **Is the model useful for a problem?** Does it *fit*? -* Do model predictions *usually* fit the users' expectations? -* Is the model *consistent* with other requirements? (e.g., fairness, robustness) - - - - - - - - - - - - - - - ---- - -
- -## Machine Learning in Production - -# Navigating Conflicts in (Student) Teams - - - - - ----- -# Assigned Seating - -1. Find your team number -2. Find a seat in the range for your team -3. Introduce yourself to the other team members - - ----- -## Now: First Short Team Meeting (10 min) - -* Move to table with your team number -* Say hi, introduce yourself: Name? SE or ML background? Favorite movie? Fun fact? -* Find time for first team meeting in next few days -* Agree on primary communication until team meeting -* Pick a movie-related team name, post team name and tag all group members on slack in `#social` - ----- -## Teams are Inevitable - - -1. Projects too large to build for a single person (division of work) -2. Projects too large to fully comprehend by a single person (divide and conquer) -3. Projects need too many skills for a single person to master (division of expertise) - - ----- -## Who has had bad experiences in teams? Student teams? Teams in industry? - -![Frustration](frustrated.jpeg) - - - - - - - - - ----- -# Team issues: Groupthink - -![](groupthink.png) - ----- -## Groupthink - -* Group minimizing conflict -* Avoid exploring alternatives -* Suppressing dissenting views -* Isolating from outside influences -* -> Irrational/dysfunctional decision making ----- -![](svposter.png) - ----- -# Team issues: Social loafing - -![](tug.png) - - - - ----- -## Some past complaints - -
- -* "M. was very pleasant and would contribute while in meetings. Outside of them, he did not complete the work he said he would and did not reach out to provide an update that he was unable to. When asked, on the night the assignment was due, he completed a portion of the task he said he would after I had completed the rest of it." -* "Procrastinated with the work till the last minute - otherwise ok." -* "He is not doing his work on time. And didnt check his own responsibilities. Left work undone for the next time." -* "D. failed to catch the latest 2 meetings. Along the commit history, he merely committed 4 and the 3 earliest commits are some setups. And the latest one commits is to add his name on the meeting log, for which we almost finished when he joined." -* "Unprepared with his deliverables, very unresponsive on WhatsApp recently, and just overall being a bad team player." -* "Consistently failed to meet deadlines. Communication improved over the course of the milestone but needed repeated prompts to get things done. Did not ask for help despite multiple offers." - -
- - ----- -## Common Sources of Frustrations - -* Priority differences ("10-601 is killing me, I need to work on that first", "I have dance class tonight") -* Ambition differences ("a B- is enough for graduating") -* Ability differences ("incompetent" students on teams) -* Working style differences (deadline driven vs planner) -* Communication preferences differences (avoid distraction vs always on) -* In-team competition around grades (outdoing each other, adversarial peer grading) - - - - -Based on research and years of own experience - ----- -## How would you handle... - -> One team member has very little technical experience and is struggling with basic Python scripts and the Unix shell. It is faster for other team members to take over the task rather than helping them. - ----- -## Breakout: Navigating Team Issues - -Pick one or two of the scenarios (or another one team member faced in the past) and openly discuss proactive/reactive solutions - -As a team, tagging team members, post to `#lecture`: - -> 1. Brief problem description -> 2. How to prevent in the first place -> 3. What to do when it occurs anyway - - ----- -## Teamwork Policy in this Course - -Teams can set their own priorities and policies – do what works for you, experiment - * Not everybody will contribute equally to every assignment – that's okay - * Team members have different strength and weaknesses – that's good - -We will intervene in *team citizenship* issues! - -> Golden rule: Try to do what you agreed to do by the time you agreed to. If you cannot, seek help and communicate clearly and early. - - - - - - - - - ---- -# Milestone 1: Modeling and First Deployment - - -(Model building, model comparison, measurements, first deployment, teamwork documents) - - - - - ---- - -
- -## Machine Learning in Production - - -# Model Testing beyond Accuracy - -

(Slicing, Capabilities, Invariants, Simulation, ...)

- - - - ----- -# Learning Goals - -* Curate validation datasets for assessing model quality, covering subpopulations and capabilities as needed -* Explain the oracle problem and how it challenges testing of software and models -* Use invariants to check partial model properties with automated testing -* Select and deploy automated infrastructure to evaluate and monitor model quality - - ----- -# Curating Validation Data & Input Slicing - -![Fruit slices](slices.jpg) - - ----- -## Software Test Case Design - -
- -**Opportunistic/exploratory testing:** Add some unit tests, without much planning - -**Specification-based testing** ("black box"): Derive test cases from specifications - - Boundary value analysis - - Equivalence classes - - Combinatorial testing - - Random testing - -**Structural testing** ("white box"): Derive test cases to cover implementation paths - - Line coverage, branch coverage - - Control-flow, data-flow testing, MCDC, ... - -Test execution usually automated, but can be manual too; automated generation from specifications or code possible - -
- - ----- -## Not All Inputs are Equal - -![Google Home](googlehome.jpg) - - -"Call mom" -"What's the weather tomorrow?" -"Add asafetida to my shopping list" - - ----- -## Input Partitioning Example - -
- - -![Input partitioning example](inputpartitioning2.png) - -Input divided by movie age. Notice low accuracy, but also low support (i.e., little validation data), for old movies. - -![Input partitioning example](inputpartitioning.png) - -Input divided by genre, rating, and length. Accuracy differs, but also amount of test data used ("support") differs, highlighting low confidence areas. - - - -
- - - -Source: Barash, Guy, et al. "Bridging the gap between ML solutions and their business requirements using feature interactions." In Proc. FSE, 2019. - - ----- -## Example: Model Impr. at Apple (Overton) - -![Overton system](overton.png) - - - - - -Ré, Christopher, Feng Niu, Pallavi Gudipati, and Charles Srisuwananukorn. "[Overton: A Data System for Monitoring and Improving Machine-Learned Products](https://arxiv.org/abs/1909.05372)." arXiv preprint arXiv:1909.05372 (2019). - - - - ----- -# Testing Model Capabilities - - -![Checklist](checklist.jpg) - - - - - -Further reading: Christian Kaestner. [Rediscovering Unit Testing: Testing Capabilities of ML Models](https://towardsdatascience.com/rediscovering-unit-testing-testing-capabilities-of-ml-models-b008c778ca81). Toward Data Science, 2021. - - - ----- -## Testing Capabilities - -![Examples of Capabilities from Checklist Paper](capabilities1.png) - - - - -From: Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. "[Beyond Accuracy: Behavioral Testing of NLP Models with CheckList](https://homes.cs.washington.edu/~wtshuang/static/papers/2020-acl-checklist.pdf)." In Proceedings ACL, p. 4902–4912. (2020). - - - ----- -## Generating Test Data for Capabilities - -**Idea 1: Domain-specific generators** - -Testing *negation* in sentiment analysis with template:
-`I {NEGATION} {POS_VERB} the {THING}.` - -Testing texture vs shape priority with artificial generated images: -![Texture vs shape example](texturevsshape.png) - - - -Figure from Geirhos, Robert, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.” In Proc. International Conference on Learning Representations (ICLR), (2019). - - ----- -## Generating Test Data for Capabilities - -**Idea 3: Crowd-sourcing test creation** - -Testing *sarcasm* in sentiment analysis: Ask humans to minimally change text to flip sentiment with sarcasm - -Testing *background* in object detection: Ask humans to take pictures of specific objects with unusual backgrounds - -![Example of modifications to text](sarcasm.png) - - - -Figure from: Kaushik, Divyansh, Eduard Hovy, and Zachary C. Lipton. “Learning the difference that makes a difference with counterfactually-augmented data.” In Proc. International Conference on Learning Representations (ICLR), (2020). - - - - - ----- -# Automated (Random) Testing and Invariants - -(if it wasn't for that darn oracle problem) - -![Random dice throw](random.jpg) - - - ----- -## Cancer in Random Image? - -![](white-noise.jpg) - ----- -## The Oracle Problem - -*How do we know the expected output of a test?* - -```java -assertEquals(??, factorPrime(15485863)); -``` - - - - - ----- -## Examples of Invariants - -
- - -* Credit rating should not depend on gender: - - $\forall x. f(x[\text{gender} \leftarrow \text{male}]) = f(x[\text{gender} \leftarrow \text{female}])$ -* Synonyms should not change the sentiment of text: - - $\forall x. f(x) = f(\texttt{replace}(x, \text{"is not", "isn't"}))$ -* Negation should swap meaning: - - $\forall x \in \text{"X is Y"}. f(x) = 1-f(\texttt{replace}(x, \text{" is ", " is not "}))$ -* Robustness around training data: - - $\forall x \in \text{training data}. \forall y \in \text{mutate}(x, \delta). f(x) = f(y)$ -* Low credit scores should never get a loan (sufficient conditions for classification, "anchors"): - - $\forall x. x.\text{score} < 649 \Rightarrow \neg f(x)$ - -Identifying invariants requires domain knowledge of the problem! - -
- - ----- -# Simulation-Based Testing - -![Driving a simulator](simulationdriving.jpg) - - - - ----- -## One More Thing: Simulation-Based Testing - -
- - - -* Derive input-output pairs from simulation, esp. in vision systems -* Example: Vision for self-driving cars: - * Render scene -> add noise -> recognize -> compare recognized result with simulator state -* Quality depends on quality of simulator: - * examples: render picture/video, synthesize speech, ... - * Less suitable where input-output relationship unknown, e.g., cancer prognosis, housing price prediction - -![Simulation is the inverse of prediction](simulationbased-testing.svg) - - - - -
- - - -Further readings: Zhang, Mengshi, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. "DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems." In Proc. ASE. 2018. - - ----- - -## Test Coverage - -![](coverage.png) - - ----- -# Continuous Integration for Model Quality - -[![Uber's internal dashboard](uber-dashboard.png)](https://eng.uber.com/michelangelo/) - - - ----- -# Milestone 1: Modeling and First Deployment - - - - - - - - - - - - ---- - -
- -## Machine Learning in Production - -# Toward Architecture and Design - - - ----- -## After requirements... - -![Overview of course content](../_assets/overview.svg) - - - ----- -## Learning Goals - -* Describe the role of architecture and design between requirements and implementation -* Identify the different ML components and organize and prioritize their quality concerns for a given project -* Explain they key ideas behind decision trees and random forests and analyze consequences for various qualities -* Demonstrate an understanding of the key ideas of deep learning and how it drives qualities -* Plan and execute an evaluation of the qualities of alternative AI components for a given purpose - - - - - - - - - - - - - - ----- - -![Simple architecture diagram of transcription service](transcriptionarchitecture2.svg) - - - - -* **ML components** for transcription model, pipeline to train the model, monitoring infrastructure... -* **Non-ML components** for data storage, user interface, payment processing, ... -* User requirements and assumptions -* -* System quality vs model quality -* System requirements vs model requirements - - - - - ----- -# Thinking like a Software Architect - -![Architecture between requirements and implementation](req-arch-impl.svg) - - - ----- - -## Case Study: Twitter - -![twitter](twitter.png) - - -Note: Source and additional reading: Raffi. [New Tweets per second record, and how!](https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-and-how.html) Twitter Blog, 2013 - - ----- - -## Twitter Case Study: Key Insights - -Architectural decisions affect entire systems, not only individual modules - -Abstract, different abstractions for different scenarios - -Reason about quality attributes early - -Make architectural decisions explicit - -Question: **Did the original architect make poor decisions?** - - - - - - - - - ----- -## System Decomposition - -![Simple architecture diagram of transcription service](transcriptionarchitecture2.svg) - - -Identify components and their responsibilities - -Establishes interfaces and team boundaries - - ----- -## Information Hiding - -Decomposition enables scaling teams - -Each team works on a component - -Need to coordinate on *interfaces*, but implementations remain hidden - -**Interface descriptions are crutial** -* Who is responsible for what -* Component requirements (specifications), behavioral and quality -* Especially consider nonlocal qualities: e.g., safety, privacy - -Interfaces rarely fully specified in practice, source of conflicts - - - - ----- -## Common components - -* **Model inference service**: Uses model to make predictions for input data -* **ML pipeline**: Infrastructure to train/update the model -* **Monitoring**: Observe model and system -* **Data sources**: Manual/crowdsourcing/logs/telemetry/... -* **Data management**: Storage and processing of data, often at scale -* **Feature store**: Reusable feature engineering code, cached feature computations - ----- -## Common System-Wide Design Challenges - -Separating concerns, understanding interdependencies -* e.g., anticipating/breaking feedback loops, conflicting needs of components - -Facilitating experimentation, updates with confidence - -Separating training and inference and closing the loop -* e.g., collecting telemetry to learn from user interactions - -Learn, serve, and observe at scale or with resource limits -* e.g., cloud deployment, embedded devices - - - - ----- -## Qualities of Interest? - -Scenario: Component for detecting credit card frauds, as a service for banks - -![Credit card](credit-card.jpg) - - - -Note: Very high volume of transactions, low cost per transaction, frequent updates - -Incrementality - - - ----- -![Table of NFPs and their relationship to different components](nfps.png) - - - -From: Habibullah, Khan Mohammad, Gregory Gay, and Jennifer Horkoff. "[Non-Functional Requirements for Machine Learning: An Exploration of System Scope and Interest](https://arxiv.org/abs/2203.11063)." arXiv preprint arXiv:2203.11063 (2022). - - - ----- -## Decision Trees: Qualities - -![Decision tree](decisiontreeexample.png) - - -* Tasks: Classification & regression -* Qualities: __Advantages__: ?? __Drawbacks__: ?? - -Notes: -* Easy to interpret (up to a size); can capture non-linearity; can do well with - little data -* High risk of overfitting; possibly very large tree size -* Obvious ones: fairly small model size, low inference cost, -no obvious incremental training; easy to interpret locally and -even globally if shallow; easy to understand decision boundaries - - - - - - - ----- -$f_{\mathbf{W}_h,\mathbf{b}_h,\mathbf{W}_o,\mathbf{b}_o}(\mathbf{X})=\phi( \mathbf{W}_o \cdot \phi(\mathbf{W}_h \cdot \mathbf{X}+\mathbf{b}_h)+\mathbf{b}_o)$ - -![Multi Layer Perceptron](mlperceptron.svg) - - -(matrix multiplications interleaved with step function) - - ----- -## Deep Learning - -![neural-network](neural-network.png) - -* Tasks: Classification & regression -* Qualities: __Advantages__: ?? __Drawbacks__: ?? - -Notes: -* High accuracy; can capture a wide range of problems (linear & non-linear) -* Difficult to interpret; high training costs (time & amount of -data required, hyperparameter tuning) - ----- -## Network Size - -
- -* 50 Layer ResNet network -- classifying 224x224 images into 1000 categories - * 26 million weights, computes 16 million activations during inference, 168 MB to store weights as floats -* Google in 2012(!): 1TB-1PB of training data, 1 billion to 1 trillion parameters -* OpenAI’s GPT-2 (2019) -- text generation - - 48 layers, 1.5 billion weights (~12 GB to store weights) - - released model reduced to 117 million weights - - trained on 7-8 GPUs for 1 month with 40GB of internet text from 8 million web pages -* OpenAI’s GPT-3 (2020): 96 layers, 175 billion weights, 700 GB in memory, $4.6M in approximate compute cost for training -
- -Notes: https://lambdalabs.com/blog/demystifying-gpt-3/ - ----- -## Cost & Energy Consumption - -
- - - -| Consumption | CO2 (lbs) | -| - | - | -| Air travel, 1 passenger, NY↔SF | 1984 | -| Human life, avg, 1 year | 11,023 | -| American life, avg, 1 year | 36,156 | -| Car, avg incl. fuel, 1 lifetime | 126,000 | - - - -| Training one model (GPU) | CO2 (lbs) | -| - | - | -| NLP pipeline (parsing, SRL) | 39 | -| w/ tuning & experimentation | 78,468 | -| Transformer (big) | 192 | -| w/ neural architecture search | 626,155 | - - - -
- - -Strubell, Emma, Ananya Ganesh, and Andrew McCallum. "[Energy and Policy Considerations for Deep Learning in NLP](https://arxiv.org/pdf/1906.02243.pdf)." In Proc. ACL, pp. 3645-3650. 2019. - - - - - ----- -# Constraints and Tradeoffs - -![Pareto Front Example](pareto-front.svg) - - - - ----- -## Constraints - -Constraints define the space of attributes for valid design solutions - -![constraints](design-space.svg) - - -Note: Design space exploration: The space of all possible designs (dotted rectangle) is reduced by several constraints on qualities of the system, leaving only a subset of designs for further consideration (highlighted center area). - - ----- -## Trade-offs: Cost vs Accuracy - - - -![Netflix prize leaderboard](netflix-leaderboard.png) - - - - -_"We evaluated some of the new methods offline but the additional -accuracy gains that we measured did not seem to justify the -engineering effort needed to bring them into a production -environment.”_ - - - - - -Amatriain & Basilico. [Netflix Recommendations: Beyond the 5 stars](https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429), -Netflix Technology Blog (2012) - - ----- -## Breakout: Qualities & ML Algorithms - -Consider two scenarios: -1. Credit card fraud detection -2. Pedestrian detection in sidewalk robot - -As a group, post to `#lecture` tagging all group members: -> * Qualities of interests: ?? -> * Constraints: ?? -> * ML algorithm(s) to use: ?? - - - - - ---- - -
- -## Machine Learning in Production - -# Deploying a Model - - - - ----- - -## Learning Goals - -
- -* Understand important quality considerations when deploying ML components -* Follow a design process to explicitly reason about alternative designs and their quality tradeoffs -* Gather data to make informed decisions about what ML technique to use and where and how to deploy it -* Understand the power of design patterns for codifying design knowledge -* -* Create architectural models to reason about relevant characteristics -* Critique the decision of where an AI model lives (e.g., cloud vs edge vs hybrid), considering the relevant tradeoffs -* Deploy models locally and to the cloud -* Document model inference services - -
- - ----- -## Deploying a Model is Easy - -Model inference component as a service - - -```python -from flask import Flask, escape, request -app = Flask(__name__) -app.config['UPLOAD_FOLDER'] = '/tmp/uploads' -detector_model = … # load model… - -# inference API that returns JSON with classes -# found in an image -@app.route('/get_objects', methods=['POST']) -def pred(): - uploaded_img = request.files["images"] - coverted_img = … # feature encoding of uploaded img - result = detector_model(converted_img) - return jsonify({"response": - result['detection_class_entities']}) - -``` - - - - - - ----- -## But is it really easy? - -Offline use? - -Deployment at scale? - -Hardware needs and operating cost? - -Frequent updates? - -Integration of the model into a system? - -Meeting system requirements? - -**Every system is different!** - ----- -![](pgh-cycling.jpg) -Notes: Cycling map of Pittsburgh. Abstraction for navigation with bikes and walking. - ----- -## What can we reason about? - -![Apollo Self-Driving Car Architecture](apollo.png) - - - -Peng, Zi, Jinqiu Yang, Tse-Hsun Chen, and Lei Ma. "A first look at the integration of machine learning models in complex autonomous driving systems: a case study on Apollo." In Proc. FSE, 2020. - - - ----- -## Case Study: Augmented Reality Translation -![Google Glasses](googleglasses.jpg) - - -Notes: Consider you want to implement an instant translation service similar toGoogle translate, but run it on embedded hardware in glasses as an augmented reality service. - ----- -## Where Should the Models Live? - -![AR Translation Architecture Sketch](ar-architecture.svg) - - -Cloud? Phone? Glasses? - -What qualities are relevant for the decision? - -Notes: Trigger initial discussion - - ----- -## Breakout: Latency and Bandwidth Analysis - - -1. Estimate latency and bandwidth requirements between components -2. Discuss tradeoffs among different deployment models - - -![AR Translation Architecture Sketch](ar-architecture.svg) - - - -As a group, post in `#lecture` tagging group members: -* Recommended deployment for OCR (with justification): -* Recommended deployment for Translation (with justification): - - - -Notes: Identify at least OCR and Translation service as two AI components in a larger system. Discuss which system components are worth modeling (e.g., rendering, database, support forum). Discuss how to get good estimates for latency and bandwidth. - -Some data: -200ms latency is noticable as speech pause; -20ms is perceivable as video delay, 10ms as haptic delay; -5ms referenced as cybersickness threshold for virtual reality -20ms latency might be acceptable - -bluetooth latency around 40ms to 200ms - -bluetooth bandwidth up to 3mbit, wifi 54mbit, video stream depending on quality 4 to 10mbit for low to medium quality - -google glasses had 5 megapixel camera, 640x360 pixel screen, 1 or 2gb ram, 16gb storage - - ----- -## Reusing Feature Engineering Code - - -![Feature encoding shared between training and inference](shared-feature-encoding.svg) - - - -Avoid *training–serving skew* - ----- -## Tecton Feature Store - - - ----- -## Separating Models and Business Logic - -![3-tier architecture integrating ML](3tier-with-ml.svg) - - - -Based on: Yokoyama, Haruki. "Machine learning system architectural pattern for improving operational stability." In Int'l Conf. Software Architecture Companion, pp. 267-274. IEEE, 2019. - ----- -## Documenting Input/Output Types for Inference Components - -```js -{ - "mid": string, - "languageCode": string, - "name": string, - "score": number, - "boundingPoly": { - object (BoundingPoly) - } -} -``` -From Google’s public [object detection API](https://cloud.google.com/vision/docs/object-localizer). - ----- -![Model card screenshot from Google](modelcard2.png) - - - -From: https://modelcards.withgoogle.com/object-detection - - ----- -## Anti-Patterns - -* Big Ass Script Architecture -* Dead Experimental Code Paths -* Glue code -* Multiple Language Smell -* Pipeline Jungles -* Plain-Old Datatype Smell -* Undeclared Consumers - - - - - -See also: 🗎 Washizaki, Hironori, Hiromu Uchida, Foutse Khomh, and Yann-Gaël Guéhéneuc. "[Machine Learning Architecture and Design Patterns](http://www.washi.cs.waseda.ac.jp/wp-content/uploads/2019/12/IEEE_Software_19__ML_Patterns.pdf)." Draft, 2019; 🗎 Sculley, et al. "[Hidden technical debt in machine learning systems](http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf)." In NeurIPS, 2015. - ---- -# I3: Architecture - - - ---- - -
- -## Machine Learning in Production - -# Testing in Production - - ----- -
- - ----- -## Learning Goals - -* Design telemetry for evaluation in practice -* Understand the rationale for beta tests and chaos experiments -* Plan and execute experiments (chaos, A/B, shadow releases, ...) in production -* Conduct and evaluate multiple concurrent A/B tests in a system -* Perform canary releases -* Examine experimental results with statistical rigor -* Support data scientists with monitoring platforms providing insights from production data - ----- -## Beta Testing - -![Windows 95 beta release](windowsbeta.jpg) - - -Note: Early release to select users, asking them to send feedback or report issues. No telemetry in early days. - ----- -## Crash Telemetry - -![Windows 95 Crash Report](wincrashreport_windows_xp.png) - -Note: With internet availability, send crash reports home to identify problems "in production". Most ML-based systems are online in some form and allow telemetry. - ----- -## A/B Testing - -![A/B test example](ab-groove.jpg) - -Notes: Usage observable online, telemetry allows testing in production. Picture source: https://www.designforfounders.com/ab-testing-examples/ - - ----- - - -![Skype feedback dialog](skype1.jpg) - -![Skype report problem button](skype2.jpg) - - -Notes: -Expect only sparse feedback and expect negative feedback over-proportionally - ----- -![Flight cost forcast](flightforcast.jpg) - -Notes: Can just wait 7 days to see actual outcome for all predictions - - ----- -## Measuring Model Quality with Telemetry - -
- -* Usual 3 steps: (1) Metric, (2) data collection (telemetry), (3) operationalization -* Telemetry can provide insights for correctness - - sometimes very accurate labels for real unseen data - - sometimes only mistakes - - sometimes delayed - - often just samples - - often just weak proxies for correctness -* Often sufficient to *approximate* precision/recall or other model-quality measures -* Mismatch to (static) evaluation set may indicate stale or unrepresentative data -* Trend analysis can provide insights even for inaccurate proxy measures -
- ----- -## Breakout: Design Telemetry in Production - -
- -Discuss how to collect telemetry, the metric to monitor, and how to operationalize - -Scenarios: -* Front-left: Amazon: Shopping app detects the shoe brand from photos -* Front-right: Google: Tagging uploaded photos with friends' names -* Back-left: Spotify: Recommended personalized playlists -* Back-right: Wordpress: Profanity filter to moderate blog posts - -As a group post to `#lecture` and tag team members: -> * Quality metric: -> * Data to collect: -> * Operationalization: - -
- ----- -![Grafana screenshot from Movie Recommendation Service](grafana.png) - - ----- -## Detecting Drift - -![Drift](drift.jpg) - - -Image source: Joel Thomas and Clemens Mewald. [Productionizing Machine Learning: From Deployment to Drift Detection](https://databricks.com/blog/2019/09/18/productionizing-machine-learning-from-deployment-to-drift-detection.html). Databricks Blog, 2019 - ----- -## Engineering Challenges for Telemetry -![Amazon news story](alexa.png) - ----- -## Model Quality vs System Quality - -![Booking.com homepage](bookingcom.png) - - - -Bernardi, Lucas, et al. "150 successful machine learning models: 6 lessons learned at Booking.com." In Proc. Int'l Conf. Knowledge Discovery & Data Mining, 2019. - - ----- - - -![A/B experiment at Bing](kohavi-bing-search.jpg) - - -
- -## Bing Experiment - -* Experiment: Ad Display at Bing -* Suggestion prioritzed low -* Not implemented for 6 month -* Ran A/B test in production -* Within 2h *revenue-too-high* alarm triggered suggesting serious bug (e.g., double billing) -* Revenue increase by 12% - $100M anually in US -* Did not hurt user-experience metrics - -
- -
- -From: Kohavi, Ron, Diane Tang, and Ya Xu. "[Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing](https://bookshop.org/books/trustworthy-online-controlled-experiments-a-practical-guide-to-a-b-testing/9781108724265)." 2020. - -
- - - ----- -## Feature Flags (Boolean flags) - -
- -```java -if (features.enabled(userId, "one_click_checkout")) { - // new one click checkout function -} else { - // old checkout functionality -} -``` - -* Good practices: tracked explicitly, documented, keep them localized and independent -* External mapping of flags to customers, who should see what configuration - * e.g., 1% of users sees `one_click_checkout`, but always the same users; or 50% of beta-users and 90% of developers and 0.1% of all users - -```scala -def isEnabled(user): Boolean = (hash(user.id) % 100) < 10 -``` - -
- ----- -![t-test in an A/B testing dashboard](testexample.png) - - - -Source: https://conversionsciences.com/ab-testing-statistics/ - - ----- -## Canary Releases - - - -Release new version to small percentage of population (like A/B testing) - -Automatically roll back if quality measures degrade - -Automatically and incrementally increase deployment to 100% otherwise - - - -![Canary bird](canary.jpg) - - - - ----- -## Chaos Experiments - -[![Simian Army logo by Netflix](simianarmy.jpg)](https://en.wikipedia.org/wiki/Chaos_engineering) - - - - - ---- - -
- -## Machine Learning in Production - - -# Data Quality - - - - - - ----- - -## Learning Goals - -* Distinguish precision and accuracy; understanding the better models vs more data tradeoffs -* Use schema languages to enforce data schemas -* Design and implement automated quality assurance steps that check data schema conformance and distributions -* Devise infrastructure for detecting data drift and schema violations -* Consider data quality as part of a system; design an organization that values data quality - - ----- - -> Data cleaning and repairing account for about 60% of the work of data scientists. - - -**Own experience?** - - - -Quote: Gil Press. “[Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/).” Forbes Magazine, 2016. - - ----- -## Case Study: Inventory Management - -![Shelves in a warehouse](warehouse.jpg) - - - ----- -## Many Data Sources - -
Twitter
SalesTrends
AdNetworks
Inventory ML
VendorSales
ProductData
Marketing
Expired/Lost/Theft
PastSales
- -*sources of different reliability and quality* - - ----- -## *Raw Data* is an Oxymoron - -![shipment receipt form](shipment-delivery-receipt.jpg) - - - - - -Recommended Reading: Gitelman, Lisa, Virginia Jackson, Daniel Rosenberg, Travis D. Williams, Kevin R. Brine, Mary Poovey, Matthew Stanley et al. "[Data bite man: The work of sustaining a long-term study](https://ieeexplore.ieee.org/abstract/document/6462156)." In "Raw Data" Is an Oxymoron, (2013), MIT Press: 147-166. - ----- -## Accuracy vs Precision - - - -Accuracy: Reported values (on average) represent real value - -Precision: Repeated measurements yield the same result - -Accurate, but imprecise: Average over multiple measurements - -Inaccurate, but precise: ? - - - - -![Accuracy-vs-precision visualized](Accuracy_and_Precision.svg) - - - - - -(CC-BY-4.0 by [Arbeck](https://commons.wikimedia.org/wiki/File:Accuracy_and_Precision.svg)) - - ----- -## Data Cascades - -![Data cascades figure](datacascades.png) - -Detection almost always delayed! Expensive rework. -Difficult to detect in offline evaluation. - - -Sambasivan, N., et al. (2021, May). “[Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI](https://dl.acm.org/doi/abs/10.1145/3411764.3445518). In Proc. CHI (pp. 1-15). - - - - ----- -## Schema in Relational Databases - -```sql -CREATE TABLE employees ( - emp_no INT NOT NULL, - birth_date DATE NOT NULL, - name VARCHAR(30) NOT NULL, - PRIMARY KEY (emp_no)); -CREATE TABLE departments ( - dept_no CHAR(4) NOT NULL, - dept_name VARCHAR(40) NOT NULL, - PRIMARY KEY (dept_no), UNIQUE KEY (dept_name)); -CREATE TABLE dept_manager ( - dept_no CHAR(4) NOT NULL, - emp_no INT NOT NULL, - FOREIGN KEY (emp_no) REFERENCES employees (emp_no), - FOREIGN KEY (dept_no) REFERENCES departments (dept_no), - PRIMARY KEY (emp_no,dept_no)); -``` - - - ----- -## Example: HoloClean - -![HoloClean](holoclean.jpg) - - -
- -* User provides rules as integrity constraints (e.g., "two entries with the same -name can't have different city") -* Detect violations of the rules in the data; also detect statistical outliers -* Automatically generate repair candidates (with probabilities) -
- - -Image source: Theo Rekatsinas, Ihab Ilyas, and Chris Ré, “[HoloClean - Weakly Supervised Data Repairing](https://dawn.cs.stanford.edu/2017/05/12/holoclean/).” Blog, 2017. - ----- - - -## Drift & Model Decay - -
- -**Concept drift** (or concept shift) - * properties to predict change over time (e.g., what is credit card fraud) - * model has not learned the relevant concepts - * over time: different expected outputs for same inputs - -**Data drift** (or covariate shift, distribution shift, or population drift) - * characteristics of input data changes (e.g., customers with face masks) - * input data differs from training data - * over time: predictions less confident, further from training data - -**Upstream data changes** - * external changes in data pipeline (e.g., format changes in weather service) - * model interprets input data incorrectly - * over time: abrupt changes due to faulty inputs - -**How do we fix these drifts?** - -
-Notes: - * fix1: retrain with new training data or relabeled old training data - * fix2: retrain with new data - * fix3: fix pipeline, retrain entirely - ----- -## Breakout: Drift in the Inventory System - -*What kind of drift might be expected?* - -As a group, tagging members, write plausible examples in `#lecture`: - -> * Concept Drift: -> * Data Drift: -> * Upstream data changes: - - -![Shelves in a warehouse](warehouse.jpg) - - - - - - ----- -## Microsoft Azure Data Drift Dashboard - -![Dashboard](drift-ui-expanded.png) - - -Image source and further readings: [Detect data drift (preview) on models deployed to Azure Kubernetes Service (AKS)](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-monitor-datasets?tabs=python) - - - ----- - -> "Everyone wants to do the model work, not the data work" - - -Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021, May). “[Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI](https://dl.acm.org/doi/abs/10.1145/3411764.3445518). In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-15). - ----- -## Data Quality Documentation - -
- -Teams rarely document expectations of data quantity or quality - -Data quality tests are rare, but some teams adopt defensive monitoring -* Local tests about assumed structure and distribution of data -* Identify drift early and reach out to producing teams - -Several ideas for documenting distributions, including [Datasheets](https://dl.acm.org/doi/fullHtml/10.1145/3458723) and [Dataset Nutrition Label](https://arxiv.org/abs/1805.03677) -* Mostly focused on static datasets, describing origin, consideration, labeling procedure, and distributions; [Example](https://dl.acm.org/doi/10.1145/3458723#sec-supp) - -
- - -🗎 Gebru, Timnit, et al. "[Datasheets for datasets](https://dl.acm.org/doi/fullHtml/10.1145/3458723)." Communications of the ACM 64, no. 12 (2021).
-🗎 Nahar, Nadia, et al. “[Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process](https://arxiv.org/abs/2110.10234).” In Pro. ICSE, 2022. - - - - - - ---- - -
- -## Machine Learning in Production - - -# Automating and Testing ML Pipelines - - - - ----- - -# Learning Goals - -* Decompose an ML pipeline into testable functions -* Implement and automate tests for all parts of the ML pipeline -* Understand testing opportunities beyond functional correctness -* Describe the different testing levels and testing opportunities at each level -* Automate test execution with continuous integration - ----- -# ML Pipelines - -![Pipeline](pipeline.svg) - - -All steps to create (and deploy) the model - - ----- -## Notebooks as Production Pipeline? - -[![How to Notebook in Production Blog post](notebookinproduction.png)](https://tanzu.vmware.com/content/blog/how-data-scientists-can-tame-jupyter-notebooks-for-use-in-production-systems) - -Parameterize and use `nbconvert`? - - ----- -## Real Pipelines can be Complex - -![Connections between the pipeline and other components](pipeline-connections.svg) - - ----- -## Possible Mistakes in ML Pipelines - -![Pipeline](pipeline.svg) - - -Danger of "silent" mistakes in many phases - -**Examples?** - - ----- -## Pipeline restructed into separate function - -
- -```python -def encode_day_of_week(df): - if 'datetime' not in df.columns: raise ValueError("Column datetime missing") - if df.datetime.dtype != 'object': raise ValueError("Invalid type for column datetime") - df['dayofweek']= pd.to_datetime(df['datetime']).dt.day_name() - df = pd.get_dummies(df, columns = ['dayofweek']) - return df - - -# ... - - -def prepare_data(df): - df = clean_data(df) - - - df = encode_day_of_week(df) - df = encode_month(df) - df = encode_weather(df) - df.drop(['datetime'], axis=1, inplace=True) - return (df.drop(['delivery_count'], axis=1), - encode_count(pd.Series(df['delivery_count']))) - - -def learn(X, y): - lr = LinearRegression() - lr.fit(X, y) - return lr - - -def pipeline(): - train = pd.read_csv('train.csv', parse_dates=True) - test = pd.read_csv('test.csv', parse_dates=True) - X_train, y_train = prepare_data(train) - X_test, y_test = prepare_data(test) - model = learn(X_train, y_train) - accuracy = eval(model, X_test, y_test) - return model, accuracy -``` - - -
- ----- -## Test the Modules - -```python -def encode_day_of_week(df): - if 'datetime' not in df.columns: raise ValueError("Column datetime missing") - if df.datetime.dtype != 'object': raise ValueError("Invalid type for column datetime") - df['dayofweek']= pd.to_datetime(df['datetime']).dt.day_name() - df = pd.get_dummies(df, columns = ['dayofweek']) - return df -``` - -```python -def test_day_of_week_encoding(): - df = pd.DataFrame({'datetime': ['2020-01-01','2020-01-02','2020-01-08'], 'delivery_count': [1, 2, 3]}) - encoded = encode_day_of_week(df) - assert "dayofweek_Wednesday" in encoded.columns - assert (encoded["dayofweek_Wednesday"] == [1, 0, 1]).all() - -# more tests... -``` - - - - ----- -## Subtle Bugs in Data Wrangling Code - - -```python -df['Join_year'] = df.Joined.dropna().map( - lambda x: x.split(',')[1].split(' ')[1]) -``` -```python -df.loc[idx_nan_age,'Age'].loc[idx_nan_age] = - df['Title'].loc[idx_nan_age].map(map_means) -``` -```python -df["Weight"].astype(str).astype(int) -``` - - ----- -## Build systems & Continuous Integration - -Automate all build, analysis, test, and deployment steps from a command line call - -Ensure all dependencies and configurations are defined - -Ideally reproducible and incremental - -Distribute work for large jobs - -Track results - -**Key CI benefit: Tests are regularly executed, part of process** - - ----- - -![](mltestingandmonitoring.png) - - -Source: Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. [The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction](https://research.google.com/pubs/archive/46555.pdf). Proceedings of IEEE Big Data (2017) - ----- - -## Case Study: Covid-19 Detection - - - -(from S20 midterm; assume cloud or hybrid deployment) - - - ----- -## Stubbing the Dependency - -![Test driver-code-stub](driver-code-stub.svg) - -```python -def test_do_not_overwrite_gender(): - def model_stub(first, last, location): - return 'M' - - df = pd.DataFrame({'firstname': ['John', 'Jane', 'Jim'], 'lastname': ['Doe', 'Doe', 'Doe'], 'location': ['Pittsburgh, PA', 'Rome, Italy', 'Paris, PA '], 'gender': [np.nan, 'F', np.nan]}) - out = clean_gender(df, model_stub) - assert(out['gender'] ==['M', 'F', 'M']).all() -``` - - - ----- -## General Testing Strategy: Decoupling Code Under Test - -![Test driver-code-stub](driver-stubs-interface.svg) - - -(Mocking frameworks provide infrastructure for expressing such tests compactly.) - ----- -# Integration and system tests - -![Testing levels](unit-integration-system-testing.svg) - ----- -## The V-Model of Testing - -![V-Model](vmodel.svg) - - - - - ----- -# Code Review and Static Analysis - - ----- -![Code Review on GitHub](review_github.png) - ----- -## Static Analysis, Code Linting - -Automatic detection of problematic patterns based on code structure - -```java -if (user.jobTitle = "manager") { - ... -} -``` - -```javascript -function fn() { - x = 1; - return x; - x = 3; -} -``` - - - - - - - - - - - - - - ---- -# Midterm - -(UPMC predictive maintenance) - ---- -# Milestone 2: Infrastructure Quality - -(online and offline evaluation, data quality, pipeline testing, continuous integrations, pull requests) - - - - ---- - -
- -## Machine Learning in Production - -# Scaling Data Storage and Data Processing - - - ----- - -# Learning Goals - -* Organize different data management solutions and their tradeoffs -* Understand the scalability challenges involved in large-scale machine learning and specifically deep learning -* Explain the tradeoffs between batch processing and stream processing and the lambda architecture -* Recommend and justify a design and corresponding technologies for a given system - ----- -# Case Study - -![Google Photos Screenshot](gphotos.png) - - -Notes: -* Discuss possible architecture and when to predict (and update) -* in may 2017: 500M users, uploading 1.2billion photos per day (14k/sec) -* in Jun 2019 1 billion users - ----- - -## Adding capacity - - - -*Stories of catastrophic success?* - ----- -## Distributed Everything - -Distributed data cleaning - -Distributed feature extraction - -Distributed learning - -Distributed large prediction tasks - -Incremental predictions - -Distributed logging and telemetry - - ----- -## Distributed Gradient Descent - -[![Parameter Server](parameterserver.png)](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf) - - ----- -## Relational Data Models - -
- -**Photos:** - -|photo_id|user_id|path|upload_date|size|camera_id|camera_setting| -|-|-|-|-|-|-|-| -|133422131|54351|/st/u211/1U6uFl47Fy.jpg|2021-12-03T09:18:32.124Z|5.7|663|ƒ/1.8; 1/120; 4.44mm; ISO271| -|133422132|13221| /st/u11b/MFxlL1FY8V.jpg |2021-12-03T09:18:32.129Z|3.1|1844|ƒ/2, 1/15, 3.64mm, ISO1250| -|133422133|54351|/st/x81/ITzhcSmv9s.jpg|2021-12-03T09:18:32.131Z|4.8|663|ƒ/1.8; 1/120; 4.44mm; ISO48| - - - -**Users:** - - -| user_id |account_name|photos_total|last_login| -|-|-|-|-| -|54351| ckaestne | 5124 | 2021-12-08T12:27:48.497Z | -|13221| eva.burk |3|2021-12-21T01:51:54.713Z| - - - -**Cameras:** - - -| camera_id |manufacturer|print_name| -|-|-|-| -|663| Google | Google Pixel 5 | -|1844|Motorola|Motorola MotoG3| - - - -```sql -select p.photo_id, p.path, u.photos_total -from photos p, users u -where u.user_id=p.user_id and u.account_name = "ckaestne" -``` - -
- ----- - -## Document Data Models - - -```js -{ - "_id": 133422131, - "path": "/st/u211/1U6uFl47Fy.jpg", - "upload_date": "2021-12-03T09:18:32.124Z", - "user": { - "account_name": "ckaestne", - "account_id": "a/54351" - }, - "size": "5.7", - "camera": { - "manufacturer": "Google", - "print_name": "Google Pixel 5", - "settings": "ƒ/1.8; 1/120; 4.44mm; ISO271" - } -} - -``` - -```js -db.getCollection('photos').find( { "user.account_name": "ckaestne"}) -``` - ----- -## Log files, unstructured data - -```text -02:49:12 127.0.0.1 GET /img13.jpg 200 -02:49:35 127.0.0.1 GET /img27.jpg 200 -03:52:36 127.0.0.1 GET /main.css 200 -04:17:03 127.0.0.1 GET /img13.jpg 200 -05:04:54 127.0.0.1 GET /img34.jpg 200 -05:38:07 127.0.0.1 GET /img27.jpg 200 -05:44:24 127.0.0.1 GET /img13.jpg 200 -06:08:19 127.0.0.1 GET /img13.jpg 200 -``` - ----- -## Partitioning - - -Divide data: - -* *Horizontal partitioning:* Different rows in different tables; e.g., movies by decade, hashing often used -* *Vertical partitioning:* Different columns in different tables; e.g., movie title vs. all actors - -**Tradeoffs?** - - - -![Horizontal partitioning](horizonalpartition.svg) - - - - ----- -## Replication with Leaders and Followers - -![Leader-follower replication](leaderfollowerreplication.svg) - - - ----- -## Microservices - -![Audible example](microservice.svg) - - - -Figure based on Christopher Meiklejohn. [Dynamic Reduction: Optimizing Service-level Fault Injection Testing With Service Encapsulation](http://christophermeiklejohn.com/filibuster/2021/10/14/filibuster-4.html). Blog Post 2021 - ----- -[![Map Reduce example](mapreduce.svg)](mapreduce.svg) - - - ----- -## Key Design Principle: Data Locality - -> Moving Computation is Cheaper than Moving Data -- [Hadoop Documentation](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#aMoving_Computation_is_Cheaper_than_Moving_Data) - -Data often large and distributed, code small - -Avoid transfering large amounts of data - -Perform computation where data is stored (distributed) - -Transfer only results as needed - -*"The map reduce way"* - - - ----- -## Stream Processing (e.g., Kafka) -![Stream example](stream.svg) - - ----- -## Common Designs - -Like shell programs: Read from stream, produce output in other stream. -> loose coupling - - -![](stream-dataflow.svg) - - ----- -## Event Sourcing - -* Append only databases -* Record edit events, never mutate data -* Compute current state from all past events, can reconstruct old state -* For efficiency, take state snapshots -* *Similar to traditional database logs, but persistent* - -```text -addPhoto(id=133422131, user=54351, path="/st/u211/1U6uFl47Fy.jpg", date="2021-12-03T09:18:32.124Z") -updatePhotoData(id=133422131, user=54351, title="Sunset") -replacePhoto(id=133422131, user=54351, path="/st/x594/vipxBMFlLF.jpg", operation="/filter/palma") -deletePhoto(id=133422131, user=54351) -``` - - ----- -## Lambda Architecture and Machine Learning - -![Lambda Architecture](lambda.svg) - - - -* Learn accurate model in batch job -* Learn incremental model in stream processor - - ----- -## Data Lake - -Trend to store all events in raw form (no consistent schema) - -May be useful later - -Data storage is comparably cheap - -Bet: *Yet unknown future value of data is greater than storage costs* - ----- -# Breakout: Vimeo Videos - -As a group, discuss and post in `#lecture`, tagging group members: -* How to distribute storage: -* How to design scalable copy-right protection solution: -* How to design scalable analytics (views, ratings, ...): - -[![Vimeo page](vimeo.png)](https://vimeo.com/about) - ----- -## ETL: Extract, Transform, Load - -* Transfer data between data sources, often OLTP -> OLAP system -* Many tools and pipelines - - Extract data from multiple sources (logs, JSON, databases), snapshotting - - Transform: cleaning, (de)normalization, transcoding, sorting, joining - - Loading in batches into database, staging -* Automation, parallelization, reporting, data quality checking, monitoring, profiling, recovery -* Many commercial tools - - -Examples of tools in [several](https://www.softwaretestinghelp.com/best-etl-tools/) [lists](https://www.scrapehero.com/best-data-management-etl-tools/) - - - - - - - - - - - - - ---- - -
- -## Machine Learning in Production - - -# Planning for Operations - - - ----- - -# Learning Goals - - -* Deploy a service for models using container infrastructure -* Automate common configuration management tasks -* Devise a monitoring strategy and suggest suitable components for implementing it -* Diagnose common operations problems -* Understand the typical concerns and concepts of MLOps - - ----- -## Operations - - - -Provision and monitor the system in production, respond to problems - -Avoid downtime, scale with users, manage operating costs - -Heavy focus on infrastructure - -Traditionally sysadmin and hardware skills - - - -![SRE Book Cover](srebook.jpg) - - - - ----- -## Service Level Objectives - -Quality requirements in operations, such as -* maximum latency -* minimum system throughput -* targeted availability/error rate -* time to deploy an update -* durability for storage - -Each with typical measures - -For the system as a whole or individual services - - - - ----- -# Dev vs. Ops - -![](devops_meme.jpg) - ----- -## Common Release Problems (Examples) - -* Missing dependencies -* Different compiler versions or library versions -* Different local utilities (e.g. unix grep vs mac grep) -* Database problems -* OS differences -* Too slow in real settings -* Difficult to roll back changes -* Source from many different repositories -* Obscure hardware? Cloud? Enough memory? - ----- -# DevOps -![DevOps Cycle](devops.png) - ----- -## Common Practices - -All configurations in version control - -Test and deploy in containers - -Automated testing, testing, testing, ... - -Monitoring, orchestration, and automated actions in practice - -Microservice architectures - -Release frequently - ----- -## Heavy tooling and automation - -[![DevOps tooling overview](devops_tools.jpg)](devops_tools.jpg) - ----- -## Automate Everything - -![CD vs CD](continuous_delivery.gif) - ----- -## Containers - - -* Lightweight virtual machine -* Contains entire runnable software, incl. all dependencies and configurations -* Used in development and production -* Sub-second launch time -* Explicit control over shared disks and network connections - - -![Docker logo](docker_logo.png) - - - ----- - -![Kubernetes](Kubernetes.png) - - - -CC BY-SA 4.0 [Khtan66](https://en.wikipedia.org/wiki/Kubernetes#/media/File:Kubernetes.png) - - ----- - -![Hawkular Dashboard](https://www.hawkular.org/img/hawkular-apm/components.png) - - -https://www.hawkular.org/hawkular-apm/ - - - - ----- -## The DevOps Mindset - -* Consider the entire process and tool chain holistically -* Automation, automation, automation -* Elastic infrastructure -* Document, test, and version everything -* Iterate and release frequently -* Emphasize observability -* Shared goals and responsibilities - - - - - ----- -![MLOps](https://ml-ops.org/img/mlops-loop-banner.jpg) - - - -https://ml-ops.org/ - - ----- -## MLOps Tools -- Examples - -* Model registry, versioning and metadata: MLFlow, Neptune, ModelDB, WandB, ... -* Model monitoring: Fiddler, Hydrosphere -* Data pipeline automation and workflows: DVC, Kubeflow, Airflow -* Model packaging and deployment: BentoML, Cortex -* Distributed learning and deployment: Dask, Ray, ... -* Feature store: Feast, Tecton -* Integrated platforms: Sagemaker, Valohai, ... -* Data validation: Cerberus, Great Expectations, ... - -Long list: https://github.com/kelvins/awesome-mlops - - ----- -## Breakout: MLOps Goals - -For the blog spam filter scenario, consider DevOps and MLOps infrastructure (CI, CD, containers, config. mgmt, monitoring, model registry, pipeline automation, feature store, data validation, ...) - -As a group, tagging group members, post to `#lecture`: -> * Which DevOps or MLOps goals to prioritize? -> * Which tools to try? - - ----- -## Incident Response Plan - -* Provide contact channel for problem reports -* Have expert on call -* Design process for anticipated problems, e.g., rollback, reboot, takedown -* Prepare for recovery -* Proactively collect telemetry -* Investigate incidents -* Plan public communication (responsibilities) - ----- -# Excursion: Organizational Culture - -![Book Cover: Organizational Culture and Leadership](orgculture.jpg) - - ----- -## Organizational Culture - -*“this is how we always did things”* - -Implicit and explicit assumptions and rules guiding behavior - -Often grounded in history, very difficult to change - -Examples: -* Move fast and break things -* Privacy first -* Development opportunities for all employees - - ----- -![Org chart comic](orgchart.png) - - - -Source: Bonkers World - - ----- -## Culture Change - -Changing organizational culture is very difficult - -Top down: espoused values, management buy in, incentives - -Bottom up: activism, show value, spread - - -**Examples of success of failure stories?** - - - - - - - ---- -# I4: Tools for Production ML Systems - - - - - - - - - - - - - - - - - - - ---- - -
- -## Machine Learning in Production - -# Process and Technical Debt - - - - ----- -## Process... - -![Overview of course content](../_assets/overview.svg) - - - - ----- - -## Learning Goals - -
- - -* Overview of common data science workflows (e.g., CRISP-DM) - * Importance of iteration and experimentation - * Role of computational notebooks in supporting data science workflows -* Overview of software engineering processes and lifecycles: costs and benefits of process, common process models, role of iteration and experimentation -* Contrasting data science and software engineering processes, goals and conflicts -* Integrating data science and software engineering workflows in process model for engineering AI-enabled systems with ML and non-ML components; contrasting different kinds of AI-enabled systems with data science trajectories -* Overview of technical debt as metaphor for process management; common sources of technical debt in AI-enabled systems - -
- ----- -## Case Study: Real-Estate Website - -![Zillow front page](zillow_main.png) - - ----- -## Data Science is Iterative and Exploratory - - -![CRISP-DM](crispdm.png) - - - - -Martínez-Plumed et al. "[CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories](https://research-information.bris.ac.uk/files/220614618/TKDE_Data_Science_Trajectories_PF.pdf)." IEEE Transactions on Knowledge and Data Engineering (2019). - ----- -## Computational Notebooks - - - -
- -* Origins in "literate programming", interleaving text and code, treating programs as literature (Knuth'84) -* First notebook in Wolfram Mathematica 1.0 in 1988 -* Document with text and code cells, showing execution results under cells -* Code of cells is executed, per cell, in a kernel -* Many notebook implementations and supported languages, Python + Jupyter currently most popular - -
- - - -![Notebook example](notebook-example.png) - - - - -Notes: -* See also https://en.wikipedia.org/wiki/Literate_programming -* Demo with public notebook, e.g., https://colab.research.google.com/notebooks/mlcc/intro_to_pandas.ipynb - ----- -![full](process4.png) - -Notes: Real experience if little attention is payed to process: increasingly complicated, increasing rework; attempts to rescue by introducing process - - ----- -## Waterfall Model - - -![Waterfall model](waterfall.svg) - - -*taming chaos, understand req., plan before coding, remember testing* - - -Notes: Although dated, the key idea is still essential -- think and plan before implementing. Not all requirements and design can be made upfront, but planning is usually helpful. - ----- -## Risk First: Spiral Model - -![Spiral model](spiral_model.svg) - - -*incremental prototypes, starting with most risky components* - ----- -## Constant iteration: Agile - -![Scrum Process](scrum.svg) - - -*working with customers, constant replanning* - - -(Image CC BY-SA 4.0, Lakeworks) - - ----- -## Discussion: Iteration in Notebook vs Agile? - - -[![Experimental results showing incremental accuracy improvement](accuracy-improvements.png)](accuracy-improvements.png) - -![Scrum Process](scrum.svg) - - -(CC BY-SA 4.0, Lakeworks) - - ----- -## Model first vs Product first - -![Combined process](combinedprocess5.svg) - - - - ----- -# Technical debt - - -[![](debt.png)](https://www.monkeyuser.com/2018/tech-debt/) - - ----- -![Technical Debt Quadrant](techDebtQuadrant.png) - - -Source: Martin Fowler 2009, https://martinfowler.com/bliki/TechnicalDebtQuadrant.html - - ----- -## Breakout: Technical Debt from ML - -As a group in `#lecture`, tagging members: Post two plausible examples technical debt in housing price prediction system: - 1. Deliberate, prudent: - 2. Reckless, inadvertent: - -![Zillow](zillow_main.png) - - - -Sculley, David, et al. [Hidden technical debt in machine learning systems](http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf). Advances in Neural Information Processing Systems. 2015. ---- - -
- -## Machine Learning in Production - -# Responsible ML Engineering - -(Intro to Ethics and Fairness) - - - - - - ----- -## Changing directions... - -![Overview of course content](../_assets/overview.svg) - - - - - ----- - - - -![Martin Shkreli](Martin_Shkreli_2016.jpg) - - - -
- -*In 2015, Shkreli received widespread criticism [...] obtained the manufacturing license for the antiparasitic drug Daraprim and raised its price from USD 13.5 to 750 per pill [...] referred to by the media as "the most hated man in America" and "Pharma Bro".* -- [Wikipedia](https://en.wikipedia.org/wiki/Martin_Shkreli) - -"*I could have raised it higher and made more profits for our shareholders. Which is my primary duty.*" -- Martin Shkreli - -
- - -Note: Image source: https://en.wikipedia.org/wiki/Martin_Shkreli#/media/File:Martin_Shkreli_2016.jpg - - ----- -## Another Example: Social Media - -![zuckerberg](mark-zuckerberg.png) - - -*What is the (real) organizational objective of the company?* - ----- -## Mental Health - -![teen-suicide-rate](teen-suicide-rate.png) - - -* 35% of US teenagers with low social-emotional well-being have been bullied on social media. -* 70% of teens feel excluded when using social media. - - -https://leftronic.com/social-media-addiction-statistics - ----- -## Who's to blame? - -![dont-be-evil](dont-be-evil.png) - - -*Are these companies intentionally trying to cause harm? If not, - what are the root causes of the problem?* - - ----- -## Liability? - -> THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. - -Note: Software companies have usually gotten away with claiming no liability for their products - ----- -## Buzzword or real progress? - -![Microsoft responsible AI principles](responsibleai.png) - - - ----- -## Legally protected classes (US) - -
- -- Race ([Civil Rights Act of 1964](https://en.wikipedia.org/wiki/Civil_Rights_Act_of_1964)) -- Religion ([Civil Rights Act of 1964](https://en.wikipedia.org/wiki/Civil_Rights_Act_of_1964)) -- National origin ([Civil Rights Act of 1964](https://en.wikipedia.org/wiki/Civil_Rights_Act_of_1964)) -- Sex, sexual orientation, and gender identity ([Equal Pay Act of 1963](https://en.wikipedia.org/wiki/Equal_Pay_Act_of_1963), [Civil Rights Act of 1964](https://en.wikipedia.org/wiki/Civil_Rights_Act_of_1964), and [Bostock v. Clayton](https://en.wikipedia.org/wiki/Bostock_v._Clayton_County)) -- Age (40 and over, [Age Discrimination in Employment Act of 1967](https://en.wikipedia.org/wiki/Age_Discrimination_in_Employment_Act_of_1967)) -- Pregnancy ([Pregnancy Discrimination Act of 1978](https://en.wikipedia.org/wiki/Pregnancy_Discrimination_Act)) -- Familial status (preference for or against having children, [Civil Rights Act of 1968](https://en.wikipedia.org/wiki/Civil_Rights_Act_of_1968)) -- Disability status ([Rehabilitation Act of 1973](https://en.wikipedia.org/wiki/Rehabilitation_Act_of_1973); [Americans with Disabilities Act of 1990](https://en.wikipedia.org/wiki/Americans_with_Disabilities_Act_of_1990)) -- Veteran status ([Vietnam Era Veterans’ Readjustment Assistance Act of 1974](https://en.wikipedia.org/wiki/Vietnam_Era_Veterans'_Readjustment_Assistance_Act); [Uniformed Services Employment and Reemployment Rights Act of 1994](https://en.wikipedia.org/wiki/Uniformed_Services_Employment_and_Re-employment_Rights_Act_of_1994)) -- Genetic information ([Genetic Information Nondiscrimination Act of 2008](https://en.wikipedia.org/wiki/Genetic_Information_Nondiscrimination_Act)) - -
- - - -https://en.wikipedia.org/wiki/Protected_group - - ----- -## Dividing a Pie? - - - - -* Equal slices for everybody -* Bigger slices for active bakers -* Bigger slices for inexperienced/new members (e.g., children) -* Bigger slices for hungry people -* More pie for everybody, bake more - -*(Not everybody contributed equally during baking, not everybody is equally hungry)* - - - - -![Pie](../_chapterimg/16_fairness.jpg) - - - ----- -## Harms of Allocation - -* Withhold opportunities or resources -* Poor quality of service, degraded user experience for certain groups - -![](gender-detection.png) - - - - -_Gender Shades: Intersectional Accuracy Disparities in -Commercial Gender Classification_, Buolamwini & Gebru, ACM FAT* (2018). - ----- -## Harms of Representation - -* Over/under-representation of certain groups in organizations -* Reinforcement of stereotypes - -![](online-ad.png) - - - - -_Discrimination in Online Ad Delivery_, Latanya Sweeney, SSRN (2013). - ----- -## Historical Bias - -*Data reflects past biases, not intended outcomes* - -![Image search for "CEO"](ceo.png) - - -*Should the algorithm reflect the reality?* - -Note: "An example of this type of bias can be found in a 2018 image search -result where searching for women CEOs ultimately resulted in fewer female CEO images due -to the fact that only 5% of Fortune 500 CEOs were woman—which would cause the search -results to be biased towards male CEOs. These search results were of course reflecting -the reality, but whether or not the search algorithms should reflect this reality is an issue worth -considering." - ----- -## Tainted Labels - -*Bias in dataset labels assigned (directly or indirectly) by humans* - -![](amazon-hiring.png) - - -Example: Hiring decision dataset -- labels assigned by (possibly biased) experts or derived from past (possibly biased) hiring decisions - ----- -## Skewed Sample - -*Bias in how and what data is collected* - -![](crime-map.jpg) - - -Crime prediction: Where to analyze crime? What is considered crime? Actually a random/representative sample? - -Recall: Raw data is an oxymoron - - ----- -## Proxies - -*Features correlate with protected attribute, remain after removal* - - -![](neighborhoods.png) - - -* Example: Neighborhood as a proxy for race -* Extracurricular activities as proxy for gender and social class (e.g., “cheerleading”, “peer-mentor for ...”, “sailing team”, “classical music”) - - ----- -## Feedback Loops reinforce Bias - -![Feedback loop](feedbackloop.svg) - - - - -> "Big Data processes codify the past. They do not invent the future. Doing that requires moral imagination, and that’s something only humans can provide. " -- Cathy O'Neil in [Weapons of Math Destruction](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991016462699704436) - ----- -## Breakout: College Admission - -Scenario: Evaluate applications & identify students who are -likely to succeed - -Features: GPA, GRE/SAT, gender, race, undergrad institute, alumni - connections, household income, hometown, transcript, etc. - -As a group, post to `#lecture` tagging members: - * **Possible harms:** Allocation of resources? Quality of service? Stereotyping? Denigration? Over-/Under-representation? - * **Sources of bias:** Skewed sample? Tainted labels? Historical bias? Limited features? - Sample size disparity? Proxies? - ---- - -
- -## Machine Learning in Production - - -# Measuring Fairness - - - ----- -# Learning Goals - -* Understand different definitions of fairness -* Discuss methods for measuring fairness -* Outline interventions to improve fairness at the model level - ----- -## Past bias, different starting positions - -![Severe median income and worth disparities between white and black households](mortgage.png) - - -Source: Federal Reserve’s [Survey of Consumer Finances](https://www.federalreserve.gov/econres/scfindex.htm) - ----- -## Anti-Classification - - -![](justice.jpeg) - - - -* Also called _fairness through blindness_ or _fairness through unawareness_ -* Ignore certain sensitive attributes when making a decision -* Example: Remove gender and race from mortgage model -* *Easy to implement, but any limitations?* - ----- -## Group fairness - -Key idea: Compare outcomes across two groups -* Similar rates of accepted loans across racial/gender groups? -* Similar chance of being hired/promoted between gender groups? -* Similar rates of (predicted) recidivism across racial groups? - -Outcomes matter, not accuracy! - ----- -## Equalized odds - -Key idea: Focus on accuracy (not outcomes) across two groups - -* Similar default rates on accepted loans across racial/gender groups? -* Similar rate of "bad hires" and "missed stars" between gender groups? -* Similar accuracy of predicted recidivism vs actual recidivism across racial groups? - -Accuracy matters, not outcomes! - ----- -# Breakout: Cancer Prognosis - -![](cancer-stats.png) - -In groups, post to `#lecture` tagging members: - -* Does the model meet anti-classification fairness wrt. sex? -* Does the model meet group fairness? -* Does the model meet equalized odds? -* Is the model fair enough to use? - - ----- -## Intuitive Justice - -Research on what post people perceive as fair/just (psychology) - -When rewards depend on inputs and participants can chose contributions: Most people find it fair to split rewards proportional to inputs -* *Which fairness measure does this relate to?* - -Most people agree that for a decision to be fair, personal characteristics that do not influence the reward, such as sex or age, should not be considered when dividing the rewards. -* *Which fairness measure does this relate to?* - ----- -## Equality vs Equity - -![Contrasting equality, equity, and justice](eej2.jpeg) - - ----- -![](fairness_tree.png) - - - - -🕮 Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter and Julia Lane. [Big Data and Social Science: Data Science Methods and Tools for Research and Practice](https://textbook.coleridgeinitiative.org/). Chapter 11, 2nd ed, 2020 - ----- -## Discussion: Fairness Goal for College Admission? - -Strong legal precedents - -Very limited scope of *affirmative action* - -Most forms of group fairness likely illegal - -In practice: Anti-classification - - - - - ----- -# Improving Fairness of a Model - -In all pipeline stages: -* Data collection -* Data cleaning, processing -* Training -* Inference -* Evaluation and auditing - - - ----- -## Example audit tool: Aequitas - -![](aequitas-report.png) - - ----- -## Example: Tweaking Thresholds - -![](tweakthresholds.svg) - ---- - -
- -## Machine Learning in Production - - -# Building Fair Products - - - ----- -## Learning Goals - -* Understand the role of requirements engineering in selecting ML -fairness criteria -* Understand the process of constructing datasets for fairness -* Document models and datasets to communicate fairness concerns -* Consider the potential impact of feedback loops on AI-based systems - and need for continuous monitoring -* Consider achieving fairness in AI-based systems as an activity throughout the entire development cycle - ----- -## Most Fairness Discussions are Model-Centric or Pipeline-Centric - -![](fairness-lifecycle.jpg) - - - - -_Fairness-aware Machine Learning_, Bennett et al., WSDM Tutorial (2019). - - ----- -## Fairness Problems are System-Wide Challenges - -* **Requirements engineering challenges:** How to identify fairness concerns, fairness metric, design data collection and labeling -* **Human-computer-interaction design challenges:** How to present results to users, fairly collect data from users, design mitigations -* **Quality assurance challenges:** Evaluate the entire system for fairness, continuously assure in production -* **Process integration challenges:** Incoprorate fairness work in development process -* **Education and documentation challenges:** Create awareness, foster interdisciplinary collaboration - - ----- -## Negotiate Fairness Goals/Measures - -Equality or equity? Equalized odds? ... - -Cannot satisfy all. People have conflicting preferences... - -> *Treating everybody equally in a meritocracy will reinforce existing inequalities whereas uplifting disadvantaged communities can be seen as giving unfair advantages to people who contributed less, making it harder to succeed in the advantaged group merely due to group status.* - - - ----- -## Making Rare Skills Attainable - - -![radiology](radiology.jpg) - -> We should stop training radiologists now. It’s just completely obvious that within five years, deep learning is going to do better than radiologists. -- [Geoffrey Hinton](https://www.youtube.com/watch?v=2HMPRXstSvQ&t=29s), 2016 - ----- -[![Headline Rutkowski not happy about AI art](rutkowski.png)](https://www.technologyreview.com/2022/09/16/1059598/this-artist-is-dominating-ai-generated-art-and-hes-not-happy-about-it/) - - ----- -## Who does the Fairness Work? - -Within organizations usually little institutional support for fairness work, few activists - -Fairness issues often raised by communities affected, after harm occurred - -Affected groups may need to organize to affect change - - - -*Do we place the cost of unfair systems on those already marginalized and disadvantaged?* - - - ----- -## Breakout: College Admission - -![](college-admission.jpg) - - -Assume most universities want to automate admissions decisions. - - -As a group in `#lecture`, tagging group members: - -> What good or bad societal implications can you anticipate, beyond a single product? -> Should we do something about it? - - - - - - - - - - ----- -## 1. Avoid Unnecessary Distinctions - - -![Healthcare worker applying blood pressure monitor](blood-pressure-monitor.jpg) - - -"Doctor/nurse applying blood pressure monitor" -> "Healthcare worker applying blood pressure monitor" - - ----- -## 2. Suppress Potentially Problem Outputs - -![Twitter post of user complaining about misclassification of friends as Gorilla](apes.png) - - -*How to fix?* - - ----- -## 4. Keep Humans in the Loop - - -![Temi.com screenshot](temi.png) - - -TV subtitles: Humans check transcripts, especially with heavy dialects - - ----- -## Fairer Data Collection - - -Carefully review data collection procedures, sampling biases, what data is collected, how trustworthy labels are, etc. - -Can address most sources of bias: tainted labels, skewed samples, limited features, sample size disparity, proxies: -* deliberate what data to collect -* collect more data, oversample where needed -* extra effort in unbiased labels - --> Requirements engineering, system engineering - --> World vs machine, data quality, data cascades - - ----- -## Feedback Loops - -![Feedback loop](feedbackloop.svg) - - - ----- -## Barriers to Fairness Work - -1. Rarely an organizational priority, mostly reactive (media pressure, regulators) - * Limited resources for proactive work - * Fairness work rarely required as deliverable, low priority, ignorable - * No accountability for actually completing fairness work, unclear responsibilities - - -*What to do?* - ----- -## Affect Culture Change - -Buy-in from management is crucial - -Show that fairness work is taken seriously through action (funding, hiring, audits, policies), not just lofty mission statements - -Reported success strategies: -1. Frame fairness work as financial profitable, avoiding rework and reputation cost -2. Demonstrate concrete, quantified evidence of benefits of fairness work -3. Continuous internal activism and education initiatives -4. External pressure from customers and regulators - - ----- -## Documenting Model Fairness - -Recall: Model cards - -![Model Card Example](modelcards.png) - - - - -Mitchell, Margaret, et al. "[Model cards for model reporting](https://www.seas.upenn.edu/~cis399/files/lecture/l22/reading2.pdf)." In Proc. FAccT, 220-229. 2019. - ----- -## Documenting Fairness of Datasets - -![Datasheet describing labeling procedure](datasheet2.png) - - - -*Excerpt from a “Data Card” for Google’s* [*Open Images Extended*](https://storage.googleapis.com/openimages/web/extended.html#miap) *dataset ([*full data card*](https://storage.googleapis.com/openimages/open_images_extended_miap/Open%20Images%20Extended%20-%20MIAP%20-%20Data%20Card.pdf)*) - ---- - -
- -## Machine Learning in Production - - -# Explainability and Interpretability - - ----- -## Explainability as Building Block in Responsible Engineering - -![Overview of course content](../_assets/overview.svg) - - - ----- -# Learning Goals - -* Understand the importance of and use cases for interpretability -* Explain the tradeoffs between inherently interpretable models and post-hoc explanations -* Measure interpretability of a model -* Select and apply techniques to debug/provide explanations for data, models and model predictions -* Eventuate when to use interpretable models rather than ex-post explanations - - ----- - -![Adversarial examples](adversarialexample.png) - - - -Image: Gong, Yuan, and Christian Poellabauer. "[An overview of vulnerabilities of voice controlled systems](https://arxiv.org/pdf/1803.09156.pdf)." arXiv preprint arXiv:1803.09156 (2018). - ----- -## Detecting Anomalous Commits - -[![Reported commit](nodejs-unusual-commit.png)](nodejs-unusual-commit.png) - - - -Goyal, Raman, Gabriel Ferreira, Christian Kästner, and James Herbsleb. "[Identifying unusual commits on GitHub](https://www.cs.cmu.edu/~ckaestne/pdf/jsep17.pdf)." Journal of Software: Evolution and Process 30, no. 1 (2018): e1893. - ----- -## Is this recidivism model fair? - -```fortran -IF age between 18–20 and sex is male THEN - predict arrest -ELSE IF age between 21–23 and 2–3 prior offenses THEN - predict arrest -ELSE IF more than three priors THEN - predict arrest -ELSE - predict no arrest -``` - - - -Rudin, Cynthia. "[Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead](https://arxiv.org/abs/1811.10154)." Nature Machine Intelligence 1, no. 5 (2019): 206-215. - ----- -## How to interpret the results? - -![Screenshot of the COMPAS tool](compas_screenshot.png) - - - -Image source (CC BY-NC-ND 4.0): Christin, Angèle. (2017). Algorithms in practice: Comparing web journalism and criminal justice. Big Data & Society. 4. - ----- -![News headline: Stanford algorithm for vaccine priority controversy](stanford.png) - - - ----- -## Debugging - - - -* Why did the system make a wrong prediction in this case? -* What does it actually learn? -* What data makes it better? -* How reliable/robust is it? -* How much does second model rely on outputs of first? -* Understanding edge cases - - - -![Turtle recognized as gun](gun.png) - - - -**Debugging is the most common use in practice** (Bhatt et al. "Explainable machine learning in deployment." In Proc. FAccT. 2020.) - - - ----- -# Understanding a Model - -Levels of explanations: - -* **Understanding a model** -* Explaining a prediction -* Understanding the data - ----- -## Inherently Interpretable: Sparse Linear Models - -$f(x) = \alpha + \beta_1 x_1 + ... + \beta_n x_n$ - -Truthful explanations, easy to understand for humans - -Easy to derive contrastive explanation and feature importance - -Requires feature selection/regularization to minimize to few important features (e.g. Lasso); possibly restricting possible parameter values - ----- -## Score card: Sparse linear model with "round" coefficients - -![Scoring card](scoring.png) - - - - ----- -## Post-Hoc Model Explanation: Global Surrogates - -1. Select dataset X (previous training set or new dataset from same distribution) -2. Collect model predictions for every value: $y_i=f(x_i)$ -3. Train *inherently interpretable* model $g$ on (X,Y) -4. Interpret surrogate model $g$ - - -Can measure how well $g$ fits $f$ with common model quality measures, typically $R^2$ - -**Advantages? Disadvantages?** - -Notes: -Flexible, intuitive, easy approach, easy to compare quality of surrogate model with validation data ($R^2$). -But: Insights not based on real model; unclear how well a good surrogate model needs to fit the original model; surrogate may not be equally good for all subsets of the data; illusion of interpretability. -Why not use surrogate model to begin with? - - ----- -## Post-Hoc Model Explanation: Feature Importance - - -![FI example](featureimportance.png) - - - - -Source: -Christoph Molnar. "[Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/)." 2019 - ----- -## Post-Hoc Model Explanation: Partial Dependence Plot (PDP) - - -![PDP Example](pdp.png) - - - -Source: -Christoph Molnar. "[Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/)." 2019 - -Note: bike rental data in DC - - - ----- -## Understanding Predictions from Inherently Interpretable Models is easy - -Derive key influence factors or decisions from model parameters - -Derive contrastive counterfacturals from models - -**Examples:** Predict arrest for 18 year old male with 1 prior: - -```fortran -IF age between 18–20 and sex is male THEN predict arrest -ELSE IF age between 21–23 and 2–3 prior offenses THEN predict arrest -ELSE IF more than three priors THEN predict arrest -ELSE predict no arrest -``` - - ----- -## Posthoc Prediction Explanation: Feature Influences - -*Which features were most influential for a specific prediction?* - - -![Lime Example](lime2.png) - - - - -Source: https://github.com/marcotcr/lime - ----- -## Feature Influences in Images - -![Lime Example](lime_cat.png) - - - -Source: https://github.com/marcotcr/lime - - - - - - ----- -## Multiple Counterfactuals - -
- - - -Often long or multiple explanations - -> Your loan application has been *declined*. If your *savings account* ... - -> Your loan application has been *declined*. If your lived in ... - -Report all or select "best" (e.g. shortest, most actionable, likely values) - - -*(Rashomon effect)* - -![Rashomon](rashomon.jpg) - - - - -
- - ----- -![Adversarial examples](adversarialexample.png) - - - ----- -## Prototypes and Criticisms - -* *Prototype* is a data instance that is representative of all the data -* *Criticism* is a data instance not well represented by the prototypes - -![Example](prototype-dogs.png) - - - -Source: -Christoph Molnar. "[Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/)." 2019 - ----- -## Influential Instance - -**Data debugging:** *What data most influenced the training?* - -![Example](influentialinstance.png) - - - -Source: -Christoph Molnar. "[Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/)." 2019 - - - - ----- -## Breakout: Debugging with Explanations - -
- -In groups, discuss which explainability approaches may help and why. Tagging group members, write to `#lecture`. - - -*Algorithm bad at recognizing some signs in some conditions:* -![Stop Sign with Bounding Box](stopsign.jpg) - -*Graduate appl. system seems to rank applicants from HBCUs low:* -![Cheyney University founded in 1837 is the oldest HBCU](cheyneylibrary.jpeg) - - -
- - -Left Image: CC BY-SA 4.0, Adrian Rosebrock - - - - - - - - ----- -## Setting Cancer Imaging -- What explanations do radiologists want? - -![](cancerpred.png) - -* *Past attempts often not successful at bringing tools into production. Radiologists do not trust them. Why?* -* [Wizard of oz study](https://en.wikipedia.org/wiki/Wizard_of_Oz_experiment) to elicit requirements - ----- -## Explanations foster Trust - -Users are less likely to question the model when explanations provided -* Even if explanations are unreliable -* Even if explanations are nonsensical/incomprehensible - -**Danger of overtrust and intentional manipulation** - - -Stumpf, Simone, Adrian Bussone, and Dympna O’sullivan. "Explanations considered harmful? user interactions with machine learning systems." In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI). 2016. - ----- -![3 Conditions of the experiment with different explanation designs](explanationexperimentgame.png) - -(a) Rationale, (b) Stating the prediction, (c) Numerical internal values - -Observation: Both experts and non-experts overtrust numerical explanations, even when inscrutable. - - -Ehsan, Upol, Samir Passi, Q. Vera Liao, Larry Chan, I. Lee, Michael Muller, and Mark O. Riedl. "The who in explainable AI: how AI background shapes perceptions of AI explanations." arXiv preprint arXiv:2107.13509 (2021). - - - - ----- -## "Stop explaining ..." - -
- -Hypotheses: -* It is a myth that there is necessarily a trade-off between accuracy and interpretability (when having meaningful features) -* Explainable ML methods provide explanations that are not faithful to what the original model computes -* Explanations often do not make sense, or do not provide enough detail to understand what the black box is doing -* Black box models are often not compatible with situations where information outside the database needs to be combined with a risk assessment -* Black box models with explanations can lead to an overly complicated decision pathway that is ripe for human error - -
- - - -Rudin, Cynthia. "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead." Nature Machine Intelligence 1.5 (2019): 206-215. ([Preprint](https://arxiv.org/abs/1811.10154)) - ---- - -
- -## Machine Learning in Production - - -# Transparency and Accountability - - - ----- -## More Explainability, Policy, and Politics - -![Overview of course content](../_assets/overview.svg) - - - - ----- -# Learning Goals - -* Explain key concepts of transparency and trust -* Discuss whether and when transparency can be abused to game the system -* Design a system to include human oversight -* Understand common concepts and discussions of accountability/culpability -* Critique regulation and self-regulation approaches in ethical machine learning - - - ----- -## Case Study: Facebook's Feed Curation - -![Facebook with and without filtering](facebook.png) - - - - - -Eslami, Motahhare, et al. [I always assumed that I wasn't really that close to [her]: Reasoning about Invisible Algorithms in News Feeds](http://eslamim2.web.engr.illinois.edu/publications/Eslami_Algorithms_CHI15.pdf). In Proc. CHI, 2015. - - - - - - ----- -# Gaming/Attacking the Model with Explanations? - -*Does providing an explanation allow customers to 'hack' the system?* - -* Loan applications? -* Apple FaceID? -* Recidivism? -* Auto grading? -* Cancer diagnosis? -* Spam detection? - - ----- -## Human Oversight and Appeals - -* Unavoidable that ML models will make mistakes -* Users knowing about the model may not be comforting -* Inability to appeal a decision can be deeply frustrating - -
- - ----- -## Who is responsible? - -![teen-suicide-rate](teen-suicide-rate.png) - - ----- -## Easy to Blame "The Algorithm" / "The Data" / "Software" - -> "Just a bug, things happen, nothing we could have done" - -- But system was designed by humans -- But humans did not anticipate possible mistakes, did not design to mitigate mistakes -- But humans made decisions about what quality was good enough -- But humans designed/ignored the development process -- But humans gave/sold poor quality software to other humans -- But humans used the software without understanding it -- ... - ----- -![Stack overflow survey on responsible](stackoverflow.png) - - -Results from the [2018 StackOverflow Survey](https://insights.stackoverflow.com/survey/2018/#technology-and-society) - ----- -![Self regulation of tech companies on facial recognition](npr_facialrecognition.png) - - ----- -![Responsible AI website from Microsoft](responsibleai.png) - - ----- -[![Forbes Article: This Is The Year Of AI Regulations](airegulation.png)](https://www.forbes.com/sites/cognitiveworld/2020/03/01/this-is-the-year-of-ai-regulations/#1ea2a84d7a81) - - - ---- - -
- -## Machine Learning in Production - - -# Versioning, Provenance, and Reproducability - - - ----- -## More Foundational Technology for Responsible Engineering - -![Overview of course content](../_assets/overview.svg) - - - - ----- -# Learning Goals - -* Judge the importance of data provenance, reproducibility and explainability for a given system -* Create documentation for data dependencies and provenance in a given system -* Propose versioning strategies for data and models -* Design and test systems for reproducibility - ----- -![Example of dataflows between 4 sources and 3 models in credit card application scenario](creditcard-provenance.svg) - - - ----- - -## Breakout Discussion: Movie Predictions - -
- -> Assume you are receiving complains that a child gets many recommendations about R-rated movies - -In a group, discuss how you could address this in your own system and post to `#lecture`, tagging team members: - -* How could you identify the problematic recommendation(s)? -* How could you identify the model that caused the prediction? -* How could you identify the training code and data that learned the model? -* How could you identify what training data or infrastructure code "caused" the recommendations? - -
- - - -K.G Orphanides. [Children's YouTube is still churning out blood, suicide and cannibalism](https://www.wired.co.uk/article/youtube-for-kids-videos-problems-algorithm-recommend). Wired UK, 2018; -Kristie Bertucci. [16 NSFW Movies Streaming on Netflix](https://www.gadgetreview.com/16-nsfw-movies-streaming-on-netflix). Gadget Reviews, 2020 - - ----- - -## Data Provenance - - -* Track origin of all data - - Collected where? - - Modified by whom, when, why? - - Extracted from what other data or model or algorithm? -* ML models often based on data drived from many sources through many steps, including other models - - -![Example of dataflows between 4 sources and 3 models in credit card application scenario](creditcard-provenance.svg) - - - - - ----- -## Versioning Strategies for Datasets - -1. Store copies of entire datasets (like Git), identify by checksum -2. Store deltas between datasets (like Mercurial) -3. Offsets in append-only database (like Kafka), identify by offset -4. History of individual database records (e.g. S3 bucket versions) - - some databases specifically track provenance (who has changed what entry when and how) - - specialized data science tools eg [Hangar](https://github.com/tensorwerk/hangar-py) for tensor data -5. Version pipeline to recreate derived datasets ("views", different formats) - - e.g. version data before or after cleaning? - - ----- -## Aside: Git Internals - -![Git internal model](https://git-scm.com/book/en/v2/images/data-model-4.png) - - - -Scott Chacon and Ben Straub. [Pro Git](https://git-scm.com/book/en/v2/Git-Internals-Git-References). 2014 - - ----- -## Example: DVC - -```sh -dvc add images -dvc run -d images -o model.p cnn.py -dvc remote add myrepo s3://mybucket -dvc push -``` - -* Tracks models and datasets, built on Git -* Splits learning into steps, incrementalization -* Orchestrates learning in cloud resources - - -https://dvc.org/ - - - ----- -## ModelDB Example - -```python -from verta import Client -client = Client("http://localhost:3000") - -proj = client.set_project("My first ModelDB project") -expt = client.set_experiment("Default Experiment") - -# log the first run -run = client.set_experiment_run("First Run") -run.log_hyperparameters({"regularization" : 0.5}) -run.log_dataset_version("training_and_testing_data", dataset_version) -model1 = # ... model training code goes here -run.log_metric('accuracy', accuracy(model1, validationData)) -run.log_model(model1) - -# log the second run -run = client.set_experiment_run("Second Run") -run.log_hyperparameters({"regularization" : 0.8}) -run.log_dataset_version("training_and_testing_data", dataset_version) -model2 = # ... model training code goes here -run.log_metric('accuracy', accuracy(model2, validationData)) -run.log_model(model2) -``` - - ----- - -## Logging and Audit Traces - - -**Key goal:** If a customer complains about an interaction, can we reproduce the prediction with the right model? Can we debug the model's pipeline and data? Can we reproduce the model? - -* Version everything -* Record every model evaluation with model version -* Append only, backed up - - - -``` -,,,, -,,,, -,,,, -``` - - ---- -# Milestone 3: Monitoring and Continuous Deployment - -(Containers, Monitoring, A/B Testing, Provenance, Updates, Availability) - - ---- - -
- -## Machine Learning in Production - -# Security and Privacy - - - - - - - ----- -## More responsible engineering... - -![Overview of course content](../_assets/overview.svg) - - - ----- -## Learning Goals - -* Explain key concerns in security (in general and with regard to ML models) -* Identify security requirements with threat modeling -* Analyze a system with regard to attacker goals, attack surface, attacker capabilities -* Describe common attacks against ML models, including poisoning and evasion attacks -* Understand design opportunities to address security threats at the system level -* Apply key design principles for secure system design - - ----- -## Evasion Attacks (Adversarial Examples) - -![](evasion-attack.png) - - -Attack at inference time -* Add noise to an existing sample & cause misclassification -* Possible with and without access to model internals - - -_Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art -Face Recognition_, Sharif et al. (2016). - ----- -## Task Decision Boundary vs Model Boundary - -[![Decision boundary vs model boundary](decisionboundary.png)](decisionboundary.png) - - - -From Goodfellow et al (2018). [Making machine learning robust against adversarial inputs](https://par.nsf.gov/servlets/purl/10111674). *Communications of the ACM*, *61*(7), 56-66. - -Note: -Exploiting inaccurate model boundary and shortcuts - -* Decision boundary: Ground truth; often unknown and not specifiable -* Model boundary: What is learned; an _approximation_ of -decision boundary - - - ----- -## Untargeted Poisoning Attack on Availability - - - -Inject mislabeled training data to damage model quality - * 3% poisoning => 11% decrease in accuracy (Steinhardt, 2017) - -Attacker must have some access to the public or private training set - - - -*Example: Anti-virus (AV) scanner: AV company (allegedly) poisoned competitor's model by submitting fake viruses* - -![](virus.png) - - - - - ----- -## Targeted Poisoning Attacks on Integrity - -Insert training data with seemingly correct labels - -![](spam.jpg) - - -More targeted than availability attack, cause specific misclassification - - -_Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural -Networks_, Shafahi et al. (2018) - - ----- -## Model Stealing Attacks - -![Bing stealing search results from Google](bing.png) - - - -Singel. [Google Catches Bing Copying; Microsoft Says 'So What?'](https://www.wired.com/2011/02/bing-copies-google/). Wired 2011. - ----- -## Model Inversion against Confidentiality - - - -Given a model output (e.g., name of a person), infer the -corresponding, potentially sensitive input (facial image of the -person) -* e.g., gradient descent on input space - - - - -![](model-inversion-image.png) - - - - - -_Model Inversion Attacks that Exploit Confidence Information and Basic -Countermeasures_, M. Fredrikson et al. in CCS (2015). - - ----- -## Breakout: Dashcam System - - - -Recall: Dashcam system from I2/I3 - -As a group, tagging members, post in `#lecture`: - * Security requirements - * Possible (ML) attacks on the system - * Possible mitigations against these attacks - - - -![](dashcam-architecture.jpg) - - - - ----- -## State of ML Security - -![](arms-race.jpg) - - - ----- -## STRIDE Threat Modeling - -![](stride.png) - - -A systematic approach to identifying threats (i.e., attacker actions) -* Construct an architectural diagram with components & connections -* Designate the trust boundary -* For each untrusted component/connection, identify threats -* For each potential threat, devise a mitigation strategy - - - -[More info: STRIDE approach](https://docs.microsoft.com/en-us/archive/msdn-magazine/2006/november/uncover-security-design-flaws-using-the-stride-approach) - - - ----- - -![Target headline](target-headline.png) - - -> Andew Pole, who heads a 60-person team at Target that studies -customer behavior, boasted at a conference in 2010 about a proprietary -program that could identify women - based on their purchases and -demographic profile - who were pregnant. - - -Lipka. "[What Target knows about you](https://www.reuters.com/article/us-target-breach-datamining/what-target-knows-about-you-idUSBREA0M1JM20140123)". Reuters, 2014 - ----- -## Data Lakes - -![data lakes](data-lake.png) - - -*Who has access?* - ----- -## Privacy Consent and Control - -![Techcrunch privacy](techcrunch-privacy.png) - - - - - ---- -# Milestone 4: Fairness, Feedback Loops, Security - - - - ---- - -
- -## Machine Learning in Production - -# Safety - - - - - - ----- -## Mitigating more mistakes... - -![Overview of course content](../_assets/overview.svg) - - - - - ----- -## Learning Goals - -* Understand safety concerns in traditional and AI-enabled systems -* Apply hazard analysis to identify risks and requirements and understand their limitations -* Discuss ways to design systems to be safe against potential failures -* Suggest safety assurance strategies for a specific project -* Describe the typical processes for safety evaluations and their limitations - - - ----- -## Demonstrating Safety - -Two main strategies: - -1. **Evidence of safe behavior in the field** - * Extensive field trials - * Usually expensive -2. **Evidence of responsible (safety) engineering process** - * Process with hazard analysis, testing mitigations, etc - * Not sufficient to assure safety - -Most standards require both - - - ----- -## Documenting Safety with Assurance (Safety) Cases - -![](safetycase.svg) - - - - - - ----- -## Robustness in a Safety Setting - -* Does the model reliably detect stop signs? -* Also in poor lighting? In fog? With a tilted camera? Sensor noise? -* With stickers taped to the sign? (adversarial attacks) - -![Stop Sign](stop-sign.png) - - - -Image: David Silver. [Adversarial Traffic Signs](https://medium.com/self-driving-cars/adversarial-traffic-signs-fd16b7171906). Blog post, 2017 - ----- -## No Model is Fully Robust - -* Every useful model has at least one decision boundary -* Predictions near that boundary are not (and should not) be robust - -![Decision boundary](decisionboundary2.svg) - - ----- -## Breakout: Robustness - -Scenario: Medical use of transcription service, dictate diagnoses and prescriptions - -As a group, tagging members, post to `#lecture`: - -> 1. What safety concerns can you anticipate? -> 2. What notion of robustness are you concerned about (i.e., what distance function)? -> 3. How could you use robustness to improve the product (i.e., when/how to check robustness)? - - - - - ----- -# AI Safety - -![Robot uprising](robot-uprising.jpg) - - -Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. "[Concrete problems in AI safety](https://arxiv.org/pdf/1606.06565.pdf%20http://arxiv.org/abs/1606.06565)." arXiv preprint arXiv:1606.06565 (2016). - ----- -## Reward Hacking -- Many Examples - -
- - ----- -## Practical Alignment Problems - -Does the model goal align with the system goal? Does the system goal align with the user's goals? -* Profits (max. accuracy) vs fairness -* Engagement (ad sales) vs enjoyment, mental health -* Accuracy vs operating costs - -Test model *and* system quality *in production* - -(see requirements engineering and architecture lectures) - - - ---- - - -
- -## Machine Learning in Production - -# Fostering Interdisciplinary Teams - - - ----- -## One last crosscutting topic - -![Overview of course content](../_assets/overview.svg) - - - - - ----- - -## Learning Goals - -* Understand different roles in projects for AI-enabled systems -* Plan development activities in an inclusive fashion for participants in different roles -* Diagnose and address common teamwork issues -* Describe agile techniques to address common process and communication issues - - ----- -# Case Study: Depression Prognosis on Social Media - -![TikTok logo](tiktok.jpg) - - ----- -## Continuum of Skills - -* Software Engineer -* Data Engineer -* Data Scientist -* Applied Scientist -* Research Scientist - - - -Talk: Ryan Orban. [Bridging the Gap Between Data Science & Engineer: Building High-Performance Teams](https://www.slideshare.net/ryanorban/bridging-the-gap-between-data-science-engineer-building-highperformance-teams/3-Software_Engineer_Data_Engineer_Data). 2016 - ----- -## Process Costs - -![](connectedteams.svg) - - -*n(n − 1) / 2* communication links within a team - ----- -## Congurence - -![](congruence.svg) - - -Structural congruence, -Geographical congruence, -Task congruence, -IRC communication congruence - - - ----- -## Breakout: Team Structure for Depression Prognosis - -In groups, tagging team members, discuss and post in `#lecture`: -* How to decompose the work into teams? -* What roles to recruit for the teams - -![TikTok logo](tiktok.jpg) - - ----- -![Team collaboration within a large tech company](team-collaboration1.png) - - ----- -![Team collaboration within a large tech company](team-collaboration2.png) - - - ----- -## Conflicting Goals? - -![DevOps](devops.png) - - - - ----- -## Matrix Organization - -![](matrix-only.svg) - - ----- -## Project Organization - -![](project-only.svg) - - - ----- -# Learning from DevOps - -![DevOps](devops.png) - - - ---- -# Today - - -(1) - -**Looking back at the semester** - -(413 slides in 40 min) - - -(2) - -**Discussion of future of ML in Production** - - -(3) - -**Feedback for future semesters** - - - - - - ---- - -
- - - -# The Future of Machine Learning in Production? - -(closing remarks) - - - - ----- -## Are Software Engineers Disappearing? - - -
- - - -see also Andrej Karpathy. [Software 2.0](https://medium.com/@karpathy/software-2-0-a64152b37c35). Blog, 2017 - - - -Note: Andrej Karpathy is the director of AI at Tesla and coined the term Software 2.0 - - ----- -## Are Data Scientists Disappearing? - -[![Forbes Article: AutoML 2.0: Is The Data Scientist Obsolete?](automl.png)](https://www.forbes.com/sites/cognitiveworld/2020/04/07/automl-20-is-the-data-scientist-obsolete/#28f4a5b053c9) - - -Ryohei Fujimaki. [AutoML 2.0: Is The Data Scientist Obsolete?](https://www.forbes.com/sites/cognitiveworld/2020/04/07/automl-20-is-the-data-scientist-obsolete/#28f4a5b053c9) Forbes, 2020 - ----- -## Are Data Scientists Disappearing? - -> However, AutoML does not spell the end of data scientists, as it doesn’t “AutoSelect” a business problem to solve, it doesn’t AutoSelect indicative data, it doesn’t AutoAlign stakeholders, it doesn’t provide AutoEthics in the face of potential bias, it doesn’t provide AutoIntegration with the rest of your product, and it doesn’t provide AutoMarketing after the fact. -- [Frederik Bussler](https://towardsdatascience.com/will-automl-be-the-end-of-data-scientists-9af3e63990e0) - - -Frederik Bussler. [Will AutoML Be the End of Data Scientists?](https://towardsdatascience.com/will-automl-be-the-end-of-data-scientists-9af3e63990e0), Blog 2020 - ----- -## SE4AI Research: More SE Power to Data Scientists? - -## SE4AI Research: More DS Power to Software Engineers? - - - ----- -![Tweet: "Virtually *everyone* is / will soon be building ML applications. Only few can afford dedicated software engineers to team up with, or SE education for themselves. It would be more inclusive to build SE into the ML processes more fundamentally, so that everyone could build better"](virtuallyeveryone.png) - - - ----- -![Unicorn](unicorn.jpg) - - ----- -## Analogy - -![Renovation](renovation.jpg) - ----- -## Analogy - - -![Hammer](hammer2.jpg) - -![Nail gun](nailgun.jpg) - - -*(better tools don't replace the knowledge to use them)* ----- -## My View - -> This is an education problem, more than a research problem. - -> Interdisciplinary teams, mutual awareness and understanding - -> Software engineers and data scientists will each play an essential role - ----- -## DevOps as a Role Model - -![DevOps](devops.png) - -Joint responsibilities, joint processes, joint tools, joint vocabulary - - ---- -# One Last Time: Transcription - -![](temi.png) - - ---- -# Feedback - ----- -## Some things we tried - -* Recitations all focused on tooling -* More teamwork guidance upfront -* Teamwork meetings with TAs -* Allowing generative AI -* -* In-class interactions and breakouts with 100+ students -* Clear specifications for homework, pass/fail grading, allow resubmission -* Credit for social activities in teams -* Slack for coordination and questions - ----- -## Last Breakout: Feedback - -
- -Discuss feedback in small groups. Post feedback you feel comfortable sharing publicly to `#lecture`, tagging group members: - -> 1. Which topics were valuable and should be kept for future versions of the course? -> 2. To improve the course in future offerings, which topics should we cover additionally, cover more, or cover less? -> 3. Are there any tools (specific or general categories of tools) we should have covered or should have covered more? -> 4. How can we improve the way we teach the class (e.g., lectures, readings, recitations, assignments, scheduling, web site, guest lectures, group work, grading, bonus points)? - -For anonymous feedback option, see link on Slack. - -
- ----- -## Interested in TAing in Spring 2024? - -Email Christian - - ---- - -
- -# Thank you! - - - - - - diff --git a/lectures/25_summary/amazon-hiring.png b/lectures/25_summary/amazon-hiring.png deleted file mode 100644 index 94822f89..00000000 Binary files a/lectures/25_summary/amazon-hiring.png and /dev/null differ diff --git a/lectures/25_summary/amazonhiring.png b/lectures/25_summary/amazonhiring.png deleted file mode 100644 index 7d4c7df3..00000000 Binary files a/lectures/25_summary/amazonhiring.png and /dev/null differ diff --git a/lectures/25_summary/apes.png b/lectures/25_summary/apes.png deleted file mode 100644 index 8e0a6f16..00000000 Binary files a/lectures/25_summary/apes.png and /dev/null differ diff --git a/lectures/25_summary/apollo.png b/lectures/25_summary/apollo.png deleted file mode 100644 index 03609231..00000000 Binary files a/lectures/25_summary/apollo.png and /dev/null differ diff --git a/lectures/25_summary/ar-architecture.svg b/lectures/25_summary/ar-architecture.svg deleted file mode 100644 index 57759c9b..00000000 --- a/lectures/25_summary/ar-architecture.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/arms-race.jpg b/lectures/25_summary/arms-race.jpg deleted file mode 100644 index dc0b97e4..00000000 Binary files a/lectures/25_summary/arms-race.jpg and /dev/null differ diff --git a/lectures/25_summary/atm.gif b/lectures/25_summary/atm.gif deleted file mode 100644 index c966f66c..00000000 Binary files a/lectures/25_summary/atm.gif and /dev/null differ diff --git a/lectures/25_summary/automl.png b/lectures/25_summary/automl.png deleted file mode 100644 index 6dd76b43..00000000 Binary files a/lectures/25_summary/automl.png and /dev/null differ diff --git a/lectures/25_summary/bing.png b/lectures/25_summary/bing.png deleted file mode 100644 index 8d5708d9..00000000 Binary files a/lectures/25_summary/bing.png and /dev/null differ diff --git a/lectures/25_summary/blood-pressure-monitor.jpg b/lectures/25_summary/blood-pressure-monitor.jpg deleted file mode 100644 index 2ec3a2fa..00000000 Binary files a/lectures/25_summary/blood-pressure-monitor.jpg and /dev/null differ diff --git a/lectures/25_summary/book.webp b/lectures/25_summary/book.webp deleted file mode 100644 index 2a79cc4d..00000000 Binary files a/lectures/25_summary/book.webp and /dev/null differ diff --git a/lectures/25_summary/bookingcom.png b/lectures/25_summary/bookingcom.png deleted file mode 100644 index 9a77b15a..00000000 Binary files a/lectures/25_summary/bookingcom.png and /dev/null differ diff --git a/lectures/25_summary/canary.jpg b/lectures/25_summary/canary.jpg deleted file mode 100644 index 79b6948a..00000000 Binary files a/lectures/25_summary/canary.jpg and /dev/null differ diff --git a/lectures/25_summary/cancer-stats.png b/lectures/25_summary/cancer-stats.png deleted file mode 100644 index 6a0f1ad5..00000000 Binary files a/lectures/25_summary/cancer-stats.png and /dev/null differ diff --git a/lectures/25_summary/cancerpred.png b/lectures/25_summary/cancerpred.png deleted file mode 100644 index 5ebc2552..00000000 Binary files a/lectures/25_summary/cancerpred.png and /dev/null differ diff --git a/lectures/25_summary/capabilities1.png b/lectures/25_summary/capabilities1.png deleted file mode 100644 index 0fd08926..00000000 Binary files a/lectures/25_summary/capabilities1.png and /dev/null differ diff --git a/lectures/25_summary/ceo.png b/lectures/25_summary/ceo.png deleted file mode 100644 index edccbe99..00000000 Binary files a/lectures/25_summary/ceo.png and /dev/null differ diff --git a/lectures/25_summary/checklist.jpg b/lectures/25_summary/checklist.jpg deleted file mode 100644 index 64d7b725..00000000 Binary files a/lectures/25_summary/checklist.jpg and /dev/null differ diff --git a/lectures/25_summary/chess.jpg b/lectures/25_summary/chess.jpg deleted file mode 100644 index f4e9a9fc..00000000 Binary files a/lectures/25_summary/chess.jpg and /dev/null differ diff --git a/lectures/25_summary/cheyneylibrary.jpeg b/lectures/25_summary/cheyneylibrary.jpeg deleted file mode 100644 index ad0113b0..00000000 Binary files a/lectures/25_summary/cheyneylibrary.jpeg and /dev/null differ diff --git a/lectures/25_summary/college-admission.jpg b/lectures/25_summary/college-admission.jpg deleted file mode 100644 index 44b03d35..00000000 Binary files a/lectures/25_summary/college-admission.jpg and /dev/null differ diff --git a/lectures/25_summary/combinedprocess5.svg b/lectures/25_summary/combinedprocess5.svg deleted file mode 100644 index 25207d46..00000000 --- a/lectures/25_summary/combinedprocess5.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/compas_screenshot.png b/lectures/25_summary/compas_screenshot.png deleted file mode 100644 index 79cb7e83..00000000 Binary files a/lectures/25_summary/compas_screenshot.png and /dev/null differ diff --git a/lectures/25_summary/complexity.svg b/lectures/25_summary/complexity.svg deleted file mode 100644 index a7b69a4f..00000000 --- a/lectures/25_summary/complexity.svg +++ /dev/null @@ -1 +0,0 @@ - diff --git a/lectures/25_summary/confoundingvariables.svg b/lectures/25_summary/confoundingvariables.svg deleted file mode 100644 index 7f81842e..00000000 --- a/lectures/25_summary/confoundingvariables.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/congruence.svg b/lectures/25_summary/congruence.svg deleted file mode 100644 index 08bd7784..00000000 --- a/lectures/25_summary/congruence.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/connectedteams.svg b/lectures/25_summary/connectedteams.svg deleted file mode 100644 index fca4e4f6..00000000 --- a/lectures/25_summary/connectedteams.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/continuous_delivery.gif b/lectures/25_summary/continuous_delivery.gif deleted file mode 100644 index 30a22de7..00000000 Binary files a/lectures/25_summary/continuous_delivery.gif and /dev/null differ diff --git a/lectures/25_summary/coverage.png b/lectures/25_summary/coverage.png deleted file mode 100644 index 35f64927..00000000 Binary files a/lectures/25_summary/coverage.png and /dev/null differ diff --git a/lectures/25_summary/credit-card.jpg b/lectures/25_summary/credit-card.jpg deleted file mode 100644 index d843a542..00000000 Binary files a/lectures/25_summary/credit-card.jpg and /dev/null differ diff --git a/lectures/25_summary/creditcard-provenance.svg b/lectures/25_summary/creditcard-provenance.svg deleted file mode 100644 index f7ee80d3..00000000 --- a/lectures/25_summary/creditcard-provenance.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/crime-map.jpg b/lectures/25_summary/crime-map.jpg deleted file mode 100644 index b5c6b5ab..00000000 Binary files a/lectures/25_summary/crime-map.jpg and /dev/null differ diff --git a/lectures/25_summary/crispdm.png b/lectures/25_summary/crispdm.png deleted file mode 100644 index bf0e4547..00000000 Binary files a/lectures/25_summary/crispdm.png and /dev/null differ diff --git a/lectures/25_summary/crowd.jpg b/lectures/25_summary/crowd.jpg deleted file mode 100644 index ec92a578..00000000 Binary files a/lectures/25_summary/crowd.jpg and /dev/null differ diff --git a/lectures/25_summary/dall-e.png b/lectures/25_summary/dall-e.png deleted file mode 100644 index c5cef7c9..00000000 Binary files a/lectures/25_summary/dall-e.png and /dev/null differ diff --git a/lectures/25_summary/dashcam-architecture.jpg b/lectures/25_summary/dashcam-architecture.jpg deleted file mode 100644 index 5274e8ae..00000000 Binary files a/lectures/25_summary/dashcam-architecture.jpg and /dev/null differ diff --git a/lectures/25_summary/data-lake.png b/lectures/25_summary/data-lake.png deleted file mode 100644 index 49ce0215..00000000 Binary files a/lectures/25_summary/data-lake.png and /dev/null differ diff --git a/lectures/25_summary/datacascades.png b/lectures/25_summary/datacascades.png deleted file mode 100644 index d3b31f9e..00000000 Binary files a/lectures/25_summary/datacascades.png and /dev/null differ diff --git a/lectures/25_summary/datasheet2.png b/lectures/25_summary/datasheet2.png deleted file mode 100644 index 3093eefe..00000000 Binary files a/lectures/25_summary/datasheet2.png and /dev/null differ diff --git a/lectures/25_summary/debt.png b/lectures/25_summary/debt.png deleted file mode 100644 index 47dbe67a..00000000 Binary files a/lectures/25_summary/debt.png and /dev/null differ diff --git a/lectures/25_summary/decisionboundary.png b/lectures/25_summary/decisionboundary.png deleted file mode 100644 index ba126585..00000000 Binary files a/lectures/25_summary/decisionboundary.png and /dev/null differ diff --git a/lectures/25_summary/decisionboundary2.svg b/lectures/25_summary/decisionboundary2.svg deleted file mode 100644 index 0c835a53..00000000 --- a/lectures/25_summary/decisionboundary2.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/decisiontreeexample.png b/lectures/25_summary/decisiontreeexample.png deleted file mode 100644 index 58c7973e..00000000 Binary files a/lectures/25_summary/decisiontreeexample.png and /dev/null differ diff --git a/lectures/25_summary/design-space.svg b/lectures/25_summary/design-space.svg deleted file mode 100644 index f945ca6e..00000000 --- a/lectures/25_summary/design-space.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/devops.png b/lectures/25_summary/devops.png deleted file mode 100644 index 3abb34d2..00000000 Binary files a/lectures/25_summary/devops.png and /dev/null differ diff --git a/lectures/25_summary/devops_meme.jpg b/lectures/25_summary/devops_meme.jpg deleted file mode 100644 index ac3b1911..00000000 Binary files a/lectures/25_summary/devops_meme.jpg and /dev/null differ diff --git a/lectures/25_summary/devops_tools.jpg b/lectures/25_summary/devops_tools.jpg deleted file mode 100644 index 4140b6d4..00000000 Binary files a/lectures/25_summary/devops_tools.jpg and /dev/null differ diff --git a/lectures/25_summary/dlr.jpg b/lectures/25_summary/dlr.jpg deleted file mode 100644 index d2dcbf34..00000000 Binary files a/lectures/25_summary/dlr.jpg and /dev/null differ diff --git a/lectures/25_summary/docker_logo.png b/lectures/25_summary/docker_logo.png deleted file mode 100644 index c08509fe..00000000 Binary files a/lectures/25_summary/docker_logo.png and /dev/null differ diff --git a/lectures/25_summary/dont-be-evil.png b/lectures/25_summary/dont-be-evil.png deleted file mode 100644 index 8e01d9d8..00000000 Binary files a/lectures/25_summary/dont-be-evil.png and /dev/null differ diff --git a/lectures/25_summary/drift-ui-expanded.png b/lectures/25_summary/drift-ui-expanded.png deleted file mode 100644 index 8fc3c03d..00000000 Binary files a/lectures/25_summary/drift-ui-expanded.png and /dev/null differ diff --git a/lectures/25_summary/drift.jpg b/lectures/25_summary/drift.jpg deleted file mode 100644 index ff35da56..00000000 Binary files a/lectures/25_summary/drift.jpg and /dev/null differ diff --git a/lectures/25_summary/driver-code-stub.svg b/lectures/25_summary/driver-code-stub.svg deleted file mode 100644 index 7d3b444c..00000000 --- a/lectures/25_summary/driver-code-stub.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/driver-stubs-interface.svg b/lectures/25_summary/driver-stubs-interface.svg deleted file mode 100644 index 60e25ba3..00000000 --- a/lectures/25_summary/driver-stubs-interface.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/driver_phone.png b/lectures/25_summary/driver_phone.png deleted file mode 100644 index 20a64a25..00000000 Binary files a/lectures/25_summary/driver_phone.png and /dev/null differ diff --git a/lectures/25_summary/driver_phone2.png b/lectures/25_summary/driver_phone2.png deleted file mode 100644 index a024d1dc..00000000 Binary files a/lectures/25_summary/driver_phone2.png and /dev/null differ diff --git a/lectures/25_summary/eej2.jpeg b/lectures/25_summary/eej2.jpeg deleted file mode 100644 index 2354a176..00000000 Binary files a/lectures/25_summary/eej2.jpeg and /dev/null differ diff --git a/lectures/25_summary/email.png b/lectures/25_summary/email.png deleted file mode 100644 index 4df7af16..00000000 Binary files a/lectures/25_summary/email.png and /dev/null differ diff --git a/lectures/25_summary/evasion-attack.png b/lectures/25_summary/evasion-attack.png deleted file mode 100644 index 4b596c38..00000000 Binary files a/lectures/25_summary/evasion-attack.png and /dev/null differ diff --git a/lectures/25_summary/explanationexperimentgame.png b/lectures/25_summary/explanationexperimentgame.png deleted file mode 100644 index 83cf10c8..00000000 Binary files a/lectures/25_summary/explanationexperimentgame.png and /dev/null differ diff --git a/lectures/25_summary/facebook.png b/lectures/25_summary/facebook.png deleted file mode 100644 index fd9bd21c..00000000 Binary files a/lectures/25_summary/facebook.png and /dev/null differ diff --git a/lectures/25_summary/fairness-lifecycle.jpg b/lectures/25_summary/fairness-lifecycle.jpg deleted file mode 100644 index 4ea3d720..00000000 Binary files a/lectures/25_summary/fairness-lifecycle.jpg and /dev/null differ diff --git a/lectures/25_summary/fairness_tree.png b/lectures/25_summary/fairness_tree.png deleted file mode 100644 index 74e3c01d..00000000 Binary files a/lectures/25_summary/fairness_tree.png and /dev/null differ diff --git a/lectures/25_summary/featureimportance.png b/lectures/25_summary/featureimportance.png deleted file mode 100644 index cf9c0620..00000000 Binary files a/lectures/25_summary/featureimportance.png and /dev/null differ diff --git a/lectures/25_summary/feedbackloop.svg b/lectures/25_summary/feedbackloop.svg deleted file mode 100644 index 66334f7c..00000000 --- a/lectures/25_summary/feedbackloop.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/flightforcast.jpg b/lectures/25_summary/flightforcast.jpg deleted file mode 100644 index 74101165..00000000 Binary files a/lectures/25_summary/flightforcast.jpg and /dev/null differ diff --git a/lectures/25_summary/fmea-radiation.png b/lectures/25_summary/fmea-radiation.png deleted file mode 100644 index 875b45a4..00000000 Binary files a/lectures/25_summary/fmea-radiation.png and /dev/null differ diff --git a/lectures/25_summary/forest.jpg b/lectures/25_summary/forest.jpg deleted file mode 100644 index 6e11580a..00000000 Binary files a/lectures/25_summary/forest.jpg and /dev/null differ diff --git a/lectures/25_summary/frustrated.jpeg b/lectures/25_summary/frustrated.jpeg deleted file mode 100644 index 4015e00d..00000000 Binary files a/lectures/25_summary/frustrated.jpeg and /dev/null differ diff --git a/lectures/25_summary/fta-sample.png b/lectures/25_summary/fta-sample.png deleted file mode 100644 index 270b11a9..00000000 Binary files a/lectures/25_summary/fta-sample.png and /dev/null differ diff --git a/lectures/25_summary/fta-without-mitigation.svg b/lectures/25_summary/fta-without-mitigation.svg deleted file mode 100644 index 02955c13..00000000 --- a/lectures/25_summary/fta-without-mitigation.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/fta.svg b/lectures/25_summary/fta.svg deleted file mode 100644 index 31b7416c..00000000 --- a/lectures/25_summary/fta.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/gender-detection.png b/lectures/25_summary/gender-detection.png deleted file mode 100644 index d02b1d8c..00000000 Binary files a/lectures/25_summary/gender-detection.png and /dev/null differ diff --git a/lectures/25_summary/gendermag1.png b/lectures/25_summary/gendermag1.png deleted file mode 100644 index 012b2a1b..00000000 Binary files a/lectures/25_summary/gendermag1.png and /dev/null differ diff --git a/lectures/25_summary/goal-relationships.svg b/lectures/25_summary/goal-relationships.svg deleted file mode 100644 index 98f27b59..00000000 --- a/lectures/25_summary/goal-relationships.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/googleglasses.jpg b/lectures/25_summary/googleglasses.jpg deleted file mode 100644 index 73e19a59..00000000 Binary files a/lectures/25_summary/googleglasses.jpg and /dev/null differ diff --git a/lectures/25_summary/googlehome.jpg b/lectures/25_summary/googlehome.jpg deleted file mode 100644 index 9c7660d2..00000000 Binary files a/lectures/25_summary/googlehome.jpg and /dev/null differ diff --git a/lectures/25_summary/gphotos.png b/lectures/25_summary/gphotos.png deleted file mode 100644 index 585309d3..00000000 Binary files a/lectures/25_summary/gphotos.png and /dev/null differ diff --git a/lectures/25_summary/grafana.png b/lectures/25_summary/grafana.png deleted file mode 100644 index 8bc0a0f7..00000000 Binary files a/lectures/25_summary/grafana.png and /dev/null differ diff --git a/lectures/25_summary/groupthink.png b/lectures/25_summary/groupthink.png deleted file mode 100644 index 074ce7e9..00000000 Binary files a/lectures/25_summary/groupthink.png and /dev/null differ diff --git a/lectures/25_summary/gun.png b/lectures/25_summary/gun.png deleted file mode 100644 index d14bdf4c..00000000 Binary files a/lectures/25_summary/gun.png and /dev/null differ diff --git a/lectures/25_summary/hammer2.jpg b/lectures/25_summary/hammer2.jpg deleted file mode 100644 index c3c79df0..00000000 Binary files a/lectures/25_summary/hammer2.jpg and /dev/null differ diff --git a/lectures/25_summary/hazop-perception.jpg b/lectures/25_summary/hazop-perception.jpg deleted file mode 100644 index a82b9c2a..00000000 Binary files a/lectures/25_summary/hazop-perception.jpg and /dev/null differ diff --git a/lectures/25_summary/holoclean.jpg b/lectures/25_summary/holoclean.jpg deleted file mode 100644 index 7ea1c8f4..00000000 Binary files a/lectures/25_summary/holoclean.jpg and /dev/null differ diff --git a/lectures/25_summary/horizonalpartition.svg b/lectures/25_summary/horizonalpartition.svg deleted file mode 100644 index 78172554..00000000 --- a/lectures/25_summary/horizonalpartition.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/imagesearch.png b/lectures/25_summary/imagesearch.png deleted file mode 100644 index 0110df70..00000000 Binary files a/lectures/25_summary/imagesearch.png and /dev/null differ diff --git a/lectures/25_summary/imgcaptioning.png b/lectures/25_summary/imgcaptioning.png deleted file mode 100644 index 9de8d250..00000000 Binary files a/lectures/25_summary/imgcaptioning.png and /dev/null differ diff --git a/lectures/25_summary/inductive.png b/lectures/25_summary/inductive.png deleted file mode 100644 index ec2359b8..00000000 Binary files a/lectures/25_summary/inductive.png and /dev/null differ diff --git a/lectures/25_summary/influentialinstance.png b/lectures/25_summary/influentialinstance.png deleted file mode 100644 index 9da24ed7..00000000 Binary files a/lectures/25_summary/influentialinstance.png and /dev/null differ diff --git a/lectures/25_summary/inputpartitioning.png b/lectures/25_summary/inputpartitioning.png deleted file mode 100644 index e10dfcb8..00000000 Binary files a/lectures/25_summary/inputpartitioning.png and /dev/null differ diff --git a/lectures/25_summary/inputpartitioning2.png b/lectures/25_summary/inputpartitioning2.png deleted file mode 100644 index b2a8f1ea..00000000 Binary files a/lectures/25_summary/inputpartitioning2.png and /dev/null differ diff --git a/lectures/25_summary/interview.jpg b/lectures/25_summary/interview.jpg deleted file mode 100644 index 28a627bc..00000000 Binary files a/lectures/25_summary/interview.jpg and /dev/null differ diff --git a/lectures/25_summary/justice.jpeg b/lectures/25_summary/justice.jpeg deleted file mode 100644 index e3193d49..00000000 Binary files a/lectures/25_summary/justice.jpeg and /dev/null differ diff --git a/lectures/25_summary/kohavi-bing-search.jpg b/lectures/25_summary/kohavi-bing-search.jpg deleted file mode 100644 index 4400b526..00000000 Binary files a/lectures/25_summary/kohavi-bing-search.jpg and /dev/null differ diff --git a/lectures/25_summary/lambda.svg b/lectures/25_summary/lambda.svg deleted file mode 100644 index 241033f1..00000000 --- a/lectures/25_summary/lambda.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/lane-keeping.png b/lectures/25_summary/lane-keeping.png deleted file mode 100644 index 9c544a1f..00000000 Binary files a/lectures/25_summary/lane-keeping.png and /dev/null differ diff --git a/lectures/25_summary/lawchat.png b/lectures/25_summary/lawchat.png deleted file mode 100644 index dcd31098..00000000 Binary files a/lectures/25_summary/lawchat.png and /dev/null differ diff --git a/lectures/25_summary/leaderfollowerreplication.svg b/lectures/25_summary/leaderfollowerreplication.svg deleted file mode 100644 index 86704fa8..00000000 --- a/lectures/25_summary/leaderfollowerreplication.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/lh2904_animation.gif b/lectures/25_summary/lh2904_animation.gif deleted file mode 100644 index 5ba913fc..00000000 Binary files a/lectures/25_summary/lh2904_animation.gif and /dev/null differ diff --git a/lectures/25_summary/lime2.png b/lectures/25_summary/lime2.png deleted file mode 100644 index 63bc05a5..00000000 Binary files a/lectures/25_summary/lime2.png and /dev/null differ diff --git a/lectures/25_summary/lime_cat.png b/lectures/25_summary/lime_cat.png deleted file mode 100644 index 17a7391b..00000000 Binary files a/lectures/25_summary/lime_cat.png and /dev/null differ diff --git a/lectures/25_summary/maintainability-decomp.svg b/lectures/25_summary/maintainability-decomp.svg deleted file mode 100644 index c65e5c6f..00000000 --- a/lectures/25_summary/maintainability-decomp.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/mapreduce.svg b/lectures/25_summary/mapreduce.svg deleted file mode 100644 index 230a66d2..00000000 --- a/lectures/25_summary/mapreduce.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/mark-zuckerberg.png b/lectures/25_summary/mark-zuckerberg.png deleted file mode 100644 index 05cb03dd..00000000 Binary files a/lectures/25_summary/mark-zuckerberg.png and /dev/null differ diff --git a/lectures/25_summary/matrix-only.svg b/lectures/25_summary/matrix-only.svg deleted file mode 100644 index 5683c9ff..00000000 --- a/lectures/25_summary/matrix-only.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/microservice.svg b/lectures/25_summary/microservice.svg deleted file mode 100644 index 09cdf95d..00000000 --- a/lectures/25_summary/microservice.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/mistakes.jpg b/lectures/25_summary/mistakes.jpg deleted file mode 100644 index 3345d1a4..00000000 Binary files a/lectures/25_summary/mistakes.jpg and /dev/null differ diff --git a/lectures/25_summary/mlperceptron.svg b/lectures/25_summary/mlperceptron.svg deleted file mode 100644 index 69feea0c..00000000 --- a/lectures/25_summary/mlperceptron.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/mltestingandmonitoring.png b/lectures/25_summary/mltestingandmonitoring.png deleted file mode 100644 index 1b00ab01..00000000 Binary files a/lectures/25_summary/mltestingandmonitoring.png and /dev/null differ diff --git a/lectures/25_summary/model-inversion-image.png b/lectures/25_summary/model-inversion-image.png deleted file mode 100644 index 71f95a78..00000000 Binary files a/lectures/25_summary/model-inversion-image.png and /dev/null differ diff --git a/lectures/25_summary/modelcard2.png b/lectures/25_summary/modelcard2.png deleted file mode 100644 index d9ff48c5..00000000 Binary files a/lectures/25_summary/modelcard2.png and /dev/null differ diff --git a/lectures/25_summary/modelcards.png b/lectures/25_summary/modelcards.png deleted file mode 100644 index e54afe1c..00000000 Binary files a/lectures/25_summary/modelcards.png and /dev/null differ diff --git a/lectures/25_summary/mortgage.png b/lectures/25_summary/mortgage.png deleted file mode 100644 index 8a774d3a..00000000 Binary files a/lectures/25_summary/mortgage.png and /dev/null differ diff --git a/lectures/25_summary/nailgun.jpg b/lectures/25_summary/nailgun.jpg deleted file mode 100644 index d5abb093..00000000 Binary files a/lectures/25_summary/nailgun.jpg and /dev/null differ diff --git a/lectures/25_summary/neighborhoods.png b/lectures/25_summary/neighborhoods.png deleted file mode 100644 index fe0a0d6e..00000000 Binary files a/lectures/25_summary/neighborhoods.png and /dev/null differ diff --git a/lectures/25_summary/nest.jpg b/lectures/25_summary/nest.jpg deleted file mode 100644 index 825fe7aa..00000000 Binary files a/lectures/25_summary/nest.jpg and /dev/null differ diff --git a/lectures/25_summary/netflix-leaderboard.png b/lectures/25_summary/netflix-leaderboard.png deleted file mode 100644 index fd264669..00000000 Binary files a/lectures/25_summary/netflix-leaderboard.png and /dev/null differ diff --git a/lectures/25_summary/neural-network.png b/lectures/25_summary/neural-network.png deleted file mode 100644 index 00cc3d8b..00000000 Binary files a/lectures/25_summary/neural-network.png and /dev/null differ diff --git a/lectures/25_summary/nfps.png b/lectures/25_summary/nfps.png deleted file mode 100644 index 88dfb163..00000000 Binary files a/lectures/25_summary/nfps.png and /dev/null differ diff --git a/lectures/25_summary/nodejs-unusual-commit.png b/lectures/25_summary/nodejs-unusual-commit.png deleted file mode 100644 index af195477..00000000 Binary files a/lectures/25_summary/nodejs-unusual-commit.png and /dev/null differ diff --git a/lectures/25_summary/notebook-example.png b/lectures/25_summary/notebook-example.png deleted file mode 100644 index 2b614ce0..00000000 Binary files a/lectures/25_summary/notebook-example.png and /dev/null differ diff --git a/lectures/25_summary/notebookinproduction.png b/lectures/25_summary/notebookinproduction.png deleted file mode 100644 index fe12e5aa..00000000 Binary files a/lectures/25_summary/notebookinproduction.png and /dev/null differ diff --git a/lectures/25_summary/npr_facialrecognition.png b/lectures/25_summary/npr_facialrecognition.png deleted file mode 100644 index eb31a7f0..00000000 Binary files a/lectures/25_summary/npr_facialrecognition.png and /dev/null differ diff --git a/lectures/25_summary/online-ad.png b/lectures/25_summary/online-ad.png deleted file mode 100644 index 933c5627..00000000 Binary files a/lectures/25_summary/online-ad.png and /dev/null differ diff --git a/lectures/25_summary/orgchart.png b/lectures/25_summary/orgchart.png deleted file mode 100644 index 6df71aa3..00000000 Binary files a/lectures/25_summary/orgchart.png and /dev/null differ diff --git a/lectures/25_summary/orgculture.jpg b/lectures/25_summary/orgculture.jpg deleted file mode 100644 index 71b95468..00000000 Binary files a/lectures/25_summary/orgculture.jpg and /dev/null differ diff --git a/lectures/25_summary/overton.png b/lectures/25_summary/overton.png deleted file mode 100644 index a3ccce38..00000000 Binary files a/lectures/25_summary/overton.png and /dev/null differ diff --git a/lectures/25_summary/overview.svg b/lectures/25_summary/overview.svg deleted file mode 100644 index 7ce87f93..00000000 --- a/lectures/25_summary/overview.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/parameterserver.png b/lectures/25_summary/parameterserver.png deleted file mode 100644 index 2cc17a72..00000000 Binary files a/lectures/25_summary/parameterserver.png and /dev/null differ diff --git a/lectures/25_summary/pareto-front.svg b/lectures/25_summary/pareto-front.svg deleted file mode 100644 index 74179042..00000000 --- a/lectures/25_summary/pareto-front.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/pdp.png b/lectures/25_summary/pdp.png deleted file mode 100644 index 67ba194c..00000000 Binary files a/lectures/25_summary/pdp.png and /dev/null differ diff --git a/lectures/25_summary/pgh-cycling.jpg b/lectures/25_summary/pgh-cycling.jpg deleted file mode 100644 index 7f44fe0a..00000000 Binary files a/lectures/25_summary/pgh-cycling.jpg and /dev/null differ diff --git a/lectures/25_summary/phenomena.jpg b/lectures/25_summary/phenomena.jpg deleted file mode 100644 index d40495e2..00000000 Binary files a/lectures/25_summary/phenomena.jpg and /dev/null differ diff --git a/lectures/25_summary/pipeline-connections.svg b/lectures/25_summary/pipeline-connections.svg deleted file mode 100644 index 9fe37f55..00000000 --- a/lectures/25_summary/pipeline-connections.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/pipeline-in-system.svg b/lectures/25_summary/pipeline-in-system.svg deleted file mode 100644 index 5bbddab8..00000000 --- a/lectures/25_summary/pipeline-in-system.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/pipeline.svg b/lectures/25_summary/pipeline.svg deleted file mode 100644 index 5195af76..00000000 --- a/lectures/25_summary/pipeline.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/pipeline2.svg b/lectures/25_summary/pipeline2.svg deleted file mode 100644 index 4a5afe53..00000000 --- a/lectures/25_summary/pipeline2.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/platformdoors.png b/lectures/25_summary/platformdoors.png deleted file mode 100644 index b632da6a..00000000 Binary files a/lectures/25_summary/platformdoors.png and /dev/null differ diff --git a/lectures/25_summary/prcurve.png b/lectures/25_summary/prcurve.png deleted file mode 100644 index 1870a0d5..00000000 Binary files a/lectures/25_summary/prcurve.png and /dev/null differ diff --git a/lectures/25_summary/predictive-policing.png b/lectures/25_summary/predictive-policing.png deleted file mode 100644 index 0e3d4607..00000000 Binary files a/lectures/25_summary/predictive-policing.png and /dev/null differ diff --git a/lectures/25_summary/process4.png b/lectures/25_summary/process4.png deleted file mode 100644 index 4113b23d..00000000 Binary files a/lectures/25_summary/process4.png and /dev/null differ diff --git a/lectures/25_summary/project-only.svg b/lectures/25_summary/project-only.svg deleted file mode 100644 index 8621580e..00000000 --- a/lectures/25_summary/project-only.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/prototype-dogs.png b/lectures/25_summary/prototype-dogs.png deleted file mode 100644 index a8fdb02c..00000000 Binary files a/lectures/25_summary/prototype-dogs.png and /dev/null differ diff --git a/lectures/25_summary/radiology.jpg b/lectures/25_summary/radiology.jpg deleted file mode 100644 index 5bc31795..00000000 Binary files a/lectures/25_summary/radiology.jpg and /dev/null differ diff --git a/lectures/25_summary/random.jpg b/lectures/25_summary/random.jpg deleted file mode 100644 index 5ec8eded..00000000 Binary files a/lectures/25_summary/random.jpg and /dev/null differ diff --git a/lectures/25_summary/rashomon.jpg b/lectures/25_summary/rashomon.jpg deleted file mode 100644 index b8494521..00000000 Binary files a/lectures/25_summary/rashomon.jpg and /dev/null differ diff --git a/lectures/25_summary/rc-car.mp4 b/lectures/25_summary/rc-car.mp4 deleted file mode 100644 index 4db3ca96..00000000 Binary files a/lectures/25_summary/rc-car.mp4 and /dev/null differ diff --git a/lectures/25_summary/recallprecision.png b/lectures/25_summary/recallprecision.png deleted file mode 100644 index 48b89018..00000000 Binary files a/lectures/25_summary/recallprecision.png and /dev/null differ diff --git a/lectures/25_summary/renovation.jpg b/lectures/25_summary/renovation.jpg deleted file mode 100644 index cb7894d3..00000000 Binary files a/lectures/25_summary/renovation.jpg and /dev/null differ diff --git a/lectures/25_summary/req-arch-impl.svg b/lectures/25_summary/req-arch-impl.svg deleted file mode 100644 index 34bea3ff..00000000 --- a/lectures/25_summary/req-arch-impl.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/responsibleai.png b/lectures/25_summary/responsibleai.png deleted file mode 100644 index 1b3ec1f5..00000000 Binary files a/lectures/25_summary/responsibleai.png and /dev/null differ diff --git a/lectures/25_summary/review_github.png b/lectures/25_summary/review_github.png deleted file mode 100644 index 523142ce..00000000 Binary files a/lectures/25_summary/review_github.png and /dev/null differ diff --git a/lectures/25_summary/robot-uprising.jpg b/lectures/25_summary/robot-uprising.jpg deleted file mode 100644 index d6f40c9e..00000000 Binary files a/lectures/25_summary/robot-uprising.jpg and /dev/null differ diff --git a/lectures/25_summary/roles_venn.svg b/lectures/25_summary/roles_venn.svg deleted file mode 100644 index b2df1fbf..00000000 --- a/lectures/25_summary/roles_venn.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/rutkowski.png b/lectures/25_summary/rutkowski.png deleted file mode 100644 index 0b2a55d4..00000000 Binary files a/lectures/25_summary/rutkowski.png and /dev/null differ diff --git a/lectures/25_summary/safe-browsing-feedback.png b/lectures/25_summary/safe-browsing-feedback.png deleted file mode 100644 index 53a13f99..00000000 Binary files a/lectures/25_summary/safe-browsing-feedback.png and /dev/null differ diff --git a/lectures/25_summary/safe-browsing-stats.png b/lectures/25_summary/safe-browsing-stats.png deleted file mode 100644 index ee7d8301..00000000 Binary files a/lectures/25_summary/safe-browsing-stats.png and /dev/null differ diff --git a/lectures/25_summary/safetycase.svg b/lectures/25_summary/safetycase.svg deleted file mode 100644 index a9392bae..00000000 --- a/lectures/25_summary/safetycase.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/sarcasm.png b/lectures/25_summary/sarcasm.png deleted file mode 100644 index eb58e9c8..00000000 Binary files a/lectures/25_summary/sarcasm.png and /dev/null differ diff --git a/lectures/25_summary/scoring.png b/lectures/25_summary/scoring.png deleted file mode 100644 index 37528717..00000000 Binary files a/lectures/25_summary/scoring.png and /dev/null differ diff --git a/lectures/25_summary/scrum.svg b/lectures/25_summary/scrum.svg deleted file mode 100644 index a8149ac1..00000000 --- a/lectures/25_summary/scrum.svg +++ /dev/null @@ -1,383 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - Drawing - - - - - - - - - - - - Drawing - - - - - - - - - - - - Drawing - - - - - - - - - - - - Drawing - - - - - - - - - - - - Drawing - - - - - - - - - - - - - - - - - - 30 days - 24 h - Working incrementof the software - - - - - - - Sprint Backlog - Sprint - Product Backlog - - - - - - - - - - diff --git a/lectures/25_summary/sensor-fusion.jpeg b/lectures/25_summary/sensor-fusion.jpeg deleted file mode 100644 index dbfaaf76..00000000 Binary files a/lectures/25_summary/sensor-fusion.jpeg and /dev/null differ diff --git a/lectures/25_summary/shared-feature-encoding.svg b/lectures/25_summary/shared-feature-encoding.svg deleted file mode 100644 index ea221aeb..00000000 --- a/lectures/25_summary/shared-feature-encoding.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/shipment-delivery-receipt.jpg b/lectures/25_summary/shipment-delivery-receipt.jpg deleted file mode 100644 index 6ff1940a..00000000 Binary files a/lectures/25_summary/shipment-delivery-receipt.jpg and /dev/null differ diff --git a/lectures/25_summary/shortcutlearning.png b/lectures/25_summary/shortcutlearning.png deleted file mode 100644 index c522e6b4..00000000 Binary files a/lectures/25_summary/shortcutlearning.png and /dev/null differ diff --git a/lectures/25_summary/simianarmy.jpg b/lectures/25_summary/simianarmy.jpg deleted file mode 100644 index 8a1f2bbe..00000000 Binary files a/lectures/25_summary/simianarmy.jpg and /dev/null differ diff --git a/lectures/25_summary/simulationbased-testing.svg b/lectures/25_summary/simulationbased-testing.svg deleted file mode 100644 index 5db26eb8..00000000 --- a/lectures/25_summary/simulationbased-testing.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/simulationdriving.jpg b/lectures/25_summary/simulationdriving.jpg deleted file mode 100644 index 6716946f..00000000 Binary files a/lectures/25_summary/simulationdriving.jpg and /dev/null differ diff --git a/lectures/25_summary/skype1.jpg b/lectures/25_summary/skype1.jpg deleted file mode 100644 index a5dd482d..00000000 Binary files a/lectures/25_summary/skype1.jpg and /dev/null differ diff --git a/lectures/25_summary/skype2.jpg b/lectures/25_summary/skype2.jpg deleted file mode 100644 index 76110b4b..00000000 Binary files a/lectures/25_summary/skype2.jpg and /dev/null differ diff --git a/lectures/25_summary/slices.jpg b/lectures/25_summary/slices.jpg deleted file mode 100644 index e4ac6a08..00000000 Binary files a/lectures/25_summary/slices.jpg and /dev/null differ diff --git a/lectures/25_summary/smartwatch.jpg b/lectures/25_summary/smartwatch.jpg deleted file mode 100644 index 7d8947df..00000000 Binary files a/lectures/25_summary/smartwatch.jpg and /dev/null differ diff --git a/lectures/25_summary/spam.jpg b/lectures/25_summary/spam.jpg deleted file mode 100644 index 64519ae2..00000000 Binary files a/lectures/25_summary/spam.jpg and /dev/null differ diff --git a/lectures/25_summary/spiral_model.svg b/lectures/25_summary/spiral_model.svg deleted file mode 100644 index 14ea98b8..00000000 --- a/lectures/25_summary/spiral_model.svg +++ /dev/null @@ -1,434 +0,0 @@ - - - - - - - - - - - - - - image/svg+xml - - - - - - - - - 1.Determineobjectives - 2. Identify and resolve risks - 3. Development and Test - 4. Plan the next iteration - - - - Progress - Cumulative cost - Requirementsplan - Concept ofoperation - Concept ofrequirements - Prototype 1 - Prototype 2 - Operationalprototype - Requirements - Draft - Detaileddesign - Code - Integration - Test - Implementation - Release - Test plan - Verification & Validation - Developmentplan - - Review - - diff --git a/lectures/25_summary/srebook.jpg b/lectures/25_summary/srebook.jpg deleted file mode 100644 index cc6e1f24..00000000 Binary files a/lectures/25_summary/srebook.jpg and /dev/null differ diff --git a/lectures/25_summary/stackoverflow.png b/lectures/25_summary/stackoverflow.png deleted file mode 100644 index 9db6ea59..00000000 Binary files a/lectures/25_summary/stackoverflow.png and /dev/null differ diff --git a/lectures/25_summary/stanford.png b/lectures/25_summary/stanford.png deleted file mode 100644 index 549cd647..00000000 Binary files a/lectures/25_summary/stanford.png and /dev/null differ diff --git a/lectures/25_summary/stop-sign.png b/lectures/25_summary/stop-sign.png deleted file mode 100644 index c79808b3..00000000 Binary files a/lectures/25_summary/stop-sign.png and /dev/null differ diff --git a/lectures/25_summary/stopsign.jpg b/lectures/25_summary/stopsign.jpg deleted file mode 100644 index cb76fe96..00000000 Binary files a/lectures/25_summary/stopsign.jpg and /dev/null differ diff --git a/lectures/25_summary/stream-dataflow.svg b/lectures/25_summary/stream-dataflow.svg deleted file mode 100644 index 19f22a93..00000000 --- a/lectures/25_summary/stream-dataflow.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/stream.svg b/lectures/25_summary/stream.svg deleted file mode 100644 index 601451b2..00000000 --- a/lectures/25_summary/stream.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/streetlight.jpg b/lectures/25_summary/streetlight.jpg deleted file mode 100644 index dc41b4d8..00000000 Binary files a/lectures/25_summary/streetlight.jpg and /dev/null differ diff --git a/lectures/25_summary/stride.png b/lectures/25_summary/stride.png deleted file mode 100644 index f1fdd47a..00000000 Binary files a/lectures/25_summary/stride.png and /dev/null differ diff --git a/lectures/25_summary/svposter.png b/lectures/25_summary/svposter.png deleted file mode 100644 index 27d5531f..00000000 Binary files a/lectures/25_summary/svposter.png and /dev/null differ diff --git a/lectures/25_summary/system.svg b/lectures/25_summary/system.svg deleted file mode 100644 index 9d3cfe66..00000000 --- a/lectures/25_summary/system.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/tank.jpg b/lectures/25_summary/tank.jpg deleted file mode 100644 index b507eb27..00000000 Binary files a/lectures/25_summary/tank.jpg and /dev/null differ diff --git a/lectures/25_summary/target-headline.png b/lectures/25_summary/target-headline.png deleted file mode 100644 index 800ed112..00000000 Binary files a/lectures/25_summary/target-headline.png and /dev/null differ diff --git a/lectures/25_summary/team-collaboration1.png b/lectures/25_summary/team-collaboration1.png deleted file mode 100644 index c9246c44..00000000 Binary files a/lectures/25_summary/team-collaboration1.png and /dev/null differ diff --git a/lectures/25_summary/team-collaboration2.png b/lectures/25_summary/team-collaboration2.png deleted file mode 100644 index a855d480..00000000 Binary files a/lectures/25_summary/team-collaboration2.png and /dev/null differ diff --git a/lectures/25_summary/techDebtQuadrant.png b/lectures/25_summary/techDebtQuadrant.png deleted file mode 100644 index d298c812..00000000 Binary files a/lectures/25_summary/techDebtQuadrant.png and /dev/null differ diff --git a/lectures/25_summary/techcrunch-privacy.png b/lectures/25_summary/techcrunch-privacy.png deleted file mode 100644 index 68c4e019..00000000 Binary files a/lectures/25_summary/techcrunch-privacy.png and /dev/null differ diff --git a/lectures/25_summary/teen-suicide-rate.png b/lectures/25_summary/teen-suicide-rate.png deleted file mode 100644 index 0e04315e..00000000 Binary files a/lectures/25_summary/teen-suicide-rate.png and /dev/null differ diff --git a/lectures/25_summary/temi.png b/lectures/25_summary/temi.png deleted file mode 100644 index 29ce2dd5..00000000 Binary files a/lectures/25_summary/temi.png and /dev/null differ diff --git a/lectures/25_summary/temporaldependence.svg b/lectures/25_summary/temporaldependence.svg deleted file mode 100644 index 01072f57..00000000 --- a/lectures/25_summary/temporaldependence.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/testexample.png b/lectures/25_summary/testexample.png deleted file mode 100644 index 031680a1..00000000 Binary files a/lectures/25_summary/testexample.png and /dev/null differ diff --git a/lectures/25_summary/texturevsshape.png b/lectures/25_summary/texturevsshape.png deleted file mode 100644 index d0da33f7..00000000 Binary files a/lectures/25_summary/texturevsshape.png and /dev/null differ diff --git a/lectures/25_summary/thermalfuse.png b/lectures/25_summary/thermalfuse.png deleted file mode 100644 index cfc851ad..00000000 Binary files a/lectures/25_summary/thermalfuse.png and /dev/null differ diff --git a/lectures/25_summary/tiktok.jpg b/lectures/25_summary/tiktok.jpg deleted file mode 100644 index eb9746d5..00000000 Binary files a/lectures/25_summary/tiktok.jpg and /dev/null differ diff --git a/lectures/25_summary/timeline.svg b/lectures/25_summary/timeline.svg deleted file mode 100644 index e3393e0c..00000000 --- a/lectures/25_summary/timeline.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/toaster.jpg b/lectures/25_summary/toaster.jpg deleted file mode 100644 index 7d65d1ea..00000000 Binary files a/lectures/25_summary/toaster.jpg and /dev/null differ diff --git a/lectures/25_summary/traincorner.jpg b/lectures/25_summary/traincorner.jpg deleted file mode 100644 index f0228c2b..00000000 Binary files a/lectures/25_summary/traincorner.jpg and /dev/null differ diff --git a/lectures/25_summary/transcription.png b/lectures/25_summary/transcription.png deleted file mode 100644 index 2180ccbe..00000000 Binary files a/lectures/25_summary/transcription.png and /dev/null differ diff --git a/lectures/25_summary/transcriptionarchitecture.svg b/lectures/25_summary/transcriptionarchitecture.svg deleted file mode 100644 index 90d6cfb4..00000000 --- a/lectures/25_summary/transcriptionarchitecture.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/transcriptionarchitecture2.svg b/lectures/25_summary/transcriptionarchitecture2.svg deleted file mode 100644 index 212a40f7..00000000 --- a/lectures/25_summary/transcriptionarchitecture2.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/tshaped.png b/lectures/25_summary/tshaped.png deleted file mode 100644 index e4b6d35b..00000000 Binary files a/lectures/25_summary/tshaped.png and /dev/null differ diff --git a/lectures/25_summary/tug.png b/lectures/25_summary/tug.png deleted file mode 100644 index 56ee08d7..00000000 Binary files a/lectures/25_summary/tug.png and /dev/null differ diff --git a/lectures/25_summary/tweakthresholds.svg b/lectures/25_summary/tweakthresholds.svg deleted file mode 100644 index 776b562f..00000000 --- a/lectures/25_summary/tweakthresholds.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/twitter.png b/lectures/25_summary/twitter.png deleted file mode 100644 index de1bb920..00000000 Binary files a/lectures/25_summary/twitter.png and /dev/null differ diff --git a/lectures/25_summary/uber-dashboard.png b/lectures/25_summary/uber-dashboard.png deleted file mode 100644 index 381ea6c5..00000000 Binary files a/lectures/25_summary/uber-dashboard.png and /dev/null differ diff --git a/lectures/25_summary/unicorn.jpg b/lectures/25_summary/unicorn.jpg deleted file mode 100644 index 753606cb..00000000 Binary files a/lectures/25_summary/unicorn.jpg and /dev/null differ diff --git a/lectures/25_summary/unit-integration-system-testing.svg b/lectures/25_summary/unit-integration-system-testing.svg deleted file mode 100644 index daf0d8dc..00000000 --- a/lectures/25_summary/unit-integration-system-testing.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/validation.png b/lectures/25_summary/validation.png deleted file mode 100644 index c531eab0..00000000 Binary files a/lectures/25_summary/validation.png and /dev/null differ diff --git a/lectures/25_summary/vimeo.png b/lectures/25_summary/vimeo.png deleted file mode 100644 index 88a8b401..00000000 Binary files a/lectures/25_summary/vimeo.png and /dev/null differ diff --git a/lectures/25_summary/virtuallyeveryone.png b/lectures/25_summary/virtuallyeveryone.png deleted file mode 100644 index 0c00effd..00000000 Binary files a/lectures/25_summary/virtuallyeveryone.png and /dev/null differ diff --git a/lectures/25_summary/virus.png b/lectures/25_summary/virus.png deleted file mode 100644 index 5ce7b177..00000000 Binary files a/lectures/25_summary/virus.png and /dev/null differ diff --git a/lectures/25_summary/vmodel.svg b/lectures/25_summary/vmodel.svg deleted file mode 100644 index b5d6207e..00000000 --- a/lectures/25_summary/vmodel.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/warehouse.jpg b/lectures/25_summary/warehouse.jpg deleted file mode 100644 index 762addb8..00000000 Binary files a/lectures/25_summary/warehouse.jpg and /dev/null differ diff --git a/lectures/25_summary/waterfall.svg b/lectures/25_summary/waterfall.svg deleted file mode 100644 index 044f1645..00000000 --- a/lectures/25_summary/waterfall.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/white-noise.jpg b/lectures/25_summary/white-noise.jpg deleted file mode 100644 index 97d15bd7..00000000 Binary files a/lectures/25_summary/white-noise.jpg and /dev/null differ diff --git a/lectures/25_summary/wincrashreport_windows_xp.png b/lectures/25_summary/wincrashreport_windows_xp.png deleted file mode 100644 index e6e28968..00000000 Binary files a/lectures/25_summary/wincrashreport_windows_xp.png and /dev/null differ diff --git a/lectures/25_summary/windowsbeta.jpg b/lectures/25_summary/windowsbeta.jpg deleted file mode 100644 index 32ca1457..00000000 Binary files a/lectures/25_summary/windowsbeta.jpg and /dev/null differ diff --git a/lectures/25_summary/worldvsmachine.svg b/lectures/25_summary/worldvsmachine.svg deleted file mode 100644 index d1ecb609..00000000 --- a/lectures/25_summary/worldvsmachine.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/lectures/25_summary/zillow_main.png b/lectures/25_summary/zillow_main.png deleted file mode 100644 index 67a83fea..00000000 Binary files a/lectures/25_summary/zillow_main.png and /dev/null differ diff --git a/recitations/README.md b/recitations/README.md deleted file mode 100644 index 8c3babf8..00000000 --- a/recitations/README.md +++ /dev/null @@ -1,2 +0,0 @@ -# Recitation Material -This folder contains recitation material. diff --git a/recitations/Recitation 10/README.md b/recitations/Recitation 10/README.md deleted file mode 100644 index 8733ac6a..00000000 --- a/recitations/Recitation 10/README.md +++ /dev/null @@ -1,171 +0,0 @@ -# Recitation 10: Container Orchestration with Kubernetes - - -## Overview -In this recitation, we explore Kubernetes, a container orchestration system. We will use Kind to deploy a local Kubernetes cluster. While this recitation aims to introduce you to Kubernetes, it is not meant to be a comprehensive guide. For more information, please refer to the [Kubernetes documentation](https://kubernetes.io/docs/home/). - -*Note: Kind is not a production ready system and is used in this recitation only to demo a subset of capabilities of Kubernetes* - - -## Installation -- *Docker* - Docker is installed by default on most Linux distributions. However, if you do not have it, you can install it by following the instructions [here](https://docs.docker.com/engine/install/). - - Note: Make sure you have docker with this version - `Docker version 23.0.1`. The Kind load docker image command may fail otherwise. -- *Kind* - Kind is a tool for running local Kubernetes clusters using Docker container “nodes”. You can install it with the following on Linux: - ``` - curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.17.0/kind-linux-amd64 - chmod +x ./kind - sudo mv ./kind /usr/local/bin/kind - ``` -- *kubectl* - kubectl is a command-line tool for controlling Kubernetes clusters. You can install it with the following on Linux: - ``` - sudo snap install kubectl --classic - ``` -- *Helm* - Helm is a package manager for Kubernetes. You can install it with the following on Linux: - - Install Helm: - ``` - sudo snap install helm --classic - ``` - - Add the prometheus-community Helm repository: - ``` - helm repo add prometheus-community https://prometheus-community.github.io/helm-charts - ``` - -## Demo -1. Build a docker image from the docker file in `server/` : - ``` - docker build -t server:flask server - ``` -2. Create a Kind Cluster with the following command: - ``` - kind create cluster --config configs/kind-config.yaml - ``` - This will create a cluster with 3 nodes. You can see the nodes by running: - ``` - kubectl get nodes - ``` -3. Load the docker image into the cluster: - ``` - kind load docker-image server:flask - ``` -4. Deploy the server to the cluster: - ``` - kubectl apply -f configs/cluster-config.yaml - ``` - - To view all running pods, services, etc., run: - ``` - kubectl get all -o wide - ``` - - To view the logs of a pod, run: - ``` - kubectl logs [-f] - ``` -5. Let us now query some end points on the Flask server via the NodePort Service - - To get the NodePort Service IP: - ``` - kubectl get svc -o wide - ``` - - To get the Control Node IP: - ``` - kubectl get nodes -o wide - ``` - - To query the server: - ``` - curl http://:/ - ``` - - To kill the server: - ``` - curl http://:/health/kill - ``` - Once the server is Killed, you can see that the pod is restarted by running: - ``` - kubectl get pods -o wide - ``` - - - Some other endpoints: - ``` - # End point to see server status - curl http://:/health/status - - # End point to get the current time - curl http://:/datetime - ``` -6. To setup Prometheus and Grafana with Helm: - ``` - # Replace [RELEASE_NAME] with a name of your choice - helm install [RELEASE_NAME] prometheus-community/kube-prometheus-stack - - ``` -7. To be able to view the Prometheus and Grafana UI, we need to forward the ports: - - For Prometheus: - ``` - kubectl port-forward --address 0.0.0.0 service/prom-kube-prometheus-stack-prometheus 9090 - ``` - - For Grafana: - ``` - kubectl port-forward --address 0.0.0.0 service/prom-grafana 3000:80 - ``` -8. To view the UIs: - - For Prometheus: - - Open `http://:9090` - - For Grafana: - - Open `http://:3000` - - Username: `admin`, Password: `prom-operator` - -## Cleanup -The entire cluster can be deleted by running: -``` -kind delete cluster -``` - -However, for individual components of our deployment, we can run the following commands: -- To delete the server deployment along with the NodePort service: - ``` - kubectl delete -f configs/cluster-config.yaml - ``` -- To bring down the Prometheus and Grafana Helm release: - ``` - helm uninstall [RELEASE_NAME] - ``` -- To delete the Docker Image we built earlier: - ``` - docker rmi server:flask - ``` - - -## Additional Commands -- Docker - ``` - # To run the docker image locally - docker run -p 5555:5555 server:flask - - # To enter one of the nodes in the cluster - docker exec -it sh - ``` - -- Kubernetes - ``` - # To port forward our server to the host machine - kubectl port-forward svc/server-service 5555:5555 - - # To get the password for the Grafana UI - kubectl get secret --namespace default prom-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo - - # To get information about a specific resource - kubectl describe - - ``` -- Helm - ``` - # To search for a Helm chart from the prometheus-community repository - helm search repo prometheus-community - - # To list all Helm releases - helm list - ``` - - - -## Resources -- [Kubernetes Documentation](https://kubernetes.io/docs/home/) -- [Kind Documentation](https://kind.sigs.k8s.io/docs/user/quick-start/) -- [Helm Documentation](https://helm.sh/docs/intro/quickstart/) diff --git a/recitations/Recitation 10/configs/cluster-config.yaml b/recitations/Recitation 10/configs/cluster-config.yaml deleted file mode 100644 index c9bd579f..00000000 --- a/recitations/Recitation 10/configs/cluster-config.yaml +++ /dev/null @@ -1,35 +0,0 @@ -apiVersion: apps/v1 -kind: Deployment -metadata: - name: server-deployment -spec: - replicas: 2 - selector: - matchLabels: - app: server - template: - metadata: - labels: - app: server - spec: - containers: - - name: server - image: server:flask - ports: - - containerPort: 5555 ---- -apiVersion: v1 -kind: Service -metadata: - name: server-service -spec: - selector: - app: server - type: NodePort - sessionAffinity: None - ports: - - name: http - port: 5555 - targetPort: 5555 - protocol: TCP - diff --git a/recitations/Recitation 10/configs/kind-config.yaml b/recitations/Recitation 10/configs/kind-config.yaml deleted file mode 100644 index a1be03a5..00000000 --- a/recitations/Recitation 10/configs/kind-config.yaml +++ /dev/null @@ -1,18 +0,0 @@ -# Config file that defines our Kind Kubernetes cluster - -kind: Cluster -apiVersion: kind.x-k8s.io/v1alpha4 -# One control plane node and three "workers". -# -# While these will not add more real compute capacity and -# have limited isolation, this can be useful for testing -# rolling updates etc. -# -# The API-server and other control plane components will be -# on the control-plane node. -# -# You probably don't need this unless you are testing Kubernetes itself. -nodes: -- role: control-plane -- role: worker -- role: worker \ No newline at end of file diff --git a/recitations/Recitation 10/handout.zip b/recitations/Recitation 10/handout.zip deleted file mode 100644 index 84216867..00000000 Binary files a/recitations/Recitation 10/handout.zip and /dev/null differ diff --git a/recitations/Recitation 10/server/Dockerfile b/recitations/Recitation 10/server/Dockerfile deleted file mode 100644 index a90f2eaa..00000000 --- a/recitations/Recitation 10/server/Dockerfile +++ /dev/null @@ -1,8 +0,0 @@ -FROM python:3.9 -COPY . / -WORKDIR / -RUN pip install --upgrade pip -RUN pip install -r requirements.txt -EXPOSE 5555 -ENTRYPOINT [ "python" ] -CMD [ "app.py" ] \ No newline at end of file diff --git a/recitations/Recitation 10/server/__pycache__/app.cpython-39.pyc b/recitations/Recitation 10/server/__pycache__/app.cpython-39.pyc deleted file mode 100644 index ed3de227..00000000 Binary files a/recitations/Recitation 10/server/__pycache__/app.cpython-39.pyc and /dev/null differ diff --git a/recitations/Recitation 10/server/app.py b/recitations/Recitation 10/server/app.py deleted file mode 100644 index ea5b5793..00000000 --- a/recitations/Recitation 10/server/app.py +++ /dev/null @@ -1,39 +0,0 @@ -""" Implements a flask server the responds with the current time""" -from flask import Flask -from datetime import datetime -import os - - -app = Flask(__name__) - - -@app.route('/') -def home(): - """ Returns the index page""" - return """
-

Welcome to Recitation 10: Container Orchestration with Kubernetes

-

Kubernetes is an open-source container orchestration system for automating software deployment, scaling, and management. - Originally designed by Google, the project is now maintained by the Cloud Native Computing Foundation. - The name Kubernetes originates from Greek, meaning 'helmsman' or 'pilot'. -

-
""" - - -@app.route('/datetime') -def get_datetime(): - """ Returns the current time""" - return f"The current time is: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}" - - -@app.route('/health/', methods=['GET']) -def health(req): - """ Returns the health of the server""" - if req == "kill": - print("Killing server...") - os._exit(0) - else: - return "Healthy" - - -if __name__ == '__main__': - app.run(host='0.0.0.0', port=5555) diff --git a/recitations/Recitation 10/server/requirements.txt b/recitations/Recitation 10/server/requirements.txt deleted file mode 100644 index 17997236..00000000 --- a/recitations/Recitation 10/server/requirements.txt +++ /dev/null @@ -1 +0,0 @@ -flask diff --git a/recitations/Recitation 5/Docker Demo Code.zip b/recitations/Recitation 5/Docker Demo Code.zip deleted file mode 100644 index c9e38a8d..00000000 Binary files a/recitations/Recitation 5/Docker Demo Code.zip and /dev/null differ diff --git a/recitations/Recitation 5/Docker Demo Code/README.md b/recitations/Recitation 5/Docker Demo Code/README.md deleted file mode 100644 index b08306f4..00000000 --- a/recitations/Recitation 5/Docker Demo Code/README.md +++ /dev/null @@ -1,30 +0,0 @@ -# Setup -[Install](https://docs.docker.com/get-docker/) and Start docker on your machine, either the docker application on Mac/Windows. -Install requirements -``` -pip install -r requirements.py -``` - -# Running Docker -``` -cd docker_demo -docker-compose up -``` -Now you can see the server up message on two different ports by two different containerised apps running simultaneously: -* http://localhost:7004 -* http://localhost:7005 - -Press Ctrl+C to stop the containers or run `docker-compose down` to bring down the containers. -# Load Balancer -While the docker conatiners are running, do the following: -``` -cd load_balancer -python3 load_balancer.py -``` -Open: -http://localhost:8082 \ -With 70% chance you will see Server A and with 30% Server B. This is an example of load balancing. -# Remove all Docker Images -``` -docker system prune -a -``` diff --git a/recitations/Recitation 5/Docker Demo Code/docker_demo/Dockerfile b/recitations/Recitation 5/Docker Demo Code/docker_demo/Dockerfile deleted file mode 100644 index 46362264..00000000 --- a/recitations/Recitation 5/Docker Demo Code/docker_demo/Dockerfile +++ /dev/null @@ -1,8 +0,0 @@ -FROM python:3.9 -COPY . / -WORKDIR / -RUN pip install --upgrade pip -RUN pip install -r requirements.txt -EXPOSE 8081 -ENTRYPOINT [ "python" ] -CMD [ "app/server.py" ] \ No newline at end of file diff --git a/recitations/Recitation 5/Docker Demo Code/docker_demo/app/server.py b/recitations/Recitation 5/Docker Demo Code/docker_demo/app/server.py deleted file mode 100644 index f325dfc8..00000000 --- a/recitations/Recitation 5/Docker Demo Code/docker_demo/app/server.py +++ /dev/null @@ -1,13 +0,0 @@ -from flask import Flask - -app = Flask('demo-flask-server') - -@app.route('/') -def healthcheck(): - return 'Server UP!' - -def main(): - app.run(host='0.0.0.0', port=8081, debug=False) - -if __name__ == '__main__': - main() \ No newline at end of file diff --git a/recitations/Recitation 5/Docker Demo Code/docker_demo/docker-compose.yml b/recitations/Recitation 5/Docker Demo Code/docker_demo/docker-compose.yml deleted file mode 100644 index 28d052c1..00000000 --- a/recitations/Recitation 5/Docker Demo Code/docker_demo/docker-compose.yml +++ /dev/null @@ -1,18 +0,0 @@ -version: "3.9" # optional since v1.27.0 -services: - web_a: - build: - context: . - ports: - - "7004:8081" - volumes: - - ./logvolume01:/var/log - web_b: - build: - context: . - ports: - - "7005:8081" - volumes: - - ./logvolume02:/var/log -volumes: - logvolume01: {} \ No newline at end of file diff --git a/recitations/Recitation 5/Docker Demo Code/docker_demo/requirements.txt b/recitations/Recitation 5/Docker Demo Code/docker_demo/requirements.txt deleted file mode 100644 index 8ab6294c..00000000 --- a/recitations/Recitation 5/Docker Demo Code/docker_demo/requirements.txt +++ /dev/null @@ -1 +0,0 @@ -flask \ No newline at end of file diff --git a/recitations/Recitation 5/Docker Demo Code/load_balancer_demo/load_balancer.py b/recitations/Recitation 5/Docker Demo Code/load_balancer_demo/load_balancer.py deleted file mode 100644 index acf2c681..00000000 --- a/recitations/Recitation 5/Docker Demo Code/load_balancer_demo/load_balancer.py +++ /dev/null @@ -1,40 +0,0 @@ -import sys -import os -from flask import jsonify -from flask import Flask -import pickle -import json -from flask import request, make_response -import requests -import numpy as np -import random - -app = Flask('load-balancer-server') - -probability = 0.7 - -def checkHealth(ip_addr): - return os.system('nc -vz '+ip_addr) == 0 - -@app.route('/') -def welcome(): - # add health check - A_server_up = checkHealth('0.0.0.0 7004') - B_server_up = checkHealth('0.0.0.0 7005') - - if not B_server_up and A_server_up: - response = requests.get('http://0.0.0.0:7004/') - elif B_server_up and not A_server_up: - response = requests.get('http://0.0.0.0:7005/') - elif A_server_up and B_server_up: - if random.uniform(0, 1) < probability: - response = f"Server A: {requests.get('http://0.0.0.0:7004/').text}" - else: - response = f"Server B: {requests.get('http://0.0.0.0:7005/').text}" - else: - response = '' - return str(response) - -if __name__ == '__main__': - app.run(host='0.0.0.0', port=8082, debug=False) - welcome() \ No newline at end of file diff --git a/recitations/Recitation 5/Docker Demo Code/requirements.txt b/recitations/Recitation 5/Docker Demo Code/requirements.txt deleted file mode 100644 index 196055db..00000000 --- a/recitations/Recitation 5/Docker Demo Code/requirements.txt +++ /dev/null @@ -1,2 +0,0 @@ -flask -numpy \ No newline at end of file diff --git a/recitations/Recitation 5/Recitation 5 - Docker.pdf b/recitations/Recitation 5/Recitation 5 - Docker.pdf deleted file mode 100644 index 28d98b3f..00000000 Binary files a/recitations/Recitation 5/Recitation 5 - Docker.pdf and /dev/null differ diff --git a/recitations/Recitation 6/Model Testing.pdf b/recitations/Recitation 6/Model Testing.pdf deleted file mode 100644 index 02dae3e6..00000000 Binary files a/recitations/Recitation 6/Model Testing.pdf and /dev/null differ diff --git a/recitations/Recitation 6/adatest-demo.zip b/recitations/Recitation 6/adatest-demo.zip deleted file mode 100644 index 52920174..00000000 Binary files a/recitations/Recitation 6/adatest-demo.zip and /dev/null differ diff --git a/schedule.md b/schedule.md index 2c72a232..2b52d675 100644 --- a/schedule.md +++ b/schedule.md @@ -2,47 +2,47 @@ | Date | Topic | Reading | Assignment due | | - | - | - | - | -| Wed, Jan 18 | [Introduction and Motivation](https://mlip-cmu.github.io/s2023/slides/01_introduction/intro.html) ([book chapter](https://ckaestne.medium.com/introduction-to-machine-learning-in-production-eef7427426f1)) | | | -| Fri, Jan 20 | ![Recitation](https://img.shields.io/badge/-rec-yellow.svg) Calling, securing, and creating APIs: Flask | | | -| Mon, Jan 23 | [From Models to AI-Enabled Systems](https://mlip-cmu.github.io/s2023/slides/02_systems/systems.html) ([book chapter 1](https://ckaestne.medium.com/machine-learning-in-production-from-models-to-systems-e1422ec7cd65), [chapter 2](https://ckaestne.medium.com/when-to-use-machine-learning-83fe9be1b8e1), [chapter 3](https://ckaestne.medium.com/setting-and-measuring-goals-for-machine-learning-projects-c887bc6ab9d0)) | [Building Intelligent Systems](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991019649190004436), Ch. 4, 5, 7, 8 | | -| Wed, Jan 25 | [Gathering and Untangling Requirements](https://mlip-cmu.github.io/s2023/slides/03_requirements/requirements.html) ([book chapter](https://ckaestne.medium.com/gathering-requirements-for-ml-enabled-systems-4f0a7a23730f)) | [The World and the Machine](http://mcs.open.ac.uk/mj665/icse17kn.pdf) | | -| Fri, Jan 27 | ![Recitation](https://img.shields.io/badge/-rec-yellow.svg) Stream processing: Apache Kafka | | | -| Mon, Jan 30 | [Planning for Mistakes](https://mlip-cmu.github.io/s2023/slides/04_mistakes/mistakes.html) ([book chapter](https://ckaestne.medium.com/planning-for-machine-learning-mistakes-2574f4fcf529)) | [Building Intelligent Systems](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991019649190004436), Ch. 6, 7, 24 | [I1: ML Product](https://github.com/mlip-cmu/s2023/blob/main/assignments/I1_mlproduct.md) | -| Wed, Feb 01 | [Model Quality](https://mlip-cmu.github.io/s2023/slides/05_modelaccuracy/modelquality1.html) ([book chapter 1](https://ckaestne.medium.com/model-quality-defining-correctness-and-fit-a8361b857df), [chapter 2](https://ckaestne.medium.com/model-quality-measuring-prediction-accuracy-38826216ebcb)) | [Building Intelligent Systems](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991019649190004436), Ch. 19 | | -| Fri, Feb 03 | ![Recitation](https://img.shields.io/badge/-rec-yellow.svg) Collaborating with GitHub: Pull requests, GitFlow, GitHub actions | | | -| Mon, Feb 06 | [Fostering Interdisciplinary (Student) Teams](https://mlip-cmu.github.io/s2023/slides/06_teamwork/teams.html) | | [I2: Requirements](https://github.com/mlip-cmu/s2023/blob/main/assignments/I2_requirements.md) | -| Wed, Feb 08 | [Model Testing Beyond Accuracy](https://mlip-cmu.github.io/s2023/slides/07_modeltesting/modelquality2.html) ([book chapter](https://ckaestne.medium.com/model-quality-slicing-capabilities-invariants-and-other-testing-strategies-27e456027bd)) | [Behavioral Testing of NLP Models with CheckList](https://homes.cs.washington.edu/~wtshuang/static/papers/2020-acl-checklist.pdf) | | -| Fri, Feb 10 | ![Recitation](https://img.shields.io/badge/-rec-yellow.svg) Collaboration tools: Jira, Miro, Slack, ... | | | -| Mon, Feb 13 | [Toward Architecture and Design](https://mlip-cmu.github.io/s2023/slides/08_architecture/tradeoffs.html) ([book chapter 1](https://ckaestne.medium.com/architectural-components-in-ml-enabled-systems-78cf76b29a92), [chapter 2](https://ckaestne.medium.com/thinking-like-a-software-architect-121ea6919871), [chapter 3](https://ckaestne.medium.com/quality-drivers-in-architectures-for-ml-enabled-systems-836f21c44334)) | [Building Intelligent Systems](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991019649190004436), Ch. 18 & [Choosing the right ML alg.](https://hackernoon.com/choosing-the-right-machine-learning-algorithm-68126944ce1f) | | -| Wed, Feb 15 | [Model Deployment](https://mlip-cmu.github.io/s2023/slides/09_deploying_a_model/deployment.html) ([book chapter](https://ckaestne.medium.com/deploying-a-model-f0b7ffefd06a)) | [Building Intelligent Systems](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991019649190004436), Ch. 13 and [Machine Learning Design Patterns](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/1feg4j8/alma991019735160604436), Ch. 16 | | -| Fri, Feb 17 | ![Recitation](https://img.shields.io/badge/-rec-yellow.svg) Containers: Docker | | | -| Mon, Feb 20 | [Testing in Production](https://mlip-cmu.github.io/s2023/slides/10_qainproduction/qainproduction.html) ([book chapter](https://ckaestne.medium.com/quality-assurance-in-production-for-ml-enabled-systems-4d1b3442316f)) | [Building Intelligent Systems](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991019649190004436), Ch. 14, 15 | [M1: Modeling and First Deployment](https://github.com/mlip-cmu/s2023/blob/main/assignments/project.md) | -| Wed, Feb 22 | [Data Quality](https://mlip-cmu.github.io/s2023/slides/11_dataquality/dataquality.html) ([book chapter](https://ckaestne.medium.com/data-quality-for-building-production-ml-systems-2e0cc7e6113f)) | [Data Cascades in High-Stakes AI](https://dl.acm.org/doi/abs/10.1145/3411764.3445518) | | -| Fri, Feb 24 | ![Recitation](https://img.shields.io/badge/-rec-yellow.svg) [Model Testing: Zeno and AdaTest](https://github.com/mlip-cmu/s2023/tree/main/recitations/Recitation%206) | | | -| Mon, Feb 27 | [Automating and Testing ML Pipelines](https://mlip-cmu.github.io/s2023/slides/12_pipelinequality/pipelinequality.html) ([book chapter 1](https://ckaestne.medium.com/quality-assurance-basics-6ce1eca9921), [chapter 2](https://ckaestne.medium.com/quality-assurance-for-machine-learning-pipelines-d495b8e5ad6a), [chapter 3](https://ckaestne.medium.com/integration-and-system-testing-bc4db6650d1)) | [The ML Test Score](https://research.google.com/pubs/archive/46555.pdf) | [I3: Architecture](https://github.com/mlip-cmu/s2023/blob/main/assignments/I3_architecture.md) | -| Wed, Mar 01 | ![Midterm](https://img.shields.io/badge/-midterm-blue.svg)[Midterm](https://github.com/mlip-cmu/s2023/tree/main/exams) | | | -| Fri, Mar 03 | ![Recitation](https://img.shields.io/badge/-rec-yellow.svg) Continuous Integration: Jenkins | | | -| Mon, Mar 06 | ![Break](https://img.shields.io/badge/-break-red.svg) Spring break, no classes | | | -| Wed, Mar 08 | ![Break](https://img.shields.io/badge/-break-red.svg) Spring break, no classes | | | -| Fri, Mar 10 | ![Break](https://img.shields.io/badge/-break-red.svg) Spring break, no classes | | | -| Mon, Mar 13 | [Scaling Data Storage and Data Processing](https://mlip-cmu.github.io/s2023/slides/13_dataatscale/dataatscale.html) ([book chapter](https://ckaestne.medium.com/scaling-ml-enabled-systems-b5c6b1527bc)) | [Big Data](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991019577936304436), Ch. 1 | | -| Wed, Mar 15 | [Planning for Operations](https://mlip-cmu.github.io/s2023/slides/14_operations/operations.html) ([chapter](https://ckaestne.medium.com/planning-for-operations-of-ml-enabled-systems-a3d18e07ef7c)) | [Operationalizing Machine Learning](https://arxiv.org/abs/2209.09125) | | -| Fri, Mar 17 | ![Recitation](https://img.shields.io/badge/-rec-yellow.svg) Pipeline automation: MLFlow | | | -| Mon, Mar 20 | [Process & Technical Debt](https://mlip-cmu.github.io/s2023/slides/15_process/process.html) ([book chapter 1](https://ckaestne.medium.com/responsible-ai-engineering-c97e44e6c57a), [chapter 2](https://ckaestne.medium.com/fairness-in-machine-learning-and-ml-enabled-products-8ee05ed8ffc4)) | [Hidden Technical Debt in Machine Learning Systems](http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf) | | -| Wed, Mar 22 | [Intro to Ethics + Fairness](https://mlip-cmu.github.io/s2023/slides/16_intro_ethics_fairness/intro-ethics-fairness.html) ([book chapter 1](https://ckaestne.medium.com/responsible-ai-engineering-c97e44e6c57a), [chapter 2](https://ckaestne.medium.com/fairness-in-machine-learning-and-ml-enabled-products-8ee05ed8ffc4)) | [Algorithmic Accountability: A Primer](https://datasociety.net/wp-content/uploads/2018/04/Data_Society_Algorithmic_Accountability_Primer_FINAL-4.pdf) | | -| Fri, Mar 24 | ![Recitation](https://img.shields.io/badge/-rec-yellow.svg) Monitoring: Prometheus, Grafana | | | -| Mon, Mar 27 | [Measuring Fairness](https://mlip-cmu.github.io/s2023/slides/17_fairness_measures/model_fairness.html) ([book chapter](https://ckaestne.medium.com/fairness-in-machine-learning-and-ml-enabled-products-8ee05ed8ffc4)) | [Human Perceptions of Fairness in Algorithmic Decision Making](https://dl.acm.org/doi/pdf/10.1145/3178876.3186138) | [M2: Infrastructure Quality](https://github.com/mlip-cmu/s2023/blob/main/assignments/project.md#milestone-2-model-and-infrastructure-quality) | -| Wed, Mar 29 | [Building Fairer Systems](https://mlip-cmu.github.io/s2023/slides/18_system_fairness/system_fairness.html) ([book chapter](https://ckaestne.medium.com/fairness-in-machine-learning-and-ml-enabled-products-8ee05ed8ffc4)) | [Improving Fairness in Machine Learning Systems](http://users.umiacs.umd.edu/~hal/docs/daume19fairness.pdf) | | -| Fri, Mar 31 | ![Recitation](https://img.shields.io/badge/-rec-yellow.svg) [Container Orchestration: Kubernetis](https://github.com/mlip-cmu/s2023/tree/main/recitations/Recitation%2010) | | | -| Mon, Apr 03 | [Explainability & Interpretability](https://mlip-cmu.github.io/s2023/slides/19_explainability/explainability.html) ([book chapter](https://ckaestne.medium.com/interpretability-and-explainability-a80131467856)) | [Black boxes not required](https://dataskeptic.com/blog/episodes/2020/black-boxes-are-not-required) or [Stop Explaining Black Box ML Models…](https://arxiv.org/abs/1811.10154) | [I4: Open Source Tools](https://github.com/mlip-cmu/s2023/blob/main/assignments/I4_mlops_tools.md) | -| Wed, Apr 05 | [Transparency & Accountability](https://mlip-cmu.github.io/s2023/slides/20_transparency/transparency.html) ([book chapter](https://ckaestne.medium.com/transparency-and-accountability-in-ml-enabled-systems-f8ed0b6fd183)) | [People + AI, Ch. Explainability and Trust](https://pair.withgoogle.com/chapter/explainability-trust/) | | -| Fri, Apr 07 | ![Recitation](https://img.shields.io/badge/-rec-yellow.svg) Fairness Toolkits | | | -| Mon, Apr 10 | [Versioning, Provenance, and Reproducability](https://mlip-cmu.github.io/s2023/slides/21_provenance/provenance.html) ([book chapter](https://ckaestne.medium.com/versioning-provenance-and-reproducibility-in-production-machine-learning-355c48665005)) | | | -| Wed, Apr 12 | [Security and Privacy](https://mlip-cmu.github.io/s2023/slides/22_security/security.html) ([book chapter](https://ckaestne.medium.com/security-and-privacy-in-ml-enabled-systems-1855f561b894)) | [Building Intelligent Systems](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991019649190004436), Ch. 25 & [The Top 10 Risks of Machine Learning Security](https://ieeexplore.ieee.org/document/9107290) | | -| Fri, Apr 14 | ![Break](https://img.shields.io/badge/-break-red.svg) Spring Carnival, no classes | | | -| Mon, Apr 17 | [Safety](https://mlip-cmu.github.io/s2023/slides/23_safety/safety.html) ([book chapter](https://ckaestne.medium.com/safety-in-ml-enabled-systems-b5a5901933ac)) | [Practical Solutions for Machine Learning Safety in Autonomous Vehicles](http://ceur-ws.org/Vol-2560/paper40.pdf) | | -| Wed, Apr 19 | [Safety continued](https://mlip-cmu.github.io/s2023/slides/23_safety/safety.html) | [The Malicious Use of Artificial: Intelligence Forecasting, Prevention, and Mitigation](https://maliciousaireport.godaddysites.com/) | [M3: Monitoring and CD](https://github.com/mlip-cmu/s2023/blob/main/assignments/project.md#milestone-3-monitoring-and-continuous-deployment) | -| Fri, Apr 21 | ![Recitation](https://img.shields.io/badge/-rec-yellow.svg) Model Explainability Tools | | | -| Mon, Apr 24 | [Fostering Interdisciplinary Teams](https://mlip-cmu.github.io/s2023/slides/24_teams/teams.html) ([book chapter](https://ckaestne.medium.com/building-machine-learning-products-with-interdisciplinary-teams-a1fdfbf49e81)) | [Collaboration Challenges in Building ML-Enabled Systems](https://arxiv.org/abs/2110.10234) | | -| Wed, Apr 26 | [Summary and Review](https://mlip-cmu.github.io/s2023/slides/25_summary/all.html) | | [M4: Fairness, Security and Feedback Loops](https://github.com/mlip-cmu/s2023/blob/main/assignments/project.md#milestone-4-fairness-security-and-feedback-loops) | -| Thu, May 4 05:30-08:30pm | **Final Project Presentations** | | [Final report](https://github.com/mlip-cmu/s2023/blob/main/assignments/project.md#final-report-and-presentation) | +| Wed, Jan 17 | [Introduction and Motivation](https://mlip-cmu.github.io/s2024/slides/01_introduction/intro.html) ([book chapter](https://ckaestne.medium.com/introduction-to-machine-learning-in-production-eef7427426f1)) | | | +| Fri, Jan 19 | ![Lab](https://img.shields.io/badge/-lab-yellow.svg) [Calling, securing, and creating APIs](https://github.com/mlip-cmu/s2024/blob/main/labs/lab01.md) | | | +| Mon, Jan 22 | [From Models to AI-Enabled Systems](https://mlip-cmu.github.io/s2024/slides/02_systems/systems.html) ([book chapter 1](https://ckaestne.medium.com/machine-learning-in-production-from-models-to-systems-e1422ec7cd65), [chapter 2](https://ckaestne.medium.com/when-to-use-machine-learning-83fe9be1b8e1), [chapter 3](https://ckaestne.medium.com/setting-and-measuring-goals-for-machine-learning-projects-c887bc6ab9d0)) | [Building Intelligent Systems](https://cmu.primo.exlibrisgroup.com/permalink/01CMU_INST/6lpsnm/alma991019649190004436), Ch. 4, 5, 7, 8 | | +| Wed, Jan 24 | [Gathering and Untangling Requirements](https://mlip-cmu.github.io/s2024/slides/03_requirements/requirements.html) ([book chapter](https://ckaestne.medium.com/gathering-requirements-for-ml-enabled-systems-4f0a7a23730f)) | [The World and the Machine](https://scholar.google.com/scholar?cluster=1090758480873197042) | | +| Fri, Jan 26 | ![Lab](https://img.shields.io/badge/-lab-yellow.svg) [Stream processing: Apache Kafka](https://github.com/mlip-cmu/s2024/blob/main/labs/lab02.md) | | | +| Mon, Jan 29 | [Planning for Mistakes](https://mlip-cmu.github.io/s2024/slides/04_mistakes/mistakes.html) ([book chapter](https://ckaestne.medium.com/planning-for-machine-learning-mistakes-2574f4fcf529)) | | [I1: ML Product](https://github.com/mlip-cmu/s2024/blob/main/assignments/I1_mlproduct.md) | +| Wed, Jan 31 | [Model Quality](https://mlip-cmu.github.io/s2024/slides/05_modelaccuracy/modelquality1.html) ([book chapter 1](https://ckaestne.medium.com/model-quality-defining-correctness-and-fit-a8361b857df), [chapter 2](https://ckaestne.medium.com/model-quality-measuring-prediction-accuracy-38826216ebcb)) | | | +| Fri, Feb 02 | ![Lab](https://img.shields.io/badge/-lab-yellow.svg) [Git](https://github.com/mlip-cmu/s2024/blob/main/labs/lab03.md) | | | +| Mon, Feb 05 | [Fostering Interdisciplinary (Student) Teams](https://mlip-cmu.github.io/s2024/slides/06_teamwork/teams.html) | | [I2: Requirements](https://github.com/mlip-cmu/s2024/blob/main/assignments/I2_requirements.md) | +| Wed, Feb 07 | Behavioral Model Testing ([book chapter](https://ckaestne.medium.com/model-quality-slicing-capabilities-invariants-and-other-testing-strategies-27e456027bd)) | | | +| Fri, Feb 09 | ![Lab](https://img.shields.io/badge/-lab-yellow.svg) Model Testing: Zeno and AdaTest | | | +| Mon, Feb 12 | Toward Architecture and Design ([book chapter 1](https://ckaestne.medium.com/architectural-components-in-ml-enabled-systems-78cf76b29a92), [chapter 2](https://ckaestne.medium.com/thinking-like-a-software-architect-121ea6919871), [chapter 3](https://ckaestne.medium.com/quality-drivers-in-architectures-for-ml-enabled-systems-836f21c44334)) | | | +| Wed, Feb 14 | Deploying a Model ([book chapter](https://ckaestne.medium.com/deploying-a-model-f0b7ffefd06a)) | | | +| Fri, Feb 16 | ![Lab](https://img.shields.io/badge/-lab-yellow.svg) Containers: Docker | | | +| Mon, Feb 19 | Testing in Production ([book chapter](https://ckaestne.medium.com/quality-assurance-in-production-for-ml-enabled-systems-4d1b3442316f)) | | [M1: Modeling and First Deployment](https://github.com/mlip-cmu/s2024/blob/main/assignments/project.md) | +| Wed, Feb 21 | Data Quality ([book chapter](https://ckaestne.medium.com/data-quality-for-building-production-ml-systems-2e0cc7e6113f)) | | | +| Fri, Feb 23 | ![Lab](https://img.shields.io/badge/-lab-yellow.svg) tbd | | | +| Mon, Feb 26 | Automating and Testing ML Pipelines ([book chapter 1](https://ckaestne.medium.com/quality-assurance-basics-6ce1eca9921), [chapter 2](https://ckaestne.medium.com/quality-assurance-for-machine-learning-pipelines-d495b8e5ad6a), [chapter 3](https://ckaestne.medium.com/integration-and-system-testing-bc4db6650d1)) | | | +| Wed, Feb 28 | ![Midterm](https://img.shields.io/badge/-midterm-blue.svg) Midterm | | | +| Fri, Mar 01 | ![Lab](https://img.shields.io/badge/-lab-yellow.svg) Continuous Integration: Jenkins | | | +| Mon, Mar 04 | ![Break](https://img.shields.io/badge/-break-red.svg) Spring break, no classes | | | +| Wed, Mar 06 | ![Break](https://img.shields.io/badge/-break-red.svg) Spring break, no classes | | | +| Fri, Mar 08 | ![Break](https://img.shields.io/badge/-break-red.svg) Spring break, no classes | | | +| Mon, Mar 11 | Scaling Data Storage and Data Processing ([book chapter](https://ckaestne.medium.com/scaling-ml-enabled-systems-b5c6b1527bc)) | | | +| Wed, Mar 13 | Planning for Operations ([book chapter](https://ckaestne.medium.com/planning-for-operations-of-ml-enabled-systems-a3d18e07ef7c)) | | | +| Fri, Mar 15 | ![Lab](https://img.shields.io/badge/-lab-yellow.svg) Pipeline automation: MLFlow | | | +| Mon, Mar 18 | Versioning, Provenance, and Reproducability ([book chapter 1](https://ckaestne.medium.com/data-science-and-software-engineering-process-models-ea997ea53711), [chapter 2](https://ckaestne.medium.com/technical-debt-in-machine-learning-systems-62035b82b6de)) | | [M2: Infrastructure Quality](https://github.com/mlip-cmu/s2024/blob/main/assignments/project.md) | +| Wed, Mar 20 | Process & Technical Debt ([book chapter 1](https://ckaestne.medium.com/responsible-ai-engineering-c97e44e6c57a), [chapter 2](https://ckaestne.medium.com/fairness-in-machine-learning-and-ml-enabled-products-8ee05ed8ffc4)) | | | +| Fri, Mar 22 | ![Lab](https://img.shields.io/badge/-lab-yellow.svg) Monitoring: Prometheus, Grafana | | | +| Mon, Mar 25 | Intro to Ethics + Fairness ([book chapter](https://ckaestne.medium.com/fairness-in-machine-learning-and-ml-enabled-products-8ee05ed8ffc4)) | | [I3: Open Source Tools](https://github.com/mlip-cmu/s2024/blob/main/assignments/I3_mlops_tools.md) | +| Wed, Mar 27 | Measuring Fairness ([book chapter](https://ckaestne.medium.com/fairness-in-machine-learning-and-ml-enabled-products-8ee05ed8ffc4)) | | | +| Fri, Mar 29 | ![Lab](https://img.shields.io/badge/-lab-yellow.svg) Container orchestration: Kubernetis | | | +| Mon, Apr 01 | Building Fairer Systems ([book chapter](https://ckaestne.medium.com/interpretability-and-explainability-a80131467856)) | | | +| Wed, Apr 03 | Explainability & Interpretability ([book chapter](https://ckaestne.medium.com/transparency-and-accountability-in-ml-enabled-systems-f8ed0b6fd183)) | | | +| Fri, Apr 05 | ![Lab](https://img.shields.io/badge/-lab-yellow.svg) Fairness Toolkits | | | +| Mon, Apr 08 | Transparency & Accountability ([book chapter](https://ckaestne.medium.com/versioning-provenance-and-reproducibility-in-production-machine-learning-355c48665005)) | | [M3: Monitoring and CD](https://github.com/mlip-cmu/s2024/blob/main/assignments/project.md) | +| Wed, Apr 10 | Security and Privacy ([book chapter](https://ckaestne.medium.com/security-and-privacy-in-ml-enabled-systems-1855f561b894)) | | | +| Fri, Apr 12 | ![Break](https://img.shields.io/badge/-break-red.svg) Spring Carnival, no classes | | | +| Mon, Apr 15 | Safety ([book chapter](https://ckaestne.medium.com/safety-in-ml-enabled-systems-b5a5901933ac)) | | | +| Wed, Apr 17 | More safety, security, privacy | | [I4: Explainability](https://github.com/mlip-cmu/s2024/blob/main/assignments/I4_explainability.md) | +| Fri, Apr 19 | ![Lab](https://img.shields.io/badge/-lab-yellow.svg) Model Explainability Tools | | | +| Mon, Apr 22 | Fostering Interdisciplinary Teams ([book chapter](https://ckaestne.medium.com/building-machine-learning-products-with-interdisciplinary-teams-a1fdfbf49e81)) | | | +| Wed, Apr 24 | Summary and Review | | [M4: Fairness, Security and Feedback Loops](https://github.com/mlip-cmu/s2024/blob/main/assignments/project.md) | +| [tbd](https://www.cmu.edu/hub/docs/final-exams.pdf) | Final Project Presentations | | [Final report](https://github.com/mlip-cmu/s2024/blob/main/assignments/project.md) |