diff --git a/slides/section_3_software_dev_process.md b/slides/section_3_software_dev_process.md index c75677886..af987952b 100644 --- a/slides/section_3_software_dev_process.md +++ b/slides/section_3_software_dev_process.md @@ -11,7 +11,7 @@ jupyter: theme: solarized --- - + # Section 3: Software Development as a Process
@@ -24,7 +24,7 @@ jupyter: - We are going to step up a level and look at the overall process of developing software - + ## Writing Code versus Engineering Software - Software is _not_ just a tool for answering a research question @@ -36,7 +36,7 @@ jupyter: - Software can be reused 🔁 - + - Software is _not_ just a tool for answering a research question - Software is shared frequently between researchers and _reused_ after publication - Therefore, we need to be concerned with more than just the implementation, i.e. "writing code" @@ -47,7 +47,7 @@ jupyter: - Software can be reused: like with stakeholders, it is hard to predict how the software will be used in the future, and we want to make it easy for reuse to happen - + ## Software Development Lifecycle
@@ -548,145 +548,120 @@ Regardless of doing Object Oriented Programming or Functional Programming - + ## ☕ 10 Minute Break ☕ - + ## Refactoring **Refactoring** is modifying code, such that: - * external behaviour unchanged, + * external behaviour is unchanged, * code itself is easier to read / test / extend. - + -## Refactoring +### Refactoring -Refactoring is vital for improving code quality. +Refactoring is vital for improving code quality. It might include things such as: +* Code decoupling and abstractions +* Renaming variables +* Reorganising functions to avoid code duplication +* Simplifying conditional statements to improve readability - + Often working on existing software - refactoring is how we improve it - - -## Refactoring Loop + +### Refactoring Loop -When making a change to a piece of software, do the following: +When refactoring a piece of software, a good process to follow is: -* Automated tests verify current behaviour -* Refactor code (so new change slots in cleanly) -* Re-run tests to ensure nothing is broken -* Make the desired change, which now fits in easily. +* Make sure you have tests that verify the current behaviour +* Refactor the code +* Re-run tests to verify the behavour of the code is unchanged - + +### Refactoring -## Refactoring +In the rest of section we will learn how to refactor an existing piece of code. We need to: -Rest of section we will learn how to refactor an existing piece of code +* Add more tests so we can be more confident that future changes will not break the existing code. +* Further split analyse_data() function into a number of smaller and more decoupled functions -```python -``` + +When refactoring, first we need to make sure there are tests in place that can verify the code behaviour as it is now (or write them if they are missing), then refactor the code and, finally, check that the original tests still pass. - In the process of refactoring, we will try to target some of the "good practices" we just talked about, like making good abstractions and reducing cognitive load. - - -## Refactoring Exercise + -Look at `inflammation/compute_data.py` - - - -Bring up the code +### Writing Regression Tests Before Refactoring -Explain the feature: -In it, if the user adds --full-data-analysis then the program will scan the directory of one of the provided files, compare standard deviations across the data by day and plot a graph. +Look at the `analyse_data` function within `inflammation/compute_data.py`: -The main body of it exists in inflammation/compute_data.py in a function called analyse_data. - - - - -## Key Points - -> "Good code is written so that is readable, understandable, covered by automated tests, not over complicated and does well what is intended to do." - - - - -## ☕ 5 Minute Break ☕ - - - -## Refactoring Functions to do Just One Thing - - - -## Introduction - -Functions that just do one thing are: - -* Easier to test -* Easier to read -* Easier to re-use - - +```python +def analyse_data(data_dir): + data_file_paths = glob.glob(os.path.join(data_dir, 'inflammation*.csv')) + if len(data_file_paths) == 0: + raise ValueError(f"No inflammation data CSV files found in path {data_dir}") + data = map(models.load_csv, data_file_paths) - -We identified last episode that the code has a function that does many more than one thing -Hard to understand - high cognitive load + means_by_day = map(models.daily_mean, data) + means_by_day_matrix = np.stack(list(means_by_day)) -Hard to test as mixed lots of different things together + daily_standard_deviation = np.std(means_by_day_matrix, axis=0) -Hard to reuse as was very fixed in its behaviour. + graph_data = { + 'standard deviation by day': daily_standard_deviation, + } + views.visualize(graph_data) +``` + +Bring up the code - -## Test Before Refactoring +Explain the feature: +When using inflammation-analysis.py if the user adds `--full-data-analysis` then the program will scan the directory of one of the provided files, compare standard deviations across the data by day and plot a graph. -* Write tests *before* refactoring to ensure we do not change behaviour. +The main body of it exists in inflammation/compute_data.py in a function called analyse_data. +We want to add extra regression tests to this function. Firstly, modify the function to return the data instead of visualise it so that it is easier to automatically test. Next, we will add assert statements that verify that the current outcome always remains the same, rather than checking if it is *correct* or not. These are called regression tests. - -## Writing Tests for Code that is Hard to Test - -What can we do? + -* Test at a higher level, with coarser accuracy -* Write "hacky" temporary tests +### Exercise: Writing Regression Tests - +Add a new test file called `test_compute_data.py` in the `tests` folder and add a regression test to verify the current output of `analyse_data()`. - -Think of hacky tests like scaffolding - we will use them to ensure we can do the work safely, -but we will remove them in the end. - +Remember that this is a *regression test* to check that we don't break our code during refactoring, and so ensure that this result remains unchanged. It does *not* necessarily check that the result is correct. - -## Exercise: Write a Regression Test for Analyse Data Before Refactoring +```python +from inflammation.compute_data import analyse_data -Add a new test file called `test_compute_data.py` in the tests folder. There is more information on the relevant web page. -Complete the regression test to verify the current output of analyse_data is unchanged by the refactorings we are going to do. +def test_analyse_data(): + path = Path.cwd() / "../data" + data_source = CSVDataSource(path) + result = analyse_data(data_source) -Time: 10min + # TODO: add assert statement(s) to test the result value is as expected +``` - + Hint: You might find it helpful to assert the results equal some made up array, observe the test failing and copy and paste the correct result into the test. When talking about the solution: @@ -696,28 +671,36 @@ When talking about the solution: * Brittle - changing the files will break the tests + +### Refactoring Functions to only do One Thing - -## Pure Functions +Functions which just do one thing are: -A **pure function** takes in some inputs as parameters, and it produces a consistent output. +* Easier to test +* Easier to read +* Easier to re-use -That is, just like a mathematical function. +We can take this further by making our single-purpose functions **pure**. -The output does not depend on externalities. + -There will be no side effects from running the function + +### Pure Functions - +A **pure function** is effectively what we think of as a mathematical function: - -Externalities like what is in a database or the time of day +- they take some input, and produce an output +- they do not rely on any information other than the inputs provided +- they do not cause any side effects. + +As a result, the output of a **pure function** does not depend on externalities or program sate, such as global variables. + +Moreover, there will be no side effects from running the function, e.g. it wont edit any files or modify global variables such that behaviour in other parts of our code are unaffected. -Side effects like modifying a global variable or writing a file - -## Pure Functions + +### Pure Functions Pure functions have a number of advantages for maintainable code: @@ -726,8 +709,8 @@ Pure functions have a number of advantages for maintainable code: - -## Refactor Code into a Pure Function + +### Exercise: Refactor Code into a Pure Function Refactor the analyse_data function into a pure function with the logic, and an impure function that handles the input and output. The pure function should take in the data, and return the analysis results: @@ -741,8 +724,8 @@ Time: 10min - -## Testing Pure Functions + +### Testing Pure Functions Pure functions are also easier to test @@ -752,12 +735,12 @@ Pure functions are also easier to test - + Can focus on making sure we get all edge cases without real world considerations - -## Write Test Cases for the Pure Function + +### Exercise: Write Test Cases for the Pure Function Now we have refactored our a pure function, we can more easily write comprehensive tests. Add tests that check for when there is only one file with multiple rows, multiple files with one row and any other cases you can think of that should be tested. @@ -765,18 +748,36 @@ Time: 10min - -## Functional Programming + +```python +from inflammation.compute_data import compute_standard_deviation_by_data -Pure functions are a concept from an approach to programming called **functional programming**. +@pytest.mark.parametrize('data,expected_output', [ + ([[[0, 1, 0], [0, 2, 0]]], [0, 0, 0]), + ([[[0, 2, 0]], [[0, 1, 0]]], [0, math.sqrt(0.25), 0]), + ([[[0, 1, 0], [0, 2, 0]], [[0, 1, 0], [0, 2, 0]]], [0, 0, 0]) +], +ids=['Two patients in same file', 'Two patients in different files', 'Two identical patients in two different files']) +def test_compute_standard_deviation_by_day(data, expected_output): -Python, and other languages, provide features that make it easier to write "functional" code: - * `map` / `filter` / `reduce` can be used to chain pure functions together into pipelines + result = compute_standard_deviation_by_data(data) + npt.assert_array_almost_equal(result, expected_output) +``` + + + +### Functional Programming + +Pure functions are a concept from an approach to programming called **functional programming**, where programs are constructed by chaining together these pure functions. + +Writing code in this way is particularly useful for data processing and analysis, or translating data from one format to another. + +We have so far mostly focussed on Procedural Programming, where a series of sequential steps are performed in a specific order. Different programming paradigms have different strengths and weaknesses, and are useful to solve different types of problems. - + If there is time - do some live coding to show imperative code, then transform into a pipeline: * Sequence of numbers @@ -813,384 +814,139 @@ total = sum(map(squared, filter(is_even, numbers))) ## ☕ 10 Minute Break ☕ - + ## Architecting Code to Separate Responsibilities - -## Using Classes to Decouple Code - + +Recall that we are using a Model-View-Controller architecture in our project, which are located in: - -### Decoupled Code - -When thinking about code, we tend to think of it in distinct parts or **units**. - -Two units are **decoupled** if changes in one can be made independently of the other +* **Model**: `inflammation/models.py` +* **View**: `inflammation/views.py` +* **Controller**: `inflammation-analysis.py` +But the code we were previously analysing was added in a separate script `inflammation/compute_data.py` and contains a mix of all three. - -E.g we have the part that loads a file and the part that draws a graph - -Or the part that the user interacts with and the part that does the calculations - + +### Exercise: Identify Model, View and Controller - -### Decoupled Code +Looking at the code inside `compute_data.py`, what parts could be considered Model, View and Controller code? -Abstractions allow decoupling code +Time: 5min - -When we have a suitable abstraction, we do not need to worry about the inner workings of the other part. - -For example break of a car, the details of how to slow down are abstracted, so when we change how -breaking works, we do not need to retrain the driver. + +Computing the standard deviation belongs to Model. +Reading the data from CSV files also belongs to Model. +Displaying of the output as a graph is View. +The logic that processes the supplied files is Controller. - -### Exercise: Decouple the File Loading from the Computation -Currently the function is hard coded to load all the files in a directory. + +### Exercise: Split Out Model, View and Controller -Decouple this into a separate function that returns all the files to load +Refactor analyse_data() function so that the Model, View and Controller code we identified in the previous exercise is moved to appropriate modules. Time: 10min - -### Decoupled... but not completely + +### Merge the Feature In -Although we have separated out the data loading, there is still an assumption and therefore coupling in terms of the format of that data (in this case CSV). +Hopefully you have now refactored the feature to conform to our MVC structure, and ran our regression tests to check that the outputs rermain the same. -Is there a way we could make this more flexible? - +We can commit this to our branch, and then switch to the `develop` branch and merge it in. - -- The format of the data stored is a practical detail which we don't want to limit the use of our `analyse_data()` function -- We could add an argument to our function to specify the format, but then we might have quite a long conditional list of all the different possible formats, and the user would need to request changes to `analyse_data()` any time they want to add a new format -- Is there a way we can let the user more flexibly specify the way in which their data gets read? - - - -One way is with **classes**! - - - -### Python Classes - -A **class** is a Python feature that allows grouping methods (i.e. functions) with some data. - - - - -Do some live coding, ending with: - -```python -import math - -class Circle: - def __init__(self, radius): - self.radius = radius - - def get_area(self): - return math.pi * self.radius * self.radius - -my_circle = Circle(10) -print(my_circle.get_area()) +```bash +$ git switch develop +$ git merge full-data-analysis ``` - -### Exercise: Use a Class to Configure Loading - -Put the `load_inflammation_data` function we wrote in the last exercise as a member method of a new class called `CSVDataSource`. - -Put the configuration of where to load the files in the class' initialiser. - -Once this is done, you can construct this class outside the the statistical analysis and pass the instance in to analyse_data. - -Time: 10min - - - - -### Interfaces - -**Interfaces** describe how different parts of the code interact with each other. - - - - -For example, the interface of the breaking system in a car, is the break pedal. -The user can push the pedal harder or softer to get more or less breaking. -The interface of our circle class is the user can call get_area to get the 2D area of the circle -as a number. - + +### Controller Structure - -### Interfaces - -Question: what is the interface for CSVDataSource +The structure of our controller is as follows: ```python -class CSVDataSource: - """ - Loads all the inflammation csvs within a specified folder. - """ - def __init__(self, dir_path): - self.dir_path = dir_path +# import modules - def load_inflammation_data(self): - data_file_paths = glob.glob(os.path.join(self.dir_path, 'inflammation*.csv')) - if len(data_file_paths) == 0: - raise ValueError(f"No inflammation csv's found in path {self.dir_path}") - data = map(models.load_csv, data_file_paths) - return list(data) -``` - - - - -Suggest discuss in groups for 1min. - -Answer: the interface is the signature of the `load_inflammation_data()` method, i.e. what arguments it takes and what it returns. - +def main(args): + # perform some actions - -### Common Interfaces - -If we have two classes that share the same interface, we can use the interface without knowing which class we have - - - - -Easiest shown with an example, lets do more live coding: - -```python -class Rectangle(Shape): - def __init__(self, width, height): - self.width = width - self.height = height - def get_area(self): - return self.width * self.height - -my_circle = Circle(radius=10) -my_rectangle = Rectangle(width=5, height=3) -my_shapes = [my_circle, my_rectangle] -total_area = sum(shape.get_area() for shape in my_shapes) +if __name__ == "__main__": + # perform some actions before main() + main(args) ``` - - - -### Polymorphism - -Using an interface to call different methods is a technique known as **polymorphism**. - -A form of abstraction - we have abstracted what kind of shape we have. +This is a common pattern for entry points to Python packages. Actions performed by the script are contained within the `main` function. The main function is run automatically if the `__name__` variable (a special veriable set by the Python interpreter) is `"__main__"`. So if our file is run by the Python interpreter on the command line, this condition will be satisfied, and our script gets run as expected. +However, if our Python module is imported from another, instead `__name__ = "inflammation_analysis"` will be defined, and the `main()` function will not automatically be run. - -### Exercise: Introduce an alternative implementation of DataSource - -Polymorphism is very useful - suppose we want to read a JSON (JavaScript Object Notation) file. - -Write a class that has the same interface as `CSVDataSource` that -loads from JSON. - -There is a function in `models.py` that loads from JSON. - -Time: 15min - + +It is useful to have this dual behaviour for our entry point scripts so that functions defined within them can be used by other modules without the main function being run on import, while still making it clear how the core functionality is run. Moreover, this pattern makes it possible to test the functions within our script because everything is put inside more easily callable functions. - -Remind learners to check the course webpage for further details and some important hints. - + +### Passing Command-Line Options to Controller - -### Mocks - -Another use of polymorphism is **mocking** in tests. - - - - - -Lets live code a mock shape: +To read command line arguments passed into a script, we use `argparse`. To use this, we import it in our controller script, initialise a parser class, and then add arguments which we want to look out for: ```python -from unittest.mock import Mock - -def test_sum_shapes(): +import argparse - mock_shape1 = Mock() - mock_shape1.get_area().return_value = 10 +parser = argparse.ArgumentParser( + description='A basic patient inflammation data management system') - mock_shape2 = Mock() - mock_shape2.get_area().return_value = 13 - my_shapes = [mock_shape1, mock_shape2] - total_area = sum(shape.get_area() for shape in my_shapes) +parser.add_argument( + 'infiles', + nargs='+', + help='Input CSV(s) containing inflammation series for each patient') - assert total_area = 23 +args = parser.parse_args() ``` - -Easier to read this test as do not need to understand how -get_area might work for a real shape. - -Focus on testing behaviour rather than implementation. - - - - -## Exercise: Test Using a Mock Implementation - -Complete the exercise to write a mock data source for `analyse_data`. - -Time: 15min - - - - - -## Object Oriented Programming - -These are techniques from **object oriented programming**. - -There is a lot more that we will not go into: - -* Inheritance -* Information hiding - - - -## A note on Data Classes + +Take people through each of these parts: -Regardless of doing Object Oriented Programming or Functional Programming +Import the library -**Grouping data into logical classes is vital for writing maintainable code.** +Initialise the parser class - +Define an argument called 'infiles' which will hold a list of input CSV file(s) to read inflammation data from. The user can specify 1 or more of these files, so we define the number of args as '+'. It also contains a help string for the user, which will be displayed if they use `--help` on the command line. - -## ☕ 10 Minute Break ☕ +You then parse the arguments, which returns an object we called `args` which contains all of the arguments requested. These can be accessed by their name, eg `args.infiles`. - -## Model-View-Controller - -Reminder - this program is using the MVC Architecture: - -* Model - Internal data of the program, and operations that can be performed on it -* View - How the data is presented to the user -* Controller - Responsible for how the user interacts with the system - - - - -### Breakout: Read and do the exercise - -Read the section **Separating Out Responsibilities**. - -Complete the exercise. - -Time: 10min - - - - -Suggest discussing answer to the exercise as a table. -Once time is up, ask one table to share their answer and any questions -Then do the other exercise - - - -### Breakout Exercise: Split out the model code from the view code - -Refactor `analyse_data` such the view code we identified in the last exercise is removed from the function, so the function contains only model code, and the view code is moved elsewhere. - -Time: 10min - - - - - -## Programming Patterns - -* MVC is a programming pattern -* Others exist - like the visitor pattern -* Useful for discussion and ideas - not a complete solution + +### Positional and Optional Arguments +Positional arguments are required arguments which must be provided all together and in the proper order when calling the script. Optional arguments are indicated by a `-` or `--` prefix, and these do not have to be provided to run the script. For example we can see the help string: - - - - -Next slide if it feels like we have got loads of time. - - - - -### Breakout Exercise: Read about a random pattern on the website and share it with the group - -Go to the website linked and pick a random pattern, see if you can understand what it is doing -and why you'd want to use it. - -Time: 15min - - - - - -## Architecting larger changes - -* Use diagrams of boxes and lines to sketch out how code will be structured -* Useful for larger changes, new code, or even understanding complex projects - - - - - -## Exercise: Design a high-level architecture - -Sketch out a design for something you have come up with or the current project. - - -Time: 10min - - - - - -At end of time, share diagrams, discussion. - - - - - -## Breakout: Read to end of page - -Read til the end, including the exercise on real world examples +```bash +$ python3 inflammation-analysis.py --help +``` -Time: 15min +```bash +usage: inflammation-analysis.py [-h] infiles [infiles ...] - +A basic patient inflammation data management system - - -At end of time, reconvene to discuss real world examples as a group. +positional arguments: + infiles Input CSV(s) containing inflammation series for each patient +optional arguments: + -h, --help show this help message and exit +``` - - + ## Conclusion Good software architecture and design is a **huge** topic. @@ -1205,6 +961,6 @@ Practise makes perfect: - + ## 🕓 End of Section 3 🕓