Data Science & Machine Learning industry is growing steadily. Less than 10 years ago, Data Science projects were mostly developed by a single engineer and contained within a model and simple deployment mechanism. Nowadays the development of a Data Science project is a complex procedure that may involve a team of several people with various expertise. Also, it usually combines cross-domains like Data Science, Data Engineering, MLOps, and others. Of course, this requires the development according to at least basic Software Engineering principles. The simplest of them is writing unit tests. In this paper, we will learn about the unit tests for Data Science and the best practices to build them.
Hi there! If you still don't write unit tests in your Data Science projects - it is the right place for you. In one of our previous papers about the common mistakes in Data Science, we highlighted this case in a separate antipattern called Unit tests avoider. So our current goal is to try to shed light on the importance of writing unit tests. As usual, I propose to use the following plan.
General need of unit tests.
Unit test structure.
Unit tests for Data Science.
Test coverage.
Bonuses.
General need of unit tests
I will not provide a big piece of new information here, just explain some well-known details. There are many different reasons to write unit tests for your project. The most important of them are:
Finding bugs in code as soon as possible. Running unit tests after the implementation of the new functionality will significantly reduce the time to identify the majority of bugs. Correctly written unit tests should allow you to cover about 80% of the bugs in your code (excluding bugs in the integration stage and bugs between different modules. As a result, they help developers to debug code in a more effective way. Because often, bug correction without using unit tests provides new unexpected bugs. Looks like not super effective.
Identifying an unexpected behavior. I am sure you faced at least once the phrase "It's not a bug it is a feature". Basically, this means that the algorithm is failing in some situation but continue working by the not expected scenario. We can easily identify this behavior by writing unit tests.
Simplifying the refactoring. Refactoring in large projects is not just renaming the variables and constants removal. It is a complex process that potentially can require changes in the implementation of modules and sub-modules. Of course, this procedure can easily break existing code and provide a lot of new bugs. Just one more reason to have unit tests in such projects.
Increasing the speed of integration. Code integration is usually a complicated process that corresponded to a lot of different errors and failures. Writing unit tests can resolve a big part of these issues providing more information about the algorithm's behavior in a new environment.
Forbidding to deploy broken code. This topic is not only about deploying but also about pushing code into a repository. It is a good approach to build something like git hooks that will work before, for example, code commits. In case the tests are not completed, the commit will be automatically rejected. Near the same procedure is with deployment. Tests are failing - deployment is rejected. Pretty simple and working procedure.
Boosting the documentation of the code. As we know - project development is a process that requires the involvement of teams, not single developers. That's why documentation is a key to the fast understanding of the code for other engineers. Unit tests also can be interpreted as non-official documentation because any engineer can take a look at tests and understand the mission of each function from it.
I think it is possible to provide more and more examples to demonstrate the importance of writing unit tests, but the general idea - you should do this. At least should try to do and see the difference.
Unit test structure
There are a lot of different modules to write unit tests in python. It is really easy to google all of them, but our goal in this paper is not to focus on their description. Our goal is to explain all things about unit tests related to Data Science projects. So I propose here to focus on the unittest python module. Unittest is a module from the Python 3 Standard Library, so you even don't need to install it.
Let's go over a simple example of the test written with unittest (code snippet below). Basically, each test should be wrapped into a class in our case it is ExampleTest. The class can contain the following predefined methods:
setUP: will be executed before running each test.
tearDown: will be executed after running each test.
setUpClass: will be executed before running all tests only once.
tearDownClass: will be executed after running all tests only once.
As we can see, we implemented all these methods in our sample code. Also, we added 3 dummy tests that print some messages and check if two equal values are really equal. By the way, each test case name in the class should start from the prefix test_. Moreover, each script containing unit test should also have the prefix test_ in the name.
Let's run our code with the command python -m unittest example and check the output.
setUpClass
setUp
test1
tearDown
.setUp
test2
tearDown
.setUp
test3
tearDown
.tearDownClass
----------------------------------------------------------------
Ran 3 tests in 0.000s
OK
All things are going as expected. The execution is starting from setUpClass and ends with tearDownClass. Each test case is starting with setUp and ends with tearDown. You can also see some dots in the output - each of them means the end of execution of the test case, so we have three dots - each for the test case from our ExampleTest class.
Unit tests for Data Science
I will not go over all possible test cases for Data Science because their amount is infinite. In this section, we will go over several groups of examples, just to shed light on the unit tests in Data Science and provide ideas for beginners in this field.
Model level tests. Here we will go over examples that should check the correctness of the model we construct. For testing purposes, we will take a simple convolutional network written in Keras from one of our previous papers. Let's check here the following cases:
Model output shapes. Each convolutional network (at least one that we will use) should provide the output of the same shape, not depending on the input image. It is the first case that we will check.
Model layers number. To make the code more flexible, engineers usually provide the number of layers as input to the get_model or similar function. Inside this function, the model is constructed automatically. So let's check that the number of layers is the same.
Model training is performed. The engineers are not always using built-in fit functions from Keras. Sometimes they write complex functions for training a model. That's why it is critical to check if the procedure is correct. A good way to do it is to check the weights (at least some of them), before the training and after.
So let's do coding. First, we will write the model_functions.py script containing the model definition, the function for extracting the training set, and the fit function. All these methods will be used for testing purposes. Let's imagine that it is our production code, and it should be tested (of course, code is simple, but it is even better).
Now let's focus on the implementation of the tests. For this purpose, we will write NeuralNetworkTest class containing the setUp function and 3 functions for our tests. In the setUp function, we just prepare a new model every time to be ready to use in tests.
test_model_output_shape: we take a model, get the prediction, and check if the shape of the prediction is exactly the same as the expected shape.
test_layers_number: we just compare the constant value with the number of layers provided by our model. I do some cheating here - in the best case we need to implement here parameterized get_model function that will take the number of layers as input but let's assume that we do it with value 7.
test_fit_function: we take the mnist dataset, push it into a model, train it for 1 epoch and check, do the biases for the first layer change their values.
Let's run our tests and check the output. To simplify the output I will not share the whole list of rows (that you can do by yourself), just the most important rows.
Ran 3 tests in 1.108s
OK
All cases are OK, so it means, that our code is correct and we finished writing tests for our neural network model.
Data level tests. In this section, we discuss what test cases can be applied to the data. Basically, here we have much more possibilities compared to the previous block, but I propose to check just a couple of simple examples:
Tabular data preprocessing function. We have an abstract tabular dataset, do some manipulations with it. The resulting data frame should contain some expected list of columns.
Image preprocessing function. We do some normalization of the image and then check that all pixels are scaled and their values are between 0 and 1.
Let's do some coding again. Script data_functions.py will contain 3 functions - the first one for the dummy preprocessing of the pandas dataframe, the second for the retrieval of the image from the mnist dataset, and the third one for scaling this image. Again, do not pay attention that these functions are really simple, try to think about their real mission and analogies from your code.
Test cases for this part are in the test_data.py file.
As we discussed previously, we have 2 tests here:
test_preprocess_pandas_df: takes columns from the processed dataframe and compares them with the expected list.
test_preprocess_image: checks if min and max values in the image after scaling are in the required range.
If we run this code, will see that all things are okay.
Ran 2 tests in 1.378s
OK
Loading procedure level tests. Machine learning and Data Science are fields that require saving and loading models, artifacts, datasets, and other resources with a high frequency. So it is really important to force these mechanisms to work without any exceptions. Let's discuss a little bit about this topic.
Here we have a single test_load.py script and 2 tests inside it - the first is linked with dataset saving-loading, second with the same procedure but for the model.
In real life mechanisms inside these 2 tests will be much more complex. But for demonstration purposes, we took simple saving and reading from JSON file for dataset test and saving-loading Keras model for model test. In the first case, we check if the downloaded dataset is the same as it was before, in the second case we compare the configs of two models (initial and downloaded).
After running these tests we can also see that everything is okay:
Ran 2 tests in 0.743s
OK
Test coverage
Okay, our tests are written, and now time to understand, that they are really effective. To understand this information, we can use the coverage python module. After we install it, we just need to run a couple of commands in the terminal to get a test coverage report.
python -m coverage run -m unittest test_neural_network.py
python -m coverage report test_neural_network.py
And as a result, we will have a table that contains the number of statements, the number of missed statements (tests are not written), and the coverage percentage.
Name Stmts Miss Cover
--------------------------------------------
test_neural_network.py 23 1 96%
--------------------------------------------
TOTAL 23 1 96%
So as we can see, according to the information from the report, we covered here 96% of possible statements.
Bonuses
To highlight the importance of writing unit tests again, I would like to say that even famous deep learning frameworks contain API for testing purposes. Yes, we speak about TensorFlow and PyTorch.
In terms of TensorFlow, it is tf.test module and mainly TestCase class. It brings to us one cool thing - all built-in assertions implemented for supporting specific for TensorFlow objects. We really missed this feature during today's coding session. But unfortunately, its functionality is still not super huge.
In terms of PyTorch, we have torch.testing, but its functionality is also still really reduced. The advantage is exactly the same as with TensorFlow - we can use specific for framework structures in assert mechanisms.
Despite this, I think it is a really good sign that even deep learning frameworks have native functionality for unit testing.
Finally, the story is ending, and we complete a brief overview of the Unit Testing for the Data Science projects. As usual, all source codes related to this paper are in the repository. Hope it was really interesting and at least a little bit informative for you.
What interesting cases of unit tests related to Data Science projects did you have? Write down your answers in the comments and, of course, thanks for reading!
Comentarios