Exploratory data analysis is a core tool in solving any Data Science problem. It is the process that aims to detect insights inside the data and to conduct a complete investigation of the dataset. Not depending on your activity - professional software development or Data Science competitions like Kaggle, exploratory data analysis is the source point of any project. The success of building your future model is highly dependent on how the analysis is performing. In this paper, we are talking about the steps to produce a perfect exploratory data analysis. We will not focus on any specific Data Science domain but go through the most critical topics.
Hi there! As we already know, writing good exploratory data analysis (EDA) is a valuable skill. In one of our previous papers about antipatterns and mistakes in Data Science, we even have an entire category of issues related to initial data analysis. So in case you still looking for reasons and advantages of writing EDA I recommend checking that paper first.
In this paper, we will go through the best practices for writing EDA. I tried to collect all hints and approaches that I used during my road to the Kaggle Notebooks Grandmaster title. As a result, we construct something like a must-have list to build EDA notebooks and talk a little bit about the insights that you can apply. So let's start, as usual, with a short plan.
General principles.
What should contain the EDA notebook.
General principles
You should remember two main reasons for writing EDA notebooks. The first one is pretty clear - to get a deep understanding of the data that you are working on. But what is the second, because looks like understanding the data should be enough to finish the task. Yes and no. The second critical point - your EDA notebook should be clear for any other person (I mean a Data Science engineer, not a random person) who will take your notebook for the review. And it is a real art to make your work not only understandable for others but also the awakening a wish to read this notebook and continue working on the described problem. To achieve this, I propose to review the following pit stops:
Introduction. A person who takes your jupyter notebook will not have any problem-related knowledge from scratch. This includes information about the task, available files, the goal of the work, and so on. That's why it is better to start from the initial description of the problem and explain what exactly means each file/folder of the current dataset. Also, feel free to add a picture related to your topic at the top of the notebook. Visual information is much easier to perceive, so in this way, you will draw the attention of the users more efficiently than just with raw text.
Contents. The main goal of the contents is to simplify the navigation in your project. A good EDA notebook can be complex and splitting it into logically completed blocks will increase the readability and understandability of the notebook. Moreover, attaching the hyperlinks for each step in the contents will speed up access to targeting parts of the notebook without scrolling the full page.
Storytelling. All people think differently. The information that looks pretty clear in your head can be confusing for other people. So your EDA should be well structured and written in a clear form. The best way to do this is to use a story approach. Your notebook should contain not only technical information but also user interaction blocks. They should simplify the load of the user and increase the possibility of understanding your work. The way to achieve it can be different - starting with the small comments between code blocks and finishing with text descriptions that contain conclusions and decision explanations.
Visualization tools. The core thing of any EDA is a visualization of the data. You shouldn't call your notebook EDA if you don't apply any plots and charts inside. Also, remember that you should select the correct tool for the visualization because not each will be suitable for the specific task. Finally, the major part of the information we receive from the eyes, so you can understand how important the chart-based content inside your EDA is.
Simple and clear code. The source code for EDA should be simple as possible. Of course, you shouldn't do it like for first-year students, but you also shouldn't use complex constructions to speed up and optimize your code. The goal of any EDA is to transfer information to people in the simplest way, not to present your skills in code and performance optimization.
Own style. Not critical, but nice to have point, especially in case you do public EDAs. You don't need to be a designer but should think about something that would be unique only in your notebooks. It can be a wrapping style for the jupyter, some interesting section typical only for your works, or something else. In other words, it can be anything that can identify your notebooks among all others.
What should contain EDA notebook
There is no rigid structure for the EDA notebook, but you can always follow the short guide below that can be represented like a non-official skeleton. Each person has his own style, strong skills, favorite topics, and other things affecting the content of the notebook. But it is always a good practice when your EDA includes these steps.
1. How to read the data. When you are working on the same task every day, you remember this procedure and can replicate it without any problems. But let's imagine that the person faced your dataset for the first time. How to download it, what specific flags to use, what is the separator for columns, and other questions. So to simplify life for everybody, try to start your EDA notebook with an example and a short explanation of how to download your data.
In the code snippet below, we can observe exactly this case. We are working with a regular .csv file, but it is separated by a ';' delimiter, uses special encoding, and has 7 garbage rows at the beginning. In case you don't have this information it would be really difficult to start with this dataset.
Also, coming back to computer vision, not all images can be read with OpenCV or similar libraries. In the example below we deal with a medical image that can be read using pydicom python module that was developed for medical imagery processing purposes.
So, as you can see it is quite an important step that can't be skipped. By the way, it is really hard to skip this step because you should start your notebook by downloading the data. But sometimes I faced cases when people hide these parts of code. Please never do it in this manner.
2. General data understanding. Before you start to go deeper into the dataset you should clearly understand what you are working with. Here I recommend following these sub-steps for tabular data:
Understand the physical meaning of each column. Of course, machine learning models will work on variables with any quality, but it is much better when you understand which amount of information each feature can bring. It at least will boost the ability to explain the outcomes of your model. In more complex cases (when you are a domain expert), you can operate with these features by yourself (remove or build composite ones).
Understand the type of each variable. To apply the right preprocessing mechanism for each variable you should understand its type. There is no magic here, on a high level it can be categorical, binary, or continuous variables.
Remove redundant data. Each dataset contains some amount of information that can't be used for modeling at all. It can be different ids, URLs, fragments of the metadata (remember that not all metadata is useless!), and other columns. They can't provide any information required for modeling, so it is better to remove them at the beginning.
Prevent data leakage. It is a critical point because sometimes some features can provide information that will not be available in production mode. We can't use these features in any case and should remove them as soon as possible.
Define a strategy to work with missed data. When we speak about machine learning on tabular data, missed data is a common problem. So if it happens to you, select a strategy to deal with it. There are a lot of opportunities to do that like filling in missed data with different approaches, deleting samples with missed data, deleting columns with missed data, and so on. There is no correct way to do this, you should think in terms of your current situation.
If we speak about the computer vision dataset, the set of actions in this step is less than for the tabular data. I recommend here to:
Visually check the data. Potentially some images are out of the domain or have other problems and should be removed from the dataset. Of course, I wouldn't say that you should check each image but visualizing a couple of them is always useful. In case we deal with classification, demonstrate examples that belong to different classes for understanding the meaning of each class.
Check the metadata. Depending on how detailed the metadata is, you can replicate some actions related to the tabular dataset described above.
Estimate image sizes. It is perfect when all images have the same shapes but it is a rare case. By obtaining the information about shape distribution you can select the right strategy to process images.
3. Target understanding. Of course, you can't successfully solve any machine learning problem without a full understanding of the target variable. What I recommend to do is to perform the following items:
Check class distribution or values distribution in case we deal with regression. This action provides you information about balancing the data or in case of regression - information about the origin of distribution. According to this information, you will be able to select the right strategy for model preparation and data preprocessing.
Try to find outliers in the data. The ability to identify anomalies in the dataset is a valuable skill. Sometimes dropping several outlier samples can increase the score of your model, so you shouldn't underrate this step.
Correlation analysis. The goal of this step is to estimate the impact of each variable on the target. This a critical step that allows reducing the number of features that will be used for training, but you should be careful here due to the high risk of losing good features.
Speaking about image segmentation, you should implement an encoder for masks and check that masks are placed exactly where they should be.
4. Features understanding. In this step we need a finalize a list of features that will be used for the modeling. Yes, potentially, we can have some uncertainty about the usage subset of features, but it is pretty okay. The real value of such features can be estimated only in the modeling phase. So, what I recommend to do in this block:
Feature engineering. We already know something (from step 2) about our features, so it is the right time to conduct feature engineering. By feature engineering we suppose a complex of actions like removing not useful features, building new features based on the combinations of existing ones, scaling existing ones, encoding, and so on. So it is a volume step, the size of what will grow together with your experience. If we speak about image processing tasks, here is also a wide window of opportunities. You can scale values, convert to grayscale, resize images, use only selected layers (like only green), convert to other colorspaces, apply augmentation, and many others.
Cross-correlation analysis. To understand the dependency between non-target features you should build at least a correlation matrix for all features. Sometimes it helps to understand that some variables are strongly correlated inside a pair and one of them can be easily removed.
Experimenting with preprocessing. Different tasks can require different processing for the same features. So you should apply several methods for processing features and select the most suitable for the current problem.
Visualization. It is always a core part of any EDA. You can understand almost everything about the features from its visualization. Is it a noisy feature, what is the distribution, what are the ranges for its values, how it behaves compared to other features? You should never skip the visualization. If you have even a small idea that you don't have enough plots it is better to add them. Of course in a smart way, not randomly.
As image augmentation is a common and effective method you should also come up with conclusions about what types of augmentations can or can't be applied here and why.
In case of imagery datasets sometimes useful information can be extracted from the colors histograms.
5. Insights search.Target and features understanding - two steps more or less similar for all tasks that belong to the same category. But the insights search is the thing that I love the most in the EDA building process. To understand what I mean, let's review a couple of examples. 1) MNIST classification dataset. As we know, the core idea of this dataset is to classify the handwritten digits from 0 to 9. So the cool feature here will be to prepare heatmaps for each class in the training set. For example, you should take all samples of class "0", add them to the same image and calculate the probabilities of each pixel to be colored in case of the "0" class. Let's go deeper into this example.
Below we have some code samples to construct this feature. And let's start from the function to get mnist dataset.
Now we check that all goes okay and we can work with our data.
And now it is the time for some magic. Here we are building our heatmaps (total 10 items) inside the list named arr. After this, we add each training sample to the corresponding heatmap and scale the results as the last step.
Now we just visualize our heatmaps.
Looks like we are good with our heatmaps. At least visually we can identify each number from the output.
To understand the performance of our non-machine learning model let's build a simple code that assigns a class according to the lowest value of MSE among all heatmaps.
And apply accuracy score over this "prediction".
As we can see, the score is not so bad. Over 76% of accuracy for the solution without any machine learning model that still can be improved.
0.7629
And finally we are looking for a confusion matrix on our test set.
Basically it is makes sense. Some strange outputs for class 1, but the rest is completely expected.
array([[ 794, 0, 18, 2, 0, 3, 10, 1, 6, 7],
[ 19, 1115, 205, 99, 46, 162, 59, 110, 105, 45],
[ 0, 0, 559, 13, 0, 1, 5, 3, 2, 1],
[ 8, 3, 42, 766, 0, 116, 0, 1, 76, 7],
[ 11, 0, 73, 7, 840, 42, 52, 26, 24, 129],
[ 64, 1, 1, 28, 0, 478, 21, 0, 27, 7],
[ 64, 3, 40, 8, 13, 26, 810, 1, 15, 4],
[ 3, 0, 22, 12, 0, 12, 0, 815, 9, 21],
[ 8, 13, 65, 46, 1, 12, 1, 13, 671, 7],
[ 9, 0, 7, 29, 82, 40, 0, 58, 39, 781]])
2) Samples visualization. Sometimes you can specify borders for positive class (tabular binary classification) just by applying visualization of some features. In the image below we can see a visualization of this case. We plot here all samples (according to 3 features - c-31, c-78, and c-32) and label classes by color (yellow for the positive class and blue for the negative).
We can highlight here that all positive samples are located below value -2 for these 3 features. Really good observation that can be used in the post-processing phase of our pipeline. So one more proof that a visualization is a powerful tool.
3) Segmentation instances localization. The general idea is quite similar to step with MNIST dataset. We can build a probabilistic map according to all samples that we have in dataset. In this way we will be able to filter-out regions where the probability for entrance of the object is near to zero. 4) Segmentation instances size distribution. One more way to control predictions in image segmentation tasks is to explore the distribution of instance sizes on the training set. It can be the area of the object, its length or width. In simple words, it would be strange if you have a lot of small objects in the training set but predict only huge instances.
6. Baseline modeling. Complete understanding of the data is a great skill, but you shouldn't forget for what purpose you are doing this. We build EDA to increase our chances to construct an efficient model. That's why it is important to take care of the baseline model for the current problem. It can solve several issues - to shed light on the relative scores and provide an understanding of requirements for model inputs and outputs. Also, it is a perfect way to make an assumption about the model's complexity. Don't try to build State of the art models here, you don't need to do this in your EDA. Keep your enthusiasm and computational resources to build production models.
I think now we have enough information to build our perfect notebook with exploratory data analysis. Remember that each person is unique and there is no rigid template to prepare EDA. Create your own style, continue experimenting with content, and carry on a conversation with readers and all will be good.
Hope it was really useful and at least a little bit informative for you. What interesting approaches do you follow in writing EDAs? Write down your answers in the comments and, of course, thanks for reading!