Each person, despite his experience and skill level, sometimes makes mistakes. And this is pretty normal - you can't ever enrich your knowledge without failing during the training. But usually, junior specialists, including Data Scientists, due to lack of expertise, can make tons of mistakes. To prevent this, I propose to go through the classification of the most common mistakes and antipatterns in the Data Science area.
Hi there! Before we start, I need to say that there is no official classification of Data Science mistakes and antipatterns. All topics you will see here are just a product of my observations and experience gained from the companies where I was working. Of course, I will be happy if this post becomes a basis for the official classification, but on the other hand, the list is not limited by the content of the paper, and theoretically you can provide more insights. So let's check the proposed classification and start our journey to the world of failures in Data Science. Basically, I propose to split all cases into 5 main categories. Some of them belong only to the Data Science domain, while others can be transferred to different domains in software development. Below you can see the high-level categories that we will go through. Hope my naming of the cases will not confuse you a lot.
Student's fever.
Self-management issues.
Analysis and data preparation issues.
Data Science solution issues.
Coding issues.
Now we are completely ready to overview what each of the categories means and go through cases. We will focus not only on the description of problems but also on some advice to solve them.
Student's Fever
This is the most typical category of mistakes at the beginning of a career in Data Science. Usually, the first prerequisites are laid at the university. It is not hard to understand from the name of the group that people make these mistakes according to knowledge obtained in lectures and courses without any practical experience just coming to the company. Here we can highlight several typical cases belonging to this category.
SOTA lover. Usually, when you start to work on a new task, you try to understand what is the best existing solution for this problem. In the world of Data Science, this is called "State of the art" (SOTA). Inspired by this, many juniors try to apply the SOTA model for every task they have. Of course, this approach is powerful, but you should also know that accuracy is not the only requirement that can be obtained from the customer. Sometimes it is enough to fit a simple linear regression model and not to spend customer’s money on GPU-containing devices and not to spend a lot of time on model training. Advice: before training the model be sure that you clearly understand all requirements of the customer or your team lead. Do some initial research and try the baseline model. Only after this start to think about the solution and tools to implement it. Potentially you will come to the same SOTA model, but in this case its usage will be reasonable.
Development hater. I am often facing with freshmen who think that the Data Science position requires working only on modeling in Jupyter notebooks. Of course, some companies operate in this manner, but the majority assume working as a Data Scientist you will need to deal with modeling and integration as well as other activities specific to the task. Advice: be ready to work not only in the Data Science field but also in common areas. Take care of the quality of your code and pay attention to the development of your programming skills.
Model designer. Data Science is a really complex area where to become an expert you should spend a lot of time and learn a lot of disciplines. You should have good knowledge in Math, Linear Algebra, Probability theory, Statistics, Programming, Machine Learning, and other areas. Some people think they can learn some Python basics, complete a couple of ML courses and become great professionals in this field because they will need only to build models. Advice: during education spend enough time learning fundamental disciplines. It is much easier to delve into Data Science when you have a strong background in classical university courses.
Young senior. I often see a lot of private courses that promise to make you a great Data Scientist for a couple of months without any background. Looks like a good ad to earn money. Of course, people without a background don’t know how hard it is to become a good Data Scientist and start to learn these courses. Finally, after completing the study, these people expect to get impressive job offers, not having enough knowledge and experience. From another point, some students get an impression from papers about SOTA models and some motivating videos about fast-growing career path. Of course, it is always an advantage, but sometimes such people start to think that they are already seniors and can hold a corresponding position in any company. Conclusion: different courses or something else - is a perfect way to enrich your skills and knowledge, but it will not make you a great professional from scratch. It is always a huge amount of work to become a good specialist.
Gladiator. Today different Data Science competitions are really popular. Some of them collect thousands of competitors. That’s why it is a perfect opportunity to expand your knowledge. Personally, I like to spend time in such competitions. Only one thing you need to know - not all competitions are fully applicable in real life. To achieve 0.000001% of improvements people usually stack a lot of models. For example, in real-time projects, it is not a good approach because you need to process a lot of data every minute. And it is only one example. Also, in the majority of cases, customers will not pay money for the development with several months' estimate that brings only a 0.000001% of improvement. Advice: take part in competitions! It is a perfect way to train yourself and increase stress tolerance. But don’t forget that not all competition solutions are applicable in the commercial data science world. Also during the study, you shouldn't have to focus only on competitions because in this case you will miss a lot of skills required in production development.
Wheel inventor. Young specialists without wide experience like to write any solution from scratch. It requires a lot of time and usually works not better than existing solutions. Of course, it is always a good exercise to write something from scratch and understand the logic under the hood, but in production development, it costs a lot of money, and people usually prefer to have already existing and well-tested solutions. Advice: before you start to implement your solution check what people already did. With a high probability, you will find that someone already built at least one library for your problem.
Self-management issues
This category usually takes place when the engineer doesn't pay enough attention or completely ignores the materials and practices for the development of non-technical (soft) and management skills during the study period. Newcomers can often think that the most important job item is just to solve a task. The ability to understand and communicate with colleagues or take part in project management is not super critical in this case. As for the previous category, let's go over concrete cases.
Estimate shooter. Project estimation is always a really complex process. Even experienced specialists can't provide completely accurate time estimates. As a result in the majority of cases, engineers provide reduced estimates. In practice reducing estimates is a much more frequent problem than exceeding. Basically, there are a lot of reasons for this, like unknown technology, weak task decomposition, or even banal inattention. Advice: before providing any time estimates, be sure that all tasks are clear to you and take care of unknown technologies. It is always a good practice to ask for help from people, who already have experience in this field or at least will check your estimates.
Business doesn't matter. Understanding the problem and its domain from the business point of view is always a heavy boost and probably a key to solving this problem most effectively. Usually, it allows us to reduce the time for modeling and not use complex models without any need. Unfortunately, this step sometimes is skipped and engineers start to work on the problem without understanding why customers need it at all. Advice: spend enough time understanding the business problem. In case there is no business analyst or project manager on your project, you should try to get this information from the Lead or directly from the customers during the meetings.
Technical robot. Soft skills development is a significant professional part for every developer, as well as for data scientists. Unfortunately, some engineers prefer to perform self-development only in the technical direction, which can bring a lot of discomfort during the upgrading to the Senior and upper levels. Advice: don’t forget to improve your soft skills. Try to get into some activities, provided by your company or by other institutions. Don't skip the possibility to participate in the call with customers or be an initiator of some brainstorming in your team.
Distorting mirror. The task definition process can be compared with real art. It is important to provide a clear and short description that will not produce a lot of questions from the engineer working on this task. But sometimes it takes place, the engineer doesn't understand the problem statement completely or interprets it in the wrong way and starts to work on it. Of course, nothing positive will not come out in this case. As a result - deadlines are missed, and money is lost. Advice: don't be afraid to speak with your Team Lead / Project Manager and clarify confusing points in your task. It is much better to resolve all issues at the beginning rather than do the same task twice (at least).
Analysis and data preparation issues
This category is entirely linked with Data Science and related projects. The main reasons for the occurrence of mistakes in this block are a lack of required analytical skills and a huge desire to get a quick and visible result. As the outcome of this situation - all forces are directed into getting the model that performs well, often skipping all required steps of Data Science solution construction.
Blind analyst. Initial data analysis is a critical and mandatory part of any data science solution. The quality of the analysis directly correlates with the visualization tools selected to perform the analysis. In our current case, engineers are ignoring not the full process of data analysis but skipping important steps that should contain visualization. Advice: never skip the visualization. It can provide hints that are not available using raw code or require a lot of programming effort to extract them. Moreover, the majority of the information comes to humans via eyes, so you should use this possibility as much as possible.
Broken faucet. It is a common problem that can happen with engineers of different levels. The alternative name for this problem is "data leakage". Basically, it means the usage during the training of features that can't be accessed in the production mode (like features from the future). This usually brings the model to fantastic scores in the training phase but produces unexpected hidden stones in production. Advice: basically, there is no single recommendation here. You just should be careful during the feature engineering step, understand the meaning of all your features and check that all of them can be constructed during production mode.
Lost EDA. Data Science solution in this case is constructed without any initial analysis. Of course, even in this case, you will be able to train the model, but in the majority of situations, it will not have acceptable quality. Exploratory data analysis is a mandatory part of a big amount of Data Science methodologies, such as the most famous one - CRISP-DM. Advice: always spend time on the Exploration Data Analysis. It is one of the most critical stages in a Data Science project. Check public resources and learn from other people how they explore their data.
Garbage in - garbage out. Sometimes I am facing with situations when the engineers try to solve tasks just by doing a search over a big amount of different model architectures. Data preparation and feature engineering steps in this case are completely skipped. The most important thing is to make the data suitable for the inputs of the selected model and optimize hyperparameters for the model. Although nowadays we have a lot of powerful SOTA models, none of them can perfectly work with raw data without any preprocessing and feature engineering. Advice: data preparation and feature engineering steps should be present in your general pipeline for solving Data Science and Machine Learning problems. They are not less important for the final results, as well as a good EDA or correctly selected model.
Data Science solution issues
If the previous category was mainly linked with the analytical part of the solution, this category is directed more to the model-building problems. Basically, this class of issues is typical for beginners, but in any case, we should take care of it. The main reason for the appearance of this group of problems is a lack of experience in building production Data Science solutions.
ML Golden Hammer. It is clear from the name that we deal with some analogy of the classical Golden Hammer antipattern with only one clarification - it has been adapted to the machine learning domain. The idea is quite simple - the person who makes this mistake in the majority of his tasks trains exactly the same model. You can say that it can be an experience. In general, yes, but the main idea of this case is that there is no baseline model - just training of the "hammer model". Advice: don't forget to build simple baseline models and try different approaches to solve the problem. You can even build a semi-automated pipeline that allows you to run a set of different models in a quick way and to get a baseline model.
Black box user - The main idea of this case is that "model.fit" is a key to solving any problem. You don't need to focus on hyperparameters tuning, data format, or interpretation of the model's results. You can just take data, make it suitable for the model and start training. You can find some similarities with "Garbage in - garbage out", but the main difference is that in the case of the "Black box user", the engineer is not thinking about tuning the model parameters just because he doesn't understand their origin and meaning. Advice: try to identify the hidden stones present in your solution. You should be able to understand the reason for failure and how to fix it. You can't do this if you just run the fit function.
High score or nothing. In this case, the most important goal during the model development is a target metric. The rest of the things, like the size of the model, the time of execution on inference, the feature selection mechanism, and others, are totally unimportant. Only one critical and desired objective - your target metric should be as maximum as possible. Advice: of course, it is always good to have a model with a high value of the target metric. But you should never forget about business requirements, available resources, and features that you are using. Maybe you just overfitted your model, and it doesn't make any sense for production.
Coding issues
This is the common group of problems that often appears when the engineer has more knowledge and experience in the modeling and analysis domain rather than in writing code. Usually, such problems are typical for Data Scientists whose main activity - is just doing some experiments in the jupyter notebooks without integration of the code.
Unit tests avoider. This is a pretty simple type of problem linked with situations when engineers just skip writing unit tests. Someone can say that unit tests for Data Science solutions are somewhat redundant but believe in my experience, even in Data Science, correctly written tests allow us to skip a lot of problems in the future. Advice: Don't forget to spend some time writing unit tests. Remember also that it should be really useful code fragments, not once that you mandatory need to write because your Lead said this.
Rock manuscripts. Writing the code is an interesting and exciting process. And it is super easy, during the implementation of a new feature, to forget to write docstrings to your functionality. As a result, nobody understands your genius idea. Moreover, after some time even you can't remember what is happening in this code fragment without any documentation. Advice: it is usually a good way to write code - leave at least docstrings for the public methods inside your code. Of course, it will take some additional amount of your time, but save a lot in the future.
Hardcoder. A typical problem started from the first project of each developer. Using constants in the code brings a lot of challenges if you change their values. It would be hard to find all instances that should be changed. Regarding paths - if you use, for example, absolute path in Ubuntu, it will fail on Windows, and so on. Advice: always define constants in the head of your source code or in a separate file with settings. Also, to remove absolute paths, it is a good way to use pathlib in Python or similar libraries.
Heap of everything. As with any other project - the Data Science project should be also well structured. Having together in the root directory all source codes, data samples, and any other artifacts is not a good way to develop a project. Advice: always try before the start of development to build a skeleton of the project. After some iterations this procedure will be automated for you.
Fatal commit. Let's imagine the situation. You develop your model that should change the world. During the experiments, you find a lot of different datasets and commit each of them. As a result, the model is ready, and all source codes and artifacts are pushed into the repository. And ... your repository is now 10 GB in size. Looks not really good for people who will use your work in the future, correct? Advice: don't save full versions of datasets inside your repository. Use for this purpose the separate resources and remember to write about it in the readme file of your project. Of course, some small samples required for the demonstration and understanding of the project's flow can be stored in the repository.
Finally, we completed the overview of the common mistakes and antipatterns in the Data Science project. Hope it was really interesting and at least a little bit informative for you. And, of course, if you had some issues from this list in real life - you will be able to fix them as soon as possible. Write down in comments what types of mistakes you faced during your career and remember the most important - You will not make any mistakes only in case of doing nothing. Thanks for reading!
It's very interesting! I look forward to continuing😉