Testing coding skills is a mandatory part for many companies during the hiring procedure. Data Science interview test usually contains a combination of pure coding exercises with domain-specific ones. This makes it much more difficult for the successful completion. That's why many people are stuck on this step, even not reaching the life interview session. This problem is more critical for the Data Science internship candidates due to the lack of experience and a general understanding of the processes. In this paper, we will talk about the most common mistakes of candidates for different Data Science positions and ways to overcome them.
Hi there! If you decided to continue reading this paper, I think you already completed at least one Data Science test task and didn't achieve your goals. So we will try to speak about some interesting points that can be missed during the task preparation. We will not cover any deep and specific task details like "how to solve ..." questions. Of course, to be invited to an interview, you need to solve all tasks correctly. If you are interested in general Data Science mistakes, I recommend reading the Classification of the common mistakes in Data Science paper first. But what if you think your solution is correct from the technical point of view, but you still got negative feedback. In this case, the problem can be hidden by the organizational part of work. Let's try to understand what could go on the wrong road in a such case. As usual, we start with the short plan.
Readme file preparation
Requirements.txt file
Conditions of the test task
Ignoring .py files
Ignoring data analysis
Task understanding
Excessive initiative
Git usage
Readme file preparation
Usually, companies ask to submit the coding exercises using some repository (GitHub, GitLab, or similar). The readme is the first file that should be created inside a new repository. In a couple of words, it should contain answers to all questions that can be raised by the user/developer about the current project. To understand what we are speaking about, try to visit several of the open GitHub repositories of Python libraries you use regularly.
Now let's speak about what should be inside this file:
What is this project about, and what is its purpose. It should be at the very beginning of each readme file. This information will help to understand if your project covers the needs of the user or he should continue searching for something else. Moreover, it will prepare the user for the code understanding because he will relatively know what to expect after reading this file.
Information on how to get this code into a local computer. You need to take care, that users with different skills level will consume your work. They even can be not familiar with git syntax and will try to get your code script by script manually. So the good way to continue the readme file is to provide instructions on cloning and setting up the project, including all libraries installation.
Project structure. For quick navigation, it is better to have the tree of the project or something similar that will contain a brief explanation of each directory.
In the case of the test task, it's reasonable to provide some summary. For instance, during the experiments, I achieved 95% of the target metric, and so on.
Any other information that will help to understand the project.
Requirements.txt file
One more primary file that should be present in your repository with Python code (I assume that the majority of Data Science tasks related to Python) is requirements.txt. The mission of this file is pretty simple - it stores the information about all modules and libraries that should be used in the project. Why do we need it? You can install all dependencies into the project by running the command pip install -r requirements.txt. There is no manual effort - installation is performed automatically. It is a really useful thing, and I always put this file into the test definition.
In general, requirements.txt contains a list of libraries with specified versions of this library.
boto3==1.26.42
sagemaker==2.127.0
pandas==1.1.2
The version of the library is critical because different developers can use different versions of the same library. In this case, some functionality will be missed, and the code will not work.
During the test task review, I usually see three groups of mistakes linked to this file.
Absence of the file or totally incorrect content. In this case, the solution doesn't contain a requirements.txt file at all, or this file has the wrong content. For example, I often see people put task definitions into this file or even other text information not related to the mission of this file.
A long list of requirements. Sometimes people put into requirements.txt all libraries that they have ever used. My personal observable record is 615 libraries. The reason for this is quite simple - people work inside a single virtual environment and have there all packages for the whole career. In practice, it is not a good approach - better to have a separate environment for each project. Moreover, the reviewer of this work will spend hours just on the environment preparation. Put into requirements.txt only libs that are actually required by your solution.
Local paths in the file. Some of the Python libraries can be installed from the local wheels. In this case, if you will use the pip freeze command you will get something like this:
pandas @ file:///C:/ci/pandas_1602083338010/work
Of course, anyone will not be able to install a such dependency on his machine. So in case of usage pip freeze, you should be careful about validation of the output file. It should be executable on any machine, not only your personal.
Conditions of the task
Serious companies take care of all procedures including test task preparation. Usually, the document with task definition contains mandatory details that shouldn't be skipped. For instance, it can be framework specification, model architecture selection, target metric, programming language, coding style guides, and other information. Why people do this - with the high probability this is the stack that is regularly used by the team, or it can be mandatory on the project that you going to be hired on. So in case you see in the requirements that the solution should be implemented in PyTorch, please follow it and not try to use TensorFlow even if you have 10 years of experience in this framework. If you are interested in how to implement a simple neural network in different frameworks check the paper Deep learning frameworks comparison.
Ignoring .py files
Production code usually assumes the usage of Python scripts (.py files). Often even if the task description contains information about .py files, I see only Jupyter Notebooks for all parts of the solution. Yes, it is quite a useful tool for analysis and prototyping, but all final source codes (except EDA) should be written inside Python scripts.
The second problem is a wrong understanding of the meaning of .py files. People just take the content of the Jupyter Notebook and paste it into the .py file with all comments, variable outputs, and other specific things. You wanted a .py file - you got it. The work is done.
Ignoring data analysis
When I prepare a test task that contains a data science-related problem, I always ask people to construct a notebook with Exploratory Data Analysis (EDA). Unfortunately, often the Jupyter Notebook has only information on how to read the data and train the model. It is not EDA. To check how better to prepare an EDA notebook check the paper Exploratory data analysis guideline.
Task understanding
A well-understood task is half the success. Sometimes I am facing issues when people don't get right the definition of the problem correctly and start to solve a totally different task. The situation even happens in classical algorithmic problems with hard-defined test cases for input and output. As a result, we often get feedback that our test cases are wrong and it is not an issue with the candidate's algorithm. So in case you can't clearly understand the task definition, contact the person responsible for the communication and clarify all issues.
Excessive initiative
Sometimes candidates like to demonstrate the full stack of skills and try to add all they know into a solution even if it is not useful for the current task. For instance, it can be a grid search over all possible models and parameters for the problem that can be solved by linear regression. Another example is implementing all known algorithms including weak to solve some mathematical equation.
Of course, the initiative is a good thing. If you don't have it at all, you probably shouldn't work as a Data Scientist. But you have to remember that:
Your solution should contain only things required by the task. In real life, it doesn't matter what additional features you will add to the solution if it is correct and does the job that should do.
Adding anything not related to the task will increase the time for scanning your work. With a high probability examiner will not be happy to check 50+ such tests. In the company, you will increase the time for code review and integration with other developers because they will need to understand everything.
More code - more possibilities to make a mistake. Mistakes made in the code that is not solving the task will not add value to your job.
Git usage
Usually, examiners ask you to share the work using some git repository. The most common case is GitHub, but it is not important here. The way you push your work into the repository can be really critical. Basically, it allows to check how good you are at working with Git. So in case you just take one commit with all files and push it into the master with a strange name - it will not positively affect the results of the evaluation of your work.
For now, we understand that even if your algorithm works perfectly, there are a lot of factors that can negatively affect the evaluation of your work. So try to focus not only on the technical solution of the task but also on the organizational part of your work. Following this combination, passing the test will not be a problem, and you soon get the job offer you wish.
コメント