Data Science, or in other words, data science, is the name given to studies conducted on data with the aim of capturing meaningful insights. A large amount of data is analyzed using mathematics, statistics, artificial intelligence and computer engineering applications, bringing together an interdisciplinary approach. This analysis helps data scientists and engineers to ask what can be done with the results and to answer these questions.
The field of data science is growing more every day. Therefore, the number of people wanting to pursue a data scientist career is also increasing. Most of the time, the candidates who get jobs are not those with the strongest technical skills, but those who can combine them with interview knowledge.
Although data science is a very broad field, we have compiled a list of Data Science (Veri Bilimi) interview questions along with their answers. 📊💼
1. What are the necessary assumptions for Linear Regression?
A linear regression has four assumptions:
- Linearity: The relationship between the independent variable x and the dependent variable y must be linear.
- Independence: There is no correlation between observations. One observation should not affect another. This often occurs in time series data.
- Homoscedasticity: The variance should be constant at every level of x.
- Normality: The errors should have a normal distribution.
2. What is missing data?
Missing data is defined as non-stored values or as data for some variables in a given dataset. For example, the following dataset has missing data. You can see that there are some missing values in the 'Age' and 'Cabin' columns.

3. How are missing data handled in a dataset?
There are various ways to handle missing data.
- Rows with missing data are left out.
- Columns with some missing data are left out.
- Missing data is filled with a string or numerical constant.
- Missing data is replaced with the mean or median value of the column.
- Multiple regression analyses are used to predict missing data.
- Multiple columns are used to replace missing data with random errors.
4. How is a missing value represented in a dataset?
Missing values or blanks are usually represented by NaN, which stands for Not a Number. The image below shows the first few records of a dataset extracted and displayed using Pandas. The “NaN” keywords represent missing values. 👇

5. What are the feature selection methods used to select the right variables?
Selecting the right features is a critical step in data analysis and model building processes. There are three main feature selection methods: Filter, Wrapper, and Embedded.
1️⃣ Filter Methods
Filtering methods are usually used in the preprocessing step. Filtering methods select features from a dataset independently of any machine learning algorithm. They are fast, have low resource requirements, and are reproducible. Some of the techniques used are:
- Correlation Analysis: By examining the relationships between variables, variables that have a strong correlation with the target variable are selected.
- Variance Analysis: By examining the variances of the variables, variables with low variance are eliminated.
- Comparative Statistics: By evaluating the differences between groups, variables that have a high relationship with the target variable are selected.
- Chi-square Test: It is a test that measures how a model is compared to the observed actual data.
2️⃣ Wrapper Methods
In Wrapper methods, a model is iteratively trained using a subset of features. Based on the results of the trained model, more features are added or removed. They are computationally more expensive than filtering methods but provide better model accuracy. Some of the techniques used are:
- Forward Selection: Starting from an empty model, the best features are added one by one.
- Backward Elimination: Starting from a model that includes all features, the least impactful features are removed one by one.
- Recursive Feature Elimination (RFE): Features that contribute the least are iteratively removed based on the model’s performance evaluation.
3️⃣ Embedded Methods
Embedded methods are methods that combine the features of filter and wrapper methods. Embedded methods are faster like filter methods, more accurate like wrapper methods, and also take feature combination into account.
- Lasso and Ridge Regression: These regression techniques filter out unnecessary features by shrinking the coefficients towards zero.
- Decision Trees: These models help distinguish insignificant features by calculating the importance levels of features.
6. What are the validation set and test set?
🆗 Validation set is used to optimize the model parameters. It is a part of the training set. It helps to avoid overfitting the developing machine learning model.
🔎 Test set is used to provide an unbiased estimate of a trained machine learning model’s prediction, evaluate its performance, or test it.
7. What are the four main components of data science?
1️⃣ Data collection and cleaning
2️⃣ Data Exploration and Visualization
3️⃣ Model Building and Analysis
4️⃣ Interpretation of Results
8. What is overfitting in machine learning? How to avoid overfitting?
Overfitting refers to a model that is trained very well on a training dataset but fails on the test and validation datasets. It is an undesirable machine learning behavior.
Overfitting can cause a model to give incorrect predictions and cannot perform well on all kinds of new data.
Overfitting can be prevented by:
- Keeping the model simple by reducing model complexity, considering fewer variables, and reducing the number of parameters in neural networks.
- Using cross-validation techniques.
- Training the model with more data.
- Using data that increases the number of samples.
- Using ensemble methods (Bagging and Boosting).
- Using regularization techniques in specific model parameters that are likely to cause overfitting.
9. How is overfitting detected?
It is important to test model fit to understand the accuracy of machine learning models. One of the best methods to detect overfitting models is the K-fold cross-validation method.
🗯️ K-fold cross-validation is one of the testing methods. In this method, the data is divided into k equal-sized subsets, which are also called "folds". One of the k folds functions as the test set, also known as the validation set, and the remaining folds train the model. This process is repeated many times.
10. What is a recommendation system in machine learning?
Recommendation systems are information filtering systems intended to predict some product ratings or user preferences related to a product.
For example, the product recommendations section on Amazon is a recommendation system. This section lists products based on the user's search history and past orders.
11. What are the different types of relationships in SQL?
There are four main types of SQL relationships:
- 1️⃣ One-to-One: It is a situation where each record of one table is related to only one record in another table.
- 2️⃣ One-to-Many: It is a situation where each record in one table is linked to several records in another table.
- 3️⃣ Many-to-Many: It is a situation where each record of the first table is related to multiple records in the second table and each record of the second table is related to multiple records in the first table.
- 4️⃣ Self-Referencing: It occurs when a table needs to reference itself. In such relationships, a table is related to other records within the same table.
12. What is Dimensionality Reduction? What are its benefits?
✅ Dimensionality reduction is the process of reducing the number of features (or dimensions) in a dataset while preserving as much information as possible.
✅ It is done to reduce the complexity of a model, improve the performance of the learning algorithm, or facilitate data visualization .
✅ There are various techniques such as PCA, SVD, and linear discriminant analysis. Each technique uses a different method to project the data into a lower-dimensional space while retaining significant information.
13. What is Supervised and Unsupervised learning?
👉 Supervised learning is a machine learning approach defined by using labeled datasets. These datasets are designed to train algorithms to classify data or to accurately predict results or to "supervise".
👉 Unsupervised learning is an approach where inferences are made from datasets containing input data without labeled responses. Unsupervised learning uses machine learning algorithms to analyze and cluster unlabeled datasets.
14. What are the differences between Supervised and Unsupervised learning?
|
Supervised Learning |
Unsupervised Learning |
|
Uses techniques like Decision Trees, K-nearest Neighbors, Neural Networks, Regression, and Support Vector Machines. |
Uses techniques like Anomaly Detection, Clustering, Hidden Variable Models, and Neural Networks. |
|
The goal is to predict outcomes for new data. |
The goal is to analyze the outcomes. |
|
Uses labeled input and output data. |
Uses unlabeled data. |
|
It is a simple method and is usually computed using programs like R or Python. |
It is a more complex method and requires powerful tools. |
15. What is Selection Bias and what are the different types of it?
Selection bias is a type of error that arises from the selection process of a dataset where an unrepresented or non-probability subset is included in the analysis.
Selection bias often occurs in studies like cohort studies, case-control studies, and cross-sectional studies where the selection of participants is not random. Selection bias is a type of systematic error.
There are various types of selection bias:
- ✔️ Sampling Selection Bias: It occurs as a result of not selecting randomly during the sampling process or excluding a specific subset from the sample. For example, if a survey only includes people from a certain age group, the survey results may not reflect the general population.
- ✔️ Data: It is a type of selection bias that results from errors or biases in the selection of data points.
- ✔️ Attrition: It refers to the situation when participants drop out, or incomplete tests are not taken into account.
16. Which would you choose between Python and R for text analysis?
For text analysis, Python has an advantage over R for several reasons.
🆚 Python’s Pandas library offers easy-to-use data structures. Additionally, it provides high-performance data analysis tools.
🆚 Python has faster performance for all types of text analysis.
17. What is the purpose of data cleaning in data analysis?
Data cleaning identifies and corrects errors, repetitions, and irrelevant data in a dataset. Data cleaning is part of the data preparation process.
As the number of data sources increases, the time required for data cleaning increases exponentially, making data cleaning seem like a daunting task. This is because of the large and extensive amounts of data produced by additional sources.
The purposes of data cleaning are:
🧹 Cleaning data from different sources makes working with data easier.
🧹 Data cleaning increases the accuracy of the machine learning model.
🧹 Data cleaning produces consistent, structured, accurate data that enables smart decision-making.
🧹 It highlights areas for improvement by saving time and money.
18. How can Euclidean distance be calculated in Python?
Euclidean distance is a metric that measures the linear distance between two points. There are numerous built-in modules and functions you can use to calculate Euclidean distance in Python. For this, you can use either the NumPy module, math.dist() and distance.euclidean() functions can be used.
📍 It is calculated in this way using NumPy:

19. What is an Epoch?
An epoch is a time period when all the training data is used at once. It is defined as the total number of iterations of the training data in a single loop to train a machine learning model. When the dataset has passed both forward and backward, it is called an epoch, i.e., a traversal. The number of epochs is considered a hyperparameter.
20. What is an Iteration?
An iteration is the process of classifying data in different groups applied within an epoch.
21. What is a cost function?
A cost function is a tool used to evaluate how well a model’s performance is doing.
22. What are Hyperparameters?
Hyperparameters are parameters that control the learning process and determine the values of model parameters. The prefix 'hyper_' indicates that they are 'higher-level' parameters that control the learning process and the resulting model parameters. In short, they are a type of parameter whose value is determined before the learning process.
23. What is batch normalization?
Batch normalization is a technique used in deep learning and artificial neural networks to help neural networks converge faster and more stably during training. Batch normalization was developed to make the training of deep neural networks more efficient.
Start Your Data Science Career
At Coderspace, companies apply to candidates! Software developers, data scientists work at the best companies with Coderspace profiles. You can sign up to Coderspace and we can help you find the right job.