Data Science Interview Questions and Answers
Data Science is no more a buzzword, it's a growing demand for every company to analyze the available data set to provide the right direction to the business. If you start searching for a job, you will see the increasing demand for Data Scientists on every job portal across the globe. An increase in the usability of AI, machine learning, and deep learning applications in our daily life will create more jobs in the market. The only thing you need to do is to upgrade your skill sets with Mindmajix DataScience Course to match the changing demand unless it's too late.
And also our experts compiled the best Data Science Interview Questions which helps you to clear your interview and acquire your dream job as a Data Scientist.
1Q) What is Data Science?
Ans: Data Science is the mechanism of utilizing algorithms, big data analytics, machine learning, and programming languages such as Python, R to analyze the hidden patterns of data and predict useful results.
2Q) What is the difference between supervised and unsupervised learning?
Ans: Supervised learning uses trained data sets, i.e., known and labeled data, and uses feedback mechanisms, whereas unsupervised learning uses unlabeled data and does not support feedback mechanism. Decision trees, regression, and support vector machines are examples of supervised learning. K-means clustering, hierarchical clustering, and the Apriori algorithm are examples of unsupervised learning.
3Q) What is selection bias?
Ans: A selection bias is an error that occurred due to the wrong selection of data. Selecting individuals, groups, or data in such a way that they don't actually represent the population intended for the research.
4Q) What is a Confusion matrix?
Ans: A Confusion matrix is a two x two table that provides four outcomes. Based on these four results, you can measure the characteristics of data such as error-rate, sensitivity, accuracy, precision, etc.
False Negative (FN)
False Positive (FP)
True Negative (TN)
The Confusion matrix predicts the following four values:
True Positive: Correct positive prediction
True Negative: Correct negative prediction
False Positive: Incorrect positive prediction
False Negative: Incorrect negative prediction
Inclined to build a profession as Data Science Developer? Then here is the blog post on, explore "Data Science Training "
5Q) What is the difference between the 'wide' and 'long' format?
Ans: The 'wide' format has all the responses in the same row, whereas the 'long' format captures only one response in a row - for another response, it uses another row.
6Q) What is the normal distribution?
Ans: There are different ways that you can represent a data set, either to the left or to the right or jumbled up. In a normal distribution, the data is distributed around the central peak value. It provides it a bell-shaped curve.
7Q) What do you mean by correlation and covariance?
Ans: Correlation and covariance are part of probability and statistics. Correlation provides the relationship between two variables whereas covariance signifies how two random values deviate from their expected values. Correlation provides the quantitative relationship between two values. The positive value shows that the parameters are directly proportional whereas the negative value shows that they are inversely proportional.
8Q) Define A/B testing?
Ans: the advertisement, how much to modify, at what places, etc. And upon modification, what was the result, whether the number of clicks increased.
9Q) What are Pandas? Why is it used?
Ans: Pandas is an open-source Python library basically used for data manipulation. It can calculate the mean, median, and mode of the given data set. Many other functions such as modifying rows, columns become pretty easy with Pandas. It operates on Python and Numpy with their lists and dictionaries. It can operate on one-dimensional, two-dimensional, and multi-dimensional data over time.
10Q) What is Data alignment in Pandas?
Ans: Data alignment is the process of how you access the data. For example, you can use column data as row data and vice versa easily with Pandas.
11Q) What is Matplotlib?
Ans: Matplotlib is a Python library used for data visualization in the form of plots like line plot, bar plot, scatter plot, box plot, etc. The main module is Pyplot under Matplotlib. Matplotlib is the object name and Pyplot is the function name. It can be easily integrated with Python or Pandas.
12Q) What do you understand by Series object and Dataframe in Pandas?
Ans: Series objects are used with the one-dimensional data whereas Dataframe is used for two-dimensional data.
13Q) What is Numpy?
Ans: Numpy is a linear algebra library in Python. It is used for mathematical and logical operations on arrays. It has extra features compared to the normal lists. It can be used for one dimensional, two-dimensional, multi-dimensional array.
The syntax to import Numpy in Python:
import NumPy as np
14Q) What is a subplot?
Ans: A subplot is used to have multiple graphs in a single plot.
Data Science Certification Questions
15Q) How many different types of plots do you use in Matplotlib? Where do these plots are used?
Ans: Following are the plots that you can use in Matplotlib:
A line plot is used mainly with linear regression models. Bar plot is used to provide the area besides providing the trends of data. Scatter plot, points are not joined. If continuous data is not available, use scatter plot otherwise you can use lie plot. This is used to make different clusters.
The histogram is basically used for image processing, for example, frequency of data points (CPU usage) in CPU processing.
16Q) What is a Box plot?
Ans: Box plot contains quartile information. It shows the minimum value, mean value, and maximum value. It shows the distribution of the data and the volume of data scattered all over the plot. The height of the box signifies the number of points.
17Q) What is a Quiver plot?
Ans: The Quiver plot shows the vector direction as arrows.
18Q) What is a Violin plot?
Ans: The shape of a plot is that of a Violin. It shows the probability density function.
19Q) How do you calculate the fitness of a regression line?
Ans: First calculate the line error, which is the distance between the actual points and predicted points.
Suppose (x1, y1), (x2, y2), and (x3,y3) are three points on the plane. The distance between the regression line and the data points are known as error values.
Calculate the mean value of all the projections of the points, suppose that is mean of y or y︠
SE line = (YA-YP)x12 - (YA- YP)x22 - (YA- YP)2 ……… (YA- YP)xn2 This is the distance from each y value of the point to the regression line.
Calculate the squared error from the mean line, SE Y︠ = (Y1-Y︠)2 - (Y2- Y︠)2 - (Y3- Y︠)2 - (Y4- Y︠)2 ……… (Yn- Y︠)2 This is the central tendency of the line.
To get the good fit of the line = 1 - SE line / SE Y︠ This is also known as R2 value or Least Squared Error Method
Following are the inferences from the R2 method
If the SE line value is less than or equal to 0, R2 is nearly 1. This is a good fit.
If the SE line value is higher but less than the SE line, R2 is approximately 0. This is a bad fit.
20Q) What is Data Visualization?
Ans: In the real scenario, you have a large number of rows and columns of data that will be difficult to interpret. Hence, using graphics, it's easy to analyze the data, that's called data visualization.
21Q) What is the difference between a bar graph and a histogram?
Ans: A bar graph shows the categorical data (price, place, etc ) whereas a histogram shows the numerical data. A bar graph compares the discrete variables but the histogram displays the frequency distribution of the continuous data.
22Q) Where is the use of pie charts?
Ans: The Pie chart is used to show the percentage of proportional data when you have more than four categories.
23Q) What is a Doughnut chart?
Ans: A Doughnut chart is the same as that of a Pie chart - the difference being that it shows the area instead of the percentage. You create two Pie charts and combine them.
24Q) What is a Sigmoid curve?
Ans: The S-shaped curve is known as the Sigmoid curve, also known as the Asymptotic curve. It neither reaches 0 nor 1, but stays in between.
It stays between +∞ and -∞
Sigmoid function = 1/1+e-x
25Q) What is a probability density function?
Ans: The probability density function shows the probability of an event/point happening in a range.
26Q) What do you mean by P-value?
Ans: P-value signifies the strength or accuracy of the result. If P<0.05, the probability is more than the result can be ignored. When P>0.05, the probability is higher that the result will be considered.
P=0.05 means the hypothesis can go either way.
Data Science Manager Interview Questions
27Q) What is a Bayes theorem?
Ans: Bayes theorem gives the probability of the second event (B) when the first event (A) has already happened or true. For example, based upon an examination, 30% of people who had a positive tests for cancer were diagnosed with cancer later.
P(A/B) = P (B/A) P(A)/P(B)
28Q) What are the differences between overfitting and underfitting?
Ans: Overfitting and underfitting are the errors that occur during selecting a model. When a model is selected in a way that has considered even an unnecessary data set that was not required in taking the accurate decision, it is called overfitting. This also causes the introduction of noise into the model. You can use cross-validation, regularization, pruning, model comparison to avoid the chances of overfitting.
Underfitting is the error when the model has not considered the sufficient data set for the analysis. Such models can provide poor performance results. You can use decision trees, logistic regression, and linear regression to avoid underfitting from cropping into the system.
29Q) What is the K-means algorithm?
Ans: K-means algorithm is the clustering algorithm. Here, K is the number of clusters. Mean calculates the mean of the cluster from the individual data points.
For example, sorting sensor measurements, behavioral segmentation, inventory categorization, etc.
30Q) What is data cleaning?
Ans: Data cleaning comprises many steps to assure that all the impurities or dirty data are removed. It can also comprise the step to remove the data sets not useful for our analysis though these data sets are logical and meaningful.
31Q) What is a ROC curve?
Ans: An ROC curve stands for the Receiver Operating Characteristic curve. It is a curve between the true positive rate and the false-positive rate. It helps to find out the right trade-off between the true positive and false-positive rates.
32Q) What would you prefer - Python or R for text analytics and why?
Ans: Python would be preferred because it has Pandas that helps in easy data analytics. Moreover, Python is faster on text analytics compared to R.
33Q) Differentiate among Univariate, Bivariate, and Multivariate.
Ans: There are several variables used for the analysis. When only one variable is used for the statistics, such as to calculate the sales in a particular area, it is known as univariate.
When two variables are used for analysis, it is known as bivariate. For example, to analyze the sales and the expenses incurred for a particular area.
The multivariate analysis uses more than two variables for analysis.
34Q) What are interpolation and extrapolation?
Ans: From a given set of values, interpolation is the method of estimating a value from two known values. Extrapolation is the method in which a value is estimated by extending the given set of values.
35Q) What is clustering? How is it different from classification?
Ans: Dividing the data into unique sets is called clustering. It can be numeric or string such as (0,1) or (cat, dog, monkey…….).
In classification, a generic set of instructions are to be followed to classify the group.
36Q) A person goes fishing five four times a week. He caught a fish twice and failed to catch it three times. What's the probability and odds of getting a fish for dinner?
Ans: Probability = Chances that he caught the fish/Total number of chances
Odds = Chances that he caught the fish/chances that he failed to catch
37Q) Define sampling. What are the different types of sampling?
Ans: Sampling is the method of identifying a group of data from a large set of data, and then analyzing and forecasting the result.
Sampling can be of the following types:
Simple random sampling
38Q) Explain the Validation dataset and Test dataset?
Ans: A validation dataset is used to avoid overfitting. It compares the performance of the training datasets and decides which one to select for the analysis.
A test dataset doesn't depend upon the training dataset. Test dataset assesses the performance of the selected model.
39Q) What is cross-validation?
Ans: Cross-validation is used to test the data set using machine-learning methods.
Cross-validation consists of two steps:
The data set is divided arbitrarily into a training dataset and test dataset.
The training dataset validates the parameters taken for the analysis and the test dataset verifies the performance of the model.
Step 1 and step 2 are repeated several times until the desired performance is achieved.
Data Science Interview Questions for Freshers
40Q) What is Machine learning?
Ans: Machine learning is the study of algorithms and models to do the predictive analysis of data to perform a specific task with the help of patterns and inferences found in data. It uses mathematical models that are built on sample data, known as training data to make predictions. Machine learning is also known as predictive analytics.
41Q) How does the recommender system work?
Ans: A recommender system predicts how the user will predict the product, whether the user will provide a good rating, etc. It is used in areas such as news, movies, social media, online shopping websites, etc.
42Q) What is linear regression?
Ans: Linear regression is the method of finding out the relationship between the criterion variable (dependent variable) and one or more predictable variables (independent variables).
43Q) What is logistic regression? How is it different from logistic regression?
Ans: Logistic regression provides the probability of happening of an event such as pass/fail, yes/no, alive/dead, etc. The value will be either 1 or 0. It is also known as logit regression.
The main difference between linear and logistic regression is that linear regression is continuous in nature and it takes a range of values whereas logistic regression is binary in nature and it takes either True (1) or False (0) value. Linear regression can even take negative values that is not possible with logistic regression.
The Sigmoid function is a very good example of logistic regression.
44Q) Why can't we use mean squared error in logistic regression?
Ans: Using a gradient descent algorithm, you use contour plots to get the minimum value of the curve. For categorical data sets, the curve such as a cost function then yields a zig-zag curve that has many false minimum values. These values are also known as local minimas. So, the function gets stuck at these local minimums, which gives you the false result.
But, with the logistic regression, this is not the case as this uses a straightforward curve and there's no question of local minimums.
45Q) How can you evaluate missing values during conducting analysis?
Ans: First, find out the variables that have missing values and then try to find out the patterns of the data. If you are unable to find out the patterns, substitute the missing values with the mean or median values. If maximum values are missing, suppose 80%, you can drop that variable altogether.
46Q) What is Deep learning?
Ans: Deep learning is the approach of training machines in such a way that it can perform human brain functions such as data processing, decision making, etc. It is useful in every field from social networking to medical image processing to speech recognition.
Following are the deep learning frameworks:
Microsoft Cognitive Toolkit
47Q) Define Artificial Neural Network?
Ans: Artificial Neural Network is the framework in which a machine can function like a human brain. The machine analyses the built-in examples to process the request and provide the solution. No programming is required for such machines.
48Q) Explain backpropagation and its variants?
Ans: Backpropagation is the method of traversing errors backward using gradient descent. This process calculates the gradient error function with respect to the neural network weights.
Following are the variants of the Backpropagation:
Stochastic Gradient Descent: A single training example is used to calculate the gradient.
Batch Gradient Descent: The complete dataset is used to calculate the gradient.
Mini-batch Gradient Descent: Mini batch is used to calculate the gradient.
49Q) What's the difference between training data and test data?
Ans: The whole data set is first cleaned and then divided into a training data set and test data set. The ratio could be 80/20, 75/25, or 70/30, etc. But the training data set will be always greater than the test data set although it cannot be 100%
The training data set is used to provide the different patterns to the machine in the form of algorithms to infer the result. Test data set then use instructions/commands to test the model for its correctness and accuracy.
50Q) What is a Random forest?
Ans: A random forest is a machine learning approach used to perform regression and classification tasks. It has many decision trees in it. This method can also be used to find out the missing values, outlier values, etc. This approach is also known as the ensemble learning method as it can be used to combine weak models to form a powerful model.
51Q) Differentiate among machine learning, deep learning, and reinforcement learning?
Ans: Machine learning is the learning approach in which computers are made to work like human brains without being explicitly programmed. In deep learning, computers are enabled to make decisions with the pre-fed examples. Reinforcement learning helps computers to make decisions based on reward/penalty analysis.
52Q) What is pruning and splitting in a decision tree?
Ans: Pruning is the process of removing the less powerful nodes from a decision tree. It reduces the size of the decision tree. Dividing the root node into children nodes is called splitting.
Data Science Interview Questions for Experienced
53Q) What is information gain in a decision tree?
Ans: Information gain signifies the gain in the entropy after splitting a dataset. A decision tree aims at finding out the way to increase the information gain.
54Q) What is a Gini index?
Ans: A Gini index measures the impurity used in the decision tree.
55Q) Differentiate between false positives and false negatives.
Ans: False positives are the instances when a non-event is wrongly classified as an event. This is also referred to as a Type I error.
False negatives are the instances when events are wrongly classified as a non-event. This type of error can also be known as a Type II error.
56Q) What are Entropy and Information gain in a Decision Tree algorithm?
Ans: The decision tree works upon a top-down approach. It starts from the root node and data partitioning is done downwards that makes the sub-nodes or branches of the tree. If the partitioning is homogeneous, it is called that entropy is zero, otherwise, if the partitioning is done equally among nodes, it is called that entropy is one.
Information gain means adding the data whose attribute contributes to the result. This happens when you partition the data into homogeneous sets. Hence, an increase in information gain indicates a decrease in entropy.
57Q) What would you prefer, a decision tree method or a random forest method?
Ans: Though both of the methods provide almost the same results in most of the cases, the random forest method is preferred because accuracy is more in the random forest method.
58Q) What is Dropout and Batch Normalization?
Ans: Dropout is the process of leaving the nodes with hidden values and data to avoid overfitting. Batch normalization is the method of improving the performance and increasing the stability of the neural network. It is ensured that in the normalization process at every layer, the mean output should be zero and the standard deviation should be one.
59Q) What are TensorFlow and its role in Deep learning?
Ans: TensorFlow consists of both C++ and Python APIs and its compilation time is less than other deep learning libraries such as Keras and Torch. TensorFlow is an open-source library used for numerical computation to ease machine learning and deep learning.
60Q) What does Tensor mean in 'TensorFlow'?
Ans: Tensor means an n-dimensional array of data with ranks fed as an input to the neural network.
61Q) What is the significance of the R programming language in Data Science?
Ans: R is massively used in big data analytics and is the most widely used in data science worldwide. For example, Google, Facebook, LinkedIn useR for many of their functions. R is an open-source, highly compatible, and platform-independent language.
62Q) What is sorting in Python?
Ans: To sort the Dataframe in Python by index. The sort_index () method is used for sorting. By default, it is in ascending order.
63Q) How much Python knowledge is required to learn data science?
Ans: When you do data analytics using Python, you should have a good understanding of basic programming functions, libraries, tuples, data types, lists, etc. Basic knowledge of array operations, Numpy, Panda Dataframe, matrix operations, and the element-wise vector is also desired. Besides, you should also have Scikit learn knowledge, loops, ability to write small functions, sorting, indexing knowledge.