- Home
- Blog
- Data Science
- Data Science Interview Questions

Share this blog:

**Data Science Interview Questions And Answers**

**Ans. **The study of data is called Data Science. The study involves defining standard methods of recording, extracting, analysis, storing of the data so that useful and needful information is represented.

The ultimate goal of Data Science is to get a better insight into the data, irrespective of being structure or unstructured format.

The below tabular format will provide more details about Data Science, Machine Learning and Artificial Intelligence

**Ans. **A botnet is a type of bot running on an IRC network that has been created with a Trojan.

**Ans. **Data visualization is a common term that describes any effort to help people understand the significance of data by placing it in a visual context.

**Ans. **Cleaning up data to the point where you can work with it is a huge amount of work. If we’re trying to reconcile a lot of sources of data that we don’t control like in this flight, it can take 80% of our time.

**Ans. **

- Design and interpret experiments to inform product decisions.
- Build models that predict signal, not noise.
- Turn big data an into the big picture
- Understand user retention, engagement, conversion, and leads.
- Give your users what they want.
- Estimate intelligently.
- Tell the story with the data.

**Ans. **Data Modeling – Data modeling (or modeling) in software engineering is the process of creating a data model for an information system by applying formal data modeling techniques. Database Design- Database design is the system of producing a detailed data model of a database. The term database design can be used to describe many different parts of the design of an overall database system.

**Ans. **Data is collected from sensors in the environment.

Data is “cleaned” or it can process to produce a data set (typically a data table) usable for processing.

Exploratory data analysis and statistical modelling may be performed.

A data product is a program such as retailers use to inform new purchases based on purchase history. It may also create data and feed it back into the environment.

Inclined to build a profession as Data Science Developer? Then here is the blog post on, exploreData Science Training

**Ans. **A subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.

**Ans. **Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X is referred to as the predictor variable and Y as the criterion variable.

**Ans. **Hash table (hash map) is a kind of data structure used to implement an associative array, a structure that can map keys to values. Ideally, the hash function will assign each key to a unique bucket, but sometimes it is possible that two keys will generate an identical hash causing both keys to point to the same bucket. It is known as hash collisions.

**Ans. **SAS is commercial software whereas R is free source and can be downloaded by anyone. SAS is easy to learn and provide an easy option for people who already know SQL whereas R is a low-level programming language and hence simple procedures take longer codes.

**Ans. **R is a low-level language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at BELL.

**Ans. **Estimating a value from 2 unknown values from a list of values is Interpolation. Extrapolation is approximating a value by extending a known set of values or facts.

**Ans. **The process of filtering used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources and multiple agents.

**Ans. **Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements. Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is equal probability method.

**Ans. **They are not different but the terms are used in different contexts. Mean is generally referred when talking about a probability distribution or sample population whereas expected value is generally referred in a random variable context.

**Ans. **P-value is used to determine the significance of results after a hypothesis test in statistics. P-value helps the readers to draw conclusions and is always between 0 and 1.

- P- Value > 0.05 denotes weak evidence against the null hypothesis which means the null hypothesis cannot be rejected.
- P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected.
- P-value=0.05is the marginal value indicating it is possible to go either way.

**Ans. **No, they do not because in some cases it reaches a local minima or a local optima point. You don’t reach the global optima point. It depends on the data and starting conditions.

**Ans. **A/B testing is another form of testing where two variables are taken into consideration and the result is derived. The use of the A/B testing is vital because the outcome of the testing result will also provide improvements to the system. For example, A/B testing for a web page will yield in understanding the current state of the web page and also the testing result will provide necessary enhancements and feedback to the current web page layout.

**Ans. **Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.

**Ans. **Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for a large number of outliers, the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values. The most common ways to treat outlier values –

- To change the value and bring in within a range
- To just remove the value.

**Ans. **There are various methods to assess the results of logistic regression analysis-

- Using Classification Matrix to look at the true negatives and false positives.
- Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening.
- Lift helps assess the logistic model by comparing it with random selection.

**Ans. **

- Understand the business problem
- Explore the data and become familiar with it.
- Prepare the data for modelling by detecting outliers, treating missing values, transforming variables, etc.
- After data preparation, start running the model, analyse the result and tweak the approach. This is an iterative step until the best possible outcome is achieved.
- Validate the model using a new data set.
- Start implementing the model and track the result to analyse the performance of the model over the period of time.

**Ans. **The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored. There are various factors to be considered when answering this question-

- Understand the problem statement, understand the data and then give the answer. Assigning a default value which can be the mean, minimum or maximum value. Getting into the data is important.
- If it is a categorical variable, the default value is assigned. The missing value is assigned a default value.
- If you have a distribution of data coming, for normal distribution give the mean value.
- Should we even treat missing values is another important point to consider? If 80% of the values for a variable are missing then you can answer that you would be dropping the variable instead of treating the missing values.

**Ans. **The word machine learning is widely used in the data analysis world. Machine learning is nothing but an application of Artificial Intelligence where the algorithm is executed automatically to learn about the data without being programmed. So while executing the algorithms, the data is parsed and patterns are determined and predicted accordingly.

The following are the uses of Machine Learning:

- Mining of the database to understand the growth automation
- Prediction and improvement of process based on the data analysis
- Achieve high orders of data quality and management
- Better decision-making process

**Ans. **Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid Overfitting of the model being built. On the other hand, the test set is used for testing or evaluating the performance of a trained machine learning model.

In simple terms, the differences can be summarized as-

- Training Set is to fit the parameters i.e. weights.
- Test Set is to assess the performance of the model i.e. evaluating the predictive power and generalization.
- The validation set is to tune the parameters.

**Ans. **The best possible answer for this would be Python because it has Pandas library that provides easy to use data structures and high-performance data analysis tools.

**Ans. **Logistic regression is a process where the model is set to deliver results where the inputs are two linear predictor variables and the output is derived in the form of binary values.

Logistic regression is explained with an example:

The output of the logistic regression is whether the politician will win or not. I.e. The output is derived in the form of binary values ( 1 or o)-Win or Lose.

To derive the output, the inputs are the following :

- The amount of money the politician has spent for campaigning
- Amount of time in campaigning.

**Ans. **These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.

If the analysis attempts to understand the difference between 2 variables at the time as in a scatterplot, then it is referred to as bivariate analysis. For example, analysing the volume of sale and spending can be considered as an example of bivariate analysis.

Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.

**Ans. **Part of a data scientist's role in certain companies involves working closely with the product teams to help define, measure, and report on these metrics. This is an exercise you can go through by yourself at home, and can really help during your interview process.

**Ans. **The following are a few common data quality issues :

Constraint ranges

A mixture of different languages

Noise in the Data set

Missing values / Null Values

Outliers in the Data set.

**Ans. **Univariate analysis is nothing but an analysis process where a certain format of data is represented to business users. But in this representation, only one variable is highlighted and the data is represented. For example, considering the sales figure based on a particular country is one variable within a pie chart. So this form of study and analysis is called univariate analysis.

**Ans. **As the name suggests bivariate analysis is a process where two variables are considered for data analysis and representation. For example, understanding the amount of money spent on marketing vs the number of sales triggered. This study or analysis is called as bivariate analysis.

**Ans. **As the name suggests, multivariate analysis is nothing but a process where more than 2 variables are considered for analysis purposes.

**Ans. **The normal distribution is one of the vital statistical data distribution process or a pattern where the data points are equally distributed so that a bell curve can be achieved.

**Ans. **A power analysis is a standard analysis process that is widely used to identify and estimate the bare minimum sample size that is required to conduct or to organize an experiment.

**Ans. **Supervised learning is a technique or a process where the human will train the machine with a “labeled” data set. In a sense, the correct answer is associated with the data set. This form of training or learning will help to predict the outcomes.

To build a robust model, the trainer or the supervisor has to spend a quality amount of time to execute this. Also, if the data insights change then the data model should also be changed accordingly.

On the other hand, the unsupervised learning technique is a process where there is no need for external human interference. The model is left on itself to understand the data and discover the information on the fly. The data that is associated with this type is the unlabelled data set. The unsupervised algorithms are capable of handling high complex tasks when compared to supervised learning algorithms.

**Ans. **A confusion matrix is nothing but a 2x2 matrix where the output is derived in the form of counts. The summary of results contains both correct predictions and incorrect predictions. With the help of the confusion matrix, one can get to know the type of errors that are made by the classifier.

**Ans. **Correlation and Covariance are two statistical methods where the relationship between two variables is defined. These two statistical methods are commonly used in statistics and probability scenarios.

**Covariance:**

- In this technique, the relationship between the two variables is identified. If there is a change observed in one variable it will automatically reflect a change in the other variable.
- The range of values it can take within - infinity to + infinity. Within this range, the negative value describes the negative relationship, whereas the positive value expresses a positive relationship.
- Mostly this is used for a linear relationship between the variables.

**Correlation:**

- This technique will help to understand how well the variables are strongly paired with one another.
- The values range from -1 to +1. Within this range, the values that are close to -1 will showcase a strong negative correlation and the values that are close to +1 will showcase a strong positive correlation.
- Within this technique, the relationship between the variables is mostly indirect.
- The relationship strength between the variables is denoted in this technique.

**Ans. **Re-sampling is a process where the sample data sets are repeated from the standard original data sets. Usually, re-sampling methods involve experimental methods compared to analytical methods. This helps in generating a unique sampling distribution.

Usually, the following are the different ways where re-sampling can be executed:

- Substitute the labels on the respective data points while performing tests.
- With the help of bootstrapping, cross-validation, etc, models are validated with the use of random data subsets.

**Ans. **Within the machine learning and statistical analysis world, one of the vital tasks is to fit a model to a training data set so that the model can predict the data. In this context we have two concepts that are closely related to better prediction of the data, they are overfitting and underfitting.

**Overfitting:**

As the word suggests, the data model is complex and has too many parameters when compared to the number of observations. Usually, the overfitted models result in weak productive performance because the data model reacts to the slightest fluctuations in the data.

**Underfitting:**

This is another concept in a statistical analysis where the algorithm cannot capture the trend of the data. For example, fitting a linear data model to a non-linear data set. As the comparison is not properly, the model cannot produce quality or strong predictive output and results in poor performance in terms of prediction.

**Ans. **In reality, both the languages are open-source languages where the implementation is wide and has a lot of user base, especially when it comes to data analysis these languages are widely used.

**Python:**

- Python is a programming language with easy to read syntax and has wide acceptance.
- It is used widely for data analysis purposes. It is a popular tool to deploy and execute machine learning processes.
- The code is easy to maintain and has a different set of libraries to simplify the work. The popular libraries that are widely used in the data science world are Scikit-learn, Pandas, Scipy, Seaborn and Numpy. These libraries are wide.

**R:**

- R is also a programming language where it is primarily used for statistical analysis
- R has a rich source of libraries which will help in any type of data analysis work. The libraries are available for everyone as it is an open-source programming language. The acceptance rate for R has been huge in the data analysis world.
- With the help of a reporting library, the results that are identified with the use of R language can be easily represented.
- A lot of help is available within the R community where a lot of documentation is available for the developers to go through.

**Ans. **

- Data cleaning plays an important role in terms of the analysis phase. The data cleaning process actually helps the data scientists or the analysis to understand the data patterns and it is easy for them to represent in one true format rather than going with each data source layout.
- It also improves the accuracy of the model in machine learning.
- As on when the number of data sources increases, the time and effort associated with the data cleaning will also increase but the end result will be self-represented where the data is available in a readable format or business-centric format.

**Ans. **In the computing world, a star schema is nothing but a simplistic format of data mart schema where it is widely used to build data warehouses and dimensional data marts. A star schema has 1 or more than 1 fact table which has a reference of a number of dimension tables.

The name comes from its appearance where the fact table is actually present in the centre and the dimension tables are surrounded by its start points.

**Ans. **Data sampling is one of the statistical analysis techniques which is widely used to select, translate, manipulate and analyze a certain subset of the data. This will provide information about the patterns, trends in the larger data set.

They are different sampling methods in place, the following are some techniques that are used to analyze the data sets.

- Simple random sampling
- Stratified sampling
- Cluster sampling
- Multistage sampling
- Systematic sampling
- Convenience sampling
- Consecutive sampling
- Purposive or judgemental sampling
- Quota sampling

**Ans. **A validation set is nothing but a training set that is used for parameter selection. Also, one more important consideration is to make sure that the model is not overfitted.

**Ans. **As the name indicates it provides a pictorial form of all the connections to the process. A decision tree is one of machine learning algorithms- supervised which is mainly used for Classification and Regression purposes.

Within this process, the entire data set is broken down into smaller datasets that are associated with a decision tree. Usually, the decision tree is in an incremental format where it shows the relation between the steps. The output of the decision tree is to have a linear fashion of flow chart which has decision nodes and leaf nodes.

A decision tree is well capable of handling categorical data sets and numerical data sets.

Pruning is an effective technique in machine learning which is primarily used to reduce the decision tree size. Using this process, the complexity of the classifier is also reduced which will eventually increase the predictivity.

**Ans. **The term “Boosting” refers to a set of algorithms where it is primarily used to enhance weak learners to perform better and make them strong learners. Using this concept, the algorithms are enhanced in such a way that the results are better compared to the initial stages of the algorithm.

Boosting is a method where the weak algorithms are tweaked and enhanced in sequential order. As the process is sequential, the predecessor algorithm is always stronger.

They are three different types of boosting:

- AdaBoost- Adaptive boosting
- Gradient boosting
- XG boost

**Ans. **The following are the cases where the algorithm is updated.

- As the model evolves, the algorithm has to be updated.
- When the data source is changing regularly
- When the results are not accurate, the algorithm will need to be updated
- If the current algorithm doesn’t serve the purpose of data analysis.

**Ans. **Deep learning is another perspective of machine learning altogether. Within deep learning, the concept of algorithms is also considered where the structure is inspired by brain functions. This is often called artificial neural networks.

**Ans. **Reinforcement learning is another technique that is oriented towards enhancing output. This learning process emphasizes on what are the activities that one has to do and how these should be aligned to the actions. By doing these two sets of actions, the result will enhance the reward signal.

In this process, the learner should not pick the action item but in turn, has to discover the action item which will yield a better result. The process is derived from the human learning process, where a huge importance is given to the reward/penalty mechanism.

**Ans. **A hyperparameter is nothing but a predefined parameter where the value is set/defined before the learning process is executed. This helps in terms of understanding how a network is trained and network structure.

For example :

- Hyperparameter can be Number of hidden units
- Hyperparameter can be The learning rate
- Epochs etc

**Ans. **CNN stands for Convolutional Neural Network. Within this network, they are four different layers that are available.

- Convolutional Layer: In this layer, several smaller windows are created so that the data can go over.
- ReLU Layer: In this layer, all the negative values are converted into zero. The actual output is rectified in the feature map.
- Pooling Layer: In this layer, the dimensions of the feature map is reduced. It is a down-sampling operation that takes place during the layer.
- Fully Connected Layer - All the objects are recognized and classified in this layer to an image.

**Ans. **They are three different variants in backpropagation, they are listed below:

- Stochastic Gradient Descent :

Within this variant, only one training example is used for calculation of the gradient and accordingly the parameters are updated.

- Batch Gradient Descent:

The gradient is calculated for the entire dataset and update action is performed for every iteration.

- Mini-bath Gradient Descent:

It is considered as one of the best optimization algorithms. It works exactly like that of stochastic gradient descent process but within this process instead of taking a single training example, it considers mini-batches.

**Ans. **Different deep learning frameworks are listed below:

- Pytorch
- TensorFlow
- Microsoft Cognitive Toolkit
- Keras
- Caffe
- Chainer

**Ans. **

- A Boltzmann machine has a simple learning algorithm that helps to discover key features which in turn represent complex regularities within the training dataset.
- The Boltzmann machine is primarily used to enhance the weights and quantity of the problem statement.
- The algorithm is relatively slow and has a lot of feature detectors within many layers.

**Ans. **The following are certain skills that are vital for an individual to have to excel in data analysis :

- Fair understanding of all built-in data types like lists, dictionaries, tuples, and sets
- Should master N-dimensional arrays
- Should have a fair understanding of Panda data frames.
- Familiar with Scikit-learn process
- Able to write comprehensions rather than loops
- Capable of writing small functions which are easy to understand and execute
- Optimization and customization of Python scripts.

**Ans. **A uniformed distribution is a case where the data is actually spread equally in all respective ranges.

A skewed distribution is a case where the data is actually spread across any one side of the plot. Usually, skewed distribution will have either left/right-skewed data distribution.

**Ans. **Precision is defined as a process where the percentage value of correct predictions is taken into consideration.

The recall is defined as a process where the number of percentage predictions is validated ( i.e. actually the predictions that were proved to be true).

**Ans. **The data is collected from various social media channels like Twitter, Facebook, etc. With the help of their API’s the information is gathered. For example, the data can be collected from a single Tweet, i.e. Tweeted date, Number of retweets, course, content of the tweet, number of favourites for the tweet, etc.

Using all of this information, a multivariate time series model is equipped to predict the answer.

**Ans. **It is advised to run through the features in a Gradient Boosting Machine or Random Forest process where the plots are generated with relative importance. Further, it is advised to look for the variables that were added in the forward variable selection process.

**Ans. **As per Naive Baye's point assumptions, all the independent variables are important and they are independent of each other. In reality, the idea is not supported to an extent. But this process works better for problem classification.

**Ans. **Within the multinomial distribution, the values are assigned as n=12 and k=3, the outcome is as per multinomial distribution. The classes are distinct.

**Ans. **The factor of having more data can actually cause more issues if they are not managed well, few of them are listed below for reference:

- If the data quality is not appropriate then having more data is not useful.
- If your model is not equipped to handle huge data then there is no point in having huge data.
- Additional data always comes with additional storage space and also computing power and resources. Pricing can also be considered as another factor.

**Ans. **A high dimensionality results in a hard form of a cluster where it has to accommodate a number of dimensions within it.

For example:

To cover a fraction of the data volume, the model has to capture a wide range of variables.

**Ans. **A confidence interval is a percentage value that is used at the time of construction for a set of samples where each sample has the same value, thus the mean value of the constructed intervals would be the same. For example, if the confidence interval is planned with a percentage of 95% then the mean value will also be 95%.

**Ans. **A correlation is a process where it helps to understand the relationship between two or more variables.

Causation is a process where it depicts the causal relationship between the two events. Further, they also represent the cause and effect.

Causation can talk about correlation but correlation doesn’t really mean causation.

You liked the article?

Like: 0

Vote for difficulty

Current difficulty (Avg): Medium

EasyMediumHardDifficultExpert

IMPROVE ARTICLEReport Issue

Embedded Systems Interview QuestionsViews: **1547**

Types of Pointers in CViews: **14352**

Oracle Procure To Pay Interview QuestionsViews: **4352**

Characteristics of C LanguageViews: **10834**

Oracle Financials Interview QuestionsViews: **9437**

12

Name

TekSlate

Author Bio

TekSlate is the best online training provider in delivering world-class IT skills to individuals and corporates from all parts of the globe. We are proven experts in accumulating every need of an IT skills upgrade aspirant and have delivered excellent services. We aim to bring you all the essentials to learn and master new technologies in the market with our articles, blogs, and videos. Build your career success with us, enhancing most in-demand skills in the market.

Stay Updated

Get stories of change makers and innovators from the startup ecosystem in your inbox