#### 1. What is Data Science? How would you say it is similar or different to business analytics and business intelligence?

Data science is a field that deals with analysis of data. It studies the source of information, what the information represents and turning it into a valuable resource by giving insights of the data that are later used for creating strategies. It is a combination of business perspectives, computer programming and statistical techniques.

Business analytics or simply analytics is the core of business intelligence and data science. Data science is a relatively new term used for analysis of big data and giving insights.

Analytics generally has higher degree of business perspectives than data science which is more programming heavy. The terms are however used interchangeably.

**2. How would you create a taxonomy to identify key customer trends in unstructured data?**

The best way to approach this question is to mention that it is good to check with the business owner and understand their objectives before categorizing the data. Having done this, it is always good to follow an iterative approach by pulling new data samples and improving the model accordingly by validating it for accuracy by soliciting feedback from the stakeholders of the business. This helps ensure that your model is producing actionable results and improving over the time.

**3. Python or R – Which one would you prefer for text analytics?**

The best possible answer for this would be Python because it has Pandas library that provides easy to use data structures and high performance data analysis tools.

Desired to gain proficiency on Data Science? Explore the blog post on Data Science Training to become a pro in Data Science.

#### 4. How do you build a custom function in Python or R?

In R: function command

The structure of a function is given below:

myfunction <- function(arg1, arg2, … ){

statements

return(object)

}

Example:

# function example – get measures of central tendency

# and spread for a numeric vector x. The user has a

# choice of measures and whether the results are printed.

mysummary <- function(x,npar=TRUE,print=TRUE) {

if (!npar) {

center <- mean(x); spread <- sd(x)

} else {

center <- median(x); spread <- mad(x)

}

if (print & !npar) {

cat(“Mean=”, center, “\n”, “SD=”, spread, “\n”)

} else if (print & npar) {

cat(“Median=”, center, “\n”, “MAD=”, spread, “\n”)

}

result <- list(center=center,spread=spread)

return(result)

}

# invoking the function

set.seed(1234)

x <- rpois(500, 4)

y <- mysummary(x)

Median= 4

MAD= 1.4826

# y$center is the median (4)

# y$spread is the median absolute deviation (1.4826)

y <- mysummary(x, npar=FALSE, print=FALSE)

# no output

# y$center is the mean (4.052)

# y$spread is the standard deviation (2.01927)

In Python:

def method-

Structure of the function:

def func(arg1,arg2 …):

statement 1

statement 2

…

return value

Example- To determine mean of a list of values.

def find_mean(given_list):

sum_values= sum(given_list)

num_values= len(given_list)

return sum_values/num_values

print find_mean([i for i in range(1,9)])

# 4

#### 5. Which package is used to do data import in R and Python? How do you do data import in SAS?

We can do data import using multiple methods:

– In R we use RODBC for RDBMS data, and data.table for fast import.

– We use jsonlite for JSON data, foreign package for other languages like SPSS

– We use data and sas7bdat package for SAS data.

– In Python we use Pandas package and the commands read_csv , read_sql for reading data. Also, we can use SQLAlchemy in Python for connecting to databases.

#### 6. What is an RDBMS? Name some examples for RDBMS? What is CRUD?

A relational database management system (RDBMS) is a database management system that is based on a relational model. The relational model uses the basic concept of a relation or table. RDBMS is the basis for SQL, and for database systems like MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft Access.

In computer programming, create, read, update and delete[1] (as an acronym CRUD or possibly a backronym) (Sometimes called SCRUD with an “S” for Search) are the four basic functions of persistent storage.

Check out more Data Science Interview Questions now!

#### 7. How do you check for data quality?

Data quality is an assessment of data’s fitness to serve its purpose in a given context. Different aspects of data quality include:

– Accuracy

– Completeness

– Update status

– Relevance

– Consistency across data sources

– Reliability

– Appropriate presentation

– Accessibility

Maintaining data quality requires going through the data in different intervals and scrubbing it. This involves updating it, standardizing it, and removing duplicates to create a single view of the data, even if it is stored in multiple systems.

#### 8. What is missing value imputation? How do you handle missing values in Python or R?

Imputation is the process of replacing missing data with substitute values.

IN R

Missing values are represented in R by the NA symbol. NA is a special value whose properties are different from other values. NA is one of the very few reserved words in R: you cannot give anything this name. Here are some examples of operations that produce NA’s.

> var (8) # Variance of one number

[1] NA

> as.numeric (c(“1″, “2″, “three”, “4″)) # Illegal conversion

[1] 1 2 NA 4

Operations on missing values:

Almost every operation performed on an NA produces an NA. For example:

> x <- c(1, 2, NA, 4) # Set up a numeric vector

> x # There’s an NA in there

[1] 1 2 NA 4

> x + 1 # NA + 1 = NA

Excluding missing values:

Math functions generally have a way to exclude missing values in their calculations. mean(), median(), colSums(), var(), sd(), min() and max() all take the na.rm argument. When this is TRUE, missing values are omitted. The default is FALSE, meaning that each of these functions returns NA if any input number is NA. Note that cor() and its relatives don’t work that way: with those you need to supply the use= argument. This is to permit more complicated handling of missing values than simply omitting them.

R’s modeling functions accept an na.action argument that tells the function what to do when it encounters an NA. The filter functions are:

– fail: Stop if any missing values are encountered

– omit: Drop out any rows with missing values anywhere in them and forgets them forever

– exclude: Drop out rows with missing values, but keeps track of where they were (so that when you make predictions, for example, you end up with a vector whose length is that of the original response.)

– pass: Take no action.

A couple of other packages supply more alternatives:

– tree.replace (library (tree): For discrete variables, adds a new category called “NA” to replace the missing values

– gam.replace (library gam): Operates on discrete variables like na.tree.replace(); for numerics, NAs are replaced by the mean of the non-missing entries.

Python:

Missing values in pandas are represented by NaN or None. They can be detected using isnull() and notnull() functions.

Operations on missing values

For all math functions sum(), mean(), max(), min() NA (missing) values will be treated as zero. If the data are all NA, the result will be NA.

df[“one”]

one

a NaN

c NaN

e 0.294633

f -0.685597

h NaN

df[“one”].sum()

-0.39096437337883205

Cleaning/filling missing values

– fillna- can fill in NA values with non-null data

– dropna – to remove axis containing missing values.

Imputing missing data:

Imputer is a transformer algorithm in scikitlearn library in python used to complete missing values to determine the best value for the missing data. Example:-

import pandas as pd

import numpy as np

from sklearn.preprocessing import Imputer

s = pd.Series([1, 2, 3, np.NaN, 5, 6, None])

imp = Imputer(missing_values=‘NaN’,

strategy=‘mean’, axis=0)

imp.fit([1, 2, 3, 4, 5, 6, 7])

x = pd.Series(imp.transform(s).tolist()[0])

print x

output-

0 1

1 2

2 3

3 4

4 5

5 6

6 7

dtype: float64

#### 9. Why do you need a for loop? How do you do for loops in Python and R?

We use the ‘for’ loop if we need to do the same task a specific number of times.

In R, it looks like this:

for (counter in vector) {commands}

We will set up a loop to square every element of the dataset, foo, which contains the odd integers from 1 to 100 (keep in mind that vectorizing would be faster for a trivial example – see below):

foo = seq(1, 100, by=2)

foo.squared = NULL

for (i in 1:50 ) {

foo.squared[i] = foo[i]^2

}

If the creation of a new vector is the goal, first we have to set up a vector to store things in prior to running the loop. This is the foo.squared = NULL part.

Next, the real for-loop begins. This code says we’ll loop 50 times(1:50). The counter we set up is ‘i’ (but we can put whatever variable name we want there). For our new vector foo.squared, the ith element will equal the number of loops that we are on (for the first loop, i=1; second loop, i=2).

#### 10. What is advantage of using apply family of functions in R? How do you use lambda in Python?

The apply function allows us to make entry-by-entry changes to data frames and matrices.

The usage in R is as follows:

apply(X, MARGIN, FUN, …)

where:

X is an array or matrix;

MARGIN is a variable that determines whether the function is applied over rows (MARGIN=1), columns (MARGIN=2), or both (MARGIN=c(1,2));

FUN is the function to be applied.

If MARGIN=1, the function accepts each row of X as a vector argument, and returns a vector of the results. Similarly, if MARGIN=2 the function acts on the columns of X. Most impressively, when MARGIN=c(1,2) the function is applied to every entry of X.

Advantage:

With the apply function we can edit every entry of a data frame with a single line command. No auto-filling, no wasted CPU cycles.

Lambda-

afunc=lambda a: func_on_a

You can then use lambda with map, reduce and filter functions based on requirement. Lambda applies the function on elements one at a time.

**11****. What packages are used for data mining in Python and R?**

– Scikit-learn – Machine learning library, built on top of NumPy, SciPy and matplotlib.

– NumPy and SciPy– for providing mathematical functionality like Matlab.

– Matplotlib- Visualization library, provides plots like in Matlab.

– NLTK– Natural Language Processing library. Extensively used fot textminng.

– Orange– Provides visualization and machine learning features. Also provies association rule learning.

– Pandas- Inspired from R. Provides functionality of working on dataframe.

R:

– table- provides fast reading of large files

– rpart and caret- for machine learning models.

– Arules- for associaltion rule learning.

– GGplot- provides varios data visualization plots.

– tm- to perform text mining.

– Forecast- provides functions for time series analysis

#### 12. What is machine learning? What is the difference between supervised and unsupervised methods?

Machine learning studies computer algorithms for learning to do stuff. There are many examples of machine learning problems. For e.g.:

– optical character recognition: categorize images of handwritten characters by the letters represented

– face detection: find faces in images (or indicate if a face is present)

– spam filtering: identify email messages as spam or non-spam

– topic spotting: categorize news articles (say) as to whether they are about politics, sports, entertainment, etc.

– spoken language understanding: within the context of a limited domain, determine the meaning of something uttered by a speaker to the extent that it can be classified into one of a fixed set of categories

– medical diagnosis: diagnose a patient as a sufferer or non-sufferer of some disease

– customer segmentation: predict, for instance, which customers will respond to a particular promotion

– fraud detection: identify credit card transactions (for instance) which may be fraudulent in nature

– weather prediction: predict, for instance, whether or not it will rain tomorrow

Supervised learning is the type of learning that takes place when the training instances are labelled with the correct result, which gives feedback about how learning is progressing. Supervised learning is fairly common in classification problems because the goal is often to get the computer to learn a classification system that we have created. Digit recognition is a common example of classification learning.

In unsupervised learning, there are no pre-determined categorizations. There are two approaches to unsupervised learning:

- The first approach is to teach the agent not by giving explicit categorizations, but by using some sort of reward system to indicate success. This approach nicely generalizes to the real world, where agents might be rewarded for doing certain actions and punished for doing others.
- A second type of unsupervised learning is called clustering. In this type of learning, the goal is not to maximize a utility function, but simply to find similarities in the training data. The assumption is often that the clusters discovered will match reasonably well with an intuitive classification. For instance, clustering individuals based on demographics might result in a clustering of the wealthy in one group and the poor in another.

**13. What is random forests and how is it different from decision trees?**

Random forests involves building several decision trees based on sampling features and then making predictions based on majority voting among trees for classification problems or average for regression problems. This solves the problem of overfitting in Decision Trees.

Algorithm:-

Repeat K times:

– Draw a bootstrap sample from the dataset.

– Train a Decision Tree by selecting m features from available p features.

– Measure out of bag error. Evaluate against the samples which were not selected in bootstrap.

Make a prediction by majority voting among K trees

Random Forests are more difficult to interpret than single decision trees, so understanding variable importance helps.

Random forests are easy to parallelize, trees can be built independently. Handles NbigP-Problems naturally since a subset of attributes are selected by importance.

**14. What is linear optimization? Where is it used? What is the travelling salesman problem? How do you use Goal Seek in Excel?**

Linear optimization or Linear Programming (LP) involves minimizing or maximizing an objective function subject to bounds, linear equality, and inequality constraints. Example problems include design optimization in engineering, profit maximization in manufacturing, portfolio optimization in finance, and scheduling in energy and transportation.

The following algorithms are commonly used to solve linear programming problems:

– Interior point: Uses a primal-dual predictor-corrector algorithm and is especially useful for large-scale problems that have structure or can be defined using sparse matrices.

– Active-set: Minimizes the objective at each iteration over the active set (a subset of the constraints that are locally active) until it reaches a solution.

– Simplex: Uses a systematic procedure for generating and testing candidate vertex solutions to a linear program. The simplex algorithm is the most widely used algorithm for linear programming.

Travelling Salesman Problem belongs to the class of np-complete problems. TSP is a special case of the travelling purchaser problem and the Vehicle routing problem. It is used as a benchmark for many optimization methods. It is a problem in graph theory requiring the most efficient i.e. least squared distance a salesman can take through n cities.

#### 15. What is CART and CHAID? How is bagging different from boosting?

**CART:**

– Classification And Regression Tree (CART) analysis is an umbrella term used to refer to Classification Tree analysis in which the predicted outcome is the class to which the data belongs. and Regression Tree analysis in which the predicted outcome can be considered a real number.

– Splits in Tree are made by variables that best differentiate the target variable.

– Each node can be split into two child nodes.

– Stopping rule governs the size of the tree.

**CHAID:**

– Chi Square Automatic Interaction Detection.

– Performs multi-level splits whereas CART uses binary splits.

– Well suited for large data sets.

– Commonly used for market segmentation studies.

**Bagging:**

- Draw N bootstrap samples.
- Retrain the model on each Sample.
- Average the results : – Regression – Averaging : Classification – Majority Voting

- Works great for overfit models – Decreases variance without changing bias, Doesn’t help much with underfit/high bias models.

- Insensitive to training data.

** ****Boosting:**

– Instead of selecting data points randomly with bootstrap favor the mis-classified points by adjusting the weights down for correctly classified examples.

– Here sequentiality is present so difficult to apply in case of large data.

**16. What is clustering? What is the difference between kmeans clustering and hierarchical clustering?**

Cluster is a group of objects that belongs to the same class. Clustering is the process of making a group of abstract objects into classes of similar objects.

Let us see why clustering is required in data analysis:

– Scalability − We need highly scalable clustering algorithms to deal with large databases.

– Ability to deal with different kinds of attributes − Algorithms should be capable of being applied to any kind of data such as interval-based (numerical) data, categorical, and binary data.

– Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes.

– High dimensionality − The clustering algorithm should not only be able to handle low-dimensional data but also the high dimensional space.

– Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.

– Interpretability − The clustering results should be interpret-able, comprehensible, and usable.

K-MEANS clustering:

K-means clustering is a well known partitioning method. In this method objects are classified as belonging to one of K-groups. The results of partitioning method are a set of K clusters, each object of data set belonging to one cluster. In each cluster there may be a centroid or a cluster representative. In the case where we consider real-valued data, the arithmetic mean of the attribute vectors for all objects within a cluster provides an appropriate representative; alternative types of centroid may be required in other cases.

#### 17. What is churn? How would it help predict and control churn for a customer?

Customer churn, also known as customer attrition, customer turnover, or customer defection, is the loss of clients or customers.

Banks, telephone service companies, internet service providers, pay TV companies, insurance firms, and alarm monitoring services, often use customer churn analysis and customer churn rates as one of their key business metrics because the cost of retaining an existing customer is far less than acquiring a new one. Companies from these sectors often have customer service branches which attempt to win back defecting clients, because recovered long-term customers can be worth much more to a company than newly recruited clients.

The statistical methods, which have been applied for decades in medicine and engineering, come in handy any time we are interested in understanding how long something (customers, patients, car parts) survives and what actions can help it survive longer.

#### 18. What is market basket analysis? How would you do it in R and Python?

Market basket analysis is the study of items that are purchased or grouped together in a single transaction or multiple, sequential transactions. Understanding the relationships and the strength of those relationships is valuable information that can be used to make recommendations, cross-sell, up-sell, offer coupons, etc.

The analysis reveals patterns such as that of the well-known study which found an association between purchases of diapers and beer.

In a market basket analysis the transactions are analysed to identify rules of association. For example, one rule could be: {pencil, paper} => {rubber}. This means that if a customer has a transaction that contains a pencil and paper, then they are likely to be interested in also buying a rubber.

Before acting on a rule, a retailer needs to know whether there is sufficient evidence to suggest that it will result in a beneficial outcome. We therefore measure the strength of a rule by calculating the following three metrics (note other metrics are available, but these are the three most commonly used):

- Support: the percentage of transactions that contain all of the items in an item set (e.g., pencil, paper and rubber). The higher the support the more frequently the item set occurs. Rules with a high support are preferred since they are likely to be applicable to a large number of future transactions.
- Confidence: the probability that a transaction that contains the items on the left hand side of the rule (in our example, pencil and paper) also contains the item on the right hand side (a rubber). The higher the confidence, the greater the likelihood that the item on the right hand side will be purchased or, in other words, the greater the return rate we can expect for a given rule.
- Lift: the probability of all of the items in a rule occurring together (otherwise known as the support) divided by the product of the probabilities of the items on the left and right hand side occurring as if there was no association between them. For example, if pencil, paper and rubber occurred together in 2.5% of all transactions, pencil and paper in 10% of transactions and rubber in 8% of transactions, then the lift would be: 0.025/(0.1*0.08) = 3.125. A lift of more than 1 suggests that the presence of pencil and paper increases the probability that a rubber will also occur in the transaction. Overall, lift summarizes the strength of association between the products on the left and right hand side of the rule; the larger the lift the greater the link between the two products.

#### 19. What is association analysis? Where is it used?

Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based on the occurrences of other items in the transaction. The technique of association rules is widely used for retail basket analysis. It can also be used for classification by using rules with class labels on the right-hand side. It is even used for outlier detection with rules indicating infrequent/abnormal association.

Association analysis also helps us to identify cross-selling opportunities, for example: we can use the rules resulting from the analysis to place associated products together in a catalog, in the supermarket, or in the Web shop, or apply them when targeting a marketing campaign for product B at customers who have already purchased product A

Association analysis determines these rules by using historic data to train the model. We can display and export the determined association rules.

#### 20. What is the central limit theorem? How is a normal distribution different from chi square distribution?

Central limit theorem states that the distribution of an average will tend to be Normal as the sample size increases, regardless of the distribution from which the average is taken except when the moments of the parent distribution do not exist. All practical distributions in statistical engineering have defined moments, and thus the CLT applies.

Chi square distribution uses standard normal variates which are a part of normal distribution. In statistical terms:

If X is normally distributed with mean μ and variance σ2 > 0, then:

is distributed as a chi-square random variable with 1 degree of freedom.

#### 21. What is a Z test, Chi Square test, F test and T test?

**Z-test** is a statistical test where normal distribution is applied and is basically used for dealing with problems related to large samples when n (sample size) ≥ 30 .

It is used to determine whether two population means are different when the variances are known and the sample size is large. The test statistic is assumed to have a normal distribution and parameters such as standard deviation should be known in order for z-test to be performed.

A one-sample location test, two-sample location test, paired difference test and maximum likelihood estimate are examples of tests that can be conducted as z-tests

Z-tests are closely related to t-tests, but t-tests are best performed when an experiment has a small sample size. Also, t-tests assume that the standard deviation is unknown, while z-tests assume that it is known. If the standard deviation of the population is unknown, the assumption that the sample variance equals the population variance is made.

It implements a z-test similar to the t.test function.

Usage:

simple.z.test(x, sigma, conf.level=0.95)

T-test assesses whether the means of two groups are statistically different from each other

A two-sample t-test examines whether two samples are different and is commonly used when the variances of two normal distributions are unknown and when an experiment uses a small sample size

For example, a t-test could be used to compare the average floor routine score of the U.S. women’s Olympic gymnastic team to the average floor routine score of China’s women’s team

It performs one and two sample t-tests on vectors of data.

Usage:

t.test(x, …)

## Default S3 method:

t.test(x, y = NULL,

alternative = c(“two.sided”, “less”, “greater”),

mu = 0, paired = FALSE, var.equal = FALSE,

conf.level = 0.95, …)

## S3 method for class ‘formula’

t.test(formula, data, subset, na.action, …)

**Chi square** is a statistical test used to compare the observed data with the data that we would expect to obtain according to a specific hypothesis.

Formula for the chi square test is:

chisq.test performs chi-squared contingency table tests and goodness-of-fit tests.

Usage:

chisq.test(x, y = NULL, correct = TRUE,

p = rep(1/length(x), length(x)), rescale.p = FALSE,

simulate.p.value = FALSE, B = 2000)

The **F-test** is designed to test if two population variances are equal. It does this by comparing the ratio of two variances. So, if the variances are equal, the ratio of the variances will be 1.

Usage:

var.test(x, …)

## Default S3 method:

var.test(x, y, ratio = 1,

alternative = c(“two.sided”, “less”, “greater”),

conf.level = 0.95, …)

## S3 method for class ‘formula’

var.test(formula, data, subset, na.action, …)

**22. What is Collaborative filtering?**

The process of filtering used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources and multiple agents.

**23. What is the difference between Cluster and Systematic Sampling?**

Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection, or cluster of elements. Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list,it is progressed from the top again. The best example for systematic sampling is equal probability method.

**24. Are expected value and mean value different?**

They are not different but the terms are used in different contexts. Mean is generally referred when talking about a probability distribution or sample population whereas expected value is generally referred in a random variable context.

**For Sampling Data**

Mean value is the only value that comes from the sampling data.

Expected Value is the mean of all the means i.e. the value that is built from multiple samples. Expected value is the population mean.

**For Distributions**

Mean value and Expected value are same irrespective of the distribution, under the condition that the distribution is in the same population.

**25. What does P-value signify about the statistical data?**

P-value is used to determine the significance of results after a hypothesis test in statistics. P-value helps the readers to draw conclusions and is always between 0 and 1.

P- Value > 0.05 denotes weak evidence against the null hypothesis which means the null hypothesis cannot be rejected.

P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected.

P-value=0.05is the marginal value indicating it is possible to go either way.

**26. Do gradient descent methods always converge to same point?**

No, they do not because in some cases it reaches a local minima or a local optima point. You don’t reach the global optima point. It depends on the data and starting conditions

**27. A test has a true positive rate of 100% and false positive rate of 5%. There is a population with a 1/1000 rate of having the condition the test identifies. Considering a positive test, what is the probability of having that condition?**

Let’s suppose you are being tested for a disease, if you have the illness the test will end up saying you have the illness. However, if you don’t have the illness- 5% of the times the test will end up saying you have the illness and 95% of the times the test will give accurate result that you don’t have the illness. Thus there is a 5% error in case you do not have the illness.

Out of 1000 people, 1 person who has the disease will get true positive result.

Out of the remaining 999 people, 5% will also get true positive result.

Close to 50 people will get a true positive result for the disease.

This means that out of 1000 people, 51 people will be tested positive for the disease even though only one person has the illness. There is only a 2% probability of you having the disease even if your reports say that you have the disease.

**28. What is the difference between Supervised Learning an Unsupervised Learning?**

If an algorithm learns something from the training data so that the knowledge can be applied to the test data, then it is referred to as Supervised Learning. Classification is an example for Supervised Learning. If the algorithm does not learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example for unsupervised learning.

** 29. ****What is the goal of A/B Testing?**

It is a statistical hypothesis testing for randomized experiment with two variables A and B. The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest. An example for this could be identifying the click through rate for a banner ad.

**30. What is an Eigenvalue and Eigenvector?**

Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.

**31. How can outlier values be treated?**

Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for large number of outliers the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values.The most common ways to treat outlier values –

1) To change the value and bring in within a range

2) To just remove the value.

**32. How can you assess a good logistic model?**

There are various methods to assess the results of a logistic regression analysis-

– Using Classification Matrix to look at the true negatives and false positives.

– Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening.

– Lift helps assess the logistic model by comparing it with random selection.

** 33. ****What are various steps involved in an analytics project?**

– Understand the business problem

– Explore the data and become familiar with it.

– Prepare the data for modelling by detecting outliers, treating missing values, transforming variables, etc.

– After data preparation, start running the model, analyse the result and tweak the approach. This is an iterative step till the best possible outcome is achieved.

– Validate the model using a new data set.

– Start implementing the model and track the result to analyse the performance of the model over the period of time.

**34. How can you iterate over a list and also retrieve element indices at the same time?**

This can be done using the enumerate function which takes every element in a sequence just like in a list and adds its location just before it.

**35. During analysis, how do you treat missing values?**

The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored.There are various factors to be considered when answering this question-

– Understand the problem statement, understand the data and then give the answer.Assigning a default value which can be mean, minimum or maximum value. Getting into the data is important.

– If it is a categorical variable, the default value is assigned. The missing value is assigned a default value.

– If you have a distribution of data coming, for normal distribution give the mean value.

– Should we even treat missing values is another important point to consider? If 80% of the values for a variable are missing then you can answer that you would be dropping the variable instead of treating the missing values.

## 0 Responses on Data Science Interview Questions and Answers"