We take a look at 5 common interview questions Data Scientists can expect to face when interviewing for a new job. We’ve also selected some of the best advice on how to answer these kinds of questions.
Interview Questions for Data Scientists
1. What is selection bias, why is it important and how can you avoid it?
Answer by Matthew Mayo: “Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample. For example, if a given sample of 100 test cases was made up of a 60/20/15/5 split of 4 classes which actually occurred in relatively equal numbers in the population, then a given model may make the false assumption that probability could be the determining predictive factor. Avoiding non-random samples is the best way to deal with bias. However, when this is impractical, techniques such as resampling, boosting, and weighting are strategies which can be introduced to help deal with the situation.”
2. Which data scientists do you admire most?
This, of course, does not have anyone right answer. But do your research, show you are passionate about the industry and understand. And have learnt from some of the best in the business. Do you know a lot about major Data Scientists in the industry? Because this is certainly something worth doing a bit of research on before heading to interview.
3. Explain what resampling methods are and why they are useful. Also, explain their limitations
Classical statistical parametric tests compare observed statistics to theoretical sampling distributions. Resampling a data-driven, not theory-driven methodology which is based upon repeated sampling within the same sample.
Resampling refers to methods for doing one of these
- Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping)
- Exchanging labels on data points when performing significance tests. (Permutation tests, also called exact tests, randomization tests, or re-randomization tests)
- Validating models by using random subsets (bootstrapping, cross-validation)
4. Python or R – Which one would you prefer for text analytics?
“The best possible answer for this would be Python. Because it has Pandas library that provides easy to use data structures and high-performance data analysis tools” (De Zyre). However, as usual with questions like these
5. What is the difference between Supervised Learning an Unsupervised Learning?
Answer from De Zyre: “If an algorithm learns something from the training data so that the knowledge can be applied to the test data. Then it is referred to as Supervised Learning. Classification is an example for Supervised Learning. If the algorithm does not learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example of unsupervised learning.”