Core Statistical Topics in Data Science Interviews
If you're preparing for a data science interview, expect to be quizzed on a range of foundational statistical concepts. The following sections break down essential topics, explain key distinctions, and provide practical approaches for handling real data scenarios.
Major Areas Frequently Examined in Interviews
- Central tendency (mean, median, mode)
- Variability (range, variance, standard deviation, interquartile range)
- Statistical relationships (covariance, correlation)
- Probability distributions
- Standardization and normalization
- Central Limit Theorem
- Sampling (population vs sample)
- Hypothesis testing

Exploring and Understanding Data
What is Exploratory Data Analysis (EDA)?
EDA refers to the process of visually and quantitatively investigating datasets to identify patterns, uncover relationships, and spot irregularities. It sets the foundation for further pre-processing, modeling, and interpretation.
Qualitative vs Quantitative Data
- Quantitative: Numerical information that can be measured or counted; includes discrete data (like number of books) and continuous data (like weight).
- Qualitative: Non-numeric details describing qualities or categories, such as eye color or marital status. Can be nominal (no natural order) or ordinal (ordered categories).
Data Cleaning: Handling Incomplete and Outlier Data
Addressing Large Amounts of Missing Data
- For numeric columns with a normal spread, consider replacing missing entries with the mean or median.
- For categories, substitute with the most frequent value.
- If the dataset allows, apply KNN or regression-based imputation.
Approaches for Outliers
- Exclude them if they result from mistakes.
- Transform using log or square root to lessen their influence.
- Limit outlier values to specific percentiles (winsorization).
- Choose robust statistical techniques (e.g., median instead of mean).
When is Median Preferable to Mean?
Median provides a better sense of central value in data with extreme values, as it is not affected by outliers, whereas mean can be skewed.
Common Techniques for Outlier Detection
- Visual tools like boxplots for quick outlier identification.
- Z-score method: any value beyond 2 or 3 standard deviations may be flagged.
Measures of Spread and Correlation
Frequent Measures of Dispersion
- Range: Difference between the largest and smallest numbers.
- Variance: Mean squared difference from the average.
- Standard Deviation: Square root of variance, represents spread in original units.
- Interquartile Range (IQR): Distance between the 75th percentile (Q3) and 25th percentile (Q1).
Covariance vs. Correlation
Feature | Covariance | Correlation |
---|---|---|
Scale | Units of measurement | Unitless, ranges between -1 and 1 |
Interpretation | Shows direction, not magnitude | Shows both direction and strength |
The Normal Distribution and Statistical Rules
Central Limit Theorem Simplified
The Central Limit Theorem explains that as the sample size increases, the distribution of the sample mean becomes approximately normal, regardless of the population's original shape, provided samples are independent and the population variance is finite.
Key Traits of a Normal Curve
- Symmetrical, with mean, median, and mode at the center
- Shape characterized by mean (μ) and standard deviation (σ)
- Almost all data (99.7%) falls within three standard deviations from the center
The Empirical Rule
- 68% within 1 std dev of mean
- 95% within 2 std devs
- 99.7% within 3 std devs
Standardization with Z-scores
A z-score tells you how far (in standard deviations) a value is from the mean. It is calculated as:
Z = (Value - Mean) / Standard Deviation
Standardization allows comparisons across different variables or samples.
Descriptive vs Inferential Statistics
- Descriptive: Focuses on summarizing and displaying information about collected data (e.g., averages, charts).
- Inferential: Uses sample data to make broader conclusions about a population (e.g., confidence intervals, regression, hypothesis testing).
Basic and Advanced Types of Analysis
- Univariate: Study of a single feature
- Bivariate: Examines two variables and their association
- Multivariate: Looks at three or more variables, often to uncover complex relationships or patterns
Dealing with Population and Sampling
- Population: All possible subjects/events of interest
- Sample: Subgroup drawn for analysis, ideally representative
Statistical Estimation and Intervals
- Point Estimate: Single-value guess of a population attribute (e.g., the sample mean for the population mean)
- Confidence Interval: Provides a range likely to contain the actual population parameter, at a defined confidence level (like 95%)
Understanding Errors and Testing Hypotheses
Types of Errors
- Type I: False positive—erroneously rejecting a true null hypothesis
- Type II: False negative—failing to reject a false null hypothesis
Hypothesis Testing Steps
- Specify the null and alternative hypotheses
- Pick a significance threshold (e.g., 0.05)
- Select an appropriate test (t-test, chi-square, etc.)
- Compute the test statistic and the p-value
- Compare p-value to significance threshold, decide on null hypothesis
Bessel’s Correction: Adjusting Variance Calculation
Using Bessel’s correction, variance for a sample (not an entire population) is divided by n-1 to produce an unbiased estimate when inferring about the broader population.
Common Probability Distributions
Category | Examples |
---|---|
Discrete | Bernoulli, Binomial, Poisson |
Continuous | Normal, Uniform, Exponential |
Distribution Shapes and Skew
- Left-skewed: Tail extends left; order: mean < median < mode
- Right-skewed: Tail extends right; order: mean > median > mode
- Symmetric: Tails even on both sides
Skewness & Kurtosis: Asymmetry and Shape
- Skewness: Indicates asymmetry (positive for right tail, negative for left, zero for symmetry); calculated by coefficients such as Pearson’s or Bowley’s indices
- Kurtosis: Describes the “peakedness” of a distribution: Mesokurtic (normal), Leptokurtic (tall/skinny), Platykurtic (flat)
Comparing Descriptive Metrics
- Standard Deviation: Measures overall spread
- Standard Error: Reflects variability in sample mean estimates from sample to sample
- Coefficient of Variation: (SD/mean) * 100%, enables comparison of spread across datasets with different scales
Sample Selection and Bias Handling
- Selection bias: Occurs when samples are not representative. Types include sampling, attrition, and observer bias among others.
Handling Missing Data
- Omitting missing entries (only if few)
- Imputation using statistical or model-based approaches
- Flagging missingness as a feature
Understanding Confidence and Significance
- Confidence level (like 95%) means the interval is expected to capture the parameter the specified proportion of times
- Significance level (alpha, often 0.05) is the probability of a Type I error
Types of Statistical Analysis
- T-test: Compares the means of two distinct groups
- ANOVA: Assesses means across three or more groups
Assumptions for Regression and ANOVA
- Linear relationships
- Homogeneous variances (homoscedasticity)
- Normality of errors
- Independence of residuals
- Lack of excessive multicollinearity
Mistakes and Model Risks
- Overfitting: Model too closely matches sample noise; poor at predicting new data
- Underfitting: Model is too simple; fails to capture core patterns
When to Use Which Sampling Approach
- Simple random: Each item equally likely
- Stratified: Divide population into subgroups, then sample each
- Cluster: Randomly pick entire groups
- Systematic: Every nth item
- Convenience: Easiest access, but most prone to bias
Assessing and Handling Multicollinearity
- Use Variance Inflation Factor (VIF); high values signal collinearity issues
- Check correlation matrix for highly correlated predictors
- Remediate by dropping variables, dimensionality reduction (PCA), or regularization
Goodness-of-Fit and Model Assessment
- R-squared/Adjusted R-squared: Proportion of variance explained by the model
- Residual analysis: Visual checks for abnormal patterns
- Information criteria: AIC and BIC balance model fit and complexity
- Cross-validation: Validates out-of-sample performance
Applied and Advanced Concepts
- Pareto Principle: 80% of consequences often originate from 20% of causes
- Bootstrapping: Resampling with replacement to estimate parameter variability
- Jackknife: Systematic resampling by leaving out observations to assess bias or variance
- PCA vs Factor Analysis: PCA finds directions of max variance; Factor Analysis seeks latent factors explaining correlations
- Survival Analysis: Focuses on time-to-event; methods include Kaplan-Meier and Cox models
- Markov Chains: System transitions depend solely on current state, not on sequence of events that preceded it
Model Evaluation: Imbalanced Data, ROC vs PR AUC
- ROC-AUC: Evaluates overall classification performance, less informative with severe class imbalance
- PR-AUC: Focuses on precision and recall, better suited for unbalanced datasets
Key Statistical Definitions
Term | Meaning |
---|---|
Bessel’s Correction | n-1 divisor for unbiased sample variance |
Z-score | Standardized measure of deviation from mean |
Gini Coefficient | Inequality measure, 0 means equality, 1 means maximal disparity |
F-Statistic | Variance ratio used in ANOVA |
P-value | Probability that data as extreme as observed would occur if null hypothesis is true |
Conclusion
Mastering these fundamental and advanced statistical principles equips you to tackle a range of data science interview questions confidently. Use visualization, robust statistical methods, and a clear understanding of analysis assumptions to stand out.