1. Top 50 Python Data Science Interview Questions and
Answers
Python Data Science Interview
If you are getting into Data Science, Python is your best friend. Why? It's simple, powerful, and packed with libraries
like NumPy, Pandas, and Scikit-learn that make data analysis, visualization, and machine learning easy.
Python Data Science Interview Questions for Entry-Level (0-1
year experience)
Now, if you're preparing for a Python Data Science interview, expect a mix of Python fundamentals, data
manipulation, statistics, and machine learning concepts. You might face questions like:
How do you handle missing data in Pandas?
What’s the difference between a list and a NumPy array?
How does the groupby function work in Pandas?
Explain the difference between shallow and deep copy in Python.
What is broadcasting in NumPy?
These questions test your problem-solving skills and understanding data structures in Python. So, brush up on
your Python basics, get comfortable with data manipulation, and practice writing efficient code. And yes in this
interview tutorial, we gonna also see more Python Data Science interview questions and answers.
2. Python Basics & Data Structures
Q 1. What are Python’s key features that make it popular in data science?
1. Easy to Learn & Use
Python has a simple syntax that’s close to natural language, making it beginner-friendly and easy to write, read,
and debug.
2. Rich Ecosystem of Libraries
Python has a vast collection of pre-built libraries specifically for Data Science, including:
NumPy – Numerical computing and arrays
Pandas – Data manipulation and analysis
Matplotlib & Seaborn – Data visualization
Scikit-learn – Machine learning algorithms
TensorFlow & PyTorch – Deep learning
3. Platform Independence
Python is cross-platform, meaning you can run the same code on Windows, macOS, or Linux without
modification.
3. 1. List (Mutable)
Difference Between List and Tuple in Python:
Feature
Mutability
List
Mutable (can be changed)
Tuple
Immutable (cannot be changed)
list = [1, 2, 3] (square
Syntax tuple = (1, 2, 3) (parentheses)
brackets)
Slower due to dynamic resizing
Performance
Memory
Usage
Faster due to fixed size
Uses more memory Uses less memory
Can add, remove, or modify
elements
When data needs to changeWhen data should remain constant
Modification Cannot modify elements
Use Case
frequently configurations)
(e.g.,
4. Strong Community Support
A large global community contributes to continuous improvements, vast documentation, and active forums
like Stack Overflow and GitHub.
coordinates,
5. Integration with Other Tools
Python integrates well with SQL databases, cloud platforms, and big data tools like Apache Spark, and
Hadoop.
6. Versatility in Data Science & AI
It supports data wrangling, statistical analysis, machine learning, deep learning, and automation, making it a
one-stop solution for all Data Science needs.
7. Automation & Scripting
Python is great for automating repetitive tasks such as data cleaning, web scraping, and ETL (Extract,
Transform, Load) processes.
8. Scalability & Performance
With optimizations like Cython, NumPy (vectorized operations), and multiprocessing, Python handles large-
scale data efficiently.
Example:
Q2. What is the difference between a list and a tuple?
my_list = [1, 2, 3]
my_list.append(4) # Allowed
print(my_list) # Output: [1, 2, 3, 4]
4. Output
Q3. How do you handle missing values in Pandas?
1. Detect Missing Values
Use .isnull() or .notnull() to check for missing values.
Try it Yourself >>
2. Tuple (Immutable)
3. Fill Missing Values
Use .fillna() to replace missing values.
2. Remove Missing Values
Use .dropna() to remove rows or columns with missing values.
When to Use?
Use a list when you need a modifiable sequence.
Use a tuple when you need a fixed sequence for faster execution and memory efficiency.
[1, 2, 3, 4]
import pandas as pd
data = {'A': [1, 2, None, 4], 'B': [None, 2, 3, 4]}
df = pd.DataFrame(data)
df.dropna() # Removes rows with NaN values
df.dropna(axis=1) # Removes columns with NaN values
print(df.isnull()) # Returns True for missing values
print(df.notnull()) # Returns True for non-missing values
my_tuple = (1, 2, 3)
my_tuple[0] = 10 # TypeError: 'tuple' object does not support item assignment
5. Try it Yourself >>
4. Replace Specific Values
Use .replace() to replace specific missing values.
5. Interpolate Missing Values
Use .interpolate() to estimate missing values based on available data.
df.interpolate(method='linear') # Linear interpolation
In Python, copying an object can be done using shallow copy or deep copy:
Shallow Copy (copy.copy() ): Creates a new object but does not create copies of nested objects. Instead, it
references them.
Deep Copy (copy.deepcopy()): Creates a completely independent copy of the original object, including all
nested objects.
Output
Q4. Explain the difference between shallow copy and deep copy in Python.
[[99, 2], [3, 4]]
import copy
list1 = [[1, 2], [3, 4]]
shallow_copy = copy.copy(list1)
shallow_copy[0][0] = 99
print(list1) # Output: [[99, 2], [3, 4]]
deep_copy = copy.deepcopy(list1)
deep_copy[0][0] = 42
print(list1) # Output: [[99, 2], [3, 4]]
df.replace(to_replace=None, value=0, inplace=True)
df.fillna(0) # Replaces all NaNs with 0
df.fillna(df.mean()) # Replaces NaNs with column mean
df.fillna(method='ffill') # Forward fill (previous value)
df.fillna(method='bfill') # Backward fill (next value)
6. Output
Q5. What is the difference between
Q7. What are Python’s built-in data types?
Q6. How does Python’s memory management work?
Q8. What is the difference between mutable and immutable data types?
is and ==in Python?
Try it Yourself >>
Python has various built-in data types, including:
Numeric Types: int, float, complex
Sequence Types: list, tuple, range
Text Type: str
Set Types: set, frozenset
Mapping Type: dict
Boolean Type: bool
Binary Types: bytes, bytearray, memoryview
is checks memory reference, while == checks values.
1. Mutable: Can be changed after creation (e.g., list, dict, set).
Python uses automatic garbage collection, reference counting, and memory pools.
lst = [1, 2, 3]
[[99, 2], [3, 4]]
a = [1, 2, 3]
b = a
c = [1, 2, 3]
print(a == c) # True
print(a is c) # False
print(a is b) # True
import sys
x = [1, 2, 3]
print(sys.getrefcount(x))
7. 1. Set: Unordered collection of unique elements.
2. Dictionary: Dictionary in Python, Stores key-value pairs.
2. Immutable: Cannot be changed after creation (e.g., int, str, tuple).
1. List Comprehension: Returns a list, storing all elements in memory.
Q9. Explain the difference between a Python set and a dictionary.
Q10. What is the difference between a generator and a list comprehension?
s = {1, 2, 3, 4}
tup = (1, 2, 3)
tup[0] = 99 # Error
lst[0] = 99 # Allowed
lst = [x**2 for x in range(5)]
print(lst) # [0, 1, 4, 9, 16]
d = {"name": "Alice", "age": 25}
8. 1. iloc[]: Selects rows/columns by integer location (index-based).
2. loc[]: Selects rows/columns by label (name-based).
merge(): Used for database-style joins (inner, outer, left, right).
join(): Similar to merge() but works with index-based joins.
concat(): Combines data along an axis.
2. Generator: Returns an iterator that generates values on the fly, saving memory.
DataFrame: A 2D labeled data structure similar to a table with rows and columns.
Series: A 1D labeled array, essentially a single column of a DataFrame.
Lambda functions in Python are anonymous (one-liner) functions using the lambda keyword.
Data Processing & Analysis
Q16. What are Pandas DataFrames, and how do they differ from Series?
Q19. What are lambda functions in Python?
Q17. How do you merge, join, and concatenate data in Pandas?
Q18. Explain the difference between .iloc[] and .loc[] in Pandas.
gen = (x**2 for x in range(5))
print(next(gen)) # 0
print(next(gen)) # 1
df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]}) df2 =
pd.DataFrame({"A": [5, 6], "B": [7, 8]}) result =
pd.concat([df1, df2]) merged = df1.merge(df2,
on="A", how="inner")
import pandas as pd
s = pd.Series([10, 20, 30])
df = pd.DataFrame({"A": [10, 20, 30], "B": [40, 50, 60]})
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=['x', 'y', 'z'])
print(df.iloc[0]) # First row using index
print(df.loc['x']) # First row using label
9. Output
Q20. How do you handle duplicate values in Pandas?
Q21. What is the role of the apply() function in Pandas?
Q22. How do you optimize performance in Pandas for large datasets?
Q23. Explain the difference between .any() and .all() functions in Pandas.
Try it Yourself >>
returns True if any value is True;
returns True only if all values are True.
Applies a function to rows or columns.
Use vectorized operations.
Convert data types to save memory.
Use chunk processing for large files.
Use drop_duplicates()to remove duplicates.
8
.any()
.all()
add = lambda x, y: x + y
print(add(3, 5)) # Output: 8
df["A"] = df["A"].astype("int8")
df["A_squared"] = df["A"].apply(lambda x: x**2)
df = pd.DataFrame({"A": [1, 2, 2, 3], "B": [4, 5, 5, 6]})
df = df.drop_duplicates()
df = pd.DataFrame({"A": [True, False, True], "B": [True, True, True]})
print(df.any()) # Checks if any value is True in each column
10. Q24. How do you convert categorical data into numerical form?
Q27. Explain the concept of feature scaling in machine learning.
Q25. What is the difference between .pivot() and .pivot_table() in Pandas?
.pivot() reshapes data but requires unique index/column combinations. .pivot_table() works even with duplicate
values and allows aggregation.
Q26. What are the different types of probability distributions used in Data Science?
Some common probability distributions
distributions.
Use pd.get_dummies()for one-hot encoding.
Feature scaling ensures numerical features are on the same scale using Standardization or Min-Max Scaling.
include Normal, Binomial, Poisson, Exponential, and Uniform
Machine Learning & Statistical Computing
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
x = np.linspace(-3, 3, 100)
plt.plot(x, norm.pdf(x))
plt.title("Normal Distribution")
plt.show()
df = pd.DataFrame({"Color": ["Red", "Blue", "Green"]})
df_encoded = pd.get_dummies(df, columns=["Color"])
print(df.all()) # Checks if all values are True in each column
df = pd.DataFrame({"Date": ["2024-01", "2024-01", "2024-02"],
"Category": ["A", "B", "A"],
"Value": [100, 200, 150]})
pivot = df.pivot(index="Date", columns="Category", values="Value")
pivot_table = df.pivot_table(index="Date", columns="Category", values="Value", aggfunc="sum")
11. Q28. How does NumPy handle multidimensional arrays?
Q30. How do you implement linear regression using NumPy?
Q29. What is the difference between mean, median, and mode?
Q31. Explain the difference between supervised and unsupervised learning.
NumPy uses
objects to handle multidimensional data.
Mean is the average, Median is the middle value, and Mode is the most frequent value.
ndarray
import numpy as np
from scipy import stats
data = [1, 2, 2, 3, 4]
print(np.mean(data)) # Mean
print(np.median(data)) # Median
print(stats.mode(data)) # Mode
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.shape) # (2,3)
print(arr[:, 1]) # Second column
import numpy as np
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])
A = np.vstack([X, np.ones(len(X))]).T
m, c = np.linalg.lstsq(A, y, rcond=None)[0]
print(f"Slope: {m}, Intercept: {c}")
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform([[10], [20], [30]])
12. Methods include oversampling (SMOTE), undersampling, and class weighting.
Supervised learning uses labeled data, while Unsupervised learning finds patterns without labels.
The scipy.statsmodule provides statistical functions like probability distributions and hypothesis testing.
Q33. What is the purpose of the scipy.stats module?
Q34. How do you handle imbalanced datasets in Python?
Q35. How do you implement a k-means clustering algorithm using Python?
Q32. How do you calculate the correlation between two variables in Python?
import pandas as pd
data = {"A": [1, 2, 3, 4], "B": [2, 4, 6, 8]}
df = pd.DataFrame(data)
print(df.corr())
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [5, 8], [8, 8]])
from imblearn.over_sampling import SMOTE
X, y = [[1], [2], [3], [4]], [0, 0, 1, 1]
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)
from scipy import stats
data = [1, 2, 3, 4, 5]
print(stats.ttest_1samp(data, 3)) # One-sample t-test
13. Python Data Science Interview Questions for Advanced-Level
(5+ years experience)
Performance Optimization & Best Practices
Q36. How do you handle memory-efficient operations in NumPy?
Q37. What are some techniques to speed up Pandas operations?
Q38. How does Python’s Global Interpreter Lock (GIL) impact performance?
Use vectorized operations instead of loops.
Convert object data types to categorical for efficiency.
Use chunk processing for large datasets.
Enable multi-threading with modin.pandas.
To optimize memory usage in NumPy:
Use appropriate data types (e.g., int8 instead of int64).
Leverage views instead of copies.
Use in-place operations to reduce memory overhead.
Utilize NumPy’s broadcasting to avoid large intermediate arrays.
The Global Interpreter Lock (GIL) restricts execution to one thread at a time, meaning:
CPU-bound tasks do not benefit from multithreading.
I/O-bound tasks can still benefit from multithreading.
Multiprocessing is better for CPU-heavy tasks.
import threading
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
print(kmeans.labels_)
print(kmeans.cluster_centers_)
def compute():
sum([i**2 for i in range(1000000)])
import numpy as np
arr = np.array([1, 2, 3, 4], dtype=np.int8)
arr += 2 # Memory-efficient operation
df = pd.DataFrame({"A": [1, 2, 3, 4]})
df["A"] = df["A"] * 2 # Faster than using apply()
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
14. Example: Multithreading (I/O-bound task)
Example: Multiprocessing (CPU-bound task)
Q40. How do you parallelize computations in Python?
Q39. Explain multithreading vs multiprocessing in Python.
Python provides several ways to parallelize computations:
Use
Feature
Execution
Best For
GIL Effect
Memory UsageShares memory space
Multithreading Multiprocessing
Multiple threads in a single processMultiple separate processes
I/O-bound tasks
Affected
CPU-bound tasks
Not affected
Each process has separate memory
import threading, time
import multiprocessing
def task():
time.sleep(2)
print("Task complete")
def compute():
sum([i**2 for i in range(1000000)])
threads = [threading.Thread(target=task) for _ in range(5)]
for t in threads: t.start()
for t in threads: t.join()
threads = [threading.Thread(target=compute) for _ in range(4)]
for t in threads: t.start() for t in threads: t.join()
processes = [multiprocessing.Process(target=compute) for _ in range(4)]
for p in processes: p.start()
for p in processes: p.join()
15. for parallel loops.
Use
for CPU-bound tasks.
Use
for efficient parallelism.
Use
for parallel Pandas-like operations.
Text preprocessing includes:
Tokenization
Lowercasing
Removing Stopwords
Stemming and Lemmatization
Removing Punctuation
Vectorization (TF-IDF, Bag-of-Words, Word Embeddings)
Example: Parallel processing with concurrent.futures
Q41. What are the different ways to preprocess text data for NLP?
Deep Learning & Advanced Machine Learning:
Dask
joblib
multiprocessing
concurrent.futures
print(list(results))
import concurrent.futures
def compute(n):
return sum(i**2 for i in range(n))
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download("punkt")
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(compute, [1000000, 2000000, 3000000])
16. Q43. Explain the role of TensorFlow and PyTorch in Deep Learning.
Q42. How do you implement a decision tree from scratch in Python?
Steps:
Calculate Gini Impurity or Entropy
Choose the best feature to split the dataset
Recursively build the tree
Feature
Developed By
TensorFlow
Google
PyTorch
Facebook
Computation GraphStatic & DynamicDynamic
Best For Production Research
import torch
import torch.nn as nn
class SimpleNN(nn.Module):
def __init__(self):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(2, 1)
def forward(self, x):
return self.fc1(x)
clf = DecisionTreeClassifier(criterion="gini")
clf.fit(df[['Feature']], df['Label'])
print(clf.predict([[1]]))
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
# Sample dataset
data = {'Feature': [0, 1, 1, 0, 1], 'Label': [0, 1, 1, 0, 1]}
df = pd.DataFrame(data)
nltk.download("stopwords")
text = "Natural Language Processing is amazing!"
tokens = word_tokenize(text.lower())
clean_tokens = [word for word in tokens if word not in stopwords.words("english")]
print(clean_tokens)
17. Q45. What are hyperparameter tuning techniques in Python?
Q46. How do you work with large datasets that don’t fit in memory?
Q44. How does a convolutional neural network (CNN) process images?
Common methods:
Grid Search
Random Search
Bayesian Optimization
HyperOpt & Optuna
Genetic Algorithms
Key CNN layers:
Convolutional Layer - Extracts features
ReLU - Activation function
Pooling Layer - Reduces dimensionality
Fully Connected Layer - Classification
Big Data & Scalability:
model = SimpleNN()
print(model)
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [3, 5, None]}
clf = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_)
from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
])
model.summary()
layers.Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)),
layers.MaxPooling2D((2,2)),
layers.Flatten(),
layers.Dense(10, activation='softmax')
18. Pandas
Dask
Vaex
PySpark
Modin
1. Using Flask:
Use Chunking in Pandas
Utilize Dask for parallel processing
Store data in Databases instead of memory
Use Apache Spark for distributed computing
PySpark is the Python API for Apache Spark, enabling distributed computing.
Q47. What are some Python libraries for handling big data?
Q48. Explain how Apache Spark integrates with Python (PySpark).
Q49. How do you deploy machine learning models using Flask or FastAPI?
import dask.dataframe as dd
df = dd.read_csv("large_dataset.csv")
df.groupby('column_name').mean().compute()
from flask import Flask, request, jsonify
import pickle
app = Flask(__name__)
model = pickle.load(open("model.pkl", "rb"))
@app.route('/predict', methods=['POST'])
def predict():
import pandas as pd
chunk_size = 10000
for chunk in pd.read_csv('large_dataset.csv', chunksize=chunk_size):
process(chunk)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("BigDataProcessing").getOrCreate()
df = spark.read.csv("large_dataset.csv", header=True, inferSchema=True)
df.show()
19. Using FastAPI:
Common tools for scheduling data pipelines:
Apache Airflow
Luigi
Cron Jobs
Perfect
In this article, we covered fundamental and advanced Python interview questions and answers for data science
related to data handling, memory management, deep learning, and big data technologies like Apache Spark and
PySpark. Understanding these topics will not only help you ace interviews but also enable you to build efficient
and scalable data-driven applications.
Q50. How do you schedule and automate data pipelines in Python?
Conclusion
from fastapi import FastAPI
import pickle
app = FastAPI()
model = pickle.load(open("model.pkl", "rb"))
@app.post("/predict/")
async def predict(features: list):
prediction = model.predict([features])
return {"prediction": prediction.tolist()}
data = request.json['features']
prediction = model.predict([data])
return jsonify({'prediction': prediction.tolist()})
if __name__ == "__main__":
app.run(debug=True)
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
def process_data():
print("Data Processing Task Executed!")
dag = DAG('data_pipeline', schedule_interval='@daily', start_date=datetime(2024, 1, 1))
task = PythonOperator(task_id='process_task', python_callable=process_data, dag=dag)