SlideShare a Scribd company logo
Top 50 Python Data Science Interview Questions and
Answers
Python Data Science Interview
If you are getting into Data Science, Python is your best friend. Why? It's simple, powerful, and packed with libraries
like NumPy, Pandas, and Scikit-learn that make data analysis, visualization, and machine learning easy.
Python Data Science Interview Questions for Entry-Level (0-1
year experience)
Now, if you're preparing for a Python Data Science interview, expect a mix of Python fundamentals, data
manipulation, statistics, and machine learning concepts. You might face questions like:
How do you handle missing data in Pandas?
What’s the difference between a list and a NumPy array?
How does the groupby function work in Pandas?
Explain the difference between shallow and deep copy in Python.
What is broadcasting in NumPy?
These questions test your problem-solving skills and understanding data structures in Python. So, brush up on
your Python basics, get comfortable with data manipulation, and practice writing efficient code. And yes in this
interview tutorial, we gonna also see more Python Data Science interview questions and answers.
Python Basics & Data Structures
Q 1. What are Python’s key features that make it popular in data science?
1. Easy to Learn & Use
Python has a simple syntax that’s close to natural language, making it beginner-friendly and easy to write, read,
and debug.
2. Rich Ecosystem of Libraries
Python has a vast collection of pre-built libraries specifically for Data Science, including:
NumPy – Numerical computing and arrays
Pandas – Data manipulation and analysis
Matplotlib & Seaborn – Data visualization
Scikit-learn – Machine learning algorithms
TensorFlow & PyTorch – Deep learning
3. Platform Independence
Python is cross-platform, meaning you can run the same code on Windows, macOS, or Linux without
modification.
1. List (Mutable)
Difference Between List and Tuple in Python:
Feature
Mutability
List
Mutable (can be changed)
Tuple
Immutable (cannot be changed)
list = [1, 2, 3] (square
Syntax tuple = (1, 2, 3) (parentheses)
brackets)
Slower due to dynamic resizing
Performance
Memory
Usage
Faster due to fixed size
Uses more memory Uses less memory
Can add, remove, or modify
elements
When data needs to changeWhen data should remain constant
Modification Cannot modify elements
Use Case
frequently configurations)
(e.g.,
4. Strong Community Support
A large global community contributes to continuous improvements, vast documentation, and active forums
like Stack Overflow and GitHub.
coordinates,
5. Integration with Other Tools
Python integrates well with SQL databases, cloud platforms, and big data tools like Apache Spark, and
Hadoop.
6. Versatility in Data Science & AI
It supports data wrangling, statistical analysis, machine learning, deep learning, and automation, making it a
one-stop solution for all Data Science needs.
7. Automation & Scripting
Python is great for automating repetitive tasks such as data cleaning, web scraping, and ETL (Extract,
Transform, Load) processes.
8. Scalability & Performance
With optimizations like Cython, NumPy (vectorized operations), and multiprocessing, Python handles large-
scale data efficiently.
Example:
Q2. What is the difference between a list and a tuple?
my_list = [1, 2, 3]
my_list.append(4) # Allowed
print(my_list) # Output: [1, 2, 3, 4]
Output
Q3. How do you handle missing values in Pandas?
1. Detect Missing Values
Use .isnull() or .notnull() to check for missing values.
Try it Yourself >>
2. Tuple (Immutable)
3. Fill Missing Values
Use .fillna() to replace missing values.
2. Remove Missing Values
Use .dropna() to remove rows or columns with missing values.
When to Use?
Use a list when you need a modifiable sequence.
Use a tuple when you need a fixed sequence for faster execution and memory efficiency.
[1, 2, 3, 4]
import pandas as pd
data = {'A': [1, 2, None, 4], 'B': [None, 2, 3, 4]}
df = pd.DataFrame(data)
df.dropna() # Removes rows with NaN values
df.dropna(axis=1) # Removes columns with NaN values
print(df.isnull()) # Returns True for missing values
print(df.notnull()) # Returns True for non-missing values
my_tuple = (1, 2, 3)
my_tuple[0] = 10 # TypeError: 'tuple' object does not support item assignment
Try it Yourself >>
4. Replace Specific Values
Use .replace() to replace specific missing values.
5. Interpolate Missing Values
Use .interpolate() to estimate missing values based on available data.
df.interpolate(method='linear') # Linear interpolation
In Python, copying an object can be done using shallow copy or deep copy:
Shallow Copy (copy.copy() ): Creates a new object but does not create copies of nested objects. Instead, it
references them.
Deep Copy (copy.deepcopy()): Creates a completely independent copy of the original object, including all
nested objects.
Output
Q4. Explain the difference between shallow copy and deep copy in Python.
[[99, 2], [3, 4]]
import copy
list1 = [[1, 2], [3, 4]]
shallow_copy = copy.copy(list1)
shallow_copy[0][0] = 99
print(list1) # Output: [[99, 2], [3, 4]]
deep_copy = copy.deepcopy(list1)
deep_copy[0][0] = 42
print(list1) # Output: [[99, 2], [3, 4]]
df.replace(to_replace=None, value=0, inplace=True)
df.fillna(0) # Replaces all NaNs with 0
df.fillna(df.mean()) # Replaces NaNs with column mean
df.fillna(method='ffill') # Forward fill (previous value)
df.fillna(method='bfill') # Backward fill (next value)
Output
Q5. What is the difference between
Q7. What are Python’s built-in data types?
Q6. How does Python’s memory management work?
Q8. What is the difference between mutable and immutable data types?
is and ==in Python?
Try it Yourself >>
Python has various built-in data types, including:
Numeric Types: int, float, complex
Sequence Types: list, tuple, range
Text Type: str
Set Types: set, frozenset
Mapping Type: dict
Boolean Type: bool
Binary Types: bytes, bytearray, memoryview
is checks memory reference, while == checks values.
1. Mutable: Can be changed after creation (e.g., list, dict, set).
Python uses automatic garbage collection, reference counting, and memory pools.
lst = [1, 2, 3]
[[99, 2], [3, 4]]
a = [1, 2, 3]
b = a
c = [1, 2, 3]
print(a == c) # True
print(a is c) # False
print(a is b) # True
import sys
x = [1, 2, 3]
print(sys.getrefcount(x))
1. Set: Unordered collection of unique elements.
2. Dictionary: Dictionary in Python, Stores key-value pairs.
2. Immutable: Cannot be changed after creation (e.g., int, str, tuple).
1. List Comprehension: Returns a list, storing all elements in memory.
Q9. Explain the difference between a Python set and a dictionary.
Q10. What is the difference between a generator and a list comprehension?
s = {1, 2, 3, 4}
tup = (1, 2, 3)
tup[0] = 99 # Error
lst[0] = 99 # Allowed
lst = [x**2 for x in range(5)]
print(lst) # [0, 1, 4, 9, 16]
d = {"name": "Alice", "age": 25}
1. iloc[]: Selects rows/columns by integer location (index-based).
2. loc[]: Selects rows/columns by label (name-based).
merge(): Used for database-style joins (inner, outer, left, right).
join(): Similar to merge() but works with index-based joins.
concat(): Combines data along an axis.
2. Generator: Returns an iterator that generates values on the fly, saving memory.
DataFrame: A 2D labeled data structure similar to a table with rows and columns.
Series: A 1D labeled array, essentially a single column of a DataFrame.
Lambda functions in Python are anonymous (one-liner) functions using the lambda keyword.
Data Processing & Analysis
Q16. What are Pandas DataFrames, and how do they differ from Series?
Q19. What are lambda functions in Python?
Q17. How do you merge, join, and concatenate data in Pandas?
Q18. Explain the difference between .iloc[] and .loc[] in Pandas.
gen = (x**2 for x in range(5))
print(next(gen)) # 0
print(next(gen)) # 1
df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]}) df2 =
pd.DataFrame({"A": [5, 6], "B": [7, 8]}) result =
pd.concat([df1, df2]) merged = df1.merge(df2,
on="A", how="inner")
import pandas as pd
s = pd.Series([10, 20, 30])
df = pd.DataFrame({"A": [10, 20, 30], "B": [40, 50, 60]})
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=['x', 'y', 'z'])
print(df.iloc[0]) # First row using index
print(df.loc['x']) # First row using label
Output
Q20. How do you handle duplicate values in Pandas?
Q21. What is the role of the apply() function in Pandas?
Q22. How do you optimize performance in Pandas for large datasets?
Q23. Explain the difference between .any() and .all() functions in Pandas.
Try it Yourself >>
returns True if any value is True;
returns True only if all values are True.
Applies a function to rows or columns.
Use vectorized operations.
Convert data types to save memory.
Use chunk processing for large files.
Use drop_duplicates()to remove duplicates.
8
.any()
.all()
add = lambda x, y: x + y
print(add(3, 5)) # Output: 8
df["A"] = df["A"].astype("int8")
df["A_squared"] = df["A"].apply(lambda x: x**2)
df = pd.DataFrame({"A": [1, 2, 2, 3], "B": [4, 5, 5, 6]})
df = df.drop_duplicates()
df = pd.DataFrame({"A": [True, False, True], "B": [True, True, True]})
print(df.any()) # Checks if any value is True in each column
Q24. How do you convert categorical data into numerical form?
Q27. Explain the concept of feature scaling in machine learning.
Q25. What is the difference between .pivot() and .pivot_table() in Pandas?
.pivot() reshapes data but requires unique index/column combinations. .pivot_table() works even with duplicate
values and allows aggregation.
Q26. What are the different types of probability distributions used in Data Science?
Some common probability distributions
distributions.
Use pd.get_dummies()for one-hot encoding.
Feature scaling ensures numerical features are on the same scale using Standardization or Min-Max Scaling.
include Normal, Binomial, Poisson, Exponential, and Uniform
Machine Learning & Statistical Computing
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
x = np.linspace(-3, 3, 100)
plt.plot(x, norm.pdf(x))
plt.title("Normal Distribution")
plt.show()
df = pd.DataFrame({"Color": ["Red", "Blue", "Green"]})
df_encoded = pd.get_dummies(df, columns=["Color"])
print(df.all()) # Checks if all values are True in each column
df = pd.DataFrame({"Date": ["2024-01", "2024-01", "2024-02"],
"Category": ["A", "B", "A"],
"Value": [100, 200, 150]})
pivot = df.pivot(index="Date", columns="Category", values="Value")
pivot_table = df.pivot_table(index="Date", columns="Category", values="Value", aggfunc="sum")
Q28. How does NumPy handle multidimensional arrays?
Q30. How do you implement linear regression using NumPy?
Q29. What is the difference between mean, median, and mode?
Q31. Explain the difference between supervised and unsupervised learning.
NumPy uses
objects to handle multidimensional data.
Mean is the average, Median is the middle value, and Mode is the most frequent value.
ndarray
import numpy as np
from scipy import stats
data = [1, 2, 2, 3, 4]
print(np.mean(data)) # Mean
print(np.median(data)) # Median
print(stats.mode(data)) # Mode
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.shape) # (2,3)
print(arr[:, 1]) # Second column
import numpy as np
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])
A = np.vstack([X, np.ones(len(X))]).T
m, c = np.linalg.lstsq(A, y, rcond=None)[0]
print(f"Slope: {m}, Intercept: {c}")
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform([[10], [20], [30]])
Methods include oversampling (SMOTE), undersampling, and class weighting.
Supervised learning uses labeled data, while Unsupervised learning finds patterns without labels.
The scipy.statsmodule provides statistical functions like probability distributions and hypothesis testing.
Q33. What is the purpose of the scipy.stats module?
Q34. How do you handle imbalanced datasets in Python?
Q35. How do you implement a k-means clustering algorithm using Python?
Q32. How do you calculate the correlation between two variables in Python?
import pandas as pd
data = {"A": [1, 2, 3, 4], "B": [2, 4, 6, 8]}
df = pd.DataFrame(data)
print(df.corr())
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [5, 8], [8, 8]])
from imblearn.over_sampling import SMOTE
X, y = [[1], [2], [3], [4]], [0, 0, 1, 1]
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)
from scipy import stats
data = [1, 2, 3, 4, 5]
print(stats.ttest_1samp(data, 3)) # One-sample t-test
Python Data Science Interview Questions for Advanced-Level
(5+ years experience)
Performance Optimization & Best Practices
Q36. How do you handle memory-efficient operations in NumPy?
Q37. What are some techniques to speed up Pandas operations?
Q38. How does Python’s Global Interpreter Lock (GIL) impact performance?
Use vectorized operations instead of loops.
Convert object data types to categorical for efficiency.
Use chunk processing for large datasets.
Enable multi-threading with modin.pandas.
To optimize memory usage in NumPy:
Use appropriate data types (e.g., int8 instead of int64).
Leverage views instead of copies.
Use in-place operations to reduce memory overhead.
Utilize NumPy’s broadcasting to avoid large intermediate arrays.
The Global Interpreter Lock (GIL) restricts execution to one thread at a time, meaning:
CPU-bound tasks do not benefit from multithreading.
I/O-bound tasks can still benefit from multithreading.
Multiprocessing is better for CPU-heavy tasks.
import threading
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
print(kmeans.labels_)
print(kmeans.cluster_centers_)
def compute():
sum([i**2 for i in range(1000000)])
import numpy as np
arr = np.array([1, 2, 3, 4], dtype=np.int8)
arr += 2 # Memory-efficient operation
df = pd.DataFrame({"A": [1, 2, 3, 4]})
df["A"] = df["A"] * 2 # Faster than using apply()
Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
Example: Multithreading (I/O-bound task)
Example: Multiprocessing (CPU-bound task)
Q40. How do you parallelize computations in Python?
Q39. Explain multithreading vs multiprocessing in Python.
Python provides several ways to parallelize computations:
Use
Feature
Execution
Best For
GIL Effect
Memory UsageShares memory space
Multithreading Multiprocessing
Multiple threads in a single processMultiple separate processes
I/O-bound tasks
Affected
CPU-bound tasks
Not affected
Each process has separate memory
import threading, time
import multiprocessing
def task():
time.sleep(2)
print("Task complete")
def compute():
sum([i**2 for i in range(1000000)])
threads = [threading.Thread(target=task) for _ in range(5)]
for t in threads: t.start()
for t in threads: t.join()
threads = [threading.Thread(target=compute) for _ in range(4)]
for t in threads: t.start() for t in threads: t.join()
processes = [multiprocessing.Process(target=compute) for _ in range(4)]
for p in processes: p.start()
for p in processes: p.join()
for parallel loops.
Use
for CPU-bound tasks.
Use
for efficient parallelism.
Use
for parallel Pandas-like operations.
Text preprocessing includes:
Tokenization
Lowercasing
Removing Stopwords
Stemming and Lemmatization
Removing Punctuation
Vectorization (TF-IDF, Bag-of-Words, Word Embeddings)
Example: Parallel processing with concurrent.futures
Q41. What are the different ways to preprocess text data for NLP?
Deep Learning & Advanced Machine Learning:
Dask
joblib
multiprocessing
concurrent.futures
print(list(results))
import concurrent.futures
def compute(n):
return sum(i**2 for i in range(n))
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download("punkt")
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(compute, [1000000, 2000000, 3000000])
Q43. Explain the role of TensorFlow and PyTorch in Deep Learning.
Q42. How do you implement a decision tree from scratch in Python?
Steps:
Calculate Gini Impurity or Entropy
Choose the best feature to split the dataset
Recursively build the tree
Feature
Developed By
TensorFlow
Google
PyTorch
Facebook
Computation GraphStatic & DynamicDynamic
Best For Production Research
import torch
import torch.nn as nn
class SimpleNN(nn.Module):
def __init__(self):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(2, 1)
def forward(self, x):
return self.fc1(x)
clf = DecisionTreeClassifier(criterion="gini")
clf.fit(df[['Feature']], df['Label'])
print(clf.predict([[1]]))
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
# Sample dataset
data = {'Feature': [0, 1, 1, 0, 1], 'Label': [0, 1, 1, 0, 1]}
df = pd.DataFrame(data)
nltk.download("stopwords")
text = "Natural Language Processing is amazing!"
tokens = word_tokenize(text.lower())
clean_tokens = [word for word in tokens if word not in stopwords.words("english")]
print(clean_tokens)
Q45. What are hyperparameter tuning techniques in Python?
Q46. How do you work with large datasets that don’t fit in memory?
Q44. How does a convolutional neural network (CNN) process images?
Common methods:
Grid Search
Random Search
Bayesian Optimization
HyperOpt & Optuna
Genetic Algorithms
Key CNN layers:
Convolutional Layer - Extracts features
ReLU - Activation function
Pooling Layer - Reduces dimensionality
Fully Connected Layer - Classification
Big Data & Scalability:
model = SimpleNN()
print(model)
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [3, 5, None]}
clf = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_)
from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
])
model.summary()
layers.Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)),
layers.MaxPooling2D((2,2)),
layers.Flatten(),
layers.Dense(10, activation='softmax')
Pandas
Dask
Vaex
PySpark
Modin
1. Using Flask:
Use Chunking in Pandas
Utilize Dask for parallel processing
Store data in Databases instead of memory
Use Apache Spark for distributed computing
PySpark is the Python API for Apache Spark, enabling distributed computing.
Q47. What are some Python libraries for handling big data?
Q48. Explain how Apache Spark integrates with Python (PySpark).
Q49. How do you deploy machine learning models using Flask or FastAPI?
import dask.dataframe as dd
df = dd.read_csv("large_dataset.csv")
df.groupby('column_name').mean().compute()
from flask import Flask, request, jsonify
import pickle
app = Flask(__name__)
model = pickle.load(open("model.pkl", "rb"))
@app.route('/predict', methods=['POST'])
def predict():
import pandas as pd
chunk_size = 10000
for chunk in pd.read_csv('large_dataset.csv', chunksize=chunk_size):
process(chunk)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("BigDataProcessing").getOrCreate()
df = spark.read.csv("large_dataset.csv", header=True, inferSchema=True)
df.show()
Using FastAPI:
Common tools for scheduling data pipelines:
Apache Airflow
Luigi
Cron Jobs
Perfect
In this article, we covered fundamental and advanced Python interview questions and answers for data science
related to data handling, memory management, deep learning, and big data technologies like Apache Spark and
PySpark. Understanding these topics will not only help you ace interviews but also enable you to build efficient
and scalable data-driven applications.
Q50. How do you schedule and automate data pipelines in Python?
Conclusion
from fastapi import FastAPI
import pickle
app = FastAPI()
model = pickle.load(open("model.pkl", "rb"))
@app.post("/predict/")
async def predict(features: list):
prediction = model.predict([features])
return {"prediction": prediction.tolist()}
data = request.json['features']
prediction = model.predict([data])
return jsonify({'prediction': prediction.tolist()})
if __name__ == "__main__":
app.run(debug=True)
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
def process_data():
print("Data Processing Task Executed!")
dag = DAG('data_pipeline', schedule_interval='@daily', start_date=datetime(2024, 1, 1))
task = PythonOperator(task_id='process_task', python_callable=process_data, dag=dag)

More Related Content

Similar to Python Interview Questions PDF By ScholarHat (20)

PPTX
Data Visualization_pandas in hadoop.pptx
Rahul Borate
 
PPTX
Lecture 9.pptx
MathewJohnSinoCruz
 
PPTX
Python-for-Data-Analysis.pptx
ParveenShaik21
 
PPTX
1P13 Python Review Session Covering various Topics
hussainmuhd1119
 
PPTX
introduction to data structures in pandas
vidhyapm2
 
PPTX
4)12th_L-1_PYTHON-PANDAS-I.pptx
AdityavardhanSingh15
 
PPTX
Pythonggggg. Ghhhjj-for-Data-Analysis.pptx
sahilurrahemankhan
 
PPTX
Session 05 cleaning and exploring
bodaceacat
 
PPTX
Session 05 cleaning and exploring
Sara-Jayne Terp
 
PPTX
Meetup Junio Data Analysis with python 2018
DataLab Community
 
PPTX
Pandas yayyyyyyyyyyyyyyyyyin Python.pptx
AamnaRaza1
 
PPTX
Unit 3_Numpy_Vsp.pptx
prakashvs7
 
PPTX
Mca ii dfs u-1 introduction to data structure
Rai University
 
PPTX
Bsc cs ii dfs u-1 introduction to data structure
Rai University
 
PPTX
Bca ii dfs u-1 introduction to data structure
Rai University
 
PPTX
python-pandas-For-Data-Analysis-Manipulate.pptx
PLOKESH8
 
PPTX
Improve Your Edge on Machine Learning - Day 1.pptx
CatherineVania1
 
PPTX
Introduction to a Python Libraries and python frameworks
yokeshmca
 
PPTX
Unit 3_Numpy_VP.pptx
vishnupriyapm4
 
PDF
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
KrishnaJyotish1
 
Data Visualization_pandas in hadoop.pptx
Rahul Borate
 
Lecture 9.pptx
MathewJohnSinoCruz
 
Python-for-Data-Analysis.pptx
ParveenShaik21
 
1P13 Python Review Session Covering various Topics
hussainmuhd1119
 
introduction to data structures in pandas
vidhyapm2
 
4)12th_L-1_PYTHON-PANDAS-I.pptx
AdityavardhanSingh15
 
Pythonggggg. Ghhhjj-for-Data-Analysis.pptx
sahilurrahemankhan
 
Session 05 cleaning and exploring
bodaceacat
 
Session 05 cleaning and exploring
Sara-Jayne Terp
 
Meetup Junio Data Analysis with python 2018
DataLab Community
 
Pandas yayyyyyyyyyyyyyyyyyin Python.pptx
AamnaRaza1
 
Unit 3_Numpy_Vsp.pptx
prakashvs7
 
Mca ii dfs u-1 introduction to data structure
Rai University
 
Bsc cs ii dfs u-1 introduction to data structure
Rai University
 
Bca ii dfs u-1 introduction to data structure
Rai University
 
python-pandas-For-Data-Analysis-Manipulate.pptx
PLOKESH8
 
Improve Your Edge on Machine Learning - Day 1.pptx
CatherineVania1
 
Introduction to a Python Libraries and python frameworks
yokeshmca
 
Unit 3_Numpy_VP.pptx
vishnupriyapm4
 
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
KrishnaJyotish1
 

More from Scholarhat (20)

PDF
React Redux Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
React Redux Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
React Router Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
JavaScript Array Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
Java Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
Java Interview Questions for 10+ Year Experienced PDF By ScholarHat
Scholarhat
 
PDF
Infosys Angular Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
DBMS Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
API Testing Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
System Design Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
Python Viva Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
Linux Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
Kubernetes Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
Collections in Java Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
CI CD Pipeline Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
Azure DevOps Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
TypeScript Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
UIUX Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
OOPS JavaScript Interview Questions PDF By ScholarHat
Scholarhat
 
PDF
Git Interview Questions PDF By ScholarHat
Scholarhat
 
React Redux Interview Questions PDF By ScholarHat
Scholarhat
 
React Redux Interview Questions PDF By ScholarHat
Scholarhat
 
React Router Interview Questions PDF By ScholarHat
Scholarhat
 
JavaScript Array Interview Questions PDF By ScholarHat
Scholarhat
 
Java Interview Questions PDF By ScholarHat
Scholarhat
 
Java Interview Questions for 10+ Year Experienced PDF By ScholarHat
Scholarhat
 
Infosys Angular Interview Questions PDF By ScholarHat
Scholarhat
 
DBMS Interview Questions PDF By ScholarHat
Scholarhat
 
API Testing Interview Questions PDF By ScholarHat
Scholarhat
 
System Design Interview Questions PDF By ScholarHat
Scholarhat
 
Python Viva Interview Questions PDF By ScholarHat
Scholarhat
 
Linux Interview Questions PDF By ScholarHat
Scholarhat
 
Kubernetes Interview Questions PDF By ScholarHat
Scholarhat
 
Collections in Java Interview Questions PDF By ScholarHat
Scholarhat
 
CI CD Pipeline Interview Questions PDF By ScholarHat
Scholarhat
 
Azure DevOps Interview Questions PDF By ScholarHat
Scholarhat
 
TypeScript Interview Questions PDF By ScholarHat
Scholarhat
 
UIUX Interview Questions PDF By ScholarHat
Scholarhat
 
OOPS JavaScript Interview Questions PDF By ScholarHat
Scholarhat
 
Git Interview Questions PDF By ScholarHat
Scholarhat
 
Ad

Recently uploaded (20)

PPTX
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 
PDF
Introduction presentation of the patentbutler tool
MIPLM
 
PPTX
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
PPTX
Introduction to Biochemistry & Cellular Foundations.pptx
marvinnbustamante1
 
PPTX
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
PPTX
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
PPTX
TRANSLATIONAL AND ROTATIONAL MOTION.pptx
KIPAIZAGABAWA1
 
PPTX
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
Android Programming - Basics of Mobile App, App tools and Android Basics
Kavitha P.V
 
PDF
epi editorial commitee meeting presentation
MIPLM
 
PDF
Governor Josh Stein letter to NC delegation of U.S. House
Mebane Rash
 
PDF
AI-Powered-Visual-Storytelling-for-Nonprofits.pdf
TechSoup
 
PDF
Horarios de distribución de agua en julio
pegazohn1978
 
PPTX
How to Manage Allocation Report for Manufacturing Orders in Odoo 18
Celine George
 
PPTX
infertility, types,causes, impact, and management
Ritu480198
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PDF
Week 2 - Irish Natural Heritage Powerpoint.pdf
swainealan
 
PDF
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
PPTX
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 
Introduction presentation of the patentbutler tool
MIPLM
 
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
Introduction to Biochemistry & Cellular Foundations.pptx
marvinnbustamante1
 
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
TRANSLATIONAL AND ROTATIONAL MOTION.pptx
KIPAIZAGABAWA1
 
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
Android Programming - Basics of Mobile App, App tools and Android Basics
Kavitha P.V
 
epi editorial commitee meeting presentation
MIPLM
 
Governor Josh Stein letter to NC delegation of U.S. House
Mebane Rash
 
AI-Powered-Visual-Storytelling-for-Nonprofits.pdf
TechSoup
 
Horarios de distribución de agua en julio
pegazohn1978
 
How to Manage Allocation Report for Manufacturing Orders in Odoo 18
Celine George
 
infertility, types,causes, impact, and management
Ritu480198
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
Week 2 - Irish Natural Heritage Powerpoint.pdf
swainealan
 
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
Ad

Python Interview Questions PDF By ScholarHat

  • 1. Top 50 Python Data Science Interview Questions and Answers Python Data Science Interview If you are getting into Data Science, Python is your best friend. Why? It's simple, powerful, and packed with libraries like NumPy, Pandas, and Scikit-learn that make data analysis, visualization, and machine learning easy. Python Data Science Interview Questions for Entry-Level (0-1 year experience) Now, if you're preparing for a Python Data Science interview, expect a mix of Python fundamentals, data manipulation, statistics, and machine learning concepts. You might face questions like: How do you handle missing data in Pandas? What’s the difference between a list and a NumPy array? How does the groupby function work in Pandas? Explain the difference between shallow and deep copy in Python. What is broadcasting in NumPy? These questions test your problem-solving skills and understanding data structures in Python. So, brush up on your Python basics, get comfortable with data manipulation, and practice writing efficient code. And yes in this interview tutorial, we gonna also see more Python Data Science interview questions and answers.
  • 2. Python Basics & Data Structures Q 1. What are Python’s key features that make it popular in data science? 1. Easy to Learn & Use Python has a simple syntax that’s close to natural language, making it beginner-friendly and easy to write, read, and debug. 2. Rich Ecosystem of Libraries Python has a vast collection of pre-built libraries specifically for Data Science, including: NumPy – Numerical computing and arrays Pandas – Data manipulation and analysis Matplotlib & Seaborn – Data visualization Scikit-learn – Machine learning algorithms TensorFlow & PyTorch – Deep learning 3. Platform Independence Python is cross-platform, meaning you can run the same code on Windows, macOS, or Linux without modification.
  • 3. 1. List (Mutable) Difference Between List and Tuple in Python: Feature Mutability List Mutable (can be changed) Tuple Immutable (cannot be changed) list = [1, 2, 3] (square Syntax tuple = (1, 2, 3) (parentheses) brackets) Slower due to dynamic resizing Performance Memory Usage Faster due to fixed size Uses more memory Uses less memory Can add, remove, or modify elements When data needs to changeWhen data should remain constant Modification Cannot modify elements Use Case frequently configurations) (e.g., 4. Strong Community Support A large global community contributes to continuous improvements, vast documentation, and active forums like Stack Overflow and GitHub. coordinates, 5. Integration with Other Tools Python integrates well with SQL databases, cloud platforms, and big data tools like Apache Spark, and Hadoop. 6. Versatility in Data Science & AI It supports data wrangling, statistical analysis, machine learning, deep learning, and automation, making it a one-stop solution for all Data Science needs. 7. Automation & Scripting Python is great for automating repetitive tasks such as data cleaning, web scraping, and ETL (Extract, Transform, Load) processes. 8. Scalability & Performance With optimizations like Cython, NumPy (vectorized operations), and multiprocessing, Python handles large- scale data efficiently. Example: Q2. What is the difference between a list and a tuple? my_list = [1, 2, 3] my_list.append(4) # Allowed print(my_list) # Output: [1, 2, 3, 4]
  • 4. Output Q3. How do you handle missing values in Pandas? 1. Detect Missing Values Use .isnull() or .notnull() to check for missing values. Try it Yourself >> 2. Tuple (Immutable) 3. Fill Missing Values Use .fillna() to replace missing values. 2. Remove Missing Values Use .dropna() to remove rows or columns with missing values. When to Use? Use a list when you need a modifiable sequence. Use a tuple when you need a fixed sequence for faster execution and memory efficiency. [1, 2, 3, 4] import pandas as pd data = {'A': [1, 2, None, 4], 'B': [None, 2, 3, 4]} df = pd.DataFrame(data) df.dropna() # Removes rows with NaN values df.dropna(axis=1) # Removes columns with NaN values print(df.isnull()) # Returns True for missing values print(df.notnull()) # Returns True for non-missing values my_tuple = (1, 2, 3) my_tuple[0] = 10 # TypeError: 'tuple' object does not support item assignment
  • 5. Try it Yourself >> 4. Replace Specific Values Use .replace() to replace specific missing values. 5. Interpolate Missing Values Use .interpolate() to estimate missing values based on available data. df.interpolate(method='linear') # Linear interpolation In Python, copying an object can be done using shallow copy or deep copy: Shallow Copy (copy.copy() ): Creates a new object but does not create copies of nested objects. Instead, it references them. Deep Copy (copy.deepcopy()): Creates a completely independent copy of the original object, including all nested objects. Output Q4. Explain the difference between shallow copy and deep copy in Python. [[99, 2], [3, 4]] import copy list1 = [[1, 2], [3, 4]] shallow_copy = copy.copy(list1) shallow_copy[0][0] = 99 print(list1) # Output: [[99, 2], [3, 4]] deep_copy = copy.deepcopy(list1) deep_copy[0][0] = 42 print(list1) # Output: [[99, 2], [3, 4]] df.replace(to_replace=None, value=0, inplace=True) df.fillna(0) # Replaces all NaNs with 0 df.fillna(df.mean()) # Replaces NaNs with column mean df.fillna(method='ffill') # Forward fill (previous value) df.fillna(method='bfill') # Backward fill (next value)
  • 6. Output Q5. What is the difference between Q7. What are Python’s built-in data types? Q6. How does Python’s memory management work? Q8. What is the difference between mutable and immutable data types? is and ==in Python? Try it Yourself >> Python has various built-in data types, including: Numeric Types: int, float, complex Sequence Types: list, tuple, range Text Type: str Set Types: set, frozenset Mapping Type: dict Boolean Type: bool Binary Types: bytes, bytearray, memoryview is checks memory reference, while == checks values. 1. Mutable: Can be changed after creation (e.g., list, dict, set). Python uses automatic garbage collection, reference counting, and memory pools. lst = [1, 2, 3] [[99, 2], [3, 4]] a = [1, 2, 3] b = a c = [1, 2, 3] print(a == c) # True print(a is c) # False print(a is b) # True import sys x = [1, 2, 3] print(sys.getrefcount(x))
  • 7. 1. Set: Unordered collection of unique elements. 2. Dictionary: Dictionary in Python, Stores key-value pairs. 2. Immutable: Cannot be changed after creation (e.g., int, str, tuple). 1. List Comprehension: Returns a list, storing all elements in memory. Q9. Explain the difference between a Python set and a dictionary. Q10. What is the difference between a generator and a list comprehension? s = {1, 2, 3, 4} tup = (1, 2, 3) tup[0] = 99 # Error lst[0] = 99 # Allowed lst = [x**2 for x in range(5)] print(lst) # [0, 1, 4, 9, 16] d = {"name": "Alice", "age": 25}
  • 8. 1. iloc[]: Selects rows/columns by integer location (index-based). 2. loc[]: Selects rows/columns by label (name-based). merge(): Used for database-style joins (inner, outer, left, right). join(): Similar to merge() but works with index-based joins. concat(): Combines data along an axis. 2. Generator: Returns an iterator that generates values on the fly, saving memory. DataFrame: A 2D labeled data structure similar to a table with rows and columns. Series: A 1D labeled array, essentially a single column of a DataFrame. Lambda functions in Python are anonymous (one-liner) functions using the lambda keyword. Data Processing & Analysis Q16. What are Pandas DataFrames, and how do they differ from Series? Q19. What are lambda functions in Python? Q17. How do you merge, join, and concatenate data in Pandas? Q18. Explain the difference between .iloc[] and .loc[] in Pandas. gen = (x**2 for x in range(5)) print(next(gen)) # 0 print(next(gen)) # 1 df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]}) df2 = pd.DataFrame({"A": [5, 6], "B": [7, 8]}) result = pd.concat([df1, df2]) merged = df1.merge(df2, on="A", how="inner") import pandas as pd s = pd.Series([10, 20, 30]) df = pd.DataFrame({"A": [10, 20, 30], "B": [40, 50, 60]}) df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=['x', 'y', 'z']) print(df.iloc[0]) # First row using index print(df.loc['x']) # First row using label
  • 9. Output Q20. How do you handle duplicate values in Pandas? Q21. What is the role of the apply() function in Pandas? Q22. How do you optimize performance in Pandas for large datasets? Q23. Explain the difference between .any() and .all() functions in Pandas. Try it Yourself >> returns True if any value is True; returns True only if all values are True. Applies a function to rows or columns. Use vectorized operations. Convert data types to save memory. Use chunk processing for large files. Use drop_duplicates()to remove duplicates. 8 .any() .all() add = lambda x, y: x + y print(add(3, 5)) # Output: 8 df["A"] = df["A"].astype("int8") df["A_squared"] = df["A"].apply(lambda x: x**2) df = pd.DataFrame({"A": [1, 2, 2, 3], "B": [4, 5, 5, 6]}) df = df.drop_duplicates() df = pd.DataFrame({"A": [True, False, True], "B": [True, True, True]}) print(df.any()) # Checks if any value is True in each column
  • 10. Q24. How do you convert categorical data into numerical form? Q27. Explain the concept of feature scaling in machine learning. Q25. What is the difference between .pivot() and .pivot_table() in Pandas? .pivot() reshapes data but requires unique index/column combinations. .pivot_table() works even with duplicate values and allows aggregation. Q26. What are the different types of probability distributions used in Data Science? Some common probability distributions distributions. Use pd.get_dummies()for one-hot encoding. Feature scaling ensures numerical features are on the same scale using Standardization or Min-Max Scaling. include Normal, Binomial, Poisson, Exponential, and Uniform Machine Learning & Statistical Computing import numpy as np import matplotlib.pyplot as plt from scipy.stats import norm x = np.linspace(-3, 3, 100) plt.plot(x, norm.pdf(x)) plt.title("Normal Distribution") plt.show() df = pd.DataFrame({"Color": ["Red", "Blue", "Green"]}) df_encoded = pd.get_dummies(df, columns=["Color"]) print(df.all()) # Checks if all values are True in each column df = pd.DataFrame({"Date": ["2024-01", "2024-01", "2024-02"], "Category": ["A", "B", "A"], "Value": [100, 200, 150]}) pivot = df.pivot(index="Date", columns="Category", values="Value") pivot_table = df.pivot_table(index="Date", columns="Category", values="Value", aggfunc="sum")
  • 11. Q28. How does NumPy handle multidimensional arrays? Q30. How do you implement linear regression using NumPy? Q29. What is the difference between mean, median, and mode? Q31. Explain the difference between supervised and unsupervised learning. NumPy uses objects to handle multidimensional data. Mean is the average, Median is the middle value, and Mode is the most frequent value. ndarray import numpy as np from scipy import stats data = [1, 2, 2, 3, 4] print(np.mean(data)) # Mean print(np.median(data)) # Median print(stats.mode(data)) # Mode import numpy as np arr = np.array([[1, 2, 3], [4, 5, 6]]) print(arr.shape) # (2,3) print(arr[:, 1]) # Second column import numpy as np X = np.array([1, 2, 3, 4, 5]) y = np.array([2, 4, 6, 8, 10]) A = np.vstack([X, np.ones(len(X))]).T m, c = np.linalg.lstsq(A, y, rcond=None)[0] print(f"Slope: {m}, Intercept: {c}") from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaled_data = scaler.fit_transform([[10], [20], [30]])
  • 12. Methods include oversampling (SMOTE), undersampling, and class weighting. Supervised learning uses labeled data, while Unsupervised learning finds patterns without labels. The scipy.statsmodule provides statistical functions like probability distributions and hypothesis testing. Q33. What is the purpose of the scipy.stats module? Q34. How do you handle imbalanced datasets in Python? Q35. How do you implement a k-means clustering algorithm using Python? Q32. How do you calculate the correlation between two variables in Python? import pandas as pd data = {"A": [1, 2, 3, 4], "B": [2, 4, 6, 8]} df = pd.DataFrame(data) print(df.corr()) from sklearn.cluster import KMeans import numpy as np X = np.array([[1, 2], [1, 4], [5, 8], [8, 8]]) from imblearn.over_sampling import SMOTE X, y = [[1], [2], [3], [4]], [0, 0, 1, 1] smote = SMOTE() X_resampled, y_resampled = smote.fit_resample(X, y) from scipy import stats data = [1, 2, 3, 4, 5] print(stats.ttest_1samp(data, 3)) # One-sample t-test
  • 13. Python Data Science Interview Questions for Advanced-Level (5+ years experience) Performance Optimization & Best Practices Q36. How do you handle memory-efficient operations in NumPy? Q37. What are some techniques to speed up Pandas operations? Q38. How does Python’s Global Interpreter Lock (GIL) impact performance? Use vectorized operations instead of loops. Convert object data types to categorical for efficiency. Use chunk processing for large datasets. Enable multi-threading with modin.pandas. To optimize memory usage in NumPy: Use appropriate data types (e.g., int8 instead of int64). Leverage views instead of copies. Use in-place operations to reduce memory overhead. Utilize NumPy’s broadcasting to avoid large intermediate arrays. The Global Interpreter Lock (GIL) restricts execution to one thread at a time, meaning: CPU-bound tasks do not benefit from multithreading. I/O-bound tasks can still benefit from multithreading. Multiprocessing is better for CPU-heavy tasks. import threading kmeans = KMeans(n_clusters=2) kmeans.fit(X) print(kmeans.labels_) print(kmeans.cluster_centers_) def compute(): sum([i**2 for i in range(1000000)]) import numpy as np arr = np.array([1, 2, 3, 4], dtype=np.int8) arr += 2 # Memory-efficient operation df = pd.DataFrame({"A": [1, 2, 3, 4]}) df["A"] = df["A"] * 2 # Faster than using apply() Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
  • 14. Example: Multithreading (I/O-bound task) Example: Multiprocessing (CPU-bound task) Q40. How do you parallelize computations in Python? Q39. Explain multithreading vs multiprocessing in Python. Python provides several ways to parallelize computations: Use Feature Execution Best For GIL Effect Memory UsageShares memory space Multithreading Multiprocessing Multiple threads in a single processMultiple separate processes I/O-bound tasks Affected CPU-bound tasks Not affected Each process has separate memory import threading, time import multiprocessing def task(): time.sleep(2) print("Task complete") def compute(): sum([i**2 for i in range(1000000)]) threads = [threading.Thread(target=task) for _ in range(5)] for t in threads: t.start() for t in threads: t.join() threads = [threading.Thread(target=compute) for _ in range(4)] for t in threads: t.start() for t in threads: t.join() processes = [multiprocessing.Process(target=compute) for _ in range(4)] for p in processes: p.start() for p in processes: p.join()
  • 15. for parallel loops. Use for CPU-bound tasks. Use for efficient parallelism. Use for parallel Pandas-like operations. Text preprocessing includes: Tokenization Lowercasing Removing Stopwords Stemming and Lemmatization Removing Punctuation Vectorization (TF-IDF, Bag-of-Words, Word Embeddings) Example: Parallel processing with concurrent.futures Q41. What are the different ways to preprocess text data for NLP? Deep Learning & Advanced Machine Learning: Dask joblib multiprocessing concurrent.futures print(list(results)) import concurrent.futures def compute(n): return sum(i**2 for i in range(n)) import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords nltk.download("punkt") with concurrent.futures.ProcessPoolExecutor() as executor: results = executor.map(compute, [1000000, 2000000, 3000000])
  • 16. Q43. Explain the role of TensorFlow and PyTorch in Deep Learning. Q42. How do you implement a decision tree from scratch in Python? Steps: Calculate Gini Impurity or Entropy Choose the best feature to split the dataset Recursively build the tree Feature Developed By TensorFlow Google PyTorch Facebook Computation GraphStatic & DynamicDynamic Best For Production Research import torch import torch.nn as nn class SimpleNN(nn.Module): def __init__(self): super(SimpleNN, self).__init__() self.fc1 = nn.Linear(2, 1) def forward(self, x): return self.fc1(x) clf = DecisionTreeClassifier(criterion="gini") clf.fit(df[['Feature']], df['Label']) print(clf.predict([[1]])) import numpy as np import pandas as pd from sklearn.tree import DecisionTreeClassifier # Sample dataset data = {'Feature': [0, 1, 1, 0, 1], 'Label': [0, 1, 1, 0, 1]} df = pd.DataFrame(data) nltk.download("stopwords") text = "Natural Language Processing is amazing!" tokens = word_tokenize(text.lower()) clean_tokens = [word for word in tokens if word not in stopwords.words("english")] print(clean_tokens)
  • 17. Q45. What are hyperparameter tuning techniques in Python? Q46. How do you work with large datasets that don’t fit in memory? Q44. How does a convolutional neural network (CNN) process images? Common methods: Grid Search Random Search Bayesian Optimization HyperOpt & Optuna Genetic Algorithms Key CNN layers: Convolutional Layer - Extracts features ReLU - Activation function Pooling Layer - Reduces dimensionality Fully Connected Layer - Classification Big Data & Scalability: model = SimpleNN() print(model) from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [3, 5, None]} clf = GridSearchCV(RandomForestClassifier(), param_grid, cv=5) clf.fit(X_train, y_train) print(clf.best_params_) from tensorflow import keras from tensorflow.keras import layers model = keras.Sequential([ ]) model.summary() layers.Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)), layers.MaxPooling2D((2,2)), layers.Flatten(), layers.Dense(10, activation='softmax')
  • 18. Pandas Dask Vaex PySpark Modin 1. Using Flask: Use Chunking in Pandas Utilize Dask for parallel processing Store data in Databases instead of memory Use Apache Spark for distributed computing PySpark is the Python API for Apache Spark, enabling distributed computing. Q47. What are some Python libraries for handling big data? Q48. Explain how Apache Spark integrates with Python (PySpark). Q49. How do you deploy machine learning models using Flask or FastAPI? import dask.dataframe as dd df = dd.read_csv("large_dataset.csv") df.groupby('column_name').mean().compute() from flask import Flask, request, jsonify import pickle app = Flask(__name__) model = pickle.load(open("model.pkl", "rb")) @app.route('/predict', methods=['POST']) def predict(): import pandas as pd chunk_size = 10000 for chunk in pd.read_csv('large_dataset.csv', chunksize=chunk_size): process(chunk) from pyspark.sql import SparkSession spark = SparkSession.builder.appName("BigDataProcessing").getOrCreate() df = spark.read.csv("large_dataset.csv", header=True, inferSchema=True) df.show()
  • 19. Using FastAPI: Common tools for scheduling data pipelines: Apache Airflow Luigi Cron Jobs Perfect In this article, we covered fundamental and advanced Python interview questions and answers for data science related to data handling, memory management, deep learning, and big data technologies like Apache Spark and PySpark. Understanding these topics will not only help you ace interviews but also enable you to build efficient and scalable data-driven applications. Q50. How do you schedule and automate data pipelines in Python? Conclusion from fastapi import FastAPI import pickle app = FastAPI() model = pickle.load(open("model.pkl", "rb")) @app.post("/predict/") async def predict(features: list): prediction = model.predict([features]) return {"prediction": prediction.tolist()} data = request.json['features'] prediction = model.predict([data]) return jsonify({'prediction': prediction.tolist()}) if __name__ == "__main__": app.run(debug=True) from airflow import DAG from airflow.operators.python_operator import PythonOperator from datetime import datetime def process_data(): print("Data Processing Task Executed!") dag = DAG('data_pipeline', schedule_interval='@daily', start_date=datetime(2024, 1, 1)) task = PythonOperator(task_id='process_task', python_callable=process_data, dag=dag)