Harish Kumar Thota
Irving, TX | (469) 287-6899 | [email protected] | LinkedIn
PROFESSIONAL SUMMARY:
Senior Data Scientist with over 7 years of progressive experience in Applied Machine Learning, Natural Language
Processing (NLP), Generative AI, and Data Analytics, delivering high-impact AI solutions across banking,
healthcare, and retail domains.
Proven expertise in Generative AI system design, including LLaMA 2, Falcon, and Retrieval-Augmented Generation
(RAG) for domain-specific conversational assistants, improving automation rates and reducing decision
turnaround times.
Proficient in managing the complete data science project lifecycle, actively contributing to all phases including
data acquisition, cleaning, engineering, feature scaling, and feature engineering.
Skilled in building machine learning models using algorithms such as Regression, Time Series (ARIMA, Holt-
Winters), Clustering, Apriori, Decision Trees, KNN, Neural Networks, SVM, and Ensemble methods (Random
Forest, Boosting).
Hands-on experience implementing Naive Bayes and proficient in Random Forests, Decision Trees, Linear &
Logistic Regression, SVM, Clustering, Neural Networks, and Principal Component Analysis (PCA).
Experienced in dimensionality reduction techniques (PCA, SVD), model evaluation using metrics like AUC-ROC, K-
fold cross-validation, and delivering insights through data visualization.
Skilled in domain adaptation of LLMs using LoRA fine-tuning, enabling cost-effective training on secure datasets
while boosting model performance in specialized terminology.
Proficient with AWS services (DMS, Glue, Athena, S3, EMR, Step Functions, Lambda, SNS, SES, EC2, QuickSight) as
well as GCP (Dataproc, BigQuery, GCS, Vertex AI) and Azure Databricks for large-scale data analytics and insights.
Extensive hands-on experience in Natural Language Processing (NLP) and Generative AI, with a strong foundation
in developing Large Language Model (LLM)-based solutions using GPT-4, Gemini, LLaMA, and Cortex Analyst.
Skilled in time series forecasting using SARIMA, Prophet, and machine learning models for demand prediction,
achieving over 90% forecast accuracy.
Skilled in developing and deploying ML models using cloud-native platforms including AWS SageMaker, GCP
Vertex AI, and AzureML Studio.
Proficient in Python with extensive use of packages such as Pandas, NumPy, Scikit-learn, TensorFlow, Keras, NLTK,
Matplotlib, Seaborn and more for data analysis, visualization, and machine learning.
Experience in cloud-native AI development, offering end-to-end data science solutions with the help of
technologies like Docker, MLflow, FastAPI, and CI/CD pipelines using GitHub Actions and Jenkins.
Highly proficient in developing tailored optimization models and performing scenario-based sensitivity analyses
to support well-informed, data-driven decision-making.
Experienced in data parsing, manipulation, and preparation using techniques like descriptive statistics, regex,
merge, subset, reindex, melt, and reshape to enable high-quality datasets for ML pipelines.
Experienced in customer segmentation, anomaly detection, and recommender systems, driving measurable
revenue and risk reduction outcomes.
Built and deployed fraud detection models with Logistic Regression, Random Forest, XGBoost, LightGBM, and
Neural Networks, achieving 96% accuracy in identifying suspicious transactions
Proficient in big data ETL pipelines with Apache Spark, enabling large-scale data ingestion and transformation for
analytics and ML workflows.
Adept at hyperparameter optimization, feature engineering, and ensemble modeling to improve predictive
accuracy and model robustness.
Expertise in Python, SQL, Machine learning, Deep learning, and Big Data technologies. Adept at translating
business requirements into actionable ML models, collaborating with cross-functional teams, and deploying
solutions on cloud platforms. Passionate about using data to drive meaningful business impact.
Hands-on experience implementing Naive Bayes and skilled in Random Forests, Decision Trees, Linear & Logistic
Regression, SVM, Clustering, Neural Networks, and Principal Component Analysis (PCA).
Experienced with AWS analytics and data services (DMS, Glue, Athena, S3, EMR, Step Functions, Lambda, SNS,
SES, EC2, QuickSight) as well as GCP services (Dataproc, BigQuery, GCS, Vertex AI) and Azure Databricks for
scalable data engineering and insights.
Strong background in data visualization and stakeholder communication through interactive dashboards
(Tableau, Power BI) and analytical reporting.
SKILLS:
Category Skills
Programming Languages Python, C, Java, R, SQL, HTML5, Bash, Linux, Shell Scripting, PySpark, MATLAB,
Data Structures and Algorithms, OOPS, RDBMS
Database & Warehousing MySQL, Oracle, MongoDB, NoSQL, Google Big Query, Snowflake, Apache Hive,
HDFS, ETL pipelines
AI Frameworks & Technologies Machine Learning, Deep Learning, Artificial Intelligence, Computer Vision, OpenCV,
Generative AI, Natural Language Processing, Scikit-learn, TensorFlow, Keras,
PyTorch
Cloud Services Azure, AWS, Docker, SageMaker, Redshift, DataProc, Kubernetes
Tools & Platforms Jupyter Notebook, VS Code, Google Collab, Microsoft Excel, Tableau, PowerBI,
Google Cloud Platform, Data Bricks, SAS
Machine Learning Modeling Linear Regression, Logistic Regression, Decision Trees, Random Forest, XGBoost,
LightGBM, SVM, K-Means Clustering, DBSCAN, ARIMA, SARIMA, Holt-Winters,
Anamoly Detection, PCA
Deep Learning Architectures CNN, RNN, LSTM, GAN, Transformers, Autoencoders, BERT, GPT
Generative AI GPT, LLaMA, Falcon, LoRA, RAG, Hugging Face Transformers, LangChain
Interpersonal Skills Time Management, Teamwork, Communication, Adaptability, Work Ethic,
Empathy, Decision Making
EDUCATION:
UNIVERSITY OF NORTH TEXAS
Master of Science in Advanced Data Analytics GPA: 4.0/4.0
Relevant Coursework: Data Analytics, Business Intelligence, Data Warehousing, Big Data, Cloud Computing Tools,
Machine Learning, Artificial Intelligence, Natural language Processing
CERTIFICATIONS:
Machine Learning: Stanford University
OpenCV Bootcamp: OpenCV University
Data Analysis with Python, Pandas and NumPy
Artificial Intelligence Foundations - Machine Learning
Programming, Data Structures, and Algorithms using Python: NPTEL – IIT Kharagpur
Introduction to Internet of Things: NPTEL – IIT Kharagpur
Oxford Achiever: Certificate of Merit
WORK EXPERIENCE:
NLP DATA SCIENTIST/ ML ENGINEER Muskogee, OK
Armstrong Bank Oct 2023 –
Present
Designed Generative AI architecture for conversational banking assistants using LLaMA 2 and Falcon models on
secure financial datasets automating 80% of credit risk assessments and cutting regulatory compliance review
time by 35%.
Created GenAI-powered data analysis engine using Claude-v3.0-Sonnet LLM on Vertex AI, enabling automated
Python/SQL code generation for descriptive and aggregative analytics.
Built ML models for customer retention prediction using Logistic Regression, KNN, Decision Trees, Random
Forest, XGBoost, LightGBM, and Neural Networks on Vertex AI, achieving an F1 score of 87%
Optimized ML pipelines on AWS SageMaker, reducing training time by 40% and cutting infrastructure costs by
30%.
Built LLM-based earnings call summarization tool using Claude-v2.1 on Vertex AI, converting video to audio,
transcribing with Speech-to-Text, and generating executive summaries via prompt engineering. Designed a user
interface with Streamlit for business accessibility.
Implemented Retrieval-Augmented Generation (RAG) workflows combining LLMs with vector databases,
enabling real-time, domain-specific Q&A systems.
Designed anomaly detection frameworks using Isolation Forest, Autoencoders, and statistical methods to flag
fraud or irregular activity in high-volume datasets.
Created bias detection frameworks for LLM outputs, integrating SHAP and counterfactual evaluation to ensure
fairness, reducing demographic bias in credit approvals by 20%.
Created a GenAI Terraform code generator with Mistral-7B LLM on Vertex AI, parsing Draw.io XML architecture
diagrams to extract cloud services/relationships and generate Terraform templates, improving performance via
prompt optimization and XML parsing enhancements.
Implemented an RAG-based chat engine integrating Pinecone vector database with OpenAI GPT-3.5 Turbo to
analyze customer feedback and support logs, enhancing detection of service pain points such as loan delays and
digital banking issues.
Built real-time inference systems using Kafka Streams, PyTorch and TensorFlow Serving, reducing decision
latency from minutes to seconds.
Designed model drift detection frameworks with statistical monitoring, retriggering retraining pipelines
automatically.
Implemented feature store architecture in Snowflake and BigQuery to ensure consistent feature availability
across training and inference.
Developed LLMOps pipelines for lifecycle management of large language models, including prompt versioning,
automated evaluation (BLEU, ROUGE, RAGAS), and feedback-driven retraining.
Designed BigQuery-based batch inference pipelines for automated scoring of retention and basket value models.
Applied optimization techniques including regularization, cross-validation, and hyperparameter tuning to
maximize model performance.
Developed a GenAI Data Analysis Engine using Claude-v3.0-Sonnet LLM on Vertex AI to automatically generate
Python or SQL for descriptive and aggregative analytics on uploaded datasets.
Sr. DATA SCIENTIST Nashville, TN
HCA Healthcare Jan 2020 – Sep
2023
Engineered clinical text classification models using BERT and RoBERTa fine-tuned on medical corpora, achieving
94% F1-score in automatic ICD-10 code extraction from unstructured EHR notes.
Improved model robustness using regularization (L1/L2) and cross-validation, ensuring high accuracy and
generalization across diverse patient datasets.
Built predictive models in Python (Scikit-learn) using regression and classification algorithms to identify high-risk
patients, improving preventive care strategies and reducing hospital readmission rates by 14%.
Developed custom Named Entity Recognition (NER) pipelines with spaCy, extracting clinical terms with 97%
precision and improving entity linking accuracy by 22% over baseline.
Designed interactive dashboards in Tableau and Power BI to track key clinical KPIs such as patient outcomes, length
of stay, readmissions and operational efficiency.
Worked on creating efficient and intelligent chatbots using popular ML algorithms such as Naive Bayes, Decision
Trees, Support Vector Machines, Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) and Natural
Language Processing (NLP).
Established A/B testing frameworks for NLP-driven patient support tools, quantifying improvements in automation
rates and user satisfaction.
Developed and deployed reinforcement learning models to optimize hospital resource allocation and decision-
making in dynamic environments.
Built deep learning models (CNN, LSTM) for image classification, sentiment analysis, and time-series forecasting,
achieving accuracy improvements.
Developed real-time clinical alerting systems using Kafka and Spark streaming to notify providers of high-risk patient
events within seconds.
Led cross-functional AI initiatives, collaborating with data engineering, DevOps, and clinical operations teams to
deliver high-value ML solutions.
Automated continuous training and deployment workflows for ML models using MLflow, Kubeflow, and SageMaker
Pipelines.
Applied LSTM-based models to predict probability of equipment or device failure and used AWS SageMaker for
model training and deployment.
Applied advanced optimization algorithms (Bayesian Optimization, Genetic Algorithms, Hyperband) to tune
hyperparameters, improving model accuracy and efficiency.
Developed privacy-preserving data augmentation and resampling pipelines for model training, ensuring compliance
with HIPAA/GDPR without compromising model performance.
Implemented anomaly detection models using Isolation Forests and Autoencoders to identify abnormal patient
vitals and reduce false alarms in clinical monitoring systems.)
Optimized ETL workflows for large-scale healthcare data integration, improving processing speed by 35% using
Azure Databricks and distributed data lakes.
Built end-to-end ML pipelines with Apache Airflow for data ingestion, preprocessing, model training, evaluation,
and deployment, reducing model update cycles from weeks to days.
Processed petabyte-scale healthcare datasets using PySpark, Hadoop, and Azure for machine learning and reporting
use cases.
Built streaming data pipelines with Kafka, Azure Event Hubs, and Spark Structured Streaming for real-time analytics
and inference on patient data.
Automated recurring healthcare performance reports using Power BI and SQL, saving over 20 hours of manual
effort monthly.
Developed automated model monitoring dashboards to track data drift, model drift and real-time inference
accuracy, enabling proactive retraining.
DATA ANALYST/DATA SCIENTIST Lowell, AR
J.B. Hunt June 2017 – Dec
2020
Developed time series forecasting models using SARIMA and Facebook Prophet, achieving 92% forecast accuracy
for shipment demand prediction in retail supply chains and reducing stockouts by 15%.
Developed regression models for spot market transportation price prediction with tree-based boosting models
(XGBoost, LightGBM), saving $12M annually in transportation costs.
Implemented clustering techniques (K-Means, DBSCAN, Hierarchical Clustering) to segment shippers, improving
targeted pricing and contract strategies.
Integrated end-to-end MLOps practices for model lifecycle management, versioning, and monitoring of
deployed ML-based solutions in logistics workflows, ensuring reliable demand forecasting, lane optimization and
pricing models in production.
Designed federated learning pipelines to train models on decentralized carrier and regional shipment data,
maintaining security and reducing compliance risks.
Conducted A/B testing for advanced analytics and decision support models, quantifying uplift in automation
rates and accuracy.
Collaborated on the Predictive Demand Generation (PDG) feature with the team and trained an Artificial Neural
Network (ANN) using TensorFlow to identify high-potential shipper prospects from sales leads.
Designed scalable ML pipelines for deploying and managing demand forecasting, shipper segmentation and
route recommendation models with automated evaluation and feedback loops.
Implemented data quality and governance frameworks ensuring lineage, cataloging, and compliance with
enterprise policies.
Developed domain-specific embeddings using Word2Vec, TF-IDF, trained on shipment attributes, lane histories
and carrier performance data to enhance semantic search, route similarity matching and carrier
recommendations.
Designed optimization algorithms for load matching and route planning, reducing fuel costs and improving fleet
utilization.
Implemented predictive maintenance models for transportation assets, reducing breakdowns and repair costs
by 12%.
Created interactive operational dashboards in Tableau/Power BI to visualize logistics KPIs like shipment volumes,
on-time delivery, carrier utilization, enabling faster decision-making.
Conducted data preprocessing, cleaning and exploratory data analysis using Pandas, SQL, and Python libraries,
ensuring data integrity.
Managed data storage and large-scale processing pipelines in Azure using SQL, Spark, and distributed data lakes
for production, development, and testing environments
Integrated weather and traffic data into shipment demand forecasting models, improving accuracy in volatile
market conditions.
Developed a scalable and configurable Auto ML solution involving multiple regression and classification
Algorithms that optimizes features, algorithms and hyperparameters, reducing the experimentation time by
weeks.
Generated predictive analytics report using Python and Tableau including visualizations of model performance
and business impact.
Applied what-if simulations for transportation pricing strategies, increasing revenue predictability in spot market
bidding.
Conducted machine learning proof-of-concepts and led production deployment of intelligent logistics solutions,
delivering measurable business value and continuous innovation.
Developed explainable AI (XAI) dashboards using SHAP and LIME for model transparency, improving trust with
regulatory stakeholders.
Engineered high-performance lane-level demand prediction models with Random Forest and XGBoost,
improving SKU-level sales forecast accuracy by 18% compared to previous models.