SlideShare a Scribd company logo
Week 2
316 – Data quality
Dr Patrick Mukala
2
Getting and Cleaning Data overview
Getting Cleaning Storing
Hadoop/Hive (SQL)
MongoDB (NoSQL)
Data Quality
Missing Data
MapReduce
3
Data Quality
Data Quality measures:
• Validity
• Accuracy
• Completeness
• Consistency
• Uniformity
4
Missing Data origins (Types)
• Not applicable, N/A
• Not available (i.e. variable added to questionnaire at a later date, missing
data due to defect)
• Unknown
• Refusal to answer
• True Missing (i.e. question skipped)
Analyze missing data to determine how to deal with missing data.
5
Missing Data mechanisms
1. Missing completely at random (MCAR): missing data on Y is unrelated
to the value of Y itself or to the values of any other variable in the data
set
2. Missing at random (MAR): The probability of missing data on Y is
unrelated to the value of Y after controlling for other variables in the
analysis (say X).
3. Not missing at random (NMAR): Missing values do depend on
unobserved values.
6
Example MCAR
• We want to assess which are the main determinants of income (such as
age). The MCAR assumption would be violated if people who did not
report their income were, on average, younger than people who
reported it.
This can be tested by dividing the sample into those who did and did not report
their income, and then testing a difference in mean age.
• Water damage of paper test results prior to entry
7
Example MAR
• The MAR assumption would be satisfied if the probability of missing data
on income depended on a person’s age, but within age group the
probability of missing income was unrelated to income.
However, this cannot be tested because we do not know the values of the missing
data, thus, we cannot compare the values of those with and without missing data
to see if they systematically differ on that variable.
• For really sick patients, clinicians may not draw blood for lab analysis.
• Titanic: NaN’s for age entry for travellers of class 3
8
Example NMAR
• The NMAR assumption would be fulfilled if people with high income are
less likely to report their income.
• Titanic: NaN’s for cabin entry for travellers of class 3
9
Options for handling Missing Data
• Listwise deletion (or complete case analysis)
• Imputation methods
• Marginal mean imputation
• Conditional mean imputation
10
Handling Missing Data
• First, do no harm. Use best practices and careful methodology to
minimize missingness
• Be transparent. Report any incidences of missing data
• Explicitly discuss whether data are missing at random
• Discuss how you as a researcher have dealt with the issue of incomplete
data and the results of your intervention

More Related Content

Similar to Guidelines for Data Quality and Preprocessing (20)

PPTX
Database ppt.pptx
AASTHAJAJOO
 
PDF
Missing data handling
QuantUniversity
 
PDF
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
ahmedragab433449
 
PPTX
Statistical Approaches to Missing Data
DataCards
 
PPT
Mineria de datos - Data minning
William Gonzabay Marcial
 
PDF
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
ahmedragab433449
 
PDF
Applied Missing Data Analysis Methodology In The Social Sciences 1st Edition ...
kouamobegun
 
PDF
Missing data
mandava57
 
PDF
Analyst’s Nightmare or Laundering Massive Spreadsheets
PyData
 
PPTX
missingdatahandling-160923201313.pptx
DakshKhurana15
 
PPTX
Imputation techniques for missing data in clinical trials
Nitin George
 
PDF
D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya
Reetabrata Bhattacharyya
 
PDF
Working with survey data with Cameron Rayner
Data For Good Regina
 
PPTX
A survey on missing information strategies and imputation methods in healthcare
Saroj Pandey
 
PDF
Module 1.2 data preparation
Sara Hooker
 
DOC
Twala2007.doc
butest
 
PPTX
Data screening
緯鈞 沈
 
PPT
Data preprocessing
extraganesh
 
PDF
Missing Data A Gentle Introduction 1st Edition Patrick E. Mcknight Phd
garsafreed94
 
PDF
DMTM Lecture 05 Data representation
Pier Luca Lanzi
 
Database ppt.pptx
AASTHAJAJOO
 
Missing data handling
QuantUniversity
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
ahmedragab433449
 
Statistical Approaches to Missing Data
DataCards
 
Mineria de datos - Data minning
William Gonzabay Marcial
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
ahmedragab433449
 
Applied Missing Data Analysis Methodology In The Social Sciences 1st Edition ...
kouamobegun
 
Missing data
mandava57
 
Analyst’s Nightmare or Laundering Massive Spreadsheets
PyData
 
missingdatahandling-160923201313.pptx
DakshKhurana15
 
Imputation techniques for missing data in clinical trials
Nitin George
 
D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya
Reetabrata Bhattacharyya
 
Working with survey data with Cameron Rayner
Data For Good Regina
 
A survey on missing information strategies and imputation methods in healthcare
Saroj Pandey
 
Module 1.2 data preparation
Sara Hooker
 
Twala2007.doc
butest
 
Data screening
緯鈞 沈
 
Data preprocessing
extraganesh
 
Missing Data A Gentle Introduction 1st Edition Patrick E. Mcknight Phd
garsafreed94
 
DMTM Lecture 05 Data representation
Pier Luca Lanzi
 

Recently uploaded (20)

PPTX
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
PPTX
Introduction to Artificial Intelligence.pptx
StarToon1
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PPT
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
Introduction to Artificial Intelligence.pptx
StarToon1
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Ad

Guidelines for Data Quality and Preprocessing

  • 1. Week 2 316 – Data quality Dr Patrick Mukala
  • 2. 2 Getting and Cleaning Data overview Getting Cleaning Storing Hadoop/Hive (SQL) MongoDB (NoSQL) Data Quality Missing Data MapReduce
  • 3. 3 Data Quality Data Quality measures: • Validity • Accuracy • Completeness • Consistency • Uniformity
  • 4. 4 Missing Data origins (Types) • Not applicable, N/A • Not available (i.e. variable added to questionnaire at a later date, missing data due to defect) • Unknown • Refusal to answer • True Missing (i.e. question skipped) Analyze missing data to determine how to deal with missing data.
  • 5. 5 Missing Data mechanisms 1. Missing completely at random (MCAR): missing data on Y is unrelated to the value of Y itself or to the values of any other variable in the data set 2. Missing at random (MAR): The probability of missing data on Y is unrelated to the value of Y after controlling for other variables in the analysis (say X). 3. Not missing at random (NMAR): Missing values do depend on unobserved values.
  • 6. 6 Example MCAR • We want to assess which are the main determinants of income (such as age). The MCAR assumption would be violated if people who did not report their income were, on average, younger than people who reported it. This can be tested by dividing the sample into those who did and did not report their income, and then testing a difference in mean age. • Water damage of paper test results prior to entry
  • 7. 7 Example MAR • The MAR assumption would be satisfied if the probability of missing data on income depended on a person’s age, but within age group the probability of missing income was unrelated to income. However, this cannot be tested because we do not know the values of the missing data, thus, we cannot compare the values of those with and without missing data to see if they systematically differ on that variable. • For really sick patients, clinicians may not draw blood for lab analysis. • Titanic: NaN’s for age entry for travellers of class 3
  • 8. 8 Example NMAR • The NMAR assumption would be fulfilled if people with high income are less likely to report their income. • Titanic: NaN’s for cabin entry for travellers of class 3
  • 9. 9 Options for handling Missing Data • Listwise deletion (or complete case analysis) • Imputation methods • Marginal mean imputation • Conditional mean imputation
  • 10. 10 Handling Missing Data • First, do no harm. Use best practices and careful methodology to minimize missingness • Be transparent. Report any incidences of missing data • Explicitly discuss whether data are missing at random • Discuss how you as a researcher have dealt with the issue of incomplete data and the results of your intervention

Editor's Notes

  • #5: Missing completely at random (MCAR): data are missing independently of both observed and unobserved data. Example: a participant flips a coin to decide whether to complete the depression survey. Missing at random (MAR): given the observed data, data are missing independently of unobserved data. Example: male participants are more likely to refuse to fill out the depression survey, but it does not depend on the level of their depression. MCAR implies MAR, but not the other way round. Most methods assume MAR. We can ignore missing data ( = omit missing observations) if we have MAR or MCAR. Missing Not at Random (MNAR): missing observations related to values of unobserved data. Example: participants with severe depression, or side-effects from the medication, were more likely to be missing at end.