SlideShare a Scribd company logo
Methods for making the
best use of admin data
Chair: Andy Teague, Admin Data Census, ONS
andy.teague@ons.gov.uk
Use of administrative,
commercial, Big and other data
sources to transform the 2021
Census
Andy Teague – Admin Data Census
How admin and other data can help?
Prepare
• Volumes for planning/testing
• Demographic characteristics for sampling & targeting
• Estimating response patterns
Collect &
Process
• Targeting resources during operation
• Data validating, editing, imputing or substituting
• Quality assuring
Output
• Enhancing outputs
• Quality reporting
Q
U
A
L
I
T
Y
Examples of how admin and other data
can help
Task Example data sets Example use
Volumes for
planning/testing
Ofcom – broadband connections and signal
strength; digital uptake on other public
services eg DVLA and Electoral Registration
Estimating digital uptake versus
number of paper questionnaires
required
Demographic
characteristics for
sampling &
targeting
Address Register supplemented with for
example information on population churn
from GP registers, Tax and Benefits data,
school census data, HESA data
Basic demographic characteristics to
help with CCS sample stratification
/Hard to count index
Targeting publicity
during operation
Twitter, Facebook or other social media,
population churn indicators from admin data
(see above)
Sentiment analysis techniques to
identify opinions about census
operation for use in publicity
campaign; targeting publicity at high
churn/hard to count areas
Data validating,
editing, imputing
or substituting
GP patient registrations supplemented with
activity data
Substitute good quality record level
admin/other data to populate imputed
records following coverage
assessment
Quality assuring
and reporting
HESA, School censuses, Council Tax, GP
registers, activity data, Tax and Benefits
data
Compare census results with good
quality data sets to check accuracy of
census estimates
Enhancing outputs HMRC tax and DWP benefits data, VOA
data
Use alongside census to produce
estimates of income and /or replace
census questions (eg no. of rooms)
Linking and Matching Data
Pete Jones – Admin Data Census
Shelley Gammon - Methodology
Presentation overview
• Overview of matching requirements in 2021
• Summary of methods and findings from B2011 matching
research
• Research update on admin data and Census-CCS
matching
• Outline the Matching Strategy for 2021
• Future developments:
• Graph Databases
• Evaluating Linkage Quality
• Wider benefits
Matching requirements in 2021
Beyond 2011 Data Linkage
• Extensive feasibility research into administrative data matching
during Beyond 2011
• Admin data population estimates produced by linking NHS
Patient Register (PR), DWP Customer Information System (CIS)
and Higher Education Statistics Agency (HESA) data
• Focused on the development of automated matching algorithms:
- Required to link large admin datasets (100m + records)
- Adopted a ‘pseudonymisation’ approach as part of a tactical
solution to preserve privacy in the linkage process
• ONS is reviewing its longer term approach to privacy and data
linkage but continue to build on methodological research to date
Beyond 2011 Methods
• Beyond 2011 matching was largely based around automated
linkage
• No common identifier so matching on name, sex date of birth
and address
• Three stages to the algorithm:
(1) Deterministic matching: Linking records on a series of ‘match-keys’ where
exact or partial agreement is found across match fields
(2) Logistic regression matching: Using clerically matched training data to
model matching decisions for more complex cases
(3) Associative matching: linking individuals based on collectively resolving
matches within a household
• Three stages are designed to run sequentially in a combined
algorithm
• Match-keys are used to resolve minor inconsistencies that commonly
occur between match-fields on two datasets
Match-Keys
Key Type
Unique records
on PR
1 Forename, Surname, DoB, Sex, Postcode 100.0%
2 Forename initial , Surname initial, DoB, Sex, Postcode District 99.6%
3 Forename bi-gram, Surname bi-gram, DoB, Sex, Postcode Area 99.4%
4 Forename initial, DoB, Sex, Postcode 99.8%
5 Surname initial, DoB, Sex, Postcode 99.4%
6 Forename, Surname, Age, Sex, Postcode Area 99.5%
7 Forename, Surname, Sex, Postcode 99.2%
8 Forename, Surname, DoB, Sex 98.9%
9 Forename, Surname, DoB, Postcode 99.5%
10 Surname, Forename, DoB, Sex, Postcode (matched on key 1) 100.0%
11 Middle name, Surname, DoB, Sex, Postcode (matched on key 1) 99.9%
What did we learn from B2011?
• Match rates between key admin datasets were high
- 95% of PR records link to the CIS
- 97% of School Census records link to the PR
- Formed the basis of a Statistical Population Dataset (SPD)
• Large volumes of administrative and census data can be linked efficiently using
automated methods:
- BUT trade off between quality and scale of matching
• A pseudonymisation approach restricts the implementation of some methods:
- Probabilistic matching
- Clerical matching
• The quality of linkage delivered through auto-matching will not be high enough for
certain applications:
- Coverage adjustment (for example, linking records for dual system estimation)
- Multivariate estimates of population characteristics
Research Update on Admin Data Matching: Automated
Probabilistic Linkage
• Tested and implemented methods for automated probabilistic matching
(Fellegi-Sunter matching framework)
• Threshold setting traditionally relies on large amounts of clerical review of
match candidates
• Implemented a ‘duplicate link’ method (Blakely & Salmond, 2002) that detects
points in the distribution that are most likely to have true matches
• No clerical review is required
Duplicate Link Method
Duplicate Link Method in Practice
• Have included a further 180,000 probabilistic links in the Admin
Data Population Estimates for 2011
2021 Census and CCS Auto-matching
• 70% of CCS-Census links were automated in 2011
• Testing ONS methods to increase the rate of auto-matching in 2021
• Using linked 2011 Census-CCS data for testing and comparison
• Anticipating higher auto-match rate with increased on-line completion
2021 Census and CCS Auto-matching
Matching Strategy for 2021
Methodology current research areas
• Graph databases – for management of linked data
• Sampling procedures to assess the quality of a
data linkage project
• Increasing the level of linkage automation,
whilst maintaining stringent quality standards
• Which are the best string comparators? - for
partial agreement e.g. in a Fellegi Sunter model
• Machine learning techniques for data linkage
Issues for data linkage
• Multiple sources of data
• Large data sets
• Updates required – changing data over time
• Need to target clerical resource
• Master linked data file – viewed as ‘truth’
• Different matching requirements for different
users
• Inconsistent clusters
What is a graph database?
• A graph database stores data points and also
the relationships between the data.
• A graph can also be represented as a matrix.
• Graph format storage in certain use cases is more
efficient than other forms of storage
Adjacency matrix
How can you use a graph to model
linked data?
• Blue nodes - Records in dataset 1
• Green nodes – records in dataset 2
1
A
C
D
B
4
2
3
No match for records 3
& D
Numbers on the edges represent
match score, where 1 is an exact
match.
Graphical representation of linked data
• Add in data on weaker links
• Model strength of linkage score
1
A
C
D
B
4
2
3
0.6
Keep information on all
comparisons, not just the
highest scoring pairs.
Graphical representation of linked data
• Add a third data source
1
A
C
D
B
4
2
3
0.6
X
Y
The weak link between 3 & D
now looks more plausible.
Z1
Z2
Can detect
duplicate
records
Graphical representation of linked data
• AUTOMATIC resolution of some clusters of records relating to
the same entity
C
D
2
3
X
Y
Even if
match score for 3X > score for 2X,
can (sometimes) use cluster metrics to
generate clusters and break the 3-X
link.
AUTOMATICALLY
Graph databases (progress)
• Still early stages
• Learning about technology
• Improving linkage quality
• Next step is to use real data - lots of it!
Larger data sets (whole population)
More data sources
Quality of linkage
• After we have matched two datasets, we need
to assess the quality of linkage
• Errors in matching can impact subsequent
analyses, especially if there is a bias in
matching errors towards certain groups of
people
• Informing data users of estimated linkage error
enables them to adjust their analysis methods
to account for this
How good is our linking?
Linked file
Dataset 1 Dataset 2
Correct links
Quality depends on:
1. Missed matches
2. Incorrect links
Sampling to estimate linkage quality (1)
• Linkage is often undertaken in stages
e.g. exact, deterministic, probabilistic.
• Estimate level of incorrect links (false positives)
overall, by stage
• (& other factors, e.g. age, sex, LA)
• Estimate level of missed matches (false
negatives) overall, by linkage probability score (&
other factors?)
• How many records do we need to assess
clerically?
Sampling for precision and recall (2)
• Sampling approach for clerical checks to estimate
false positives and false negatives?
- Used gold-standard matched data from the 2011 Census
to evaluate different approaches.
- Can stratification improve quality and reduce costs?
• Optimum sampling strategy depends on:
1. availability of prior information on quality
2. What level of detail is required in the estimates
Wider benefits
• Data linkage playing an increasing role at
ONS and more widely
o Increased use of administrative data
o Our methods being reused
• Collaborating externally
o e.g. with experts at UCL, developing guidance for
linkage standards
Contact:
Pete Jones – Admin Data Census,
Census Transformation Programme
Peter.jones@ons.gov.uk
Shelley Gammon – Methodology,
Digital Services, Technology and Methodology
Shelley.gammon@ons.gov.uk
Towards an integrated census-
administrative data approach to item
imputation for the 2021 Census
Fern Leather, Katie Sharp and Steven Rogers
Collection and Editing Methodology and
Statistical Computing
Background
• Item-level edit and imputation process for 2011
Census used CANCEIS software for nearest
neighbour imputation 1
• General move towards use of more
administrative data in 2021 Census 2
• Can we use linked admin data to provide
auxiliary information to improve the accuracy of
the imputation process?
1 Bankier, M., Lachance, M., & Poirer, P. (1999). A generic implementation of the new imputation methodology.
http://www.ssc.ca/survey/documents/SSC2000_M_Bankier.pdf
2 http://www.ons.gov.uk/ons/guide-method/census/2021-census/about-the-census-transformation-programme/programme-objectives/index.html
Research questions
• What does an improvement on the conventional
(no admin) method look like?
Based on evaluating the predictive and
distributional accuracy of imputation under
differing conditions3
• Can using linked admin data deliver that
improvement?
• Where do real admin data sources fall on the
continuum of imputation performance?
3 Chambers (2001), National Statistics Methodological Series Report 28: Evaluation criteria for statistical editing and imputation, Office for
National Statistics
Simulation study
• Baseline test data – clean and consistent
census data – approx 600,000 records from a
contiguous geographic area
• Perturbation strategy – age randomly
removed for 5% of records by age and
household size
• A series of synthetic admin datasets with
systematically increasing error in the age
variable were created: 0 years, +/- 3 years,
+/- 6 years, +/- 12 years, to compare with the
conventional no admin approach
Predictive accuracy
Mean = 0.20 +/- 0.03 Mean = 0.15 +/- 0.05
St Dev = 3.18 St Dev = 11.77
Mode = 3 Mode = 6
Exact admin data No admin data (control)
+/- 3 years error +/- 6 years error
Distributional accuracy
RMSE = 25.3
SSE = 68,387
RMSE = 22.32
SSE = 51,294
RMSE = 30.46
SSE = 97,425
RMSE = 40.52
SSE = 172,354
RMSE = 36.10
SSE = 140,709
+/-3 years admin error +/-6 years error +/-12 years error
Other measures
Exact No Admin
(control)
Admin3 Admin6 Admin12
Size of total donor pool 56,077 258,116 57,529 59,807 70,068
Average age range of
potential donors (years)
1 14.3 1.1 1.4 2.5
Percentage of records
where true age is within
range of imputation
actions
53.1 58.3 20.9 12.9 20.5
Percentage of records
failing edit checks if
admin age directly
substituted in
0 N/A 3.8 9.1 19.4
Evaluating actual admin datasets
RMSE = 994
Difference
>1%
Difference
>1%
RMSE = 1257
Difference
>1%
Difference from
Census (years)
PR exact
%
CIS exact
%
PR
probabilistic
%
CIS
probabilistic
%
-10 0.03 0.03 0.16 0.26
-9 0.01 0.01 0.16 0.25
-8 0.01 0.01 0.17 0.26
-7 0.01 0.01 0.19 0.27
-6 0.02 0.02 0.24 0.32
-5 0.02 0.02 0.27 0.33
-4 0.02 0.02 0.31 0.33
-3 0.03 0.03 0.38 0.37
-2 0.06 0.06 0.46 0.42
-1 0.30 0.38 0.76 0.79
0 98.40 98.16 86.71 85.91
1 0.33 0.33 0.70 0.81
2 0.06 0.07 0.39 0.36
3 0.03 0.03 0.33 0.30
4 0.02 0.02 0.25 0.27
5 0.02 0.02 0.23 0.25
6 0.02 0.02 0.20 0.23
7 0.01 0.01 0.16 0.20
8 0.01 0.01 0.15 0.19
9 0.01 0.01 0.13 0.18
10 0.02 0.03 0.14 0.19
Imputation with observed admin data
error
NHS Patient Register
No admin data
Exact admin data
+/-3 years
error etc.
Exact admin data PR exact PR probabilistic No admin data
Estimate 95% Confidence
Limits
Estimate 95% Confidence
Limits
Estimate 95% Confidence
Limits
Estimate 95% Confidence
Limits
Mean -0.2 -0.22 -0.17 0.20 0.17 0.23 0.23 0.19 0.27 -0.15 -0.19 -0.1
SD 3.18 - - 3.32 - - 5.46 - - 11.77 - -
Variance 10.1 - - 11.02 - - 29.81 - - 138.47 - -
Distributional accuracy
PR Exact
PR
probabilistic CIS exact
CIS
probabilistic
SSE 47,500 58,421 46,146 45,478
RMSE 21.69 24.05 21.38 21.22
Simulating different coverage levels
• So far, we have assumed 100% coverage of all
admin datasets – this is a best case scenario
which is not realistic for live operations
• Next step - simulate 80%, 60%, 40%, 20%
coverage of the PR and CIS synthetic datasets
Predictive accuracy – PR exact dataset
Predictive accuracy - overall
Predictive accuracy – what can we
expect in a real situation?
Conclusion and next steps
• Under certain conditions, auxiliary information
from linked admin data can improve imputation
accuracy for age, even when coverage is low
• The method is able to protect observed Census
data from highly erroneous admin data
• Further research:
Explore use for other variables – joint distributions and
variables with high imputation rates in 2011
Contact:
Fern Leather, Katie Sharp and Steven Rogers
Collection and Editing Methodology and
Statistical Computing
fern.leather@ons.gsi.gov.uk
steven.rogers@ons.gsi.gov.uk
katie.sharp@ons.gsi.gov.uk
Using administrative data to
improve 2021 Census population
estimates
Owen Abbott
Office for National Statistics
This presentation will cover.....
• Context:
- Use of admin data in 2011 Census
• How do we produce population estimates
• Potential ways to improve estimates:
- Improving response
- Coverage measurement
- Quality assurance
Background - 2011 Census
• 2011 Census used administrative data to:
• Support collection
• Build Address Register (e.g. NLPG)
• HtC index (e.g. House prices)
• Enhance population estimates through
Quality Assurance
• Mainly aggregates
• Some experimental record linkage (Blackwell et
al, 2013)
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
NumberofUsualResidents
Quinary Age Group
All persons - Exeter
95% Confidence
Interval
2011 Census Counts¹
2011 Census
Estimates¹
Rolled forward
estimates¹
Patient Register
2011²
School Census
2011³,⁴
Social Security and
Revenue Information
2011⁵
Comparator Lower
Bound
Comparator Upper
Bound
Source: 2011 Census QA pack available at http://www.ons.gov.uk/ons/guide-
method/census/2011/census-data/2011-census-user-guide/quality-and-methods/quality/quality-
assurance/local-authority-quality-assurance/index.html
Census population estimates
• Before looking at ways to improve the
estimates, need to examine how they are
produced
• Important to highlight things which influence
the quality of those estimates
Census population estimates
• The census outputs are not just the count of
responses, as not every household or person
is counted
• Methodology for estimating those missed:
• Carry out large post-enumeration survey (CCS)
• Match to Census
• Dual-system Estimation
• Estimate total population
• Quality Assurance
• Impute households and persons into output
database
Framework for producing population
estimates using a census
Dual System
Estimation
Matching
Ratio
estimation
Census
Quality
Assurance
Census
Coverage
Survey
Population
estimates
Key influences on quality of estimates
Dual System
Estimation
Matching
Ratio
estimation
Census
Quality
Assurance
Census
Coverage
Survey
Population
estimates
Overall
response
rates
Sample
Size
Response
rates
Matching
accuracy
Census/CCS
Independence
Access to
sources
Quality of
Sources
Response
by age-
sex
Response
variation
Response
by LA
Over-
count
Potential ways to improve estimates
• Three main areas where administrative data
could be used:
1. Supporting the collection to improve census
response
2. Enhancing coverage assessment
3. Quality Assurance of population estimates
Enhance census coverage
Dual System
Estimation
Matching
Ratio
estimation
Census
Quality
Assurance
Census
Coverage
Survey
Population
estimates
Admin Data
Enhance census coverage
Address register
1 The High Street
2 The High Street
3 The High Street
4 The High Street
5 The High Street
6 The High Street
7 The High Street
8 The High Street
9 The High Street
10 The High Street
11 The High Street
12 The High Street
13 The High Street
14 The High Street
15 The High Street
Enhanced
Census
Admin data
No response
Admin data, 1 person missed
Response, 1 person missed
Census
No response
No response
No response
Response, 1 person missed
Administrative data
(filtered for activity)
No data
No data
No data
No data
1 person missed
No data
No data
Enhance census coverage
• This approach used by NISRA in 2011
• Full evaluation by Ross (2015)
• Findings:
• Added 68k persons (3.9%) in 31k (4.5%)
households
• Reduced Confidence Intervals widths by around
20%
• Gains in variance with small risk of additional bias
Age-Sex Distribution of Census and
Administrative data records
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
Percentageofrecords
Census records
CUE records
Estimates and Confidence Intervals
Estimation Area
With CUE records Without CUE records
Estimate Variance
Relative
CI width1 Estimate Variance
Relative
CI width1
Eastern Northern
Ireland 791,900 9,146,600 0.75% 782,000 14,372,000 0.95%
Western
Northern Ireland 535,800 10,800,700 1.20% 528,600 13,991,800 1.39%
Belfast
462,000 11,562,800 1.44% 452,500 21,540,600 2.01%
Enhance census coverage
• This is being pursued for 2021:
• Fits within existing framework
• Improves quality
• Does not require large scale person level linkage
• Can evaluate through existing Census-CCS
linkage
• Can link the administrative data to the address
frame in advance
BUT
• Requires high quality activity data
What about the census collection?
• Using admin data in this way is an attempt to
improve quality after the census collection
• What about using it during the collection?
• Potential to prioritise follow-up activities based on
data on presence of age-sex groups…could this
reduce differential non-response?
• Needs further work…
Potential ways to improve estimates
• Three main areas where administrative data
could be used:
1. Supporting the collection to improve census
response
2. Enhancing coverage assessment
3. Quality Assurance of population estimates
Expand DSE to MSE
Dual System
Estimation
Matching
Ratio
estimation
Census
Quality
Assurance
Census
Coverage
Survey
Population
estimates
Administrative
Data
Log-linear
modelling
Over-
coverage
Matching
error ~2%Multiple System
Estimation
Other areas to explore
• Also considering use of administrative data
for:
- Coverage adjustment
- Bias adjustments
- National adjustment
Potential ways to improve estimates
• Three main areas where administrative data
could be used:
1. Supporting the collection to improve census
response
2. Enhancing coverage assessment
3. Quality Assurance of population estimates
Quality Assurance
• Can use administrative data for expanding
scope of QA in 2021
• Quality assuring population size as was done
in 2011
• Additional focus on household estimates
• Lower level QA e.g. output areas, SYOA
• Expanding topic QA
Summary
• Number of ways in which Administrative data
could improve the 2021 Census population
estimates
• These innovations fit into the existing
framework – evolution not revolution
• Number of challenges
• Access
• Quality
• Methodological
• Communicating methods
References
• Abbott, O. (2009) 2011 UK Census Coverage Assessment and Adjustment Methodology.
Population Trends, 137, pp. 25-32.
• Abbott, O. and Compton, G. (2014) Counting and estimating hard-to-survey populations in
the 2011 Census. In: R. Tourangeau, B. Edwards, T. Johnson, K. Wolter and N. Bates, eds.
2014. Hard-to-survey populations. Cambridge: Cambridge University Press. Ch.4.
• Blackwell, L., Charlesworth, A., Rogers, N. and Thorne, R. (2013) Matching of Census and
administrative data for Census data quality assurance in the 2011 Census of England and
Wales. Paper presented at NTTS2013 conference, Brussels. Available at http://www.cros-
portal.eu/sites/default/files/NTTS2013fullPaper_168.pdf
• ONS (2012) 2011 Census Address register. Available at http://www.ons.gov.uk/ons/guide-
method/census/2011/how-our-census-works/how-did-we-do-in-2011-/evaluation---address-
register.pdf
• ONS (2015) 2011 Census Quality Assurance Methodology: Evaluation Report. Available at
http://www.ons.gov.uk/ons/guide-method/census/2011/how-our-census-works/how-did-we-
do-in-2011-/2011-census-quality-assurance-methodology.pdf
• Ross, H. (2015) Using administrative data to enhance the quality of census population
estimates. MSc. University of Southampton.
Contact:
Owen Abbott – Methodology,
Digital Services, Technology and Methodology
Shelley.gammon@ons.gov.uk

More Related Content

PPTX
2021 Census collection strategy
Office for National Statistics
 
PPTX
Improving quality of admin data
Office for National Statistics
 
PPTX
Statistical design for the 2021 Census
Office for National Statistics
 
PPTX
Opportunities for alternative data sources
Office for National Statistics
 
PPTX
What can we do with administrative data?
Office for National Statistics
 
PPTX
Towards an administrative data census the story so far
Office for National Statistics
 
PPTX
Data Collection Transformation Programme
Office for National Statistics
 
PPTX
Administrative data census research
Office for National Statistics
 
2021 Census collection strategy
Office for National Statistics
 
Improving quality of admin data
Office for National Statistics
 
Statistical design for the 2021 Census
Office for National Statistics
 
Opportunities for alternative data sources
Office for National Statistics
 
What can we do with administrative data?
Office for National Statistics
 
Towards an administrative data census the story so far
Office for National Statistics
 
Data Collection Transformation Programme
Office for National Statistics
 
Administrative data census research
Office for National Statistics
 

What's hot (20)

PPTX
Welcome to the Census Transformation Research Conference 2016
Office for National Statistics
 
PPTX
The integration of statistical and administrative data sources to increase po...
Office for National Statistics
 
PPTX
Transforming the census to 2021 and beyond estonia
Office for National Statistics
 
PDF
Plans for the online 2021 Census with increased use of administrative and sur...
UKDSCensus
 
PDF
Delivering early benefits and trial outputs using administrative data
UKDSCensus
 
PDF
Evaluating the feasibility of using administrative data in the context of cen...
UKDSCensus
 
PPTX
ONS presentation at RSS South Wales poverty & inequality stats event
Richard Tonkin
 
PPTX
Data use overview
aleidich
 
PDF
Lake Hill Analytics General Capabilities Overview
Damian Herrick
 
PPTX
Hiv data triangulation workshop slideset 2.5 day agenda 24_march2015
aleidich
 
PDF
Capstone eLearning Deck
ZacharyCote2
 
PPT
Assessing M&E Systems For Data Quality
removed_62798267384a091db5c693ad7f1cc5ac
 
PPTX
Rsa provincial workshop slideset 10 march_tot
aleidich
 
PPTX
Applying a User-Centered Design Approach to Improve Data Use in Decision Making
removed_62798267384a091db5c693ad7f1cc5ac
 
PDF
Big data and macroeconomic nowcasting from data access to modelling
Dario Buono
 
PPTX
Integrating gender in dryland systems research
CGIAR Research Program on Dryland Systems
 
PDF
Negotiating the analog mainstream with digital methods in hand, by Pertti Aho...
Digital Sociology Mini-Conference
 
PDF
NEGOTIATING THE ANALOG MAINSTREAM WITH DIGITAL METHODS IN HAND VISIONS FROM T...
Pertti Ahonen
 
PPTX
Mpumalanga HIV Data Triangulation and Use 4 Nov 2014
aleidich
 
PDF
The Outlook for Data 2017: A Snapshot Into the Evolving Role of Audience Insight
Filipp Paster
 
Welcome to the Census Transformation Research Conference 2016
Office for National Statistics
 
The integration of statistical and administrative data sources to increase po...
Office for National Statistics
 
Transforming the census to 2021 and beyond estonia
Office for National Statistics
 
Plans for the online 2021 Census with increased use of administrative and sur...
UKDSCensus
 
Delivering early benefits and trial outputs using administrative data
UKDSCensus
 
Evaluating the feasibility of using administrative data in the context of cen...
UKDSCensus
 
ONS presentation at RSS South Wales poverty & inequality stats event
Richard Tonkin
 
Data use overview
aleidich
 
Lake Hill Analytics General Capabilities Overview
Damian Herrick
 
Hiv data triangulation workshop slideset 2.5 day agenda 24_march2015
aleidich
 
Capstone eLearning Deck
ZacharyCote2
 
Assessing M&E Systems For Data Quality
removed_62798267384a091db5c693ad7f1cc5ac
 
Rsa provincial workshop slideset 10 march_tot
aleidich
 
Applying a User-Centered Design Approach to Improve Data Use in Decision Making
removed_62798267384a091db5c693ad7f1cc5ac
 
Big data and macroeconomic nowcasting from data access to modelling
Dario Buono
 
Integrating gender in dryland systems research
CGIAR Research Program on Dryland Systems
 
Negotiating the analog mainstream with digital methods in hand, by Pertti Aho...
Digital Sociology Mini-Conference
 
NEGOTIATING THE ANALOG MAINSTREAM WITH DIGITAL METHODS IN HAND VISIONS FROM T...
Pertti Ahonen
 
Mpumalanga HIV Data Triangulation and Use 4 Nov 2014
aleidich
 
The Outlook for Data 2017: A Snapshot Into the Evolving Role of Audience Insight
Filipp Paster
 
Ad

Similar to Methods for making the best use of admin data (20)

PPTX
ADR UK workshop: Messy and complex data part 2
EleanorCollard
 
DOC
Matching Criteria
Hugh Knight
 
PPTX
Introduction to Data Linkage
University of Southampton
 
PPTX
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
CILIP MDG
 
PPTX
Admin data census
Office for National Statistics
 
PDF
WinklerINI160906
Bill Winkler
 
PPT
Data Archiving and Processing
CRRC-Armenia
 
PDF
Record linkage methods applied to population data deduplication
Instituto Canario de Estadística (ISTAC)
 
PDF
record_linking
Robert Berry
 
PPT
DataMeet 4: Data cleaning & census data
Ritvvij Parrikh
 
PDF
Efficient Record De-Duplication Identifying Using Febrl Framework
IOSR Journals
 
PPTX
Token
amooool2000
 
PDF
Indexing Techniques for Scalable Record Linkage and Deduplication
Pradeeban Kathiravelu, Ph.D.
 
PDF
5. Llinking employers and employees responses
BEYOND4.0
 
PDF
Dqs mds-matching 15042015
Neil Hambly
 
PPTX
Anonymising quantative data
ISSDA
 
PPT
Prescription Event Monitoring & Record Linkage Systems
Satish Veerla
 
PPTX
The Use of Data and Datasets in Data Science
Damian T. Gordon
 
PPT
online Record Linkage
Priya Pandian
 
PPT
Prescription Event Monitoring & Record Linkage Systems
Satish Veerla
 
ADR UK workshop: Messy and complex data part 2
EleanorCollard
 
Matching Criteria
Hugh Knight
 
Introduction to Data Linkage
University of Southampton
 
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
CILIP MDG
 
WinklerINI160906
Bill Winkler
 
Data Archiving and Processing
CRRC-Armenia
 
Record linkage methods applied to population data deduplication
Instituto Canario de Estadística (ISTAC)
 
record_linking
Robert Berry
 
DataMeet 4: Data cleaning & census data
Ritvvij Parrikh
 
Efficient Record De-Duplication Identifying Using Febrl Framework
IOSR Journals
 
Indexing Techniques for Scalable Record Linkage and Deduplication
Pradeeban Kathiravelu, Ph.D.
 
5. Llinking employers and employees responses
BEYOND4.0
 
Dqs mds-matching 15042015
Neil Hambly
 
Anonymising quantative data
ISSDA
 
Prescription Event Monitoring & Record Linkage Systems
Satish Veerla
 
The Use of Data and Datasets in Data Science
Damian T. Gordon
 
online Record Linkage
Priya Pandian
 
Prescription Event Monitoring & Record Linkage Systems
Satish Veerla
 
Ad

More from Office for National Statistics (20)

PDF
ONS Economic Forum Slidepack – 21 July 2025
Office for National Statistics
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
Bringing data to life | Bricks, mortar and data: understanding house and rent...
Office for National Statistics
 
PPTX
Earnings Symposium Slidepack - 29 April 2025
Office for National Statistics
 
PPTX
Bringing data to life - Crime webinar Accessible.pptx
Office for National Statistics
 
PPTX
ONS Economic Forum Slidepack – 19 May 2025.pptx
Office for National Statistics
 
PPTX
Measuring what matters most: understanding national well-being
Office for National Statistics
 
PPTX
ONS Economic Forum Slidepack - 24 March 2025 (slideshare).pptx
Office for National Statistics
 
PPTX
Bringing data to life: Artificial Intelligence and innovation - keeping human...
Office for National Statistics
 
PPTX
SlideShare ONS Economic Forum Slidepack - 27 January 2025
Office for National Statistics
 
PPTX
A Quick Introduction to the Reference Data Management Framework
Office for National Statistics
 
PPTX
Reference Data Management Framework Overview Digital Booklet
Office for National Statistics
 
PPTX
Bringing data to life: How are your vitals? Exploring health by numbers
Office for National Statistics
 
PPTX
SlideShare Annual crime and justice statistics forum - 7 November 2024
Office for National Statistics
 
PPTX
SlideShare ONS Economic Forum Slidepack - 25 November 2024
Office for National Statistics
 
PPTX
Air fryers and vinyl records: how we measure the cost of living
Office for National Statistics
 
PPTX
Bringing data to life - environment static.pptx
Office for National Statistics
 
PPTX
Bringing data to life an introduction to statistics
Office for National Statistics
 
PPTX
SlideShare – ONS Economic Forum Slidepack 21 October 2024
Office for National Statistics
 
PPTX
Navigating numbers: Business, industry and trade (PowerPoint)
Office for National Statistics
 
ONS Economic Forum Slidepack – 21 July 2025
Office for National Statistics
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Bringing data to life | Bricks, mortar and data: understanding house and rent...
Office for National Statistics
 
Earnings Symposium Slidepack - 29 April 2025
Office for National Statistics
 
Bringing data to life - Crime webinar Accessible.pptx
Office for National Statistics
 
ONS Economic Forum Slidepack – 19 May 2025.pptx
Office for National Statistics
 
Measuring what matters most: understanding national well-being
Office for National Statistics
 
ONS Economic Forum Slidepack - 24 March 2025 (slideshare).pptx
Office for National Statistics
 
Bringing data to life: Artificial Intelligence and innovation - keeping human...
Office for National Statistics
 
SlideShare ONS Economic Forum Slidepack - 27 January 2025
Office for National Statistics
 
A Quick Introduction to the Reference Data Management Framework
Office for National Statistics
 
Reference Data Management Framework Overview Digital Booklet
Office for National Statistics
 
Bringing data to life: How are your vitals? Exploring health by numbers
Office for National Statistics
 
SlideShare Annual crime and justice statistics forum - 7 November 2024
Office for National Statistics
 
SlideShare ONS Economic Forum Slidepack - 25 November 2024
Office for National Statistics
 
Air fryers and vinyl records: how we measure the cost of living
Office for National Statistics
 
Bringing data to life - environment static.pptx
Office for National Statistics
 
Bringing data to life an introduction to statistics
Office for National Statistics
 
SlideShare – ONS Economic Forum Slidepack 21 October 2024
Office for National Statistics
 
Navigating numbers: Business, industry and trade (PowerPoint)
Office for National Statistics
 

Recently uploaded (20)

PDF
Company Profile 2023 PT. ZEKON INDONESIA.pdf
hendranofriadi26
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PDF
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
PDF
Data_Cleaning_Infographic_Series_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PDF
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
PPTX
Trading Procedures (1).pptxcffcdddxxddsss
garv794
 
PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PPTX
Presentation1.pptxvhhh. H ycycyyccycycvvv
ItratBatool16
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PPTX
Analysis of Employee_Attrition_Presentation.pptx
AdawuRedeemer
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Company Profile 2023 PT. ZEKON INDONESIA.pdf
hendranofriadi26
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
Data_Cleaning_Infographic_Series_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
Trading Procedures (1).pptxcffcdddxxddsss
garv794
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
Presentation1.pptxvhhh. H ycycyyccycycvvv
ItratBatool16
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
Analysis of Employee_Attrition_Presentation.pptx
AdawuRedeemer
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 

Methods for making the best use of admin data

  • 1. Methods for making the best use of admin data Chair: Andy Teague, Admin Data Census, ONS [email protected]
  • 2. Use of administrative, commercial, Big and other data sources to transform the 2021 Census Andy Teague – Admin Data Census
  • 3. How admin and other data can help? Prepare • Volumes for planning/testing • Demographic characteristics for sampling & targeting • Estimating response patterns Collect & Process • Targeting resources during operation • Data validating, editing, imputing or substituting • Quality assuring Output • Enhancing outputs • Quality reporting Q U A L I T Y
  • 4. Examples of how admin and other data can help Task Example data sets Example use Volumes for planning/testing Ofcom – broadband connections and signal strength; digital uptake on other public services eg DVLA and Electoral Registration Estimating digital uptake versus number of paper questionnaires required Demographic characteristics for sampling & targeting Address Register supplemented with for example information on population churn from GP registers, Tax and Benefits data, school census data, HESA data Basic demographic characteristics to help with CCS sample stratification /Hard to count index Targeting publicity during operation Twitter, Facebook or other social media, population churn indicators from admin data (see above) Sentiment analysis techniques to identify opinions about census operation for use in publicity campaign; targeting publicity at high churn/hard to count areas Data validating, editing, imputing or substituting GP patient registrations supplemented with activity data Substitute good quality record level admin/other data to populate imputed records following coverage assessment Quality assuring and reporting HESA, School censuses, Council Tax, GP registers, activity data, Tax and Benefits data Compare census results with good quality data sets to check accuracy of census estimates Enhancing outputs HMRC tax and DWP benefits data, VOA data Use alongside census to produce estimates of income and /or replace census questions (eg no. of rooms)
  • 5. Linking and Matching Data Pete Jones – Admin Data Census Shelley Gammon - Methodology
  • 6. Presentation overview • Overview of matching requirements in 2021 • Summary of methods and findings from B2011 matching research • Research update on admin data and Census-CCS matching • Outline the Matching Strategy for 2021 • Future developments: • Graph Databases • Evaluating Linkage Quality • Wider benefits
  • 8. Beyond 2011 Data Linkage • Extensive feasibility research into administrative data matching during Beyond 2011 • Admin data population estimates produced by linking NHS Patient Register (PR), DWP Customer Information System (CIS) and Higher Education Statistics Agency (HESA) data • Focused on the development of automated matching algorithms: - Required to link large admin datasets (100m + records) - Adopted a ‘pseudonymisation’ approach as part of a tactical solution to preserve privacy in the linkage process • ONS is reviewing its longer term approach to privacy and data linkage but continue to build on methodological research to date
  • 9. Beyond 2011 Methods • Beyond 2011 matching was largely based around automated linkage • No common identifier so matching on name, sex date of birth and address • Three stages to the algorithm: (1) Deterministic matching: Linking records on a series of ‘match-keys’ where exact or partial agreement is found across match fields (2) Logistic regression matching: Using clerically matched training data to model matching decisions for more complex cases (3) Associative matching: linking individuals based on collectively resolving matches within a household • Three stages are designed to run sequentially in a combined algorithm
  • 10. • Match-keys are used to resolve minor inconsistencies that commonly occur between match-fields on two datasets Match-Keys Key Type Unique records on PR 1 Forename, Surname, DoB, Sex, Postcode 100.0% 2 Forename initial , Surname initial, DoB, Sex, Postcode District 99.6% 3 Forename bi-gram, Surname bi-gram, DoB, Sex, Postcode Area 99.4% 4 Forename initial, DoB, Sex, Postcode 99.8% 5 Surname initial, DoB, Sex, Postcode 99.4% 6 Forename, Surname, Age, Sex, Postcode Area 99.5% 7 Forename, Surname, Sex, Postcode 99.2% 8 Forename, Surname, DoB, Sex 98.9% 9 Forename, Surname, DoB, Postcode 99.5% 10 Surname, Forename, DoB, Sex, Postcode (matched on key 1) 100.0% 11 Middle name, Surname, DoB, Sex, Postcode (matched on key 1) 99.9%
  • 11. What did we learn from B2011? • Match rates between key admin datasets were high - 95% of PR records link to the CIS - 97% of School Census records link to the PR - Formed the basis of a Statistical Population Dataset (SPD) • Large volumes of administrative and census data can be linked efficiently using automated methods: - BUT trade off between quality and scale of matching • A pseudonymisation approach restricts the implementation of some methods: - Probabilistic matching - Clerical matching • The quality of linkage delivered through auto-matching will not be high enough for certain applications: - Coverage adjustment (for example, linking records for dual system estimation) - Multivariate estimates of population characteristics
  • 12. Research Update on Admin Data Matching: Automated Probabilistic Linkage • Tested and implemented methods for automated probabilistic matching (Fellegi-Sunter matching framework) • Threshold setting traditionally relies on large amounts of clerical review of match candidates • Implemented a ‘duplicate link’ method (Blakely & Salmond, 2002) that detects points in the distribution that are most likely to have true matches • No clerical review is required
  • 14. Duplicate Link Method in Practice • Have included a further 180,000 probabilistic links in the Admin Data Population Estimates for 2011
  • 15. 2021 Census and CCS Auto-matching • 70% of CCS-Census links were automated in 2011 • Testing ONS methods to increase the rate of auto-matching in 2021 • Using linked 2011 Census-CCS data for testing and comparison • Anticipating higher auto-match rate with increased on-line completion
  • 16. 2021 Census and CCS Auto-matching
  • 18. Methodology current research areas • Graph databases – for management of linked data • Sampling procedures to assess the quality of a data linkage project • Increasing the level of linkage automation, whilst maintaining stringent quality standards • Which are the best string comparators? - for partial agreement e.g. in a Fellegi Sunter model • Machine learning techniques for data linkage
  • 19. Issues for data linkage • Multiple sources of data • Large data sets • Updates required – changing data over time • Need to target clerical resource • Master linked data file – viewed as ‘truth’ • Different matching requirements for different users • Inconsistent clusters
  • 20. What is a graph database? • A graph database stores data points and also the relationships between the data. • A graph can also be represented as a matrix. • Graph format storage in certain use cases is more efficient than other forms of storage Adjacency matrix
  • 21. How can you use a graph to model linked data? • Blue nodes - Records in dataset 1 • Green nodes – records in dataset 2 1 A C D B 4 2 3 No match for records 3 & D Numbers on the edges represent match score, where 1 is an exact match.
  • 22. Graphical representation of linked data • Add in data on weaker links • Model strength of linkage score 1 A C D B 4 2 3 0.6 Keep information on all comparisons, not just the highest scoring pairs.
  • 23. Graphical representation of linked data • Add a third data source 1 A C D B 4 2 3 0.6 X Y The weak link between 3 & D now looks more plausible. Z1 Z2 Can detect duplicate records
  • 24. Graphical representation of linked data • AUTOMATIC resolution of some clusters of records relating to the same entity C D 2 3 X Y Even if match score for 3X > score for 2X, can (sometimes) use cluster metrics to generate clusters and break the 3-X link. AUTOMATICALLY
  • 25. Graph databases (progress) • Still early stages • Learning about technology • Improving linkage quality • Next step is to use real data - lots of it! Larger data sets (whole population) More data sources
  • 26. Quality of linkage • After we have matched two datasets, we need to assess the quality of linkage • Errors in matching can impact subsequent analyses, especially if there is a bias in matching errors towards certain groups of people • Informing data users of estimated linkage error enables them to adjust their analysis methods to account for this
  • 27. How good is our linking? Linked file Dataset 1 Dataset 2 Correct links Quality depends on: 1. Missed matches 2. Incorrect links
  • 28. Sampling to estimate linkage quality (1) • Linkage is often undertaken in stages e.g. exact, deterministic, probabilistic. • Estimate level of incorrect links (false positives) overall, by stage • (& other factors, e.g. age, sex, LA) • Estimate level of missed matches (false negatives) overall, by linkage probability score (& other factors?) • How many records do we need to assess clerically?
  • 29. Sampling for precision and recall (2) • Sampling approach for clerical checks to estimate false positives and false negatives? - Used gold-standard matched data from the 2011 Census to evaluate different approaches. - Can stratification improve quality and reduce costs? • Optimum sampling strategy depends on: 1. availability of prior information on quality 2. What level of detail is required in the estimates
  • 30. Wider benefits • Data linkage playing an increasing role at ONS and more widely o Increased use of administrative data o Our methods being reused • Collaborating externally o e.g. with experts at UCL, developing guidance for linkage standards
  • 31. Contact: Pete Jones – Admin Data Census, Census Transformation Programme [email protected] Shelley Gammon – Methodology, Digital Services, Technology and Methodology [email protected]
  • 32. Towards an integrated census- administrative data approach to item imputation for the 2021 Census Fern Leather, Katie Sharp and Steven Rogers Collection and Editing Methodology and Statistical Computing
  • 33. Background • Item-level edit and imputation process for 2011 Census used CANCEIS software for nearest neighbour imputation 1 • General move towards use of more administrative data in 2021 Census 2 • Can we use linked admin data to provide auxiliary information to improve the accuracy of the imputation process? 1 Bankier, M., Lachance, M., & Poirer, P. (1999). A generic implementation of the new imputation methodology. http://www.ssc.ca/survey/documents/SSC2000_M_Bankier.pdf 2 http://www.ons.gov.uk/ons/guide-method/census/2021-census/about-the-census-transformation-programme/programme-objectives/index.html
  • 34. Research questions • What does an improvement on the conventional (no admin) method look like? Based on evaluating the predictive and distributional accuracy of imputation under differing conditions3 • Can using linked admin data deliver that improvement? • Where do real admin data sources fall on the continuum of imputation performance? 3 Chambers (2001), National Statistics Methodological Series Report 28: Evaluation criteria for statistical editing and imputation, Office for National Statistics
  • 35. Simulation study • Baseline test data – clean and consistent census data – approx 600,000 records from a contiguous geographic area • Perturbation strategy – age randomly removed for 5% of records by age and household size • A series of synthetic admin datasets with systematically increasing error in the age variable were created: 0 years, +/- 3 years, +/- 6 years, +/- 12 years, to compare with the conventional no admin approach
  • 36. Predictive accuracy Mean = 0.20 +/- 0.03 Mean = 0.15 +/- 0.05 St Dev = 3.18 St Dev = 11.77 Mode = 3 Mode = 6 Exact admin data No admin data (control) +/- 3 years error +/- 6 years error
  • 37. Distributional accuracy RMSE = 25.3 SSE = 68,387 RMSE = 22.32 SSE = 51,294 RMSE = 30.46 SSE = 97,425 RMSE = 40.52 SSE = 172,354 RMSE = 36.10 SSE = 140,709 +/-3 years admin error +/-6 years error +/-12 years error
  • 38. Other measures Exact No Admin (control) Admin3 Admin6 Admin12 Size of total donor pool 56,077 258,116 57,529 59,807 70,068 Average age range of potential donors (years) 1 14.3 1.1 1.4 2.5 Percentage of records where true age is within range of imputation actions 53.1 58.3 20.9 12.9 20.5 Percentage of records failing edit checks if admin age directly substituted in 0 N/A 3.8 9.1 19.4
  • 39. Evaluating actual admin datasets RMSE = 994 Difference >1% Difference >1% RMSE = 1257 Difference >1% Difference from Census (years) PR exact % CIS exact % PR probabilistic % CIS probabilistic % -10 0.03 0.03 0.16 0.26 -9 0.01 0.01 0.16 0.25 -8 0.01 0.01 0.17 0.26 -7 0.01 0.01 0.19 0.27 -6 0.02 0.02 0.24 0.32 -5 0.02 0.02 0.27 0.33 -4 0.02 0.02 0.31 0.33 -3 0.03 0.03 0.38 0.37 -2 0.06 0.06 0.46 0.42 -1 0.30 0.38 0.76 0.79 0 98.40 98.16 86.71 85.91 1 0.33 0.33 0.70 0.81 2 0.06 0.07 0.39 0.36 3 0.03 0.03 0.33 0.30 4 0.02 0.02 0.25 0.27 5 0.02 0.02 0.23 0.25 6 0.02 0.02 0.20 0.23 7 0.01 0.01 0.16 0.20 8 0.01 0.01 0.15 0.19 9 0.01 0.01 0.13 0.18 10 0.02 0.03 0.14 0.19
  • 40. Imputation with observed admin data error NHS Patient Register No admin data Exact admin data +/-3 years error etc. Exact admin data PR exact PR probabilistic No admin data Estimate 95% Confidence Limits Estimate 95% Confidence Limits Estimate 95% Confidence Limits Estimate 95% Confidence Limits Mean -0.2 -0.22 -0.17 0.20 0.17 0.23 0.23 0.19 0.27 -0.15 -0.19 -0.1 SD 3.18 - - 3.32 - - 5.46 - - 11.77 - - Variance 10.1 - - 11.02 - - 29.81 - - 138.47 - -
  • 41. Distributional accuracy PR Exact PR probabilistic CIS exact CIS probabilistic SSE 47,500 58,421 46,146 45,478 RMSE 21.69 24.05 21.38 21.22
  • 42. Simulating different coverage levels • So far, we have assumed 100% coverage of all admin datasets – this is a best case scenario which is not realistic for live operations • Next step - simulate 80%, 60%, 40%, 20% coverage of the PR and CIS synthetic datasets
  • 43. Predictive accuracy – PR exact dataset
  • 45. Predictive accuracy – what can we expect in a real situation?
  • 46. Conclusion and next steps • Under certain conditions, auxiliary information from linked admin data can improve imputation accuracy for age, even when coverage is low • The method is able to protect observed Census data from highly erroneous admin data • Further research: Explore use for other variables – joint distributions and variables with high imputation rates in 2011
  • 47. Contact: Fern Leather, Katie Sharp and Steven Rogers Collection and Editing Methodology and Statistical Computing [email protected] [email protected] [email protected]
  • 48. Using administrative data to improve 2021 Census population estimates Owen Abbott Office for National Statistics
  • 49. This presentation will cover..... • Context: - Use of admin data in 2011 Census • How do we produce population estimates • Potential ways to improve estimates: - Improving response - Coverage measurement - Quality assurance
  • 50. Background - 2011 Census • 2011 Census used administrative data to: • Support collection • Build Address Register (e.g. NLPG) • HtC index (e.g. House prices) • Enhance population estimates through Quality Assurance • Mainly aggregates • Some experimental record linkage (Blackwell et al, 2013)
  • 51. 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 NumberofUsualResidents Quinary Age Group All persons - Exeter 95% Confidence Interval 2011 Census Counts¹ 2011 Census Estimates¹ Rolled forward estimates¹ Patient Register 2011² School Census 2011³,⁴ Social Security and Revenue Information 2011⁵ Comparator Lower Bound Comparator Upper Bound Source: 2011 Census QA pack available at http://www.ons.gov.uk/ons/guide- method/census/2011/census-data/2011-census-user-guide/quality-and-methods/quality/quality- assurance/local-authority-quality-assurance/index.html
  • 52. Census population estimates • Before looking at ways to improve the estimates, need to examine how they are produced • Important to highlight things which influence the quality of those estimates
  • 53. Census population estimates • The census outputs are not just the count of responses, as not every household or person is counted • Methodology for estimating those missed: • Carry out large post-enumeration survey (CCS) • Match to Census • Dual-system Estimation • Estimate total population • Quality Assurance • Impute households and persons into output database
  • 54. Framework for producing population estimates using a census Dual System Estimation Matching Ratio estimation Census Quality Assurance Census Coverage Survey Population estimates
  • 55. Key influences on quality of estimates Dual System Estimation Matching Ratio estimation Census Quality Assurance Census Coverage Survey Population estimates Overall response rates Sample Size Response rates Matching accuracy Census/CCS Independence Access to sources Quality of Sources Response by age- sex Response variation Response by LA Over- count
  • 56. Potential ways to improve estimates • Three main areas where administrative data could be used: 1. Supporting the collection to improve census response 2. Enhancing coverage assessment 3. Quality Assurance of population estimates
  • 57. Enhance census coverage Dual System Estimation Matching Ratio estimation Census Quality Assurance Census Coverage Survey Population estimates Admin Data
  • 58. Enhance census coverage Address register 1 The High Street 2 The High Street 3 The High Street 4 The High Street 5 The High Street 6 The High Street 7 The High Street 8 The High Street 9 The High Street 10 The High Street 11 The High Street 12 The High Street 13 The High Street 14 The High Street 15 The High Street Enhanced Census Admin data No response Admin data, 1 person missed Response, 1 person missed Census No response No response No response Response, 1 person missed Administrative data (filtered for activity) No data No data No data No data 1 person missed No data No data
  • 59. Enhance census coverage • This approach used by NISRA in 2011 • Full evaluation by Ross (2015) • Findings: • Added 68k persons (3.9%) in 31k (4.5%) households • Reduced Confidence Intervals widths by around 20% • Gains in variance with small risk of additional bias
  • 60. Age-Sex Distribution of Census and Administrative data records 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% Percentageofrecords Census records CUE records
  • 61. Estimates and Confidence Intervals Estimation Area With CUE records Without CUE records Estimate Variance Relative CI width1 Estimate Variance Relative CI width1 Eastern Northern Ireland 791,900 9,146,600 0.75% 782,000 14,372,000 0.95% Western Northern Ireland 535,800 10,800,700 1.20% 528,600 13,991,800 1.39% Belfast 462,000 11,562,800 1.44% 452,500 21,540,600 2.01%
  • 62. Enhance census coverage • This is being pursued for 2021: • Fits within existing framework • Improves quality • Does not require large scale person level linkage • Can evaluate through existing Census-CCS linkage • Can link the administrative data to the address frame in advance BUT • Requires high quality activity data
  • 63. What about the census collection? • Using admin data in this way is an attempt to improve quality after the census collection • What about using it during the collection? • Potential to prioritise follow-up activities based on data on presence of age-sex groups…could this reduce differential non-response? • Needs further work…
  • 64. Potential ways to improve estimates • Three main areas where administrative data could be used: 1. Supporting the collection to improve census response 2. Enhancing coverage assessment 3. Quality Assurance of population estimates
  • 65. Expand DSE to MSE Dual System Estimation Matching Ratio estimation Census Quality Assurance Census Coverage Survey Population estimates Administrative Data Log-linear modelling Over- coverage Matching error ~2%Multiple System Estimation
  • 66. Other areas to explore • Also considering use of administrative data for: - Coverage adjustment - Bias adjustments - National adjustment
  • 67. Potential ways to improve estimates • Three main areas where administrative data could be used: 1. Supporting the collection to improve census response 2. Enhancing coverage assessment 3. Quality Assurance of population estimates
  • 68. Quality Assurance • Can use administrative data for expanding scope of QA in 2021 • Quality assuring population size as was done in 2011 • Additional focus on household estimates • Lower level QA e.g. output areas, SYOA • Expanding topic QA
  • 69. Summary • Number of ways in which Administrative data could improve the 2021 Census population estimates • These innovations fit into the existing framework – evolution not revolution • Number of challenges • Access • Quality • Methodological • Communicating methods
  • 70. References • Abbott, O. (2009) 2011 UK Census Coverage Assessment and Adjustment Methodology. Population Trends, 137, pp. 25-32. • Abbott, O. and Compton, G. (2014) Counting and estimating hard-to-survey populations in the 2011 Census. In: R. Tourangeau, B. Edwards, T. Johnson, K. Wolter and N. Bates, eds. 2014. Hard-to-survey populations. Cambridge: Cambridge University Press. Ch.4. • Blackwell, L., Charlesworth, A., Rogers, N. and Thorne, R. (2013) Matching of Census and administrative data for Census data quality assurance in the 2011 Census of England and Wales. Paper presented at NTTS2013 conference, Brussels. Available at http://www.cros- portal.eu/sites/default/files/NTTS2013fullPaper_168.pdf • ONS (2012) 2011 Census Address register. Available at http://www.ons.gov.uk/ons/guide- method/census/2011/how-our-census-works/how-did-we-do-in-2011-/evaluation---address- register.pdf • ONS (2015) 2011 Census Quality Assurance Methodology: Evaluation Report. Available at http://www.ons.gov.uk/ons/guide-method/census/2011/how-our-census-works/how-did-we- do-in-2011-/2011-census-quality-assurance-methodology.pdf • Ross, H. (2015) Using administrative data to enhance the quality of census population estimates. MSc. University of Southampton.
  • 71. Contact: Owen Abbott – Methodology, Digital Services, Technology and Methodology [email protected]