SlideShare a Scribd company logo
2
Most read
3
Most read
4
Most read
Statistical Databases
By Samir Rana
Database used for
statistical analysis.
● Researchers have access to statistics, but not the records
inside.
● Access control and limited query keeps prying eyes off from
sensitive data
● Basic ops Limited to : Count, Sum, Mean, std deviation etc
Requirements
Statistical Databases
● Data Model
● Query Language
● Integrity Constraints
● Recovery
● Physical DB organization
● Data analysis requirements
Data Model
1. Due to Multidimensionality of SD,
Relational Model is not suitable.
2. New data structures and operations
are needed, such as data cube
operator and aggregate Data
structures
Different Data model proposed:
SUBJECT ,
Semantic Association Models,
GRASS,
Conceptual Statistical Model,
Statistical Object Representation Model
Query Language
● powerful and easy-to-use query
languages to define and manipulate
statistical data.
● Evaluation Criteria of Statistical
query languages :
○ data and metadata definition,
○ data manipulation,
○ interface to statistical packages,
○ the expressive power of the
language.
Query Languages
● SDBMS built on top of CDBMS:
○ GRAFSTAT on DB2(SQL/DS),
○ STRAND on INGRES
● Generalized Interface system that links
together available CDBMS, statistical
packages and graphics software
○ SIBYL , GPI and PEPIN-SICLA
● Separately developed SDBMS:
○ RAPID, CAS SDB, ABE, SIR/SQL ,
GENISYS, CANTOR.
○ SIR/DBMS, TPL, TPLDCS, BROWSE.
● SDBMS with graphical user interfaces:
○ SUBJECT , GUIDE , ABE , STBE,
SEEDS online code book.
● Formal Extensions of Relational Model:
○ SSDL
● Natural language based user interface:
○ LIDS 86 .
● Query languages which calculate
aggregates from temporal data:
○ QUEL , HQUEL , TBE.
Tree Based Statistics Access Method(TBSAM)
● Calculate set-of-aggregates of all data items such that boolean qualification
● Based on the B+ tree, and it exploits all the benefits of a B+ trees dynamic nature.
● Aim is the efficient retrieval of a tuple, given the value of its index attribute.
● Dynamic index, and thus can support insertion/deletion/modification of tuples in
relation.
● Various types of statistical queries can be facilitated :
○ descriptive statistics,
○ order statistics,
○ statistical sampling types of queries.
Processing and Optimization
● large portion of statistical data are either
spatial or temporal data.
● pure tables of relational databases are not
capable of efficientlystoring or helping
retrieving such data.
● algorithms reorder the operations to be
performed → build the optimal or
suboptimal query processing tree →
depending on the physical data storage
structures, chooses the best possible
strategy to query data.
Operations on Temporal data
● Temporal theta join : the conjunction of two
sets or predicates, the time join predicate
and the non-time join predicate.
● TE-join : two tuples (or rows) in two join
relations (tables) are joined if their time
intervals intersect.
● T-join : causes the concatenation of tuples
from the operand relations only if their time
intervals intersect.
Security
● Easy to infer, The contents of Specific
Records from Statistical Data
● Conflict of Providing Statistics and
securing individual records gives rise to
Inference Control.
● Type of Inference Control
○ Query set Restriction
○ Data perturbation
○ Output Perturbation
○ Conceptual Approach
Evaluation of Effectiveness of Inference control:
● Security
● Robustness
● Bias
● Precision
● Consistency
● Cost
Other Applications
● Data Visualization:
○ A point in multidimensional space.
○ Can be used as a basis to build an interactive data visualization system.
○ User can browse in the multidimensional space.
● Statistical expert systems:
○ a program which can act in the role of an expert statistical consultant.
○ give expert advice on how to design a study, what data to collect to answer the research
questions, and how to analyze the data collected.
Conclusion
● Applications that collect vast amounts of data, and require interactive real-time
analysis capabilities on it, is on the rise.
● the standard approach of statistical analysis to load part of the data from a file
or database into a statistical package, and then performing analysis on it will
not work due to efficiency reasons.
● The overall goal of research in statistical database management has been to
make this analysis an integral part of the data management system itself.
● The focus of the research community has been on developing techniques to
make this happen.
Thank you

More Related Content

Similar to Statistical Databases (20)

PDF
Understanding the Step-by-Step Data Science Process for Beginners | IABAC
IABAC
 
PDF
Data Mining Module 1 Business Analytics.
Jayanti Pande
 
PPTX
Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization
Panos Alexopoulos
 
PPTX
Importance of Data Structures
Pradipta Poudel
 
PPTX
Big data analyti data analytical life cycle
NAKKAPUNEETH1
 
PDF
Data mining and data warehousing notes
tinamaheswariktm2004
 
PPTX
Job Profiles in Big Data - StackDataLabs
Stack Data Labs
 
DOCX
Understanding Data Mining: Benefits, Challenges, and How AI & ML Help
StudySection
 
PPTX
Big data Analytics Unit - CCS334 Syllabus
Sunanthini Rajkumar
 
PPTX
Big data and data science overview
Colleen Farrelly
 
PPTX
data science, prior knowledge ,modeling, scatter plot
SteffinAlex
 
PDF
Introduction to Artificial Intelligence_ Lec 5
Dalal2Ali
 
PDF
unit 3 DBMS.docx.pdf geometric transformer in query processing
FallenAngel35
 
PDF
unit 3 DBMS.docx.pdf geometry in query p
FallenAngel35
 
PPTX
Hetrogeneous Data handling in Big Data Analysis
DrSatwinderSingh3
 
PDF
@vtucode.in-21CS71-module-1-pdf.pdfBig data
sanjanakorawar
 
PPTX
data mining and data warehousing
MohammedAmeenUlIslam1
 
PDF
High dimensionality reduction on graphical data
eSAT Journals
 
PDF
Big Data Analytics M1.pdf big data analytics
nithishlkumar9194
 
PPTX
Data Science and Analysis.pptx
PrashantYadav931011
 
Understanding the Step-by-Step Data Science Process for Beginners | IABAC
IABAC
 
Data Mining Module 1 Business Analytics.
Jayanti Pande
 
Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization
Panos Alexopoulos
 
Importance of Data Structures
Pradipta Poudel
 
Big data analyti data analytical life cycle
NAKKAPUNEETH1
 
Data mining and data warehousing notes
tinamaheswariktm2004
 
Job Profiles in Big Data - StackDataLabs
Stack Data Labs
 
Understanding Data Mining: Benefits, Challenges, and How AI & ML Help
StudySection
 
Big data Analytics Unit - CCS334 Syllabus
Sunanthini Rajkumar
 
Big data and data science overview
Colleen Farrelly
 
data science, prior knowledge ,modeling, scatter plot
SteffinAlex
 
Introduction to Artificial Intelligence_ Lec 5
Dalal2Ali
 
unit 3 DBMS.docx.pdf geometric transformer in query processing
FallenAngel35
 
unit 3 DBMS.docx.pdf geometry in query p
FallenAngel35
 
Hetrogeneous Data handling in Big Data Analysis
DrSatwinderSingh3
 
@vtucode.in-21CS71-module-1-pdf.pdfBig data
sanjanakorawar
 
data mining and data warehousing
MohammedAmeenUlIslam1
 
High dimensionality reduction on graphical data
eSAT Journals
 
Big Data Analytics M1.pdf big data analytics
nithishlkumar9194
 
Data Science and Analysis.pptx
PrashantYadav931011
 

Recently uploaded (20)

PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
Research Methodology Overview Introduction
ayeshagul29594
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
What Is Data Integration and Transformation?
subhashenia
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
BinarySearchTree in datastructures in detail
kichokuttu
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Ad

Statistical Databases

  • 3. ● Researchers have access to statistics, but not the records inside. ● Access control and limited query keeps prying eyes off from sensitive data ● Basic ops Limited to : Count, Sum, Mean, std deviation etc
  • 4. Requirements Statistical Databases ● Data Model ● Query Language ● Integrity Constraints ● Recovery ● Physical DB organization ● Data analysis requirements
  • 5. Data Model 1. Due to Multidimensionality of SD, Relational Model is not suitable. 2. New data structures and operations are needed, such as data cube operator and aggregate Data structures Different Data model proposed: SUBJECT , Semantic Association Models, GRASS, Conceptual Statistical Model, Statistical Object Representation Model
  • 6. Query Language ● powerful and easy-to-use query languages to define and manipulate statistical data. ● Evaluation Criteria of Statistical query languages : ○ data and metadata definition, ○ data manipulation, ○ interface to statistical packages, ○ the expressive power of the language.
  • 7. Query Languages ● SDBMS built on top of CDBMS: ○ GRAFSTAT on DB2(SQL/DS), ○ STRAND on INGRES ● Generalized Interface system that links together available CDBMS, statistical packages and graphics software ○ SIBYL , GPI and PEPIN-SICLA ● Separately developed SDBMS: ○ RAPID, CAS SDB, ABE, SIR/SQL , GENISYS, CANTOR. ○ SIR/DBMS, TPL, TPLDCS, BROWSE. ● SDBMS with graphical user interfaces: ○ SUBJECT , GUIDE , ABE , STBE, SEEDS online code book. ● Formal Extensions of Relational Model: ○ SSDL ● Natural language based user interface: ○ LIDS 86 . ● Query languages which calculate aggregates from temporal data: ○ QUEL , HQUEL , TBE.
  • 8. Tree Based Statistics Access Method(TBSAM) ● Calculate set-of-aggregates of all data items such that boolean qualification ● Based on the B+ tree, and it exploits all the benefits of a B+ trees dynamic nature. ● Aim is the efficient retrieval of a tuple, given the value of its index attribute. ● Dynamic index, and thus can support insertion/deletion/modification of tuples in relation. ● Various types of statistical queries can be facilitated : ○ descriptive statistics, ○ order statistics, ○ statistical sampling types of queries.
  • 9. Processing and Optimization ● large portion of statistical data are either spatial or temporal data. ● pure tables of relational databases are not capable of efficientlystoring or helping retrieving such data. ● algorithms reorder the operations to be performed → build the optimal or suboptimal query processing tree → depending on the physical data storage structures, chooses the best possible strategy to query data. Operations on Temporal data ● Temporal theta join : the conjunction of two sets or predicates, the time join predicate and the non-time join predicate. ● TE-join : two tuples (or rows) in two join relations (tables) are joined if their time intervals intersect. ● T-join : causes the concatenation of tuples from the operand relations only if their time intervals intersect.
  • 10. Security ● Easy to infer, The contents of Specific Records from Statistical Data ● Conflict of Providing Statistics and securing individual records gives rise to Inference Control. ● Type of Inference Control ○ Query set Restriction ○ Data perturbation ○ Output Perturbation ○ Conceptual Approach Evaluation of Effectiveness of Inference control: ● Security ● Robustness ● Bias ● Precision ● Consistency ● Cost
  • 11. Other Applications ● Data Visualization: ○ A point in multidimensional space. ○ Can be used as a basis to build an interactive data visualization system. ○ User can browse in the multidimensional space. ● Statistical expert systems: ○ a program which can act in the role of an expert statistical consultant. ○ give expert advice on how to design a study, what data to collect to answer the research questions, and how to analyze the data collected.
  • 12. Conclusion ● Applications that collect vast amounts of data, and require interactive real-time analysis capabilities on it, is on the rise. ● the standard approach of statistical analysis to load part of the data from a file or database into a statistical package, and then performing analysis on it will not work due to efficiency reasons. ● The overall goal of research in statistical database management has been to make this analysis an integral part of the data management system itself. ● The focus of the research community has been on developing techniques to make this happen.