SlideShare a Scribd company logo
Managing XML and Semistructured Data Lecture 19: Compressing XML Data Prof. Dan Suciu Spring 2001
In this lecture XML Compression Motivation XMill approach and results Resources XMILL: An Efficient Compressor for XML Data  by Liefke and Suciu, in SIGMOD'2001
Compression: The Problem XML for exchange (space or time) but XML is verbose users prefer application specific formats: Web Server Logs EMBL G2 is XML doomed to fail ?
An Example:Web Server Logs < apache:entry > < apache:host > 202.239.238.16 </ apache:host > < apache:requestLine > GET / HTTP/1.0 </ apache:requestLine > < apache:contentType > text/html </ apache:contentType > < apache:statusCode > 200</ apache:statusCode > < apache:date > 1997/10/01-00:00:02</ apache:date > < apache:byteCount > 4478</ apache:byteCount > < apache:referer > http://www.net.jp/ </ apache:referer > < apache:userAgent > Mozilla/3.1$[$ja$]$(I)</ apache:userAgent > </ apache:entry > 202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I) ASCII File 15.9 MB  (gzipped 1.6MB): XML-ized inflates to 24.2 MB  (gzipped 2.1MB):
XMill specialized compressor for XML data makes XML look “small” Download: Now: www.research.att.com/sw/tools/xmill Soon: www.cs.washington.edu/homes/suciu/XMILL
How Xmill Works: Three Ideas < apache:entry > < apache:host > </ apache:host > . . . </ apache:entry > 202.239.238.16  GET / HTTP/1.0  text/html  200 … gzip Structure gzip Data =1.75MB + Compress the structure separately from the data:
How Xmill Works: Three Ideas < apache:entry > . . . </ apache:entry > 202.23.23.16 224.42.24.55 … gzip Structure gzip Data1 =1.33MB + GET / HTTP/1.0 GET / HTTP/1.1 … gzip Data2 + Group the data values according to their types:
How Xmill Works: Three Ideas Apply semantic (specialized) compressors: Examples: 8, 16, 32-bit integer encoding (signed/unsigned) differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...) compress lists, records (e.g. 104.32.23.1    4 bytes) Need user input to select the semantic compressor gzip Structure  +  gzip c1(Data1)  +  gzip c2(Data2) + ... =0.82MB
XML Compression
Compression Tradeoff
Summary of XML Data Management XML = old data type (trees) with new interpretation (data) We discussed traditional management techniques for XML: Data model Query language Optimizations ... Many traditional problems still unsolved (storage, processing, optimization, ...)
Summary of XML Data Management More interesting question: what are the novel applications enabled by XML ? Some ideas: Approximate queries over unfamiliar data instances “ Search the database for a pattern  similar  to this one” Rank results based on their similarity to the pattern What is an appropriate query language for that ? Linking independent databases We have Xlink, how do we use it ?

More Related Content

What's hot (6)

PPTX
2013 DATA @ NFLX (Tableau User Group)
Albert Wong
 
PDF
Lost In The Clouds
george.james
 
PDF
YAML Engineering: why we need a new paradigm
Raphaël PINSON
 
PPTX
Geo data analytics
Daniel Marcous
 
PPTX
The immutable database datomic
Laurence Chen
 
PPT
Mining top k frequent closed itemsets
yuanchung
 
2013 DATA @ NFLX (Tableau User Group)
Albert Wong
 
Lost In The Clouds
george.james
 
YAML Engineering: why we need a new paradigm
Raphaël PINSON
 
Geo data analytics
Daniel Marcous
 
The immutable database datomic
Laurence Chen
 
Mining top k frequent closed itemsets
yuanchung
 

Viewers also liked (19)

PDF
Best Practices Portfolio Mngt
STKI
 
PDF
Rt Printing V3 And Vendors
STKI
 
PDF
Office Of The Cio Round Table Summary 3
STKI
 
PDF
Crm Round Table Summary 2
STKI
 
PDF
Office Of The Cio 2
STKI
 
PDF
Office Of The Cio Pmo 23.12.07
STKI
 
PPS
Christ The Redeemer In Rio
alina28
 
PDF
Erp Round Table Summary V5
STKI
 
PDF
Itil Rt Summary1
STKI
 
PDF
Office Of The Cio Pmo 23.12.07
STKI
 
PPS
Minunea Globului Pamintesc
alina28
 
PDF
Com Fer Un Blog
Jordipayeras
 
PDF
Bpm Round Table Summary
STKI
 
PDF
Office Of The Cio 2
STKI
 
PDF
Green Dc Rt V3 And Vendors
STKI
 
POT
Nelson Rolihlahla Mandela
marija1987
 
PDF
Psalm of Life
lsample
 
PPT
Lit. Unit 7: Gift of the Magi
lsample
 
PDF
The harsh hammurabi code
lsample
 
Best Practices Portfolio Mngt
STKI
 
Rt Printing V3 And Vendors
STKI
 
Office Of The Cio Round Table Summary 3
STKI
 
Crm Round Table Summary 2
STKI
 
Office Of The Cio 2
STKI
 
Office Of The Cio Pmo 23.12.07
STKI
 
Christ The Redeemer In Rio
alina28
 
Erp Round Table Summary V5
STKI
 
Itil Rt Summary1
STKI
 
Office Of The Cio Pmo 23.12.07
STKI
 
Minunea Globului Pamintesc
alina28
 
Com Fer Un Blog
Jordipayeras
 
Bpm Round Table Summary
STKI
 
Office Of The Cio 2
STKI
 
Green Dc Rt V3 And Vendors
STKI
 
Nelson Rolihlahla Mandela
marija1987
 
Psalm of Life
lsample
 
Lit. Unit 7: Gift of the Magi
lsample
 
The harsh hammurabi code
lsample
 
Ad

Similar to 19compression (20)

PPT
XML Technologies
juancpinzone
 
PPT
XML Technologies
juancpinzone
 
PPT
ravenbenweb xml and its application .PPT
ubaidullah75790
 
PDF
XML and Complex Systems (1998)
Joe Gollner
 
PDF
XML Compression Benchmark
University of New South Wales
 
PPT
Unit_2_Xml.ppt
Sushil Bhardwaj
 
PPT
What is xml
Aneesa Rahman
 
PPTX
Agile xml
Richard Winslow
 
PDF
Introduction to xml
soumya
 
PPT
XML In The Real World - Use Cases For Oracle XMLDB
Marco Gralike
 
PDF
Computational Social Science, Lecture 09: Data Wrangling
jakehofman
 
ODP
Web based application of Live Scoreboard using XML.
Uttam Kumar
 
PPT
Xml processing-by-asfak
Asfak Mahamud
 
PDF
XML Bible
LiquidHub
 
PPTX
Extensible markup language ppt as part of Internet Technology
SherinRappai
 
PPT
Basic concepts of xml
HelpWithAssignment.com
 
XML Technologies
juancpinzone
 
XML Technologies
juancpinzone
 
ravenbenweb xml and its application .PPT
ubaidullah75790
 
XML and Complex Systems (1998)
Joe Gollner
 
XML Compression Benchmark
University of New South Wales
 
Unit_2_Xml.ppt
Sushil Bhardwaj
 
What is xml
Aneesa Rahman
 
Agile xml
Richard Winslow
 
Introduction to xml
soumya
 
XML In The Real World - Use Cases For Oracle XMLDB
Marco Gralike
 
Computational Social Science, Lecture 09: Data Wrangling
jakehofman
 
Web based application of Live Scoreboard using XML.
Uttam Kumar
 
Xml processing-by-asfak
Asfak Mahamud
 
XML Bible
LiquidHub
 
Extensible markup language ppt as part of Internet Technology
SherinRappai
 
Basic concepts of xml
HelpWithAssignment.com
 
Ad

Recently uploaded (20)

PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
The Future of Artificial Intelligence (AI)
Mukul
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 

19compression

  • 1. Managing XML and Semistructured Data Lecture 19: Compressing XML Data Prof. Dan Suciu Spring 2001
  • 2. In this lecture XML Compression Motivation XMill approach and results Resources XMILL: An Efficient Compressor for XML Data by Liefke and Suciu, in SIGMOD'2001
  • 3. Compression: The Problem XML for exchange (space or time) but XML is verbose users prefer application specific formats: Web Server Logs EMBL G2 is XML doomed to fail ?
  • 4. An Example:Web Server Logs < apache:entry > < apache:host > 202.239.238.16 </ apache:host > < apache:requestLine > GET / HTTP/1.0 </ apache:requestLine > < apache:contentType > text/html </ apache:contentType > < apache:statusCode > 200</ apache:statusCode > < apache:date > 1997/10/01-00:00:02</ apache:date > < apache:byteCount > 4478</ apache:byteCount > < apache:referer > http://www.net.jp/ </ apache:referer > < apache:userAgent > Mozilla/3.1$[$ja$]$(I)</ apache:userAgent > </ apache:entry > 202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I) ASCII File 15.9 MB (gzipped 1.6MB): XML-ized inflates to 24.2 MB (gzipped 2.1MB):
  • 5. XMill specialized compressor for XML data makes XML look “small” Download: Now: www.research.att.com/sw/tools/xmill Soon: www.cs.washington.edu/homes/suciu/XMILL
  • 6. How Xmill Works: Three Ideas < apache:entry > < apache:host > </ apache:host > . . . </ apache:entry > 202.239.238.16 GET / HTTP/1.0 text/html 200 … gzip Structure gzip Data =1.75MB + Compress the structure separately from the data:
  • 7. How Xmill Works: Three Ideas < apache:entry > . . . </ apache:entry > 202.23.23.16 224.42.24.55 … gzip Structure gzip Data1 =1.33MB + GET / HTTP/1.0 GET / HTTP/1.1 … gzip Data2 + Group the data values according to their types:
  • 8. How Xmill Works: Three Ideas Apply semantic (specialized) compressors: Examples: 8, 16, 32-bit integer encoding (signed/unsigned) differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...) compress lists, records (e.g. 104.32.23.1  4 bytes) Need user input to select the semantic compressor gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ... =0.82MB
  • 11. Summary of XML Data Management XML = old data type (trees) with new interpretation (data) We discussed traditional management techniques for XML: Data model Query language Optimizations ... Many traditional problems still unsolved (storage, processing, optimization, ...)
  • 12. Summary of XML Data Management More interesting question: what are the novel applications enabled by XML ? Some ideas: Approximate queries over unfamiliar data instances “ Search the database for a pattern similar to this one” Rank results based on their similarity to the pattern What is an appropriate query language for that ? Linking independent databases We have Xlink, how do we use it ?