SlideShare a Scribd company logo
Querying  Rich  Text  
with  Lucene  XQuery	

{	

Michael  Sokolov	
Senior  Architect	
Safari  Books  Online
!   Overview  of  Lux	

!   Why  we  need  want  a  rich(er)  query  language	

!   Implementation  Stories	

!   Indexing  tagged  text	
!   Storing  documents  in  Lucene	
!   Lazy  searching	
	

!   Demo	

The  plan  for  this  talk
!  XQuery  in  Solr	

!   Query  optimizer	
!   Efficient  XML  document  format	
!   XQuery  function  library	

!   as  a  Java  library  (Lucene  only)	
!   as  Solr  plugins	
!   as  a  standalone  App  Server	

What  is  Lux?
Search	

to  find  something
Query	

to  get  an  answer
!
!
!
!
!

  maybe  it  was  once  –  10  year  s  ago?	
  Legacy  stuff:  DTDs,  namespaces,  etc	
  arcane  Java  programming  interfaces	
  Don’t  we  use  JSON  now?	
  so  why  do  we  care  about  it?	

XML  is  not  cool
!   There’s  a  huge  amount  of  it  out  there	
!   HTML  is  XML,  or  can  be	
!   Lux  is  about  making  it  easy  (and  free)  to  deal  
with  XML	
	

But  it  still  maZers
!   We  make  content-­‐‑rich  sites:	

!   our  own  site:  safaribooksonline.com	
!   our  clients  sites:  oed.com,  degruyter.com,  
oxfordreference.com,  …	

!   Publishers  provide  us  with  content	

!   we  debug  content  problems	
!   we  add  new  features  nimbly	
!   Piles  of  random  data  (XML,  mostly)	

Why  did  we  make  it?
!   Complex  queries  over  semi-­‐‑structured  data,  typically  
documents	
!   You  don’t  need  it  for  edismax-­‐‑style  “quick”  search	
!   or  highly-­‐‑structured  data	
!   XQuery  comes  with  a  rich  function  library;	
!   rich  string,  numeric  and  date  functions	
!   extensions  for  HTTP,  filesystem,  zip	

How  can  XQuery  help?
DispatchFilter	
UpdateProcessor	
XML  Indexer	
XML  text  
fields	

Tagged  
TokenStream	

XPath  fields	
Tinybin  
storage	

External  
Field  Codec	

QueryComponent	
QParserPlugin	
Evaluator	
Saxon  XQuery  
XSLT  Processor	
XQuery  
Function  
Library	
Lazy  
Searcher	

ResponseWriter	

Compiler	
Optimizer	
Tagged	
Highlighter	

How  does  Lux  work?
!   “hamlet”  	
!   “hamlet”  in  //title	
!   “hamlet”  in  //scene/title,  //speaker,  etc…	
!   XQuery,  but  we  need  an  index	
!   DIH  XPathEntityProcessor	
!   But  are  XPath  indexes  enough?	

XML  is  text  with  context
!   In  which  speeches  does  Hamlet  talk  about  poison?	
!   +speaker:Hamlet  +line:poison	
!   Works  great  if  we  indexed  speaker  and  line  for  each  
speech	

!   What  if  we  only  indexed  at  the  scene  level?  	
!   What  if  we  just  indexed  speech  text  as  a  field?	
!   XPath  indexes  are  precise  and  fine-­‐‑grained	
!   Great  when  you  know  exactly  what  you  need	
	

How  do  we  index  context?
<play>	
<title>Hamlet</title>	
<act act=”1”>	
<scene act=”1” scene=”1”>	
<title>SCENE I. Elsinore ... </title>	
	
Index	

Values	

Tags	

title, act, @act	
  

Tag  Paths	

/play, /play/title, /play/act, /play/act/@act	
  

Text	

hamlet,	
  scene,	
  elsinore	
  

Tagged  Text	

play:hamlet,	
  title:hamlet,	
  @act:1	
  

XPath	

user-­‐defined	
  Xpath	
  2.0	
  expression;	
  eg:	
  	
  
count(//line),	
  	
  
replace(//title,	
  'SCENE|ACT	
  S+','')	
  

Contextual  Indexes
!   Tagged  Text,  Path  index	
!   Imprecise,  generic  indexes,  but  more  context  
than  just  full  text	
!   XQuery  post-­‐‑processing  to  patch  over  the  gaps	
!   Query  optimizer  applies  indexes	
!   For  when  you  don’t  want  to  sweat  the  details:  
ad  hoc  queries,  content  analysis  and  debugging	

General  purpose  indexes
<scene><speech>
<speaker>Hamlet</speaker>
<line>To be or not to be, … </line>

…	
scene	
speech	
speaker	

…	
scene	
speech	
line	

…	
scene	
speech	
line	

Hamlet	

To	

be	

!
!
!
!

Zext:scene:hamlet                pos=1	
Zext:speech:hamlet            pos=1	
Zext:speaker:hamlet        pos=1	
Zext:scene:to                                  pos=2	
Zext:speech:to                              pos=2	
…	

Tokens  emiZed	

  Wraps  an  existing  Analyzer  (for  the  text)	
  Responds  to  XML  events  (start  element,  etc)	
  Maintains  a  tag  name  stack	
  Emits  each  token  prefixed  by  enclosing  tags	

TaggedTokenStream
!   XPath:	
      //speech[speaker=“Hamlet”][contains(.,”poison”)]	
!   “optimized”  XQuery:	
      lux:search(“+<speaker:Hamlet  +<speech:poison”)          	
              //speech  [speaker=“Hamlet”]  [contains(.,”poison”)]	
!   Lucene  Query:	
      tagged_text:(+speaker:Hamlet  +speech:poison)	

TagQueryParser
!   Generic  JSON  index	
!   Overlapping  tags  (part-­‐‑of-­‐‑speech,  phrase-­‐‑labeling,  NLP)	
!   citation  classification  w/probabilistic  labeling	

!   One  stored  field  for  all  the  text  makes  highlighting  easier	
!   One  Lucene  field  means  you  can  use  PhraseQuery,  eg:  	
        PhraseQuery(<speaker:hamlet  <speech:to)  finds  all              	
                    speeches  by  hamlet  starting  with  “to”.	

Tagged  token  examples
!
!
!
!
!
!

  stored  document    =  100%	
  qnames  =  +1.3%	
  paths  =  +2.4%	
  text  tokens  =  18%	
  tagged  text  (opaque)  =  18%	
  tagged  text  (all  transparent)  =  71%	

What’s  the  cost?
subsequence(	
  
	
  	
  for	
  $doc	
  in	
  collection()[.//SPEAKER=“Hamlet”]	
  
	
  order	
  by	
  $doc/lux:key(“title”)	
  
	
  return	
  $doc,	
  1000,	
  20)	
  
	
  
subsequence	
  (	
  
	
  lux:search(“<SPEAKER:Hamlet”,	
  “title”,	
  
1000)	
  [.//SPEAKER=“Hamlet”]	
  
,	
  1,	
  20)	
  

Query  optimization
!   Lux  uses  Lucene  as  its  primary  document  store	
!   Lux  tinybin  (based  on  Saxon  TinyTree)  storage  
format  avoids  XML  parsing  overhead	
!   Experimental  new  codec  stores  fields  as  files	
	

Document  storage
!   Problem:  “big”  stored  fields	
!   Text  documents  get  stored  for  highlighting	

!   Take  time  to  copy  when  merging	
!   Can  we  do  beZer  by  storing  as  files,  but  
managing  w/Lucene?	

“Big”  binary  stored  fields
large  stored  fields	
small  stored  fields	

ExternalFieldCodec
!   Real-­‐‑time  deletes	
!   Track  deletions  when  merging	
!   Keep  commits  with  IndexDeletionPolicy	
!   Delete  unmerged  (empty)  segments	

!   Off-­‐‑line  deletes	
!   Cleanup  tool  traverses  entire  index	

Deleting  is  complicated
!
!
!
!

  2-­‐‑3x  write  speedup  for  unindexed  stored  fields	
  a  bit  slower  in  the  worst  case	
  But,  text  analysis  can  take  most  of  the  time	
  Net:  useful  if  you  are  storing  large  binaries	

Codec  Performance  
(preliminary)
!   custom  DispatchFilter  provides:	
!   HTTP  request/response  handling  in  XQuery	
!   file  uploads,  redirects	
!   Ability  to  roll  your  own:  cookies,  authentication	

!   Rapid  prototyping,  testing  query  performance,  
relevance,  in  an  application  seZing	

App  Server
!   Yes,  but  did  you  remember  to  index  all  the  
fields  you  need  in  advance?	
!   Yes,  but  did  you  want  to  format  the  result  into  a  
nice  report  *using  your  query  language*?	
!   Yes,  but  did  you  want  access  to  a  complete  
XPath  2.0  implementation  in  your  indexer?	

Isn’t  Solr  enough?
!   Find  some  sample  content  with  a  new  tag  we  need  
to  support	
!   Perform  complex  updates  to  patch  broken  content	
!   Troubleshoot  content	
!   Explore  unfamiliar  content	
!   Write  prototypes  and  admin  tools  entirely  in  HTML,  
JS  and  XQuery	
!   Demo:  hZp://localhost:8080	

Example  uses  
!   Downloads  and  Documentation  at  
hZp://luxdb.org  	
!   Source  code  at  hZp://github.com/msokolov/lux	
!   Freely  available  under  OSS  license  (MPL  2)	
!   Contributions  welcome	
!   Thank  you,  Safari  Books!	
	

Thank  You!

More Related Content

ZIP
2010 08-06 - sd ruby - solr
Nick Zadrozny
 
PDF
Practical Kerberos
Accumulo Summit
 
PDF
Boost Maintainability
Mosky Liu
 
PPTX
ElasticSearch for .NET Developers
Ben van Mol
 
PPT
Xml Presentation-1
Sudharsan S
 
PPT
Xml Presentation-3
Sudharsan S
 
PDF
Querying XML: XPath and XQuery
Katrien Verbert
 
PPT
XML.ppt
butest
 
2010 08-06 - sd ruby - solr
Nick Zadrozny
 
Practical Kerberos
Accumulo Summit
 
Boost Maintainability
Mosky Liu
 
ElasticSearch for .NET Developers
Ben van Mol
 
Xml Presentation-1
Sudharsan S
 
Xml Presentation-3
Sudharsan S
 
Querying XML: XPath and XQuery
Katrien Verbert
 
XML.ppt
butest
 

Viewers also liked (10)

PDF
SQL Server - Querying and Managing XML Data
Marek Maśko
 
PPTX
Xml dtd
sana mateen
 
PPTX
Introduction to xml
Gtu Booker
 
PPTX
Intro xml
sana mateen
 
PPT
Introduction to XML
yht4ever
 
PPT
Introduction to XML
Jussi Pohjolainen
 
PPS
XML - What is XML?
sandur
 
PPTX
Xml ppt
seemadav1
 
SQL Server - Querying and Managing XML Data
Marek Maśko
 
Xml dtd
sana mateen
 
Introduction to xml
Gtu Booker
 
Intro xml
sana mateen
 
Introduction to XML
yht4ever
 
Introduction to XML
Jussi Pohjolainen
 
XML - What is XML?
sandur
 
Xml ppt
seemadav1
 
Ad

Similar to Querying rich text with XQuery (20)

PDF
Elasticsearch Basics
Shifa Khan
 
ZIP
Solr Powr — Enterprise-grade search for your app
Nick Zadrozny
 
PDF
Log analysis with the elk stack
Vikrant Chauhan
 
PPTX
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Soham Mondal
 
PDF
Xtext beyond the defaults - how to tackle performance problems
Holger Schill
 
PPTX
You Want to Go XML-First: Now What? Building an In-House XML-First Workflow -...
BookNet Canada
 
PPTX
Multi Lingual Websites In Umbraco
Paul Marden
 
PDF
Catmandu / LibreCat Project
Patrick Hochstenbach
 
PDF
Wanna search? Piece of cake!
Alex Kursov
 
PPTX
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Spark Summit
 
PPTX
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Sujit Pal
 
PPTX
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
Marco Gralike
 
PPTX
ElasticSearch in Production: lessons learned
BeyondTrees
 
PPTX
ElasticSearch Basics
Satya Mohapatra
 
PDF
Joys & frustrations of putting 34,000 lines of Haskell into production (at Va...
Saurabh Nanda
 
PDF
xml2tex at TUG 2014
Keiichiro Shikano
 
PDF
plone.app.multilingual
Ramon Navarro
 
PDF
Introduction to libre « fulltext » technology
Robert Viseur
 
PPTX
XML
Kamal Acharya
 
PPTX
Hotsos 2013 - Creating Structure in Unstructured Data
Marco Gralike
 
Elasticsearch Basics
Shifa Khan
 
Solr Powr — Enterprise-grade search for your app
Nick Zadrozny
 
Log analysis with the elk stack
Vikrant Chauhan
 
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Soham Mondal
 
Xtext beyond the defaults - how to tackle performance problems
Holger Schill
 
You Want to Go XML-First: Now What? Building an In-House XML-First Workflow -...
BookNet Canada
 
Multi Lingual Websites In Umbraco
Paul Marden
 
Catmandu / LibreCat Project
Patrick Hochstenbach
 
Wanna search? Piece of cake!
Alex Kursov
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Spark Summit
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Sujit Pal
 
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
Marco Gralike
 
ElasticSearch in Production: lessons learned
BeyondTrees
 
ElasticSearch Basics
Satya Mohapatra
 
Joys & frustrations of putting 34,000 lines of Haskell into production (at Va...
Saurabh Nanda
 
xml2tex at TUG 2014
Keiichiro Shikano
 
plone.app.multilingual
Ramon Navarro
 
Introduction to libre « fulltext » technology
Robert Viseur
 
Hotsos 2013 - Creating Structure in Unstructured Data
Marco Gralike
 
Ad

More from lucenerevolution (20)

PDF
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
PDF
State of the Art Logging. Kibana4Solr is Here!
lucenerevolution
 
PDF
Search at Twitter
lucenerevolution
 
PDF
Building Client-side Search Applications with Solr
lucenerevolution
 
PDF
Integrate Solr with real-time stream processing applications
lucenerevolution
 
PDF
Scaling Solr with SolrCloud
lucenerevolution
 
PDF
Administering and Monitoring SolrCloud Clusters
lucenerevolution
 
PDF
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
lucenerevolution
 
PDF
Using Solr to Search and Analyze Logs
lucenerevolution
 
PDF
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 
PDF
Solr's Admin UI - Where does the data come from?
lucenerevolution
 
PDF
Schemaless Solr and the Solr Schema REST API
lucenerevolution
 
PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
PDF
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
lucenerevolution
 
PDF
Faceted Search with Lucene
lucenerevolution
 
PDF
Recent Additions to Lucene Arsenal
lucenerevolution
 
PDF
Turning search upside down
lucenerevolution
 
PDF
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
lucenerevolution
 
PDF
Shrinking the haystack wes caldwell - final
lucenerevolution
 
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
lucenerevolution
 
Search at Twitter
lucenerevolution
 
Building Client-side Search Applications with Solr
lucenerevolution
 
Integrate Solr with real-time stream processing applications
lucenerevolution
 
Scaling Solr with SolrCloud
lucenerevolution
 
Administering and Monitoring SolrCloud Clusters
lucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
lucenerevolution
 
Using Solr to Search and Analyze Logs
lucenerevolution
 
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 
Solr's Admin UI - Where does the data come from?
lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
lucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
lucenerevolution
 
Faceted Search with Lucene
lucenerevolution
 
Recent Additions to Lucene Arsenal
lucenerevolution
 
Turning search upside down
lucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
lucenerevolution
 
Shrinking the haystack wes caldwell - final
lucenerevolution
 

Recently uploaded (20)

PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 

Querying rich text with XQuery

  • 1. Querying  Rich  Text   with  Lucene  XQuery { Michael  Sokolov Senior  Architect Safari  Books  Online
  • 2. !   Overview  of  Lux !   Why  we  need  want  a  rich(er)  query  language !   Implementation  Stories !   Indexing  tagged  text !   Storing  documents  in  Lucene !   Lazy  searching !   Demo The  plan  for  this  talk
  • 3. !  XQuery  in  Solr !   Query  optimizer !   Efficient  XML  document  format !   XQuery  function  library !   as  a  Java  library  (Lucene  only) !   as  Solr  plugins !   as  a  standalone  App  Server What  is  Lux?
  • 6. ! ! ! ! !   maybe  it  was  once  –  10  year  s  ago?   Legacy  stuff:  DTDs,  namespaces,  etc   arcane  Java  programming  interfaces   Don’t  we  use  JSON  now?   so  why  do  we  care  about  it? XML  is  not  cool
  • 7. !   There’s  a  huge  amount  of  it  out  there !   HTML  is  XML,  or  can  be !   Lux  is  about  making  it  easy  (and  free)  to  deal   with  XML But  it  still  maZers
  • 8. !   We  make  content-­‐‑rich  sites: !   our  own  site:  safaribooksonline.com !   our  clients  sites:  oed.com,  degruyter.com,   oxfordreference.com,  … !   Publishers  provide  us  with  content !   we  debug  content  problems !   we  add  new  features  nimbly !   Piles  of  random  data  (XML,  mostly) Why  did  we  make  it?
  • 9. !   Complex  queries  over  semi-­‐‑structured  data,  typically   documents !   You  don’t  need  it  for  edismax-­‐‑style  “quick”  search !   or  highly-­‐‑structured  data !   XQuery  comes  with  a  rich  function  library; !   rich  string,  numeric  and  date  functions !   extensions  for  HTTP,  filesystem,  zip How  can  XQuery  help?
  • 10. DispatchFilter UpdateProcessor XML  Indexer XML  text   fields Tagged   TokenStream XPath  fields Tinybin   storage External   Field  Codec QueryComponent QParserPlugin Evaluator Saxon  XQuery   XSLT  Processor XQuery   Function   Library Lazy   Searcher ResponseWriter Compiler Optimizer Tagged Highlighter How  does  Lux  work?
  • 11. !   “hamlet”   !   “hamlet”  in  //title !   “hamlet”  in  //scene/title,  //speaker,  etc… !   XQuery,  but  we  need  an  index !   DIH  XPathEntityProcessor !   But  are  XPath  indexes  enough? XML  is  text  with  context
  • 12. !   In  which  speeches  does  Hamlet  talk  about  poison? !   +speaker:Hamlet  +line:poison !   Works  great  if  we  indexed  speaker  and  line  for  each   speech !   What  if  we  only  indexed  at  the  scene  level?   !   What  if  we  just  indexed  speech  text  as  a  field? !   XPath  indexes  are  precise  and  fine-­‐‑grained !   Great  when  you  know  exactly  what  you  need How  do  we  index  context?
  • 13. <play> <title>Hamlet</title> <act act=”1”> <scene act=”1” scene=”1”> <title>SCENE I. Elsinore ... </title> Index Values Tags title, act, @act   Tag  Paths /play, /play/title, /play/act, /play/act/@act   Text hamlet,  scene,  elsinore   Tagged  Text play:hamlet,  title:hamlet,  @act:1   XPath user-­‐defined  Xpath  2.0  expression;  eg:     count(//line),     replace(//title,  'SCENE|ACT  S+','')   Contextual  Indexes
  • 14. !   Tagged  Text,  Path  index !   Imprecise,  generic  indexes,  but  more  context   than  just  full  text !   XQuery  post-­‐‑processing  to  patch  over  the  gaps !   Query  optimizer  applies  indexes !   For  when  you  don’t  want  to  sweat  the  details:   ad  hoc  queries,  content  analysis  and  debugging General  purpose  indexes
  • 15. <scene><speech> <speaker>Hamlet</speaker> <line>To be or not to be, … </line> … scene speech speaker … scene speech line … scene speech line Hamlet To be ! ! ! ! Zext:scene:hamlet                pos=1 Zext:speech:hamlet            pos=1 Zext:speaker:hamlet        pos=1 Zext:scene:to                                  pos=2 Zext:speech:to                              pos=2 … Tokens  emiZed   Wraps  an  existing  Analyzer  (for  the  text)   Responds  to  XML  events  (start  element,  etc)   Maintains  a  tag  name  stack   Emits  each  token  prefixed  by  enclosing  tags TaggedTokenStream
  • 16. !   XPath:      //speech[speaker=“Hamlet”][contains(.,”poison”)] !   “optimized”  XQuery:      lux:search(“+<speaker:Hamlet  +<speech:poison”)                        //speech  [speaker=“Hamlet”]  [contains(.,”poison”)] !   Lucene  Query:      tagged_text:(+speaker:Hamlet  +speech:poison) TagQueryParser
  • 17. !   Generic  JSON  index !   Overlapping  tags  (part-­‐‑of-­‐‑speech,  phrase-­‐‑labeling,  NLP) !   citation  classification  w/probabilistic  labeling !   One  stored  field  for  all  the  text  makes  highlighting  easier !   One  Lucene  field  means  you  can  use  PhraseQuery,  eg:          PhraseQuery(<speaker:hamlet  <speech:to)  finds  all                                  speeches  by  hamlet  starting  with  “to”. Tagged  token  examples
  • 18. ! ! ! ! ! !   stored  document    =  100%   qnames  =  +1.3%   paths  =  +2.4%   text  tokens  =  18%   tagged  text  (opaque)  =  18%   tagged  text  (all  transparent)  =  71% What’s  the  cost?
  • 19. subsequence(      for  $doc  in  collection()[.//SPEAKER=“Hamlet”]    order  by  $doc/lux:key(“title”)    return  $doc,  1000,  20)     subsequence  (    lux:search(“<SPEAKER:Hamlet”,  “title”,   1000)  [.//SPEAKER=“Hamlet”]   ,  1,  20)   Query  optimization
  • 20. !   Lux  uses  Lucene  as  its  primary  document  store !   Lux  tinybin  (based  on  Saxon  TinyTree)  storage   format  avoids  XML  parsing  overhead !   Experimental  new  codec  stores  fields  as  files Document  storage
  • 21. !   Problem:  “big”  stored  fields !   Text  documents  get  stored  for  highlighting !   Take  time  to  copy  when  merging !   Can  we  do  beZer  by  storing  as  files,  but   managing  w/Lucene? “Big”  binary  stored  fields
  • 22. large  stored  fields small  stored  fields ExternalFieldCodec
  • 23. !   Real-­‐‑time  deletes !   Track  deletions  when  merging !   Keep  commits  with  IndexDeletionPolicy !   Delete  unmerged  (empty)  segments !   Off-­‐‑line  deletes !   Cleanup  tool  traverses  entire  index Deleting  is  complicated
  • 24. ! ! ! !   2-­‐‑3x  write  speedup  for  unindexed  stored  fields   a  bit  slower  in  the  worst  case   But,  text  analysis  can  take  most  of  the  time   Net:  useful  if  you  are  storing  large  binaries Codec  Performance   (preliminary)
  • 25. !   custom  DispatchFilter  provides: !   HTTP  request/response  handling  in  XQuery !   file  uploads,  redirects !   Ability  to  roll  your  own:  cookies,  authentication !   Rapid  prototyping,  testing  query  performance,   relevance,  in  an  application  seZing App  Server
  • 26. !   Yes,  but  did  you  remember  to  index  all  the   fields  you  need  in  advance? !   Yes,  but  did  you  want  to  format  the  result  into  a   nice  report  *using  your  query  language*? !   Yes,  but  did  you  want  access  to  a  complete   XPath  2.0  implementation  in  your  indexer? Isn’t  Solr  enough?
  • 27. !   Find  some  sample  content  with  a  new  tag  we  need   to  support !   Perform  complex  updates  to  patch  broken  content !   Troubleshoot  content !   Explore  unfamiliar  content !   Write  prototypes  and  admin  tools  entirely  in  HTML,   JS  and  XQuery !   Demo:  hZp://localhost:8080 Example  uses  
  • 28. !   Downloads  and  Documentation  at   hZp://luxdb.org   !   Source  code  at  hZp://github.com/msokolov/lux !   Freely  available  under  OSS  license  (MPL  2) !   Contributions  welcome !   Thank  you,  Safari  Books! Thank  You!