Querying rich text with XQuery

Querying Rich Text
with Lucene XQuery

{

Michael Sokolov
Senior Architect
Safari Books Online

!   Overview of Lux

!   Why we need want a rich(er) query language

!   Implementation Stories

!   Indexing tagged text
!   Storing documents in Lucene
!   Lazy searching

!   Demo

The plan for this talk

!  XQuery in Solr

!   Query optimizer
!   Eﬃcient XML document format
!   XQuery function library

!   as a Java library (Lucene only)
!   as Solr plugins
!   as a standalone App Server

What is Lux?

Search

to ﬁnd something

Query

to get an answer

!
!
!
!
!

  maybe it was once – 10 year s ago?
  Legacy stuﬀ: DTDs, namespaces, etc
  arcane Java programming interfaces
  Don’t we use JSON now?
  so why do we care about it?

XML is not cool

!   There’s a huge amount of it out there
!   HTML is XML, or can be
!   Lux is about making it easy (and free) to deal
with XML

But it still maZers

!   We make content-‐‑rich sites:

!   our own site: safaribooksonline.com
!   our clients sites: oed.com, degruyter.com,
oxfordreference.com, …

!   Publishers provide us with content

!   we debug content problems
!   we add new features nimbly
!   Piles of random data (XML, mostly)

Why did we make it?

!   Complex queries over semi-‐‑structured data, typically
documents
!   You don’t need it for edismax-‐‑style “quick” search
!   or highly-‐‑structured data
!   XQuery comes with a rich function library;
!   rich string, numeric and date functions
!   extensions for HTTP, ﬁlesystem, zip

How can XQuery help?

DispatchFilter
UpdateProcessor
XML Indexer
XML text
ﬁelds

Tagged
TokenStream

XPath ﬁelds
Tinybin
storage

External
Field Codec

QueryComponent
QParserPlugin
Evaluator
Saxon XQuery
XSLT Processor
XQuery
Function
Library
Lazy
Searcher

ResponseWriter

Compiler
Optimizer
Tagged
Highlighter

How does Lux work?

!   “hamlet”
!   “hamlet” in //title
!   “hamlet” in //scene/title, //speaker, etc…
!   XQuery, but we need an index
!   DIH XPathEntityProcessor
!   But are XPath indexes enough?

XML is text with context

!   In which speeches does Hamlet talk about poison?
!   +speaker:Hamlet +line:poison
!   Works great if we indexed speaker and line for each
speech

!   What if we only indexed at the scene level?
!   What if we just indexed speech text as a ﬁeld?
!   XPath indexes are precise and ﬁne-‐‑grained
!   Great when you know exactly what you need

How do we index context?

<play>
<title>Hamlet</title>
<act act=”1”>
<scene act=”1” scene=”1”>
<title>SCENE I. Elsinore ... </title>

Index

Values

Tags

title, act, @act

Tag Paths

/play, /play/title, /play/act, /play/act/@act

Text

hamlet,
scene,
elsinore

Tagged Text

play:hamlet,
title:hamlet,
@act:1

XPath

user-‐defined
Xpath
2.0
expression;
eg:

count(//line),

replace(//title,
'SCENE|ACT
S+','')

Contextual Indexes

!   Tagged Text, Path index
!   Imprecise, generic indexes, but more context
than just full text
!   XQuery post-‐‑processing to patch over the gaps
!   Query optimizer applies indexes
!   For when you don’t want to sweat the details:
ad hoc queries, content analysis and debugging

General purpose indexes

<scene><speech>
<speaker>Hamlet</speaker>
<line>To be or not to be, … </line>

…
scene
speech
speaker

…
scene
speech
line

…
scene
speech
line

Hamlet

To

be

!
!
!
!

Zext:scene:hamlet pos=1
Zext:speech:hamlet pos=1
Zext:speaker:hamlet pos=1
Zext:scene:to pos=2
Zext:speech:to pos=2
…

Tokens emiZed

  Wraps an existing Analyzer (for the text)
  Responds to XML events (start element, etc)
  Maintains a tag name stack
  Emits each token preﬁxed by enclosing tags

TaggedTokenStream

!   XPath:
//speech[speaker=“Hamlet”][contains(.,”poison”)]
!   “optimized” XQuery:
lux:search(“+<speaker:Hamlet +<speech:poison”)
//speech [speaker=“Hamlet”] [contains(.,”poison”)]
!   Lucene Query:
tagged_text:(+speaker:Hamlet +speech:poison)

TagQueryParser

!   Generic JSON index
!   Overlapping tags (part-‐‑of-‐‑speech, phrase-‐‑labeling, NLP)
!   citation classification w/probabilistic labeling

!   One stored field for all the text makes highlighting easier
!   One Lucene field means you can use PhraseQuery, eg:
PhraseQuery(<speaker:hamlet <speech:to) finds all
speeches by hamlet starting with “to”.

Tagged token examples

!
!
!
!
!
!

  stored document = 100%
  qnames = +1.3%
  paths = +2.4%
  text tokens = 18%
  tagged text (opaque) = 18%
  tagged text (all transparent) = 71%

What’s the cost?

subsequence(

for
$doc
in
collection()[.//SPEAKER=“Hamlet”]

order
by
$doc/lux:key(“title”)

return
$doc,
1000,
20)

subsequence
(

lux:search(“<SPEAKER:Hamlet”,
“title”,

1000)
[.//SPEAKER=“Hamlet”]

,
1,
20)

Query optimization

!   Lux uses Lucene as its primary document store
!   Lux tinybin (based on Saxon TinyTree) storage
format avoids XML parsing overhead
!   Experimental new codec stores ﬁelds as ﬁles

Document storage

!   Problem: “big” stored fields
!   Text documents get stored for highlighting

!   Take time to copy when merging
!   Can we do beZer by storing as files, but
managing w/Lucene?

“Big” binary stored fields

large stored ﬁelds
small stored ﬁelds

ExternalFieldCodec

!   Real-‐‑time deletes
!   Track deletions when merging
!   Keep commits with IndexDeletionPolicy
!   Delete unmerged (empty) segments

!   Oﬀ-‐‑line deletes
!   Cleanup tool traverses entire index

Deleting is complicated

!
!
!
!

  2-‐‑3x write speedup for unindexed stored ﬁelds
  a bit slower in the worst case
  But, text analysis can take most of the time
  Net: useful if you are storing large binaries

Codec Performance
(preliminary)

!   custom DispatchFilter provides:
!   HTTP request/response handling in XQuery
!   ﬁle uploads, redirects
!   Ability to roll your own: cookies, authentication

!   Rapid prototyping, testing query performance,
relevance, in an application seZing

App Server

!   Yes, but did you remember to index all the
ﬁelds you need in advance?
!   Yes, but did you want to format the result into a
nice report *using your query language*?
!   Yes, but did you want access to a complete
XPath 2.0 implementation in your indexer?

Isn’t Solr enough?

!   Find some sample content with a new tag we need
to support
!   Perform complex updates to patch broken content
!   Troubleshoot content
!   Explore unfamiliar content
!   Write prototypes and admin tools entirely in HTML,
JS and XQuery
!   Demo: hZp://localhost:8080

Example uses

!   Downloads and Documentation at
hZp://luxdb.org
!   Source code at hZp://github.com/msokolov/lux
!   Freely available under OSS license (MPL 2)
!   Contributions welcome
!   Thank you, Safari Books!

Thank You!

Querying rich text with XQuery

More Related Content

Viewers also liked (10)

Similar to Querying rich text with XQuery (20)

More from lucenerevolution (20)

Recently uploaded (20)

Querying rich text with XQuery