SlideShare a Scribd company logo
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component 
Steven Bower 
©2014 Bloomberg L.P.
Bloomberg 
• Largest provider of financial news and information 
• Our strength is quickly and accurately delivering data, news and analytics 
• Creating high performance and accurate information retrieval systems is core to 
our strength
Bloomberg Search Team 
• Search infrastructure 
• Develop and support search as a service platform 
• Support for other search applications within the company 
• Consultancy 
• Provide design consultancy/support to application teams 
• Promote search best practices/standardization throughout the company 
• Machine learning 
• Develop machine learning techniques to improve relevancy 
• Create natural language processors to answer questions 
• Unified search 
• Create information retrieval tools to organize and connect the vast and varied 
datasets provided to our clients
Our Challenge
Our Approach 
• Use Search/Solr as it provides flexible search/filtering over large, fast moving, 
result sets 
• Initially used StatsComponent, but quickly ran into limitations 
• Wanted to push the bounds of analytics capabilities in Solr/Lucene 
• Needed a pluggable framework to perform complex calculations/aggregations on 
numerical time-series data 
• DocValues provided high performance columnar access to fields in the index 
(without un-inversion cost)
DocValues 
• DocValues provide high performance 
columnar access to fields in the index 
• No un-inversion cost 
• Increased storage footprint 
• Helps achieve NRT 
• Values live off-heap in memory map
Analytics Component 
• New component from the ground up 
• Designed/Implemented by the Bloomberg Search Team over summer of 2013 
• Initial implementation was built using DocValues API directly, but moved to 
FieldCache 
• Refactored existing faceting implementation to support analytics 
• Created simple prefix notation for statistical expressions 
• Available as a Solr Contrib module in Solr 5.x or patches for 4.8+ on SOLR-5302
Features 
• Flexible/Extendable framework for adding additional statistics/faceting 
• Supports Multiple Analytics Requests per query execution 
• Multiple statistic calculations per request 
• Multiple facets per request 
• Each request can facet statistics over different fields and ranges
Features - Faceting 
• Field Faceting 
• Support for int, long, float, double, date, string fields 
• Support for multi-value fields 
• Support for limit, offset and mincount 
• Support for sorting of stats-facets by any statistic (i.e. sort by mean) 
• Range faceting 
• Numeric types and dates 
• Dynamically calculate range/gap based on calculated statistics 
• Support for query faceting of stats 
• Use calculated statistics to generate facet queries
Features – Map Operators 
• Basic Math 
• neg(<expr>) 
• add(<expr>,...) 
• mult(<expr>,...) 
• div(<expr>,<expr>) 
• pow(<expr>,<expr>) 
• log(<expr>,<expr>) 
• Constants 
• const_num(<number>) 
• const_date(<date>) 
• const_str(<string>) 
• Date Math 
• date_math(<date expr>,<date op>,...) 
• String operations 
• rev(<expr>) 
• concat(<expr>,...) 
• Field 
• <field> 
• Missing Values 
• miss(<expr>,<value>)
Features – Reduction Operators 
• Statistical 
• min(<expr>) 
• max(<expr>) 
• sum(<expr>) 
• count(<expr>) 
• miss(<expr>) 
• unique(<expr>) 
• Complex 
• sumofsquares(<expr>) 
• mean(<expr>) 
• stddev(<expr>) 
• median(<expr>) 
• percentile(<expr>)
Examples 
• Weighted Average 
• Calculate weighted average of field_a with field_b as the weight 
div( mean( mult(field_a, field_b) ), sum(field_b) ) 
• Variance 
• Calculate the variance of field_a 
pow( stddev(field_a), const_num(2) )
Examples 
• T-Score 
• Calculate a t-score where ## is the value and all values in your sample are stored in field_a. 
div( add( const_num(##), neg( mean(field_a) ) ), 
div( stddev(field_a), pow( count(field_a), const_num(.5) ) ) )
How We Use It 
• Segment, aggregate and analyze 
financial data quickly 
• Aggregate time series data across 
multiple fields to render charts 
• Created flexible diagnostic tools/ 
visualizations to analyze Solr 
performance
Future Plans 
• Multi-shard support 
• Pivot Facet Support 
• Statistics on Multi-value fields 
• To support unique() 
• Filter result set based upon calculated statistics 
• Generalize facet implementation
Links and Questions? 
Analytics Component 
h"ps://issues.apache.org/jira/browse/SOLR-­‐5302 
More About Bloomberg 
h"p://www.bloomberglabs.com/

More Related Content

PDF
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Lucidworks
 
PDF
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Lucidworks
 
PDF
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Lucidworks
 
PDF
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Lucidworks
 
PDF
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Lucidworks
 
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
PDF
Solr4 nosql search_server_2013
Lucidworks (Archived)
 
PDF
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Lucidworks
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Lucidworks
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Lucidworks
 
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Lucidworks
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Lucidworks
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Lucidworks
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
Solr4 nosql search_server_2013
Lucidworks (Archived)
 
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Lucidworks
 

What's hot (20)

PDF
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Lucidworks
 
PDF
SQL Now! How Optiq brings the best of SQL to NoSQL data.
Julian Hyde
 
PDF
Webinar: Replace Google Search Appliance with Lucidworks Fusion
Lucidworks
 
PDF
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Lucidworks
 
PPTX
Webinar: Solr & Fusion for Big Data
Lucidworks
 
PDF
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Lucidworks
 
PDF
Distributed Stream Processing - Spark Summit East 2017
Petr Zapletal
 
PPT
Configuring elasticsearch for performance and scale
Bharvi Dixit
 
PDF
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
PPTX
Multi dimension aggregations using spark and dataframes
Romi Kuntsman
 
PDF
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Lucidworks
 
PDF
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Databricks
 
PPTX
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Spark Summit
 
PPTX
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Spark Summit
 
PDF
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
PDF
Enabling exploratory data science with Spark and R
Databricks
 
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
PPTX
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Lucidworks
 
SQL Now! How Optiq brings the best of SQL to NoSQL data.
Julian Hyde
 
Webinar: Replace Google Search Appliance with Lucidworks Fusion
Lucidworks
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Lucidworks
 
Webinar: Solr & Fusion for Big Data
Lucidworks
 
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Lucidworks
 
Distributed Stream Processing - Spark Summit East 2017
Petr Zapletal
 
Configuring elasticsearch for performance and scale
Bharvi Dixit
 
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Multi dimension aggregations using spark and dataframes
Romi Kuntsman
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Lucidworks
 
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Databricks
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Spark Summit
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Spark Summit
 
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Enabling exploratory data science with Spark and R
Databricks
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
Ad

Viewers also liked (20)

PDF
Building a real time big data analytics platform with solr
Trey Grainger
 
PDF
Lucene/Solr Revolution 2015 Opening Keynote with Lucidworks CEO Will Hayes
Lucidworks
 
PDF
Search at Twitter: Presented by Michael Busch, Twitter
Lucidworks
 
PDF
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
lucenerevolution
 
PPTX
Real-Time Big Data with Storm, Kafka and GigaSpaces
Oleksii Diagiliev
 
PPTX
Webinar Google Analytics Real Time MA 22-11-11
Watt
 
PPTX
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Yonik Seeley
 
PDF
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Lucidworks
 
PDF
This Ain't Your Parent's Search Engine: Presented by Grant Ingersoll, Lucidworks
Lucidworks
 
PDF
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
Lucidworks
 
PDF
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Joaquin Delgado PhD.
 
PDF
Lucene/Solr Spatial in 2015: Presented by David Smiley
Lucidworks
 
PDF
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
Lucidworks
 
PDF
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Lucidworks
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
PDF
Evolving Search Relevancy: Presented by James Strassburg, Direct Supply
Lucidworks
 
PDF
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Lucidworks
 
PDF
Deep Data at Macy's - Searching Hierarchichal Documents for eCommerce Merchan...
Lucidworks
 
PDF
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Lucidworks
 
Building a real time big data analytics platform with solr
Trey Grainger
 
Lucene/Solr Revolution 2015 Opening Keynote with Lucidworks CEO Will Hayes
Lucidworks
 
Search at Twitter: Presented by Michael Busch, Twitter
Lucidworks
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
lucenerevolution
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Oleksii Diagiliev
 
Webinar Google Analytics Real Time MA 22-11-11
Watt
 
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Yonik Seeley
 
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Lucidworks
 
This Ain't Your Parent's Search Engine: Presented by Grant Ingersoll, Lucidworks
Lucidworks
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
Lucidworks
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Joaquin Delgado PhD.
 
Lucene/Solr Spatial in 2015: Presented by David Smiley
Lucidworks
 
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
Lucidworks
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Lucidworks
 
Rapid Prototyping with Solr
Erik Hatcher
 
Lucene for Solr Developers
Erik Hatcher
 
Evolving Search Relevancy: Presented by James Strassburg, Direct Supply
Lucidworks
 
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Lucidworks
 
Deep Data at Macy's - Searching Hierarchichal Documents for eCommerce Merchan...
Lucidworks
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Lucidworks
 
Ad

Similar to Search Analytics Component: Presented by Steven Bower, Bloomberg L.P. (20)

PPTX
Azure Stream Analytics
Davide Mauri
 
PPTX
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Lucidworks
 
PPTX
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
Remy Rosenbaum
 
PPTX
DB
Samchu Li
 
PPTX
DATA WAREHOUSING
Rishikese MR
 
PPTX
data mining and data warehousing
MohammedAmeenUlIslam1
 
PDF
LoQutus: A deep-dive into Microsoft Power BI
LoQutus
 
PPTX
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
PPTX
Kicktag - About Kicktag & Cosmos 2014
Kicktag Web Solutions Ltd
 
PPTX
temporal and spatial database.pptx
64837JAYAASRIK
 
PPTX
A machine learning and data science pipeline for real companies
DataWorks Summit
 
PPTX
BI Apps Architecture
Dylan Wan
 
PPTX
rough-work.pptx
sharpan
 
PDF
Levelling up your data infrastructure
Simon Belak
 
PPTX
Elasticsearch - Scalability and Multitenancy
Bozhidar Bozhanov
 
PPTX
The Data Engineering Guide 101 - GDGoC NUML X Bytewise
gdscnuml
 
PPTX
PAD: Performance Anomaly Detection in Multi-Server Distributed Systems
James Hill
 
PPTX
WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...
Łukasz Grala
 
PPTX
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
Richard Robinson
 
PPTX
Spatial Data in SQL Server
Eduardo Castro
 
Azure Stream Analytics
Davide Mauri
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Lucidworks
 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
Remy Rosenbaum
 
DATA WAREHOUSING
Rishikese MR
 
data mining and data warehousing
MohammedAmeenUlIslam1
 
LoQutus: A deep-dive into Microsoft Power BI
LoQutus
 
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
Kicktag - About Kicktag & Cosmos 2014
Kicktag Web Solutions Ltd
 
temporal and spatial database.pptx
64837JAYAASRIK
 
A machine learning and data science pipeline for real companies
DataWorks Summit
 
BI Apps Architecture
Dylan Wan
 
rough-work.pptx
sharpan
 
Levelling up your data infrastructure
Simon Belak
 
Elasticsearch - Scalability and Multitenancy
Bozhidar Bozhanov
 
The Data Engineering Guide 101 - GDGoC NUML X Bytewise
gdscnuml
 
PAD: Performance Anomaly Detection in Multi-Server Distributed Systems
James Hill
 
WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...
Łukasz Grala
 
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
Richard Robinson
 
Spatial Data in SQL Server
Eduardo Castro
 

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Lucidworks
 
PDF
Drive Agent Effectiveness in Salesforce
Lucidworks
 
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks
 
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks
 
PPTX
Connected Experiences Are Personalized Experiences
Lucidworks
 
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Lucidworks
 
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
Lucidworks
 
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Lucidworks
 
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
Lucidworks
 
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Lucidworks
 
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Lucidworks
 
PDF
Smart Answers for Employee and Customer Support After COVID-19
Lucidworks
 
PPTX
Applying AI & Search in Europe - featuring 451 Research
Lucidworks
 
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
Lucidworks
 
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Lucidworks
 
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Lucidworks
 
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Lucidworks
 
PPTX
Webinar: Building a Business Case for Enterprise Search
Lucidworks
 
PPTX
Why Insight Engines Matter in 2020 and Beyond
Lucidworks
 
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Lucidworks
 
Drive Agent Effectiveness in Salesforce
Lucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks
 
Connected Experiences Are Personalized Experiences
Lucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
Lucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Lucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Lucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Lucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Lucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Lucidworks
 
Webinar: Building a Business Case for Enterprise Search
Lucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Lucidworks
 

Recently uploaded (20)

PDF
Immersive experiences: what Pharo users do!
ESUG
 
PPTX
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
PDF
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PDF
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
PPTX
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
PDF
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
PPTX
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PDF
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PDF
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
PPTX
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
PPTX
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
PDF
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
PPT
Activate_Methodology_Summary presentatio
annapureddyn
 
PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
PDF
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
PDF
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Immersive experiences: what Pharo users do!
ESUG
 
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
Presentation about variables and constant.pptx
kr2589474
 
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
Activate_Methodology_Summary presentatio
annapureddyn
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 

Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.

  • 2. Search Analytics Component Steven Bower ©2014 Bloomberg L.P.
  • 3. Bloomberg • Largest provider of financial news and information • Our strength is quickly and accurately delivering data, news and analytics • Creating high performance and accurate information retrieval systems is core to our strength
  • 4. Bloomberg Search Team • Search infrastructure • Develop and support search as a service platform • Support for other search applications within the company • Consultancy • Provide design consultancy/support to application teams • Promote search best practices/standardization throughout the company • Machine learning • Develop machine learning techniques to improve relevancy • Create natural language processors to answer questions • Unified search • Create information retrieval tools to organize and connect the vast and varied datasets provided to our clients
  • 6. Our Approach • Use Search/Solr as it provides flexible search/filtering over large, fast moving, result sets • Initially used StatsComponent, but quickly ran into limitations • Wanted to push the bounds of analytics capabilities in Solr/Lucene • Needed a pluggable framework to perform complex calculations/aggregations on numerical time-series data • DocValues provided high performance columnar access to fields in the index (without un-inversion cost)
  • 7. DocValues • DocValues provide high performance columnar access to fields in the index • No un-inversion cost • Increased storage footprint • Helps achieve NRT • Values live off-heap in memory map
  • 8. Analytics Component • New component from the ground up • Designed/Implemented by the Bloomberg Search Team over summer of 2013 • Initial implementation was built using DocValues API directly, but moved to FieldCache • Refactored existing faceting implementation to support analytics • Created simple prefix notation for statistical expressions • Available as a Solr Contrib module in Solr 5.x or patches for 4.8+ on SOLR-5302
  • 9. Features • Flexible/Extendable framework for adding additional statistics/faceting • Supports Multiple Analytics Requests per query execution • Multiple statistic calculations per request • Multiple facets per request • Each request can facet statistics over different fields and ranges
  • 10. Features - Faceting • Field Faceting • Support for int, long, float, double, date, string fields • Support for multi-value fields • Support for limit, offset and mincount • Support for sorting of stats-facets by any statistic (i.e. sort by mean) • Range faceting • Numeric types and dates • Dynamically calculate range/gap based on calculated statistics • Support for query faceting of stats • Use calculated statistics to generate facet queries
  • 11. Features – Map Operators • Basic Math • neg(<expr>) • add(<expr>,...) • mult(<expr>,...) • div(<expr>,<expr>) • pow(<expr>,<expr>) • log(<expr>,<expr>) • Constants • const_num(<number>) • const_date(<date>) • const_str(<string>) • Date Math • date_math(<date expr>,<date op>,...) • String operations • rev(<expr>) • concat(<expr>,...) • Field • <field> • Missing Values • miss(<expr>,<value>)
  • 12. Features – Reduction Operators • Statistical • min(<expr>) • max(<expr>) • sum(<expr>) • count(<expr>) • miss(<expr>) • unique(<expr>) • Complex • sumofsquares(<expr>) • mean(<expr>) • stddev(<expr>) • median(<expr>) • percentile(<expr>)
  • 13. Examples • Weighted Average • Calculate weighted average of field_a with field_b as the weight div( mean( mult(field_a, field_b) ), sum(field_b) ) • Variance • Calculate the variance of field_a pow( stddev(field_a), const_num(2) )
  • 14. Examples • T-Score • Calculate a t-score where ## is the value and all values in your sample are stored in field_a. div( add( const_num(##), neg( mean(field_a) ) ), div( stddev(field_a), pow( count(field_a), const_num(.5) ) ) )
  • 15. How We Use It • Segment, aggregate and analyze financial data quickly • Aggregate time series data across multiple fields to render charts • Created flexible diagnostic tools/ visualizations to analyze Solr performance
  • 16. Future Plans • Multi-shard support • Pivot Facet Support • Statistics on Multi-value fields • To support unique() • Filter result set based upon calculated statistics • Generalize facet implementation
  • 17. Links and Questions? Analytics Component h"ps://issues.apache.org/jira/browse/SOLR-­‐5302 More About Bloomberg h"p://www.bloomberglabs.com/