SlideShare a Scribd company logo
Datafari - Building an Open Source
Enterprise Search Solution from
Popular Building Blocks
CEDRIC ULMER
FRANCE LABS
II-SDV
25/04/17
Datafari
So what is Datafari?
• « Packaged solution » to analyse and search for documents and data
• Can index heterogeneous data formats from multiple sources
• Federated search interface
• Apache v2 licence
II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution from Popular Building Blocks
Why Datafari ?
Choice of the Apache Solr and Elasticsearch technologies (more about this later...)
Three possibilities to answer a customer requirements :
• Use a packaged solution available on the market from a 3rd party
• Starting from Apache Solr or Elasticsearch (or others)
• Develop, gather necessary components for each customer needs
• Ensure « production-ready » material: docs, processes, tests.
• Create our own packaged solution (yeah!)
Why Datafari ?
Problems with 3rd party proprietary solutions:
• Black box
• Roadmap not clear
• Resilience (bankrupt, acquisition…)
Problems with 3rd party open source solutions:
• Lack of technical documentation
• Difficulty to setup an understandable debug environment
• Delay in the embedded components updates: In particular Solr or ES
• License issues (mostly viral ones)
• Lack of resilience from the makers
=> Required us to develop our own solution to better address our customer needs
Why Datafari
Idea:
• Gather the best of both worlds :
• The “packaged” aspect of existing solutions
• Many functionalities
• All in one
• The flexibility of a solution based on Solr and ES
• All of that with an Apache v2 licence ☺
• Focus on Enterprise Search:
• Admin for search experts
• Admin for search admin
• Eased AD/LDAP management
• Search and data analytics
Based on 4 building blocks:
• Apache Solr
• The heart of the search engine
• Apache Manifold CF
• Crawling documents
• Ajax FranceLabs
• UI
• Elasticsearch
• Data analytics
Ajax
FranceLabs
Datafari 3.1
Apache Tomcat 7
Data Sources
Datafari Search / Admin
Apache ManifoldCF
CMS
DB
Fileshares
Web
Security
(AD, LDAP)
PostgreSQL
Apache Solr 5.5
Document Index
Statistics Index
Apache ManifoldCF 2.5 Crawler Service
Autorization Service
ELK
Cassandra (User
Management)
Apache Solr
Lucene based Full text search engine
Apache Top Level project
Large communauty (users/devs)
Efficient/Reliable
Scalable
• High availability
• Queries
• Index volume
Apache Solr
Webapp Java
REST APIs XML/HTTP
• Indexing
• Querying
Caching
Web admin interface
Configuration through XML config files or APIs
Apache Lucene/Solr – Some refs
Apache Solr for Datafari
Search core of Datafari
Preconfigured index for rich documents
• Language detection
• Standard facets
• Autocomplete
• Spellchecker
Indexing user queries
• Enables analytics on search users behavior
Datafari 3.1
Apache Tomcat 7
Data Sources
Datafari Search / Admin
Apache ManifoldCF
CMS
DB
Fileshares
Web
Security
(AD, LDAP)
PostgreSQL
Apache Solr 5.5
Document Index
Statistics Index
Apache ManifoldCF 2.5 Crawler Service
Autorization Service
ELK
Cassandra (User
Management)
Apache Manifold CF
Framework for data crawling
Management of incremental crawling
Authorization management
Programmable crawls (time windows, loads, regex…)
Apache Manifold CF
Many off the shelf connectors:
• FileShare (Samba)
• JDBC
• Website
• Alfresco
• CMIS
• Sharepoint
• Mail
• Dropbox
• LDAP/AD
Apache Manifold CF for Datafari
Manages data crawling
Manages authentication
Preconfigured integration with our Solr
Datafari 3.1
Apache Tomcat 7
Data Sources
Datafari Search / Admin
Apache ManifoldCF
CMS
DB
Fileshares
Web
Security
(AD, LDAP)
PostgreSQL
Apache Solr 5.5
Document Index
Statistics Index
Apache ManifoldCF 2.5 Crawler Service
Autorization Service
ELK
Cassandra (User
Management)
Datafari Search
Front-End
User UI
• AjaxFrance Labs
Authentication
Interactions with Solr (SolrJ)
Indexing users queries
Admin UI
• Solr
• ManifoldCF
• Statistics
AjaxFranceLabs
Inspired by AjaxSolr
Javascript/Ajax client
Provides several components:
• Manager: backend connection
• Widgets
• Graphical/Logical components
• (Advanced) Search
• Facet
• Geolocalisation (Based on OpenStreetMap)
Browser
Datafari Server
Datafari Search
Manager
SearchBarWidget
ResultWidget
FacetWidget
Datafari Search Servlet
Ajax
Use case 1 – Oil and Gas
Sources:
• Sharepoint
• Documentum
• Fileshare
• DB
Volume: 28 TB
Users: Geoscientists
Use case 2 – Nuclear
Sources:
• Fileshare
• Oracle
• DB
Volume: 15 M docs
Users: Maintenance operators
Démo!!!
Technical Roadmap (1/2)
New advanced search
Solr 6
Graphical SolrCloud management
Always more documentation
Annotator
Technical roadmap (2/2)
New languages
Consolidation
Unit test framework
More dashboards in ELK
Learning-to-Rank
Where can I find Datafari
Main hub: http://www.datafari.com/en
Source code available on Github:
• https://code.google.com/p/datafari/
Install packages for Debian 7 and Windows available on:
• www.datafari.com
Forum:
• https://groups.google.com/forum/#!forum/datafari
Documentation on Confluence
• Technical and functional
Tickets and releases on Jira
Want to follow Datafari ?
@francelabs
#datafari
francelabs
francelabs
Become a Datafarian ! ☺
We are always open to suggestions
• “Reorganise your docs…”
Contribution
• What about a German version ?!
• UI widgets ?
Most important: your use cases and usage feedback
CONTACT
Don’t hesitate to reach out to us for any info
Our corporate website: www.francelabs.com
Email: contact@francelabs.com
Tél: 09 72 43 72 85
Fax: 09 72 29 28 14

More Related Content

PDF
II-SDV 2017: Gridlogics Technologies
Dr. Haxel Consult
 
PPTX
II-PIC 2017 in Bangalore
Dr. Haxel Consult
 
PDF
ICIC 2017: Publication Analysis and Publication Strategy
Dr. Haxel Consult
 
PDF
II-SDV 2016 Centredoc
Dr. Haxel Consult
 
PDF
II-SDV 2016 IRIX Software Engineering
Dr. Haxel Consult
 
PDF
IC-SDV 2019: Minesoft
Dr. Haxel Consult
 
PDF
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
PDF
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
Dr. Haxel Consult
 
II-SDV 2017: Gridlogics Technologies
Dr. Haxel Consult
 
II-PIC 2017 in Bangalore
Dr. Haxel Consult
 
ICIC 2017: Publication Analysis and Publication Strategy
Dr. Haxel Consult
 
II-SDV 2016 Centredoc
Dr. Haxel Consult
 
II-SDV 2016 IRIX Software Engineering
Dr. Haxel Consult
 
IC-SDV 2019: Minesoft
Dr. Haxel Consult
 
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
Dr. Haxel Consult
 

What's hot (20)

PDF
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
PDF
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
PDF
ICIC 2017: The Use of Patent Information for Innovation and Competitive Intel...
Dr. Haxel Consult
 
PDF
II-PIC 2017: Porduct presentation minesoft
Dr. Haxel Consult
 
PDF
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
PDF
II-PIC 2017: Product Presentation BizInt
Dr. Haxel Consult
 
PDF
ICIC 2017: New product presentationsLighthouse IP
Dr. Haxel Consult
 
PDF
IC-SDV 2018: Search Technology / VanatagePoint
Dr. Haxel Consult
 
PDF
II-SV 2017: How to effectively monitor Technological Developments in IP
Dr. Haxel Consult
 
PDF
II-PIC 2017: Optimizing R&D strategy through organized patent database
Dr. Haxel Consult
 
PDF
II-SDV 2016 Michael Iarrobino - Improving Text Mining Results with Access to ...
Dr. Haxel Consult
 
PDF
AI-SDV 2021: Dolcera
Dr. Haxel Consult
 
PDF
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
PDF
ICIC 2014 New Product Introduction Gridlogisc
Dr. Haxel Consult
 
PDF
ICIC 2014 New Product Introduction ProQuest
Dr. Haxel Consult
 
PDF
II-SDV 2016 Aleksandar Kapisoda, Klaus Kater - Deep Web Search
Dr. Haxel Consult
 
PDF
ViewPorter® Louis™ Machine Learning
Orange Digit
 
PPTX
American family hadoop journey, uw ebc sig meeting, april 2015
Craig Jordan
 
PDF
SciDB
Oleg Tsarev
 
PPT
The BI Sandbox
Craig Jordan
 
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
ICIC 2017: The Use of Patent Information for Innovation and Competitive Intel...
Dr. Haxel Consult
 
II-PIC 2017: Porduct presentation minesoft
Dr. Haxel Consult
 
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
II-PIC 2017: Product Presentation BizInt
Dr. Haxel Consult
 
ICIC 2017: New product presentationsLighthouse IP
Dr. Haxel Consult
 
IC-SDV 2018: Search Technology / VanatagePoint
Dr. Haxel Consult
 
II-SV 2017: How to effectively monitor Technological Developments in IP
Dr. Haxel Consult
 
II-PIC 2017: Optimizing R&D strategy through organized patent database
Dr. Haxel Consult
 
II-SDV 2016 Michael Iarrobino - Improving Text Mining Results with Access to ...
Dr. Haxel Consult
 
AI-SDV 2021: Dolcera
Dr. Haxel Consult
 
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
ICIC 2014 New Product Introduction Gridlogisc
Dr. Haxel Consult
 
ICIC 2014 New Product Introduction ProQuest
Dr. Haxel Consult
 
II-SDV 2016 Aleksandar Kapisoda, Klaus Kater - Deep Web Search
Dr. Haxel Consult
 
ViewPorter® Louis™ Machine Learning
Orange Digit
 
American family hadoop journey, uw ebc sig meeting, april 2015
Craig Jordan
 
The BI Sandbox
Craig Jordan
 
Ad

Viewers also liked (7)

PDF
II-SDV 2017: Decoding the Gray Shades of Patent White Space Analysis
Dr. Haxel Consult
 
PDF
II-SDV 2017: Auto Classification: Can/Should AI replace You?
Dr. Haxel Consult
 
PDF
II-SDV 2017: Approaches of Web Information Analysis in a Day to Day Work Envi...
Dr. Haxel Consult
 
PDF
II-SDV 2017: From KNIME to HighThroughPut Pipelining - from KNIME to HTPP
Dr. Haxel Consult
 
PDF
II-SDV 2017: Will Virtual Reality (VR) be changing the way we deal with infor...
Dr. Haxel Consult
 
PDF
II-SDV 2017: What is Innovation and how can we measure it?
Dr. Haxel Consult
 
PDF
II-SDV 2017: The "International Chemical Ontology Network"
Dr. Haxel Consult
 
II-SDV 2017: Decoding the Gray Shades of Patent White Space Analysis
Dr. Haxel Consult
 
II-SDV 2017: Auto Classification: Can/Should AI replace You?
Dr. Haxel Consult
 
II-SDV 2017: Approaches of Web Information Analysis in a Day to Day Work Envi...
Dr. Haxel Consult
 
II-SDV 2017: From KNIME to HighThroughPut Pipelining - from KNIME to HTPP
Dr. Haxel Consult
 
II-SDV 2017: Will Virtual Reality (VR) be changing the way we deal with infor...
Dr. Haxel Consult
 
II-SDV 2017: What is Innovation and how can we measure it?
Dr. Haxel Consult
 
II-SDV 2017: The "International Chemical Ontology Network"
Dr. Haxel Consult
 
Ad

Similar to II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution from Popular Building Blocks (20)

PDF
NoSQL, Apache SOLR and Apache Hadoop
Dmitry Kan
 
PDF
Solr Recipes Workshop
Erik Hatcher
 
PPT
Organizing the Data Chaos of Scientists
Andreas Schreiber
 
PDF
Solr Recipes
Erik Hatcher
 
KEY
Apache Solr - Enterprise search platform
Tommaso Teofili
 
PPT
DataFinder: A Python Application for Scientific Data Management
Andreas Schreiber
 
PDF
Solr Application Development Tutorial
Erik Hatcher
 
PDF
Suche mit Apache Lucene & Co.
inovex GmbH
 
KEY
Intro to Apache Solr for Drupal
Chris Caple
 
PPT
DataFinder concepts and example: General (20100503)
Data Finder
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
Solr as your search and suggest engine karan nangru
IndicThreads
 
PDF
Introduction to Solr
Erik Hatcher
 
PDF
Migrating Fast to Solr
Cominvent AS
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
Introduction to Solr
Erik Hatcher
 
PDF
Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...
lucenerevolution
 
PPT
Solr -
Hao Chen 陈浩
 
PDF
Apache Solr Changes the Way You Build Sites
Peter
 
PPTX
Solr
Peter Svehla
 
NoSQL, Apache SOLR and Apache Hadoop
Dmitry Kan
 
Solr Recipes Workshop
Erik Hatcher
 
Organizing the Data Chaos of Scientists
Andreas Schreiber
 
Solr Recipes
Erik Hatcher
 
Apache Solr - Enterprise search platform
Tommaso Teofili
 
DataFinder: A Python Application for Scientific Data Management
Andreas Schreiber
 
Solr Application Development Tutorial
Erik Hatcher
 
Suche mit Apache Lucene & Co.
inovex GmbH
 
Intro to Apache Solr for Drupal
Chris Caple
 
DataFinder concepts and example: General (20100503)
Data Finder
 
Rapid Prototyping with Solr
Erik Hatcher
 
Solr as your search and suggest engine karan nangru
IndicThreads
 
Introduction to Solr
Erik Hatcher
 
Migrating Fast to Solr
Cominvent AS
 
Rapid Prototyping with Solr
Erik Hatcher
 
Introduction to Solr
Erik Hatcher
 
Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...
lucenerevolution
 
Apache Solr Changes the Way You Build Sites
Peter
 

More from Dr. Haxel Consult (20)

PDF
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
Dr. Haxel Consult
 
PDF
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
Dr. Haxel Consult
 
PDF
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
Dr. Haxel Consult
 
PDF
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
Dr. Haxel Consult
 
PDF
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
Dr. Haxel Consult
 
PDF
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
Dr. Haxel Consult
 
PDF
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
Dr. Haxel Consult
 
PDF
AI-SDV 2022: Machine learning based patent categorization: A success story in...
Dr. Haxel Consult
 
PDF
AI-SDV 2022: Machine learning based patent categorization: A success story in...
Dr. Haxel Consult
 
PDF
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
Dr. Haxel Consult
 
PDF
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
Dr. Haxel Consult
 
PDF
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
Dr. Haxel Consult
 
PDF
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
Dr. Haxel Consult
 
PDF
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
Dr. Haxel Consult
 
PDF
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
Dr. Haxel Consult
 
PDF
AI-SDV 2022: Copyright Clearance Center
Dr. Haxel Consult
 
PDF
AI-SDV 2022: Lighthouse IP
Dr. Haxel Consult
 
PDF
AI-SDV 2022: New Product Introductions: CENTREDOC
Dr. Haxel Consult
 
PDF
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
Dr. Haxel Consult
 
PDF
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
Dr. Haxel Consult
 
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
Dr. Haxel Consult
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
Dr. Haxel Consult
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
Dr. Haxel Consult
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
Dr. Haxel Consult
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
Dr. Haxel Consult
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
Dr. Haxel Consult
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
Dr. Haxel Consult
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
Dr. Haxel Consult
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
Dr. Haxel Consult
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
Dr. Haxel Consult
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
Dr. Haxel Consult
 
AI-SDV 2022: Copyright Clearance Center
Dr. Haxel Consult
 
AI-SDV 2022: Lighthouse IP
Dr. Haxel Consult
 
AI-SDV 2022: New Product Introductions: CENTREDOC
Dr. Haxel Consult
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
Dr. Haxel Consult
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
Dr. Haxel Consult
 

Recently uploaded (20)

PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
APNIC
 
PDF
Project English Paja Jara Alejandro.jpdf
AlejandroAlonsoPajaJ
 
PPTX
Artificial-Intelligence-in-Daily-Life (2).pptx
nidhigoswami335
 
PPTX
Microsoft PowerPoint Student PPT slides.pptx
Garleys Putin
 
PDF
Data Protection & Resilience in Focus.pdf
AmyPoblete3
 
PPTX
SEO Trends in 2025 | B3AITS - Bow & 3 Arrows IT Solutions
B3AITS - Bow & 3 Arrows IT Solutions
 
PPTX
B2B_Ecommerce_Internship_Simranpreet.pptx
LipakshiJindal
 
PPTX
Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025
Reversed Out Creative
 
PPTX
原版北不列颠哥伦比亚大学毕业证文凭UNBC成绩单2025年新版在线制作学位证书
e7nw4o4
 
PDF
Cybersecurity Awareness Presentation ppt.
banodhaharshita
 
PPTX
Different Generation Of Computers .pptx
divcoder9507
 
PPTX
Generics jehfkhkshfhskjghkshhhhlshluhueheuhuhhlhkhk.pptx
yashpavasiya892
 
PDF
Generative AI Foundations: AI Skills for the Future of Work
hemal sharma
 
PPTX
EthicalHack{aksdladlsfsamnookfmnakoasjd}.pptx
dagarabull
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
Pengenalan perangkat Jaringan komputer pada teknik jaringan komputer dan tele...
Prayudha3
 
PPTX
Slides Powerpoint: Eco Economic Epochs.pptx
Steven McGee
 
PDF
Slides: PDF Eco Economic Epochs for World Game (s) pdf
Steven McGee
 
PDF
BGP Security Best Practices that Matter, presented at PHNOG 2025
APNIC
 
PDF
Latest Scam Shocking the USA in 2025.pdf
onlinescamreport4
 
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
APNIC
 
Project English Paja Jara Alejandro.jpdf
AlejandroAlonsoPajaJ
 
Artificial-Intelligence-in-Daily-Life (2).pptx
nidhigoswami335
 
Microsoft PowerPoint Student PPT slides.pptx
Garleys Putin
 
Data Protection & Resilience in Focus.pdf
AmyPoblete3
 
SEO Trends in 2025 | B3AITS - Bow & 3 Arrows IT Solutions
B3AITS - Bow & 3 Arrows IT Solutions
 
B2B_Ecommerce_Internship_Simranpreet.pptx
LipakshiJindal
 
Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025
Reversed Out Creative
 
原版北不列颠哥伦比亚大学毕业证文凭UNBC成绩单2025年新版在线制作学位证书
e7nw4o4
 
Cybersecurity Awareness Presentation ppt.
banodhaharshita
 
Different Generation Of Computers .pptx
divcoder9507
 
Generics jehfkhkshfhskjghkshhhhlshluhueheuhuhhlhkhk.pptx
yashpavasiya892
 
Generative AI Foundations: AI Skills for the Future of Work
hemal sharma
 
EthicalHack{aksdladlsfsamnookfmnakoasjd}.pptx
dagarabull
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Pengenalan perangkat Jaringan komputer pada teknik jaringan komputer dan tele...
Prayudha3
 
Slides Powerpoint: Eco Economic Epochs.pptx
Steven McGee
 
Slides: PDF Eco Economic Epochs for World Game (s) pdf
Steven McGee
 
BGP Security Best Practices that Matter, presented at PHNOG 2025
APNIC
 
Latest Scam Shocking the USA in 2025.pdf
onlinescamreport4
 

II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution from Popular Building Blocks

  • 1. Datafari - Building an Open Source Enterprise Search Solution from Popular Building Blocks CEDRIC ULMER FRANCE LABS II-SDV 25/04/17
  • 2. Datafari So what is Datafari? • « Packaged solution » to analyse and search for documents and data • Can index heterogeneous data formats from multiple sources • Federated search interface • Apache v2 licence
  • 4. Why Datafari ? Choice of the Apache Solr and Elasticsearch technologies (more about this later...) Three possibilities to answer a customer requirements : • Use a packaged solution available on the market from a 3rd party • Starting from Apache Solr or Elasticsearch (or others) • Develop, gather necessary components for each customer needs • Ensure « production-ready » material: docs, processes, tests. • Create our own packaged solution (yeah!)
  • 5. Why Datafari ? Problems with 3rd party proprietary solutions: • Black box • Roadmap not clear • Resilience (bankrupt, acquisition…) Problems with 3rd party open source solutions: • Lack of technical documentation • Difficulty to setup an understandable debug environment • Delay in the embedded components updates: In particular Solr or ES • License issues (mostly viral ones) • Lack of resilience from the makers => Required us to develop our own solution to better address our customer needs
  • 6. Why Datafari Idea: • Gather the best of both worlds : • The “packaged” aspect of existing solutions • Many functionalities • All in one • The flexibility of a solution based on Solr and ES • All of that with an Apache v2 licence ☺ • Focus on Enterprise Search: • Admin for search experts • Admin for search admin • Eased AD/LDAP management • Search and data analytics
  • 7. Based on 4 building blocks: • Apache Solr • The heart of the search engine • Apache Manifold CF • Crawling documents • Ajax FranceLabs • UI • Elasticsearch • Data analytics Ajax FranceLabs
  • 8. Datafari 3.1 Apache Tomcat 7 Data Sources Datafari Search / Admin Apache ManifoldCF CMS DB Fileshares Web Security (AD, LDAP) PostgreSQL Apache Solr 5.5 Document Index Statistics Index Apache ManifoldCF 2.5 Crawler Service Autorization Service ELK Cassandra (User Management)
  • 9. Apache Solr Lucene based Full text search engine Apache Top Level project Large communauty (users/devs) Efficient/Reliable Scalable • High availability • Queries • Index volume
  • 10. Apache Solr Webapp Java REST APIs XML/HTTP • Indexing • Querying Caching Web admin interface Configuration through XML config files or APIs
  • 12. Apache Solr for Datafari Search core of Datafari Preconfigured index for rich documents • Language detection • Standard facets • Autocomplete • Spellchecker Indexing user queries • Enables analytics on search users behavior
  • 13. Datafari 3.1 Apache Tomcat 7 Data Sources Datafari Search / Admin Apache ManifoldCF CMS DB Fileshares Web Security (AD, LDAP) PostgreSQL Apache Solr 5.5 Document Index Statistics Index Apache ManifoldCF 2.5 Crawler Service Autorization Service ELK Cassandra (User Management)
  • 14. Apache Manifold CF Framework for data crawling Management of incremental crawling Authorization management Programmable crawls (time windows, loads, regex…)
  • 15. Apache Manifold CF Many off the shelf connectors: • FileShare (Samba) • JDBC • Website • Alfresco • CMIS • Sharepoint • Mail • Dropbox • LDAP/AD
  • 16. Apache Manifold CF for Datafari Manages data crawling Manages authentication Preconfigured integration with our Solr
  • 17. Datafari 3.1 Apache Tomcat 7 Data Sources Datafari Search / Admin Apache ManifoldCF CMS DB Fileshares Web Security (AD, LDAP) PostgreSQL Apache Solr 5.5 Document Index Statistics Index Apache ManifoldCF 2.5 Crawler Service Autorization Service ELK Cassandra (User Management)
  • 18. Datafari Search Front-End User UI • AjaxFrance Labs Authentication Interactions with Solr (SolrJ) Indexing users queries Admin UI • Solr • ManifoldCF • Statistics
  • 19. AjaxFranceLabs Inspired by AjaxSolr Javascript/Ajax client Provides several components: • Manager: backend connection • Widgets • Graphical/Logical components • (Advanced) Search • Facet • Geolocalisation (Based on OpenStreetMap)
  • 21. Use case 1 – Oil and Gas Sources: • Sharepoint • Documentum • Fileshare • DB Volume: 28 TB Users: Geoscientists
  • 22. Use case 2 – Nuclear Sources: • Fileshare • Oracle • DB Volume: 15 M docs Users: Maintenance operators
  • 24. Technical Roadmap (1/2) New advanced search Solr 6 Graphical SolrCloud management Always more documentation Annotator
  • 25. Technical roadmap (2/2) New languages Consolidation Unit test framework More dashboards in ELK Learning-to-Rank
  • 26. Where can I find Datafari Main hub: http://www.datafari.com/en Source code available on Github: • https://code.google.com/p/datafari/ Install packages for Debian 7 and Windows available on: • www.datafari.com Forum: • https://groups.google.com/forum/#!forum/datafari Documentation on Confluence • Technical and functional Tickets and releases on Jira
  • 27. Want to follow Datafari ? @francelabs #datafari francelabs francelabs
  • 28. Become a Datafarian ! ☺ We are always open to suggestions • “Reorganise your docs…” Contribution • What about a German version ?! • UI widgets ? Most important: your use cases and usage feedback
  • 29. CONTACT Don’t hesitate to reach out to us for any info Our corporate website: www.francelabs.com Email: [email protected] Tél: 09 72 43 72 85 Fax: 09 72 29 28 14