SlideShare a Scribd company logo
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Properly integrate ManifoldCF with Solr
Aurélien MAZOYER
Search Expert, Co-founder, France Labs
3
01
Apache Manifold CF
o Agenda
• Overview of ManifoldCF
• Our scenario : find files on a file share
• In real life
4
01
Apache Manifold CF
o Overview
• Connector Framework
• Incremental crawling
• Handle authorization
• Configuration via REST API and UI
5
01
Apache Manifold CF
o History
• Based on « Connector Framework » developed by Karl Wright
for the MetaCarta Appliance
• Donated to the Apache Software Foundation in 2009
• May 2012 : out of incubation
• Current version : 2.2 (August 2015)
6
01
Connectors gone wild
o Different connectors for :
• Content repositories
• Web, Wiki, DB, Email, RSS, CMIS, Alfresco…
• But also Windows Share, Sharepoint, Dropbox…
• Authorities
• LDAP, AD, CMIS…
• Output
• Solr, Elasticsearch, OSS…
7
03
Big picture
Manifold CF
Solr Elasticsearch Repository N
OpenLDAP
Authority N
…
Daemon Agent
Conn. 1
Manifold CF
authority
service
Ouputs
Authorities
Conn. 2
Conn. N
ManifoldCF
UI
ManifoldCF
API
Conn. 1 Conn. 2 Conn. N
Wiki
DB
Repository N
…
…
Repositories
Conn. 1
Conn. N
8
01
Roles of components
o Daemon agent
• Java process
• Run repository and ouput connectors
• Run data crawling jobs
9
01
Roles of components
o Authority service
• Web application
• Run authority connectors
• Get security tokens for a specific user
10
01
Component
Ouput ConnectionRepo Connection Crawl Job
1…1 1…* 1…* 1…*
o ManifoldCF UI
That’s it.
11
01
API Configuration
o API
12
01
Test it!
o For testing purpose:
• java –jar post.jar
• All-in-one process
• Embedded database (HSQL)
13
01
Taking MCF to production
Multi-process deployment
o 3 web application in a servlet container
• mcf-crawler-ui
• mcf-authorization-service
• mcf-api-service
o Daemon agent
o Database
• PostgresSQL
o Synchronize on filesystem ( local or distributed (zK) )
14
01
Search files with Security : Solr + MCF
o Our scenario
• File share using Active Directory
• Search with Solr
• With security constraints
15
01
Security model : Solr + MCF
o Authorization
• Early Binding
• Index documents with ACLs
• Compute authorization at runtime
o Authentication
• Not handled by Solr/ManifoldCF
• Front-end application should authenticate user
16
01
Search files with security : Solr + MCF
Manifold CF
AD
Daemon
Agent
JCIFS
Connector
Solr
connector
Phase 1 : Indexing
Repositories Authorities
Output Connector
Solr
Extracting
Handler
Manifold CF
authority
service
AD
ConnectorWindows
Share
MCF Plugin
Send docs and
ACLs
Crawl
documents
with ACLs
Get User
access token
Solr
MCF Plugin
17
01
Search files with security : Solr + MCF
Manifold CF
AD
Daemon
Agent
JCIFS
Connector
Solr
connector
Repositories Authorities
Extracting
Handler
Manifold CF
authority
service
AD
Connector
Front End Authenticated Search Filter docs based on
ACLs and users info
Authorized results
Phase 2 : Searching
Output Connector
Windows
Share
18
01
Configure Solr + MCF
o side
o 4 connections and 1 job
• Create Windows Share connection
• Create Solr connection
• Create Active Directory connection
• Create Authority Group connection
• Create a crawling Job
19
01
Component
0…1
1…*
Authority Group
Authority Connection
1…1
1…*
Ouput ConnectionRepo Connection Crawl Job
1…1 1…* 1…* 1…*
20
01
Component
AD Group
Crawl Job Solr Connection
AD Connection
Windows Share
Connection
21
01
Configure Solr + MCF
o Frond end side
o Authentication
• For Tomcat
• JDNI Tomcat Realm
• TomcatSPNEGO
22
01
Configure Solr + MCF
o side
o Modify schema.xml
• Add fields for security tokens
o Modify solrconfig.xml
• Add MCF Solr Plugin (query parser)
o And don’t forget to protect the Solr instance :-P
23
01
Configure Solr + MCF
o Leverage Solr Extracting handler
• Based on ApacheTika
• Mime type detection
• Embed parsing library
• Supported extension:
• MS Office (OLE2 and OOXML)
• OpenDocument
• Pdf
• Audio/video/image files
• Now OCRs thanks to Tika 1.7 (and Tesseract)
o Now, can be done directly in MCF!
24
01
Component
0…1
1…*
Authority Group
Authority Connection
1…1
1…*
Ouput ConnectionRepo Connection Crawl Job
1…1 1…* 1…* 1…*
Transformation
Connection
0…*
1…*
25
01
Crawling principle
o Crawling model
• Incremental model
• Continuous model
ManifoldCF In Action – Chapter 1 (Karl Wright)
Phase 1 Phase 2
26
01
Incremental crawling of file share
o Incremental crawling not so easy with some
repositories:
Windows Shar
e Connector
JCIFS
Windows Share
Uhuuu, file share, what's new
since last time we met?
Errkkk…
27
01
Incremental crawling of file share : Solr + MCF
o Phase 1 : Discovery/Indexing Depth first
Fetch SMB file attributes
If file is a directory and if matches inclusion regex
For each file
If file is a regular file and if matches inclusion regex
List files in SMB directory
Check ingeststatus entry in crawler DB
If no entry or the version attribute is different
Fetch file content
Update ingeststatus entry in DB
Push file to Solr
For each start path
entry
Windows Share
28
01
o What is ingeststatus database entry?
o Simplified version :
o LastVersion?
• Here, computed from lastModified and ACLs on the file
DOCURI LAST_INGEST LAST_VERSION
protocol://REPO_HOST/Doc1.docx 10.09.2015 18:21:04 Doc1_Version1
protocol://REPO_HOST/Doc2.docx 10.09.2015 19:21:04 Doc2_Version1
+S-1-5-18+S-1-5-21-3380247023-2036360560-1108467148-1118+S-1-5-21-3380247023-
2036360560-1108467148-500+S-1-5-32-544+1+DEAD_AUTHORITY+-file://///52.30.17.1
84/ShareFolder/TestFile.txt+1444462827664:16Y
Incremental crawling of file share
29
01
Incremental crawling of file share : Solr + MCF
o Phase 2 : Deleting unreachable documents
Update Crawler database
Send delete command to Solr
For each crawler DB entry
30
01
How to see what happened
o Search History
o Monitoring
• Job Status
• Notification Connections
31
01
How to see what happened
o Search History
o History
• Simple History
• Maximum Activity
• Maximum Bandwidth
• Result Histogram
o Status
• Document Status
• Queue Status
32
01
Performance issue
o Find bottleneck
• Crawled repository
• Network
• Solr
• MCF database
• MCF configuration
33
01
Handle performance issue
o Specific connector’s configuration
• Throttling
• Max JVM connections
o Can improve speed / limit impact on crawled repository
o Very specific to the repository
34
01
Handle performance issue
o Job settings
o Size limit of ingested documents
o Use regex to remove some extensions from crawl
35
01
Investigate errors
• Increase connector’s log level
• Read MCF simple history
• Thread Dump
36
01
Common errors in file crawling
o Crawler account rights
o Exotic files
o Very biiiiiiig files
o JCIFS errors
o Solr connector timeout
37
01
When use ManifoldCF?
q = crawled_environment:heterogeneous
OR scenario:intranet
OR security:mandatory
38
01
References
o ManifoldCF documentation
https://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html
o ManifoldCF in Action (K. Wright)
https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs
o Securing Solr document with MCF (K. Wright)
http://fr.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011
o France Labs blog posts :
http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-solr-for-files-search/
http://www.francelabs.com/blog/tutorial-on-authorizations-for-manifold-cf-and-solr/
39
01
Datafari
Search
Admin
o Intranet “ready to play” search solution
• Apache License
o Embed:
o Solr
o ManifoldCF
o And other cool stuff:
• Admin and responsive search UI
• User Management
• Banana for user behavior analysis
• Tesseract OCR
• A funny zebra
• Etc…
www.datafari.com
40
aurelien.mazoyer@francelabs.com
@francelabs
www.francelabs.com

More Related Content

PDF
Apache ManifoldCF
Piergiorgio Lucidi
 
PDF
ソーシャルゲームにおけるMongoDB適用事例 - Animal Land
Masakazu Matsushita
 
PDF
Boost.勉強会#19東京 Effective Modern C++とC++ Core Guidelines
Shintarou Okada
 
PPTX
SharePoint 開発入門
Hiroaki Oikawa
 
PPTX
Boost Your Neo4j with User-Defined Procedures
Neo4j
 
PDF
MeetBSD2014 Performance Analysis
Brendan Gregg
 
PPTX
Laravel Eloquent ORM
Ba Thanh Huynh
 
PPTX
Blazor - the successor of angular/react/vue?
Robert Szachnowski
 
Apache ManifoldCF
Piergiorgio Lucidi
 
ソーシャルゲームにおけるMongoDB適用事例 - Animal Land
Masakazu Matsushita
 
Boost.勉強会#19東京 Effective Modern C++とC++ Core Guidelines
Shintarou Okada
 
SharePoint 開発入門
Hiroaki Oikawa
 
Boost Your Neo4j with User-Defined Procedures
Neo4j
 
MeetBSD2014 Performance Analysis
Brendan Gregg
 
Laravel Eloquent ORM
Ba Thanh Huynh
 
Blazor - the successor of angular/react/vue?
Robert Szachnowski
 

What's hot (20)

PDF
データローダについてちょっと詳しくなる
Junko Nakayama
 
PDF
初めてでも大丈夫!SharePoint 開発の第一歩
Yoshitaka Seo
 
PDF
Sling Component Filters in CQ5
connectwebex
 
PDF
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
Hyojun Jeon
 
PDF
本当のClosure Compilerをお見せしますよ。
Teppei Sato
 
PDF
Reactive Spring Framework 5
Aliaksei Zhynhiarouski
 
PDF
CKAN overview
Augusto Herrmann Batista
 
PDF
今からでも遅くないDBマイグレーション - Flyway と SchemaSpy の紹介 -
onozaty
 
PPTX
Compact Representation of Large RDF Data Sets for Publishing and Exchange
WU (Vienna University of Economics and Business)
 
PDF
メルカリ・ソウゾウでは どうGoを活用しているのか?
Takuya Ueda
 
PDF
BigQueryでJOINを極める!
Miki Katsuragi
 
KEY
JSON-LD: JSON for Linked Data
Gregg Kellogg
 
PDF
SharePoint Online で、ポータル実践アイデア
Hirofumi Ota
 
PPTX
JSON-LD for RESTful services
Markus Lanthaler
 
PDF
Github 으로 학교 팀 프로젝트 하기
nexusz99
 
PDF
SJBoard Project Portfolio
JuyoungKang7
 
PDF
Redux toolkit
ofoefiergbor1
 
PPTX
Automating Your Way to Greatness by Combining OutSystems CI/CD With the Power...
OutSystems
 
PDF
ドメイン駆動設計入門
Takuya Kitamura
 
PDF
TDPT + VMCプロトコル on WebRTC
hironroinakae
 
データローダについてちょっと詳しくなる
Junko Nakayama
 
初めてでも大丈夫!SharePoint 開発の第一歩
Yoshitaka Seo
 
Sling Component Filters in CQ5
connectwebex
 
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
Hyojun Jeon
 
本当のClosure Compilerをお見せしますよ。
Teppei Sato
 
Reactive Spring Framework 5
Aliaksei Zhynhiarouski
 
今からでも遅くないDBマイグレーション - Flyway と SchemaSpy の紹介 -
onozaty
 
Compact Representation of Large RDF Data Sets for Publishing and Exchange
WU (Vienna University of Economics and Business)
 
メルカリ・ソウゾウでは どうGoを活用しているのか?
Takuya Ueda
 
BigQueryでJOINを極める!
Miki Katsuragi
 
JSON-LD: JSON for Linked Data
Gregg Kellogg
 
SharePoint Online で、ポータル実践アイデア
Hirofumi Ota
 
JSON-LD for RESTful services
Markus Lanthaler
 
Github 으로 학교 팀 프로젝트 하기
nexusz99
 
SJBoard Project Portfolio
JuyoungKang7
 
Redux toolkit
ofoefiergbor1
 
Automating Your Way to Greatness by Combining OutSystems CI/CD With the Power...
OutSystems
 
ドメイン駆動設計入門
Takuya Kitamura
 
TDPT + VMCプロトコル on WebRTC
hironroinakae
 
Ad

Viewers also liked (20)

PDF
A Novel methodology for handling Document Level Security in Search Based Appl...
lucenerevolution
 
PDF
Presentation Lucene / Solr / Datafari - Nantes JUG
francelabs
 
PDF
Besoin de rien Envie de Search - Presentation Lucene Solr ElasticSearch
francelabs
 
PPT
Apprendre Solr en deux heures
Saïd Radhouani
 
PPTX
Using Enterprise Search at the city of Antibes
francelabs
 
PPTX
Sitecore Dev User Group Meetup in Milwaukee - Perficient - Rick Bauer
Rick Bauer
 
PDF
Plannning for the GSA Sunsetting feat. Coveo
MC+A
 
PPTX
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
francelabs
 
PDF
Concepts de Recherche dans un environnement WSS et MOSS
Desjardins
 
PPTX
SharePoint Search for Dummies
Joel Oleson
 
PPTX
Coveo Search - Product Overview
Amplexor
 
PDF
Coveo_Intelligent_Workplace_eBook
Stephen Alfano
 
PDF
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Lucidworks
 
PPT
Apache ManifoldCF
Shinichiro Abe
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PPTX
Introduction to Big Data processing (FGRE2016)
Thomas Vanhove
 
PDF
Netflix Global Search - Lucene Revolution
ivan provalov
 
PDF
Intro to Apache Solr
Shalin Shekhar Mangar
 
PDF
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Lucidworks
 
PDF
Language support and linguistics in lucene solr & its eco system
lucenerevolution
 
A Novel methodology for handling Document Level Security in Search Based Appl...
lucenerevolution
 
Presentation Lucene / Solr / Datafari - Nantes JUG
francelabs
 
Besoin de rien Envie de Search - Presentation Lucene Solr ElasticSearch
francelabs
 
Apprendre Solr en deux heures
Saïd Radhouani
 
Using Enterprise Search at the city of Antibes
francelabs
 
Sitecore Dev User Group Meetup in Milwaukee - Perficient - Rick Bauer
Rick Bauer
 
Plannning for the GSA Sunsetting feat. Coveo
MC+A
 
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
francelabs
 
Concepts de Recherche dans un environnement WSS et MOSS
Desjardins
 
SharePoint Search for Dummies
Joel Oleson
 
Coveo Search - Product Overview
Amplexor
 
Coveo_Intelligent_Workplace_eBook
Stephen Alfano
 
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Lucidworks
 
Apache ManifoldCF
Shinichiro Abe
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Introduction to Big Data processing (FGRE2016)
Thomas Vanhove
 
Netflix Global Search - Lucene Revolution
ivan provalov
 
Intro to Apache Solr
Shalin Shekhar Mangar
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Lucidworks
 
Language support and linguistics in lucene solr & its eco system
lucenerevolution
 
Ad

Similar to Integrate ManifoldCF with Solr (20)

PDF
Alfresco WebScript Connector for Apache ManifoldCF
Piergiorgio Lucidi
 
PDF
Solr and ManifoldCF
Minoru Osuka
 
PPTX
Super Size Your Search
Piergiorgio Lucidi
 
PDF
Apache ManifoldCF @ Linux Day 2012
Piergiorgio Lucidi
 
PPT
Solr -
Hao Chen 陈浩
 
PDF
II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution fr...
Dr. Haxel Consult
 
PPTX
Big Data Technologies
Anant Corporation
 
PDF
Deploying Immutable infrastructures with RabbitMQ and Solr
Jordi Llonch
 
PDF
Smart Content Migration using Apache ManifoldCF
Piergiorgio Lucidi
 
PDF
Solr at zvents 6 years later & still going strong
lucenerevolution
 
KEY
ApacheCon Europe 2012 -Big Search 4 Big Data
OpenSource Connections
 
PDF
Solr Application Development Tutorial
Erik Hatcher
 
PDF
BP-8 Global Federation and Search
Alfresco Software
 
PPTX
Apache Solr
Minh Tran
 
PDF
Solr4 nosql search_server_2013
Lucidworks (Archived)
 
PDF
Taking eZ Find beyond full-text search
Paul Borgermans
 
PDF
A Kafka Client’s Request: There and Back Again with Danica Fine
HostedbyConfluent
 
PPTX
Building Data Pipelines for Solr with Apache NiFi
Bryan Bende
 
PDF
"Solr Update" at code4lib '13 - Chicago
Erik Hatcher
 
PDF
Solr Recipes
Erik Hatcher
 
Alfresco WebScript Connector for Apache ManifoldCF
Piergiorgio Lucidi
 
Solr and ManifoldCF
Minoru Osuka
 
Super Size Your Search
Piergiorgio Lucidi
 
Apache ManifoldCF @ Linux Day 2012
Piergiorgio Lucidi
 
II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution fr...
Dr. Haxel Consult
 
Big Data Technologies
Anant Corporation
 
Deploying Immutable infrastructures with RabbitMQ and Solr
Jordi Llonch
 
Smart Content Migration using Apache ManifoldCF
Piergiorgio Lucidi
 
Solr at zvents 6 years later & still going strong
lucenerevolution
 
ApacheCon Europe 2012 -Big Search 4 Big Data
OpenSource Connections
 
Solr Application Development Tutorial
Erik Hatcher
 
BP-8 Global Federation and Search
Alfresco Software
 
Apache Solr
Minh Tran
 
Solr4 nosql search_server_2013
Lucidworks (Archived)
 
Taking eZ Find beyond full-text search
Paul Borgermans
 
A Kafka Client’s Request: There and Back Again with Danica Fine
HostedbyConfluent
 
Building Data Pipelines for Solr with Apache NiFi
Bryan Bende
 
"Solr Update" at code4lib '13 - Chicago
Erik Hatcher
 
Solr Recipes
Erik Hatcher
 

More from francelabs (6)

PPTX
Migration d'Exalead vers Solr - IFCE et France Labs - Search Day 2014
francelabs
 
PPTX
Apache Solr pour le eCommerce chez Allopneus avec France Labs - Lib'day2014
francelabs
 
PDF
Geneva jug Lucene Solr
francelabs
 
PPTX
Solr + Hadoop - Fouillez facilement dans votre système Big Data
francelabs
 
PPTX
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
francelabs
 
PPTX
Marseille JUG Novembre 2013 Lucene Solr France Labs
francelabs
 
Migration d'Exalead vers Solr - IFCE et France Labs - Search Day 2014
francelabs
 
Apache Solr pour le eCommerce chez Allopneus avec France Labs - Lib'day2014
francelabs
 
Geneva jug Lucene Solr
francelabs
 
Solr + Hadoop - Fouillez facilement dans votre système Big Data
francelabs
 
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
francelabs
 
Marseille JUG Novembre 2013 Lucene Solr France Labs
francelabs
 

Recently uploaded (20)

PPTX
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
PDF
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
PPTX
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
PDF
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
PDF
Zero Carbon Building Performance standard
BassemOsman1
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PPT
Understanding the Key Components and Parts of a Drone System.ppt
Siva Reddy
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
PDF
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
PDF
All chapters of Strength of materials.ppt
girmabiniyam1234
 
PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PPTX
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
PPTX
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PPTX
Online Cab Booking and Management System.pptx
diptipaneri80
 
PDF
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
PDF
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
Zero Carbon Building Performance standard
BassemOsman1
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
Understanding the Key Components and Parts of a Drone System.ppt
Siva Reddy
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
Information Retrieval and Extraction - Module 7
premSankar19
 
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
All chapters of Strength of materials.ppt
girmabiniyam1234
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
Online Cab Booking and Management System.pptx
diptipaneri80
 
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 

Integrate ManifoldCF with Solr

  • 1. O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
  • 2. Properly integrate ManifoldCF with Solr Aurélien MAZOYER Search Expert, Co-founder, France Labs
  • 3. 3 01 Apache Manifold CF o Agenda • Overview of ManifoldCF • Our scenario : find files on a file share • In real life
  • 4. 4 01 Apache Manifold CF o Overview • Connector Framework • Incremental crawling • Handle authorization • Configuration via REST API and UI
  • 5. 5 01 Apache Manifold CF o History • Based on « Connector Framework » developed by Karl Wright for the MetaCarta Appliance • Donated to the Apache Software Foundation in 2009 • May 2012 : out of incubation • Current version : 2.2 (August 2015)
  • 6. 6 01 Connectors gone wild o Different connectors for : • Content repositories • Web, Wiki, DB, Email, RSS, CMIS, Alfresco… • But also Windows Share, Sharepoint, Dropbox… • Authorities • LDAP, AD, CMIS… • Output • Solr, Elasticsearch, OSS…
  • 7. 7 03 Big picture Manifold CF Solr Elasticsearch Repository N OpenLDAP Authority N … Daemon Agent Conn. 1 Manifold CF authority service Ouputs Authorities Conn. 2 Conn. N ManifoldCF UI ManifoldCF API Conn. 1 Conn. 2 Conn. N Wiki DB Repository N … … Repositories Conn. 1 Conn. N
  • 8. 8 01 Roles of components o Daemon agent • Java process • Run repository and ouput connectors • Run data crawling jobs
  • 9. 9 01 Roles of components o Authority service • Web application • Run authority connectors • Get security tokens for a specific user
  • 10. 10 01 Component Ouput ConnectionRepo Connection Crawl Job 1…1 1…* 1…* 1…* o ManifoldCF UI That’s it.
  • 12. 12 01 Test it! o For testing purpose: • java –jar post.jar • All-in-one process • Embedded database (HSQL)
  • 13. 13 01 Taking MCF to production Multi-process deployment o 3 web application in a servlet container • mcf-crawler-ui • mcf-authorization-service • mcf-api-service o Daemon agent o Database • PostgresSQL o Synchronize on filesystem ( local or distributed (zK) )
  • 14. 14 01 Search files with Security : Solr + MCF o Our scenario • File share using Active Directory • Search with Solr • With security constraints
  • 15. 15 01 Security model : Solr + MCF o Authorization • Early Binding • Index documents with ACLs • Compute authorization at runtime o Authentication • Not handled by Solr/ManifoldCF • Front-end application should authenticate user
  • 16. 16 01 Search files with security : Solr + MCF Manifold CF AD Daemon Agent JCIFS Connector Solr connector Phase 1 : Indexing Repositories Authorities Output Connector Solr Extracting Handler Manifold CF authority service AD ConnectorWindows Share MCF Plugin Send docs and ACLs Crawl documents with ACLs
  • 17. Get User access token Solr MCF Plugin 17 01 Search files with security : Solr + MCF Manifold CF AD Daemon Agent JCIFS Connector Solr connector Repositories Authorities Extracting Handler Manifold CF authority service AD Connector Front End Authenticated Search Filter docs based on ACLs and users info Authorized results Phase 2 : Searching Output Connector Windows Share
  • 18. 18 01 Configure Solr + MCF o side o 4 connections and 1 job • Create Windows Share connection • Create Solr connection • Create Active Directory connection • Create Authority Group connection • Create a crawling Job
  • 19. 19 01 Component 0…1 1…* Authority Group Authority Connection 1…1 1…* Ouput ConnectionRepo Connection Crawl Job 1…1 1…* 1…* 1…*
  • 20. 20 01 Component AD Group Crawl Job Solr Connection AD Connection Windows Share Connection
  • 21. 21 01 Configure Solr + MCF o Frond end side o Authentication • For Tomcat • JDNI Tomcat Realm • TomcatSPNEGO
  • 22. 22 01 Configure Solr + MCF o side o Modify schema.xml • Add fields for security tokens o Modify solrconfig.xml • Add MCF Solr Plugin (query parser) o And don’t forget to protect the Solr instance :-P
  • 23. 23 01 Configure Solr + MCF o Leverage Solr Extracting handler • Based on ApacheTika • Mime type detection • Embed parsing library • Supported extension: • MS Office (OLE2 and OOXML) • OpenDocument • Pdf • Audio/video/image files • Now OCRs thanks to Tika 1.7 (and Tesseract) o Now, can be done directly in MCF!
  • 24. 24 01 Component 0…1 1…* Authority Group Authority Connection 1…1 1…* Ouput ConnectionRepo Connection Crawl Job 1…1 1…* 1…* 1…* Transformation Connection 0…* 1…*
  • 25. 25 01 Crawling principle o Crawling model • Incremental model • Continuous model ManifoldCF In Action – Chapter 1 (Karl Wright) Phase 1 Phase 2
  • 26. 26 01 Incremental crawling of file share o Incremental crawling not so easy with some repositories: Windows Shar e Connector JCIFS Windows Share Uhuuu, file share, what's new since last time we met? Errkkk…
  • 27. 27 01 Incremental crawling of file share : Solr + MCF o Phase 1 : Discovery/Indexing Depth first Fetch SMB file attributes If file is a directory and if matches inclusion regex For each file If file is a regular file and if matches inclusion regex List files in SMB directory Check ingeststatus entry in crawler DB If no entry or the version attribute is different Fetch file content Update ingeststatus entry in DB Push file to Solr For each start path entry Windows Share
  • 28. 28 01 o What is ingeststatus database entry? o Simplified version : o LastVersion? • Here, computed from lastModified and ACLs on the file DOCURI LAST_INGEST LAST_VERSION protocol://REPO_HOST/Doc1.docx 10.09.2015 18:21:04 Doc1_Version1 protocol://REPO_HOST/Doc2.docx 10.09.2015 19:21:04 Doc2_Version1 +S-1-5-18+S-1-5-21-3380247023-2036360560-1108467148-1118+S-1-5-21-3380247023- 2036360560-1108467148-500+S-1-5-32-544+1+DEAD_AUTHORITY+-file://///52.30.17.1 84/ShareFolder/TestFile.txt+1444462827664:16Y Incremental crawling of file share
  • 29. 29 01 Incremental crawling of file share : Solr + MCF o Phase 2 : Deleting unreachable documents Update Crawler database Send delete command to Solr For each crawler DB entry
  • 30. 30 01 How to see what happened o Search History o Monitoring • Job Status • Notification Connections
  • 31. 31 01 How to see what happened o Search History o History • Simple History • Maximum Activity • Maximum Bandwidth • Result Histogram o Status • Document Status • Queue Status
  • 32. 32 01 Performance issue o Find bottleneck • Crawled repository • Network • Solr • MCF database • MCF configuration
  • 33. 33 01 Handle performance issue o Specific connector’s configuration • Throttling • Max JVM connections o Can improve speed / limit impact on crawled repository o Very specific to the repository
  • 34. 34 01 Handle performance issue o Job settings o Size limit of ingested documents o Use regex to remove some extensions from crawl
  • 35. 35 01 Investigate errors • Increase connector’s log level • Read MCF simple history • Thread Dump
  • 36. 36 01 Common errors in file crawling o Crawler account rights o Exotic files o Very biiiiiiig files o JCIFS errors o Solr connector timeout
  • 37. 37 01 When use ManifoldCF? q = crawled_environment:heterogeneous OR scenario:intranet OR security:mandatory
  • 38. 38 01 References o ManifoldCF documentation https://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html o ManifoldCF in Action (K. Wright) https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs o Securing Solr document with MCF (K. Wright) http://fr.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011 o France Labs blog posts : http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-solr-for-files-search/ http://www.francelabs.com/blog/tutorial-on-authorizations-for-manifold-cf-and-solr/
  • 39. 39 01 Datafari Search Admin o Intranet “ready to play” search solution • Apache License o Embed: o Solr o ManifoldCF o And other cool stuff: • Admin and responsive search UI • User Management • Banana for user behavior analysis • Tesseract OCR • A funny zebra • Etc… www.datafari.com

Editor's Notes

  • #3: Start : 0:00. End : 1:05 Hi, Thank you, Moi : Aurélien MAZOYER Co founder of France Labs, open source company based in France. We offer consulting on search technologies, icosystem Datafari, intranet search solution. Say a few word about datafari at the end of talk. Topic, Survey : how many of you have ever use manifoldcf ?
  • #4: Start : 1:05. End : 1:40 3 parts Overview of MCF. Explain a Case study on the integration of MCF with Solr in order to search file What happens with mcf
  • #5: Start 1:40 to 2:45 CF stands for Connector framework. That means that it is a tool that help you to connect heterodjinious Push the data to your favorite search engine Keep it syncronize Take access right into account to perform authenticated Provides a Complete UI and REST API
  • #6: Start 2:45 to 3:15 Karl wright when he worked for Apache Top Level Project since 2012. Active project : last release this summer.
  • #7: Start 3:15 to 4:25 Plenty of connectors included in ManifoldCF What is called You can write your own (ManifoldCF In action give you the best practice to write your own connector) Domain controller, such as an active directory Search engine
  • #8: Start 4:25 to 5:03 Contains Different components You can see the different connectors for the interaction with the external world Administration interface Talk about it in a few slide. Cannot see here a underlying database, backbone of the solution
  • #9: Start 5:03 to 5:30 Actually do the crawling job
  • #10: Start 5:30 to 6:26 You add the username in parameter provide the security tokens for a specific user. It gives For example the sid of the user in Active Directory, and the sid of all groups that he belongs to
  • #11: Start 6:26 to 7:14 Also web application. Administrate MCF. To begin, you will have to create Crawl Job. Start the job. Once you are done, you are now able to start your crawl
  • #12: Start 7:14 to 8:00 What can be done in the admin Put new config Send command Respect REST standards
  • #13: Start 8:00 to 8:49 Very simple to test it. Extract the binary distribution, open example directory Not unfamiliar TO solr users Not recommanded way to run it in production (mainly because of the HSQL database)
  • #14: Start 8:49 to 10:00 Component we described in different processes. The database is very important. One of the recommanded database Synchronize via local folder on the machine or with zookeeper.
  • #15: Start 10:00 to 11:07 Here is our scenario Let’s imagine An intranet Users who authenticate against AD They put their files on a shared folders. You have access rights on folders based on the user. But specific permission for some users. Of course it is a mess so they need a good search engine to find theirs documents Quite simple, not very unusuable, but it can be a nightmare if you don’t have to right tool We are here in a full proprietary environment. But we will see that MCF and solr can deal with it.
  • #16: Start 11:07 to 12:00 A few words Autorisation, when user runs a solr query Nither solr nor mcf will do this job Up to the front end application
  • #17: Start 12:00 to 12:34 Go back to the big picture Step 1 JCIFS connector fetches documents with theirs access control and push to Solr Extracting handler
  • #18: Start 12:34 to 13:00 Step 2 : Frontend sends an authenticated query Retrieves the security tokens linked to the current user Then, runs a normal search and filter the result set with the help the document acces control list and user security tokens
  • #19: Start 13:00 to 13:44 How can we actually implement that. ON the mcfside Windows share connection. Some few step to do (download last version of JCIFS library, uncomment the windows share line in the connectors config file)
  • #20: Start 13:44 to 14:03 Authority connector should be belong to an authority group
  • #21: Start 14:03 to 14:18 That’s it for manifold
  • #22: Start 14:18 to 15:00 Told you Front end is in charge of the authentication LDAP protocol to authenticate TomcatSPNEGO (Active directory). Spénégo : use single sign on
  • #23: Start 15:00 to 15:57 Add fields that will contains the access control list of the document Declare the MCF plugin Configure the endpoint of the authority service Add a filter query that uses this plugin in your search handler This is for the search handler
  • #24: Start 15:57 to 16:30 For the update handler. It is a default extracting handler that integrate apache tika. As a reminder, since Solr 5, extracting handler can run tesseract to extract content from images. Solr can do this job.
  • #25: Start 16:30 to 17:22 In new version of Manifold. It can also be done In fact, processing pipeline. You can do field mapping but also tika extraction. Perfect if you don’t want to send big files over the network
  • #26: Start 17:22 to 18:11 Now we will try to understand what is going on under the hood during our crawl These two crawling models are available with manifoldCF. To avoid indexing Discover new documents, remove old ones.
  • #27: Start 18:11 to 18:33 Some repository works well with incremental crawling Others don’t Unfortunatly our windows share won’t be able to answer
  • #28: Start 18:33 to 20:00 Therefore JCIFS connector How do windows share connetor handle incremental If it is a file Next slide is version attribute Fetch from the windows share
  • #29: Start 20:00 to 20:50 For each document Last version Depends on the repository
  • #30: Start 20:50 to 21:04 This was for step 1. more We can repeat these 2 steps in order to keep our data syncrhonize. We have covered how to configure this and we ve describe of it works under the hood. Now it is in production mode and you want to be sure of what is going on
  • #31: Start 21:04 to 22:33 Many informations UI or API Send alert if something went wrong or just if the crawl is finished
  • #32: Start 22:33 to 23:00 You also have a tab that shows you an history of all the different activities. Document status, for example if you want to see if a document has already been ingested in the current crawl Maximum bandwitch will give you information of crawling performance
  • #33: Start 23:00 to 23:16 Unfortunatly somtimes facing obvious Crawled repository that is overloaded It can be because of the network. You should packet with wireshark Solr server : for example if the autocommit frequency is too high. Mcf database is an important component, be sure that you followed the best pratices in the documentation
  • #34: Start 23:16 to 24:25 Maybe it is because of the configuration of your connector Two main parameters that can have an impact on performance Throttling : Fixing hard limit on fetching document (usefull if you are doing web crawling don’t don’t to be ban by the webmaster) Max connections that will be done to the system. It can be a good idea if we want to do web crawling to increase this value But windows share won’t work very well with a of connection, so in our scenario we should use a small value
  • #35: Start 24:25 to 25:15 In the job settings, you can filter document that you want to index For an intranet file share, you probably don’t want to index the last Star wars movie that an employee wanted to share with their colleagues
  • #36: Start 25:15 to 26:00 That was some example of performance issues. But unfortunatly, It can be even worst, you can face errors If you are facing errors A thread dump can give you information on
  • #37: Start 26:00 to 28:08 One common problem is when the account you use for crawl doesn’t It must be able to read everything and to read ACLs for each file It can need special right, such as Print operator. As we just saw, we can use exclusion regex or size limit Be also sure to add ignore tika exception in solr JCIFS errors linked or not to network issues timeout. Sometimes be solve while increase jcifs timeout Sometimes you can have to increase solr time out issues Big processing
  • #38: Start 28:08 to 28:45 What can happen in real life To conclude. Massive web crawling : Nutch is the best tool for you Then, go for it.
  • #39: Start 28:45 to 29:18 Here are some references That is now freely available You can have a look at our blog posts, that you how to run through the different steps that I covered in the file search scenario I described
  • #40: Start 29:18 to 29:40 If you are too lazy to integrate Solr and ManifoldCF by yourself
  • #41: Start 29:40 to 30:00 Thank you very much for your attention, Be pleased to answer any question you may have