SlideShare a Scribd company logo
Angel Borroy
Tom Page
10th June 2020
(Re)Indexing
Large Repositories
22
Agenda
(Re)Indexing Large Repositories
• Alfresco SOLR in a Nutshell
• Indexing Process
• Indexing Scenarios
• When to Re-Index
• Deployment Alternatives
• Demo time without downtime!
• Benchmark Review
• Improvements in 1.4.2
• Future Improvements
• Recap
Alfresco SOLR
3
Alfresco SOLR in a Nutshell
SOLR 6 is used in Alfresco to perform two main processes:
• Indexing (or tracking) metadata, permissions and content from Alfresco Repository
• Returning results from search queries supporting several syntaxes (AFTS, CMIS)
Indexing process
Asynchronous
4
Searching process
Eventual consistency
SOLR is indexing the information after the database has committed the transaction, so there is a short period of time
when not all the documents are available in SOLR Index. We call this eventual consistency, as SOLR will catch up with
Repository eventually.
Syntax
AFTS
CMIS
Alfresco SOLR in a Nutshell
Permission
Checks
Synchronous
5
Alfresco SOLR in a Nutshell
Alfresco SOLR Storage
By default two SOLR cores are created, one for the living documents (alfresco) and one for the removed documents
(archive).
Each core includes following storage folders:
• Default SOLR Index files in the solrhome/<core>/index folder
• Alfresco customized Content Store in the contentstore folder
• This folder includes a cached copy of Repository content and metadata
• Content Store will be removed in Search Services 2.0
“These folders are populated by the Indexing Process
6
Indexing process
● Each tracker is fired asynchronously according to a cron expression: alfresco.cron or alfresco.*.tracker.cron
● Transactions and ACL Change Sets are processed in batches of Nodes or ACLs
● Batches are split to be executed in parallel by Workers
● However, Content Tracker recovers text from content nodes one by one
● Commit Tracker writes the changes from the different Trackers to SOLR Index "eventually"
>> Cascade Tracker is not running when indexing from scratch
7
Indexing scenarios
1. When updating the repository using applications or bulk ingestion
processes, the transactions will include a long list of nodes to be
indexed
2. When using Alfresco Share to create new content, there will be
more transactions but every transaction will include a small list of
nodes to be indexed
3. When setting the permission level for every node in a hierarchy
manually, the ACL Change Sets will include a long list of ACLs to
be indexed
4. When using default Alfresco permissions design, the ACL
Change Sets will include a small list of ACLs to be indexed
5. When using complex format of documents, Transformation
Service will require additional resources
6. When using large documents, SOLR Index will require additional
storage
8
Indexing scenarios
Controlling what to index
• Content can be excluded from SOLR Index by configuration
solrcore.properties > alfresco.index.transformContent=false
https://docs.alfresco.com/search-community/concepts/solrcore-properties-file.html
• Some nodes can be excluded from SOLR Index by using the Index Control aspect
cm:indexControl > cm:isIndexed :: false, metadata and content is not
indexed
cm:indexControl > cm:isContentIndexed :: false, content is not indexed
https://docs.alfresco.com/community/concepts/admin-indexes.html
• Some properties can be excluded from SOLR Index by design in the Content Model
<property>
<index enabled=”false”/>
</property>
https://docs.alfresco.com/community/references/dev-extension-points-content-model-define-and-deploy.html
Add this setting to
archive
core by default!
9
Re-indexing process can take some time, from a few hours to a few days, in large repositories.
Full re-index
• When upgrading to a major Search Services release, like 2.0
• When the SOLR Index has been corrupted, due to technical reasons
• When breaking changes are introduced in common custom Content Models
Partial re-index
• This process could also take some time, depending on the amount of documents to be re-indexed. But it will take
less than a full re-index
• When incremental changes are introduced in a Content Model, partial reindexation can be fired by using the SOLR
REST API
http://localhost:8983/solr/admin/cores?action=reindex&query=TYPE:person
When to re-index
10
Deployment alternatives
https://docs.alfresco.com/sie/concepts/solr-shard-overview.html
https://docs.alfresco.com/sie/concepts/solr-replication.html
https://docs.alfresco.com/sie/tasks/solr-install.html
11
• Using the ZIP Distribution file
https://docs.alfresco.com/search-community/concepts/solr-install-config.html
• Using Docker or Docker Compose
https://github.com/Alfresco/SearchServices/tree/master/search-services
https://github.com/Alfresco/acs-community-deployment/tree/master/docker-compose
https://github.com/Alfresco/alfresco-docker-installer
• Using Kubernetes
https://github.com/Alfresco/acs-community-deployment/tree/master/helm/alfresco-content-services-community
Installing alternatives
12
Deployment schema to minimize downtime in re-indexing processes
> When using different SOLR version,
configure Alfresco Repository to use the new SOLR server *
> When using the same SOLR version,
INDEX folder can be used directly
* Upgrading from SOLR 4 to SOLR 6 is not allowed when using Alfresco CE 6.2.0-ga (thanks for raising this @AFaust) >> SEARCH-2289
Deployment for Re-Indexing
13
When configuring an Alfresco Node to perform the reindexing process, there are some services you can switch off
depending on your requirements:
• Scheduled Jobs can be disabled, as they will be run by the Alfresco instance in the living service
https://docs.alfresco.com/6.2/concepts/scheduled-jobs.html
• Some ACS features can be disabled
https://docs.alfresco.com/6.2/concepts/maincomponents-disable.html
• Additional subsystems (apart from Search or Transformation) can be disabled
https://docs.alfresco.com/6.2/concepts/subsystem-categories.html
• Activities
• Audit
• Email
• …
“Don’t make a copy of your Alfresco Repository production configuration and press the start button!
Alfresco Repository Indexing Configuration
14
Monitoring
Profiling
• Using VisualVM or YourKit Java Profiler for the JVMs
(Repository, SOLR)
• Using pg_stats_statements extension or some other DB tool
https://hub.alfresco.com/t5/alfresco-content-services-blog/alfresco-6-
profiling-with-docker/ba-p/295846
https://github.com/aborroy/alfresco-6-profiling
Monitoring
• Using Prometheus with Grafana (Repository, SOLR)
https://hub.alfresco.com/t5/alfresco-content-services-blog/monitoring-
alfresco-solr-with-prometheus-and-grafana/ba-p/294157
https://github.com/aborroy/alfresco-solr-monitoring
1515
Demo time without downtime!
16
• Living Docker Compose environment running with around 4,000 text documents indexed
• Using YourKit-Java-Profiler to monitor Repository performance
• Starting a new Search Services 2.0 server locally to start indexing the repository
• Once Search Services 2.0 is updated, change Solr hostname value from Admin Web
Console or modify alfresco-global.properties
Search Services 2.0
is not
released yet!
Demo time without downtime!
http://127.0.0.1:8083/solr/alfresco/select?indent=on&q=TEXT:[* TO *]&wt=json
http://127.0.0.1:8983/solr/alfresco/select?indent=on&q=TEXT:[* TO *]&wt=json
1717
Benchmark Review
18
1 Billion Documents Review (2015)
• Review from 1 billion benchmarks (November 2015)
• 10 repository nodes (Alfresco 5.1), 20 Solr 4 nodes (Alfresco Index Server)
• Indexed 1b documents in 5 days
How Alfresco powered a 1.2 Billion document deployment on Amazon Web Services
1919
Improvements in 1.4.2
20
1.2 Billion Baseline Plan (2020)
• Customer-sponsored benchmark to see performance of system with their configuration
• Want 1.2b documents indexed into Search Services
• 20 instances, each containing a single shard (DB_ID_RANGE based sharding)
21
• Bottlenecks
• Database (getChildAssocs)
• Transformers (when using large documents)
• Network (when using large metadata/content)
• Time spent processing data for other shards
Performance considerations
22
Baseline Results
• Estimated completion in 21 days
23
Baseline Results
• Estimated completion in 21 days
24
DB_ID_RANGE Sharding
• Does not require specifying total number of shards in advance
• Index can continue to grow with repository
See https://docs.alfresco.com/search-enterprise/concepts/solr-shard-approaches.html
25
Cascade Tracking
26
Cascade Tracking
27
Time spend processing transactions for other shards
• With DB_ID_RANGE sharding we know that only a range of transactions are relevant
• Skip transactions when using DB_ID_RANGE
• To support path queries we sometimes need to update data on multiple shards from a single change
• Option to disable cascade tracking
28
Reduce Database Access and Network Usage
• Reduce amount of data requested
• Remove unused calls to getChildAssocs
• Compress communication where appropriate
• Add option to compress content transfer
Lorem ipsum dolor
sit amet,
consectetur
adipiscing elit...
Please give me all
metadata for the
node
Please give me:
● X
● Y
● Z
78 9c 05 c1 81 09
c0 30 08 04 c0 ...
29
Overview of Improvements in 1.4.2
• Search Services 1.4.2 (and Insight Engine 1.4.2)
• ACS Repository 6.2 Enterprise
• No ACS Community release containing this yet
• However can use existing ACS and jars from https://github.com/aborroy/solr-performance-services-repo
Reindex of 1.2b documents in 10 days
(6 repo nodes, 20 search nodes)
Search Services 1.3.0
150 documents/second
Search Services 1.4.2
1200-3500 documents/second*
(depending on the number of
shards, size of documents, etc.)
* Depending on exact configuration
(Nb. Not yet validated on the production system)
3030
Future Improvements
31
Future Improvements - Coming in 2.0.0
• Schema Simplification
• Smaller index
• Removing Duplicate Fields
• Smaller communication
• Improved Trackers
• Less duplication with large transactions
• New tracker parallelism option
• Content Store Removal
• Reduced disk usage
• Less duplication
• Better usage of Solr optimisations
• Adds potential to use other Solr features
32
Scenarios datasets
• 100,000 documents created with 100,000 transactions
• 100,000 documents created with 1 transaction
• Changing the path for 100,000 documents
• 200,000 ACLs created with 200,000 ACL change sets
Parameters investigated
• The existing *BatchSize size parameters
• The new *MaxParallelism parameters
• These change the number of workers assigned to the
tracker. They use a ForkJoinPool, and can impact the
resources available to other processes
Improved Trackers - Testing
33
Hotspot calculation
• Increasing the Transaction Batch Size for nodes and ACLs
has an impact while the maximum number for your
deployment is not reached. After that, you can increase this
batch size but there will be no performance changes
• Increasing the Node Batch Size can improve your
performance while you are down the right number for your
deployment. After that, you can increase this batch size but
the performance will be penalised
• Increasing the maximum number of Parallel Threads
improved performance until the maximum number for our
deployment was reached. However in a real world
deployment it may be useful to use a lower number to avoid
impacting other processes.
Improved Trackers - Testing
Duration
(ms)
#
34
Content Store Removal
• Solr Content store removal will reduce disk usage and simplify replication
The Solr Content Store
35
Content Store Removal
• Solr Content store removal will reduce disk usage and simplify replication
The Solr Content Store
Replication of index across Solr nodes
3636
Recap
37
When to re-index
• When upgrading to major Search Services releases
How to re-index
• Running some small tests to ensure the performance of the indexing process before running it in production
• Indexing from scratch with the upgraded Repository
• Indexing in a parallel deployment
How to measure
• Profiling
• Monitoring
Recap
Thank you!
39
Relevant works
https://nathanmcminn.com/2017/01/11/alfresco-and-solr-search-reindexing-and-index-cluster-size/
https://www.slideshare.net/JosePortillo26/jose-portillo-dev-con-presentation-1138
https://www.slideshare.net/angelborroy/2019-dev-con115angelborroy
https://blog.xenit.eu/blog/ethias-sharding
https://hub.alfresco.com/t5/alfresco-content-services-blog/large-repository-upgrades/ba-p/287877
https://hub.alfresco.com/t5/alfresco-content-services-blog/scaling-search-with-db-id-range/ba-p/287900
https://www.alfresco.com/technical-whitepaper/alfresco-content-services-solr-deployment-options
https://www.alfresco.com/technical-whitepaper/alfresco-content-services-solr-deployment-example-aws
https://docs.alfresco.com/6.2/concepts/upgrade-path.html

More Related Content

What's hot (20)

PPTX
Alfresco tuning part1
Luis Cabaceira
 
PPTX
Alfresco tuning part1
Luis Cabaceira
 
PPTX
Alfresco tuning part2
Luis Cabaceira
 
PPTX
Moving Gigantic Files Into and Out of the Alfresco Repository
Jeff Potts
 
PPTX
From zero to hero Backing up alfresco
Toni de la Fuente
 
PDF
Alfresco 5.2 REST API
J V
 
PDF
Alfresco Backup and Disaster Recovery White Paper
Toni de la Fuente
 
PPTX
Architectural changes in the repo in 6.1 and beyond
Stefan Kopf
 
PPTX
Upgrading to Alfresco 6
Angel Borroy López
 
PPTX
Sizing your alfresco platform
Luis Cabaceira
 
PDF
Collaborative Editing Tools for Alfresco
Angel Borroy López
 
PDF
Storage and Alfresco
Toni de la Fuente
 
PDF
Alfresco Transform Service DevCon 2019
J V
 
PDF
Kafka Security 101 and Real-World Tips
confluent
 
PDF
Apache Kafka Architecture & Fundamentals Explained
confluent
 
PPTX
Comprehensive Terraform Training
Yevgeniy Brikman
 
PPSX
Microservices Testing Strategies JUnit Cucumber Mockito Pact
Araf Karsh Hamid
 
PDF
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
SANG WON PARK
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PPTX
The Alfresco ECM 1 Billion Document Benchmark on AWS and Aurora - Benchmark ...
Symphony Software Foundation
 
Alfresco tuning part1
Luis Cabaceira
 
Alfresco tuning part1
Luis Cabaceira
 
Alfresco tuning part2
Luis Cabaceira
 
Moving Gigantic Files Into and Out of the Alfresco Repository
Jeff Potts
 
From zero to hero Backing up alfresco
Toni de la Fuente
 
Alfresco 5.2 REST API
J V
 
Alfresco Backup and Disaster Recovery White Paper
Toni de la Fuente
 
Architectural changes in the repo in 6.1 and beyond
Stefan Kopf
 
Upgrading to Alfresco 6
Angel Borroy López
 
Sizing your alfresco platform
Luis Cabaceira
 
Collaborative Editing Tools for Alfresco
Angel Borroy López
 
Storage and Alfresco
Toni de la Fuente
 
Alfresco Transform Service DevCon 2019
J V
 
Kafka Security 101 and Real-World Tips
confluent
 
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Comprehensive Terraform Training
Yevgeniy Brikman
 
Microservices Testing Strategies JUnit Cucumber Mockito Pact
Araf Karsh Hamid
 
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
SANG WON PARK
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
The Alfresco ECM 1 Billion Document Benchmark on AWS and Aurora - Benchmark ...
Symphony Software Foundation
 

Similar to (Re)Indexing Large Repositories in Alfresco (20)

PPTX
Benchmarking Solr Performance at Scale
thelabdude
 
PPTX
python_development.pptx
LemonReddy1
 
PDF
What's New in Apache Solr 4.10
Anshum Gupta
 
PDF
ITB2017 - Keynote
Ortus Solutions, Corp
 
PDF
Building scalbale cloud native apps with .NET 8
GillesMathieu10
 
PPTX
201511 - Alfresco Day - Platform Update and Roadmap - Gabriele Columbro - Bo...
Symphony Software Foundation
 
PPTX
ELK Ruminating on Logs (Zendcon 2016)
Mathew Beane
 
PDF
Solr 4
Erik Hatcher
 
PDF
Take your database source code and data under control
Marcin Przepiórowski
 
PDF
DSpace Developers Workshop at OR2024.pdf
4Science
 
PDF
Webinar: What's new in CDAP 3.5?
Cask Data
 
PPTX
Benchmarking Solr Performance
Lucidworks
 
PPTX
Centralizing Kubernetes and Container Operations
Kublr
 
PDF
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
Alluxio, Inc.
 
PDF
Lucene/Solr 8: The Next Major Release Steve Rowe, Lucidworks
Lucidworks
 
PDF
Lucene/Solr 8: The next major release
Steve Rowe
 
PPTX
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
PPTX
Introduction to Kubernetes
Vishal Biyani
 
PDF
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
BIOVIA
 
PPTX
Day 7 - Make it Fast
Barry Jones
 
Benchmarking Solr Performance at Scale
thelabdude
 
python_development.pptx
LemonReddy1
 
What's New in Apache Solr 4.10
Anshum Gupta
 
ITB2017 - Keynote
Ortus Solutions, Corp
 
Building scalbale cloud native apps with .NET 8
GillesMathieu10
 
201511 - Alfresco Day - Platform Update and Roadmap - Gabriele Columbro - Bo...
Symphony Software Foundation
 
ELK Ruminating on Logs (Zendcon 2016)
Mathew Beane
 
Solr 4
Erik Hatcher
 
Take your database source code and data under control
Marcin Przepiórowski
 
DSpace Developers Workshop at OR2024.pdf
4Science
 
Webinar: What's new in CDAP 3.5?
Cask Data
 
Benchmarking Solr Performance
Lucidworks
 
Centralizing Kubernetes and Container Operations
Kublr
 
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
Alluxio, Inc.
 
Lucene/Solr 8: The Next Major Release Steve Rowe, Lucidworks
Lucidworks
 
Lucene/Solr 8: The next major release
Steve Rowe
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
Introduction to Kubernetes
Vishal Biyani
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
BIOVIA
 
Day 7 - Make it Fast
Barry Jones
 
Ad

More from Angel Borroy López (20)

PDF
Alfresco AI Webinar, creating a RAG system from scratch
Angel Borroy López
 
PDF
Alfresco TechQuest 2024 - Alfresco Container-based Installation and Configura...
Angel Borroy López
 
PDF
Transitioning from Customized Solr to Out-of-the-Box OpenSearch
Angel Borroy López
 
PDF
Alfresco integration with OpenSearch - OpenSearchCon 2024 Europe
Angel Borroy López
 
PDF
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Angel Borroy López
 
PDF
Using Generative AI and Content Service Platforms together
Angel Borroy López
 
PDF
Enhancing Document-Centric Features with On-Premise Generative AI for Alfresc...
Angel Borroy López
 
PDF
La Guía Definitiva para una Actualización Exitosa a Alfresco 23.1
Angel Borroy López
 
PDF
Docker Init with Templates for Alfresco
Angel Borroy López
 
PDF
Before & After Docker Init
Angel Borroy López
 
PDF
Alfresco Transform Services 4.0.0
Angel Borroy López
 
PDF
How to migrate from Alfresco Search Services to Alfresco SearchEnterprise
Angel Borroy López
 
PDF
Using Podman with Alfresco
Angel Borroy López
 
PDF
CSP: Evolución de servicios de código abierto en un mundo Cloud Native
Angel Borroy López
 
PDF
Alfresco Embedded Activiti Engine
Angel Borroy López
 
PDF
Alfresco Transform Core 3.0.0
Angel Borroy López
 
PDF
Desarrollando una Extensión para Docker
Angel Borroy López
 
PDF
DockerCon 2022 Spanish Room-ONBOARDING.pdf
Angel Borroy López
 
PDF
Deploying Containerised Open-Source CSP Platforms
Angel Borroy López
 
PDF
Introduction to AWS
Angel Borroy López
 
Alfresco AI Webinar, creating a RAG system from scratch
Angel Borroy López
 
Alfresco TechQuest 2024 - Alfresco Container-based Installation and Configura...
Angel Borroy López
 
Transitioning from Customized Solr to Out-of-the-Box OpenSearch
Angel Borroy López
 
Alfresco integration with OpenSearch - OpenSearchCon 2024 Europe
Angel Borroy López
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Angel Borroy López
 
Using Generative AI and Content Service Platforms together
Angel Borroy López
 
Enhancing Document-Centric Features with On-Premise Generative AI for Alfresc...
Angel Borroy López
 
La Guía Definitiva para una Actualización Exitosa a Alfresco 23.1
Angel Borroy López
 
Docker Init with Templates for Alfresco
Angel Borroy López
 
Before & After Docker Init
Angel Borroy López
 
Alfresco Transform Services 4.0.0
Angel Borroy López
 
How to migrate from Alfresco Search Services to Alfresco SearchEnterprise
Angel Borroy López
 
Using Podman with Alfresco
Angel Borroy López
 
CSP: Evolución de servicios de código abierto en un mundo Cloud Native
Angel Borroy López
 
Alfresco Embedded Activiti Engine
Angel Borroy López
 
Alfresco Transform Core 3.0.0
Angel Borroy López
 
Desarrollando una Extensión para Docker
Angel Borroy López
 
DockerCon 2022 Spanish Room-ONBOARDING.pdf
Angel Borroy López
 
Deploying Containerised Open-Source CSP Platforms
Angel Borroy López
 
Introduction to AWS
Angel Borroy López
 
Ad

Recently uploaded (20)

PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
PDF
Step-by-Step Guide to Install SAP HANA Studio | Complete Installation Tutoria...
SAP Vista, an A L T Z E N Company
 
PDF
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
PDF
How Agentic AI Networks are Revolutionizing Collaborative AI Ecosystems in 2025
ronakdubey419
 
PPT
Brief History of Python by Learning Python in three hours
adanechb21
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PDF
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
PDF
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PDF
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
PDF
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
PPTX
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
PDF
Salesforce Pricing Update 2025: Impact, Strategy & Smart Cost Optimization wi...
GetOnCRM Solutions
 
PDF
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
 
PDF
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
PDF
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PDF
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 
PDF
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
Step-by-Step Guide to Install SAP HANA Studio | Complete Installation Tutoria...
SAP Vista, an A L T Z E N Company
 
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
How Agentic AI Networks are Revolutionizing Collaborative AI Ecosystems in 2025
ronakdubey419
 
Brief History of Python by Learning Python in three hours
adanechb21
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
Salesforce Pricing Update 2025: Impact, Strategy & Smart Cost Optimization wi...
GetOnCRM Solutions
 
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
 
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 

(Re)Indexing Large Repositories in Alfresco

  • 1. Angel Borroy Tom Page 10th June 2020 (Re)Indexing Large Repositories
  • 2. 22 Agenda (Re)Indexing Large Repositories • Alfresco SOLR in a Nutshell • Indexing Process • Indexing Scenarios • When to Re-Index • Deployment Alternatives • Demo time without downtime! • Benchmark Review • Improvements in 1.4.2 • Future Improvements • Recap Alfresco SOLR
  • 3. 3 Alfresco SOLR in a Nutshell SOLR 6 is used in Alfresco to perform two main processes: • Indexing (or tracking) metadata, permissions and content from Alfresco Repository • Returning results from search queries supporting several syntaxes (AFTS, CMIS) Indexing process Asynchronous
  • 4. 4 Searching process Eventual consistency SOLR is indexing the information after the database has committed the transaction, so there is a short period of time when not all the documents are available in SOLR Index. We call this eventual consistency, as SOLR will catch up with Repository eventually. Syntax AFTS CMIS Alfresco SOLR in a Nutshell Permission Checks Synchronous
  • 5. 5 Alfresco SOLR in a Nutshell Alfresco SOLR Storage By default two SOLR cores are created, one for the living documents (alfresco) and one for the removed documents (archive). Each core includes following storage folders: • Default SOLR Index files in the solrhome/<core>/index folder • Alfresco customized Content Store in the contentstore folder • This folder includes a cached copy of Repository content and metadata • Content Store will be removed in Search Services 2.0 “These folders are populated by the Indexing Process
  • 6. 6 Indexing process ● Each tracker is fired asynchronously according to a cron expression: alfresco.cron or alfresco.*.tracker.cron ● Transactions and ACL Change Sets are processed in batches of Nodes or ACLs ● Batches are split to be executed in parallel by Workers ● However, Content Tracker recovers text from content nodes one by one ● Commit Tracker writes the changes from the different Trackers to SOLR Index "eventually" >> Cascade Tracker is not running when indexing from scratch
  • 7. 7 Indexing scenarios 1. When updating the repository using applications or bulk ingestion processes, the transactions will include a long list of nodes to be indexed 2. When using Alfresco Share to create new content, there will be more transactions but every transaction will include a small list of nodes to be indexed 3. When setting the permission level for every node in a hierarchy manually, the ACL Change Sets will include a long list of ACLs to be indexed 4. When using default Alfresco permissions design, the ACL Change Sets will include a small list of ACLs to be indexed 5. When using complex format of documents, Transformation Service will require additional resources 6. When using large documents, SOLR Index will require additional storage
  • 8. 8 Indexing scenarios Controlling what to index • Content can be excluded from SOLR Index by configuration solrcore.properties > alfresco.index.transformContent=false https://docs.alfresco.com/search-community/concepts/solrcore-properties-file.html • Some nodes can be excluded from SOLR Index by using the Index Control aspect cm:indexControl > cm:isIndexed :: false, metadata and content is not indexed cm:indexControl > cm:isContentIndexed :: false, content is not indexed https://docs.alfresco.com/community/concepts/admin-indexes.html • Some properties can be excluded from SOLR Index by design in the Content Model <property> <index enabled=”false”/> </property> https://docs.alfresco.com/community/references/dev-extension-points-content-model-define-and-deploy.html Add this setting to archive core by default!
  • 9. 9 Re-indexing process can take some time, from a few hours to a few days, in large repositories. Full re-index • When upgrading to a major Search Services release, like 2.0 • When the SOLR Index has been corrupted, due to technical reasons • When breaking changes are introduced in common custom Content Models Partial re-index • This process could also take some time, depending on the amount of documents to be re-indexed. But it will take less than a full re-index • When incremental changes are introduced in a Content Model, partial reindexation can be fired by using the SOLR REST API http://localhost:8983/solr/admin/cores?action=reindex&query=TYPE:person When to re-index
  • 11. 11 • Using the ZIP Distribution file https://docs.alfresco.com/search-community/concepts/solr-install-config.html • Using Docker or Docker Compose https://github.com/Alfresco/SearchServices/tree/master/search-services https://github.com/Alfresco/acs-community-deployment/tree/master/docker-compose https://github.com/Alfresco/alfresco-docker-installer • Using Kubernetes https://github.com/Alfresco/acs-community-deployment/tree/master/helm/alfresco-content-services-community Installing alternatives
  • 12. 12 Deployment schema to minimize downtime in re-indexing processes > When using different SOLR version, configure Alfresco Repository to use the new SOLR server * > When using the same SOLR version, INDEX folder can be used directly * Upgrading from SOLR 4 to SOLR 6 is not allowed when using Alfresco CE 6.2.0-ga (thanks for raising this @AFaust) >> SEARCH-2289 Deployment for Re-Indexing
  • 13. 13 When configuring an Alfresco Node to perform the reindexing process, there are some services you can switch off depending on your requirements: • Scheduled Jobs can be disabled, as they will be run by the Alfresco instance in the living service https://docs.alfresco.com/6.2/concepts/scheduled-jobs.html • Some ACS features can be disabled https://docs.alfresco.com/6.2/concepts/maincomponents-disable.html • Additional subsystems (apart from Search or Transformation) can be disabled https://docs.alfresco.com/6.2/concepts/subsystem-categories.html • Activities • Audit • Email • … “Don’t make a copy of your Alfresco Repository production configuration and press the start button! Alfresco Repository Indexing Configuration
  • 14. 14 Monitoring Profiling • Using VisualVM or YourKit Java Profiler for the JVMs (Repository, SOLR) • Using pg_stats_statements extension or some other DB tool https://hub.alfresco.com/t5/alfresco-content-services-blog/alfresco-6- profiling-with-docker/ba-p/295846 https://github.com/aborroy/alfresco-6-profiling Monitoring • Using Prometheus with Grafana (Repository, SOLR) https://hub.alfresco.com/t5/alfresco-content-services-blog/monitoring- alfresco-solr-with-prometheus-and-grafana/ba-p/294157 https://github.com/aborroy/alfresco-solr-monitoring
  • 16. 16 • Living Docker Compose environment running with around 4,000 text documents indexed • Using YourKit-Java-Profiler to monitor Repository performance • Starting a new Search Services 2.0 server locally to start indexing the repository • Once Search Services 2.0 is updated, change Solr hostname value from Admin Web Console or modify alfresco-global.properties Search Services 2.0 is not released yet! Demo time without downtime! http://127.0.0.1:8083/solr/alfresco/select?indent=on&q=TEXT:[* TO *]&wt=json http://127.0.0.1:8983/solr/alfresco/select?indent=on&q=TEXT:[* TO *]&wt=json
  • 18. 18 1 Billion Documents Review (2015) • Review from 1 billion benchmarks (November 2015) • 10 repository nodes (Alfresco 5.1), 20 Solr 4 nodes (Alfresco Index Server) • Indexed 1b documents in 5 days How Alfresco powered a 1.2 Billion document deployment on Amazon Web Services
  • 20. 20 1.2 Billion Baseline Plan (2020) • Customer-sponsored benchmark to see performance of system with their configuration • Want 1.2b documents indexed into Search Services • 20 instances, each containing a single shard (DB_ID_RANGE based sharding)
  • 21. 21 • Bottlenecks • Database (getChildAssocs) • Transformers (when using large documents) • Network (when using large metadata/content) • Time spent processing data for other shards Performance considerations
  • 22. 22 Baseline Results • Estimated completion in 21 days
  • 23. 23 Baseline Results • Estimated completion in 21 days
  • 24. 24 DB_ID_RANGE Sharding • Does not require specifying total number of shards in advance • Index can continue to grow with repository See https://docs.alfresco.com/search-enterprise/concepts/solr-shard-approaches.html
  • 27. 27 Time spend processing transactions for other shards • With DB_ID_RANGE sharding we know that only a range of transactions are relevant • Skip transactions when using DB_ID_RANGE • To support path queries we sometimes need to update data on multiple shards from a single change • Option to disable cascade tracking
  • 28. 28 Reduce Database Access and Network Usage • Reduce amount of data requested • Remove unused calls to getChildAssocs • Compress communication where appropriate • Add option to compress content transfer Lorem ipsum dolor sit amet, consectetur adipiscing elit... Please give me all metadata for the node Please give me: ● X ● Y ● Z 78 9c 05 c1 81 09 c0 30 08 04 c0 ...
  • 29. 29 Overview of Improvements in 1.4.2 • Search Services 1.4.2 (and Insight Engine 1.4.2) • ACS Repository 6.2 Enterprise • No ACS Community release containing this yet • However can use existing ACS and jars from https://github.com/aborroy/solr-performance-services-repo Reindex of 1.2b documents in 10 days (6 repo nodes, 20 search nodes) Search Services 1.3.0 150 documents/second Search Services 1.4.2 1200-3500 documents/second* (depending on the number of shards, size of documents, etc.) * Depending on exact configuration (Nb. Not yet validated on the production system)
  • 31. 31 Future Improvements - Coming in 2.0.0 • Schema Simplification • Smaller index • Removing Duplicate Fields • Smaller communication • Improved Trackers • Less duplication with large transactions • New tracker parallelism option • Content Store Removal • Reduced disk usage • Less duplication • Better usage of Solr optimisations • Adds potential to use other Solr features
  • 32. 32 Scenarios datasets • 100,000 documents created with 100,000 transactions • 100,000 documents created with 1 transaction • Changing the path for 100,000 documents • 200,000 ACLs created with 200,000 ACL change sets Parameters investigated • The existing *BatchSize size parameters • The new *MaxParallelism parameters • These change the number of workers assigned to the tracker. They use a ForkJoinPool, and can impact the resources available to other processes Improved Trackers - Testing
  • 33. 33 Hotspot calculation • Increasing the Transaction Batch Size for nodes and ACLs has an impact while the maximum number for your deployment is not reached. After that, you can increase this batch size but there will be no performance changes • Increasing the Node Batch Size can improve your performance while you are down the right number for your deployment. After that, you can increase this batch size but the performance will be penalised • Increasing the maximum number of Parallel Threads improved performance until the maximum number for our deployment was reached. However in a real world deployment it may be useful to use a lower number to avoid impacting other processes. Improved Trackers - Testing Duration (ms) #
  • 34. 34 Content Store Removal • Solr Content store removal will reduce disk usage and simplify replication The Solr Content Store
  • 35. 35 Content Store Removal • Solr Content store removal will reduce disk usage and simplify replication The Solr Content Store Replication of index across Solr nodes
  • 37. 37 When to re-index • When upgrading to major Search Services releases How to re-index • Running some small tests to ensure the performance of the indexing process before running it in production • Indexing from scratch with the upgraded Repository • Indexing in a parallel deployment How to measure • Profiling • Monitoring Recap
  • 39. 39 Relevant works https://nathanmcminn.com/2017/01/11/alfresco-and-solr-search-reindexing-and-index-cluster-size/ https://www.slideshare.net/JosePortillo26/jose-portillo-dev-con-presentation-1138 https://www.slideshare.net/angelborroy/2019-dev-con115angelborroy https://blog.xenit.eu/blog/ethias-sharding https://hub.alfresco.com/t5/alfresco-content-services-blog/large-repository-upgrades/ba-p/287877 https://hub.alfresco.com/t5/alfresco-content-services-blog/scaling-search-with-db-id-range/ba-p/287900 https://www.alfresco.com/technical-whitepaper/alfresco-content-services-solr-deployment-options https://www.alfresco.com/technical-whitepaper/alfresco-content-services-solr-deployment-example-aws https://docs.alfresco.com/6.2/concepts/upgrade-path.html