Skip to content

Conversation

eolivelli
Copy link
Contributor

@eolivelli eolivelli commented Mar 10, 2023

Motivation

ManagedLedgerImpl retains eagerly a cache of all the BookKeeper ReadHandles.
In case of Offloaded ReadHandler there is kind of a memory leak, as each BlobStoreBackedInputStreamImpl retains a DirectBuffer of 1MB, in the case of a topic with terabytes of data and thousands of ledger this leads to Direct OOM errors if something tries to open all the ledgers

Modifications

Add a new background activity that evicts from memory all the Offloaded ReadHandles and release memory.

The feature is controlled by the new configuration option managedLedgerInactiveOffloadedLedgerEvictionTimeSeconds=600

Unfortunately this fix cannot fully prevent a OODM error because there is no global count and limit of the memory retained by such Handles, it allows to mitigate the problem by releasing automatically unused Ledger Handlers.
The default value, 10 minutes, is very conservative, but it should work with real-world ledgers.

The worst case scenario is a topic with tens of thousands of small ledgers with a consumer that reads from the topic from the beginning, in this case the broker will open the ReadHandlers probably more quickly than the eviction process pace.

Verifying this change

This change added tests.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: eolivelli#22

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Mar 10, 2023
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good work @eolivelli

@codelipenghui codelipenghui added this to the 3.0.0 milestone Mar 13, 2023
@codelipenghui
Copy link
Contributor

@eolivelli We should start with a proposal for a new behavior that will introduced to Pulsar, and it also introduced a new configuration.

Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have invalidateReadHandle method to invalidate the read handles. Why should we introduce a time-based invalidation for the offloaded read handles? Or, if the change wants to introduce time-based read handle invalidation, why it can't apply to the bookkeeper read handle?

@poorbarcode
Copy link
Contributor

Since we will start the RC version of 3.0.0 on 2023-04-11, I will change the label/milestone of PR who have not been merged.

  • The PR of type feature is deferred to 3.1.0
  • The PR of type fix is deferred to 3.0.1

So drag this PR to 3.0.1

@poorbarcode poorbarcode modified the milestones: 3.0.0, 3.1.0 Apr 10, 2023
@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@dave2wave
Copy link
Member

@eolivelli Is this something for the near term?

@github-actions github-actions bot removed the Stale label Jul 17, 2023
@Technoboy- Technoboy- modified the milestones: 3.1.0, 3.2.0 Jul 31, 2023
@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@poorbarcode
Copy link
Contributor

@eolivelli @lhotari

If you are not working on this PR anymore, I will take over this fix tomorrow.

@lhotari
Copy link
Member

lhotari commented May 21, 2025

/pulsarbot rerun-failure-checks

@lhotari
Copy link
Member

lhotari commented May 21, 2025

@eolivelli @lhotari

If you are not working on this PR anymore, I will take over this fix tomorrow.

@poorbarcode Your review comments have been addressed. Please review again.

@lhotari lhotari dismissed poorbarcode’s stale review May 30, 2025 15:09

Dismissing request for changes to be able to run CI. I've replied to the request.

@lhotari
Copy link
Member

lhotari commented May 30, 2025

/pulsarbot rerun-failure-checks

@lhotari
Copy link
Member

lhotari commented Jun 2, 2025

@poorbarcode I'm merging this. You can create follow-up PRs to improve the situation after this has been merged if you find that important.

@lhotari lhotari merged commit a1a2b36 into apache:master Jun 2, 2025
53 checks passed
lhotari added a commit that referenced this pull request Jun 2, 2025
lhotari added a commit that referenced this pull request Jun 2, 2025
lhotari added a commit that referenced this pull request Jun 2, 2025
nodece pushed a commit to nodece/pulsar that referenced this pull request Jun 18, 2025
@lhotari
Copy link
Member

lhotari commented Aug 25, 2025

Another related fix to prevent OOM errors is #24655

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants