Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
An expire-group-metadata operation generates tombstone records, updates the `groups` state and decrements group size counters, then performs a write to the log. If there is a __consumer_offsets partition reassignment, this operation fails. The `groups` state is reverted to an earlier snapshot but classic group size counters are not. This begins an inconsistency between the metrics and the actual groups size. This applies to all unsuccessful write operations that alter the `groups` state.
The issue is exacerbated because the expire group metadata operation can be retried multiple times until the partition is fully unloaded.
The solution to this is to make the counters also a timeline data structure (TimelineLong) so that in the event of a failed write operation we revert the counters as well.