Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-18569

New consumer close may wait on unneeded FindCoordinator

    XMLWordPrintableJSON

Details

    Description

      A flaky test revealed that the new consumer close may wait for a FindCoordinator unsent request to go out when closing the consumer, even after the commit/leaveGroup stages of close are done.

      This could happen because the CoordinatorRequestManager poll continues to attempt FindCoordinator if the coordinator is unknown , even if this happens during consumer close, after the consumer has completed the commit/leave attempts (which are the only steps in close that require a coordinator), and before the network shutdown that stops polling managers here

      https://github.com/apache/kafka/blob/5c20aa187aa8f51af4270d7d1b0db4963b0cd10b/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java#L1343 

      If the unneeded FindCoordinator is generated and the brokers are down (like could happen in the flaky test), the consumer would wait for that request unnecessarily here  https://github.com/apache/kafka/blob/5c20aa187aa8f51af4270d7d1b0db4963b0cd10b/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerNetworkThread.java#L327

      I expect we shouldn't block the close on a FindCoordinator request if the consumer already completed the commit/leave attempts.

      An option could be to consider "signal close" to the CoordinatorRequestManager after the consumer.close completes commit/leave, so that it does not generate any more requests on poll (similar to what is already done for the CommitRequestManager with the CommitOnCloseEvent

      https://github.com/apache/kafka/blob/5c20aa187aa8f51af4270d7d1b0db4963b0cd10b/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java#L1335 

      This fix should allow to enable this test for the new consumer reliably.

      https://github.com/apache/kafka/blob/5c20aa187aa8f51af4270d7d1b0db4963b0cd10b/core/src/test/scala/integration/kafka/api/ConsumerBounceTest.scala#L404

      Without the fix, the test is flaky (fails locally after a few repeated runs, fails in CI).

       

      Attachments

        Issue Links

          Activity

            People

              frankvicky TengYao Chi
              lianetm Lianet Magrans
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: