You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We are getting an "Unexpected error during batch delivery" following an upgrade to Kafka 3.7.
We are using Aiven's platform for Kafka and we upgraded to Kafka version 3.7. However, as soon as the upgrade finished, we noticed that the Transactional Producers are issuing the following errors:
Traceback (most recent call last):
File "/root/.cache/pypoetry/virtualenvs/redacted-9TtSrW0h-py3.11/lib/python3.11/site-packages/aiokafka/producer/sender.py", line 154, in _sender_routine
task.result()
File "/root/.cache/pypoetry/virtualenvs/redacted-9TtSrW0h-py3.11/lib/python3.11/site-packages/aiokafka/producer/sender.py", line 333, in _do_txn_offset_commit
return (await handler.do(node_id))
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/pypoetry/virtualenvs/redacted-9TtSrW0h-py3.11/lib/python3.11/site-packages/aiokafka/producer/sender.py", line 379, in do
retry_backoff = self.handle_response(resp)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/pypoetry/virtualenvs/redacted-9TtSrW0h-py3.11/lib/python3.11/site-packages/aiokafka/producer/sender.py", line 619, in handle_response
raise error_type()
aiokafka.errors.StaleLeaderEpochCodeError: [Error 13] StaleLeaderEpochCodeError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/.cache/pypoetry/virtualenvs/redacted-9TtSrW0h-py3.11/lib/python3.11/site-packages/redacted/workers.py", line 148, in restartable_loop
await self.fn()
File "/root/.cache/pypoetry/virtualenvs/redacted-9TtSrW0h-py3.11/lib/python3.11/site-packages/redacted/transactional_processor.py", line 177, in run
await producer.send_offsets_to_transaction({topic_partition: commit_offsets},
File "/root/.cache/pypoetry/virtualenvs/redacted-9TtSrW0h-py3.11/lib/python3.11/site-packages/aiokafka/producer/producer.py", line 577, in send_offsets_to_transaction
await asyncio.shield(fut)
File "/root/.cache/pypoetry/virtualenvs/redacted-9TtSrW0h-py3.11/lib/python3.11/site-packages/aiokafka/producer/sender.py", line 167, in _sender_routine
raise KafkaError("Unexpected error during batch delivery")
aiokafka.errors.KafkaError: KafkaError: Unexpected error during batch delivery
2024-05-06T16:26:52+0000 | ERROR | Task exception was never retrieved
future: <Task finished name='Task-129748' coro=<Sender._send_produce_req() done, defined at /root/.cache/pypoetry/virtualenvs/redacted-9TtSrW0h-py3.11/lib/python3.11/site-packages/aiokafka/producer/sender.py:246> exception=KeyError(<aiokafka.producer.message_accumulator.MessageBatch object at 0xffff7900fd10>)>
Traceback (most recent call last):
File "/root/.cache/pypoetry/virtualenvs/redacted-9TtSrW0h-py3.11/lib/python3.11/site-packages/aiokafka/producer/sender.py", line 259, in _send_produce_req
await handler.do(node_id)
File "/root/.cache/pypoetry/virtualenvs/redacted-9TtSrW0h-py3.11/lib/python3.11/site-packages/aiokafka/producer/sender.py", line 740, in do
self._sender._message_accumulator.reenqueue(batch)
File "/root/.cache/pypoetry/virtualenvs/redacted-9TtSrW0h-py3.11/lib/python3.11/site-packages/aiokafka/producer/message_accumulator.py", line 378, in reenqueue
self._pending_batches.remove(batch)
KeyError: <aiokafka.producer.message_accumulator.MessageBatch object at 0xffff7900fd10>
The Producer seemingly still keeps going and any further attempts to send messages on the transaction end up with:
We are using a Transactional Producer with acks set to all and with a set transactional.id.
Environment (please complete the following information):
aiokafka 0.10.0
Kafka 3.7 hosted on the Aiven Platform
Reproducible example
I have reproduced below the code we are using to issue transactional messages using the producer.
asyncdefrun(self):
log_info("PROCESSOR_STARTUP", processor="TransactionalProcessor", group_id=self.__group_id)
consumer=awaitself.__get_consumer()
try:
whileself.__running:
messages=awaitconsumer.getmany(timeout_ms=5000)
fortopic_partition, messages_in_partitioninmessages.items():
try:
ifmessages_in_partition:
producer=awaitself.__get_producer(topic_partition.topic, topic_partition.partition)
commit_offsets=messages_in_partition[-1].offset+1records_to_write= []
forrecordinmessages_in_partition:
processor_result=self.__processor(
Record(topic=record.topic, partition=record.partition, offset=record.offset,
received_at=datetime.fromtimestamp(record.timestamp/1000.0),
key=record.key, value=record.value))
ifinspect.isawaitable(processor_result):
processor_result=awaitprocessor_resultrecords_to_write.extend(processor_result)
asyncwithproducer.transaction():
forrecordinrecords_to_write:
awaitproducer.send(record.topic, value=record.value, key=record.key)
awaitproducer.send_offsets_to_transaction({topic_partition: commit_offsets},
self.__group_id)
exceptProducerFenced:
# This occurs when someone else takes over the processing of this topic partition. We simply# close the producer, if any, and continue with the next partitionawaitself.__clear_producer(topic_partition.topic, topic_partition.partition)
log_info("PRODUCER_FENCED", topic=topic_partition.topic, partition=topic_partition.partition,
group_id=self.__group_id)
exceptOutOfOrderSequenceNumber:
# This occurs if a fatal error occurred earlier and therefore the producer does not know the# sequence number for its next message. Discard the producer, if any, and continue with the# next partition.awaitself.__clear_producer(topic_partition.topic, topic_partition.partition)
log_info("PRODUCER_DESYNC", topic=topic_partition.topic, partition=topic_partition.partition,
group_id=self.__group_id)
finally:
awaitself.__clear_clients()
log_info("PROCESSOR_CLEANUP", processor="TransactionalProcessor", group_id=self.__group_id)
The text was updated successfully, but these errors were encountered:
After a couple of days trying to get to the bottom of this, we have resolved the issue by changing the order of sending the offsets and sending the records within the transaction, like so:
Given that this has solved the problem, it means that there is some sort of race condition within the library that is only triggered when using Kafka 3.7.
Describe the bug
We are getting an "Unexpected error during batch delivery" following an upgrade to Kafka 3.7.
We are using Aiven's platform for Kafka and we upgraded to Kafka version 3.7. However, as soon as the upgrade finished, we noticed that the Transactional Producers are issuing the following errors:
The Producer seemingly still keeps going and any further attempts to send messages on the transaction end up with:
We are using a Transactional Producer with
acks
set toall
and with a settransactional.id
.Environment (please complete the following information):
Reproducible example
I have reproduced below the code we are using to issue transactional messages using the producer.
The text was updated successfully, but these errors were encountered: