On 22nd of March 2023, between 7:45 until 08:10 and 09:30 until 09:40 UTC, clients may have experienced high error rate when sending API calls on items updates or matchings. An abnormal amount of their API calls ended up in 400 or 500 status code. Some inputs have nonetheless created events in our traceability systems and those events could be seen as 'waiting' until 14:00 UTC.
Within our items' data processing module, our queueing service has been saturated with an abnormal number of unclosed connections, resulting in an inability to write incoming data correctly. Our Cloud engineers have identified that it was caused by the units which process incoming calls. It was triggering multiple connections instead of a single one and not closing those connections properly enough. Due to very high load that occurred in European morning, those connections saturated our queueing service.
Upon acknowledgement of this issue, our Cloud Engineering team proceeded with a 24/7 monitoring with regular curative actions to prevent the service bus from saturating, until the hot-fix was finally green-lit and pushed to production environments on March 23rd around 8:00 UTC over all units. All blocked events have been closed. Please be assured that calls that received a 200 response have been processed properly and in due time.
In order to improve the overall reliability and to avoid the occurrence of similar incidents in the future:
On behalf of SES-imagotag, the whole Cloud team would like to apology for the impact it may have had. Please be assured that we are working continuously to improve the quality of the platform and the user experience.