[Europe] Slowness in items' data integration
Incident Report for VUSION Cloud
Postmortem

Incoming calls high error rate

What happened?

On 22nd of March 2023, between 7:45 until 08:10 and 09:30 until 09:40 UTC, clients may have experienced high error rate when sending API calls on items updates or matchings. An abnormal amount of their API calls ended up in 400 or 500 status code. Some inputs have nonetheless created events in our traceability systems and those events could be seen as 'waiting' until 14:00 UTC.

What went wrong and why?

Within our items' data processing module, our queueing service has been saturated with an abnormal number of unclosed connections, resulting in an inability to write incoming data correctly. Our Cloud engineers have identified that it was caused by the units which process incoming calls. It was triggering multiple connections instead of a single one and not closing those connections properly enough. Due to very high load that occurred in European morning, those connections saturated our queueing service.  

Upon acknowledgement of this issue, our Cloud Engineering team proceeded with a 24/7 monitoring with regular curative actions to prevent the service bus from saturating, until the hot-fix was finally green-lit and pushed to production environments on March 23rd around 8:00 UTC over all units. All blocked events have been closed. Please be assured that calls that received a 200 response have been processed properly and in due time.

How are we making incidents like this less likely or less impactful?

In order to improve the overall reliability and to avoid the occurrence of similar incidents in the future:

  1. This metric will now be monitored during development, quality and staging phases.
  2. Deployment process will be improved to identify similar behaviours sooner.

 

On behalf of SES-imagotag, the whole Cloud team would like to apology for the impact it may have had. Please be assured that we are working continuously to improve the quality of the platform and the user experience.

Posted Apr 04, 2023 - 14:23 UTC

Resolved
This incident has been resolved.
Posted Mar 22, 2023 - 17:02 UTC
Update
Dear customers,

The incident is now closed. All inputs that have been successfully received have been processed. The events' status have been updated and are available via API or via Manager UI.
Please make sure that all data inputs that were properly received, meaning that they did not get a status code as 200, have been retried on your side.

A report will be provided as soon as the complete root-cause analysis process has been completed.

On behalf of SES-imagotag, we deeply apologize for any inconvenience this may have caused and appreciate your patience and understanding during the resolution process.. Rest assured that our Cloud team is working continuously to provide the best experience.
Posted Mar 22, 2023 - 16:59 UTC
Monitoring
Dear customers,

We regret to inform you that a portion of our European customer base is currently encountering delays in the item update process. Our dedicated team of Cloud Engineers has identified this performance issue and is actively working to resolve it as swiftly as possible.

Please rest assured that we are committed to providing a seamless experience for all our users and are diligently addressing this matter. We apologize for any inconvenience this may have caused and appreciate your patience and understanding during this time.

We will keep you informed of any updates or improvements regarding this situation. In the meantime, should you have any questions or require further assistance, please do not hesitate to reach out to our Customer Support team.

Thank you for choosing our services and for your continued support.
Posted Mar 22, 2023 - 12:58 UTC
This incident affected: Europe (VUSION Cloud API - Europe).