Slow integrations and matchings

Incident Report for VUSION Cloud

Postmortem

What happened ?

On March 18th at 9:00 AM UTC a new release of the VusionCloud platform was deployed in the US environment. The update was completed within the scheduled maintenance window, and initial routine checks - including matching processes and item integration - were successful. VusionCloud engineers closely monitored the system until the maintenance window closed at 11:00 AM UTC.

However, a few hours later, delays began accumulating in the matching processes and item data ingestion. The issue was escalated to VusionCloud engineers at around 3:00 PM UTC. The team promptly investigated and identified the new release as the root cause. To restore the normal operations, the faulty version was swiftly rolled back, and operations were back to normal by approximately 3:30 PM UTC.

What went wrong, and why?

The March release of VusionCloud included an update to the library responsible for reading and processing messages from queueing services, as the library currently used in production is set to be deprecated soon. To ensure that all our production libraries remain up to date, well-maintained, and capable of delivering the best security and performance, we made the decision to upgrade to the latest version.

However, this update introduced changes to the library’s configuration parameters, which were not correctly implemented. The new version altered the library’s standard behavior, causing improper application scaling under increased load starting at 12:30 PM UTC. As a result, messages began to accumulate leading to processing delays.

Once our engineers identified the issue, they promptly rolled-it back and found out that the update was the root cause. System performance began improving immediately by 3:30 PM UTC, and all services were fully restored by 4:30 PM UTC.

How did we respond?

First delays appeared at 12:30 PM UTC when the application failed to scale effectively under the rising load. Because the delays accumulated gradually and did not trigger exceptions or errors, our proactive alerts remained below the threshold for detection. While messages were processed successfully, processing time continued to increase.

Our engineering team became aware of the issue around 3:00PM UTC, at the same time it was escalated internally by our support team. Given the impact on performance, they decided to allocate a fixed time to find a potential fix before proceeding with a rollback. As no immediate solution was found, they initiated the rollback at 3:30 PM UTC to minimize disruption to the production environment.

Performance began improving immediately after the rollback, with 80% of the delays resolved within 20 minutes. By 4:30 PM UTC, all services had returned to normal.

Once services stabilized, the team conducted a thorough investigation and quickly identified the root cause of this incident, the update of the new library.

How are we making incidents like this less likely or less impactful?

We remain committed to reducing the likelihood of such incidents and minimizing their impact.

  • To strengthen our monitoring, we will reinforce post-maintenance checks and closely track system behavior during the first peak activity periods of the day.
  • Additionally, we will significantly lower alert thresholds to ensure they are triggered at the first signs of delays, allowing for faster detection and response.
  • To prevent similar occurrences, we will enhance our automated testing by incorporating these specific production conditions. This will ensure that each release is tested under those operational scenarios before deployment.
  • Finally, a fix will be deployed to ensure seamless compatibility with the updated library’s standard behavior.

Thank you for your continued support and understanding.

Posted Mar 20, 2025 - 15:30 UTC

Resolved

The rollback fixed the issue. Situation is back to normal.
We will investigate the root cause and provide a fix.
Thank you for your understanding.
Posted Mar 18, 2025 - 16:24 UTC

Update

We are continuing to monitor for any further issues.
Posted Mar 18, 2025 - 16:13 UTC

Monitoring

rollback seems effective. Monitoring is still ongoing.
Posted Mar 18, 2025 - 16:07 UTC

Identified

The issue has been identified as a probable issue from this morning's release. The version has been rolled back to the previous one.
Posted Mar 18, 2025 - 15:59 UTC

Investigating

Dear customers,

We are currently facing slow integrations and matchings. Delays can be observed on Vusion Manager.
We are currently investigating the issue.

Thank you for your understanding.
Posted Mar 18, 2025 - 15:25 UTC
This incident affected: Americas (VUSION Manager - Americas, VUSION Cloud API - Americas).