On March 18th at 9:00 AM UTC a new release of the VusionCloud platform was deployed in the US environment. The update was completed within the scheduled maintenance window, and initial routine checks - including matching processes and item integration - were successful. VusionCloud engineers closely monitored the system until the maintenance window closed at 11:00 AM UTC.
However, a few hours later, delays began accumulating in the matching processes and item data ingestion. The issue was escalated to VusionCloud engineers at around 3:00 PM UTC. The team promptly investigated and identified the new release as the root cause. To restore the normal operations, the faulty version was swiftly rolled back, and operations were back to normal by approximately 3:30 PM UTC.
The March release of VusionCloud included an update to the library responsible for reading and processing messages from queueing services, as the library currently used in production is set to be deprecated soon. To ensure that all our production libraries remain up to date, well-maintained, and capable of delivering the best security and performance, we made the decision to upgrade to the latest version.
However, this update introduced changes to the library’s configuration parameters, which were not correctly implemented. The new version altered the library’s standard behavior, causing improper application scaling under increased load starting at 12:30 PM UTC. As a result, messages began to accumulate leading to processing delays.
Once our engineers identified the issue, they promptly rolled-it back and found out that the update was the root cause. System performance began improving immediately by 3:30 PM UTC, and all services were fully restored by 4:30 PM UTC.
First delays appeared at 12:30 PM UTC when the application failed to scale effectively under the rising load. Because the delays accumulated gradually and did not trigger exceptions or errors, our proactive alerts remained below the threshold for detection. While messages were processed successfully, processing time continued to increase.
Our engineering team became aware of the issue around 3:00PM UTC, at the same time it was escalated internally by our support team. Given the impact on performance, they decided to allocate a fixed time to find a potential fix before proceeding with a rollback. As no immediate solution was found, they initiated the rollback at 3:30 PM UTC to minimize disruption to the production environment.
Performance began improving immediately after the rollback, with 80% of the delays resolved within 20 minutes. By 4:30 PM UTC, all services had returned to normal.
Once services stabilized, the team conducted a thorough investigation and quickly identified the root cause of this incident, the update of the new library.
We remain committed to reducing the likelihood of such incidents and minimizing their impact.
Thank you for your continued support and understanding.