On October 17th, beginning at 8:48 AM UTC, users encountered difficulties accessing Vusion Manager and experienced errors when calling VusionCloud APIs, similar to the incident that occured on September 19th, 2024. To restore stability, VusionCloud engineers initiated a manual fail-over at 9:02 AM UTC. The situation improved briefly but services did not recover fully. To solve this situation, VusionCloud engineers drastically increased the database capacity.
At 8:47 AM UTC on October 17th, the Users Management database started to experience degraded response times, similar to September 19th incident. This Azure-induced slowdown triggered an increase in parallel sessions, resulting in service disruption at 8:48 AM UTC. Temporary improvements, achieved through manual fail-over by VusionCloud engineers, were short-lived due to very high demand from waiting sessions, caused by the database slowdown. In addition to the fail-over, VusionCloud engineers drastically increased the database capacity.
Automatic alerts detected the issue at 8:54 AM UTC, prompting our team to prioritize restoring system stability. Engineers quickly identified the high database load as the cause. A manual failover was initiated at 9:02 AM UTC, providing temporary relief. However, because the database slowdown persisted and the waiting Users sessions caused a high load on the database, the VusionCloud team greatly increased the database capacity, fully resolving the issue by 9:55 AM UTC.
We recognize the recurrence of recent incidents and are implementing additional measures to address these issues more effectively.
In response to the incidents last month and this month, we have further strengthened session management to reduce strain on the database. Additionally, we have substantially upgraded the database server’s capacity, and are introducing continuous monitoring to identify and resolve potential issues before they affect performance.
Our goal is to provide a more resilient system to ensure higher availability and stability.