VusionCloud is unavailable

Incident Report for VUSION Cloud

Postmortem

What happened?

On October 17th, beginning at 8:48 AM UTC, users encountered difficulties accessing Vusion Manager and experienced errors when calling VusionCloud APIs, similar to the incident that occured on September 19th, 2024. To restore stability, VusionCloud engineers initiated a manual fail-over at 9:02 AM UTC. The situation improved briefly but services did not recover fully. To solve this situation, VusionCloud engineers drastically increased the database capacity.

What went wrong, and why?

At 8:47 AM UTC on October 17th, the Users Management database started to experience degraded response times, similar to September 19th incident. This Azure-induced slowdown triggered an increase in parallel sessions, resulting in service disruption at 8:48 AM UTC. Temporary improvements, achieved through manual fail-over by VusionCloud engineers, were short-lived due to very high demand from waiting sessions, caused by the database slowdown. In addition to the fail-over, VusionCloud engineers drastically increased the database capacity.

How did we respond?

Automatic alerts detected the issue at 8:54 AM UTC, prompting our team to prioritize restoring system stability. Engineers quickly identified the high database load as the cause. A manual failover was initiated at 9:02 AM UTC, providing temporary relief. However, because the database slowdown persisted and the waiting Users sessions caused a high load on the database, the VusionCloud team greatly increased the database capacity, fully resolving the issue by 9:55 AM UTC.

How are we making incidents like this less likely or less impactful?

We recognize the recurrence of recent incidents and are implementing additional measures to address these issues more effectively.

In response to the incidents last month and this month, we have further strengthened session management to reduce strain on the database. Additionally, we have substantially upgraded the database server’s capacity, and are introducing continuous monitoring to identify and resolve potential issues before they affect performance.

Our goal is to provide a more resilient system to ensure higher availability and stability.

Posted Oct 29, 2024 - 08:13 UTC

Resolved

Dear customer,

The issue has been resolved, VusionCloud is now operationnal.
Thank you for your patience.
Posted Oct 17, 2024 - 13:05 UTC

Monitoring

Dear customers,
We have identified and resolved the issue, and we are now actively monitoring the situation to ensure stability.

Thank you for your patience.
Posted Oct 17, 2024 - 10:00 UTC

Investigating

Dear customers,

We are currently experiencing a service interruption from VusionManager. Our teams are fully mobilized are focused on fixing our service and ensuring customer business continuity.

We will keep you informed as the situation evolves.
Thank you for your trust
Posted Oct 17, 2024 - 08:53 UTC
This incident affected: Europe (VUSION Manager - Europe, VUSION Cloud API - Europe).