Misleading ESLs status display

Incident Report for VUSION Cloud

Resolved

Introduction
In our relentless pursuit of continuous improvement and innovation, we have initiated in the past months a series of ambitious initiatives aimed at fortifying various facets of our platform. Our foremost goals have been to elevate the availability, stability, performance, and overall user experience, all while handling growth and preparing ahead.

Of particular significance is our unwavering commitment to introducing multi-regional active-active capabilities that will undoubtedly benefit our existing and future customers. This initiative holds the promise of delivering a substantial boost in our cloud's performance, elevating availability to unprecedented levels, and reinforcing our disaster recovery capabilities. These enhancements are poised to empower all VUSION Cloud clients with a more robust and resilient infrastructure, ensuring their critical operations remain unimpeded and their data is safeguarded.

This strategic undertaking represents a pivotal step in our journey toward providing a state-of-the-art platform that not only meets your needs today but anticipates and addresses your evolving requirements in the future.

What Happened?

The September update of VUSION Cloud included substantial improvements in performance including the distribution of data processing across regions and zones.

Towards the end of September, we received feedback concerning misleading ESL status. In fact, some status update messages were not processed promptly or in the correct order. Consequently, labels were stuck in a 'synchronizing' or ‘unreachable’ state, even though the data transmissions had been executed as expected and without delays.

Upon identifying this issue, our support team diligently categorized these cases and escalated them to our R&D team, who promptly provided fixes, that were deployed over the first half of October. Our analysis revealed that these issues had been mitigated. However, the labels that were in 'synchronizing' status and that did not get an update in the past days still showed the incorrect initial status. In order to correct the remaining faulty status, our engineers initiated, starting from the 10th of October, refresh tasks for all labels that were needing an update. The refresh were completed by the 13th of October.

We want to emphasize that no data or transmissions were lost or delayed, and no action was or is required from our customers.

How Are We Making Incidents Like This Less Likely or Less Impactful?

To prevent incidents like the one we faced, we have introduced several measures:
1. Enhanced Testing: We have added numerous end to end test cases to our development and quality assurance pipelines. Our test teams are now placing special emphasis on controlling these cases to ensure rigorous testing.
2. Automated Alerts: We have implemented a number of automatic alerts to closely monitor label statuses and detect any similar issues in advance.
3. Improved Support Escalation: We have identified areas for improvement in our support escalation processes to expedite the resolution of such incidents.

We acknowledge that change is an integral part of progress. While we endeavor to make transitions as seamless as possible, we understand that there may be unforeseen challenges, especially when updating live systems. Your satisfaction and success are our utmost priorities, and we remain dedicated to collaborating closely with you to ensure smooth and successful updates.

We appreciate your patience and understanding as we work together to adapt and grow. Rest assured, you can expect unwavering commitment from our team to provide you with the best-in-class IoT platform in the industry.

Posted Oct 19, 2023 - 21:14 UTC