Affected
Major outage from 8:00 AM to 7:30 PM
Major outage from 8:00 AM to 7:30 PM
Major outage from 8:00 AM to 7:30 PM
Major outage from 8:00 AM to 7:30 PM
- ResolvedResolved
There may have been some lags/non-responses while we were disabling some workaround code we used to assist in recovery. But the issue is now considered resolved.
The root cause is a network micro-cut that happened around 07:00 AM UTC. This caused a large number of shards to disconnect and reconnect at the same time. This caused some issues in our cache: 240 shards (out of a total of 704) had incorrect data cached. This cache contains critical data for the bot operations (like in which guilds Vexera is present, what are the channels in these guilds, the permissions, the members, etc.). Missing and invalid data prevented the bot to work correctly in some guilds.
The issue, that first got identified by the support team based on user reports, wasn't correctly escalated to devs for remediation. All the people that run the bot are based in Europe, and at this hour, we were already on our worksites. This finally got escalated to someone with access to our servers at 02:17 PM UTC who tried basic actions to recover the bot without success. Being at work, they weren't able to do much more.
Real investigations finally started around 05:00 PM UTC, and the first remediation steps were able to be executed by 05:10 PM UTC. After having identified the 240 shards affected by the issue, a rolling restart of them was started to fix the issue. This action was completed by 06:40 PM UTC, at which point almost all guilds were fixed. We then dug into our logs to search for and fix a few remaining guilds in an inconsistent state. At 07:30 PM UTC, we believed everything was back to normal, which was confirmed by our metrics.
The time it took to fix this issue is unacceptable, and there is clearly a lack of proper escalation and handling processes for major outages that happen during workdays. We will see how we can improve that.
- MonitoringMonitoring
As of now, all guilds should have recovered. We are keeping health metrics under close monitoring while our cache is finishing to warm up.
For transparency, previous updates contained an error: the outage is not due to an issue with our voice service (although early investigations have reported this service as a possible cause). There will be more information to come when we will be sure that everything is fixed.
- IdentifiedIdentified
We have identified the cause of the issue and are working on fixing the impacted voice service, as well as the dashboard.
Vexera Pro has been fixed and is back up & running!
Edit: We are currently restarting all affected shards.
- InvestigatingInvestigating
Vexera is currently affected by 3 various issues:
One of our voice components, specifically "voice-manager-1", is currently unresponsive. Some servers may experience issues playing music.
The dashboard returns a "403 Forbidden" error once you select a server in your server list. Configuring a server through the dashboard is not possible at the time.
Vexera Pro is unresponsive to all commands.
We're investigating the issues above and will update once further information becomes available. Thank you for your patience.