flespi noc (eu)
129 subscribers
133 links
flespi eu region NOC
Download Telegram
#eu: downtime started, error: Simulated telematics device timed out waiting for recently added command. Usually this indicates the problem with flespi Telematics Gateway.
#eu: downtime ended, period: 254 second(s)
#eu: downtime started, error: Simulated telematics device connected to the channel, sent the packet with message, but channel didn't replied to it. Usually this indicates the problem with flespi Telematics Gateway.
#eu: downtime ended, period: 761 second(s)
#eu: downtime started, error: Failed to perform https://flespi.io GET request. Usually this indicates either flespi eu datacenter network uplink connection problem or when the platform is in the maintenance mode.
#eu: downtime ended, period: 50 second(s)
#eu: downtime started, error: Failed to receive messages posted by the simulated device with GET /gw/channels/XXX/messages REST API call within 5 seconds. It usually means that flespi storage system is either shutdowned for maintenance or currently operating under high load and some database operations may be delayed.
#eu: downtime ended, period: 157 second(s)
At the moment both datacenters are operational.

Our storage system mirror in the failed datacenter require synchronization of around 80TB of the data. We are gradually activating this process and hopefully complete it until tomorrow. During this process flespi may be slow to respond in certain situations. There also may happen short downtimes. Please keep in mind that this is controlled and data-heavy process.

In short regarding the downtimes. One of our datacenters was completely powered off with all servers in all racks deactivated. The services in active datacenter had difficulties handling the accumulated load and responding to NOC test nodes within the expected timeout. So even during active downtime reported the flespi was fully operational but just slow to respond.

During investigation and network rerouting we've got disconnections from another datacenter, but that was just network re-configuration issue.
We also got a split brain condition in the failed datacenter in their routers which we were unable to detect before switching traffic back so this was also a reason for some failures for 50% of requests. We detected and fixed it within minutes.

So all in all there were a lot of problems happened and our team solved them one by one keeping flespi available. Our monthly uptime (https://flespi.com/status#uptime) already dropped to historical low level of 99.82% which means that our Enterprise and Ultimate users will enjoy 30% and 70% discounts in their next flespi bill (https://flespi.com/pricing#sla).

Later we will investigate each step and problem that happened and consider how to improve our platform to prevent it in the future. And of course we will add more spare resources to mitigate even peaked load, just in case.
#eu: downtime started, error: Failed to receive messages posted by the simulated device with GET /gw/channels/XXX/messages REST API call within 5 seconds. It usually means that flespi storage system is either shutdowned for maintenance or currently operating under high load and some database operations may be delayed.
We now experience around a minute latency in telematics data for some devices feed due to data synchronization between datacenters. This is a part of controlled process.
All other services (REST, MQTT and so on) are unaffected.
#eu: downtime ended, period: 1133 second(s)
#eu: downtime started, error: DELETE /channels/connection operation triggered by webhook failed. Usually this indicates the problem within webhooks subsystem.
#eu: downtime ended, period: 83 second(s)
We had a failure of one of the servers with the MQTT Broker service, which was the reason for a 2-minute delay in the webhooks processing and was detected by our NOC uptime checking nodes as downtime. No other services were affected, or their hiccup time was less than 30 seconds.

We are investigating now how a server failure can affect the webhooks system throughput and will definitely improve this soon.
#eu: downtime started, error: Failed to perform https://flespi.io/gw/xxx GET request. Usually this indicates either flespi telematics hub REST API overload or when the hub is in the maintenance mode.
#eu: downtime ended, period: 258 second(s)
#eu: downtime started, error: Failed to perform https://flespi.io/gw/xxx GET request. Usually this indicates either flespi telematics hub REST API overload or when the hub is in the maintenance mode.
#eu: downtime ended, period: 504 second(s)
Downtime Explanation - May 27, 2025

During today's scheduled datacenter switch replacement maintenance, we experienced a network issue that caused service interruption.

An engineer on site inadvertently disconnected a one of fiber links from the secondary gateway. This caused the secondary gateway to lose LAN connectivity while still retaining WAN connection and attempting to assume the primary routing role. This triggered a "split brain" situation where the secondary gateway tried to handle all traffic despite having no functional network connection due to the loose fiber connector.

Service has been restored by operating on the primary gateway only. The maintenance work is continuing as planned.