Hey folks!
Unfortunately, roughly 2 hours ago, lemm.ee went offline. The cause was our load balancer: it suddenly decided that all of our servers had become unhealthy, despite all health checks responding successfully when I requested them directly. In such cases, the load balancer stops serving all requests, effectively meaning that lemm.ee is unreachable for all users. I am still not sure what exactly caused the issue, but I will try to investigate more over the weekend.
For now, we have partially recovered, and I am continuing to work on remaining issues. Hopefully we will be back to 100% very soon. Sorry for the inconvenience!
We appreciate what you do hero
I was wondering what was going on, status.lemm.ee said the server was ok but the federation was broken. Thank you for fixing it
Sorry for the delay in updating the status page - I actually had gone out for lunch just a few minutes before the downtime started, so I didn’t even realize anything was up until I was back at my computer about 45 minutes later 💀
no need to apologise. still a better response time, than some of the professionals I work with ;-)
I survived the July 18th lemm.ee downtime, and all I got was this lousy comment.
All is forgiven, thank you for running this lovely instance _
Thanks for your great work and transperancy!
Thanks for the quick fix! What did you have to do to get the load balancer working again?
For now, I just redeployed all of our servers completely, but as I don’t know the actual root cause of the issue yet, I’m still investigating to figure out if anything more is needed.
Nginx? I had an nginx LB shit itself yesterday. Luckily it auto-recovered and I had HA but just weird it happened.
Actually, we’re using Hetzner’s cloud load balancer for lemm.ee. But if this issue repeats in the near future, then I will definitely consider setting up something else.
haproxy is where it’s at!
It’s probably a managed haproxy in Hetzner’s case.
I’d like to speak to a manager /s
Typically when this happens, the issue is on the LB itself. Maybe its own network had issues?
Would it be in bad taste to blame Russia?
Yeah, but it could have been China, India, Iran, or maybe even North Korea. There are a lot of places that think disrupting the rest of the world will get them somewhere.
Sometimes, downtimes are awesome. Get off your machine and spend time with your family, folks!
I thought the entire lemmy network was down because status.lemm.ee was saying our instance was fine and federation wasn’t working with every other instance. lol
o7
Thank goodness! Hopefully discovering these vulnerabilities and protecting them will help keep Lemmy alive when the big dogs come in to sweep us away! (Worst fears)
Seriously, your professionalism in handling the situation and in reporting it is fantastic.
It’s totally above and beyond anything we should expect for a service powered by donations!
Thank you!