Cheating a bit by writing this before everything is fully back, but it should be soon, hopefully. First of all, I am sorry about the extended downtime. Not at all what I would have wanted, and I have tried everything in my power to speed it up to get y'all back to posting.
We had a user who backfilled 20k+ records in the last 24 hours and it caused us to get limited by Bluesky's relay(the thing that takes our posts and shows them on bsky.app). This is not the user's fault, but mine. I found out today that PDS rate limits are not on by default (they are now); you have to set the env variable PDS_RATE_LIMITS_ENABLED for API limits. With that set, a user could not generate enough records to cause the relay to rate limit us. The user was backfilling an account with their historical twitter history, and it was nothing malicious and I want to reiterate it's not their fault and I should have had the proper protections in place so that the user knew the rate limits and would have stopped anything before it affected everyone. That is now set, and this should not happen again. I am deeply sorry and hope this is the last of our growing pains as a PDS.
The extend of the outage is waiting on Bluesky's relay to catch up with the current firehose events emitted from the PDS.
A timeline of events
16:10 UTC: a user reports something is off and I notice it as well. Posts are delayed and not always going through
16:20 UTC: Troubleshooting and finding the account and seeing that they are backfilling an atproto account bsky records that match their twitter history (again, not their fault, the PDS should have been giving them rate limit headers and stopping it long ago)
16:40 UTC: I wrote up this GitHub issue asking if someone could look into it since a similar issue had happen on another PDS (also not that users fault either)
17:14 UTC: Reached out to the user who had created the account and asked if they could stop the script while we figure out what is going on. Very nice and apologize, was not malicious at all.
18:05 UTC: Made use of the
com.atproto.admin.sendEmailendpoint for the first time which let me email everyone the following message. If you did not get it may check that you have an email on your account so you can be reached if something like this happens again. The script if you ever need it18:23 UTC: Bryan was kind enough to reply to the GitHub issue and let me know he had up the rate of firehose events for us which would allow us to get going faster
18:30ish UTC: I had dug some more and found that it does not appear api rate limits were on by default like I expected(they are now). Set
PDS_RATE_LIMITS_ENABLEDand rebooted the server. Does not really help us now since it's the event's coming from the firehose via the relay that is limited atm, but stops this from happening again.20:00 UTC: I tried a takedown on the account that had created the records in the hope that if a takedown happen it would stop the backlog for that account. Didn't work. I assume the relay was so backlogged that it wouldn't see the takedown till much later.
23:00 UTC: Writing this up while waiting on the backlog on the relay to catch up and let everyone post again. sorry gang :(
00:24 UTC: About to post this now before we're out of relay hell.
2:38 UTC: We have made it through relay hell ๐ฅ. Sorry again everyone. Rate limits are in place so this doesn't happen again and they even made a rule about us.