Update [16/10/2021]: We’ve added why the outages have been happening.
Following our Diablo 2: Resurrected server outage and login issues story a couple of days ago, players from all over the world are really making their frustrations heard.
The outages are due to the game’s authentication servers, which started after Blizzard patched the Xbox, PlayStation, and Nintendo Switch versions of the game on 6th October.
[#D2R] As part of our continued investigation into the issues over the past few days, our team will be actively monitoring and reacting to the situation during peak play times and there may be periods where logins or game creation are limited
— Blizzard CS – The Americas (@BlizzardCS) October 13, 2021
We continue working to resolve the login issues in #D2R as soon as possible. We apologize for the inconvenience caused by today's outages.
— Blizzard CS – The Americas (@BlizzardCS) October 11, 2021
Players who experienced these outages and using the US, Asia, and UK servers are taking to social media to demand refunds. Here are but a few examples:
There’s even a #RefundD2R thing on Twitter.
Damn guys, login problems are like once a day.. #RefundD2R
— Samuel Giudice (@SamuelGiudice) October 12, 2021
#refundD2R same day, same time, same channel. This is unbelievable. 5 days in this shit
— Alex C – Moonspell (@alexcasjim) October 13, 2021
— Matt Bradshaw (@Snipe_Dogg) October 12, 2021
We hope to hear from Blizzard about this escalating issue soon. In the meantime, perhaps it’s time you revisited Diablo 3, which is all sorts of stable at this point in time?
Why Is This Happening?
Blizzard posted a lengthy official update as to why the server outages have been happening:
“Our server outages have not been caused by a singular issue; we are solving each problem as they arise, with both mitigating solves and longer-term architectural changes. A small number of players have experienced character progression loss–moving forward, any loss due to a server crash should be limited to several minutes.
This is not a complete solve to us, and we are continuing to work on this issue. Our team, with the help of others at Blizzard, are working to bring the game experience to a place that feels good for everyone.
We’re going to get a little bit into the weeds here with some engineering specifics, but we hope that overall this helps you understand why these outages have been occurring and what we’ve been doing to address each instance, as well as how we’re investigating the overall root cause. Let’s start at the beginning.
In staying true to the original game, we kept a lot of legacy code. However, one legacy service in particular is struggling to keep up with modern player behavior.
This service, with some upgrades from the original, handles critical pieces of game functionality, namely game creation/joining, updating/reading/filtering game lists, verifying game server health, and reading characters from the database to ensure your character can participate in whatever it is you’re filtering for. Importantly, this service is a singleton, which means we can only run one instance of it in order to ensure all players are seeing the most up-to-date and correct game list at all times. We did optimize this service in many ways to conform to more modern technology, but as we previously mentioned, a lot of our issues stem from game creation.
We mention “modern player behavior” because it’s an interesting point to think about. In 2001, there wasn’t nearly as much content on the internet around how to play Diablo II “correctly” (Baal runs for XP, Pindleskin/Ancient Sewers/etc for magic find, etc). Today, however, a new player can look up any number of amazing content creators who can teach them how to play the game in different ways, many of them including lots of database load in the form of creating, loading, and destroying games in quick succession. Though we did foresee this–with players making fresh characters on fresh servers, working hard to get their magic-finding items–we vastly underestimated the scope we derived from beta testing.
Additionally, overall, we were saving too often to the global database: There is no need to do this as often as we were. We should really be saving you to the regional database, and only saving you to the global database when we need to unlock you–this is one of the mitigations we have put in place. Right now we are writing code to change how we do this entirely, so we will almost never be saving to the global database, which will significantly reduce the load on that server, but that is an architecture redesign which will take some time to build, test, then implement.”
Long story short: the old game’s code is causing most of the problems. Here’s how they’re solving it currently; chief of them using queues:
Rate limiting: We are limiting the number of operations to the database around creating and joining games, and we know this is being felt by a lot of you. For example, for those of you doing Pindleskin runs, you’ll be in and out of a game and creating a new one within 20 seconds. In this case, you will be rate limited at a point. When this occurs, the error message will say there is an issue communicating with game servers: this is not an indicator that game servers are down in this particular instance, it just means you have been rate limited to reduce load temporarily on the database, in the interest of keeping the game running. We can assure you this is just mitigation for now–we do not see this as a long-term fix.
Login Queue Creation: This past weekend was a series of problems, not the same problem over and over again. Due to a revitalized playerbase, the addition of multiple platforms, and other problems associated with scaling, we may continue to run into small problems. To diagnose and address them swiftly, we need to make sure the “herding”–large numbers of players logging in simultaneously–stops. To address this, we have people working on a login queue, much like you may have experienced in World of Warcraft. This will keep the population at the safe level we have at the time, so we can monitor where the system is straining and address it before it brings the game down completely. Each time we fix a strain, we’ll be able to increase the population caps. This login queue has already been partially implemented on the backend (right now, it looks like a failed authentication in the client) and should be fully deployed in the coming days on PC, with console to follow after.
Breaking out critical pieces of functionality into smaller services: This work is both partially in progress for things we can tackle in less than a day (some have been completed already this week) and also planned for larger projects, like new microservices (for example, a GameList service that is only responsible for providing the game list to players). Once critical functionality has been broken down, we can look into scaling up our game management services, which will reduce the amount of load.