Guild Update Retrospective

This past weekend, we achieved one of our largest and most sought-after milestones: Guilds and Boss Strike functionality. While the visible features were substantial, much of the real complexity and time sink from this update came from the backend changes that accompanied the release, which unfortunately led to the issues experienced immediately following the update. In this post, I’d like to provide some transparency about what happened, the root causes of the issues some players experienced, and how we are addressing and preventing them in the future.

Over the past month, we received messages from several former users who had been banned for ToS violations, who subsequently issued DDoS threats. As a result, we began rapidly migrating our remaining endpoints to Cloudflare, since our current provider does not have the infrastructure required to mitigate attacks of this nature. This was a monumental task as moving these services to Cloudflare required ripping out our battle-hardened MadronaID API backend and rewriting it all in a Cloudflare-compatible technology. This was extremely complex as MadronaID interacts with many critical interconnected parts of our service. We ultimately took a massive portion of our codebase written over the past 2 years, rewrote it, and deployed it on a new platform in the span of just a few weeks to support this transition.

Issues that Emerged During the Release

The first issue we noticed occurred immediately when the update went live. Players were being unexpectedly signed out when any of our systems tried to refresh logins. After an hour of troubleshooting with our vendor, we discovered the root cause was due to a security patch that we performed during our maintenance window, which unexpectedly left the auth system in an inconsistent state, despite that same patch not having the same effect on our test system. After completing some mitigation steps, error rates returned to normal levels, and players were able to log in and stay logged in. Unfortunately, the resulting surge of login requests exposed an oversight in our OTP system: older email providers such as Yahoo and AOL enforce extremely low hourly sending limits. Even with only hundreds of players with these email domains attempting to request multiple login codes in a short window, those mailboxes began rate-limiting deliveries, blocking affected users from regaining access. This, unfortunately, led to a snowball effect of more requests, causing even more denied emails for those domains.

Next, we began to receive reports of issues with our newly launched webshop. This was built in response to increasing IAP fees, in hopes of capturing more of the revenue created by Boss Strikes to be reinvested in development. During configuration, we relied on Epic’s default regional pricing, which resulted in unintentional price disparities. Even in regions with strong economies, items were being sold with effective discounts of up to 50% compared to the U.S. dollar. As the disparity is so large, we’ve had to temporarily pull access to the shop while we work through mitigation steps.

Finally, we received reports of a couple of oversights in our guild implementation that led to some unexpected behavior. Fortunately, we were able to rapidly patch most of the issues, though we have some hotfixes coming soon to fully smooth out the wrinkles that were found. A huge thank you to everyone who reported the issues quickly so we could resolve them before they became widespread.

Why Weren’t We Able to Mitigate These Issues Before Release

After much internal reflection, the root cause of most of these issues was a combination of rushing to hit our internal timelines and changing way too much of our tech stack in one update. We’d previously promised at least one more update in 2025, but unfortunately, our initial internal timeline was delayed due to a combination of refining our implementation while also having our attention divided by doing our massive infrastructure migration. This pushed us dangerously close to the end of the year, and thus, the pressure was on to get things wrapped up and shipped. Simultaneously, due to the DDoS threats, we knew we could not wait indefinitely, and thus we felt further compelled to get our update shipped or risk a potential prolonged downtime. Finally, we had to develop several clawback functions for the various exploits we uncovered. We are deeply committed to a level playing field, and thus it was important to get this part right. But as a side effect, this meant a huge additional time sink on top of all the other tasks we’d already committed to working on. What seemed like a relatively straightforward task resulted in a very complex implementation to ensure accounts were updated accordingly while maintaining detailed audit logs to ensure we could check our work. Thus, with 3 monumental top-level objectives, along with countless smaller tasks interspersed with the rest of the work, the team was ultimately spread extremely thin.

Additionally, due to so many parts of our system changing and the limited time window we had, we did not have the bandwidth to re-test every possible scenario and thus had to take calculated risks on skipping certain validation steps while focusing on the riskier portions of our tech stack. The parts we had validated fortunately functioned as expected, but unfortunately, a few of our less-tested areas had unexpected issues that we had to fix after going live. Even worse, some of these areas were impossible to test in a practical way due to how some of our vendors are set up, which forced us to effectively test in production (a horrible place to be). Our internal QA is quite clever with the issues they find, but unfortunately, they were not given the time and space necessary to fully exercise our systems.

Why Did it Take So Long to Get Everything Stabilized?

As mentioned before, this update revamped quite literally every single part of our entire technology stack. This had the unfortunate effect of making it extremely difficult to isolate issues to a subsystem to help target debugging efforts, which slowed our hotfix times. Additionally, we unfortunately just got unlucky with the authentication system errors being a fluke that we’d never seen in the past two years of using this vendor. What normally would be an hour-long hotfix session turned into a 5-hour disaster mitigation effort with the team working well into the morning.

Moving Forward

Yesterday's update was not at the quality level we at Madrona require for Battle Nations and our users. On behalf of the entire team at Madrona, I would like to apologize for any frustrations users may have experienced from issues with guilds or an inability to log in.

We’ve taken some time to reflect on what happened and have plans on how to mitigate the issues we described above.

Firstly, regarding the guild level cost issues, we are currently testing a cleanup function that will fix the guild levels based on what the level costs should’ve been. We will re-level all guilds based on the correct rates according to how much gold was donated, with any leftover gold being redeposited in the guild bank for future use. This will be done in a maintenance break with a hotfix that will be coming out soon to ensure the values displayed in the client are correct and to reduce further confusion.

Secondly, the Battle Nations Webshop will be returning with custom regional pricing. In addition, we will be taking action against users who used tools to appear in different regions and engaged in significant order volume to exploit pricing disparities. This will not affect legitimate transactions that may have come with more nanopods that should have been included due to our error. For those missing purchases from Google Play, please submit a support ticket, and we will resolve it in the next few days.

Third, to try to improve the login experience for those using Yahoo, att.net, verizon.net, or any other email service served by Yahoo, we are taking steps to better align with Yahoo email policies. While these actions may help improve email deliverability, this may not fully resolve the issues for email services like these that have limits significantly lower than industry standards. Unfortunately, this is not something we have control over. We would encourage users experiencing issues to reach out and have support change your account to a new email address from a different provider. We deeply apologize for the inconvenience.

Finally, we’ve taken significant steps to avoid putting ourselves in this position in the future. Fortunately, now that all of our infrastructure is protected by Cloudflare, we shouldn’t need to perform another mass migration of our backend, which consumed much of our dev bandwidth. We will also be more strict about changing so many different subsystems simultaneously, opting instead for staggered releases that allow for better testability and debuggability.

On a more positive note, as of this month, we have converted nearly all of the development team to full-time status! Between the significantly increased team bandwidth and the massive migration time sinks behind us, we’re now positioned to deliver the high-quality experience we continuously strive for. Thanks to your continued support, you’ve helped us grow to become a fully fledged studio, something we are deeply grateful for. We’re super excited for what's to come!

For this week and the next, we’ll be shipping out a few hotfix updates to resolve any lingering issues and ensure we’re prepared for the first Boss Strike!