An Interesting 401

I had a pretty interesting troubleshooting experience yesterday.

For work, I've been setting up some AWS CloudWatch alarms in our dev environment to demonstrate to my team how we can build a service status dashboard for our production environment.

One of the things I wanted to do was leverage the health status checks available for load balancers, so in my dev environment I set up an ALB and created a target group with our TeamCity instance in it, just for the purposes of demonstrating the technique.

So the load balancer finishes initializing and I confirm that traffic is resolving properly. But the health check is failing in the target group, with status code 401. That's weird.

I wonder if there's some aspect of AWS that I don't understand, maybe I've misconfigured something, and that is the root cause of the health check failure. I don't want to run into this same problem when I set up the dashboard in production, so I feel like I better figure this out.

I spend way too long double checking the settings on the load balancer, the target group, the instance, the instance's Windows firewall settings, the security groups, the VPC and subnets. I think I went through the entire stack 4 or 5 times, and couldn't figure out why the health check would be failing.

It's particularly confusing because there's no problems with the actual site traffic, just the health check. The health check was using the default settings and just hitting the root path, which of course I could browse to myself just fine. So weird!

I make sure there wasn't some extra process on the instance listening on the port. Then I start wondering if the health check is constructing the network request in some unusual way that TeamCity doesn't know how to deal with. But if that were true why would it respond with a 401 Unauthorized response? It's an ALB and should be doing a HTTP GET, so it shouldn't be a Ping ICMP type thing. Maybe it's doing a HTTP HEAD instead of GET and TeamCity doesn't like that?

So I install WireShark on the TeamCity instance and start watching the network traffic. I see the health check packets come in and sure enough, 401 response. Pretty typical GET request, with a few headers

Connection: close
User-Agent: ELB-HealthChecker/2.0
Accept-Encoding: gzip, compressed

I curl the server myself with the same request, and I get the 401 as well. Okay. I start fiddling with the headers on the curl request and keep getting 401 every time. I browse to the site in Chrome and grab the equivalent curl command from the network panel in dev tools. It's got a bunch of other headers, but one by one I remove them from the curl request and keep getting a 302 to /login.html, which is the expected response, but at least it's not a 401. Until I remove the User Agent header. Then I get a 401. Aha!

I guess I can understand TeamCity requiring a user agent. Maybe a little silly and unnecessary, but not completely out of the realm of reasonableness. But, the health check request does have a User Agent header. The ELB-HealthChecker/2.0 value is a little unusual given that most traffic will of course originate from a browser, but there's nothing invalid about it. In fact, kudos to AWS for doing the responsible thing and specifying a sensible user agent in the first place. So why does TeamCity like my browser's user agent of Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 but not ELB-HealthChecker/2.0 or no user agent at all? I start sending curl requests with various fragments of my browser's user agent, and keep getting the 302 responses.

Until I remove AppleWebKit. Then I get a 401. Seriously? Seriously!? We all know how stupid browser user agents are these days, so why would TeamCity go to the extra effort of adding code to require AppleWebKit specifically to be included in the user agent?? I tried ELB-HealthChecker/2.0 AppleWebKit and even that worked fine. 302. smdh.

I search google and twitter for anything about TeamCity and AppleWebKit to see if anybody has run into this before and there's nothing. well, I guess I can't be too surprised - probably not a lot of people trying to load balance a single TeamCity instance on AWS in the first place. My TeamCity instance is a few versions behind and I wonder if this behavior has been changed in newer releases. I'm not going to spend time installing a new version of TeamCity, but fortunately there's a cloud version now so I try it there, and sure enough it returns a 401 there too. Try it yourself!

> curl -s -o /dev/null -w "%{http_code}" "<https://teamcity.jetbrains.com/>" -H "User-Agent: AppleWebKit"
302

> curl -s -o /dev/null -w "%{http_code}" "<https://teamcity.jetbrains.com/>" -H "User-Agent: AppleWebKi"
401

So, mystery solved, I guess. One of the silliest things I've ever encountered, but at least now I can rest easy that there's not some networking security concept in AWS that I don't understand or something like that. I can move on with my CloudWatch dashboard project.

If anybody at JetBrains ends up reading this, would love for you to fill in the blanks for me on this one 🙂

Oh, and I guess I probably should have realized it would have been better to just specify /login.html as the health check endpoint in the first place. Which I've since done and now my target group health checks are green.