How I Fixed AWS ELB Health Checks Failing On Flask

Posted on Sun 28 February 2021 in Tech • 3 min read

The Problem - AWS Load Balancer Health Checks Kept Failing

I've been working on a Flask application to track my workouts. In an effort to deploy this application in a way that others could sign up and use it, I decided to use containers in AWS. This is extremely overkill for what I'm doing, but I wanted to play around with ECS and get some hands-on experience doing deployments. I followed this excellent video tutorial by Alex Damiani. The general architecture is the following:

  • Flask application
  • Gunicorn as the WSGI HTTP Server in front of Flask (in the same container)
  • Amazon ECR (Elastic Container Registry) for holding the container images
  • Amazon ECS (Elastic Container Service) for running the container
  • Amazon EC2 server for hosting the deployed containers
  • Amazon RDS (Relational Database Service) for the backend PostgreSQL database
  • Amazon ELB (Elastic Load Balancer) with the HTTPS certificate to route traffic to the containers
  • Amazon Route 53 record so that https://fitness.totallyquantified.com points to that ELB
  • Github Action to create the docker container, upload it to ECR, deploy it to ECS, and drain the old container

This all worked relatively smoothly, and the Flask application worked just fine. However, I noticed that the container kept getting killed and re-deployed every few minutes. This felt sub-optimal to say the least, so I wanted to get it fixed.

What I Tried

I discovered that the ELB was never marking the container as healthy, even though the application worked. Here are some of the things I tried while troubleshooting:

  • Changed the base image for the docker container
  • Modified gunicorn settings for less workers, different temp directory, keep-alive values, thrads, and worker-class
  • Changed the dockerfile to use ENTRYPOINT instead of CMD
  • Created a separate /health_check endpoint to see if it was just the root that was not returning properly
  • Directly exposted port 5000 in the dockerfile
  • Ran the container commands as root instead of a new locked-down user
  • Temporarily removed gunicorn and just ran 'flask run' directly
  • Extended all timeouts and failure settings on the ELB health check

The Solution - Remove The Flask SERVER_NAME Environment Variable

The key discovery I found while troubleshooting was that if I logged in to the EC2 server that was running these ECS containers, I was getting a 404 from any route. I did this by listing the containers, seeing what port was being used, then curling that port:

$ docker ps -n 5
CONTAINER ID        IMAGE                                                                                                      COMMAND                  CREATED             STATUS                PORTS                     NAMES
fa1d3d172824        128476853129.dkr.ecr.us-east-2.amazonaws.com/tqfit-ecs-repo:c240736f3d857cab1d989bc7da9ba613cf784e53   "gunicorn -b 0.0.0.0…"   13 hours ago        Up 13 hours           0.0.0.0:33004->5000/tcp   ecs-tqfit-ecs-task-12-tqfit-ecs-repo-b01ad28fd8f4b0b81700

$ curl -I 0.0.0.0:33004/health_check
HTTP/1.0 404 NOT FOUND
...

While investigating my environment variables, I discovered that SERVER_NAME expects a header with that same name, but my local testing and the ELB health checks were just using IPs. I removed this Flask SERVER_NAME environment variable from my dockerfile and Github Action, and all connections worked and the application still responded at https://fitness.totallyquantified.com. If you're having trouble with health checks or routing to your Flask app, you may want to remove that variable!