18.07.2024

Scale to Zero

What is scale to Zero?

While the concept behind scaling to zero should be rather straightforward, it comes with some catches. In modern Software Development, Applications tend to have a rather large footprint. Which is unattractive for cloud computing due to its per-resource cost. Thus, the concept of dynamically scaling your application up and down as demanded was introduced into cloud-native environments.

At Gepardec we investigated the European Environmental, Social und Governance (in short ESG) policies that will affect companies in the information and technology branches. Companies are obligated to report their CO2 and energy consumption. Therefore, we wanted to play something like this in a small scale. We took one of our applications and measured how much energy it consumes. After that, we tried to scale our application to a minimum (over time) and have a finished ESG report at the end,

Our Application

The application we used to test our scaling capabilities was Mega (link). Mega is our end-of-month time check system. Basically, we allow every employee to check and validate their time recordings and approve them so that upper management can send out the invoices to our clients.

Originally, it was developed as a monolith, and it still is. While running on Quarkus and being a bit more cloudy,, it still lacks features like the ability to run multiple instances at the same time, fast startup times, observability, etc.

In our Learning Friday format, we didn’t intend to change the current architecture but instead wanted to gather experience scaling an old / monolith application, as our clients mostly use monolithic architectures. Which thus limited our scaling to 0 and 1.

Oc idle

OpenShift ships with a simple tool called “oc idle”. Oc idle allows you to temporarily scale a service down to zero until the next request arrives:

oc idle <service>

This is great! We can just specify all the services we want to idle – and they are getting unidled automatically for us. However, we might want to repeat idling the application after a request comes in, which is not supported by oc idle. There are some other limitations too:

Only supports default HAProxy
Only supports default OpenShift SDN

We currently use Cillium as SDN, so we have no way to utilize this tool. Either way, we might need a tool that provides us more options.

What is KEDA

KEDA is an acronym for “Kubernetes Event Driven Autoscaler”. As the name suggests, KEDA is used for controlling the number of instances of an application based on generic events. These events are triggered by so-called Scalers like Kafka, MySQL, Prometheus, etc. Many tools are supported here out of the box, an External Scaler allows to include events from other sources and beyond. https://keda.sh/docs/2.13/scalers/

It can handle different types of Kubernetes Resources as long as they implement the /scale subresource, which allows for the definition of some replicas. So you’re not limited to deployments, you can even scale StatefulSets or your own CRDs with KEDA. The only thing you’ll need is a ScaledObject Resource.

Scaling based on incoming HTTP requests is not supported out-of-the-box as this is fairly complex. If you think about it you require at least something that queues and intercepts requests, so how are we going to implement our solution?

The HTTP Add-on

From the creators of KEDA, there is also the KEDA HTTP Add-on. While still in beta, it implements precisely the use case we require. It provides a reverse proxy (the Interceptor Proxy) from which you can forward requests to the applications. On incoming requests, it creates an event in an external scaler for KEDA. The HTTP Add-On ships with its own CRD, the HTTPScaledObject, which creates the required ScaledObject, HorizontalPodAutoscaler, etc. The HTTPScaledObject restricts the support to only Deployment Resources.

In a minified sense this is basically how the http add on works. While it’s not completely correct because the interceptor doesn’t notify the external scaler but the external scaler looks up if the interceptor has requests, it is in our opinion a good overview of how the information / request flows with the http add on.

… startup times

MEGA takes some time to start up, which impacts all cold requests (the requests that trigger unidling). In some cases, a request takes up to 22 seconds! We need a way to speed up the startup times.

Quarkus in itself has a focus on topics around optimizing performance and energy efficiency. Quarkus provides a Fast-Jar per default, which starts faster and consumes less memory, because it indexes classes and resources in order to avoid lookups. We are already using Quarkus, but we need more speed.

We gain speed by throwing all our qualities of life out of the window. We are going to build our application natively with GraalVM. This means blazingly fast startup times, but delivers with really long build times, limited runtime features (no Debugger, Profiler, etc.) and limited dynamic class loading (Reflections must be registered). Quarkus offers a guide to help you deal with the many limitations of native builds. https://quarkus.io/guides/writing-native-applications-tips

To create a running native instance of MEGA, we have to make many changes in order to react to GraalVMs limitations. Furthermore, compilation is really slow, making the development process a whole lot more “relaxed”. An empty Quarkus Application we’ve created for comparisons starts within 60ms natively – which is really fast compared to the 900ms the JVM build takes. However, while our JVM build compiles within 11s, the native build takes 2min and 30s!

Hiccups?

We deployed KEDA into a separate namespace, namely: openshift-keda. Therefore all of our KEDA infrastructure; operator, http-add-on, interceptor-proxy; are deployed there.

In our case, we couldn’t manage to use routes in the application namespace but had to shift the route definition to the KEDA namespace. ExternalServices didn’t work for some reason. Which made our general solution a bit untasteful as it blurs the scoped definitions across namespaces.

Another hiccup was that our timeout limit was too short, and therefore requests couldn’t get through which had some interesting properties. If a timeout is reached you won’t get an error code that usually describes that. You get an error code from the interceptor proxy which is 403 (we need to verify that) and usually 408 would be for timeouts

Thoughts

Implementing Scale to Zero needs many things taken into account for. It is a good fit for:

Small Applications with short startup times
Applications that don’t need to be reachable fast

There are many things to be thought about your application:

It shouldn’t schedule tasks for itself. We’ve been using the Quarkus Scheduler to start tasks inside of the Application. For Scale to Zero, we had to migrate these tasks to Kubernetes CronJobs
It should be ready for requests AFTER it did necessary initializations. Because the application has to be able to handle a request right after it starts with scale to zero, we had the problem that it syncs some data for Authentication after it becomes ready, so the first request fails, because there are no users yet.
The application needs to complete tasks after a request within a certain, configurable scaledownPeriod, after which the application scales down to zero.
How frequently is your application requested? If the scaledownPeriod is configured too short, it could result in high CPU usage because the application needs to start too often.
…

Also, there are implications concerning your cluster:

Cross cutting concerns like Monitoring or Ingress can be affected. For example, Prometheus doesn’t handle scraping well with applications that are not there most of the time.
KEDA and the HTTP Add-On might just use up more memory than your cluster would’ve without it. If you only use it for one application or a small amount of smaller applications, this might be the case
…

geschrieben von:
Simon, Constantin