The Teruti-Lucas experiment — part 2
A serverless adventure with AWS Lambda
Check part 1 to get the context.
A new architecture based on AWS Lambda
AWS offers a generous Free Tier and because I’m already a big fan of the AWS’ whole portfolio it came to me as an obvious choice.
Here is the first architecture draft based on AWS Lambda and many common AWS services:
AWS offers a Free Tier for AWS Lambda : 3.2 million seconds of compute time and 1 million requests per month. More precisely, the memory size you choose for your Lambda functions determines how long they can run under the Free Tier umbrella.
AWS SNS offers a Free Tier too : the first 1 million monthly requests are free. A topic will provide an asynchronous interaction with my lambda. This is very convenient because I don’t want to block my box, waiting for requests to complete.
My lambda will run every time a notification pops up in the topic.
Event Driven Architecture is an extremely powerful tool when you need to decouple things: every developer should know their salient characteristics.
Unfortunately, AWS Lambda has no asynchronous support built-in, it’s obviously a long-awaited feature, this is why one or two SNS topics might help.
Crafting the lambda
To code a lambda in Java is a simple process, but to deploy it and to test it, is not as straightforward as you might think.
The code in itself is quite simple and it is easily built with Maven.
Once you have got a fat JAR, you have to deploy it on the AWS Lambda service in the region of your choice. My JAR was 22 mega bytes.
Uploading such a JAR without premium access to the AWS’ chosen region, such as the AWS Direct Connect service, is pretty slow. I suppose the S3 Transfer Acceleration service might help here but I would like to use my Free Tier as much as possible.
Additionally, you have no direct control of the AWS Lambda service : you can only collect and produce traces and then assess what went wrong. There is no remote debugging here, nor easy ways to get a deep access to your running code.
Every time you want to test your lambda after a code update you must upload the fat JAR again: TDD is mandatory to shorten the development lifecycle to a bearable minimum.
AWS offers a new approach to tackling the issue : SAM Local (Beta).
I have not used it because my code was pretty simple: the lambda is a basic adapter between my client and the WMS server, it does not perform anything spectacular, though the AWS lambda’s underlying infrastructure does.
Data and the public cloud conundrum
My first prototype was capable of processing one region, the Bourgogne, in twenty minutes on a decent box. The lambda version will produce the same maps and store them in a private S3 bucket.
But wait a second: this dataset is not meant to be public nor it is eligible for the public cloud.
Additionally, a private S3 bucket could become public inadvertently, it’s almost a trend these days.
Don’t use real data when you don’t have to : a falsely derived dataset is a proper dataset when it fits the needs
For the sake of data protection, I will use a fake and derived dataset with the same number of points and the same map’s requirements, which is neutral to the WMS service.
One last principle:
Use your own credentials to consume the relevant services, don’t act on your company’s behalf when you are not asked to do so
Testing the lambda at scale and fail
We should always draw a map of our dependencies and ask ourselves what the SLA is for each dependency.
This is especially relevant with AWS Lambda’s economics. Indeed, you pay for the compute capacity you burn.
Having your requests hanging forever throttled by an external dependency, here the WMS service, is the last thing you want; unless you are willing to burn your Free Tier credit very rapidly.
Every lambda has a maximum execution time allowed and the lower it is the better
The shot failed miserably but that’s always a chance to learn from stupid mistakes free of charge, for now. Every time a notification was sent, AWS Lambda triggered my lambda but the WMS service could not deal with the workload and throttled my requests which in turn throttled my lambda instances or froze the ones which were running.
A simpler architecture
A few moments later, I came up with a new architecture: a simpler one, capable of controlling the request flow.
Now, my box is running a simple Java agent with a pool of threads. Each thread is responsible for a direct execution of the lambda function.
One might argue about the absence of a VPC in this architecture : this is an ongoing issue to me. The lambda could run in its own VPC and could use a VPC endpoint to connect to S3 in the same region. I might try one day.
Performance and costs
Serverless, as its name tells us, implies there are seemingly no servers to manage: it’s supposed to be mainly a development experience with a simple deployment process, though tedious sometimes. This a kind of magic.
Unfortunately, magic is not from our world. There are “things” to provision in order to serve the needed compute capacity. This step is known as a cold start, it’s a widely discussed topic among the Serverless community.
A cold start might occur generally when:
- Your lambda is executed for the first time
- Your lambda is an idle state: remember, AWS Lambda is a managed service in a public cloud. AWS has its own requirements and might detach the compute capacity from your lambda to serve other customers
- Your lambda’s configuration has been updated : you have just changed the memory footprint or anything else (the binary)
In any case, a cold start harms the performance especially in Java and might provoke failures by timeout, then increase costs. You’ll find nice articles on Medium which delve into this hot topic. For instance, you should have a look on this one from Yan Cui.
I found out that there was a need to hint AWS Lambda about the size of your lambda swarm’s instances.
For instance, my batch needed 50 concurrent lambda executions because the WMS service started to behave badly when this threshold was exceeded. Warming up AWS Lambda with just one call was not enough, it was better when I triggered 50 concurrent executions many times.
Here are my findings:
- The first round was a full cold start: the first invocation took 1,648 ms, the second 78 ms, but the 10th 567 ms, the 20th 1,645 ms and the 50th 3,779 ms !
- Then, I ran the first round many times, the lambda execution times ranged from 79ms to 889 ms. It looked like my swarm of lambda instances was ready !
Finally, I added a dedicated attribute to my request model named warmingUp to inform the lambda about this preparation step. The lambda returns as soon as possible in this specific case.That’s clearly a non functional requirement such as a correlation ID which I added too.
Every time I launched my batch I ran the warming up process before to provision the swarm. Here were the costs for a single launch in eu-west-1:
- One region and four districts
- 2,463 false points derived from the source data (a random offset)
- 12,315 PNG files (A6 300 DPI, 1754x1448) stored in a private S3 bucket
- 35.6 giga bytes
- Around 22 minutes per run
- A warming up process with a concurrency level at 50
- 512 mega bytes per lambda instances
Let’s see the impact on my Free Tier:
One may notice the absence of a S3 line in the Free Tier’s array. It noticed me too but I found the line in the bill explorer.
One run cost me $0.05, which is pretty cheap.
The lambda architecture ran twice as fast as the first prototype, but the lambda stored the maps in S3. By contrast, the first prototype stored maps on the local drive which was a not a proper way to store or share data.
S3 enables many use cases such as many sharing issues including : the printing company who needs to download the maps, other institutions involved, and so on.
S3 is able to store objects as large as 5 tera bytes and it’s an exa bytes service which means that it will easily store all the maps. Thanks to lifecycle policies, it could push the maps of the previous year to Glacier and reduce the overall storage cost.
Additionally, S3 is a premium storage service which empowers the users with many unique features : 99.999999999% durability, encryption with your own keys, security standards and compliance certifications, and so on.
The new frontier of everything ?
Playing with AWS Lambda was great fun and quite informative. It is clearly an easy way to have a compute capacity for a very modest amount of money.
AWS Lambda SDK is well crafted as the SAM is, although this stack has a few downsides.
Firstly, you have to manage a compute capacity by new and subtle means: it’s the cold start paradigm.
In spite of the heuristic approach implemented in the service and even if you don’t operate “servers” yourself, you will try to influence the underpinning system by using your own heuristic method. It might look like a game of cat and mouse sometimes. The lambda execution is documented here and reveals some interesting internal mechanisms.
Secondly, the developer experience is still a work in progress.
AWS has made some strong announcements at Re:Invent 2017 with the support of a lambda friendly Cloud 9 version, although the whole environment seems to imply a strong binding to many AWS services. Some companies might not be able to adopt a new and dedicated set of tools or increase their reliance on the AWS’ ecosystem.
Thirdly, AWS Lambda might not be for you if you are looking for low latency and constant throughput.
You don’t manage the infrastructure and it’s shared with other customers.
Finally, AWS Lambda made a terrific impression on me.
The economics of AWS Lambda are impossible to ignore and the lack of heavyweight operational tasks is appealing.
I assume the industry will find a way to rapidly standardized an enriched programming model and a runtime like the big providers did for containerization.
For instance, it would be useful to have an API to clearly signal the FaaS about a workload’s characteristics, then it could run a perfect warm-up instead of a blind heuristic method.
The cloud architecture delivered what I expected. AWS Lambda is a great service though there is a learning curve and some tricky subtleties.
Developing applications for a public cloud can be frustrating and sometimes slow. There is a a long feedback loop because we code locally and deploy remotely on an infrastructure we don’t own or manage. The alternative way, such as the Cloud 9 service, looks like a big step towards a complete vendor locking. I might give a try to SAM Local next time.
This is one of the biggest issues to me which might redirect developers to the lightest stacks and to the more sophisticated and neutral IDEs. Java 9 should bring some improvements on the Java side thanks to its new features.
Additionally, it leads to best practices, such as TDD, being enforced, which is clearly a good thing: debugging features such as X-Ray/logs should be used as a last resort approaches.
The Free Tier enables rich experiments and might be sufficient to real production workloads. Anyway, the AWS Lambda’s economics are a cost killer which will be very hard to ignore for most companies because competitors might not.
Finally, I wish the industry would be wise enough to offer some level of standardization supported by the big three : AWS, Google, Microsoft.