Run LLMs on AWS Lambda: Scalable Serverless Inference Made Easy

Nikolay PenkovDecember 26, 2024

Why use "Serverless"

AWS Lambda is a powerful serverless computing service, offering a myriad of advantages, such as auto-scaling, cost-effectiveness, and ease of maintenance, making it a game-changer for businesses of all sizes.

One of the most exciting applications of AWS Lambda is its role in enabling serverless inference of ML models. This innovative approach empowers businesses to harness the potential of AI and ML without the need for complex infrastructure management.

In the following post, we’ll see how AWS Lambda can help us deploy LLama 2 for serverless inference.

Github repo containing the code for the tutorial is available here: https://github.com/penkow/llama-lambda

AWS Lambda provides us with the ability to deploy our code without the need of setting up and hosting a server. This not only creates the advantage of fast deployment but also several other benefits for developers. Here are the three main advantages which are of our benefit:

Scalability: Serverless platforms automatically handle the scaling of your application. They can quickly adjust the number of resources allocated based on the load. As a result your application is ablet to handle sudden spikes in traffic without manual intervention.
Cost-efficient: Traditional server-based setups require you to provision and pay for fixed resources, even if they’re not fully utilized. Serverless automatically allocates cloud resources on demand and releases them when there is no traffic. In this way you pay only for the compute resources you actually use
Reduced Operations Overhead: Serverless platforms hide the complexity of the infrastructure management. You don’t have to worry about provisioning, patching, or maintaining servers. This reduces the operational overhead and allows developers to focus more on writing code.

Those three points are important if we want to have a scalable and cost-efficient deployment of LLama 2.

Is a High-Grade GPU Required for LLama 2?

This is a question which probably comes to your mind, considering that AWS Lambda runs on a CPU and doesn’t provide GPU access. Generally speaking, all vanilla LLMs are resource-hungry and require high-end GPUs to run, however thanks to quantization approaches we are now able to run them also locally on CPUs.

In our case llama.cpp is the solution to this problem, allowing us to run a quantized version of the model ported to C++ for even more efficient inference. We are going to use it in addition with llama.cpp python bindings.

To use llama.cpp you will need a quantized version of the model (TODO: which you can creating following this post) or a pre-quantized model provided by TheBloke in HuggingFace.

For this tutorial you will need:

An active AWS subscription (which could incur cost)
Docker
AWS CLI (configured for your AWS subscription)
(Optional) Terraform (for automatic deployment)

Assuming that you have everything installed and configured, lets get started.

Creating an AWS Lambda Image

AWS Lambda provides the option to upload your code as a zip or provide it as a Docker image. The former method has a memory limit of 250 MB which is of our concern because of the model size and the additional libraries needed. The latter is also limited, however, Docker Images can be up to 10 GB of size which will cover our needs. In the following, we will create a Docker image that contains the code, the needed libraries and the LLama 2 model itself.

Inference code

Save the following code as app.py:

from llama_cpp import Llama
import os

MODEL_NAME = os.environ['MODEL_NAME']

llm = Llama(model_path=f"./model/{MODEL_NAME}")

def handler(event, context):
    prompt = event['prompt']
    max_tokens = int(event['max_tokens'])
    output = llm(prompt, max_tokens=max_tokens, echo=True)
    return output

Here we load the model which can be selected with the MODEL_NAME environmental variable. We are going to set this variable when we build the image, so that the model selection is configurable. We expose only the max_tokens input variable, however the model has some additional variables that are important and should be configured to match you goals. Get in touch with me if you want to find out more.

Dockerfile

Save this configuration as Dockerfile (without extension)

FROM public.ecr.aws/lambda/python:3.8

RUN yum install -y \
    autoconf \
    automake \
    cmake \
    gcc \
    gcc-c++ \
    libtool \
    make \
    nasm \
    pkgconfig

WORKDIR ${LAMBDA_TASK_ROOT}

RUN  pip3 install llama-cpp-python

# Specify the name of your quantized model here
ENV MODEL_NAME=**your qunatized model here**

# Copy the infrence code
COPY app.py ${LAMBDA_TASK_ROOT}

# Copy the model file
COPY ./model/${MODEL_NAME} ${LAMBDA_TASK_ROOT}/model/${MODEL_NAME}

# Set the CMD to your handler
CMD [ "app.handler" ]

Important: The Dockerfile will try to copy the file specified in the MODEL_NAME variable from a folder called model in the same directory as the Dockerfile. Make sure you follow this convention or the image build will fail.

Build and deploy

Note: In the following steps replace <your-region>, <account-id>, and <repository-uri> with your corresponding data

Create an AWS ECR Repository to host the AWS Lambda Image

ECR is short for Elastic Container Registry and as you can guess from the name, this is the place where Docker images are stored in AWS. In order to store an image first we have to create a repository. Let’s create a repository called llama-lambda using the AWS CLI:

aws ecr create-repository --repository-name llama-lambda

Build the image

To build the image execute the following in the folder of your Dockerfile. First authenticate to the created AWS repository:

aws ecr get-login-password --region <your-region> | docker login --username AWS --password-stdin <account-id>.dkr.ecr.<your-region>.amazonaws.com

Now you can build and tag your image:

1
2

docker build -t llama-lambda .
docker tag llama-lambda:latest <account-id>.dkr.ecr.<your-region>.amazonaws.com/llama-lambda:latest

Upload the image

Push the created image to ECR with the following command:

docker push <account-id>.dkr.ecr.<your-region>.amazonaws.com/llama-lambda:latest

Deploy the image as an AWS Lambda function

First we have to create a IAM role for our function

aws iam create-role --role-name llama-lambda-role --description "Role for llama-lambda function" --assume-role-policy-document '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}'

Next we have to find out the role ARN:

aws iam get-role --role-name llama-lambda-role

In the output JSON there should be an “Arn” field where you will see the ARN of the role.

We can afterwards create our lambda function. Make sure you fill <your-iam-role-arn> with the ARN that you have received:

aws lambda create-function \
  --function-name llama-lambda-function \
  --image-uri <your-account-id>.dkr.ecr.<your-region>.amazonaws.com/llama-lambda:latest \
  --role <your-iam-role-arn> \
  --package-type Image \
  --timeout 900 \
  --memory-size 10240

Note: We are setting the timeout to 900 seconds (15 min) and the RAM allocation to 10 GBs, which are the maximum values that AWS Lambda supports. They should be sufficient, but bare in mind that larger models may require more resources.

Invoking the AWS Lambda function

And that’s it, you can now invoke your LLama 2 AWS Lambda function with a custom prompt. Assuming that you’ve deployed the chat version of the model, here is an example for invoking the function:

aws lambda invoke --function-name llama-lambda-function --payload '{"max_tokens": "512", "prompt": "<s>[INST] <<SYS>> You are a helpful assistant. <</SYS>> What is a Large Language Model? [/INST]"}

Final words

In this short tutorial you’ve learned how to deploy LLama 2 using AWS Lambda for serverless inference. Running LLMs as AWS Lambda functions provides a cost-effective and scalable solution infrastructure with minimal configuration requirements. The only caveat is the inference speed. Running LLama 2 on CPU could lead to long inference time depending on your prompt and the configured model context length. The advantage comes when prompts are executed in parallel and AWS Lambda scales to handle all requests.

However, use cases which don’t require real-time response, this solutions is of a big advantage, thanks to the low cost and low configuration needs for deployment.

In addition, AWS API Gateway can be used to expose the created lambda function as a REST API to the internet. This is a topic for another post, however if you want to find out how to do it, do not hesitate to reach out :).

Content