How to Run Llama 2 Locally on CPU and Serve it as a Docker Container: Step-by-Step Guide

Nikolay PenkovDecember 25, 2024

Open Source AI

In today’s digital landscape, the large language models are becoming increasingly widespread, revolutionizing the way we interact with information and AI-driven applications. What’s more, the availability of open-source alternatives to their paid counterparts is empowering enthusiasts and developers to harness the power of these models without breaking the bank.

In this post, we’ll explore one of the leading open-source models in this domain: Llama 2. In this tutorial you’ll understand how to run Llama 2 locally and find out how to create a Docker container, providing a fast and efficient deployment solution for Llama 2. So, let’s embark on this exciting adventure and unlock the world of large language models with Llama 2!

Llama 2

Tools and Models

LLama 2 was created by Meta and was published with an open-source license, however you have to ready and comply with the Terms and Conditions for the model. You can find out more about the models on the official page: https://ai.meta.com/llama/

The Llama Model

Model size

The LLama 2 model comes in multiple forms. You are going to see 3 versions of the models 7B, 13B, and 70B, where B stands for billion parameters. As you can guess, the larger the model, the better the performance, the greater the resource needs.

Model type

Besides the model size, you will see that there are 2 model types:

Llama 2: the text completion version of the model, which doesn’t have a specific prompt template
Llama 2 Chat: the fine-tuned version of the model, which was trained to follow instructions and act as a chat bot. This version needs a specific prompt template in order to perform the best, which we are going to discuss below.

Getting the model

All Llama 2 models are available on HuggingFace. In order to access them you will have to apply for an access token by accepting the terms and conditions. Let’s take for example LLama 2 7B Chat. After opening the page you will see a form where you can apply for model access. After your request is approved, you will be able to download the model using your HuggingFace access token.

In this tutorial we are interested in the CPU version of Llama 2. Usually big and performant Deep Learning models require high-end GPU’s to be ran. However, thanks to the excellent work of the community, we have llama.cpp, which does the magic and allows running LLama models solely on your CPU. How this magic is done will be discussed in a future post. For now, you have to know only that the llama.cpp applies a custom quantization approach to compress the models in a GGUF format. This reduces their size and resource needs.

Thanks to The Bloke, there are already pre-made models which can be used directly with the mentioned framework. Let’s get our hands dirty and download the the Llama 2 7B Chat GGUF model. After opening the page download the llama-2–7b-chat.Q2_K.gguf file, which is the most compressed version of the 7B chat model and requires the least resources.

Python bindings for llama.cpp

We are going to write our code in python therefore we need to run the llama.cpp in a pythonic way. This was also considered by the community and exists as a project called llama-cpp-python which can be installed with pip install llama-cpp-python

Docker (Optional)

If you are going to deploy your model as a docker container you will need Docker running on your system. We are going to consider Docker Desktop in this tutorial so make sure you install it.

Running the model

Let’s see how we can run the model by analyzing the following python script:

  from llama_cpp import Llama

  # Put the location of to the GGUF model that you've download from HuggingFace here
  model_path = "**path to your llama-2–7b-chat.Q2_K.gguf**"

  # Create a llama model
  model = Llama(model_path=model_path)

  # Prompt creation
  system_message = "You are a helpful assistant"
  user_message = "Generate a list of 5 funny dog names"

  prompt = f"""<s>[INST]
  <<SYS>>
  {system_message}
  <</SYS>>
  {user_message}
  [/INST]"""

  # Model parameters
  max_tokens = 100

  # Run the model
  output = model(prompt, max_tokens=max_tokens)

As you can see, the compressed model is loaded with the python bindings library by simply passing the path to the GGUF file.

The model prompt is also very important. You can see some special tokens such as <s>, [INST] and <<SYS>>. For now, remember that they have to be present and the prompt has to follow the given template. More information on their role will be given soon.

In the prompt there are two user inputs:

The system message which can be used to instill specific knowledge or constraint to the LLM. Alternatively, it can be omitted and the model will follow the system message it was trained on.
The user message which is the actual user prompt. Here you can define the concrete task that you want the model to do (e.g. code generation, or generating funny dog names in our case)

Lastly, the parameter max_tokens determines how many tokens the model will generate. For now think of tokens as number of words, but bare in mind that this is not always the case as some words can be presented as multiple tokens.

There are some other additional parameters which can be configured for more advanced cases, but they will be discussed in separate posts.

Now let’s save the code as llama_cpu.py and run it with python llama_cpu.py:

from flask import Flask, request, jsonify
from llama_cpp import Llama

# Create a Flask object
app = Flask("Llama server")
model = None

@app.route('/llama', methods=['POST'])
def generate_response():
    global model

    try:
        data = request.get_json()

        # Check if the required fields are present in the JSON data
        if 'system_message' in data and 'user_message' in data and 'max_tokens' in data:
            system_message = data['system_message']
            user_message = data['user_message']
            max_tokens = int(data['max_tokens'])

            # Prompt creation
            prompt = f"""<s>[INST] <<SYS>>
            {system_message}
            <</SYS>>
            {user_message} [/INST]"""

            # Create the model if it was not previously created
            if model is None:
                # Put the location of to the GGUF model that you've download from HuggingFace here
                model_path = "**path to your llama-2–7b-chat.Q2_K.gguf**"

                # Create the model
                model = Llama(model_path=model_path)

            # Run the model
            output = model(prompt, max_tokens=max_tokens, echo=True)

            return jsonify(output)

        else:
            return jsonify({"error": "Missing required parameters"}), 400

    except Exception as e:
        return jsonify({"Error": str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=True)

This will start a local server on port 5000. You can make a POST request with cURL as follows (or use Postman instead 🙂):

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "system_message": "You are a helpful assistant",
    "user_message": "Generate a list of 5 funny dog names",
    "max_tokens": 100
  }' \
  http://127.0.0.1:5000/llama

You can play around with different input parameters to see how the model works

Dockerizing the server

Now that we have a working Flask server that hosts our model, we can create a Docker container. For this we have to build a Docker image that contains the model and the server logic:

# Use python as base image
FROM python

# Set the working directory in the container
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY ./llama_cpu_server.py /app/llama_cpu_server.py
COPY ./llama-2-7b-chat.Q2_K.gguf /app/llama-2-7b-chat.Q2_K.gguf

# Install the needed packages
RUN pip install llama-cpp-python
RUN pip install Flask

# Expose port 5000 outside of the container
EXPOSE 5000

# Run llama_cpu_server.py when the container launches
CMD ["python", "llama_cpu_server.py"]

Save the specified configuration as “Dockerfile ”in the same folder of llama_cpu_server.py

NOTE: Make sure that the model file llama-2–7b-chat.Q2_K.gguf and the server file llama_cpu_server.py are in the same directory as the Dockerfile. Also make sure that the model path specified in llama_cpu_server.py is relative to the llama_cpu_server.py file or the container build and run will fail!

Afterwards you can build and run the Docker container with:

docker build -t llama-cpu-server . docker run -p 5000:5000 llama-cpu-server

The Dockerfile will creates a Docker image that starts a container with port 5000 exposed to the outside world (i.e. locally in your network). You can now make POST requests to the same endpoint as previously:

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "system_message": "You are a helpful assistant",
    "user_message": "Generate a list of 5 funny dog names",
    "max_tokens": 100
  }' \
  http://127.0.0.1:5000/llama

Congrats, you have your own locally hosted Llama 2 Chat model now, which you can use for any of your needs 🙌.

Running in production

This tutorial showed how to deploy Llama 2 locally as Docker container. However, this is not a production ready code. As you saw, we are running Flask in debug mode, we are not exposing all model parameters such as top_p, top_k, temperature and n_ctx. The model is not utilizing the full chat capabilities because there is no user session implemented, and previous context will be lost after every request. Another thing to mentions is that this solution is not scalable in its current form, and parallel requests may break the server.

Content