In today’s digital landscape, the large language models are becoming increasingly widespread, revolutionizing the way we interact with information and AI-driven applications. What’s more, the availability of open-source alternatives to their paid counterparts is empowering enthusiasts and developers to harness the power of these models without breaking the bank.
In this post, we’ll explore one of the leading open-source models in this domain: Llama 2. In this tutorial you’ll understand how to run Llama 2 locally and find out how to create a Docker container, providing a fast and efficient deployment solution for Llama 2. So, let’s embark on this exciting adventure and unlock the world of large language models with Llama 2!
LLama 2 was created by Meta and was published with an open-source license, however you have to ready and comply with the Terms and Conditions for the model. You can find out more about the models on the official page: https://ai.meta.com/llama/
The LLama 2 model comes in multiple forms. You are going to see 3 versions of the models 7B, 13B, and 70B, where B stands for billion parameters. As you can guess, the larger the model, the better the performance, the greater the resource needs.
Besides the model size, you will see that there are 2 model types:
All Llama 2 models are available on HuggingFace. In order to access them you will have to apply for an access token by accepting the terms and conditions. Let’s take for example LLama 2 7B Chat. After opening the page you will see a form where you can apply for model access. After your request is approved, you will be able to download the model using your HuggingFace access token.
In this tutorial we are interested in the CPU version of Llama 2. Usually big and performant Deep Learning models require high-end GPU’s to be ran. However, thanks to the excellent work of the community, we have llama.cpp, which does the magic and allows running LLama models solely on your CPU. How this magic is done will be discussed in a future post. For now, you have to know only that the llama.cpp applies a custom quantization approach to compress the models in a GGUF format. This reduces their size and resource needs.
Thanks to The Bloke, there are already pre-made models which can be used directly with the mentioned framework. Let’s get our hands dirty and download the the Llama 2 7B Chat GGUF model. After opening the page download the llama-2–7b-chat.Q2_K.gguf file, which is the most compressed version of the 7B chat model and requires the least resources.
We are going to write our code in python therefore we need to run the
llama.cpp in a pythonic way. This was also considered by the community
and exists as a project called llama-cpp-python which can be installed
with pip install llama-cpp-python
If you are going to deploy your model as a docker container you will need Docker running on your system. We are going to consider Docker Desktop in this tutorial so make sure you install it.
Let’s see how we can run the model by analyzing the following python script:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from llama_cpp import Llama
# Put the location of to the GGUF model that you've download from HuggingFace here
model_path = "**path to your llama-2–7b-chat.Q2_K.gguf**"
# Create a llama model
model = Llama(model_path=model_path)
# Prompt creation
system_message = "You are a helpful assistant"
user_message = "Generate a list of 5 funny dog names"
prompt = f"""<s>[INST]
<<SYS>>
{system_message}
<</SYS>>
{user_message}
[/INST]"""
# Model parameters
max_tokens = 100
# Run the model
output = model(prompt, max_tokens=max_tokens)
As you can see, the compressed model is loaded with the python bindings library by simply passing the path to the GGUF file.
The model prompt is also very important. You can see some special tokens such as <s>
, [INST]
and <<SYS>>
. For now, remember that they have to be present and the prompt has to follow the given template. More information on their role will be given soon.
In the prompt there are two user inputs:
The system message which can be used to instill specific knowledge or constraint to the LLM. Alternatively, it can be omitted and the model will follow the system message it was trained on.
The user message which is the actual user prompt. Here you can define the concrete task that you want the model to do (e.g. code generation, or generating funny dog names in our case)
Lastly, the parameter max_tokens determines how many tokens the model will generate. For now think of tokens as number of words, but bare in mind that this is not always the case as some words can be presented as multiple tokens.
There are some other additional parameters which can be configured for more advanced cases, but they will be discussed in separate posts.
Now let’s save the code as llama_cpu.py and run it with python llama_cpu.py
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from flask import Flask, request, jsonify
from llama_cpp import Llama
# Create a Flask object
app = Flask("Llama server")
model = None
@app.route('/llama', methods=['POST'])
def generate_response():
global model
try:
data = request.get_json()
# Check if the required fields are present in the JSON data
if 'system_message' in data and 'user_message' in data and 'max_tokens' in data:
system_message = data['system_message']
user_message = data['user_message']
max_tokens = int(data['max_tokens'])
# Prompt creation
prompt = f"""<s>[INST] <<SYS>>
{system_message}
<</SYS>>
{user_message} [/INST]"""
# Create the model if it was not previously created
if model is None:
# Put the location of to the GGUF model that you've download from HuggingFace here
model_path = "**path to your llama-2–7b-chat.Q2_K.gguf**"
# Create the model
model = Llama(model_path=model_path)
# Run the model
output = model(prompt, max_tokens=max_tokens, echo=True)
return jsonify(output)
else:
return jsonify({"error": "Missing required parameters"}), 400
except Exception as e:
return jsonify({"Error": str(e)}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=True)
This will start a local server on port 5000. You can make a POST request with cURL as follows (or use Postman instead 🙂):
1
2
3
4
5
6
7
8
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"system_message": "You are a helpful assistant",
"user_message": "Generate a list of 5 funny dog names",
"max_tokens": 100
}' \
http://127.0.0.1:5000/llama
You can play around with different input parameters to see how the model works
Dockerizing the server
Now that we have a working Flask server that hosts our model, we can create a Docker container. For this we have to build a Docker image that contains the model and the server logic:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Use python as base image
FROM python
# Set the working directory in the container
WORKDIR /app
# Copy the current directory contents into the container at /app
COPY ./llama_cpu_server.py /app/llama_cpu_server.py
COPY ./llama-2-7b-chat.Q2_K.gguf /app/llama-2-7b-chat.Q2_K.gguf
# Install the needed packages
RUN pip install llama-cpp-python
RUN pip install Flask
# Expose port 5000 outside of the container
EXPOSE 5000
# Run llama_cpu_server.py when the container launches
CMD ["python", "llama_cpu_server.py"]
Save the specified configuration as “Dockerfile ”in the same folder of llama_cpu_server.py
Afterwards you can build and run the Docker container with:
docker build -t llama-cpu-server . docker run -p 5000:5000 llama-cpu-server
The Dockerfile will creates a Docker image that starts a container with port 5000 exposed to the outside world (i.e. locally in your network). You can now make POST requests to the same endpoint as previously:
1
2
3
4
5
6
7
8
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"system_message": "You are a helpful assistant",
"user_message": "Generate a list of 5 funny dog names",
"max_tokens": 100
}' \
http://127.0.0.1:5000/llama
Congrats, you have your own locally hosted Llama 2 Chat model now, which you can use for any of your needs 🙌.
This tutorial showed how to deploy Llama 2 locally as Docker container. However, this is not a production ready code. As you saw, we are running Flask in debug mode, we are not exposing all model parameters such as top_p, top_k, temperature and n_ctx. The model is not utilizing the full chat capabilities because there is no user session implemented, and previous context will be lost after every request. Another thing to mentions is that this solution is not scalable in its current form, and parallel requests may break the server.
We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.