Running your own FOSS LLM
This is an attempt at an explanation for end users, as presented as a 15 minutes talk at the Bolzano meetup for the Italian Linux Day 2024.
- Don’t Panic: Machine Learning in 3 minutes
- Don’t Panic: Large Language Models in 4 Minutes
- Getting to the Point: Running your own FOSS LLM in 8 Minutes
Don’t Panic: Machine Learning in 3 minutes
Not Machine Learning
Example: we need a program that calculates the time it takes for an object to hit the ground when dropped from a specified height.
Input → Program → Output
In this case:
Height in meter → Program → Time to hit the ground in seconds
You need a programmer that understands the domain of the problem (in this case: physics). The programmer will write something like this:
import math
g = 9.81 # gravitational acceleration in m/s^2
h = float(input("height: "))
t = math.sqrt(2.0 * h / g) # yo, Galileo
print(f"{t} seconds")
Machine Learning
OK, so we all know children learn how to catch a ball by solving kinematic differential equations. Not.
There must be another way. Learning by example!
Step 1: Getting Examples
We go to the leaning tower of Pisa and measure drops from different heights.
Step 2: Training
Example Data → Model being trained
The trained model is a set of numbers that describe the red curve. These numbers are called parameters.
Step 3: Inference
Height in meter → Trained Model → Time to hit the ground in seconds
Inference means applying the model for some given input. We determine the value of the red curve for the given input.
So. You still need a programmer to implement the code for model training and inference. But she can rely on generic libraries and doesn’t need to have profound domain specific knowledge.
In pseudocode, you would have something like this:
# training (do once)
model = define_model_type_and_size()
model.fit(example_data)
model.save_to_disk()
# inference (using the model)
model = load_trained_model_from_disk()
h = float(input("height: "))
t = model.predict(h)
print(f"{t} seconds")
We can now compute those free fall times without knowing anything about the physics of falling objects.
Note that if you use a model trained by somebody else, you just need the inference step.
Does this work?
Absolutely! Machine Learning techniques, neural networks in particular, can do things that we don’t know how to do with the “traditional” approach. We can even tackle problem domains so complex there has been no Galileo yet to break the problem down into basic steps!
Some examples include:
- recognize objects in images with human-level accuracy,
- translate texts
- evaluate a position in the game of Go at the level of a world-class player.
Machine learning models can do all these and much more.
Note, that for a computer everything (images, texts or positions in a game of Go) can be represented as numbers. Of course a trained model that classifies objects in images is a “red curve” in a high-dimensional space, and it might take tens of millions of parameters to describe that particular red curve. The idea is still the same.
Don’t Panic: Large Language Models in 4 Minutes
The models
Large Language Models (LLMs) are models that have been trained on a huge amount of text. They take text as input and generate output that completes the input text.
Input text → Large Language Model → Text Completion
Texts are encoded as numbers called tokens. A token is a few characters usually, depending on the model.
As an example, if you give such a model the input “compare apples and” and ask it to predict the following two tokens, it (might) give the output “oranges”.
LLMs use deep neural networks with a fancy architecture, called the transformer.
That explains the acronym “GPT”: a Generative Pretrained Transformer.
So how do we get from text completion to be able to chat with an LLM?
With a stupid trick :) Here it goes:
Transcript of a dialog, where the User interacts with an Assistant named Bob.
Bob is helpful, kind, honest, good at writing, and never fails to answer the
User's requests immediately and with precision.
User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User: Do you know what the SFSCON is?
Bob: I am not familiar with the SFSCON.
User: And do you know what FOSDEM is?
Bob: FOSDEM is a free software developer event taking place in Brussels, Belgium.
(This is a real example made with an LLM called Llama2-13B.)
The brown text is fed into the model, but hidden from the chat user. The chat user just sees her inputs (green) and the models outputs (black) in successive text completions.
As a mater of fact, this gets out of sync easily, so actual chatbots are LLMs that have been further trained on chats (often called “chat” or “instruct” models to distinguish them from “base” or “foundational” models).
Chat question → Chat-LLM → Chat Answer
Knowledge and Reasoning
The large in LLM is not an exaggeration. These models have billions to hundreds of billions of parameters (!) whereas image recognition models have just tens of millions.
Speech is a harder problem than sight, as it turned out.
Humans who interact with LLMs are usually impressed with two aspects.
Knowledge
LLMs contain vast knowledge about different topics. They have been trained on huge amounts of text, they all have read Wikipedia in different languages. But, just as a human, they don’t have perfect recall. There is no verbatim copy of the texts they were trained on.
As you see from the example above that model in particular had some knowledge about the (well known) conference FOSDEM, but no knowledge about the (more local) conference SFSCON. This is expected and OK. Sometimes LLM cheat and make up things they don’t recall. That’s called “hallucination”.
Reasoning
As incredibile as it sounds, when people started to make larger and larger LLMs at some point reasoning emerged in these systems. When given original problems, LLMs showed they could string together steps of reasoning to solve problems.
Is this Artificial Intelligence?
Well, that depends on the definition of intelligence and is a philosophical question.
While people make fun of LLMs that trip over logic riddles, LLMs do pass original university level exam tests on all kind of domains, can summarize and translate original texts, help writing source code, and we can ask them to (quickly) read a document and answer questions about it. So they are definitely useful as assistants whether you want to call them intelligent or not.
What models are out there
Contrary to what you might think, and luckily for us, there is no monopoly on LLMs. Many different entities successfully trained LLMs in the past two years or so.
A surprisingly large number of these models are available under FOSS licenses or at least under permissive licenses.
When I say FOSS license, I don’t mean the license of source code. Source code for model training and inference is mostly FOSS anyway! For example, PyTorch is a very important piece of FOSS used almost universally to train (and run) these models.
I mean the license applied to the model itself (the architecture metadata and those billions of parameters).
The major website where these models are shared is Hugging Face ( https://huggingface.co/ ). At this point one can say that Hugging Face is for models what GitHub is for source code.
Models come in different sizes. Generally speaking the larger they are the more knowledge they contain and the better at reasoning they are.
However, the tendency is to improve training in such a way that smaller models beat larger models of a previous generation. This is good news, because it enables us to run increasingly good models locally on our own computers without needing huge computational resources.
If you are a developer, all is said now. You will figure out how Hugging Face works, how you can download the models and run them.
However, you’re a user. What can you do?
Getting to the Point: Running your own FOSS LLM in 8 Minutes
Ollama
As a user, currently the easiest way to run a LLM locally is to use the FOSS software Ollama ( https://ollama.com/ ).
Ollama can be set up using a one-line command on Linux. It will download and install about 1 GB of software to run (inference) on LLMs. Ollama is able to download models from its repository for you. There is a choice of about 100 models.
You can use Ollama as a chatbot in the command line or install another component to use it in a web browser with a familiar interface.
The important point is Ollama will download the model and the model will run locally. Ollama can make use of your discrete GPU if you have one that supports ROCm (AMD) or CUDA (Nvidia), but it runs just fine on a CPU.
Ollama uses a model format where each parameter is encoded using 4 bits.
Typically model sizes you are able to run locally range from 1 billion to 70 billion parameters and take up from 1 GB to 40 GB of disk space (and the internet connection to download these).
When using the model it needs to be loaded into RAM (or VRAM). Ollama gives the following requirement:
You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.
In my experience if the input to the model (all the previous chat since you run it) gets a bit longer, these values are a bit tight.
All (recent) discrete GPUs are typically fast enough to run inference at a good speed (but might not have enough VRAM).
CPU-performance depends on a number of things, not only the CPU model and generation and number of cores, but also the RAM bandwidth is very important! You can always use a smaller model to get faster answers. The smallest models even run on Raspberry Pis…
Interesting models
The command to download a model is
ollama pull "model-name"
There is a full list of models available.
Some recommended, smaller models of different sizes, as a starting point:
If you have at least 16 GB of RAM:
→ mistral-nemo:12b
by Mistral AI (France)
Apache 2 (FOSS)
12 billion parameters, 7.1 GB size
multilingual, large context (longer chats possibile), detailed knowledge of a lot of topics
https://ollama.com/library/mistral-nemo:12b
If you have at least 8 GB of RAM:
→ mistral:7b
by Mistral AI (France)
Apache 2 (FOSS)
7.25 billion parameters, 4.1 GB size
one of the best small models, multilingual, tends to switch back to english, though
https://ollama.com/library/mistral:7b
If you have only 4 GB of RAM things get a bit tricky:
→ llama3.2:3b
by Meta,
permissive license (not FOSS)
3.2 billion parameters, 2.0 GB size
multilingual, small, don’t expect detailed knowledge about anything, still good for summarizing documents
https://ollama.com/library/llama3.2:3b
If you have lots of RAM, this is currently one of the best larger models
you can run locally:
→ llama3.1:70b
by Meta,
permissive license (not FOSS)
70.6 billion parameters, 40 GB size
one of the best larger models you can run locally, large context (longer chats possibile)
https://ollama.com/library/llama3.1:70b
Let’s install mistral-nemo:12b
After downloading some models, you can always list which ones you have:
chris@debgui:~$ ollama list
NAME ID SIZE MODIFIED
mistral:7b f974a74358d6 4.1 GB 4 minutes ago
mistral-nemo:12b 994f3b8b7801 7.1 GB 5 minutes ago
llama3.2:3b a80c4f17acd5 2.0 GB 11 minutes ago
To chat with a model, run:
ollama run "model name"
Here’s an example! Let mistral-nemo:12b explain the code example from the beginning:
And here’s another one. Translate an original text (from this presentation):
One year ago, at the SFSCon 2023 I did a talk about LLMs and brought up an original high school problem that a prominent online chatbot solved, but all local models smaller than 100 billion parameters had trouble with. Now we have a 12 billion parameter FOSS model that can solve the problem:
Note the usual imprecise arithmetic. LLMs cannot do arithmetics without a calculator (just as humans :) Also there is one confused intermediate step.
All these examples were single-try. I didn’t do any cherry-picking.
Performance
LLM are … large. They certainly require a modern computer.
In my experience, for a single user that runs the model locally and doesn’t need to input a lot of text (just short chats, no large text inputs), a modern CPU is fast enough, and you can always fall down to a smaller model, sacrificing knowledge and smartness for speed.
Ollama should be able to automatically pick the ideal number of CPU cores to use. Picking all of them is not a good idea, because the memory bandwidth can be contested by too many cores and low power cores or virtual corse are not really useful.
In case you suspect Ollama didn’t make a good pick you can use this command inside the Ollama chat:
/set parameter num_thread 8
experimenting with different numbers.
Ollama can also benchmark the timings for you (just add --verbose
to
the ollama run
command).
On my test system, a high-end AMD desktop-class processor with Debian 12, I ran Ollama in a libvirt/kvm guest with 8 cores and 32 GB RAM. This is what I got for mistral-nemo:12b:
chris@debgui:~$ ollama run mistral-nemo:12b --verbose
>>> /set parameter num_thread 8
Set parameter 'num_thread' to '8'
>>> Tell me a joke about LLMs.
Why don't LLMs ever play poker in the jungle? Too many cheetahs!
total duration: 4.775237542s
load duration: 1.592226141s
prompt eval count: 11 token(s)
prompt eval duration: 367.263ms
prompt eval rate: 29.95 tokens/s
eval count: 20 token(s)
eval duration: 2.773033s
eval rate: 7.21 tokens/s
Here, the relevant parts are:
- it processes 30 tokens per second while reading the input (prompt eval rate), that includes all the previous text in the current chat,
- it processes 7.2 tokens per second while generating the output.
A token is about 3 characters in this case.
Unless you need to process large input texts, this is perfectly usable!
Let’s compare this to a system running with CUDA on a high-end Nvidia GPU (RTX 6000 Ada):
$ ollama run mistral-nemo:12b --verbose
>>> I'm adding a bit of text to benchmark the prompt evaluation. Ignore it. Just tell me a joke
... about LLMs?
What do you call an LLM that's bad at jokes? A pun-ishment!
total duration: 311.01282ms
load duration: 19.4231ms
prompt eval count: 29 token(s)
prompt eval duration: 28.355ms
prompt eval rate: 1022.75 tokens/s
eval count: 19 token(s)
eval duration: 220.745ms
eval rate: 86.07 tokens/s
Well, yes… Note the high prompt eval rate of 1023 tokens per second.
What about a web UI
Say no more. Open-webui has you covered.
If you have Docker installed, you can start it up with a single command.
docker run -d --network=host -v open-webui:/app/backend/data -e OLLAMA_BASE_URL=http://127.0.0.1:11434 --name open-webui --restart always ghcr.io/open-webui/open-webui:main
It provides a familiar web UI that can access your local Ollama installation and models.
Outlook
Models make mistakes, they don’t recall details accurately. They’re terrible at arithmetic. When you demand answers, they start to make things up (“hallucinations”).
So, they are a bit like students :)
Treat them like information from a random website you found with DuckDuckGo on the world wide web. Or like a junior assistant. Or like your classmate that can help you even if she’s not 100% certain about everything at all times.
They are a valuable tool. They are here to stay, and they are being built-into all our devices.
As training sufficiently large models is still expensive, we rely on the good will (and strategic planning) of entities that release some of their models under FOSS licences.
There is a growing community of people that run LLMs locally and also modify them in the spirit of FOSS. That’s called fine-tuning, and it’s way less expensive than training a new model from scratch. There are lots of specialized models that are fine-tuned on specific scenarios and tasks, such as programming or role playing. If you want to dive into this community I recommend checking out Hugging Face and the LocalLLaMA Subreddit.
LLMs are being integrated into our document management systems and search infrastructure. You see chatbots and copilots integrated everywhere. These systems combine a semantic search engine that finds relevant documents and feeds them into an LLM that will answer your questions using those documents. That’s called retrieval augmented generation (RAG). If your interested in that one check out the many FOSS RAG projects. In Bolzano, NOI Techpark has a FOSS RAG project: Stuart Chatbot.