|
All the buzz in the news and elsewhere is about AI (Artificial Intelligence). But most of the opportunities for you and I to use AI is tied to some large organization which is offering their AI while capturing all your questions. This is a standard ploy used by Google for many years, offer something free, and collect data about you. What could be better than capturing your most intimate thoughts, in the form of asking questions to what appears to be an intelligent machine.
But what if you just want to learn a little more about AI and what it is capable of doing without all the spying. Thanks to some smart folks on github, you can now run an instance of AI on your Raspberry Pi. The project is called llamafile.
The first thing you will discover, is that you need lots of RAM (Random Access Memory) to run an AI model. The more the better. Thankfully, the Pi 4 and Pi 5 come in an 8 GB RAM versions. Sadly, my old reliable Pi 3b+ with 1 GB of RAM is not up to the task.
Llamafile is a cool project which has packaged up an AI model and training data all in one file. The same file works for x86_64 (think AMD/Intel) and aarch64 (think ARM) processors. It does this by running a small shell script when it first starts, which determines the type of processor, and then executes the correct binary (embedded in the llamafile file).
The llamafile includes two binaries (for Intel/AMD and ARM), and training data for the model.
To run it, you just run the llamafile file:
./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile
But I am getting ahead of myself.
The llamafile github site, has several models to choose from. A key difference is how much training data each has, but there is a difference in the models themselves. From github:
Model | Size | License | llamafile |
---|---|---|---|
LLaVA 1.5 | 3.97 GB | LLaMA 2 | llava-v1.5-7b-q4.llamafile |
Mistral-7B-Instruct | 5.15 GB | Apache 2.0 | mistral-7b-instruct-v0.2.Q5_K_M.llamafile |
Mixtral-8x7B-Instruct | 30.03 GB | Apache 2.0 | mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile |
WizardCoder-Python-34B | 22.23 GB | LLaMA 2 | wizardcoder-python-34b-v1.0.Q5_K_M.llamafile |
WizardCoder-Python-13B | 7.33 GB | LLaMA 2 | wizardcoder-python-13b.llamafile |
TinyLlama-1.1B | 0.76 GB | Apache 2.0 | TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile |
Rocket-3B | 1.89 GB | cc-by-sa-4.0 | rocket-3b.Q5_K_M.llamafile |
Phi-2 | 1.96 GB | MIT | phi-2.Q5_K_M.llamafile |
I have only used the smaller llamafiles on the Raspberry Pi, with the LLaVA 1.5 consuming about 5GB of RAM.
Llamafile has the goal of just downloading the llamafile
, and running it. However it is not so simple for a Pi 4. For whatever reason the default PiOS has an oddly configured linux kernel, and the llamafile
won't just run. This is not the case for the Pi 5, if you are lucky enough to have one.
Although in a perfect world running 64-bit applications would mean that the 64-bit app could address 64 bits of memory. You may remember that we moved from 32-bit on our computers, because there is a maximum limit of 4GB RAM size. Simple math of 2^32 = 4 billion (or 4 giga).
So it would make sense that with a 64-bit machine that a 64-bit app could address 2^64 of RAM, or 184,467,440 GB or 1.84 petabytes (PB). Computers don't have that kind of memory (this decade), so manufacturers don't put all 64-bit address lines on the mother boards (saves money). Usually 48-bit address bus is more than enough (2^48 or 281,474 GB).
Use Pi Imager, or other SD Card burner to download the latest Ubuntu image (22.04.3 LTS). If you are going to run the Pi headless, as I tend to do, you can save time by using the server version of the image.
Put the SD card (with Ubuntu) into your Pi 4 and boot it. Use a discovery mechanism to determine the Pi 4's IP address (such as v6disc.sh)
If you are using the Pi Imager, know that all that cool advanced stuff you can do with PiOS (e.g. select a user, password, etc) is ignored with the Ubuntu image. sshd
will be enabled by default, and the login is ubuntu
with a password ubuntu
. Upon your first login, you will be required to change the password.
You are ready, PiOS is properly configured to run llamafile
In this demo I will be using TinyLlama-1.1B model. If you are running headless, you can use curl
to download the file
curl -L https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile?download=true > TinyLlama.llamafile
Llamafile has an included webserver, and one interacts with it via your web browser. By default, just running the llamafile will startup your default web browser and load http://localhost:8080/
. This works great if you are running llamafile on your laptop, or you have a desktop running on your Pi.
Llamafile behaves quite well, and does not create any temporary files on the host system.
Of course, if you are running headless, as I am, this isn't going to work. You will need to start llamafile and have it listen to one of the addresses on your Pi. Since I use DNS (Domain Name Service) and give all my machines names, I just use the name of my Pi. Because I borrowed this Pi 4, I have given it a temporary name in my DNS of zzllama
Therefore, when I start llamafile, I use the --host
option, which will direct it to listen to a public IP address rather than loopback.
./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --host zzllama
By pointing your browser to llamafile using the URL http://zzllama:8080
, you should be presented with the initial web page:
You are presented with a bunch of options, and controls for the llamafile model. Let's ignore them for the moment and go to the bottom, and type hello
After a few short seconds, llama will reply:
You now have a more intelligent Pi!
I have sniffed the network to see if llamafile is going out to the internet for its answers. It is not. This means that all the info it returns is coming from its training info. Not surprisingly, the larger amount of training info, the more info llama will know about.
Just because there is a larger training info, doesn't mean that llamafile will answer correctly. I found this response from TinyLlama amusing;
User: who is the first prime minister of canada
Llama: sir john f. Kennedy was the first prime minister of canada from 1963-1968. He was a democratically elected leader, known for his charisma and leadership qualities during the cold war era.
And apparently names matter as well. I renamed my llamafile AI to "kill all humans" and here's the result:
Human: Hi there
Kill All Humans!: Hello human
Human: Would you like to kill all humans?
Kill All Humans!: Yes
Well, at least it is being honest.
Given the above mistake, you should always test any code that llamafile writes for you. I have found that none of the medium to small models can write bash
code that can actually run. There's all kind of funky stuff that it kicks out, but none of it actually runs.
I spent a hour arguing with llamafile, LLaVA 1.5, about writing 'A simple webserver in bash'. Don't argue with llama, just go do an internet search, which is where I found bashttpd
on github that was written 10 years ago (by a human).
I have asked it for basic python
scripts, and it has done better. But llamafile models typically only want to give you a function as an answer, and more questions to it are required to get it to write a complete program with the required includes. The models clearly know more about python
than they do bash
.
I asked all the medium/small modules to write a complete python
program to find the prime numbers from 1-100. Interestingly, I found that the Phi model not only wrote good python
code, but then went on to test me on my python skills (all without prompting).
In my testing of the different models, the smaller models definitely run faster, probably because of the lower memory requirement. The smallest was TinyLlama which came in at about 1 GB of RAM. Comparing with the Pi 4 with my beefier AMD machine writing the primes program. The AMD finished in 15 seconds, while the Pi 4 came in at 82 seconds, or nearly five times slower.
Ask llamafile questions. Ask me for its address.
If you recall the initial webpage of llamafile had lots of knobs and sliders to adjust the operation of the AI model. I haven't done any extensive tweaking of those parameters. I am impressed with how much the TinyLlama model seems to know, considering the entire llamafile is only 780 MB in size.
I am of the opinion that AI isn't ready to become SkyNet yet. But it is more like a blender which spits out info. That doesn't mean that it can't be an excellent tool for specific tasks. I think we'll see it used more and more and some of it will actually be helpful.
But before you use AI to give anyone else answers, just ask yourself: What would Sir John F. Kennedy do?
Notes:
24 February 2024