Mistral 7B on consumer hardware

2 min read

Ollama is a great tool to run quantized LLMs on consumer hardware. Once you Download and Install ollama and the Python libraries, a quick example to run Mistral 7B with your custom prompt:

# generate.py
import ollama
 
prompt = '''
Tell me a joke
'''
 
response = ollama.chat(model='mistral', messages=[
       {
         'role': 'user',
         'content': prompt,
       }])
print(response['message']['content'])

Save this as a .py file, run via CLI, python generate.py and you have a CLI data generator program. I was able to generate around 6K seed data samples in 8 hours for a recent project with Mistral 7B on an M1 Mac.

Tips

  • The prompt can have tips for formatting the output, let’s say for example you want pipe separated data with text and a label
  • The ouput can be parsed downstream with UNIX CLI utilities such as sed, awk, grep, tr etc.

How does this run?

ollama runs this through llama.cpp. llama.cpp allows you to save quantized models in the GGML binary format, which can be executed on a broader range of hardware. ollama loads Mistral 7B in GGML 4-bit quantized format and runs it on CPU.

GGUF (GPT-Generated Unified Format), is a successor to GGML and is designed to address GGML’s limitations – most notably, enabling the quantization of non-Llama models. GGUF is also extensible: allowing for the integration of new features while retaining compatibility with older LLMs. (Talamadupula 2024)

Can I load other models from HuggingFace?

Yes. ollama supports loading any GGUF format model. GGUF Quants tool supports quantized model creation with a HuggingFace repo as input.

References

Talamadupula, Kartik. 2024. “A Guide to Quantization in LLMs.” Symbl.ai. February 21, 2024. https://symbl.ai/developers/blog/a-guide-to-quantization-in-llms/.