Choosing a Large Language Model (LLM) for private or local use involves several important considerations. This post aims to guide you through the key factors to keep in mind when selecting an LLM model that fits your needs and hardware capabilities. We will explore model characteristics, discuss how to choose the right model for your use case, and finally look at tools available for running LLMs locally on Windows and Mac. This is meant to be a starting point for anyone looking to run an LLM locally and/or privately for personal use. If you’re looking for a tutorial on how to deploy a full-fledged solution for your organization, I may write another post on that in the near future. Let’s start with some basic concepts.
In an LLM (Large Language Model), parameters are the numerical weights (values in matrices and tensors) that the neural network learns during training. They define how input tokens (words, sub-words) are transformed through the network’s layers to produce the next token’s probability. In simpler terms: parameters are the “memory” of what the model has learned from its training data. Billions of tiny knobs that collectively encode grammar, facts, reasoning patterns, and style.
Why does the number of parameters matter?
Capacity: More parameters generally give the model more representational power. This often improves its ability to understand complex prompts, handle long-range context, and generalize across diverse tasks.
Resource cost: Large parameter counts require more memory (RAM/VRAM) and compute power for both training and inference.
Latency and energy: Bigger models take longer and use more energy per request. Quantization and optimized runtimes reduce this, but don’t eliminate the trend.
Performance trade-off: Past a certain point, increasing size yields diminishing returns for many use-cases; smaller fine-tuned or specialized models can outperform a huge general-purpose one for narrow tasks.
This is an approximation of what it would take to run a model. Every day, manufacturers are coming up with “AI” enabled chips and all kinds of marketing terms that don’t mean much. Look at the actual specs and make a decision. To give you an idea: I was able to run a 20B model in my M4 Macbook Pro with 36GB of RAM, and it was perfectly usable. I had a browser open, messaging, and using my computer as normal. I could probably run a bigger model if I wanted to.
Small models (1B – 7B parameters)
Medium models (13B – 30B parameters)
Large models (65B – 70B+ parameters)
For personal use, I’d stay with small and medium models. Larger models require proper hardware and are intended for several users and real-life applications. This is what you’d go for if you were implementing a solution for an organization.
A technique used to reduce the size and computational cost of a machine-learning model by storing its numerical weights and sometimes activations with lower-precision numbers.
For example, an LLM might be trained with 16- or 32-bit floating-point weights (FP16/FP32). Quantization converts them to smaller data types such as 8-bit integers (INT8) or even 4-bit formats (Q4, Q5, etc.).
This reduces memory footprint and bandwidth, allowing the model to run faster and fit on devices with limited RAM or GPU VRAM. The trade-off is usually a small loss in accuracy, though modern quantization methods are designed to minimize this impact.
An “OpenAI-compatible API” means the server exposes the same REST endpoints, request/response formats, and headers as the official OpenAI API. For example /v1/chat/completions, /v1/completions, /v1/embeddings, etc. Because the interface is the same, you can use standard OpenAI client libraries (like the OpenAI Python or JS SDK) and just point them to your server’s base URL and API key. This allows you to swap between OpenAI’s hosted models and your own locally-hosted models (e.g., vLLM, Ollama, llama-cpp) without changing your application code.
There are a few factors you need to consider when choosing a model. Size is only one of them. Here’s a general overview of what you need to consider.
| Model Size | Full-Precision (FP16) | Quantized (INT8 / 4-bit) | Typical Hardware Requirements |
|---|---|---|---|
| Small (1B – 7B) |
~2–14GB | ~1–6GB | Modern laptop with ≥16 GB RAM or single consumer GPU (e.g., RTX 3060/4060) |
| Medium (13B – 30B) |
~26–60 GB | ~12–24GB | High-end single GPU (24–48 GB VRAM), 2 lower-VRAM GPUs, or workstation/server with large CPU RAM. |
| Large (65B – 70B+) |
~130–140 GB | ~35–70GB | Multi-GPU setups (e.g., 2× 80 GB A100/H100) or large-memory CPU servers; not practical for desktops |
Generally, yes. My recommendation is that if you’re looking for a model that understands you and is, in general, more capable, get the larger you can fit in your hardware. In any case, here are some of the trade-offs:
Now that you’ve picked your model let’s look at how to run it in your own computer. There’s plenty of tools to pick from. Here’s some of the most popular:
Choosing an LLM for private or local use can be intimidating with all those numbers and fancy words. It shouldn’t be. It’s a matter of considering your intended task, model size, hardware constraints, performance needs, licensing, and ecosystem support. Smaller models may just be enough for many use cases and are easier to run on consumer hardware, while larger models require significant resources but offer improved capabilities. If you have the chance, go for the bigger models. If not, quantization and efficient file formats help make running LLMs more accessible.
Finally, leverage available tools like Ollama, llama.cpp, and MLX to deploy models effectively on Windows and Mac platforms. Balancing these factors will help you select the best model for your specific needs and environment.
Good luck!