Choosing a Large Language Model (LLM) for private or local use involves several important considerations. This post aims to guide you through the key factors to keep in mind when selecting an LLM model that fits your needs and hardware capabilities. We will explore model characteristics, discuss how to choose the right model for your use case, and finally look at tools available for running LLMs locally on Windows and Mac. This is meant to be a starting point for anyone looking to run an LLM locally and/or privately for personal use. If you’re looking for a tutorial on how to deploy a full-fledged solution for your organization, I may write another post on that in the near future. Let’s start with some basic concepts.
Model Characteristics
Parameters
In an LLM (Large Language Model), parameters are the numerical weights (values in matrices and tensors) that the neural network learns during training. They define how input tokens (words, sub-words) are transformed through the network’s layers to produce the next token’s probability. In simpler terms: parameters are the “memory” of what the model has learned from its training data. Billions of tiny knobs that collectively encode grammar, facts, reasoning patterns, and style.
Why does the number of parameters matter?
Capacity: More parameters generally give the model more representational power. This often improves its ability to understand complex prompts, handle long-range context, and generalize across diverse tasks.
Resource cost: Large parameter counts require more memory (RAM/VRAM) and compute power for both training and inference.
- A 7B-parameter model can often run on a single modern GPU or even CPU (with quantization).
- A 70B-parameter model may need multiple high-end GPUs to load and serve.
Latency and energy: Bigger models take longer and use more energy per request. Quantization and optimized runtimes reduce this, but don’t eliminate the trend.
Performance trade-off: Past a certain point, increasing size yields diminishing returns for many use-cases; smaller fine-tuned or specialized models can outperform a huge general-purpose one for narrow tasks.
Resources Required
This is an approximation of what it would take to run a model. Every day, manufacturers are coming up with “AI” enabled chips and all kinds of marketing terms that don’t mean much. Look at the actual specs and make a decision. To give you an idea: I was able to run a 20B model in my M4 Macbook Pro with 36GB of RAM, and it was perfectly usable. I had a browser open, messaging, and using my computer as normal. I could probably run a bigger model if I wanted to.
Small models (1B – 7B parameters)
- Full-precision (FP16): ~2–14 GB
- Quantized (INT8 / 4-bit GGUF): ~1–6 GB
- Fits in a modern laptop with ≥16 GB RAM or a single consumer GPU (e.g., RTX 3060/4060).
Medium models (13B – 30B parameters)
- Full-precision (FP16): ~26–60 GB
- Quantized (4-bit): ~12–24 GB
- Usually needs a high-end single GPU (24–48 GB VRAM) or 2 lower-VRAM GPUs, or a workstation/server with plenty of CPU RAM if running on CPU.
Large models (65B – 70B+ parameters)
- Full-precision (FP16): ~130–140 GB
- Quantized (4-bit): ~35–70 GB
- Typically served on multi-GPU setups (e.g., 2× 80 GB A100/H100) or large-memory CPU servers; not practical for ordinary desktops.
For personal use, I’d stay with small and medium models. Larger models require proper hardware and are intended for several users and real-life applications. This is what you’d go for if you were implementing a solution for an organization.
LLM File Formats
- GGUF (GPT-Generated Unified Format): A lightweight, binary format designed for llama.cpp and other CPU/edge-oriented runtimes. It stores weights in quantized form (e.g., Q4_K_M, Q5_K) for small memory footprints and fast inference on CPUs or modest GPUs.
- MLX: Apple’s Machine Learning eXchange format used with the MLX framework for Apple Silicon (M-series) devices. It optimizes models for Metal acceleration and low-precision math on macOS/iOS, making it ideal for local LLM inference on Apple hardware.
- Safetensors: A popular, framework-agnostic tensor checkpoint format (used by Hugging Face Transformers, vLLM, etc.). It’s designed for speed and safety (no code execution on load), and usually stores full-precision weights (FP16/FP32) for server/GPU deployments.
- .pth / .pt: The legacy PyTorch checkpoint formats. They serialize both model weights and sometimes Python code; still widely found for research and older models, but gradually replaced by safetensors.
Quantization
A technique used to reduce the size and computational cost of a machine-learning model by storing its numerical weights and sometimes activations with lower-precision numbers.
For example, an LLM might be trained with 16- or 32-bit floating-point weights (FP16/FP32). Quantization converts them to smaller data types such as 8-bit integers (INT8) or even 4-bit formats (Q4, Q5, etc.).
This reduces memory footprint and bandwidth, allowing the model to run faster and fit on devices with limited RAM or GPU VRAM. The trade-off is usually a small loss in accuracy, though modern quantization methods are designed to minimize this impact.
OpenAI Compatible API
An “OpenAI-compatible API” means the server exposes the same REST endpoints, request/response formats, and headers as the official OpenAI API. For example /v1/chat/completions, /v1/completions, /v1/embeddings, etc. Because the interface is the same, you can use standard OpenAI client libraries (like the OpenAI Python or JS SDK) and just point them to your server’s base URL and API key. This allows you to swap between OpenAI’s hosted models and your own locally-hosted models (e.g., vLLM, Ollama, llama-cpp) without changing your application code.
Choosing the Right Model
There are a few factors you need to consider when choosing a model. Size is only one of them. Here’s a general overview of what you need to consider.
Intended Task
- Use case: Chat/Q&A, code generation, summarization, RAG, reasoning, multimodal, etc. If all you need is text, you can save some space by getting a model for that purpose and not image generation.
- Domain specialization: General-purpose vs. domain-tuned (finance, legal, medical). Folks on the internet have taken general knowledge models and trained them with domain-specific content. (Please go to the doctor, don’t ask an LLM about your symptoms.)
- Language and modality: Human languages, programming languages, or multimodal (text-image).
Model Size & Hardware Requirements
- Parameters vs. hardware: Check whether it fits in your available GPU/CPU RAM. The size of the download does not reflect how much memory it will need when loaded.
- Latency requirements: Smaller models respond faster. Speed is measured in tokens per second. The larger the number, the better.
- Energy and cost: Bigger models are more expensive to host.
Performance vs. Efficiency Trade-off
- Quality/accuracy: Often improves with size, but has diminishing returns. It may be super smart, but if it takes 20 seconds to reply, maybe it’s not worth it.
- Quantization support: Can reduce memory and compute needs with minimal quality loss.
- Benchmarking: Look at task-specific scores (MMLU, GSM-8K, etc.) instead of just size.
Licensing & Cost Considerations
- License terms: Open-source (e.g., Llama 3, Mistral) vs. restrictive/commercial. Open-source is great; it’s the foundation of the internet we use today. Don’t be afraid of looking that way.
- Deployment cost: Cloud API pay-per-token vs. self-hosted hardware and energy.
- Redistribution: If you build a product, confirm you can legally ship the model. Piracy is cool in the movies. Make sure you read the license terms.
Deployment Environment
- Target hardware: Cloud GPUs, on-prem GPUs, edge devices, Apple Silicon.
- Runtime support: Frameworks (vLLM, llama.cpp, Ollama, PyTorch) and model formats (GGUF, MLX, safetensors).
- Scaling needs: Multi-GPU, distributed inference, autoscaling.
Context Window & Features
- Context length: Long-context models (e.g., 32k–200k tokens) for RAG or document QA.
- Tool-use / function-calling: If you need structured outputs or external tool invocation.
- Multilingual or multimodal support: If your use-case needs it.
Ecosystem & Community
- Availability of fine-tunes or adapters (LoRA, PEFT).
- Community and vendor support.
- Integration libraries: LangChain, LlamaIndex, RAG tooling, embeddings.
Quick Summary of Resource Requirements by Model Size
| Model Size | Full-Precision (FP16) | Quantized (INT8 / 4-bit) | Typical Hardware Requirements |
|---|---|---|---|
| Small (1B – 7B) |
~2–14GB | ~1–6GB | Modern laptop with ≥16 GB RAM or single consumer GPU (e.g., RTX 3060/4060) |
| Medium (13B – 30B) |
~26–60 GB | ~12–24GB | High-end single GPU (24–48 GB VRAM), 2 lower-VRAM GPUs, or workstation/server with large CPU RAM. |
| Large (65B – 70B+) |
~130–140 GB | ~35–70GB | Multi-GPU setups (e.g., 2× 80 GB A100/H100) or large-memory CPU servers; not practical for desktops |
Is Bigger Better?
Generally, yes. My recommendation is that if you’re looking for a model that understands you and is, in general, more capable, get the larger you can fit in your hardware. In any case, here are some of the trade-offs:
- Latency: Smaller models respond faster; larger models take longer per request.
- Energy use: Bigger models consume more power; quantization helps reduce energy consumption.
- Accuracy: Larger models tend to be more accurate, but with diminishing returns beyond a certain size.
Tools for Running LLMs Locally
Now that you’ve picked your model let’s look at how to run it in your own computer. There’s plenty of tools to pick from. Here’s some of the most popular:
- Ollama: A user-friendly platform for running LLMs locally on macOS and Windows, providing an OpenAI-compatible API and easy model management.
- llama.cpp: A lightweight C++ implementation optimized for running quantized LLaMA models on CPU, suitable for edge devices and laptops.
- MLX: Apple’s Machine Learning eXchange framework optimized for Apple Silicon devices, enabling efficient local inference using Metal acceleration.
- vLLM: A high-performance, production-ready, scalable inference engine designed for GPU servers, compatible with OpenAI API standards.
- Other open source tools: Various community projects support different model formats and hardware setups, enabling flexible local deployment depending on your environment. Jan and Clara are some examples.
Conclusion
Choosing an LLM for private or local use can be intimidating with all those numbers and fancy words. It shouldn’t be. It’s a matter of considering your intended task, model size, hardware constraints, performance needs, licensing, and ecosystem support. Smaller models may just be enough for many use cases and are easier to run on consumer hardware, while larger models require significant resources but offer improved capabilities. If you have the chance, go for the bigger models. If not, quantization and efficient file formats help make running LLMs more accessible.
Finally, leverage available tools like Ollama, llama.cpp, and MLX to deploy models effectively on Windows and Mac platforms. Balancing these factors will help you select the best model for your specific needs and environment.
Good luck!


SUBMIT YOUR COMMENT