Private AI: Self-Hosting an AI Chatbot on Your Local Network

I'm a firm believer that the future of AI is private deployments. By concentrating so much power in the hands of a few corporations we will inevitably find ourselves in a dystopian future.
I’m a firm believer that the future of AI is private deployments. By concentrating so much power in the hands of a few corporations we will inevitably find ourselves in a dystopian future.
Some time ago I started to wonder: what would it take to self-host my own “Claude” or “ChatGPT”?
Most of the time I use LLM chatbots as a very advanced search engine. But what I actually need it for is to help me navigate numerous documents written in legal German (as if regular German weren’t difficult enough, people invented the legal version of it). Obviously, the problem here is privacy. Under no circumstances can I trust my personal documents to a third-party.
In this blog post I’ll take you through my journey of achieving exactly this. I recently presented this content to my colleagues at work and now I’m writing this down here in my personal blog for external availability.
My Requirements #
My requirements are driven by these three goals:
- Maintain my privacy
- Replicate the Claude / ChatGPT experience as close as possible
- Re-use my already owned hardware
Let’s go through the list of the actual requirements one by one.
Privacy #
Obviously, the most important requirement here is privacy.
This means:
- No telemetry
- No spying
- No profiling
- Full control over the data that leaves my local network
Document Processing #
Second priority is document processing, which would also include OCR and RAG.
Simply speaking, I need the LLM chat to accept documents and answer my questions about them. It would be also great to manage knowledge collections which the model would be able to query automatically.
Multilingual #
Related to the previous requirement. There is no use of processing my documents unless the model supports the language the documents are written in.
I’m originally from Russia, I live in Germany and I work in English. To cover all of my use-cases the models would need to be multilingual and support at least these 3 languages.
Reasonable Speed and Accuracy #
Obviously, no matter how sophisticated my setup is, if it takes minutes to process every prompt – it’s unusable. As I write further, there are ways to balance speed and accuracy by picking the right LLM for your needs.
Multiuser Support #
I’m not the only person in our household who needs a privacy-focused LLM chatbot. So, this setup needs to support multiple users and needs to work on every device we have at home. The easiest way to achieve that is a web interface with authentication.
Runs Locally #
Everything must run on my local network.
I run a VPN server at home and my phone is always connected to it. So, no matter where I am physically, my phone is always on my local network. This means I can use this setup securely from almost anywhere in the world.
Open Source #
All of the software I’m going to run locally needs to be open source. Of course it does not guarantee that it would not have any malicious code in it, but at least there is a chance that such code would be discovered at some point. Also, I’d like to have a direct feedback channel where maintainers would listen and help me resolve issues should I have any. As you’ll see further, this was an important factor.
Conversational Mode (Optional) #
It’s not like I’m currently using this functionality with any of the online services but it might be handy sometimes to make the model read something aloud or ask a question when typing is too slow.
It’s nice to have, but not a blocker for me by any means.
Choosing Hardware #
Alright, here I’m very biased because before I started all of this, I had already bought a Mac mini M4 Pro 64GB for other purposes (music production, hosting a media library). Naturally, I would not want to buy even more hardware for just running an LLM. So, for me the choice has been already made.
However, I would argue here that a Mac mini is one of the most affordable options for self-hosting an AI chatbot at the moment.
Let’s compare current prices across multiple options. The reference configuration we’re looking for is:
- At least 32 GB of video memory (realistically 64 GB, for all the models I run I need 42 GB + the OS needs)
- Let’s say a reasonable 1 TB SSD
Discrete GPU #
Building a traditional PC with a discrete GPU is not really an option here. GPUs with 32 GB of VRAM and above are quite hard to come buy these days. For example, prices for Nvidia 5090 32GB now start around $3,900 (Newegg, 5th of April 2026). Note the search results returned only a single card in stock.

You can literally buy 2 Mac minis M4 32 GB for the same money and it’s still going to be cheaper than a single 5090 32 GB. Also, it’s just a GPU, you’d still need to buy and build the rest of the PC.
To be clear, we’re not talking about the performance here. Pretty much any Nvidia GPU would outperform a Mac mini. What we assess here is capability: how much money do I need to pay so the model fits into the video memory without going on disk and becoming extremely slow? In this sense, Mac mini outperforms the discrete GPUs here.
Chips with Unified Memory #
The solution to the limited VRAM problem is of course the unified memory architecture, which modern Macs already have as standard, but, of course, there are also options in the PC world. Namely, AMD’s Strix Halo architecture is exactly that. So, any PC based on Ryzen AI MAX+ 395 series is what you’d need for a comparable experience.
My search led to 2 popular options:



To be fair, GMKtec does not have an option for a 512 GB SSD and if you don’t care much about the disk space you could bring the Mac’s price further down to $1,999. Apple is known for their extremely high storage fees.
Summary #
For 64 GB of unified memory and 1 TB SSD we get this (all prices are taken on the 5th of April 2026):
| Name | Chip | Memory | Price |
|---|---|---|---|
| Mac mini | M4 Pro (12 CPU / 16 GPU) | 64 GB (up to 273 GB/s) | $2,199.00 |
| GMKtec EVO-X2 | Ryzen AI Max+ 395 | 64 GB (up to 256 GB/s) | $1,999.99 |
| Framework Desktop | Ryzen AI Max+ 395 | 64 GB (up to 256 GB/s) | $2,193.00 |
I found it interesting that the Ryzen AI Max+ 395’s tech specs don’t mention the actual memory bandwidth. However, the official blog post claims the theoretical maximum of 256 GB/s comparing to 273 GB/s for the M4 Pro chip.
Also, it was surprisingly hard to find any direct performance comparison of M4 Pro against Ryzen AI MAX+ 395. Since I don’t have access to the AMD hardware, I can’t do it myself. So, I have to skip this. If you know where to find these results, contact me and I’ll put the link here.
I would expect the performance to be similar when it comes to LLMs. The power draw on the other hand is more than 3 times lower on M4 Pro during the peak load, according to this video:

Considering all of that and what you get with a Mac in terms of software, I personally would choose a Mac mini over a Ryzen AI Max+ 395 machine on any day. Also, the electricity is quite expensive here in Germany.
AI Chatbot Components #
Self-hosting an LLM chatbot is all about connecting existing components together and running them as services on your machine.
Let’s go through every component one by one and then I’ll show how to connect them.
Model manager #
A model manager is a program for searching and downloading AI models. There are standalone tools like hf but the model manager is often included into the programs that also run the model engines.
Engine(s) #
This is the component that hosts the actual engines that run your models, so you can request something from them.
Before we get to comparing the actual engines, let me clarify something important:
GGUF vs MLX #
On Hugging Face you can find models in these two formats:
Let’s compare them:
| Feature | GGUF ( llama.cpp) | MLX (Apple) |
|---|---|---|
| Target Hardware | Universal: CPU, NVIDIA, AMD, Apple. | Exclusive: Apple Silicon (M1–M5) only. |
| Memory Strategy | mmap + Page Faults: Swaps model parts to disk if > RAM. | Unified Memory (UMA): CPU/GPU share memory; zero-copy, no explicit swap. |
| Execution Model | Pre-compiled Kernels: Fixed C++ kernels; no graph compilation. | Lazy Graph + JIT: Builds Metal kernel graph on-the-fly; auto-fusion. |
| Performance | Scalable: Handles massive models (>70B) via disk swapping. | Fast (Small): Beats GGUF on models <22B; hits RAM wall on huge models. |
| Quantization | Granular: Explicit control (Q4_K_M, Q5_K_XL). | Dynamic: Automatic mixed precision (~4.5 bits), less user control. |
| Best Use Case | Cross-platform, massive models, or strict memory constraints. | Native Mac apps, training/finetuning, and small-to-mid models. |
TLDR; it’s all about optimizations for Apple Silicon (like M4 Pro) and how the memory is managed. With GGUF it’s possible to use the disk space for a part of the model. With MLX you’re limited by the physical unified memory your Apple Silicon chip has.
Now we know what GGUF and MLX is, let’s look at the actual engines now, the most popular tools nowadays are:
Ollama #
Ollama is a CLI that runs GGUF models using llama.cpp as its backend.
It’s written in Go with cgo bindings to the C++ code and it includes the model manager. Since it’s GGUF it’s not really optimized for Apple Silicon, therefore I didn’t really look into it so much. I installed it, loaded a model about 25 GB in size and at some point it became so slow it was unusable. I didn’t have such issues with MLX models of the same size.
LM Studio #
LM Studio is a proprietary GUI app that runs GGUF and MLX models.
Among other things it includes:
- Open-source CLI
- Model manager
- Desktop Chat GUI
- MCP integrations
- GGUF and MLX engines
I think it’s fair to say that this is the most popular option at the moment, but it does not satisfy my requirements:
- It’s not open-source
- Most likely has telemetry
- It’s optimized for the desktop experience, there is no official mobile client
- No multi-user support (at least for now)
The CLI is written in TypeScript, unknown about the rest.
Swama #
Swama is an open-source MLX engine written in Swift.
I had high hopes for this one and I thought this would be my final choice. I expected higher performance because of the native MLX implementation without the Python overhead which you usually see in other MLX engines. However, I was disappointed by its instability. It was crashing so often that it was basically unusable for me. Things might improve in the future though. I might revisit it later. I would not call this project mature just yet.
oMLX #
oMLX is an open-source project written in Python that runs MLX models and combines multiple engines like:
- mlx-lm – to run text-only LLMs
- mlx-vlm – to run LLMs with vision (which can analyze pictures)
- mlx-embeddings – to run Retriaval-Augmented Generation (RAG)
- mlx-audio – run STT and TTS models
Among other things it also includes:
- Model manager
- Admin Web UI
- MCP integrations
- Simple chat Web UI
- Benchmarking
- Paged SSD KV caching
- Recently introduced TurboQuant compression method. There is a good video explaining what it is.
This project appears very mature right away and I was mostly satisfied with its stability. It’s also being actively developed and every release brings fantastic improvements. I also had a wonderful experience with maintainers: I asked for a Jina reranker architecture support and it was added after a day and made it into the next release a week later.
Having persistent caching really makes a difference and it’s particularly noticeable in long chats when you follow up on you first prompt.
oMLX is also compatible with any Anthropic / OpenAI-compatible client (Claude Code, OpenClaw, Cursor, etc.) out of the box and gives you instructions on how to connect these tools.
Needless to say, oMLX was my final choice.

Web Chat Client #
If with engines you have a few choices, with the web chat client it’s really the only option – Open Web UI.
I tried a few other projects like LibreChat but they can’t really compete with Open Web UI in terms of features.
Open Web UI has it all:
- Connection to any Open AI-compatible engine endpoint
- Support for agentic tools (built-in and MCP)
- Web search as an agentic tool or via the RAG pipeline
- Code execution as an agentic tool, integration with Open Terminal
- Configurable RAG pipeline: content extraction, embeddings, reranker. Multiple supported vector databases.
- STT and TTS support
- Integrating external image generation / editing models
- Knowledge base support (organizing documents into vectorized collections). Type
#to select a collection for the context of your prompt - Support for writing skills. Type
$to select a skill for your prompt
In my opinion it’s as good as the web UI for Claude or ChatGPT and perhaps even better.
Web Search #
In the context of this project web search is the implementation of the agentic tool that searches the web. At the moment, Open Web UI supports 25 different integration options to choose from.
This is tricky. My most important requirement is privacy, but how to make the web search private?
The answer is “you can’t”.
Good news is that with Open Web UI we can control whether the model has access to the web search tool or not by using a toggle per chat. So, it’s already a decent level of privacy when needed.
But what about those prompts when you ask the model to research something for you?
Well, you have to pick the least evil for that.
I chose Brave Search API:
- “Own, built-from-scratch index”
- “LLM-Optimized”
- “Does not profile you”
- Native support by Open Web UI
Ultimately, these claims can’t be independently verified, so take everything with a grain of salt. They do disclose though that your search queries are stored for 90 days for billing and debugging purposes. Yeah, if you say so.
You have to register on their website and get your API token. Every month you get a 1,000 search queries for free, but then you must pay $5 for every additional 1,000 queries. Of course you can set a spend limit per month. Mine is set to $20. With my active use I’m yet to exceed my monthly free allowance.
All things considered, I’d rather use this over anything based on Google or Bing. Ideally, I’d prefer to have a European index and it’s on its way but not ready just yet.
Nginx + Letʼs Encrypt (Optional) #
I don’t want to quote a tutorial on Nginx and ACME here, but it’s worth mentioning that if you’re planning on using features like microphone access (STT) or push notifications from the Open Web UI, you must have a valid HTTPS connection. Otherwise, browsers would block such features over insecure channels.
Here you can find the full documentation on how to setup Open Web UI with Nginx. There are some tricky aspects when it comes to caching and upload limits.
Choosing Model #
Yes, we’re finally here. We’re choosing the actual models now. Models. Plural. You’ll need multiple different models for each purpose.
Choices #
First, let’s figure out what we’re choosing from. How do the models differ?
Choosing a model is always a balancing act:
- Parameter count ↑ = Smarter ↑ = Memory ↑ – how vast is the knowledge
- Quantization ↓ = Accuracy: ↓ = Generation Speed ↑ – like image compression, the lower the number, the lower the quality
- Capabilities:
- Instruct or thinking mode?
- What built-in tools does it support? (web-search, memory, knowledge base, etc).
- Vision – does it work with images?
Main Model #
This is the model you’ll be chatting with. LLMs are text-only. VLMs are so called multimodal models that can consume images.
When it comes to the main model, I decided to stick to 8bit quantization. I have tried 4bit, but it hallucinated more often than I would like it to.
Originally, I settled on Qwen3-14B-8bit:
- Multilingual, thinking text model
- Decent accuracy
- 15 tokens/s
- 15 GB of memory
Currently, I’m evaluating Qwen3.5-35B-A3B-8bit:
- Multilingual, thinking multimodal model
- More competent
- Mixture-of-Experts (fast but less accurate)
- 56 tokens/s
- 37 GB of memory
Mixture-of-Experts #
This Mixture-of-Experts (
MoE) matters a lot here.
Imagine a room with 256 specialists. In a regular model, all 256 specialists look at every token you send – slow and expensive. A MoE model does something smarter: for each token it processes, it only wakes up 9 of those 256 specialists:
- 8 “routed” experts – chosen dynamically based on what the token needs
- 1 “shared” expert – always active, no matter what – generalist
The rest of the 247 specialists just sit idle for that token – saving compute.
So, in theory you should get the knowledge of the 35B parameter model with a speed of 3B parameter model.
As of now, there are 194 MoE models available on Hugging Face, so there is something to choose from.
Model Configuration #
It’s very important to find optimal configuration parameters for your model.
Normally, the model creators provide the list of optimal configuration parameters for various use-cases. For example, if you look at the official Qwen3.5-35B-A3B-8bit documentation you’ll find this:
All of these parameters can be set on the per model basis either in oMLX or in Open Web UI.
Content Extractor #
This is what processes your document when you upload it and extracts the text from a PDF or DOCX file. It’s not really a model on its own but it also supports OCR that uses the EasyOCR model.
The goal here is to extract the text out of anything you upload.
My search on the internet led me to Docling as a leading content extraction tool.
Embeddings #
This is the model that converts the extracted text into vectors and stores them in a vector database. When your main model needs to access a document, it queries the embeddings model and it retrieves the vectors from the vector database and embeds them into the main context of the LLM. Based on this augmented context, the LLM now can generate an answer back to you. Hence the names “embeddings” and “retrieval-augmented generation”.
Based on this blog post I decided to use jina-embeddings-v5-text-small.
Reranker (Optional) #
Reranker is the model that analyzes search query results that you receive from embeddings and adjusts the relevance of these results based on your prompt, so the model gets the most relevant results on top.
I decided to combine the Jina embeddings model with their reranker – jina-reranker-v3.
Speech-To-Text + Text-To-Speech (Optional) #
Open Web UI includes the locally running Whisper for STT (which should be enough).
For TTS, it’s complicated. I experimented with a few options like:
Fish is the current leader in how natural the voice sounds, but you’d need to give a reference audio for training. No pre-trained voices. Also, the model is quite large and takes a lot of memory.
Chatterbox and Orpheus are quite good and have pre-trained voices, but for me they didn’t work well with German and Russian.
Fun fact: when STT models don’t support a language you ask them to speak, they exhibit a very weird behavior. I tried a simple prompt:
„Hello, good day! Hallo, guten Tag! Привет, хорошего дня!
(English, German, Russian)
And when Chatterbox tried to say this, it simply started laughing instead of speaking Russian:
Orpheus just started speaking like a drunk lady:
Generally, I have not found a model with a great multilingual support so far. Models have accents in German and Russian. However, for English they work just fine. Kokoro is quite good.
I settled on Voxtral for now. Only because of the multilingual support. However, it’s glitchy sometimes: the voice gets suddenly louder or quieter, you start hearing a background noise or some other artifacts.
No ideal solution yet, so I’ll keep looking.
Task Model (Optional) #
Open Web UI is using models for some miscellaneous tasks like:
- Chat name generation
- Chat tag generation
- Search query generation
- Follow up prompt generation
You can select a task model for these purposes in the settings of Open Web UI.
For all of these tasks it needs an LLM. My main model is fast enough. So, I can just run it as a task model as well, your needs may vary.
Running macOS as a Server #
Although you could simply run everything under your normal user, it won’t work if you care about security and if you want your services to be available even when your user does not have an active session on your Mac. These services needs to start before you log in.
To do things right you’ll need a separate service account and launch daemons.
Service Account #
A service account is a user that has no shell and no access to the data of your regular user. Ideally, your regular user does not have access to the data of the service account either. Given you have strict file system permissions that is (e.g. 600 on important files).
To create a service account you can just use this shell script:
# Create a new group first
sudo dscl . -create /Groups/_services
sudo dscl . -create /Groups/_services PrimaryGroupID 450
# Create the user
sudo dscl . -create /Users/_svcuser
sudo dscl . -create /Users/_svcuser UserShell /usr/bin/false
sudo dscl . -create /Users/_svcuser RealName "Service Account"
sudo dscl . -create /Users/_svcuser UniqueID 451
sudo dscl . -create /Users/_svcuser PrimaryGroupID 450
sudo dscl . -create /Users/_svcuser NFSHomeDirectory /var/svcuser
# Set a password (required for launchd)
sudo passwd _svcuser
# Create home directory
sudo mkdir -p /var/svcuser
sudo chown -R _svcuser:_services /var/svcuser
# Hide from login screen
sudo dscl . -create /Users/_svcuser IsHidden 1
# Hide home folder from Finder
sudo chflags hidden /var/svcuser
From now on, if you want to execute a command on behalf of this service account you just use sudo -u _svcuser <command>.
Make sure that all the sensitive data like config files and secrets have 600 file permissions on them.
Launch Daemons #
On Mac all the services are managed with the launchctl command. Every service is defined as a so-called
properly list file. It’s practically an XML file that contains keys and values that define properties of the service.
Here is an example of how I run oMLX:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.omlx.serve</string>
<key>UserName</key>
<string>_svcuser</string>
<key>GroupName</key>
<string>_services</string>
<key>ProgramArguments</key>
<array>
<string>/bin/bash</string>
<string>-c</string>
<string>exec /opt/homebrew/bin/omlx serve --model-dir /var/svcuser/.omlx/models --host 127.0.0.1 --port 28100 --paged-ssd-cache-dir /var/svcuser/.omlx/ssd-cache --hot-cache-max-size 8GB --api-key $(cat /var/svcuser/.config/secrets/omlx)</string>
</array>
<key>EnvironmentVariables</key>
<dict>
<key>HOME</key>
<string>/var/svcuser</string>
</dict>
<key>KeepAlive</key>
<true/>
<key>RunAtLoad</key>
<true/>
<key>ProcessType</key>
<string>Interactive</string>
<key>StandardOutPath</key>
<string>/var/svcuser/logs/omlx/omlx.log</string>
<key>StandardErrorPath</key>
<string>/var/svcuser/logs/omlx/omlx.error.log</string>
</dict>
</plist>
You need to set 644 file permissions on such property files and put them in /Library/LaunchDaemons, for example /Library/LaunchDaemons/com.omlx.serve.plist.
You can find the full documentation on launchd property files if you type man 5 launchd.plist on your terminal.
It’s worth noting that you would also need to set the proper prioritization level for your services. It can be done in 2 ways:
<key>Nice</key>
<integer>-5</integer>
Lower “nice” values cause more favorable scheduling.
Or you can simply set the ProcessType like I did in my example above:
<key>ProcessType</key>
<string>Interactive</string>
If left unspecified, resource limits (CPU, I/O) are applied to the service:
Background– Limits applyStandard– Standard jobs are equivalent to noProcessTypebeing setAdaptive– Move between theBackgroundandInteractivebased on activityInteractive– No limits, critical to maintaining a responsive user experience
Port Forwarding #
Unless you run something under root (normally you should not), your service would not be able to allocate 443 or 80 ports.
I run Nginx as my service account, therefore I must forward the HTTPS port 443 to my Nginx port 8443.
On macOS it’s done via the pf (packet filter). It’s configured via the /etc/pf.conf file. Mine looks like this:
/etc/pf.conf
scrub-anchor "com.apple/*"
nat-anchor "com.apple/*"
rdr-anchor "com.apple/*"
rdr-anchor "com.svcuser/*"
dummynet-anchor "com.apple/*"
anchor "com.apple/*"
anchor "com.svcuser/*"
load anchor "com.apple" from "/etc/pf.anchors/com.apple"
load anchor "com.svcuser/portforward" from "/etc/pf.anchors/com.svcuser.portforward"
/etc/pf.anchors/com.svcuser.portforward
rdr pass on lo0 proto tcp from any to any port 443 -> 127.0.0.1 port 8443
rdr pass on en0 proto tcp from any to any port 443 -> 127.0.0.1 port 8443
pf also needs to run as a daemon if you’re planning on using it, for example like so:
/Library/LaunchDaemons/com.svcuser.pf-portforward.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.svcuser.pf-portforward</string>
<key>ProgramArguments</key>
<array>
<string>/sbin/pfctl</string>
<string>-E</string>
<string>-f</string>
<string>/etc/pf.conf</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<false/>
<key>StandardOutPath</key>
<string>/var/svcuser/logs/nginx/pf-portforward.log</string>
<key>StandardErrorPath</key>
<string>/var/svcuser/logs/nginx/pf-portforward.error.log</string>
</dict>
</plist>
Connecting All Pieces Together #
Now we know what we need to run and how to run macOS as a server. We’re ready to connect all the components together.
Launch Daemons #
Now you need to run oMLX, Docling and Open Web UI as launch daemons.
Preparations #
Let’s create some directory structures first:
# "bin" for Python virtual environments, "data" for configs and databases, "logs" and "secrets"
sudo mkdir -p /var/svcuser/{bin,data,etc/nginx,logs/omlx,logs/docling,logs/open-webui,.config/secrets}
# setting the right ownership (IMPORTANT!)
sudo chown -R _svcuser:_services /var/svcuser
We would need Python 3.11 and 2 virtual environments to run Docling and Open Web UI as separate services. They might run just find in a single virtual environment, but there is always a risk of dependency conflict. I prefer to play it safe. Currently, both Open Web UI and Docling can run on Python 3.11, double-check the current requirements before proceeding.
brew install python@3.11
PYTHON311="$(brew --prefix python@3.11)/bin/python3.11"
sudo -u _svcuser "$PYTHON311" -m venv /var/svcuser/bin/venv-open-webui
sudo -u _svcuser "$PYTHON311" -m venv /var/svcuser/bin/venv-docling-serve
Note: to ensure the correct file ownership, everything you install inside the service account must be installed on behalf of the service account user by using sudo -u _svcuser. Make sure that all created directories have the right _svcuser:_services ownership and proper file permissions.
oMLX #
I find that installing oMLX with homebrew is the most optimal option. This way you don’t need to deal with an additional virtual Python environment:
brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx
You can find my /Library/LaunchDaemons/com.omlx.serve.plist file for oMLX as an example above.
Don’t forget to create the secret file /var/svcuser/.config/secrets/omlx with the 600 file permissions. This is the API key for the oMLX server and also your password for the admin dashboard.
Docling #
Docling is not on homebrew, it’s just a Python package. We need to install the package in the prepared virtual environment by running:
sudo -u _svcuser /var/svcuser/bin/venv-docling-serve/bin/pip install 'docling-serve[ui]' easyocr
Note: easyocr is required if you want your scanned documents to be processed.
To run it as a daemon you’d need a /Library/LaunchDaemons/com.svcuser.docling-serve.plist file like so:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>EnvironmentVariables</key>
<dict>
<key>DOCLING_SERVE_MAX_SYNC_WAIT</key>
<string>600</string>
<key>HOME</key>
<string>/var/svcuser</string>
<key>UVICORN_WORKERS</key>
<string>1</string>
</dict>
<key>GroupName</key>
<string>_services</string>
<key>KeepAlive</key>
<true/>
<key>Label</key>
<string>com.svcuser.docling-serve</string>
<key>ProcessType</key>
<string>Interactive</string>
<key>ProgramArguments</key>
<array>
<string>/bin/bash</string>
<string>-c</string>
<string>export DOCLING_SERVE_API_KEY=$(cat /var/svcuser/.config/secrets/docling); exec /var/svcuser/bin/venv-docling-serve/bin/docling-serve run --host 127.0.0.1 --port 5001</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>SessionCreate</key>
<false/>
<key>StandardErrorPath</key>
<string>/var/svcuser/logs/docling/docling.error.log</string>
<key>StandardOutPath</key>
<string>/var/svcuser/logs/docling/docling.log</string>
<key>UserName</key>
<string>_svcuser</string>
</dict>
</plist>
This service also needs a secret file at /var/svcuser/.config/secrets/docling with 600 file permissions containing its API key. Just generate a long random string (e.g. openssl rand -hex 32).
Open Web UI #
We need to install the Python package in the prepared virtual environment by running:
sudo -u _svcuser /var/svcuser/bin/venv-open-webui/bin/pip install open-webui
And here is my /Library/LaunchDaemons/com.openwebui.serve.plist file for Open Web UI:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>EnvironmentVariables</key>
<dict>
<key>DATA_DIR</key>
<string>/var/svcuser/data/open-webui</string>
<key>HOME</key>
<string>/var/svcuser</string>
<key>PATH</key>
<string>/var/svcuser/bin/venv-open-webui/bin:/opt/homebrew/bin:/usr/bin:/bin</string>
<key>LOG_LEVEL</key>
<string>INFO</string>
<key>ENV</key>
<string>prod</string>
<key>GLOBAL_LOG_LEVEL</key>
<string>WARNING</string>
<key>OFFLINE_MODE</key>
<string>True</string>
<key>HF_HUB_OFFLINE</key>
<string>1</string>
</dict>
<key>KeepAlive</key>
<true/>
<key>Label</key>
<string>com.openwebui.serve</string>
<key>ProcessType</key>
<string>Adaptive</string>
<key>ProgramArguments</key>
<array>
<string>/bin/bash</string>
<string>-c</string>
<string>export WEBUI_SECRET_KEY=$(cat /var/svcuser/.config/secrets/openwebui); exec /var/svcuser/bin/venv-open-webui/bin/open-webui serve --host 127.0.0.1 --port 8080</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>SessionCreate</key>
<false/>
<key>StandardErrorPath</key>
<string>/var/svcuser/logs/open-webui/webui.error.log</string>
<key>StandardOutPath</key>
<string>/var/svcuser/logs/open-webui/webui.log</string>
<key>UserName</key>
<string>_svcuser</string>
</dict>
</plist>
Once more you’d need a secret file at /var/svcuser/.config/secrets/openwebui with 600 permissions. It’s used for signing authorization tokens for users. Generate a very long random string.
Start The Daemons #
Now all the daemons can be started by running:
sudo launchctl load /Library/LaunchDaemons/{com.omlx.serve.plist,com.svcuser.docling-serve.plist,com.openwebui.serve.plist}
Check the logs and make sure everything is running properly.
Adding The Models #
Visit http://127.0.0.1:28100/admin/dashboard , use the generated secret as your password and open the oMLX admin dashboard.
Now you can download the models you’ve chosen. Go to Models -> Downloader. There search for the models and click Download.
Once the models are downloaded, go to Settings -> Model Settings and enable PIN for all the models you want to be always available without delay (otherwise they get evicted from memory after 5 minutes of idle and it takes some time to load them back).
Then go to the configuration of your main LLM (or VLM) and set the recommended configuration parameters, here what I have for mine:
CTX WINDOW should be as high as you can afford (requires more memory) and MAX_TOKENS is basically how many tokens the model can generate in a response.
You might also want to experiment with the advanced options later. For now I would advice to keep it as it is.
Open Web UI Settings #
Once your main model is configured you can go to http://127.0.0.1:8080/ where you would be able to initialize the admin user and start connecting your models.
Connecting to oMLX #
First, click your user’s picture and then Admin Panel.
Go to Settings -> Connections and add a local Open AI endpoint (in this context Open AI is just a schema standard):
Configuring The Models #
Now if you go to Settings -> Models you should be able to see all of your models from oMLX. I suggest you click hide on all of them except the main model (in my case it’s Qwen3.5-35B-A3B-8bit).
Then click on your main model. Here we need to configure a few important parameters:
First of all, set Access to Public, so other users can also chat with this model and it can be used as a task model (see “Other Settings” below).
Second, you definitely need a system prompt. System prompts are not trivial and very individual. For now, what you definitely need is to declare your regional formats, for example:
**Currency & Units**:
- **Currency**: Euro (€). Symbol after amount (e.g., 12,99 €).
- **Dates**: German format DD.MM.YYYY (e.g., 07.03.2026).
- **Time**: 24-hour format (e.g., 14:30).
- **Units**: Metric only (km, m, cm, mm, kg, g, °C, ml, l, kWh). Never use imperial unless explicitly asked.
- **Paper Sizes**: DIN A standard (A4, A3).
There are plenty of sources on the internet on writing a good system prompt. Or you can also ask another model like Claude to generate a system prompt for you based on your preferences.
For example, I gave my model a personality of Gwen Stacy from Spider-Verse. It’s more fun this way. I mean, the model is called “Qwen”, this was an obvious choice.
Next, you need to click on Advanced Parameters -> Show and set Function Calling to Native. This allows your model to use agentic tools like web_search or get_current_timestamp, etc.
At the bottom you can select which built-in tools the model has access to.
After that, don’t forget to click Save.
Adding Web Search #
Now go to Settings -> Web Search, select your provider (e.g. Brave) and put the API key in.
Be careful with the Concurrent Requests setting. Some APIs might ban you for too many concurrent requests. Therefore I set it to 1.
Adding Retrieval-Augmented Generation #
Settings -> Documents
Note: I’m setting the Relevance Threshold to 0. Perhaps I did something wrong, but with any value above 0 uploading a single document to a chat didn’t work for me. The model could not retrieve the content from it at all, getting empty results. So, I kept it set to 0.
Other Settings #
If you wish to experiment with STT or TTS it’s in the Settings -> Audio section but I’m not covering it here.
As the final thing to configure, I would recommend to visit Settings -> Interface. Here you can select your task model (in my case my main model is my task model) and you can select or remove certain background tasks your task model would be used for. Like the chat title generation, tags generation, follow ups, etc.
That’s it! #
Now your model is ready for action, you can start a new chat and try it out.
If you wish to create skills or knowledge collections you can find this in the Workspace section under the New Chat button.
Don’t forget to read the Open Web UI documentation, it’s very detailed and well-written.
Demos #
Here I’ll show you my setup in action.
Agentic Tools #
This demo shows how my model uses agentic tools while handling my prompt “Find me a concert in Berlin next week”.
You can see how the model goes through reasoning cycles and makes decisions to use agentic tools:
- First, it requests the current timestamp, so it can determine the date range for “next week”
- Then it runs web search queries
- After it found relevant results it fetched the entire web page with the most relevant information
- Finally, it presented the results as a summary and recommended where I could buy tickets
Documents #
This demo shows how the model retrieves a vectorized document and queries relevant segments of it.
First, we upload a document in the chat itself:
Second, we use an earlier created knowledge collection.
Skills #
Using a simple skill for text proofing which is defined like so:
**IGNORE CHAT HISTORY EXCEPT THE LAST MESSAGE**
You are a text proofing assistant. Your task is to analyze the user's text and provide corrections only when improvements are needed.
**Instructions:**
1. Read the user's text and identify:
- The language of the text
- All grammatical errors according to the detected language
- Awkward or unnatural phrasing that could be improved
2. Decide if corrections are needed:
- If changes are needed: Provide the corrected text in a code block, then list each change as a bullet point explaining what was fixed
- If no changes are needed: Output exactly: "No changes necessary."
**Output Format:**
```text
corrected text here
```
Changes made:
- Change 1 explanation
- Change 2 explanation
**Examples:**
Input: "She go to the store yesterday."
Output:
```text
She went to the store yesterday.
```
Changes made:
- "go" corrected to "went" (past tense)
Input: "The cat chased it's tail."
Output:
```text
The cat chased its tail.
```
Changes made:
- "it's" corrected to "its" (possessive, not contraction)
Input: "This looks good."
Output: No changes necessary.
**Rules:**
- Preserve original formatting and paragraph structure in the corrected text
- Do not add commentary beyond explaining grammatical changes
- Process only the user's message; disregard anything outside it.
Vision #
I uploaded a strange picture and asked to tell me what it is.
It’s very impressive that the model managed to read the text which was barely visible and upside down. To be fair, the first part is actually wrong, it should be “Kurvengewedel im Herzen des Harzes”. Impressive regardless.
Conclusion #
The main question here is: “Is this really practical?”
I use this chatbot daily for:
- Research
- Simple code generation
- Document analysis
- Language tools: translation, text-proofing, etc.
For my use cases, it’s almost as good as the online models. Can’t complain. Maybe it’s not quite at the level of the commercial models when it comes to coding, but I can still occasionally generate a script in Claude. I have no problem with that. My main use-case is working with personal documents.
I think it’s worth pointing out that I would have this Mac mini anyway. I use it for making music. Now it brings even more value for the same money. I get as much privacy as I can get while using AI.
So, the answer is YES. And it was totally worth spending a few evenings diving deeply into the stack. I learned a lot in the process too. I hope you could also learn something from this.






