Hugging Face Integration
Integrate OpenRegister with Hugging Face Text Generation Inference (TGI) or vLLM to run Mistral and other Hugging Face models locally with an OpenAI-compatible API.
Overview
Hugging Face provides two options for running local LLMs with OpenAI-compatible APIs:
- Text Generation Inference (TGI): Official Hugging Face solution, optimized for production
- vLLM: Alternative with better throughput, full OpenAI API compatibility
Both provide:
- ✅ OpenAI-Compatible API - Drop-in replacement for OpenAI
- ✅ Privacy-First - All data stays local
- ✅ Cost-Free - No API fees
- ✅ Fast - Optimized inference engines
- ✅ Flexible - Choose any Hugging Face model
Comparison
| Feature | TGI | vLLM | Ollama |
|---|---|---|---|
| OpenAI API | ✅ Yes (v1.4.0+) | ✅ Yes | ❌ No |
| Speed | ⚡⚡ Fast | ⚡⚡⚡ Very Fast | ⚡ Good |
| Models | Hugging Face | Hugging Face | Curated list |
| Setup | Medium | Medium | Easy |
| Memory | 8-16GB | 8-16GB | 8-16GB |
| Use Case | Production | High throughput | Simple setup |
Prerequisites
- Nextcloud 28+ with OpenRegister installed
- Docker and Docker Compose
- GPU recommended (8GB+ VRAM) for optimal performance
- At least 16GB RAM for larger models
Quick Start
Option 1: Text Generation Inference (TGI) - Recommended
Pros:
- Official Hugging Face solution
- Well-maintained and documented
- Optimized for production
- Automatic quantization
Installation:
# Start TGI with Mistral (using huggingface profile)
docker-compose -f docker-compose.dev.yml --profile huggingface up -d tgi-mistral
# Wait for model download (~15GB for Mistral 7B)
docker logs -f openregister-tgi-mistral
Configuration in OpenRegister:
- Navigate to Settings → OpenRegister → LLM Configuration
- Select OpenAI as provider (TGI is OpenAI-compatible)
- Configure:
- Base URL:
http://tgi-mistral:80(from Nextcloud container) - Model:
mistral-7b-instruct - API Key:
dummy(not used for local)
- Base URL:
Option 2: vLLM - Alternative
Pros:
- Faster inference
- Better throughput for multiple requests
- Full OpenAI API compatibility
- PagedAttention optimization
Installation:
# Start vLLM with Mistral (if configured)
docker-compose -f docker-compose.dev.yml --profile huggingface up -d vllm-mistral
# Wait for model download
docker logs -f openregister-vllm-mistral
Configuration in OpenRegister:
- Navigate to Settings → OpenRegister → LLM Configuration
- Select OpenAI as provider
- Configure:
- Base URL:
http://vllm-mistral:8000(from Nextcloud container) - Model:
mistral-7b-instruct - API Key:
dummy(not used for local)
- Base URL:
Configuration Details
TGI Service Configuration
tgi-mistral:
image: ghcr.io/huggingface/text-generation-inference:latest
container_name: openregister-tgi-mistral
restart: always
ports:
- "8081:80"
volumes:
- tgi-models:/data
environment:
- MODEL_ID=mistralai/Mistral-7B-Instruct-v0.1
- MAX_INPUT_LENGTH=4096
- MAX_TOTAL_TOKENS=8192
- MAX_CONCURRENT_REQUESTS=128
deploy:
resources:
limits:
memory: 16G
reservations:
memory: 8G
devices:
- driver: nvidia
count: all
capabilities: [gpu]
vLLM Service Configuration
vllm-mistral:
image: vllm/vllm-openai:latest
container_name: openregister-vllm-mistral
restart: always
ports:
- "8082:8000"
volumes:
- vllm-models:/root/.cache/huggingface
environment:
- MODEL_NAME=mistralai/Mistral-7B-Instruct-v0.1
- TENSOR_PARALLEL_SIZE=1
- GPU_MEMORY_UTILIZATION=0.9
- SERVED_MODEL_NAME=mistral-7b-instruct
Available Models
Recommended Models for OpenRegister
| Model | Size | Use Case | Memory Required |
|---|---|---|---|
| Mistral-7B-Instruct-v0.2 | 7B | General purpose, RAG | 16GB |
| Mixtral-8x7B-Instruct | 47B | High quality, complex | 48GB+ |
| Llama-3-8B-Instruct | 8B | General purpose | 16GB |
| Phi-3-mini-instruct | 3.8B | Fast, lightweight | 8GB |
| Qwen2-7B-Instruct | 7B | Multilingual, code | 16GB |
Changing the Model
Edit docker-compose.dev.yml:
For TGI:
tgi-mistral:
environment:
- MODEL_ID=mistralai/Mistral-7B-Instruct-v0.2 # Change this
For vLLM (if configured):
vllm-mistral:
environment:
- MODEL_NAME=mistralai/Mistral-7B-Instruct-v0.2 # Change this
command:
- --model
- mistralai/Mistral-7B-Instruct-v0.2 # Change this too
Then restart the service:
docker-compose -f docker-compose.dev.yml --profile huggingface restart tgi-mistral
# or
docker-compose -f docker-compose.dev.yml --profile huggingface restart vllm-mistral
API Usage
Testing the API
TGI (port 8081):
curl http://localhost:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-7b-instruct",
"messages": [
{"role": "user", "content": "Hello! What is the capital of France?"}
],
"max_tokens": 100
}'
vLLM (port 8082):
curl http://localhost:8082/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-7b-instruct",
"messages": [
{"role": "user", "content": "Hello! What is the capital of France?"}
],
"max_tokens": 100
}'
Using with LLPhant (PHP)
LLPhant's OpenAI client can point to TGI/vLLM:
use LLPhant\Chat\OpenAIChat;
use OpenAI\Client;
// Create OpenAI client pointing to TGI/vLLM
$client = Client::factory()
->withBaseUri('http://tgi-mistral:80') // or http://vllm-mistral:8000
->withHttpHeader('Content-Type', 'application/json')
->make();
// Use with LLPhant
$chat = new OpenAIChat($client);
$response = $chat->generateText('What is the capital of France?');
Use Cases
1. AI Chat
Enable conversational AI using local models:
- Configure TGI or vLLM
- Set OpenAI provider with local base URL
- Use chat features in OpenRegister
2. RAG (Retrieval Augmented Generation)
Answer questions using your data:
- Configure embedding model (separate from chat)
- Vectorize your objects and files
- Ask questions - AI retrieves relevant context
3. Function Calling
Use Mistral with OpenRegister's function calling:
- Search objects
- Create objects
- Update objects
- Query registers
Troubleshooting
Container Won't Start
# Check logs
docker logs openregister-tgi-mistral
# or
docker logs openregister-vllm-mistral
# Common issues:
# 1. Port already in use
sudo lsof -i :8081 # TGI
sudo lsof -i :8082 # vLLM
# 2. Insufficient memory
docker stats openregister-tgi-mistral
# 3. GPU not available
docker exec openregister-tgi-mistral nvidia-smi
Model Download Fails
# Check internet connection
docker exec openregister-tgi-mistral ping -c 3 huggingface.co
# For gated models, set Hugging Face token:
# Edit docker-compose.dev.yml:
environment:
- HUGGING_FACE_HUB_TOKEN=your_token_here
Connection Errors from OpenRegister
Problem: OpenRegister can't connect to TGI/vLLM.
Solutions:
- Verify base URL uses container name:
http://tgi-mistral:80 - Check containers are on same Docker network
- Test connection from Nextcloud container:
docker exec <nextcloud-container> curl http://tgi-mistral:80/health
Slow Performance
Solutions:
- Use GPU acceleration (10-100x faster)
- Choose smaller model (3B instead of 7B)
- Increase
MAX_CONCURRENT_REQUESTSfor TGI - Adjust
GPU_MEMORY_UTILIZATIONfor vLLM
Performance Optimization
GPU Acceleration
For best performance, use GPU:
deploy:
resources:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
Performance Gain: 10-100x faster inference with GPU
Concurrent Requests
TGI:
environment:
- MAX_CONCURRENT_REQUESTS=128 # Increase for more parallel requests
vLLM:
environment:
- GPU_MEMORY_UTILIZATION=0.9 # Use 90% of GPU memory
Further Reading
Support
For issues specific to:
- TGI setup: Check TGI Documentation
- vLLM setup: Check vLLM Documentation
- OpenRegister integration: OpenRegister GitHub issues