Performance Optimization Guide

This guide covers performance optimization strategies for OpenRegister, with a focus on chat responses and vector search operations.

Performance Monitoring

OpenRegister includes detailed timing logs to track where time is spent during chat operations:

[ChatService] Message processed - OVERALL PERFORMANCE
- contextRetrieval: Time to search vectors and build context
- historyBuilding: Time to load conversation history
- llmGeneration: Time for LLM to generate response (includes tool calls)

[ChatService] Response generated - PERFORMANCE
- toolsLoading: Time to load agent tools
- llmGeneration: Actual LLM inference time
- overhead: Time spent in framework/serialization

Checking Performance Logs

docker logs -f master-nextcloud-1 | grep "PERFORMANCE"

Optimization Strategies

1. Streaming Responses

Current Implementation: Responses can be streamed using Server-Sent Events (SSE) for improved perceived performance.

Benefits:

Frontend shows response as it's generated
Perceived performance improvement of 10x+
User sees first tokens in <1s instead of waiting for complete response

Implementation:

ChatController::sendMessage() can return SSE stream
Use $chat->generateStreamOfText() for Ollama
Frontend consumes SSE and displays progressively

2. Ollama Model Selection

Model Size Impact:

Small models (1-3B parameters): 5-10s response time
Medium models (7-8B parameters): 20-30s response time
Large models (70B+ parameters): 60-120s response time

Recommended Models:

# Faster models for Ollama:
docker exec openregister-ollama ollama pull llama3.2:1b    # Tiny (faster)
docker exec openregister-ollama ollama pull phi3:mini      # Microsoft Phi3 (fast)
docker exec openregister-ollama ollama pull gemma2:2b      # Google Gemma (balanced)

Configuration: Select models in Settings → LLM Configuration → Chat Provider

3. GPU Acceleration

Current Support: GPU acceleration is supported for Ollama when configured in docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Speed improvement: 10-50x faster inference when GPU is available.

Verification:

# Check if GPU is being used
docker exec openregister-ollama nvidia-smi

4. Context Window Optimization

Current Behavior: Full conversation history + RAG context is sent to LLM.

Optimization Options:

Limit conversation history to last N messages
Truncate RAG context to top 3-5 most relevant chunks
Implement smart summarization for long conversations

Configuration: Adjust in buildMessageHistory():

// Limit to last 10 messages instead of all
$limit = 10;

5. Tool Call Optimization

Current Behavior: Sequential tool calls require multiple LLM roundtrips.

Example Flow:

User: "List registers and schemas"
LLM generates → calls list_registers → waits
LLM receives result → calls list_schemas → waits
Total time accumulates for each call

Optimization Strategies:

Implement parallel tool calls where possible
Cache tool results for session
Reduce tool call iterations with better prompting

6. Vector Search Optimization

Current Implementation: Vector search can be optimized through several methods:

Optimization Options:

Disable RAG for agents that don't need it
Use faster embedding models (nomic-embed-text)
Implement vector search caching
Index optimization in vector database
Use PostgreSQL + pgvector for database-level vector operations

Configuration: Adjust RAG settings per agent in agent configuration.

7. Model Keep-Alive

Current Configuration: Model keep-alive is configured in docker-compose.yml:

environment:
  - OLLAMA_KEEP_ALIVE=30m  # Keep model in memory

Purpose: Prevents model from unloading from memory between requests, reducing reload time.

8. Prompt Optimization

Current Approach: System prompts are optimized for clarity and conciseness.

Best Practices:

Keep prompts concise
Avoid redundant instructions
Use tool descriptions instead of prompt examples

Quick Wins

Enable Performance Logging

Performance logging is already implemented - check logs to identify bottlenecks:

docker logs -f master-nextcloud-1 | grep "PERFORMANCE"

Switch to Smaller Model

In OpenRegister settings:

Navigate to Settings → LLM Configuration → Chat Provider
Change Chat Model from large models (llama3.2) to smaller models (phi3:mini or gemma2:2b)

Disable RAG for Fast Responses

For agents that don't need context search:

// In agent settings:
enableRag: false  // For agents that don't need context search

Streaming Implementation

Backend:

public function sendMessageStream(int $conversationId): StreamResponse {
    $stream = function () use ($conversationId) {
        // ... setup ...
        
        foreach ($chat->generateStreamOfText($prompt) as $chunk) {
            yield "data: " . json_encode(['chunk' => $chunk]) . "\n\n";
        }
        
        yield "data: [DONE]\n\n";
    };
    
    return new StreamResponse($stream);
}

Frontend:

const eventSource = new EventSource('/api/chat/stream');
eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.chunk) {
        appendToMessage(data.chunk);
    }
};

Response Caching

Current Support:

Cache LLM responses for identical queries
Cache tool call results for session
Redis support for distributed caching

Parallel Processing

Current Capabilities:

Run vector search and history building in parallel
Execute independent tool calls in parallel
Use async PHP with ReactPHP/Amp

Monitoring

Performance metrics can be added to response headers:

$response->addHeader('X-Generation-Time', round($llmTime, 2));
$response->addHeader('X-Context-Time', round($contextTime, 2));

Frontend can then display: "Generated in 25.3s"

Testing Performance

After each optimization, test with:

# Simple query (no tools)
curl -X POST /api/chat/send -d '{"message": "Hello"}'

# Complex query (with tools)
curl -X POST /api/chat/send -d '{"message": "List all registers"}'

# Measure response time
time curl -X POST /api/chat/send -d '{"message": "Hello"}'

Expected Performance

With optimizations applied:

Simple query: 2-5s (down from 15-20s)
Tool query: 8-15s (down from 50s+)
With streaming: Perceived as <1s (first tokens immediate)
With GPU: 0.5-2s (down from 15-20s)

Priority Order

🔴 Enable GPU → 10-50x speedup
🔴 Switch to smaller model → 2-3x speedup
🔴 Implement streaming → Perceived 10x+ improvement
🟡 Optimize context window → 1.5-2x speedup
🟡 Tool call optimization → 2x speedup for multi-tool queries
🟢 Prompt optimization → 1.2x speedup

Database Performance

For vector search operations, database choice significantly impacts performance:

MariaDB/MySQL: Vector similarity calculated in PHP (slower)
PostgreSQL + pgvector: Database-level vector operations (10-100x faster)

See Database Status Tile in LLM Configuration settings for current database status and recommendations.

Database Index Optimization

Performance Issues Identified

Issue 1: Missing Database Indexes

Problem: Object queries taking 30 seconds first time

Root Cause: Missing indexes on commonly searched fields (uuid, slug, name, summary, description)

Issue 2: External App Cache Miss

Problem: External app queries taking 4 seconds every time (no caching)

Root Cause: External app context detection creating different cache keys each time

Comprehensive Database Index Migration

Created migration with 12 strategic performance indexes:

-- Critical single-column indexes
CREATE INDEX objects_uuid_perf_idx ON openregister_objects (uuid);
CREATE INDEX objects_slug_perf_idx ON openregister_objects (slug);
CREATE INDEX objects_name_perf_idx ON openregister_objects (name);
CREATE INDEX objects_summary_perf_idx ON openregister_objects (summary);
CREATE INDEX objects_description_perf_idx ON openregister_objects (description);

-- Critical composite indexes  
CREATE INDEX objects_main_search_idx ON openregister_objects (register, schema, organisation, published);
CREATE INDEX objects_publication_filter_idx ON openregister_objects (published, depublished, created);
CREATE INDEX objects_rbac_tenancy_idx ON openregister_objects (owner, organisation, register, schema);
CREATE INDEX objects_ultimate_perf_idx ON openregister_objects (register, schema, organisation, published, owner);

Run the Migration:

# Apply the new indexes
docker exec -u 33 master-nextcloud-1 php /var/www/html/occ upgrade

# Verify indexes were created
docker exec -u 33 master-nextcloud-1 mysql -u nextcloud -p nextcloud \
  -e 'SHOW INDEX FROM openregister_objects WHERE Key_name LIKE "%perf%";'

Expected Performance Improvements

After Index Migration:

Endpoint	Before	After	Improvement
`/api/objects/voorzieningen/product`	30s	<1s	97% faster
`/api/objects/voorzieningen/organisatie`	30s	<1s	97% faster
General object searches	5-15s	0.1-0.5s	95% faster

After Cache Fix:

Endpoint	Before	After	Improvement
`/api/apps/opencatalogi/api/publications`	4s every time	4s → 50ms	Cache Hit: 98% faster

Search Optimization

Database Index Optimization

Migration: lib/Migration/Version1Date20250102120000.php

Added critical indexes for search performance:

Single-Column Search Indexes

objects_name_search_idx - Index on name column
objects_summary_search_idx - Index on summary column
objects_description_search_idx - Index on description column

Composite Search Indexes

objects_name_deleted_published_idx - Combined search + lifecycle filtering
objects_summary_deleted_published_idx - Summary search + lifecycle filtering
objects_description_deleted_published_idx - Description search + lifecycle filtering
objects_name_register_schema_idx - Name search + register/schema filtering
objects_summary_register_schema_idx - Summary search + register/schema filtering
objects_name_organisation_deleted_idx - Name search + multi-tenancy filtering
objects_summary_organisation_deleted_idx - Summary search + multi-tenancy filtering

Performance Impact: These indexes enable fast lookup on frequently searched metadata columns, reducing query times from seconds to milliseconds.

Schema Caching System

Service: lib/Service/SchemaCacheService.php

Implemented comprehensive schema caching with:

In-memory caching for frequently accessed schemas
Database-backed cache with configurable TTL
Batch schema loading for multiple schemas
Automatic cache invalidation when schemas are updated
Cache statistics and monitoring

Performance Benefits:

Eliminates repeated database queries for schema loading
Reduces schema processing overhead
Enables predictable performance for schema-dependent operations
Supports high-concurrency scenarios

Optimized Search Query Execution

Handler: lib/Db/ObjectHandlers/MariaDbSearchHandler.php

Enhanced search performance with prioritized search strategy:

Search Priority Order

PRIORITY 1: Indexed metadata columns (name, summary, description)
- Fastest performance using database indexes
- Direct column access with LOWER() function
PRIORITY 2: Other metadata fields (image, etc.)
- Moderate performance with direct column access
- No indexes but faster than JSON search
PRIORITY 3: JSON object search
- Comprehensive fallback using JSON_SEARCH()
- Slowest but ensures complete search coverage

Performance Impact:

Dramatic improvement in search response times
Leverages database indexes for common searches
Maintains comprehensive search coverage

Performance Expected Improvements

Search Performance

Metadata searches: 10-50x improvement using indexes
Full-text searches: 3-10x improvement with prioritized strategy
Complex searches: 5-15x improvement with composite indexes

Schema Loading Performance

Individual schemas: 5-10x improvement with caching
Batch schema loading: 10-20x improvement with bulk operations
Schema-dependent operations: Consistent sub-millisecond performance

Ollama Setup - Ollama configuration and GPU setup
Vector Search Backends - Vector search backend options
Search Optimization - Detailed search optimization guide

Performance Monitoring​

Checking Performance Logs​

Optimization Strategies​

1. Streaming Responses​

2. Ollama Model Selection​

3. GPU Acceleration​

4. Context Window Optimization​

5. Tool Call Optimization​

6. Vector Search Optimization​

7. Model Keep-Alive​

8. Prompt Optimization​

Quick Wins​

Enable Performance Logging​

Switch to Smaller Model​

Disable RAG for Fast Responses​

Streaming Implementation​

Response Caching​

Parallel Processing​

Monitoring​

Testing Performance​

Expected Performance​

Priority Order​

Database Performance​

Database Index Optimization​

Performance Issues Identified​

Issue 1: Missing Database Indexes​

Issue 2: External App Cache Miss​

Comprehensive Database Index Migration​

Expected Performance Improvements​

Search Optimization​

Database Index Optimization​

Single-Column Search Indexes​

Composite Search Indexes​

Schema Caching System​

Optimized Search Query Execution​

Search Priority Order​

Performance Expected Improvements​

Search Performance​

Schema Loading Performance​

Related Documentation​

Performance Monitoring

Checking Performance Logs

Optimization Strategies

1. Streaming Responses

2. Ollama Model Selection

3. GPU Acceleration

4. Context Window Optimization

5. Tool Call Optimization

6. Vector Search Optimization

7. Model Keep-Alive

8. Prompt Optimization

Quick Wins

Enable Performance Logging

Switch to Smaller Model

Disable RAG for Fast Responses

Streaming Implementation

Response Caching

Parallel Processing

Monitoring

Testing Performance

Expected Performance

Priority Order

Database Performance

Database Index Optimization

Performance Issues Identified

Issue 1: Missing Database Indexes

Issue 2: External App Cache Miss

Comprehensive Database Index Migration

Expected Performance Improvements

Search Optimization

Database Index Optimization

Single-Column Search Indexes

Composite Search Indexes

Schema Caching System

Optimized Search Query Execution

Search Priority Order

Performance Expected Improvements

Search Performance

Schema Loading Performance

Related Documentation