Semantic Search User Guide

Overview

Semantic search allows you to find information by meaning, not just by matching keywords. Instead of searching for exact words, you can search for concepts, and the system will find related content even if it uses different terminology.

What is Semantic Search?

Traditional search (keyword search) looks for exact word matches:

Search for "car" → Only finds documents containing "car"
Misses documents about "automobile", "vehicle", "transportation"

Semantic search understands meaning:

Search for "car" → Finds "car", "automobile", "vehicle", "transportation"
Understands that "Amsterdam" relates to "Netherlands" and "Dutch"
Recognizes that "budget" is related to "financial planning" and "costs"

How It Works

Vectorization: Your files and objects are converted into mathematical representations (vectors) that capture their meaning
Query Understanding: When you search, your query is also converted to a vector
Similarity Matching: The system finds content with vectors similar to your query vector
Ranking: Results are ranked by how closely they match your query's meaning

Search Modes

OpenRegister offers three search modes:

Hybrid Search (Recommended)

Combines keyword and semantic search for the best results.

Use when: You want the most accurate results (default mode)

Example:

Query: "project management tools"
Finds: Documents with exact phrase + documents about "planning software", "task coordination", "workflow systems"

Semantic Search

Pure meaning-based search using vector similarity.

Use when: You want to find conceptually similar content, even with different wording

Example:

Query: "reducing costs"
Finds: Documents about "budget optimization", "financial efficiency", "expense management"

Keyword Search

Traditional full-text search using SOLR.

Use when: You need exact phrase matching or very fast results

Example:

Query: "invoice #12345"
Finds: Only documents containing that exact invoice number

Using Search

From the Chat Interface

The easiest way to use semantic search is through the AI Chat:

Navigate to AI Chat from the main menu
Type your question in natural language
The system automatically:
- Uses hybrid search to find relevant context
- Generates an answer based on your data
- Cites sources with similarity scores

Example questions:

"What projects are related to Amsterdam?"
"Show me files that mention budgets over 1 million"
"Find information about customer complaints"
"What do we have about sustainability initiatives?"

From the Search API

For programmatic access, use the API endpoints:

# Semantic search
GET /apps/openregister/api/solr/search/semantic?query=your+query&limit=10

# Hybrid search
GET /apps/openregister/api/solr/search/hybrid?query=your+query&limit=10

Understanding Results

Similarity Scores

Results include a similarity score from 0.0 to 1.0:

0.9-1.0: Extremely similar (almost identical meaning)
0.8-0.9: Very similar (closely related concepts)
0.7-0.8: Similar (related but distinct concepts)
0.6-0.7: Somewhat similar (loosely related)
Below 0.6: May not be relevant

Source Citations

When using the chat interface, each answer includes:

Source name: File name or object title
Source type: File or Object
Similarity score: How closely it matches your query
Text excerpt: Relevant content from the source

Click on a source to view the full document or object.

Best Practices

Writing Effective Queries

✅ Do:

Use natural language: "What are our sustainability goals?"
Be specific about what you're looking for
Include key concepts: "budget planning 2025"
Ask questions: "How do we handle customer refunds?"

❌ Don't:

Use single words without context: "budget"
Overload with technical operators: "budget AND (2024 OR 2025) NOT draft"
Expect exact phrase matching in semantic mode

Improving Results

If you're not getting good results:

Try Different Phrasing:
- Instead of "cars", try "vehicles" or "transportation"
- Rephrase your query to emphasize different aspects
Use Hybrid Mode:
- Combines benefits of both keyword and semantic search
- Usually gives the most accurate results
Provide Feedback:
- Use thumbs up/down in chat to help improve results
- The system learns from your feedback
Check Vectorization:
- Go to Settings → Object Management / File Management
- Ensure relevant content has been vectorized
- Check stats to see vectorization progress

Configuration

For Users

Adjust Search Preferences:

Open AI Chat
Click Settings button
Configure:
- Search Mode: Hybrid / Semantic / Keyword
- Number of Sources: How many documents to retrieve (1-10)
- Search in files: Include/exclude file content
- Search in objects: Include/exclude structured data

For Administrators

Enable Semantic Search:

Go to Settings → Administration → OpenRegister
Configure LLM Settings:
- Choose embedding provider (OpenAI or Ollama)
- Enter API key (if using OpenAI)
- Select embedding model
Vectorize Content:
- Go to Object Management:
  - Enable vectorization
  - Select schemas to vectorize
  - Click "Start Bulk Vectorization"
- Go to File Management:
  - Enable vectorization
  - Configure chunking strategy
  - Enable file types to process
Monitor Progress:
- Check stats in Object/File Management dialogs
- View vectorization progress
- Monitor vector database size

Common Use Cases

Query: "Show me projects related to renewable energy"

Results: Projects mentioning:

Solar power, wind energy, clean energy
Sustainability initiatives
Carbon reduction programs
Green infrastructure

2. Budget Analysis

Query: "What files discuss budget increases?"

Results: Files containing:

Financial growth, increased funding
Cost escalation, expanded budget
Resource allocation changes
Funding enhancements

3. Customer Feedback

Query: "Find complaints about delivery times"

Results: Documents about:

Shipping delays, late arrivals
Delivery issues, transport problems
Customer dissatisfaction with speed
Logistics challenges

4. Policy Research

Query: "What are our data privacy policies?"

Results: Documents covering:

GDPR compliance, data protection
Privacy guidelines, security measures
Information handling procedures
Confidentiality agreements

Troubleshooting

"No results found"

Possible causes:

Content hasn't been vectorized yet
Query is too specific or uses uncommon terminology
File types aren't enabled for processing

Solutions:

Check vectorization status in Object/File Management
Try rephrasing your query
Use hybrid mode for broader coverage
Ensure relevant file types are enabled

"Results seem irrelevant"

Possible causes:

Insufficient vectorized content
Query is too broad or ambiguous
Similarity threshold may be too low

Solutions:

Vectorize more content
Make your query more specific
Use keyword search for exact matches
Provide feedback (thumbs down) to help improve

"Search is slow"

Possible causes:

Large vector database
High number of sources requested
OpenAI API latency

Solutions:

Reduce number of sources in settings
Use keyword mode for faster results
Consider using local Ollama for embeddings
Contact administrator about database optimization

Privacy & Security

Your Data

All vectorization happens on your server
Vectors are stored in your database
File content never leaves your infrastructure (except for embeddings)

API Keys

OpenAI API keys are stored in configuration
Only used for generating embeddings and chat responses
Administrator can switch to local Ollama for complete privacy

Search Privacy

Search queries are processed locally
Chat history is stored per-user
Conversation history is private to each user
Admins cannot see your chat conversations

Tips for Power Users

Combining Search Techniques

Use hybrid search with specific terms:

"sustainable development Amsterdam" finds both exact mentions of Amsterdam AND conceptually related sustainability content

Understanding Vector Dimensions

Larger models (e.g., text-embedding-3-large, 3072 dimensions) → More nuanced understanding
Smaller models (e.g., text-embedding-3-small, 1536 dimensions) → Faster, less storage

Optimizing for Your Domain

Vectorize domain-specific documents first
Use consistent terminology in your organization
Provide feedback to improve relevance

Monitoring Costs

Each OpenAI embedding API call has a small cost
Bulk vectorization is more cost-effective than individual
Consider Ollama for free, local embeddings (slower but private)

FAQs

Q: How is semantic search different from regular search?
A: Regular search matches words. Semantic search understands meaning and finds conceptually similar content.

Q: Do I need to change how I search?
A: No! Just use natural language. The system handles the complexity.

Q: Can I still use exact phrase matching?
A: Yes! Use keyword mode or hybrid mode (which includes exact matching).

Q: How long does vectorization take?
A: Depends on content volume. Typically 100-500 items per minute.

Q: Does this work with all file types?
A: We support 15+ formats including PDF, Word, Excel, images (with OCR), and more.

Q: Is my data secure?
A: Yes! Only embeddings (mathematical representations) are sent to OpenAI, not full content. Use Ollama for complete on-premises processing.

Q: Can I search across different languages?
A: Yes! Multilingual embedding models understand multiple languages.

Q: How do I know if results are accurate?
A: Check similarity scores and source citations. Use feedback buttons to improve results.

Getting Help

Documentation: See /docs/ for technical details
Administrator: Contact your OpenRegister administrator
Feedback: Use thumbs up/down in chat to report issues
Support: Visit Conduction for enterprise support

Last updated: October 13, 2025

Overview​

What is Semantic Search?​

How It Works​

Search Modes​

Hybrid Search (Recommended)​

Semantic Search​

Keyword Search​

Using Search​

From the Chat Interface​

From the Search API​

Understanding Results​

Similarity Scores​

Source Citations​

Best Practices​

Writing Effective Queries​

Improving Results​

Configuration​

For Users​

For Administrators​

Common Use Cases​

1. Finding Related Projects​

2. Budget Analysis​

3. Customer Feedback​

4. Policy Research​

Troubleshooting​

"No results found"​

"Results seem irrelevant"​

"Search is slow"​

Privacy & Security​

Your Data​

API Keys​

Search Privacy​

Tips for Power Users​

Combining Search Techniques​

Understanding Vector Dimensions​

Optimizing for Your Domain​

Monitoring Costs​

FAQs​

Getting Help​

Overview

What is Semantic Search?

How It Works

Search Modes

Hybrid Search (Recommended)

Semantic Search

Keyword Search

Using Search

From the Chat Interface

From the Search API

Understanding Results

Similarity Scores

Source Citations

Best Practices

Writing Effective Queries

Improving Results

Configuration

For Users

For Administrators

Common Use Cases

1. Finding Related Projects

2. Budget Analysis

3. Customer Feedback

4. Policy Research

Troubleshooting

"No results found"

"Results seem irrelevant"

"Search is slow"

Privacy & Security

Your Data

API Keys

Search Privacy

Tips for Power Users

Combining Search Techniques

Understanding Vector Dimensions

Optimizing for Your Domain

Monitoring Costs

FAQs

Getting Help