Skip to main content

Text Extraction Sources: Files vs Objects

OpenRegister processes content from two distinct sources that both lead to chunks for searching and analysis.

Processing Paths Overview

📄 Source 1: Files

Description

Files (documents, images, spreadsheets, etc.) are processed through text extraction engines to convert binary content into searchable text.

Supported File Types

CategoryFormatsExtraction Method
DocumentsPDF, DOCX, DOC, ODT, RTFLLPhant or Dolphin
SpreadsheetsXLSX, XLS, CSVLLPhant or Dolphin
PresentationsPPTXLLPhant or Dolphin
Text FilesTXT, MD, HTML, JSON, XMLLLPhant (native)
ImagesJPG, PNG, GIF, WebP, TIFFDolphin (OCR only)

File Processing Flow

The file processing flow varies based on the extraction mode configured. Each mode provides different timing and processing characteristics:

Extraction Modes Overview

Characteristics:

  • Direct Link: File upload and parsing logic are directly connected
  • Synchronous: Processing happens during the upload request
  • User Experience: User waits for extraction to complete
  • Use Case: When immediate text availability is critical
  • Performance: May slow down file uploads for large files

2. Background Job Mode - Delayed Extraction

Characteristics:

  • Delayed Action: Extraction happens after upload completes
  • Asynchronous: Processing on the job stack, non-blocking
  • User Experience: Upload completes immediately
  • Use Case: Recommended for most scenarios (best performance)
  • Performance: No impact on upload speed

3. Cron Job Mode - Periodic Batch Processing

Characteristics:

  • Repeating Action: Periodic batch processing via scheduled jobs
  • Batch Processing: Multiple files processed together
  • User Experience: Upload completes immediately, extraction happens later
  • Use Case: When you want to control processing load and timing
  • Performance: Efficient batch processing, predictable load

4. Manual Only Mode - User-Triggered Processing

Characteristics:

  • Manual Trigger: Only processes when user explicitly triggers
  • User Control: Complete control over when extraction happens
  • Use Case: Selective processing, testing, or resource-constrained environments
  • Performance: No automatic processing overhead

Detailed File Processing Flow

File Metadata Preserved

When files are processed, the following metadata is maintained:

  • Source Reference: Original file ID from Nextcloud
  • File Path: Location in Nextcloud filesystem
  • MIME Type: File format information
  • File Size: Original file size in bytes
  • Checksum: For change detection
  • Extraction Method: Which engine was used (LLPhant or Dolphin)
  • Extraction Timestamp: When text was extracted

Example: PDF Processing

Input: contract-2024.pdf (245 KB, 15 pages)

Step 1: Text Extraction
- Engine: Dolphin AI
- Time: 8.2 seconds
- Output: 12,450 characters of text

Step 2: Chunking
- Strategy: Recursive (respects paragraphs)
- Chunks created: 14
- Average chunk size: 889 characters
- Overlap: 200 characters

Step 3: Storage
- FileText entity created
- Chunks stored in chunks_json field
- Status: completed

📦 Source 2: Objects

Description

OpenRegister objects (structured data entities) are converted into text blobs by concatenating their property values. This enables full-text search across structured data.

Object-to-Text Conversion

Objects are transformed using the following rules:

  1. Simple Properties: Direct value extraction

    { 'name': 'John Doe', 'age': 35 }
    → 'name: John Doe age: 35'
  2. Arrays: Join with separators

    { 'tags': ['urgent', 'customer', 'support'] }
    → 'tags: urgent, customer, support'
  3. Nested Objects: Flatten with dot notation

    { 'address': { 'city': 'Amsterdam', 'country': 'NL' } }
    → 'address.city: Amsterdam address.country: NL'
  4. Special Handling: Exclude system fields

    • Ignore: id, uuid, created, updated
    • Include: User-defined properties only

Object Processing Flow

Object Metadata Preserved

When objects are processed, the following metadata is maintained:

  • Object ID: Reference to original object
  • Schema: Schema definition for context
  • Register: Register containing the object
  • Property Map: Which chunk contains which properties
  • Extraction Timestamp: When text blob was created

Example: Contact Object Processing

Input Object (Contact Schema):
{
'id': 12345,
'uuid': '550e8400-e29b-41d4-a716-446655440000',
'firstName': 'Jane',
'lastName': 'Smith',
'email': 'jane.smith@example.com',
'phone': '+31612345678',
'company': {
'name': 'Acme Corp',
'industry': 'Technology'
},
'tags': ['vip', 'partner'],
'notes': 'Important client, prefers email communication'
}

Step 1: Text Blob Creation
→ 'firstName: Jane lastName: Smith email: jane.smith@example.com
phone: +31612345678 company.name: Acme Corp
company.industry: Technology tags: vip, partner
notes: Important client, prefers email communication'

Step 2: Chunking
- Strategy: Fixed size (short enough for single chunk)
- Chunks created: 1
- Chunk size: 215 characters

Step 3: Storage
- ObjectText entity created
- Chunk stored with property mapping
- Status: completed

Common Chunking Process

Both files and objects converge at the chunking stage, where text is divided into manageable pieces.

Chunking Strategies

Smart splitting that respects natural text boundaries:

Priority Order:
1. Paragraph breaks (\n\n)
2. Sentence endings (. ! ?)
3. Line breaks (\n)
4. Word boundaries (spaces)
5. Character split (fallback)

Best for: Natural language documents, articles, reports

2. Fixed Size Splitting

Mechanical splitting with overlap:

Settings:
- Chunk size: 1000 characters
- Overlap: 200 characters
- Minimum chunk: 100 characters

Best for: Structured data, code, logs

Chunk Structure

Each chunk contains:

{
'text': 'The actual chunk content...',
'start_offset': 0,
'end_offset': 1000,
'source_type': 'file',
'source_id': 12345,
'language': 'en',
'language_level': 'B2'
}

Enhancement Pipeline

After chunking, content can undergo optional enhancements:

1. Text Search Indexing (Solr)

Purpose: Fast keyword and phrase search across all content

Performance: ~50-200ms per query

Use Cases: Search box, filters, reporting

2. Vector Embeddings (RAG)

Purpose: Semantic search and AI context retrieval

Performance: ~200-500ms per chunk (one-time), ~100-300ms per query

Use Cases: AI chat, related content, recommendations

3. Entity Extraction (GDPR)

Purpose: GDPR compliance, PII tracking, data subject access requests

Performance: ~100-2000ms per chunk (depending on method)

Use Cases: Compliance audits, right to erasure, data mapping

4. Language Detection

Purpose: Multi-language support, content filtering, translation routing

Performance: ~10-50ms per chunk (local) or ~100-200ms (API)

Use Cases: Language filters, translation, localization

5. Language Level Assessment

Purpose: Accessibility compliance, content simplification, readability scoring

Performance: ~20-100ms per chunk

Use Cases: Plain language compliance, educational leveling, accessibility

Comparison: Files vs Objects

AspectFilesObjects
Input FormatBinary (PDF, DOCX, images)Structured JSON data
ExtractionText extraction engines requiredProperty value concatenation
Processing TimeSlow (2-60 seconds)Fast (<1 second)
ComplexityHigh (OCR, parsing)Low (string operations)
Chunk CountMany (10-1000+)Few (1-10)
Update FrequencyRare (files are static)Common (objects change often)
Best ForDocuments, reports, imagesStructured records, metadata
GDPR RiskHigh (unstructured PII)Medium (known data structure)
Search PrecisionLower (natural language)Higher (structured fields)
ContextFull document contextProperty-level context

Combined Use Cases

Use Case 1: Customer Management

Object: Customer record
- Name, email, phone, notes
→ Chunked for search

File: Contract PDF attached to customer
- Terms, signatures, dates
→ Extracted and chunked

Search: 'payment terms for Acme Corp'
→ Finds chunks from both object and file
→ Returns unified results

Use Case 2: GDPR Data Subject Access Request

Request: 'Find all mentions of john.doe@example.com'

Step 1: Entity extraction finds email in:
- 15 chunks from 8 PDF files
- 3 chunks from 2 customer objects
- 12 chunks from 42 email messages

Step 2: Generate report with:
- All files containing email
- All objects referencing person
- All email conversations
- Exact positions in each source

Step 3: Provide data or anonymize on request

Use Case 3: Multi-Language Knowledge Base

Content Sources:
- Files: User manuals (EN, NL, DE)
- Objects: FAQ entries (EN, NL)
- Emails: Support conversations (mixed)

Processing:
1. All sources → Chunks
2. Language detection → Tag each chunk
3. Vector embeddings → Enable semantic search

User Search (in Dutch):
→ System filters to NL chunks
→ Semantic search across files + objects + emails
→ Returns relevant content in user's language

Configuration

Enabling File Processing

Settings → OpenRegister → File Configuration

Extract Text From: [All Files / Specific Folders / Object Files]
Text Extractor: [LLPhant / Dolphin]
Extraction Mode: [Immediate / Background Job / Cron Job / Manual Only]
Chunking Strategy: [Recursive / Fixed Size]

Extraction Mode Selection Guide

Immediate Mode:

  • ✅ Use when: Text must be available immediately after upload
  • ✅ Best for: Small files, critical workflows, real-time search requirements
  • ⚠️ Consider: May slow down uploads for large files
  • 📊 Performance: Synchronous processing during upload

Background Job Mode (Recommended):

  • ✅ Use when: You want fast uploads with async processing
  • ✅ Best for: Most production scenarios, large files, high-volume uploads
  • ⚠️ Consider: Text may not be immediately available (typically seconds to minutes delay)
  • 📊 Performance: Non-blocking, optimal for user experience

Cron Job Mode:

  • ✅ Use when: You want to control processing load and timing
  • ✅ Best for: Batch processing, predictable resource usage, scheduled maintenance windows
  • ⚠️ Consider: Text extraction happens at scheduled intervals (default: every 15 minutes)
  • 📊 Performance: Efficient batch processing, predictable system load

Manual Only Mode:

  • ✅ Use when: You want complete control over when extraction happens
  • ✅ Best for: Testing, selective processing, resource-constrained environments
  • ⚠️ Consider: Requires manual intervention to trigger extraction
  • 📊 Performance: No automatic processing overhead

Enabling Object Processing

Settings → OpenRegister → Text Analysis

Enable Object Text Extraction: [Yes / No]
Include Properties: [Select which properties to extract]
Chunking Strategy: [Recursive / Fixed Size]

Enabling Enhancements

Settings → OpenRegister → Text Analysis

☑ Text Search Indexing (Solr)
☑ Vector Embeddings (RAG)
☑ Entity Extraction (GDPR)
☑ Language Detection
☑ Language Level Assessment

Performance Recommendations

For File-Heavy Workloads

  • Use Background Job or Cron Job mode for optimal performance
  • Enable Dolphin for images/complex PDFs
  • Use recursive chunking for better quality
  • Enable selective enhancements (not all at once)
  • Configure appropriate batch sizes for cron mode

For Object-Heavy Workloads

  • Use immediate processing (objects are small)
  • Enable fixed-size chunking (faster)
  • Always enable language detection (fast on short text)
  • Enable entity extraction for compliance

For Mixed Workloads

  • Background processing for files
  • Immediate processing for objects
  • Use recursive chunking for both
  • Enable all enhancements selectively per schema

API Examples

Search Across Both Sources

GET /api/search?q=contract%20terms&sources=files,objects

Response:

{
'results': [
{
'source_type': 'file',
'source_id': 12345,
'file_name': 'contract-2024.pdf',
'chunk_index': 3,
'text': '...payment terms are net 30...',
'score': 0.95
},
{
'source_type': 'object',
'source_id': 67890,
'schema': 'customers',
'property': 'notes',
'text': '...special contract terms agreed...',
'score': 0.87
}
]
}

Get All Chunks for a File

GET /api/files/12345/chunks

Get All Chunks for an Object

GET /api/objects/67890/chunks

Conclusion

OpenRegister's dual-source text extraction system provides:

  • Comprehensive Coverage: Search across files AND structured data
  • Unified Processing: Same chunking and enhancement pipeline
  • Flexible Configuration: Enable features per source type
  • GDPR Compliance: Track entities from all sources
  • Intelligent Search: Semantic and keyword search across everything

By processing both files and objects into a common chunk format, OpenRegister creates a truly unified content search and analysis platform.


Next Steps: