Text Extraction Technical Documentation

📚 Feature Documentation: See Text Extraction, Vectorization & Named Entity Recognition for user-facing documentation and overview.

Overview

OpenRegister's Text Extraction Service converts content from various sources (files, objects, emails, calendar items) into searchable text chunks. The service uses a handler-based architecture to support multiple source types and extraction methods.

Architecture

Handler-Based Design

Location: lib/Service/TextExtraction/

Service Flow

Database Schema

Chunk Entity

Table: oc_openregister_chunks

CREATE TABLE oc_openregister_chunks (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    uuid VARCHAR(255) NOT NULL UNIQUE,
    source_type VARCHAR(50) NOT NULL,
    source_id BIGINT NOT NULL,
    text_content TEXT NOT NULL,
    start_offset INT NOT NULL,
    end_offset INT NOT NULL,
    chunk_index INT NOT NULL,
    checksum VARCHAR(64),
    language VARCHAR(10),
    language_level VARCHAR(20),
    language_confidence DECIMAL(3,2),
    detection_method VARCHAR(50),
    indexed_in_solr BOOLEAN NOT NULL DEFAULT FALSE,
    vectorized BOOLEAN NOT NULL DEFAULT FALSE,
    owner VARCHAR(255),
    organisation VARCHAR(255),
    created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    
    INDEX idx_source (source_type, source_id),
    INDEX idx_checksum (checksum),
    INDEX idx_language (language),
    INDEX idx_owner (owner),
    INDEX idx_organisation (organisation)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Key Fields:

source_type: 'file', 'object', 'email', 'calendar'
source_id: ID of the source entity
checksum: SHA256 hash of source text for change detection
text_content: The actual chunk text
start_offset / end_offset: Character positions in original text
chunk_index: Sequential chunk number (0-based)

Entity Class

Location: lib/Db/Chunk.php

class Chunk extends Entity implements JsonSerializable
{
    protected ?string $uuid = null;
    protected ?string $sourceType = null;
    protected ?int $sourceId = null;
    protected ?string $textContent = null;
    protected int $startOffset = 0;
    protected int $endOffset = 0;
    protected int $chunkIndex = 0;
    protected ?string $checksum = null;
    protected ?string $language = null;
    protected ?string $languageLevel = null;
    protected ?float $languageConfidence = null;
    protected ?string $detectionMethod = null;
    protected bool $indexedInSolr = false;
    protected bool $vectorized = false;
    protected ?string $owner = null;
    protected ?string $organisation = null;
    protected ?DateTime $createdAt = null;
    protected ?DateTime $updatedAt = null;
}

Handlers

FileHandler

Location: lib/Service/TextExtraction/FileHandler.php

Supported Formats:

Documents: PDF, DOCX, DOC, ODT, RTF
Spreadsheets: XLSX, XLS, CSV
Presentations: PPTX
Text Files: TXT, MD, HTML, JSON, XML
Images: JPG, PNG, GIF, WebP, TIFF (via OCR)

Extraction Methods:

LLPhant: Local PHP-based extraction
Dolphin: AI-powered extraction with OCR
Native: Direct text reading for plain text files

Implementation:

public function extract(int $sourceId, array $sourceMeta): string
{
    $nodes = $this->rootFolder->getById($sourceId);
    $file = $nodes[0];
    
    // Extract based on MIME type
    $mimeType = $sourceMeta['mimetype'] ?? 'unknown';
    
    // TODO: Implement actual extraction logic
    return $file->getContent();
}

ObjectHandler

Location: lib/Service/TextExtraction/ObjectHandler.php

Process:

Load object from database
Extract schema and register information
Flatten nested object structures
Concatenate property values with context
Add metadata (UUID, version, organization)

Implementation:

public function extract(int $sourceId, array $sourceMeta): string
{
    $object = $this->objectMapper->find($sourceId);
    return $this->convertObjectToText($object);
}

private function convertObjectToText(ObjectEntity $object): string
{
    $textParts = [];
    $textParts[] = "Object ID: " . $object->getUuid();
    
    // Add schema info
    if ($object->getSchema() !== null) {
        $schema = $this->schemaMapper->find($object->getSchema());
        $textParts[] = "Type: " . ($schema->getTitle() ?? $schema->getName());
    }
    
    // Extract object data
    $objectData = $object->getObject();
    if (is_array($objectData)) {
        $textParts[] = "Content: " . $this->extractTextFromArray($objectData);
    }
    
    return implode("\n", $textParts);
}

Chunking Strategies

Recursive Character Splitting (Recommended)

Priority Order:

Paragraph breaks (\n\n)
Sentence endings (. ! ?)
Line breaks (\n)
Commas and semicolons
Word boundaries (spaces)
Character split (fallback)

Best for: Natural language documents, articles, reports

Configuration:

$chunks = $textExtractionService->textToChunks($text, [
    'chunk_size' => 1000,
    'chunk_overlap' => 200,
    'strategy' => 'RECURSIVE_CHARACTER'
]);

Fixed Size Splitting

Settings:

Chunk size: 1000 characters (default)
Overlap: 200 characters (default)
Minimum chunk: 100 characters

Best for: Structured data, code, logs

Configuration:

$chunks = $textExtractionService->textToChunks($text, [
    'chunk_size' => 1000,
    'chunk_overlap' => 200,
    'strategy' => 'FIXED_SIZE'
]);

Supported File Formats

Text Documents

Format	Max Size	Processing Time	Notes
`.txt`	100MB	< 1s	UTF-8, ISO-8859-1, Windows-1252
`.md`	50MB	< 1s	Preserves structure
`.html`	20MB	1-3s	Strips scripts/styles

PDF Documents

Type	Max Size	Processing Time	Libraries
Text PDF	100MB	2-10s	Smalot PdfParser, pdftotext
Scanned PDF (OCR)	50MB	10-60s	Tesseract OCR

Requirements for OCR:

# Install Tesseract
sudo apt-get install tesseract-ocr

# With languages
apt-get install tesseract-ocr-nld tesseract-ocr-deu

Microsoft Office

Format	Max Size	Processing Time	Libraries
`.docx`	50MB	2-5s	PhpOffice/PhpWord
`.xlsx`	30MB	3-10s	PhpOffice/PhpSpreadsheet
`.pptx`	50MB	2-5s	ZipArchive + XML

Images (OCR)

Format	Max Size	Processing Time	Requirements
JPG, PNG, GIF, BMP, TIFF	20MB	5-15s/page	Tesseract OCR

Best Practices:

Use high-resolution scans (300 DPI ideal)
Ensure text is legible and not skewed
Black text on white background works best

Data Formats

Format	Max Size	Processing Time	Notes
`.json`	20MB	1-2s	Recursive extraction
`.xml`	20MB	1-3s	Tag names and content

Service Implementation

TextExtractionService

Location: lib/Service/TextExtractionService.php

Key Methods:

/**
 * Extract text from a source using appropriate handler.
 */
public function extractSourceText(
    string $sourceType,
    int $sourceId,
    array $sourceMeta
): array {
    $handler = $this->getHandler($sourceType);
    $text = $handler->extract($sourceId, $sourceMeta);
    $checksum = hash('sha256', $text);
    
    return [
        'source_type' => $sourceType,
        'source_id' => $sourceId,
        'text' => $text,
        'checksum' => $checksum,
        'owner' => $handler->getOwner($sourceId, $sourceMeta),
        'organisation' => $handler->getOrganisation($sourceId, $sourceMeta),
    ];
}

/**
 * Split text into chunks.
 */
public function textToChunks(array $payload, array $options = []): array {
    $chunkSize = $options['chunk_size'] ?? self::DEFAULT_CHUNK_SIZE;
    $chunkOverlap = $options['chunk_overlap'] ?? self::DEFAULT_CHUNK_OVERLAP;
    $strategy = $options['strategy'] ?? self::RECURSIVE_CHARACTER;
    
    // Apply chunking strategy
    $chunks = match($strategy) {
        self::FIXED_SIZE => $this->chunkFixedSize($payload['text'], $chunkSize, $chunkOverlap),
        self::RECURSIVE_CHARACTER => $this->chunkRecursive($payload['text'], $chunkSize, $chunkOverlap),
        default => $this->chunkRecursive($payload['text'], $chunkSize, $chunkOverlap)
    };
    
    // Map to chunk entities with checksum
    return array_map(function($index, $chunkText) use ($payload, $chunkSize) {
        return [
            'text_content' => $chunkText,
            'chunk_index' => $index,
            'start_offset' => $index * $chunkSize,
            'end_offset' => ($index * $chunkSize) + strlen($chunkText),
            'checksum' => $payload['checksum'] ?? null,
        ];
    }, array_keys($chunks), $chunks);
}

/**
 * Persist chunks to database.
 */
public function persistChunksForSource(
    string $sourceType,
    int $sourceId,
    array $chunks,
    ?string $owner,
    ?string $organisation,
    int $sourceTimestamp,
    array $payload
): void {
    // Delete existing chunks for this source
    $this->chunkMapper->deleteBySource($sourceType, $sourceId);
    
    // Create new chunks
    foreach ($chunks as $chunkData) {
        $chunk = new Chunk();
        $chunk->setUuid(Uuid::v4()->toString());
        $chunk->setSourceType($sourceType);
        $chunk->setSourceId($sourceId);
        $chunk->setTextContent($chunkData['text_content']);
        $chunk->setStartOffset($chunkData['start_offset']);
        $chunk->setEndOffset($chunkData['end_offset']);
        $chunk->setChunkIndex($chunkData['chunk_index']);
        $chunk->setChecksum($chunkData['checksum'] ?? null);
        $chunk->setOwner($owner);
        $chunk->setOrganisation($organisation);
        
        $this->chunkMapper->insert($chunk);
    }
}

Change Detection

The system uses SHA256 checksums to detect content changes:

// Calculate checksum from extracted text
$checksum = hash('sha256', $extractedText);

// Check if source is up-to-date
public function isSourceUpToDate(
    int $sourceId,
    string $sourceType,
    int $sourceTimestamp,
    bool $forceReExtract
): bool {
    if ($forceReExtract === true) {
        return false;
    }
    
    // Get existing chunks
    $existingChunks = $this->chunkMapper->findBySource($sourceType, $sourceId);
    
    if (empty($existingChunks)) {
        return false;
    }
    
    // Check if checksum matches
    $existingChecksum = $existingChunks[0]->getChecksum();
    $currentChecksum = $this->calculateSourceChecksum($sourceId, $sourceType);
    
    return $existingChecksum === $currentChecksum;
}

Benefits:

Avoids unnecessary re-extraction
Efficient change detection
Automatic updates on content modification

Performance

Processing Times

File Type	Size	Extraction Time	Chunking Time
Text (.txt)	< 1MB	< 1s	50ms
PDF (text)	< 5MB	1-3s	100ms
PDF (OCR)	< 5MB	10-60s	100ms
DOCX	< 5MB	1-2s	100ms
XLSX	< 5MB	2-5s	150ms
Images (OCR)	< 5MB	5-15s	50ms

Bulk Processing

Batch Size: 100 items per batch (configurable)
Parallel Processing: Supported via background jobs
Progress Tracking: Real-time status updates

API Endpoints

Text Extraction

POST /api/files/{fileId}/extract
POST /api/objects/{objectId}/extract
GET  /api/chunks?source_type=file&source_id={id}
GET  /api/chunks/{chunkId}

Chunking

POST /api/chunks/chunk
Content-Type: application/json

{
  "source_type": "file",
  "source_id": 12345,
  "options": {
    "chunk_size": 1000,
    "chunk_overlap": 200,
    "strategy": "RECURSIVE_CHARACTER"
  }
}

Error Handling

Common Errors

Error	Cause	Solution
File too large	Exceeds format limit	Reduce size or increase limit
Format not supported	Unrecognized format	Enable format or convert file
Extraction failed	Corrupted file	Verify file integrity
OCR failed	Tesseract not installed	Install Tesseract OCR

Recovery

Failed extractions can be retried via API
Error messages stored for debugging
Automatic retry on content update

Processing Pipeline

Step-by-Step Flow

┌─────────────────────────────────────────────────────────┐
│ 1. File Upload                                          │
│    - User uploads file to Nextcloud                     │
│    - File stored in data directory                      │
│    - Upload completes immediately (non-blocking)        │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│ 2. Background Job Queued                                │
│    - FileChangeListener detects new/updated file        │
│    - Queues FileTextExtractionJob asynchronously        │
│    - User request completes without delay               │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│ 3. Background Processing (non-blocking)                 │
│    - Job runs in background (typically within seconds)  │
│    - Check MIME type and validate format                │
│    - Use format-specific extractor                      │
│    - Handle encoding issues                             │
│    - Clean/normalize text                               │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│ 4. Document Chunking                                    │
│    - Apply selected strategy                            │
│    - Create overlapping chunks                          │
│    - Preserve metadata                                  │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│ 5. Storage                                              │
│    - Store chunks in database                           │
│    - Calculate and store checksum                       │
│    - Link chunks to source                             │
└─────────────────────────────────────────────────────────┘

Event Listener

Location: lib/Listener/FileChangeListener.php

Purpose: Automatically process files on upload/update.

Events:

NodeCreatedEvent - File uploaded
NodeWrittenEvent - File updated

Behavior:

Checks if extraction is needed (checksum comparison)
Triggers text extraction automatically via background job
Updates extraction status
Full error handling and logging

Registration: Registered in Application.php with service container integration.

File Size Limits

Default Limits

Format	Default Max Size	Configurable	Reason
Text/Markdown	100MB	Yes	Memory-efficient
HTML	20MB	Yes	DOM parsing overhead
PDF (text)	100MB	Yes	Direct extraction
PDF (OCR)	50MB	Yes	Processing intensive
Office Docs	30-50MB	Yes	Library limitations
Images	20MB	Yes	OCR memory usage
JSON/XML	20MB	Yes	Parsing complexity

Modifying Limits

Edit lib/Service/SolrFileService.php:

// File size limits (in bytes)
private const MAX_FILE_SIZE_TEXT = 104857600;  // 100MB
private const MAX_FILE_SIZE_PDF = 104857600;   // 100MB
private const MAX_FILE_SIZE_OFFICE = 52428800; // 50MB
private const MAX_FILE_SIZE_IMAGE = 20971520;  // 20MB

Warning: Increasing limits may cause:

Memory exhaustion
Slow processing
Timeouts on large files

Dependencies

PHP Libraries (Installed via Composer)

{
  "smalot/pdfparser": "^2.0",           // PDF text extraction
  "phpoffice/phpword": "^1.0",          // Word document processing
  "phpoffice/phpspreadsheet": "^1.0"    // Excel processing
}

System Commands

Command	Purpose	Installation
`pdftotext`	PDF extraction fallback	`apt-get install poppler-utils`
`tesseract`	OCR for images/scanned PDFs	`apt-get install tesseract-ocr`

Checking Dependencies

# Check if pdftotext is available
which pdftotext

# Check Tesseract version
tesseract --version

# Test OCR
tesseract test-image.png output

Troubleshooting

Debugging File Processing

Enable Debug Logging:

Go to Settings → File Management
Enable "Detailed Logging"
Check logs at: data/nextcloud.log or via Docker: docker logs nextcloud-container

Look for:

[TextExtractionService] Processing file: document.pdf
[TextExtractionService] Extraction method: pdfParser
[TextExtractionService] Extracted text length: 45678
[TextExtractionService] Created 12 chunks

Performance Issues

If processing is slow:

Check file size and format
Monitor memory usage: docker stats
Reduce chunk size to decrease processing time
Use faster extraction methods when possible

File Access Issues

If you see 'file not found' or 'failed to open stream' errors:

The system uses asynchronous background jobs to process files:

Non-blocking uploads: File uploads complete immediately without waiting for text extraction
Background processing: Text extraction runs in background jobs (typically within seconds)
Path filtering: Only OpenRegister files are processed
Automatic retries: Failed extractions are automatically retried by the background job system

To check background job status:

# View pending background jobs
docker exec -u 33 <nextcloud-container> php occ background-job:list

# Check logs for extraction job status
docker logs <nextcloud-container> | grep FileTextExtractionJob

Quality Issues

If text extraction is poor:

Verify source file quality
For scanned documents, ensure 300+ DPI
Use native PDF over scanned when possible
Test with different OCR languages
Check for corrupted files

Best Practices

For Administrators

✅ Do:

Enable only needed file formats
Set reasonable size limits
Monitor storage growth
Schedule bulk processing off-hours
Keep dependencies updated

❌ Don't:

Process unnecessary file types
Set extremely high size limits
Skip dependency checks
Run bulk processing during peak hours

For Users

✅ Do:

Use text-based PDFs when possible
Provide high-quality scans (300 DPI)
Use consistent file naming
Organize files in logical folders

❌ Don't:

Upload password-protected files (won't extract)
Use low-resolution scans
Mix unrelated content in single file
Rely on OCR for perfect accuracy

FAQ

Q: Can I process password-protected files?
A: No, password-protected files cannot be extracted. Remove password first.

Q: How accurate is OCR?
A: 90-98% accuracy for good quality scans (300 DPI, clear text). Lower for poor scans.

Q: Can I process files retroactively?
A: Yes! Use bulk extraction via API or admin interface.

Q: Do I need to re-process files after changing chunk strategy?
A: Yes, existing chunks won't update automatically. Re-extract to apply new strategy.

Q: What happens to old chunks when I re-process?
A: Old chunks are replaced with new ones. Checksums ensure only changed content is re-processed.

Q: Can I see extracted text before chunking?
A: Check logs with debug mode enabled. Text is logged before chunking.

Migration

From FileText to Chunks

The old FileText entity stored chunks in chunks_json. Migration to dedicated Chunk table:

// Migration: Version1Date20251118000000
// Drops deprecated openregister_file_texts and openregister_object_texts tables

// Migration: Version1Date20251117000000
// Adds checksum column to openregister_chunks table

Migration Strategy:

Create openregister_chunks table
Migrate existing chunks from chunks_json to table
Update services to use ChunkMapper
Drop old tables (after verification)

Feature Documentation

Text Extraction, Vectorization & Named Entity Recognition - Unified feature documentation
Text Extraction Sources - Source-specific details (files vs objects)
Enhanced Text Extraction - GDPR and language features

Technical Documentation

Vectorization Technical Documentation - Vector embedding implementation
Named Entity Recognition Technical Documentation - NER implementation

Overview​

Architecture​

Handler-Based Design​

Service Flow​

Database Schema​

Chunk Entity​

Entity Class​

Handlers​

FileHandler​

ObjectHandler​

Chunking Strategies​

Recursive Character Splitting (Recommended)​

Fixed Size Splitting​

Supported File Formats​

Text Documents​

PDF Documents​

Microsoft Office​

Images (OCR)​

Data Formats​

Service Implementation​

TextExtractionService​

Change Detection​

Performance​

Processing Times​

Bulk Processing​

API Endpoints​

Text Extraction​

Chunking​

Error Handling​

Common Errors​

Recovery​

Processing Pipeline​

Step-by-Step Flow​

Event Listener​

File Size Limits​

Default Limits​

Modifying Limits​

Dependencies​

PHP Libraries (Installed via Composer)​

System Commands​

Checking Dependencies​

Troubleshooting​

Debugging File Processing​

Performance Issues​

File Access Issues​

Quality Issues​

Best Practices​

For Administrators​

For Users​

FAQ​

Migration​

From FileText to Chunks​

Related Documentation​

Feature Documentation​

Technical Documentation​