Text Extraction Refactoring Summary

Overview

Text extraction has been refactored to be completely independent of SOLR. Previously, text extraction was tightly coupled with SOLR indexing, which created unnecessary dependencies and made the extraction process less flexible.

Changes Made

1. TextExtractionService Enhancement

File: lib/Service/TextExtractionService.php

Added Chunking Logic

Moved chunking from SolrFileService to TextExtractionService
Added constants for chunk configuration:
- DEFAULT_CHUNK_SIZE: 1000 characters
- DEFAULT_CHUNK_OVERLAP: 200 characters
- MIN_CHUNK_SIZE: 100 characters
- MAX_CHUNKS_PER_FILE: 1000 chunks
- Chunking strategies: RECURSIVE_CHARACTER, FIXED_SIZE

New Chunking Methods

chunkDocument() - Main chunking interface
chunkFixedSize() - Fixed-size chunking with overlap
chunkRecursive() - Smart chunking that tries different separators (paragraphs, sentences, words)
recursiveSplit() - Recursive text splitting algorithm
cleanText() - Text normalization before chunking

Removed SOLR Dependency

Removed GuzzleSolrService from constructor
Removed automatic SOLR indexing after extraction
Text extraction now stores chunks independently

2. FileText Entity Update

File: lib/Db/FileText.php

Added New Property

chunks_json (LONGTEXT) - Stores JSON-encoded array of text chunks

Each chunk contains:

{
  "text": "chunk content",
  "start_offset": 0,
  "end_offset": 1000
}

3. Database Migration

File: lib/Migration/Version1Date20251107170000.php

Created migration to add chunks_json column to oc_openregister_file_texts table
Migration ran successfully on 2025-11-07

4. GuzzleSolrService Update

File: lib/Service/GuzzleSolrService.php

Updated `indexFiles()` Method

Before: Generated chunks on-the-fly during SOLR indexing
After: Reads pre-generated chunks from chunks_json field
Benefits:
- SOLR indexing is faster (no need to re-chunk)
- Consistent chunking across all services
- Chunks can be reused for embeddings, search, etc.

Benefits of the Refactoring

1. Separation of Concerns

Text Extraction: Handles file parsing and chunking
SOLR Service: Only responsible for indexing pre-chunked data
Each service has a single, well-defined responsibility

2. Improved Performance

Chunks are generated once during extraction
SOLR indexing no longer needs to re-chunk text
Chunks are cached in database for reuse

3. Flexibility

Chunks can be used by multiple services:
- SOLR for full-text search
- Vector databases for embeddings
- AI services for text analysis
- Export/download features
Easy to adjust chunking strategies without affecting SOLR

4. Better Data Integrity

Chunks are stored with the extracted text
No risk of different services generating inconsistent chunks
Chunks are versioned with the file (re-extraction updates chunks)

Test Results

File Tested: Nextcloud_Server_Administration_Manual.pdf (8.8 MB)

Results:

✅ PDF extraction successful
✅ Text extracted: 416,657 characters
✅ Chunks generated: 551 chunks
✅ Average chunk size: ~756 characters
✅ Chunks stored in chunks_json field
✅ SOLR indexing uses pre-generated chunks

Migration Notes

For Existing Installations

Run Migration: The migration adds the chunks_json column automatically
Re-extract Files: Existing files need to be re-extracted to generate chunks
SOLR Re-index: After re-extraction, re-index files in SOLR to use new chunks

For New Installations

No migration needed - new installations include chunking from the start

Text Extraction Implementation - Current text extraction implementation
LLPhant Refactor - Future refactoring plan using LLPhant

Overview​

Changes Made​

1. TextExtractionService Enhancement​

Added Chunking Logic​

New Chunking Methods​

Removed SOLR Dependency​

2. FileText Entity Update​

Added New Property​

3. Database Migration​

4. GuzzleSolrService Update​

Updated indexFiles() Method​

Benefits of the Refactoring​

1. Separation of Concerns​

2. Improved Performance​

3. Flexibility​

4. Better Data Integrity​

Test Results​

Migration Notes​

For Existing Installations​

For New Installations​

Related Documentation​