Enhanced Text Extraction Implementation Plan

Overview

This document outlines the implementation plan for adding entity extraction, language detection, and language level assessment to OpenRegister's text extraction system, including GDPR entity tracking capabilities.

Documentation Created

Enhanced Text Extraction & GDPR Entity Tracking
- Complete feature documentation
- Processing methods (local, external services, LLM, hybrid)
- GDPR entity register design
- Language detection and assessment
- Preparing for anonymization
- API endpoints
Text Extraction Sources: Files vs Objects
- Visual separation of file and object processing paths
- Detailed flow diagrams for each source type
- Comparison and combined use cases
- Configuration options
Text Extraction Database Entities
- Complete database schema
- Entity relationship diagrams
- PHP entity classes
- Migration strategy
- Performance considerations

Key Features Added to Documentation

1. Two Processing Paths

File Path

File Upload → Text Extraction (LLPhant/Dolphin) → Complete Text → Chunks

Object Path

Object Creation → Property Values → Text Blob → Chunks

Both paths converge at chunks, which can then undergo:

Text search indexing (Solr)
Vector embeddings (RAG)
Entity extraction (GDPR)
Language detection
Language level assessment

Two new entities:

Entity: Stores unique entities (persons, emails, organizations)
- UUID, type, value, category
- Detection timestamp and metadata
- Supports deduplication
EntityRelation: Links entities to chunk positions
- Entity ID + Chunk ID
- Precise character positions
- Confidence score and detection method
- Anonymization tracking
- Context for verification

Prepared for anonymization:

Precise position tracking
Consistent replacement values
Reversible anonymization
Metadata preservation

3. Language Detection & Assessment

Chunk entity enhancements:

language field: ISO 639-1 codes (e.g., 'en', 'nl', 'de')
language_level field: Reading level (e.g., 'B2', 'Grade 8', '65')
language_confidence field: Detection confidence (0.0-1.0)
detection_method field: How it was detected

Use cases:

Multi-language content management
Accessibility compliance (plain language)
Content routing by language
Readability assessment

4. Multiple Processing Methods

All enhancements support three methods:

Local Algorithms: Fast, privacy-friendly, no external deps
External Services: Specialized APIs (Presidio, NLDocs, Dolphin)
LLM Processing: Context-aware, handles ambiguity
Hybrid (Recommended): Multiple methods with confidence scoring

5. Extended Chunking Support

Email chunking:

Segment by headers, body, signature, attachments
Preserve sender/recipient as entities
Link chunks to email threads

Chat message chunking:

Process individual messages with context
Track conversation participants as entities
Maintain threading
Include previous messages for coherent search

Database Changes Required

New Tables

oc_openregister_object_texts: Text blobs from objects
oc_openregister_chunks: Individual chunks (migrated from chunks_json)
oc_openregister_entities: GDPR entity register
oc_openregister_entity_relations: Entity-to-chunk mappings

Updated Tables

oc_openregister_file_texts: No changes required (already has chunks_json)

Future migration will move chunks from JSON to dedicated table for better querying.

Implementation Phases

Phase 1: Database Schema (Week 1)

Tasks:

Create migration for new tables
Create PHP entity classes
Create mapper classes
Add unit tests for entities

Deliverables:

lib/Migration/Version1DateXXXXXXXX.php
lib/Db/ObjectText.php
lib/Db/Chunk.php
lib/Db/GdprEntity.php
lib/Db/EntityRelation.php
Corresponding mapper classes

Phase 2: Object Text Extraction (Week 2)

Tasks:

Create ObjectTextExtractionService
Integrate with SaveObject event
Property value concatenation logic
Chunking for objects
Add configuration settings

Deliverables:

lib/Service/ObjectTextExtractionService.php
Integration with existing SaveObject flow
Settings UI for object extraction
Unit tests

Phase 3: Chunk Migration (Week 3)

Tasks:

Create ChunkService
Migrate FileText chunks_json to Chunk table
Background job for migration
Update services to use Chunk entity
Maintain backward compatibility

Deliverables:

lib/Service/ChunkService.php
lib/BackgroundJob/MigrateChunksJob.php
Updated TextExtractionService
Migration status tracking

Phase 4: Language Detection (Week 4)

Tasks:

Create LanguageDetectionService
Implement local algorithm (lingua or similar)
Implement API integration (optional)
Implement LLM integration (optional)
Add background job for batch processing
Add configuration UI

Deliverables:

lib/Service/LanguageDetectionService.php
lib/BackgroundJob/DetectLanguageJob.php
Settings UI for language detection
Unit tests

Phase 5: Language Level Assessment (Week 5)

Tasks:

Create LanguageLevelService
Implement readability formulas (Flesch-Kincaid, etc.)
Implement API integration (optional)
Implement LLM integration (optional)
Add background job
Add configuration UI

Deliverables:

lib/Service/LanguageLevelService.php
lib/BackgroundJob/AssessLanguageLevelJob.php
Settings UI for level assessment
Unit tests

Phase 6: Entity Extraction (Week 6-7)

Tasks:

Create EntityExtractionService
Implement regex patterns (local)
Implement Presidio integration (optional)
Implement LLM integration (optional)
Entity deduplication logic
EntityRelation creation
Background job for batch processing
Add configuration UI

Deliverables:

lib/Service/EntityExtractionService.php
lib/BackgroundJob/ExtractEntitiesJob.php
Settings UI for entity extraction
Unit tests

Tasks:

Create EntityController
Create Vue components for entity list
Entity details view
Occurrence list
GDPR report generation
Export functionality
Search and filtering

Deliverables:

lib/Controller/EntityController.php
src/views/gdpr/EntitiesIndex.vue
src/views/gdpr/EntityDetails.vue
src/modals/gdpr/GdprReportModal.vue
API endpoints

Phase 8: Email & Chat Chunking (Week 9)

Tasks:

Create EmailChunkingService
Create ChatChunkingService
Integration with Mail app (if available)
Integration with Talk app (if available)
Special handling for email metadata
Conversation threading

Deliverables:

lib/Service/EmailChunkingService.php
lib/Service/ChatChunkingService.php
Event listeners for Mail/Talk
Unit tests

Phase 9: Testing & Documentation (Week 10)

Tasks:

Integration tests for all services
Performance testing
API documentation updates
User documentation updates
Admin guide for GDPR features
Video tutorials (optional)

Deliverables:

Full test coverage
Updated API documentation
User guides
Admin documentation

Phase 10: Deployment & Monitoring (Week 11)

Tasks:

Beta deployment
Monitor background jobs
Performance tuning
Bug fixes
Collect user feedback
Production deployment

Configuration Structure

Settings → OpenRegister → Text Analysis

┌─ Text Extraction ─────────────────────────────┐
│ ☑ Enable Object Text Extraction              │
│ ☑ Enable File Text Extraction                │
│                                                │
│ Chunking Strategy: [Recursive ▼]              │
│ Chunk Size: [1000] characters                 │
│ Chunk Overlap: [200] characters               │
└────────────────────────────────────────────────┘

┌─ Language Detection ──────────────────────────┐
│ ☑ Enable Language Detection                   │
│                                                │
│ Detection Method: [Hybrid ▼]                  │
│   • Local Algorithm                           │
│   • External API (optional)                   │
│   • LLM (optional)                            │
│                                                │
│ Confidence Threshold: [0.70] (0.0-1.0)        │
└────────────────────────────────────────────────┘

┌─ Language Level Assessment ───────────────────┐
│ ☑ Enable Language Level Assessment            │
│                                                │
│ Assessment Method: [Formula ▼]                │
│ Scale: [CEFR ▼]                               │
└────────────────────────────────────────────────┘

┌─ Entity Extraction (GDPR) ────────────────────┐
│ ☑ Enable Entity Extraction                    │
│                                                │
│ Extraction Method: [Hybrid ▼]                 │
│   • Local Patterns: ☑ Enabled                 │
│   • Presidio API: ☐ Enabled (API key req.)    │
│   • LLM: ☑ Enabled                            │
│                                                │
│ Entity Types to Detect:                       │
│   ☑ Persons                                   │
│   ☑ Email Addresses                           │
│   ☑ Phone Numbers                             │
│   ☑ Organizations                             │
│   ☑ Locations                                 │
│   ☑ Dates of Birth                            │
│   ☐ ID Numbers                                │
│   ☐ Bank Accounts                             │
│   ☐ IP Addresses                              │
│                                                │
│ Confidence Threshold: [0.80] (0.0-1.0)        │
│ Context Window: [100] characters              │
│                                                │
│ [View GDPR Register] [Generate Report]        │
└────────────────────────────────────────────────┘

┌─ Vector Embeddings (RAG) ─────────────────────┐
│ ☑ Enable Vectorization                        │
│                                                │
│ Embedding Model: [OpenAI text-embedding-3 ▼]  │
│ Vector Backend: [Solr ▼]                      │
└────────────────────────────────────────────────┘

┌─ Processing ──────────────────────────────────┐
│ Background Job Interval: [5] minutes          │
│ Batch Size: [100] chunks per job              │
│                                                │
│ [Process Pending Chunks Now]                  │
│ [Reprocess All Chunks]                        │
└────────────────────────────────────────────────┘

┌─ Statistics ──────────────────────────────────┐
│ Total Chunks: 145,782                         │
│ Languages Detected: 8                         │
│ Entities Found: 2,341                         │
│ Pending Processing: 234                       │
│                                                │
│ Top Languages:                                │
│   • English: 98,452 chunks (67.5%)            │
│   • Dutch: 42,119 chunks (28.9%)              │
│   • German: 5,211 chunks (3.6%)               │
└────────────────────────────────────────────────┘

API Endpoints

Chunks

GET  /api/chunks
GET  /api/chunks/{id}
POST /api/chunks/{id}/analyze
GET  /api/chunks/languages
GET  /api/chunks/levels
POST /api/chunks/batch-analyze

GET  /api/entities
GET  /api/entities/{id}
GET  /api/entities/{id}/occurrences
POST /api/entities/{id}/anonymize
GET  /api/gdpr/report
POST /api/gdpr/export

Object Text

GET  /api/object-texts
GET  /api/object-texts/{id}
POST /api/objects/{id}/extract-text

Service Architecture

TextExtractionService (existing)
  ├─ FileTextExtractionService (existing)
  └─ ObjectTextExtractionService (new)

ChunkService (new)
  ├─ createChunksFromFile()
  ├─ createChunksFromObject()
  ├─ migrateFromJson()
  └─ getChunksBySource()

EnhancementService (new)
  ├─ LanguageDetectionService
  │   ├─ detectLocal()
  │   ├─ detectApi()
  │   └─ detectLlm()
  ├─ LanguageLevelService
  │   ├─ assessFormula()
  │   ├─ assessApi()
  │   └─ assessLlm()
  └─ EntityExtractionService
      ├─ extractLocal()
      ├─ extractPresidio()
      ├─ extractLlm()
      └─ createEntityRelations()

GdprService (new)
  ├─ generateReport()
  ├─ findEntityOccurrences()
  ├─ anonymizeEntity()
  └─ exportGdprData()

Background Jobs

MigrateChunksJob: Migrate chunks from JSON to table
ProcessChunksJob: Apply enhancements to pending chunks
DetectLanguageJob: Batch language detection
AssessLanguageLevelJob: Batch level assessment
ExtractEntitiesJob: Batch entity extraction
UpdateEntityStatsJob: Update entity occurrence counts

Performance Targets

Object text extraction: <100ms per object
Chunk creation: <50ms per 100KB text
Language detection (local): <10ms per chunk
Language level (formula): <20ms per chunk
Entity extraction (local): <100ms per chunk
GDPR report generation: <5s for 10,000 entities

Testing Strategy

Unit Tests: All services and entities
Integration Tests: End-to-end flows
Performance Tests: Background job processing
Load Tests: 10,000+ files and objects
API Tests: All endpoints
UI Tests: GDPR register interface

Security Considerations

Access Control: GDPR register admin-only
Encryption: Entities encrypted at rest
Audit Trail: Log all entity access
Data Minimization: Only extract necessary entities
Retention: Configurable entity retention periods
Export: Secure GDPR data export

Compliance

GDPR: Complete entity tracking for data subject requests
Right to Erasure: Prepared for anonymization
Data Mapping: Know where all PII exists
Audit Trail: Complete access logging
Retention: Configurable data retention

Next Steps

Review this implementation plan with stakeholders
Prioritize phases based on business needs
Allocate resources (developers, QA, etc.)
Set up development environment
Create feature branch
Begin Phase 1 implementation

Questions for Stakeholders

Which entity types are most important for initial release?
Should we integrate with external services (Presidio, etc.) or start with local only?
What is the target timeline for GDPR compliance?
Are there specific languages to prioritize for detection?
Should email/chat chunking be in first release or later?
What is the performance budget for background job processing?
Are there existing GDPR workflows to integrate with?

Success Metrics

Coverage: 100% of files and objects chunked
Accuracy: >90% entity detection accuracy
Performance: <5min to process 1000 files
Adoption: GDPR register used for data subject requests
Compliance: Pass GDPR audit
User Satisfaction: Positive feedback on search quality

Conclusion

This enhanced text extraction system provides OpenRegister with:

✅ Unified processing for files and objects
✅ GDPR compliance with entity tracking
✅ Language detection and assessment
✅ Prepared for anonymization
✅ Extended support for emails and chats
✅ Flexible processing methods (local, API, LLM)
✅ Comprehensive documentation
✅ Clear implementation roadmap

The system is designed to be implemented incrementally, with each phase delivering value independently while building toward the complete feature set.

Overview​

Documentation Created​

Key Features Added to Documentation​

1. Two Processing Paths​

File Path​

Object Path​

2. GDPR Entity Register​

3. Language Detection & Assessment​

4. Multiple Processing Methods​

5. Extended Chunking Support​

Database Changes Required​

New Tables​

Updated Tables​

Implementation Phases​

Phase 1: Database Schema (Week 1)​

Phase 2: Object Text Extraction (Week 2)​

Phase 3: Chunk Migration (Week 3)​

Phase 4: Language Detection (Week 4)​

Phase 5: Language Level Assessment (Week 5)​

Phase 6: Entity Extraction (Week 6-7)​

Phase 7: GDPR Register UI (Week 8)​

Phase 8: Email & Chat Chunking (Week 9)​

Phase 9: Testing & Documentation (Week 10)​

Phase 10: Deployment & Monitoring (Week 11)​

Configuration Structure​

Settings → OpenRegister → Text Analysis​

API Endpoints​

Chunks​

Entities (GDPR)​

Object Text​

Service Architecture​

Background Jobs​

Performance Targets​

Testing Strategy​

Security Considerations​

Compliance​

Next Steps​

Questions for Stakeholders​

Success Metrics​

Conclusion​

Overview

Documentation Created

Key Features Added to Documentation

1. Two Processing Paths

File Path

Object Path

2. GDPR Entity Register

3. Language Detection & Assessment

4. Multiple Processing Methods

5. Extended Chunking Support

Database Changes Required

New Tables

Updated Tables

Implementation Phases

Phase 1: Database Schema (Week 1)

Phase 2: Object Text Extraction (Week 2)

Phase 3: Chunk Migration (Week 3)

Phase 4: Language Detection (Week 4)

Phase 5: Language Level Assessment (Week 5)

Phase 6: Entity Extraction (Week 6-7)

Phase 7: GDPR Register UI (Week 8)

Phase 8: Email & Chat Chunking (Week 9)

Phase 9: Testing & Documentation (Week 10)

Phase 10: Deployment & Monitoring (Week 11)

Configuration Structure

Settings → OpenRegister → Text Analysis

API Endpoints

Chunks

Entities (GDPR)

Object Text

Service Architecture

Background Jobs

Performance Targets

Testing Strategy

Security Considerations

Compliance

Next Steps

Questions for Stakeholders

Success Metrics

Conclusion