Enhanced Text Extraction Implementation Plan
Overview
This document outlines the implementation plan for adding entity extraction, language detection, and language level assessment to OpenRegister's text extraction system, including GDPR entity tracking capabilities.
Documentation Created
-
Enhanced Text Extraction & GDPR Entity Tracking
- Complete feature documentation
- Processing methods (local, external services, LLM, hybrid)
- GDPR entity register design
- Language detection and assessment
- Preparing for anonymization
- API endpoints
-
Text Extraction Sources: Files vs Objects
- Visual separation of file and object processing paths
- Detailed flow diagrams for each source type
- Comparison and combined use cases
- Configuration options
-
Text Extraction Database Entities
- Complete database schema
- Entity relationship diagrams
- PHP entity classes
- Migration strategy
- Performance considerations
Key Features Added to Documentation
1. Two Processing Paths
File Path
File Upload → Text Extraction (LLPhant/Dolphin) → Complete Text → Chunks
Object Path
Object Creation → Property Values → Text Blob → Chunks
Both paths converge at chunks, which can then undergo:
- Text search indexing (Solr)
- Vector embeddings (RAG)
- Entity extraction (GDPR)
- Language detection
- Language level assessment
2. GDPR Entity Register
Two new entities:
-
Entity: Stores unique entities (persons, emails, organizations)
- UUID, type, value, category
- Detection timestamp and metadata
- Supports deduplication
-
EntityRelation: Links entities to chunk positions
- Entity ID + Chunk ID
- Precise character positions
- Confidence score and detection method
- Anonymization tracking
- Context for verification
Prepared for anonymization:
- Precise position tracking
- Consistent replacement values
- Reversible anonymization
- Metadata preservation
3. Language Detection & Assessment
Chunk entity enhancements:
languagefield: ISO 639-1 codes (e.g., 'en', 'nl', 'de')language_levelfield: Reading level (e.g., 'B2', 'Grade 8', '65')language_confidencefield: Detection confidence (0.0-1.0)detection_methodfield: How it was detected
Use cases:
- Multi-language content management
- Accessibility compliance (plain language)
- Content routing by language
- Readability assessment