Enhanced Text Extraction & GDPR Entity Tracking

Overview

OpenRegister provides a comprehensive text extraction, chunking, and analysis pipeline that processes both files and objects into searchable, analyzable chunks. The system supports multiple processing paths and optional enhancement steps including entity extraction for GDPR compliance, language detection, and language level assessment.

Core Concepts

Sources: Files vs Objects

The text extraction system processes two distinct types of sources:

1. File Processing

Files (PDF, DOCX, images, etc.) are converted into text through extraction services:

LLPhant: Local PHP-based extraction for standard documents
Dolphin: AI-powered extraction with OCR support for complex documents and images

2. Object Processing

OpenRegister objects (structured data) are converted into text blobs:

Property values are concatenated into a searchable text representation
Maintains references to source object and properties
Enables full-text search across structured data

The Chunking Process

Both files and objects follow a path to chunks - smaller, manageable pieces of text optimized for processing:

Chunk Storage

Chunks are stored with comprehensive metadata:

Original text content
Start/end offsets
Source reference (file ID or object ID)
Language (e.g., 'en', 'nl', 'de')
Language Level (e.g., 'A1', 'B2', 'C1' or reading difficulty score)

Enhancement Pipeline

Once chunks are created, they can undergo multiple optional enhancement processes:

Processing Methods

Each enhancement can be performed by different methods based on requirements:

1. Local Algorithms

Advantages: Privacy-friendly, no external dependencies, fast
Use Cases: Language detection, basic entity patterns
Examples:
- Regular expressions for email/phone detection
- Statistical language detection libraries
- Readability formulas (Flesch-Kincaid, etc.)

2. External Services

Advantages: Specialized accuracy, maintained by experts
Use Cases: Complex entity recognition, advanced NLP
Examples:
- Presidio: Microsoft's PII detection service
- NLDocs: Dutch document analysis
- Dolphin: Advanced text analysis

3. LLM/AI Processing

Advantages: Context-aware, handles ambiguity, multi-task
Use Cases: Complex entities, context-dependent detection
Examples:
- GPT-4 for nuanced entity extraction
- Custom prompts for domain-specific entities
- Combined analysis (language + entities + sentiment)

4. Hybrid Approach (Recommended for Critical Applications)

Strategy: Multiple methods with confidence scoring
Example Flow:
1. Local regex detects potential email: confidence 0.6
2. Presidio confirms: confidence 0.85
3. LLM validates in context: confidence 0.95
4. Final decision: Entity confirmed with 0.95 confidence

The system maintains a comprehensive register of all personal data found in your content.

Entity Types

Entity Entity

Represents a unique piece of personally identifiable information:

id: Primary key
uuid: Unique identifier for API access
type: Type of entity (person, organization, email, etc.)
value: The actual entity value (e.g., 'john.doe@example.com')
category: Classification (PII, sensitive PII, business data)
detectedAt: First detection timestamp
updatedAt: Last occurrence timestamp
metadata: Additional context (normalized forms, aliases)

Entity Knowledge Graph

Entities can be linked to each other! Contact information (phone, email, address) can belong to persons or organizations. This creates a knowledge graph that helps with:

Complete GDPR profiles (all contact info per person)
Better anonymization planning
Contact information consolidation
Entity deduplication

Note: We track contact info ownership, NOT employment or family relationships.

Entity Relation Entity

Links entities to specific chunk locations:

id: Primary key
entityId: Reference to Entity
chunkId: Reference to Chunk
positionStart: Character position where entity starts in chunk
positionEnd: Character position where entity ends in chunk
confidence: Detection confidence (0.0 - 1.0)
detectionMethod: How it was found ('regex', 'presidio', 'llm', 'hybrid')
context: Surrounding text for verification
anonymized: Whether this occurrence has been anonymized
anonymizedValue: Replacement value if anonymized
createdAt: Detection timestamp

Entity Link Entity (Entity-to-Entity Relationships)

Links entities to each other to create a knowledge graph:

id: Primary key
sourceEntityId: Source entity in the relationship
targetEntityId: Target entity in the relationship
relationshipType: Type of relationship
- 'belongs_to' - Contact info (phone/email/address) belongs to person or organization
- 'associated_with' - Generic association for uncertain relationships
- 'primary_contact' - Marks primary contact method
- 'alternate_contact' - Marks alternate contact method
confidence: Relationship confidence (0.0 - 1.0)
detectionMethod: How detected ('proximity', 'pattern', 'llm', 'manual', 'inference')
metadata: Additional context about the relationship
createdAt: When relationship was detected

Example Relationships:

Contact Info → belongs_to → Person or Organization:
- Phone(+31612345678) → belongs_to → Person(John Doe)
- Email(john@company.com) → belongs_to → Person(John Doe)
- Email(info@acme.com) → belongs_to → Organization(Acme Corp)
- Address(123 Main St) → belongs_to → Person(John Doe)
- Address(456 Business Park) → belongs_to → Organization(Acme Corp)

Preparing for Anonymization

The system is designed to support future anonymization features:

1. Precise Location Tracking

Every entity relation records exact character positions, enabling surgical replacement:

Original chunk: 'Contact John Doe at john.doe@example.com for details.'
Entity: type=PERSON, value='John Doe'
Relation: positionStart=8, positionEnd=16
Anonymization: Replace positions 8-16 with '[PERSON_1]'
Result: 'Contact [PERSON_1] at john.doe@example.com for details.'

2. Consistency Across Documents

The Entity table ensures the same person/email is anonymized consistently:

Entity UUID: '550e8400-...'
Value: 'john.doe@example.com'
Anonymized to: '[EMAIL_42]' (same across all occurrences)

3. Reversible Anonymization

The system can maintain a secure mapping for authorized users:

Anonymized View: 'Contact [PERSON_1] at [EMAIL_42]'
Admin View: 'Contact John Doe at john.doe@example.com'

4. Metadata Preservation

Context and confidence scores help validate anonymization:

{
  'entity': 'john.doe@example.com',
  'anonymized': '[EMAIL_42]',
  'confidence': 0.95,
  'method': 'hybrid',
  'context': '...Contact John Doe at john.doe@example.com for...',
  'verified': true
}

Language Detection & Assessment

Language Detection

Identifies the primary language of each chunk:

Methods:

Local Algorithm: Fast statistical detection (e.g., lingua, language-detect)
External Service: Cloud-based detection (e.g., Google Cloud Translation API)
LLM: Context-aware language identification

Storage: ISO 639-1 codes ('en', 'nl', 'de', 'fr', etc.)

Use Cases:

Route to appropriate language models
Filter search results by language
Trigger language-specific processing
Compliance with language requirements

Language Level Assessment

Determines reading difficulty or proficiency level:

Methods:

Readability Formulas: Flesch-Kincaid, SMOG, Coleman-Liau
External Services: Specialized readability APIs
LLM Analysis: Context-aware difficulty assessment

Scales:

CEFR: A1, A2, B1, B2, C1, C2 (language proficiency)
Reading Level: Grade 1-12, College, Graduate
Numeric Score: Custom difficulty score (0-100)

Storage: String field in chunk entity (e.g., 'B2', 'Grade 8', '65')

Use Cases:

Content accessibility evaluation
Educational material classification
Simplified language recommendations
Regulatory compliance (plain language laws)

Extended Chunking Support

Email Chunking

Emails are chunked with special handling:

Metadata Stored:

Email sender/recipients (as entities for GDPR)
Subject line
Timestamp
Email thread ID
Attachment references

Chat Message Chunking

Chat messages (Talk app) are chunked with conversation context:

Metadata Stored:

Conversation ID
Message sender (as entity for GDPR)
Timestamp
Reply-to references
Participant list

Context Preservation:

Include previous 2-3 messages in chunk context for coherent search
Maintain conversation threading
Link related chunks

Processing Flow Example

Complete File Processing Flow

Configuration Options

Text Extraction Settings

Located in Settings → OpenRegister → File Configuration:

Extraction Scope: None, All Files, Specific Folders, Object Files
Extraction Engine: LLPhant or Dolphin
Chunking Strategy: Recursive (smart) or Fixed Size
Chunk Size: Default 1000 characters
Chunk Overlap: Default 200 characters

Enhancement Settings

Located in Settings → OpenRegister → Text Analysis:

Language Detection

Enable Language Detection: Yes/No
Detection Method: Local Algorithm / External Service / LLM / Hybrid
Confidence Threshold: Minimum confidence to store result (0.0-1.0)

Language Level Assessment

Enable Level Assessment: Yes/No
Assessment Method: Formula / Service / LLM
Scale: CEFR / Reading Level / Numeric Score

Enable Entity Extraction: Yes/No
Extraction Method: Local Patterns / Presidio / LLM / Hybrid
Entity Types: Select which types to detect
Confidence Threshold: Minimum confidence (0.0-1.0)
Context Window: Characters to store around entity

Vector/RAG Support

Enable Vectorization: Yes/No
Embedding Model: Select model
Vector Backend: Solr / Dedicated Vector DB

API Endpoints

Chunk Management

GET /api/chunks
  - List all chunks with filters
  - Filters: sourceType, sourceId, language, languageLevel

GET /api/chunks/{id}
  - Get single chunk with all metadata

POST /api/chunks/{id}/analyze
  - Trigger analysis (language, level, entities)
  - Body: { 'features': ['language', 'level', 'entities'] }

GET /api/entities
  - List all detected entities
  - Filters: type, category, dateRange

GET /api/entities/{id}
  - Get entity details with all occurrences

GET /api/entities/{id}/occurrences
  - List all chunks containing this entity

GET /api/entities/{id}/links
  - Get all entities linked to this entity
  - Shows relationships (phone belongs to person, etc.)

POST /api/entities/{id}/anonymize
  - Mark entity for anonymization
  - Body: { 'replacement': '[PERSON_1]' }

GET /api/gdpr/report
  - Generate GDPR compliance report
  - Shows all PII found in system

GET /api/gdpr/profile/{entityId}
  - Complete GDPR profile for a person
  - Includes all contact info, employment, relationships

Entity Links (Knowledge Graph)

GET /api/entity-links?sourceId={id}
  - Get all links from an entity

GET /api/entity-links?targetId={id}
  - Get all links to an entity

POST /api/entity-links
  - Create entity relationship
  - Body: { 'sourceEntityId': 1, 'targetEntityId': 2, 'relationshipType': 'belongs_to' }

GET /api/entities/{id}/graph
  - Get knowledge graph around an entity
  - Shows all related entities and relationships

Language Analysis

GET /api/chunks/languages
  - Get language distribution statistics

GET /api/chunks/levels
  - Get language level distribution

POST /api/chunks/batch-analyze
  - Analyze multiple chunks
  - Body: { 'chunkIds': [1,2,3], 'features': ['language'] }

Use Cases

Scenario: Organization needs to identify all personal data.

Process:

Enable entity extraction with hybrid method
Process all existing files and objects
System detects entities and their relationships:
- Persons
- Email addresses (linked to persons)
- Phone numbers (linked to persons)
- Organizations (linked to persons via employment)
Generate GDPR report showing:
- All persons with complete profiles
- All contact information per person
- Organizations they work for
- All document locations
Review and verify entities and relationships
Prepare comprehensive data subject access requests

Example Output:

Data Subject: John Doe

Contact Information:
- Phone: +31612345678 (primary)
- Phone: +31687654321 (alternate)
- Email: john.doe@example.com (primary)
- Email: j.doe@company.com (alternate)
- Address: 123 Main St, Amsterdam

Found in:
- 15 documents
- 45 chunks
- 3 email threads

2. Multi-Language Content Management

Scenario: Organization has content in multiple languages.

Process:

Enable language detection
Process all content
Filter searches by language
Route to appropriate language models for AI
Generate language distribution reports

3. Accessibility Compliance

Scenario: Government agency must ensure plain language.

Process:

Enable language level assessment
Set target level (e.g., B1 / Grade 8)
Process all public-facing documents
Identify documents exceeding target complexity
Flag for simplification

4. Intelligent Search with Context

Scenario: Users need semantic search across emails and documents.

Process:

Enable chunking for files, objects, and emails
Generate embeddings for vector search
Enable language detection for result filtering
User searches: 'budget discussion with John'
System finds:
- Email chunks mentioning budget + John
- Document chunks about budget meetings
- Chat messages discussing budget
Results ranked by relevance and language preference

Performance Considerations

Processing Times

Chunking: ~50-100ms per 100KB of text Language Detection: ~10-50ms per chunk (local) or ~100-200ms (API) Language Level: ~20-100ms per chunk (formula) or ~200-500ms (LLM) Entity Extraction: ~100-300ms per chunk (patterns) or ~500-2000ms (LLM)

Recommendations

Batch Processing: Process enhancements in background jobs
Selective Enhancement: Only enable features you need
Confidence Thresholds: Higher thresholds = fewer false positives but may miss entities
Hybrid Approach: Use fast local methods first, then validate with LLM only when uncertain

Database Schema

Updated Chunks Table

ALTER TABLE oc_openregister_file_texts ADD COLUMN language VARCHAR(10) DEFAULT NULL;
ALTER TABLE oc_openregister_file_texts ADD COLUMN language_level VARCHAR(20) DEFAULT NULL;

-- Note: Chunks currently stored in chunks_json field
-- Future migration will separate into dedicated chunk table

New Entity Table

CREATE TABLE oc_openregister_entities (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    uuid VARCHAR(255) NOT NULL UNIQUE,
    type VARCHAR(50) NOT NULL,
    value TEXT NOT NULL,
    category VARCHAR(50) NOT NULL,
    metadata JSON,
    detected_at DATETIME NOT NULL,
    updated_at DATETIME NOT NULL,
    INDEX idx_type (type),
    INDEX idx_category (category),
    INDEX idx_detected (detected_at)
);

New Entity Relation Table

CREATE TABLE oc_openregister_entity_relations (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    entity_id BIGINT NOT NULL,
    chunk_id BIGINT NOT NULL,
    position_start INT NOT NULL,
    position_end INT NOT NULL,
    confidence DECIMAL(3,2) NOT NULL,
    detection_method VARCHAR(50) NOT NULL,
    context TEXT,
    anonymized BOOLEAN DEFAULT FALSE,
    anonymized_value VARCHAR(255),
    created_at DATETIME NOT NULL,
    FOREIGN KEY (entity_id) REFERENCES oc_openregister_entities(id) ON DELETE CASCADE,
    FOREIGN KEY (chunk_id) REFERENCES oc_openregister_chunks(id) ON DELETE CASCADE,
    INDEX idx_entity (entity_id),
    INDEX idx_chunk (chunk_id),
    INDEX idx_confidence (confidence),
    INDEX idx_anonymized (anonymized)
);

Security & Privacy

Data Protection

Entity Storage: Entities are stored encrypted at rest
Access Control: GDPR register access requires admin privileges
Audit Trail: All entity access is logged
Retention: Configurable retention periods for entity data

External Service Considerations

When using external services (Presidio, Dolphin, etc.):

Data Transmission: Consider data sensitivity before sending to external APIs
Compliance: Ensure services comply with GDPR, HIPAA, or relevant regulations
Contracts: Establish data processing agreements
Fallback: Always have local method fallback if service unavailable

Future Enhancements

Anonymization Engine: Automatic replacement of entities in source documents
Pseudonymization: Consistent fake data generation for testing
Consent Tracking: Link entities to user consent records
Right to Erasure: Automated entity removal across all chunks
Cross-Reference Detection: Link related entities (e.g., person + their email)
Sentiment Analysis: Per-chunk sentiment scoring
Topic Modeling: Automatic categorization of chunks by topic
Summarization: LLM-generated summaries of long documents

Conclusion

The enhanced text extraction system provides a comprehensive foundation for:

Searchability: Full-text and semantic search
Compliance: GDPR entity tracking and future anonymization
Accessibility: Language and readability analysis
Intelligence: AI-ready chunked content for RAG and analysis

By processing both files and objects into a unified chunk structure, OpenRegister enables powerful cross-content search, analysis, and compliance features while maintaining flexibility in processing methods and preparing for future capabilities.

📚 See Also:

Text Extraction, Vectorization & Named Entity Recognition - Unified documentation covering text extraction, vectorization, and NER in one place
Text Extraction Sources - Detailed source-specific documentation

Overview​

Core Concepts​

Sources: Files vs Objects​

1. File Processing​

2. Object Processing​

The Chunking Process​

Chunk Storage​

Enhancement Pipeline​

Processing Methods​

1. Local Algorithms​

2. External Services​

3. LLM/AI Processing​

4. Hybrid Approach (Recommended for Critical Applications)​

GDPR Entity Register​

Entity Types​

Entity Entity​

Entity Relation Entity​

Entity Link Entity (Entity-to-Entity Relationships)​

Preparing for Anonymization​

1. Precise Location Tracking​

2. Consistency Across Documents​

3. Reversible Anonymization​

4. Metadata Preservation​

Language Detection & Assessment​

Language Detection​

Language Level Assessment​

Extended Chunking Support​

Email Chunking​

Chat Message Chunking​

Processing Flow Example​

Complete File Processing Flow​

Configuration Options​

Text Extraction Settings​

Enhancement Settings​

Language Detection​

Language Level Assessment​

Entity Extraction (GDPR)​

Vector/RAG Support​

API Endpoints​

Chunk Management​

Entity Register (GDPR)​

Entity Links (Knowledge Graph)​

Language Analysis​

Use Cases​

1. GDPR Compliance Audit​

2. Multi-Language Content Management​

3. Accessibility Compliance​

4. Intelligent Search with Context​

Performance Considerations​

Processing Times​

Recommendations​

Database Schema​

Updated Chunks Table​

New Entity Table​

New Entity Relation Table​

Security & Privacy​

Data Protection​

External Service Considerations​

Future Enhancements​

Conclusion​

Overview

Core Concepts

Sources: Files vs Objects

1. File Processing

2. Object Processing

The Chunking Process

Chunk Storage

Enhancement Pipeline

Processing Methods

1. Local Algorithms

2. External Services

3. LLM/AI Processing

4. Hybrid Approach (Recommended for Critical Applications)

GDPR Entity Register

Entity Types

Entity Entity

Entity Relation Entity

Entity Link Entity (Entity-to-Entity Relationships)

Preparing for Anonymization

1. Precise Location Tracking

2. Consistency Across Documents

3. Reversible Anonymization

4. Metadata Preservation

Language Detection & Assessment

Language Detection

Language Level Assessment

Extended Chunking Support

Email Chunking

Chat Message Chunking

Processing Flow Example

Complete File Processing Flow

Configuration Options

Text Extraction Settings

Enhancement Settings

Language Detection

Language Level Assessment

Entity Extraction (GDPR)

Vector/RAG Support

API Endpoints

Chunk Management

Entity Register (GDPR)

Entity Links (Knowledge Graph)

Language Analysis

Use Cases

1. GDPR Compliance Audit

2. Multi-Language Content Management

3. Accessibility Compliance

4. Intelligent Search with Context

Performance Considerations

Processing Times

Recommendations

Database Schema

Updated Chunks Table

New Entity Table

New Entity Relation Table

Security & Privacy

Data Protection

External Service Considerations

Future Enhancements

Conclusion