Entity Relationship Model - Updated Structure
Overview
This document describes the updated entity relationship model based on user feedback to create a more efficient and intuitive structure for GDPR entity management.
Key Changes
1. BelongsTo Field on Entity (Many-to-One)
Instead of: Separate EntityLink table for all relationships (many-to-many)
Now: Direct belongs_to_entity_id field on each entity (many-to-one)
Rationale: Most contact information belongs to exactly ONE person or organization, making a foreign key more efficient than a join table.
2. Role Field on EntityRelation
Added: role field to track entity context for GDPR compliance
Purpose: Determine if an entity occurrence requires anonymization:
- public_figure: May not require anonymization
- employee: In official capacity, may not require anonymization
- private_individual: Always requires anonymization
- customer, contractor, author, recipient, mentioned: Context-dependent
3. Source Tracking on EntityRelation
Added: file_id, object_id, email_id fields to track ultimate source
Purpose:
- Chunks may change over time (re-chunking, updates)
- GDPR requests need original source documents
- Anonymization must trace back to original files
- Audit trails require source document references
4. Source Tracking on Chunk
Clarified: source_type and source_id track the ultimate source, not intermediate text entities
Values:
source_type: 'file', 'object', 'mail', 'chat'source_id: The actual file/object/email/chat ID (not the text extraction record ID)
Updated Entity Structure
Entity Table Schema
CREATE TABLE oc_openregister_entities (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
uuid VARCHAR(255) NOT NULL UNIQUE,
type VARCHAR(50) NOT NULL,
value TEXT NOT NULL,
category VARCHAR(50) NOT NULL,
belongs_to_entity_id BIGINT, -- NEW: Direct parent reference
metadata JSON,
owner VARCHAR(255),
organisation VARCHAR(255),
detected_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
INDEX idx_type (type),
INDEX idx_category (category),
INDEX idx_belongs_to (belongs_to_entity_id), -- NEW: Index for parent lookups
FOREIGN KEY (belongs_to_entity_id) REFERENCES oc_openregister_entities(id) ON DELETE SET NULL
);
EntityRelation Table Schema
CREATE TABLE oc_openregister_entity_relations (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
entity_id BIGINT NOT NULL,
chunk_id BIGINT NOT NULL,
role VARCHAR(50), -- NEW: Entity role/context
file_id BIGINT, -- NEW: Original file source
object_id BIGINT, -- NEW: Original object source
email_id BIGINT, -- NEW: Original email source
position_start INT NOT NULL,
position_end INT NOT NULL,
confidence DECIMAL(3,2) NOT NULL,
detection_method VARCHAR(50) NOT NULL,
context TEXT,
anonymized BOOLEAN NOT NULL DEFAULT FALSE,
anonymized_value VARCHAR(255),
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
INDEX idx_role (role), -- NEW: Query by role
INDEX idx_file (file_id), -- NEW: Find entities in file
INDEX idx_object (object_id), -- NEW: Find entities in object
INDEX idx_email (email_id), -- NEW: Find entities in email
FOREIGN KEY (entity_id) REFERENCES oc_openregister_entities(id) ON DELETE CASCADE,
FOREIGN KEY (chunk_id) REFERENCES oc_openregister_chunks(id) ON DELETE CASCADE
);
Relationship Pattern
Parent-Child (BelongsTo)
Person: John Doe (id=1)
├─ Phone: +31612345678 (id=2, belongs_to_entity_id=1)
├─ Phone: +31687654321 (id=3, belongs_to_entity_id=1)
├─ Email: john.doe@example.com (id=4, belongs_to_entity_id=1)
├─ Email: j.doe@company.com (id=5, belongs_to_entity_id=1)
└─ Address: 123 Main St, Amsterdam (id=6, belongs_to_entity_id=1)
Organization: Acme Corp (id=10)
├─ Phone: +31201234567 (id=11, belongs_to_entity_id=10)
├─ Email: info@acme.com (id=12, belongs_to_entity_id=10)
├─ Email: sales@acme.com (id=13, belongs_to_entity_id=10)
└─ Address: 456 Business Park (id=14, belongs_to_entity_id=10)
Query Patterns
Get all contact info for a person:
SELECT * FROM oc_openregister_entities
WHERE belongs_to_entity_id = 1;
Get parent entity for a phone number:
SELECT parent.* FROM oc_openregister_entities phone
JOIN oc_openregister_entities parent ON phone.belongs_to_entity_id = parent.id
WHERE phone.id = 2;
Find all persons with a specific email domain:
SELECT DISTINCT parent.* FROM oc_openregister_entities email
JOIN oc_openregister_entities parent ON email.belongs_to_entity_id = parent.id
WHERE email.type = 'email'
AND email.value LIKE '%@acme.com'
AND parent.type = 'person';
Role-Based Anonymization
Example Detection
Text: "Contact John Doe (Sales Manager) at +31612345678"
Entities Detected:
1. Person: John Doe (id=1)
2. Phone: +31612345678 (id=2, belongs_to_entity_id=1)
EntityRelation Records:
1. entity_id=1, chunk_id=123, role='employee', position_start=8, position_end=16
2. entity_id=2, chunk_id=123, role='employee', position_start=37, position_end=50
Anonymization Decision:
- John Doe: role='employee' → May not require anonymization (business context)
- Phone: role='employee' → May not require anonymization (business contact)
Example: Private Individual
Text: "Please forward this to Jane Smith (customer) at jane.smith@gmail.com"
Entities Detected:
1. Person: Jane Smith (id=20)
2. Email: jane.smith@gmail.com (id=21, belongs_to_entity_id=20)
EntityRelation Records:
1. entity_id=20, chunk_id=456, role='customer', position_start=23, position_end=33
2. entity_id=21, chunk_id=456, role='customer', position_start=48, position_end=69
Anonymization Decision:
- Jane Smith: role='customer' → Requires anonymization (private individual)
- Email: role='customer' → Requires anonymization (personal email)
Source Tracking Benefits
Scenario: File Re-chunking
Initial State:
- File: contract.pdf (file_id=100)
- FileText: id=200 (extraction record)
- Chunks: Created from file_id=100
- Entities found in chunks
After Re-chunking:
- File: contract.pdf (file_id=100) - UNCHANGED
- FileText: id=200 - Updated with new chunks
- Chunks: NEW chunks created
- EntityRelations: Still reference file_id=100 (original source)
Benefit: Can always trace entities back to original file, even after re-chunking
Scenario: GDPR Request
Request: 'Give me all documents containing John Doe'
Query:
SELECT DISTINCT
er.file_id,
er.object_id,
er.email_id,
er.role,
er.created_at
FROM oc_openregister_entities e
JOIN oc_openregister_entity_relations er ON e.id = er.entity_id
WHERE e.value = 'John Doe' AND e.type = 'person'
Result:
file_id | object_id | email_id | role | created_at
--------|-----------|----------|-------------------|-------------------
100 | NULL | NULL | employee | 2024-01-15 10:30
NULL | 500 | NULL | mentioned | 2024-02-20 14:15
NULL | NULL | 300 | recipient | 2024-03-10 09:00
150 | NULL | NULL | private_individual| 2024-04-05 16:45
Interpretation:
- contract.pdf (file 100): John Doe as employee
- Customer record (object 500): John Doe mentioned
- Email thread (email 300): John Doe as recipient
- Personal letter (file 150): John Doe as private individual (requires anonymization)
PHP Entity Class Updates
GdprEntity Class
class GdprEntity extends Entity implements JsonSerializable
{
protected ?int $belongsToEntityId = null; // NEW
/**
* Get all child entities (contact info) for this entity
*
* @param EntityMapper $mapper
* @return GdprEntity[]
*/
public function getChildren(EntityMapper $mapper): array
{
return $mapper->findByBelongsTo($this->id);
}
/**
* Get the parent entity this belongs to
*
* @param EntityMapper $mapper
* @return GdprEntity|null
*/
public function getParent(EntityMapper $mapper): ?GdprEntity
{
if ($this->belongsToEntityId === null) {
return null;
}
return $mapper->find($this->belongsToEntityId);
}
}
EntityRelation Class
class EntityRelation extends Entity implements JsonSerializable
{
protected ?string $role = null; // NEW
protected ?int $fileId = null; // NEW
protected ?int $objectId = null; // NEW
protected ?int $emailId = null; // NEW
// Role constants
public const ROLE_PUBLIC_FIGURE = 'public_figure';
public const ROLE_EMPLOYEE = 'employee';
public const ROLE_PRIVATE_INDIVIDUAL = 'private_individual';
public const ROLE_CUSTOMER = 'customer';
/**
* Check if this entity occurrence requires anonymization
*
* @return bool
*/
public function requiresAnonymization(): bool
{
$nonPrivateRoles = [
self::ROLE_PUBLIC_FIGURE,
self::ROLE_EMPLOYEE,
];
if ($this->role && in_array($this->role, $nonPrivateRoles)) {
return false;
}
return true; // Default: require anonymization
}
/**
* Get source type and ID
*
* @return array{type: string, id: int}|null
*/
public function getSource(): ?array
{
if ($this->fileId) return ['type' => 'file', 'id' => $this->fileId];
if ($this->objectId) return ['type' => 'object', 'id' => $this->objectId];
if ($this->emailId) return ['type' => 'email', 'id' => $this->emailId];
return null;
}
}
Migration from EntityLink Table
Current State (Before)
EntityLink table with many-to-many relationships:
- source_entity_id
- target_entity_id
- relationship_type ('belongs_to', 'associated_with', etc.)