Named Entity Recognition (NER) & Natural Language Processing (NLP)
Overview
OpenRegister uses Named Entity Recognition (NER) and Natural Language Processing (NLP) techniques to automatically identify and extract sensitive information from documents for GDPR compliance, data classification, and intelligent search capabilities.
What is NLP?
Natural Language Processing (NLP) is a field of artificial intelligence that enables computers to understand, interpret, and process human language in a meaningful way.
Core NLP Capabilities in OpenRegister
NLP Tasks in OpenRegister
- Text Extraction: Convert files (PDF, DOCX, images) into machine-readable text
- Language Detection: Identify the language of content (English, Dutch, German, etc.)
- Language Level Assessment: Determine reading difficulty (A1-C2 CEFR levels, Flesch-Kincaid scores)
- Named Entity Recognition (NER): Identify persons, organizations, locations, dates, etc.
- Text Chunking: Split documents into semantic units for better processing
- Text Classification: Categorize documents by type, topic, or sensitivity
- Sentiment Analysis: Understand emotional tone (future feature)
What is NER?
Named Entity Recognition (NER) is a specific NLP task that locates and classifies named entities (proper nouns and important terms) in text into predefined categories.
Entity Categories
OpenRegister recognizes these entity types for GDPR compliance:
| Entity Type | Description | Examples | GDPR Category |
|---|---|---|---|
| PERSON | Individual names | John Doe, Jane Smith | Personal Data |
| Email addresses | john@example.com | Personal Data | |
| PHONE | Phone numbers | +31 6 12345678 | Personal Data |
| ADDRESS | Physical addresses | 123 Main St, Amsterdam | Personal Data |
| ORGANIZATION | Company/org names | Acme Corporation | Business Data |
| LOCATION | Geographic locations | Amsterdam, Netherlands | Contextual Data |
| DATE | Dates and times | 2025-01-15, January 15th | Temporal Data |
| IBAN | Bank account numbers | NL91 ABNA 0417 1643 00 | Sensitive PII |
| SSN | Social security numbers | 123-45-6789 | Sensitive PII |
NER Process Flow
NER Implementation Options
OpenRegister supports multiple NER engines to balance accuracy, privacy, and infrastructure requirements:
1. MITIE PHP Library (Local - Basic Setup)
MITIE (MIT Information Extraction) is an open-source NER library from MIT that runs entirely locally.
Recommended for:
- Development environments
- Privacy-sensitive deployments with no external API calls
- Basic entity recognition needs
- Low-resource environments
Advantages:
- ✅ No external dependencies
- ✅ Complete privacy (all processing local)
- ✅ No API costs
- ✅ Fast processing
- ✅ Works offline
Limitations:
- ⚠️ Lower accuracy than cloud services
- ⚠️ Requires PHP extension compilation
- ⚠️ Limited language support
- ⚠️ Pattern-based detection (regex + ML models)
Installation:
# Install MITIE PHP extension
git clone https://github.com/mit-nlp/MITIE.git
cd MITIE
mkdir build && cd build
cmake ..
cmake --build . --config Release --target install
# Enable PHP extension
echo "extension=mitie.so" > /etc/php/8.1/mods-available/mitie.ini
phpenmod mitie
Usage Example:
use OCA\OpenRegister\Service\NerService;
// MITIE will detect entities using local models
$nerService = $this->container->get(NerService::class);
$entities = $nerService->extractEntities($text, 'mitie');
foreach ($entities as $entity) {
echo "Found {$entity['type']}: {$entity['value']} (confidence: {$entity['confidence']})\n";
}
Detection Methods:
- Pattern matching (regex) for emails, phones, IBANs
- Statistical ML models for persons and organizations
- Dictionary-based location detection
2. Microsoft Presidio (Production - Recommended)
Presidio is Microsoft's open-source PII detection and anonymization framework with state-of-the-art accuracy.
Recommended for:
- ✅ Production deployments
- ✅ GDPR compliance requirements
- ✅ Multi-language support needed
- ✅ High accuracy requirements
Advantages:
- ✅ High accuracy (90-98% precision)
- ✅ Multi-language support (50+ languages)
- ✅ PII-specific focus (built for GDPR/CCPA)
- ✅ Anonymization built-in
- ✅ Regular updates and improvements