Presidio Integration
Integrate OpenRegister with Microsoft Presidio Analyzer for advanced entity extraction and PII (Personally Identifiable Information) detection for GDPR compliance.
Overview
Presidio Analyzer is Microsoft's open-source PII detection service that provides:
- ✅ High Accuracy: 90-98% precision for PII detection
- ✅ Multi-language Support: 50+ languages including Dutch
- ✅ GDPR Compliance: Built specifically for GDPR/CCPA requirements
- ✅ Self-hosted: Run locally in Docker
- ✅ Extensible: Add custom recognizers
Prerequisites
- Nextcloud 28+ with OpenRegister installed
- Docker and Docker Compose
- At least 2GB RAM
- Presidio Analyzer container (included in docker-compose.yml)
Quick Start
Step 1: Start Presidio Container
Presidio is included in the docker-compose configuration:
# Start all services including Presidio
docker-compose up -d
# Or specifically start Presidio
docker-compose up -d presidio-analyzer
Step 2: Verify Presidio is Running
# Check health
curl http://localhost:5001/health
# Expected response:
# {"status":"ok"}
Step 3: Configure OpenRegister
Presidio is automatically configured when the container is running. Configure in OpenRegister settings:
Settings → OpenRegister → Text Analysis → Entity Extraction
- Method: Select "Presidio"
- Presidio URL:
http://presidio-analyzer:5001(from Nextcloud container) - Default Language: Select your primary language (e.g., Dutch)
- Supported Languages: Select languages to support
Configuration Details
Presidio Service Configuration
presidio-analyzer:
image: mcr.microsoft.com/presidio-analyzer:latest
container_name: openregister-presidio-analyzer
restart: always
ports:
- "5001:5001"
environment:
- GRPC_PORT=5001
- LOG_LEVEL=INFO
# Multi-language support including Dutch
- PRESIDIO_ANALYZER_LANGUAGES=en,nl,de,fr,es
deploy:
resources:
limits:
memory: 2G
reservations:
memory: 512M
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:5001/health || exit 1"]
interval: 30s
timeout: 10s
retries: 3
Accessing Presidio
Important: Docker Container Communication
- ✅ From Nextcloud container:
http://presidio-analyzer:5001 - ✅ From host machine:
http://localhost:5001 - ❌ NOT:
http://localhost:5001(from Nextcloud container, use container name)
Supported Entity Types
Presidio detects the following entity types:
Personal Information
- PERSON: Person names
- EMAIL_ADDRESS: Email addresses
- PHONE_NUMBER: Phone numbers
- NRP: National identification numbers (BSN in Netherlands)
Financial Information
- CREDIT_CARD: Credit card numbers
- IBAN_CODE: International bank account numbers
- SWIFT_CODE: SWIFT codes
Location Information
- LOCATION: Geographic locations
- IP_ADDRESS: IP addresses
- URL: Web URLs
Organization Information
- ORGANIZATION: Organization names
Date and Time
- DATE_TIME: Dates and timestamps
Medical Information
- MEDICAL_LICENSE: Medical license numbers
- US_PASSPORT: US passport numbers
Custom Entities
- Add custom recognizers for domain-specific entities
Use Cases
1. GDPR Compliance
Automatically detect and track PII in documents:
use OCA\OpenRegister\Service\NerService;
$nerService = $this->container->get(NerService::class);
// Extract entities from Dutch text
$dutchText = "Jan de Vries woont in Amsterdam en zijn telefoonnummer is 06-12345678.";
$entities = $nerService->extractEntities($dutchText, 'presidio', [
'language' => 'nl'
]);
foreach ($entities as $entity) {
echo "Type: {$entity['type']}\n";
echo "Value: {$entity['value']}\n";
echo "Confidence: {$entity['confidence']}\n";
echo "Position: {$entity['start']}-{$entity['end']}\n\n";
}
Output:
Type: PERSON
Value: Jan de Vries
Confidence: 0.85
Position: 0-12
Type: LOCATION
Value: Amsterdam
Confidence: 0.85
Position: 22-31
Type: PHONE_NUMBER
Value: 06-12345678
Confidence: 0.95
Position: 58-69
2. Data Subject Access Requests
Generate GDPR reports showing all PII for a person:
// Find all entities for a specific person
$personEntities = $nerService->findEntitiesByValue('Jan de Vries', 'presidio');
// Generate GDPR report
$report = $gdprService->generateDataSubjectReport($personEntities);
3. Automatic Anonymization
Anonymize detected PII in documents:
// Extract entities
$entities = $nerService->extractEntities($text, 'presidio');
// Anonymize text
$anonymizedText = $anonymizationService->anonymize($text, $entities);
API Usage
Direct API Calls
Test Presidio directly:
# Analyze text for PII
curl -X POST http://localhost:5001/analyze \
-H "Content-Type: application/json" \
-d '{
"text": "Jan de Vries woont in Amsterdam",
"language": "nl",
"entities": ["PERSON", "LOCATION"]
}'
Response:
{
"entities": [
{
"entity_type": "PERSON",
"start": 0,
"end": 12,
"score": 0.85,
"analysis_explanation": {
"recognizer": "SpacyRecognizer",
"pattern": "PERSON"
}
},
{
"entity_type": "LOCATION",
"start": 22,
"end": 31,
"score": 0.85,
"analysis_explanation": {
"recognizer": "SpacyRecognizer",
"pattern": "LOCATION"
}
}
]
}
Integration with Text Extraction Pipeline
Presidio integrates with OpenRegister's text extraction pipeline:
Configuration
PHP Configuration
// config/ner_config.php
return [
'ner_enabled' => true,
'ner_method' => 'presidio', // Use Presidio for production
'presidio' => [
'analyzer_url' => 'http://presidio-analyzer:5001',
'default_language' => 'nl', // Default to Dutch
'languages' => ['nl', 'en'], // Support Dutch and English
'score_threshold' => 0.6, // Minimum confidence score
'entities' => [
'PERSON',
'EMAIL_ADDRESS',
'PHONE_NUMBER',
'IBAN_CODE',
'LOCATION',
'ORGANIZATION',
'NRP', // Dutch BSN numbers
]
]
];
Automatic Language Detection
If your documents are mixed language, detect language first:
// Detect language
$language = $nerService->detectLanguage($text);
// Use detected language for entity extraction
$entities = $nerService->extractEntities($text, 'presidio', [
'language' => $language
]);
Accuracy Comparison
| Method | Precision | Recall | F1 Score | Speed |
|---|---|---|---|---|
| Presidio | 90-95% | 85-92% | 87-93% | ⚡⚡ Medium |
| MITIE (Local) | 75-85% | 70-80% | 72-82% | ⚡⚡⚡ Fast |
| LLM (GPT-4) | 92-98% | 90-95% | 91-96% | ⚡ Slow |
Definitions:
- Precision: Percentage of detected entities that are correct (low false positives)
- Recall: Percentage of actual entities that were detected (low false negatives)
- F1 Score: Harmonic mean of precision and recall (overall accuracy)
Troubleshooting
Container Won't Start
# Check logs
docker logs openregister-presidio-analyzer
# Common issues:
# 1. Port 5001 already in use
sudo lsof -i :5001
# 2. Insufficient memory
docker stats openregister-presidio-analyzer
# 3. Language models not downloaded
docker exec openregister-presidio-analyzer ls /app/models
Low Accuracy
Solutions:
- Specify correct language:
'language' => 'nl'for Dutch - Adjust score threshold: Lower threshold for more detections
- Add custom recognizers for domain-specific entities
- Use hybrid approach with multiple methods
Connection Errors from OpenRegister
Problem: OpenRegister can't connect to Presidio.
Solutions:
- Verify analyzer URL uses container name:
http://presidio-analyzer:5001 - Check containers are on same Docker network
- Test connection from Nextcloud container:
docker exec <nextcloud-container> curl http://presidio-analyzer:5001/health
Slow Processing
Solutions:
- Process in batches
- Use async processing for large documents
- Cache results for repeated text
- Adjust timeout settings
Performance Optimization
Batch Processing
Process multiple chunks efficiently:
// Process multiple chunks
$chunks = [$chunk1, $chunk2, $chunk3];
$allEntities = $nerService->extractEntitiesBatch($chunks, 'presidio', [
'language' => 'nl',
'async' => true
]);
Caching
Cache entity extraction results:
// Cache entities for repeated text
$cacheKey = md5($text);
$entities = $cache->get($cacheKey);
if (!$entities) {
$entities = $nerService->extractEntities($text, 'presidio');
$cache->set($cacheKey, $entities, 3600); // Cache for 1 hour
}
Further Reading
- Presidio Setup Guide
- Entity Extraction Concepts
- Text Extraction Enhanced
- Presidio Official Documentation
Support
For issues specific to:
- Presidio setup: Check Presidio Setup Guide
- Entity extraction: See Entity Extraction Concepts
- OpenRegister integration: OpenRegister GitHub issues
- Presidio issues: Check Presidio GitHub