Presidio Analyzer Setup for Dutch Language Support

Overview

Presidio Analyzer is Microsoft's open-source PII detection and Named Entity Recognition (NER) service. This guide covers setting up Presidio with Dutch language support for OpenRegister.

Why Presidio?

Presidio Analyzer is recommended for production GDPR compliance because:

✅ High Accuracy: 90-98% precision for entity detection
✅ Multi-language Support: 50+ languages including Dutch
✅ GDPR-Focused: Built specifically for PII detection
✅ Self-Hosted: Run on your own infrastructure (privacy-first)
✅ Extensible: Add custom recognizers for domain-specific entities
✅ Active Development: Maintained by Microsoft with regular updates

Dutch Language Support

Default Language Model

The Presidio Analyzer Docker image includes English (en) language models by default. For Dutch language support, the container automatically downloads the required spaCy Dutch model on first startup.

Required Models

For Dutch NER, Presidio uses:

nl_core_news_sm - Small Dutch spaCy model (43MB, fast, good accuracy)
nl_core_news_md - Medium Dutch model (optional, 90MB, better accuracy)
nl_core_news_lg - Large Dutch model (optional, 545MB, best accuracy)

The small model is sufficient for most use cases and is automatically downloaded.

Docker Compose Configuration

The docker-compose.yml already includes Presidio Analyzer with multi-language support:

presidio-analyzer:
  image: mcr.microsoft.com/presidio-analyzer:latest
  container_name: openregister-presidio-analyzer
  restart: always
  ports:
    - "5001:5001"
  environment:
    - GRPC_PORT=5001
    - LOG_LEVEL=INFO
    # Multi-language support (Dutch included)
    - PRESIDIO_ANALYZER_LANGUAGES=en,nl,de,fr,es
  deploy:
    resources:
      limits:
        memory: 2G
      reservations:
        memory: 512M
  healthcheck:
    test: ["CMD-SHELL", "curl -f http://localhost:5001/health || exit 1"]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 30s

Starting Presidio

1. Start the Service

# Navigate to OpenRegister directory
cd /path/to/apps-extra/openregister

# Start all services (including Presidio)
docker-compose up -d presidio-analyzer

# Or start everything
docker-compose up -d

2. First Startup (Dutch Model Download)

On first startup, Presidio will automatically download the Dutch language model:

# Watch the logs to see model download
docker-compose logs -f presidio-analyzer

You should see output like:

Downloading Dutch model...
Collecting nl-core-news-sm
  Downloading nl_core_news_sm-3.6.0.tar.gz (43 MB)
Successfully installed nl-core-news-sm-3.6.0
✔ Download and installation successful

Note: This happens automatically. The download takes 1-3 minutes depending on your internet connection.

3. Verify Installation

# Check if Presidio is running
curl http://localhost:5001/health

# Response should be:
# {"status": "ok"}

# Test Dutch entity detection
curl -X POST http://localhost:5001/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Jan de Vries woont in Amsterdam en zijn email is jan.devries@example.nl",
    "language": "nl"
  }'

Expected response:

[
  {
    "entity_type": "PERSON",
    "start": 0,
    "end": 12,
    "score": 0.85,
    "analysis_explanation": {
      "recognizer": "SpacyRecognizer",
      "pattern_name": null,
      "pattern": null,
      "original_score": 0.85,
      "score": 0.85,
      "textual_explanation": null,
      "score_context_improvement": 0,
      "supportive_context_word": "",
      "validation_result": null
    }
  },
  {
    "entity_type": "LOCATION",
    "start": 22,
    "end": 31,
    "score": 0.85
  },
  {
    "entity_type": "EMAIL_ADDRESS",
    "start": 48,
    "end": 73,
    "score": 0.95
  }
]

Advanced Configuration

Using a Larger Dutch Model (Better Accuracy)

If you need higher accuracy for Dutch text, use the medium or large model:

Option 1: Custom Dockerfile

Create docker/Dockerfile.presidio-nl:

FROM mcr.microsoft.com/presidio-analyzer:latest

# Install larger Dutch model
RUN python -m spacy download nl_core_news_md

# Or for best accuracy (large model)
# RUN python -m spacy download nl_core_news_lg

Update docker-compose.yml:

presidio-analyzer:
  build:
    context: .
    dockerfile: docker/Dockerfile.presidio-nl
  container_name: openregister-presidio-analyzer
  # ... rest of configuration

Option 2: Volume with Pre-downloaded Models

# Download model locally
docker run --rm -v $(pwd)/presidio-models:/models \
  mcr.microsoft.com/presidio-analyzer:latest \
  python -m spacy download nl_core_news_md

# Mount in docker-compose.yml
presidio-analyzer:
  volumes:
    - ./presidio-models:/usr/local/lib/python3.9/site-packages/nl_core_news_md

Configuring Recognized Entity Types

By default, Presidio detects these entity types in Dutch:

Entity Type	Description	Example
PERSON	Person names	Jan de Vries, Maria van der Berg
LOCATION	Places/cities	Amsterdam, Rotterdam, Nederland
ORGANIZATION	Companies/orgs	KPN, Gemeente Amsterdam
EMAIL_ADDRESS	Email addresses	jan@example.nl
PHONE_NUMBER	Phone numbers	+31 6 12345678, 06-12345678
IBAN_CODE	Bank accounts	NL91 ABNA 0417 1643 00
NRP (Dutch specific)	BSN numbers	123456782 (9-digit)
DATE_TIME	Dates/times	15 januari 2025
URL	Web addresses	https://example.nl
IP_ADDRESS	IP addresses	192.168.1.1

Dutch-Specific Patterns

Presidio includes Dutch-specific recognizers:

BSN (Burgerservicenummer): Dutch citizen service number
Dutch phone numbers: Multiple formats (+31 6, 06-, etc.)
Dutch IBANs: NL-prefixed bank accounts
Dutch addresses: Street patterns common in Netherlands

Performance Considerations

Memory Requirements

Model Size	Memory Usage	Performance	Accuracy
sm (small)	100-200MB	Fast (50-100ms/doc)	Good (85-90%)
md (medium)	200-400MB	Medium (100-200ms/doc)	Better (88-93%)
lg (large)	500MB-1GB	Slow (200-400ms/doc)	Best (90-95%)

Recommendation: Use small model for development and most production use cases. Only upgrade to medium/large if accuracy is insufficient.

Processing Speed

Average processing time for Dutch text:

Short text (100 chars): 20-50ms
Medium text (1000 chars): 50-150ms
Long text (10000 chars): 200-500ms

Tip: Process text in chunks of ~1000 characters for optimal performance.

Integration with OpenRegister

Configuration in OpenRegister

Configure Presidio in your OpenRegister settings:

// config/ner_config.php
return [
    'ner_enabled' => true,
    'ner_method' => 'presidio',  // Use Presidio for production
    
    'presidio' => [
        'analyzer_url' => 'http://presidio-analyzer:5001',
        'default_language' => 'nl',  // Default to Dutch
        'languages' => ['nl', 'en'],  // Support Dutch and English
        'score_threshold' => 0.6,     // Minimum confidence score
        'entities' => [
            'PERSON',
            'EMAIL_ADDRESS',
            'PHONE_NUMBER',
            'IBAN_CODE',
            'LOCATION',
            'ORGANIZATION',
            'NRP',  // Dutch BSN numbers
        ]
    ]
];

Using Presidio in PHP

use OCA\OpenRegister\Service\NerService;

// Initialize NER service
$nerService = $this->container->get(NerService::class);

// Extract entities from Dutch text
$dutchText = "Jan de Vries woont in Amsterdam en zijn telefoonnummer is 06-12345678.";

$entities = $nerService->extractEntities($dutchText, 'presidio', [
    'language' => 'nl'
]);

foreach ($entities as $entity) {
    echo "Type: {$entity['type']}\n";
    echo "Value: {$entity['value']}\n";
    echo "Confidence: {$entity['confidence']}\n";
    echo "Position: {$entity['start']}-{$entity['end']}\n\n";
}

Output:

Type: PERSON
Value: Jan de Vries
Confidence: 0.85
Position: 0-12

Type: LOCATION
Value: Amsterdam
Confidence: 0.85
Position: 22-31

Type: PHONE_NUMBER
Value: 06-12345678
Confidence: 0.95
Position: 58-69

Automatic Language Detection

If your documents are mixed language, detect language first:

// Detect language
$language = $nerService->detectLanguage($text);

// Use detected language for entity extraction
$entities = $nerService->extractEntities($text, 'presidio', [
    'language' => $language
]);

Troubleshooting

Presidio Container Won't Start

# Check logs
docker logs openregister-presidio-analyzer

# Common issues:
# 1. Port 5001 already in use
sudo lsof -i :5001

# 2. Insufficient memory
# Increase memory limit in docker-compose.yml

Dutch Model Download Fails

# Manually download Dutch model
docker exec -it openregister-presidio-analyzer \
  python -m spacy download nl_core_news_sm

# Verify model is installed
docker exec -it openregister-presidio-analyzer \
  python -m spacy info nl_core_news_sm

Low Accuracy for Dutch Text

Solutions:

Upgrade to larger model (see Advanced Configuration)
Lower confidence threshold in configuration
Add custom recognizers for domain-specific terms
Use hybrid approach with multiple NER methods

Connection Errors from Nextcloud

# Test connectivity from Nextcloud container
docker exec nextcloud curl http://presidio-analyzer:5001/health

# If fails, check if services are on same network
docker network ls
docker network inspect openregister_default

Custom Dutch Recognizers

For domain-specific Dutch entities (e.g., Dutch postcode patterns):

Create custom recognizer: custom_recognizers/nl_postcode_recognizer.yaml

- name: nl_postcode
  supported_language: nl
  patterns:
    - name: dutch_postcode
      regex: '\b[1-9][0-9]{3}\s?[A-Z]{2}\b'
      score: 0.85
  context:
    - postcode
    - adres
    - woonplaats

Load custom recognizers:

docker exec openregister-presidio-analyzer \
  curl -X POST http://localhost:5001/recognizers \
  -H "Content-Type: application/yaml" \
  --data-binary @custom_recognizers/nl_postcode_recognizer.yaml

Testing Dutch Entity Detection

Test Script

# test-dutch-ner.sh
#!/bin/bash

echo "Testing Dutch entity detection..."

# Test 1: Person and location
curl -X POST http://localhost:5001/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Jan de Vries woont in Amsterdam.",
    "language": "nl"
  }' | jq

# Test 2: Email and phone
curl -X POST http://localhost:5001/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Neem contact op via jan@example.nl of bel 06-12345678.",
    "language": "nl"
  }' | jq

# Test 3: IBAN
curl -X POST http://localhost:5001/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Maak het bedrag over naar NL91 ABNA 0417 1643 00.",
    "language": "nl"
  }' | jq

# Test 4: Organization
curl -X POST http://localhost:5001/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "text": "De Gemeente Amsterdam heeft een nieuwe website gelanceerd.",
    "language": "nl"
  }' | jq

echo "Tests completed!"

Run tests:

chmod +x test-dutch-ner.sh
./test-dutch-ner.sh

Resource Management

Monitor Presidio

# Check resource usage
docker stats openregister-presidio-analyzer

# Check health
curl http://localhost:5001/health

# View logs
docker logs -f openregister-presidio-analyzer --tail 100

Restart Presidio

# Restart service
docker-compose restart presidio-analyzer

# Or rebuild (if Dockerfile changed)
docker-compose up -d --build presidio-analyzer

Production Checklist

Before deploying to production:

Presidio Analyzer is running and healthy
Dutch model is installed (verify with test query)
Memory limits are appropriate (2GB recommended)
Health checks are working
Confidence threshold is configured (0.6-0.8 recommended)
Logging level is set to INFO (not DEBUG)
Custom recognizers are loaded (if needed)
Integration tests pass with Dutch text samples
Performance is acceptable (<200ms per 1000 chars)

Alternative: MITIE for Development

For development without Docker dependencies, use MITIE (local PHP library):

// Development/testing with MITIE
$entities = $nerService->extractEntities($text, 'mitie');

// Production with Presidio
$entities = $nerService->extractEntities($text, 'presidio', [
    'language' => 'nl'
]);

See NER & NLP Concepts for MITIE setup.

NER & NLP Concepts - Understanding entity recognition
Docker Setup - Complete Docker development setup guide
Text Extraction Enhanced - Complete extraction pipeline
Entity Relationships - GDPR entity data model

External Resources

Summary: The default Presidio Analyzer setup automatically supports Dutch language. No additional configuration is required beyond the docker-compose setup. The Dutch spaCy model downloads automatically on first startup.

Overview​

Why Presidio?​

Dutch Language Support​

Default Language Model​

Required Models​

Docker Compose Configuration​

Starting Presidio​

1. Start the Service​

2. First Startup (Dutch Model Download)​

3. Verify Installation​

Advanced Configuration​

Using a Larger Dutch Model (Better Accuracy)​

Configuring Recognized Entity Types​

Dutch-Specific Patterns​

Performance Considerations​

Memory Requirements​

Processing Speed​

Integration with OpenRegister​

Configuration in OpenRegister​

Using Presidio in PHP​

Automatic Language Detection​

Troubleshooting​

Presidio Container Won't Start​

Dutch Model Download Fails​

Low Accuracy for Dutch Text​

Connection Errors from Nextcloud​

Custom Dutch Recognizers​

Testing Dutch Entity Detection​

Test Script​

Resource Management​

Monitor Presidio​

Restart Presidio​

Production Checklist​

Alternative: MITIE for Development​

Related Documentation​

External Resources​