Skip to main content

Files

What are Files in Open Register?

In Open Register, Files are binary data attachments that can be associated with objects. They extend the system beyond structured data to include documents, images, videos, and other file types that are essential for many applications.

Files in Open Register are:

  • Securely stored and managed
  • Associated with specific objects
  • Versioned alongside their parent objects
  • Accessible through a consistent API
  • Integrated with Nextcloud's file management capabilities

Attaching Files to Objects

Files can be attached to objects in several ways:

  1. Integrated Uploads: Files can be uploaded directly within object POST/PUT operations using multipart/form-data, base64-encoded content, or URL references
  2. Schema-defined file properties: When a schema includes properties of type 'file', these are automatically handled during object creation or updates
  3. Direct API attachment: Files can be added to an object after creation using the file attachment API endpoints
  4. Base64 encoded content: Files can be included in object data as base64-encoded strings
  5. URL references: External files can be referenced by URL and will be downloaded and stored locally

Integrated File Uploads

OpenRegister supports integrated file uploads directly within object POST/PUT operations, providing a unified approach to handling structured data (objects) and unstructured data (files) together.

Upload Methods

Use Case: Uploading files from web forms or file inputs

Example:

POST /index.php/apps/openregister/api/registers/documents/schemas/document/objects
Content-Type: multipart/form-data

title=Annual Report 2024
attachment=@report.pdf
thumbnail=@cover.jpg

JavaScript Example:

const formData = new FormData();
formData.append('title', 'Annual Report 2024');
formData.append('attachment', fileInput.files[0]);
formData.append('thumbnail', thumbnailInput.files[0]);

fetch('/index.php/apps/openregister/api/registers/documents/schemas/document/objects', {
method: 'POST',
body: formData,
headers: {
'Authorization': 'Bearer YOUR_TOKEN'
}
})
.then(response => response.json())
.then(data => console.log('Created:', data));

Why this is recommended:

  • ✅ Most efficient: No encoding overhead, files transferred directly
  • ✅ Preserves metadata: Original filename and MIME type are maintained
  • ✅ No guessing: Extension and filename are exactly as uploaded
  • ✅ Best file quality: No conversion or inference errors
  • ✅ Low memory footprint: Can stream directly from disk to disk
  • ✅ Fastest method: Direct transfer without intermediate conversions

2. Base64-Encoded Files

Use Case: Embedding files in JSON payloads, API integrations

Data URI Format:

{
"title": "Screenshot",
"image": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA..."
}

Plain Base64 Format:

{
"title": "Document",
"attachment": "JVBERi0xLjQKJeLjz9MKMyAwIG9iago8PC9MZW5ndGggMj..."
}

Note: Base64 encoding increases file size by approximately 33% and original filenames are lost. Use only for small files (< 100 KB) or when multipart is not possible.

3. URL References

Use Case: Referencing remote files, importing from external sources

Example:

{
"title": "External Document",
"attachment": "https://example.com/files/document.pdf",
"logo": "https://cdn.example.com/images/logo.png"
}

Note: URL references are slower as the server must download the file from the external URL. Use only for trusted sources or migration scenarios.

4. Mixed Upload Methods

You can combine all three methods in a single request:

POST /index.php/apps/openregister/api/registers/documents/schemas/document/objects
Content-Type: multipart/form-data

title=Complete Package
mainDocument=@contract.pdf
signature=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA...
reference=https://example.com/terms.pdf

Array of Files

Files can be uploaded as arrays:

Schema:

{
"properties": {
"attachments": {
"type": "array",
"items": {
"type": "file"
}
}
}
}

Upload:

{
"title": "Multi-File Document",
"attachments": [
"data:application/pdf;base64,JVBERi0xLjQKJeL...",
"https://example.com/file2.pdf",
"data:image/png;base64,iVBORw0KGgo..."
]
}

Update Operations

File properties work the same way with PUT/PATCH operations:

PUT /index.php/apps/openregister/api/registers/documents/schemas/document/objects/abc-123
Content-Type: multipart/form-data

title=Updated Document
attachment=@new-version.pdf

Note: Updating a file property replaces the previous file.

Error Handling

Invalid MIME Type

{
"error": "File at attachment has invalid type 'application/zip'. Allowed types: application/pdf, application/msword"
}

File Too Large

{
"error": "File at attachment exceeds maximum size (10485760 bytes). File size: 15728640 bytes"
}

Upload Error

{
"error": "Failed to read uploaded file for field 'attachment'"
}

URL Download Failure

{
"error": "Unable to fetch file from URL: https://example.com/missing.pdf"
}

Backward Compatibility

Existing file endpoints remain unchanged:

  • POST /api/objects/{register}/{schema}/{id}/files
  • GET /api/objects/{register}/{schema}/{id}/files
  • DELETE /api/objects/{register}/{schema}/{id}/files/{fileId}

Both approaches work and can be used interchangeably.

Performance Comparison

MethodSpeedFile SizeMetadataUse Case
MultipartFastestOriginalPreserved✅ Recommended for all uploads
Base64Medium+33% largerLost⚠️ Small files only (< 100 KB)
URLSlowestOriginalPreserved🐌 External imports only

Best Practices

  1. ✅ ALWAYS use Multipart for user uploads

    • Users expect filenames to be preserved
    • Prevents confusion about generic filenames
  2. ⚠️ Base64 only for APIs

    • When API client doesn't support multipart
    • Document that filenames will be lost
    • Always use data URI format with MIME type
  3. 🐌 URLs only for trusted sources

    • Use timeout limits (max 30 seconds)
    • Validate content-length headers upfront
    • Implement retry logic
  4. 📝 Document your choice

    • If using base64 or URL, explain why
    • Make users aware of trade-offs
  5. 🧪 Test performance

    • Measure upload times in production
    • Monitor failure rates for URL downloads

File Metadata and Tagging

Each file attachment includes rich metadata:

  • Basic properties (name, size, type, extension)
  • Creation and modification timestamps
  • Access and download URLs
  • Checksum for integrity verification
  • Custom tags for categorization

Tagging System

Files can be tagged with both simple labels and key-value pairs:

  • Tags with a colon (':') are treated as key-value pairs and can be used for advanced filtering and organization

Version Control

The system maintains file versions by:

  • Tracking file modifications with timestamps
  • Preserving checksums to detect changes
  • Integrating with the object audit trail system
  • Supporting file restoration from previous versions

Security and Access Control

File attachments inherit the security model of their parent objects:

  • Files are stored in NextCloud with appropriate permissions
  • Share links can be generated for controlled external access
  • Access is managed through the OpenRegister user and group system
  • Files are associated with the OpenRegister application user for consistent permissions

File Operations

The system supports the following operations on file attachments:

  • Retrieving Files
  • Updating Files
  • Deleting Files

File Preview and Rendering

The system leverages NextCloud's preview capabilities for supported file types:

  • Images are displayed as thumbnails
  • PDFs can be previewed in-browser
  • Office documents can be viewed with compatible apps
  • Preview URLs are generated for easy embedding

Integration with Object Lifecycle

File attachments are fully integrated with the object lifecycle:

  • When objects are created, their file folders are automatically provisioned
  • When objects are updated, file references are maintained
  • When objects are deleted, associated files can be optionally preserved or removed
  • File operations are recorded in the object's audit trail

Technical Implementation

The file attachment system is implemented through two main service classes:

  • FileService: Handles low-level file operations, folder management, and NextCloud integration
  • ObjectService: Provides high-level methods for attaching, retrieving, and managing files in the context of objects

These services work together to provide a seamless file management experience within the OpenRegister application.

File Structure

id
integer

Unique identifier of the file in Nextcloud

uuid
string

Unique identifier for the file

filename
string

Name of the file

downloadUrl
string <uri>

Direct download URL for the file

shareUrl
string <uri>

URL to access the file via share link

accessUrl
string <uri>

URL to access the file

extension
string

File extension

checksum
string

ETag hash for file versioning

source
integer

Source identifier

userId
string

ID of the user who owns the file

base64
string

Base64 encoded content of the file

filePath
string

Full path to the file in Nextcloud

created
string <date-time>

ISO 8601 timestamp when file was first shared

updated
string <date-time>

ISO 8601 timestamp of last modification

{}

How Files are Stored

Open Register provides flexible storage options for files:

1. Nextcloud Storage

By default, files are stored in Nextcloud's file system, leveraging its robust file management capabilities, including:

  • Access control
  • Versioning
  • Encryption
  • Collaborative editing

2. External Storage

For larger deployments or specialized needs, files can be stored in:

  • Object storage systems (S3, MinIO)
  • Content delivery networks
  • Specialized document management systems

3. Database Storage

Small files can be stored directly in the database for simplicity and performance.

File Features

1. Versioning

Files maintain version history, allowing you to:

  • Track changes over time
  • Revert to previous versions
  • Compare different versions

2. Access Control

Files inherit access control from their parent objects, ensuring consistent security:

  • Users who can access an object can access its files
  • Additional file-specific permissions can be applied
  • Permissions can be audited

3. Metadata

Files support rich metadata to provide context and improve searchability:

  • Standard metadata (creation date, size, type)
  • Custom metadata specific to your application
  • Extracted metadata (e.g., EXIF data from images)

4. Preview Generation

Open Register can generate previews for common file types:

  • Thumbnails for images
  • PDF previews
  • Document previews

5. Content Extraction

For supported file types, content can be extracted for indexing and search:

  • Text extraction from documents
  • OCR for scanned documents and images
  • Metadata extraction
Enhanced Text Extraction

OpenRegister now includes enhanced text extraction with entity tracking (GDPR), language detection, and language level assessment. See Enhanced Text Extraction & GDPR Entity Tracking for details.

Asynchronous Processing: Text extraction happens in the background after file upload, ensuring:

  • Fast uploads: Your file uploads complete instantly without waiting
  • Non-blocking: Users don't experience delays during file operations
  • Reliable: Background jobs automatically handle retries for failed extractions
  • Resource-efficient: Processing happens when resources are available

Text Extraction Options:

OpenRegister supports two text extraction engines:

  1. LLPhant (Default) - PHP-based extraction:

    • ✓ Native support: TXT, MD, HTML, JSON, XML, CSV
    • ○ Library support: PDF, DOCX, DOC, XLSX, XLS (requires PhpOffice, PdfParser)
    • ⚠️ Limited: PPTX, ODT, RTF
    • ✗ No support: Image files (JPG, PNG, GIF, WebP)
    • Best for: Privacy-conscious environments, regular documents
    • Cost: Free (included)
  2. Dolphin AI - Advanced AI-powered extraction:

    • ✓ All document formats with superior quality
    • ✓ OCR for scanned documents and images (JPG, PNG, GIF, WebP)
    • ✓ Advanced table extraction
    • ✓ Formula recognition
    • ✓ Multi-language OCR
    • Best for: Complex documents, scanned materials, images with text
    • Cost: API subscription required

Extraction Scope Options:

  • None: Text extraction disabled
  • All files: Extract from all uploaded files
  • Files in folders: Extract only from files in specific folders
  • Files attached to objects: Extract only from files linked to objects (recommended)

Typical Processing Times:

  • Text files: < 1 second
  • PDFs (LLPhant): 2-10 seconds
  • PDFs (Dolphin): 3-15 seconds
  • Large documents or OCR: 10-60 seconds
  • Images with OCR (Dolphin): 5-20 seconds

You can configure text extraction in Settings → File Configuration. Check extraction status in the file's metadata after upload.

Technical Implementation

Background Job Processing:

Text extraction uses Nextcloud's background job system for reliable, async processing:

  1. File Upload - User uploads a file
  2. Job Queuing - 'FileChangeListener' automatically queues 'FileTextExtractionJob'
  3. Job Execution - Background job system processes the file when resources are available
  4. Text Extraction - Selected extractor (LLPhant or Dolphin) processes the file
  5. Chunking - Text is automatically split into chunks with overlap (1000 chars per chunk, 200 char overlap)
  6. Storage - Extracted text and chunks stored in 'FileText' entity for reuse
  7. Completion - Status updated to 'completed' or 'failed'

Note: Text extraction is now fully independent of SOLR. Chunks are generated during extraction and stored in the database, making them reusable for SOLR indexing, vector embeddings, AI processing, or any other service that needs chunked text.

File Type Compatibility Matrix:

LLPhant Support:

  • Native (TXT, MD, HTML, JSON, XML, CSV) - Perfect quality, very fast
  • Library (PDF, DOCX, DOC, XLSX, XLS) - Good quality, medium speed
  • ⚠️ Limited (PPTX, ODT, RTF) - Basic text only, use Dolphin for better results
  • No Support (JPG, PNG, GIF, WebP) - Requires Dolphin with OCR

Dolphin AI Support:

  • ✓ All formats with superior quality
  • ✓ OCR for scanned documents and images
  • ✓ Table extraction with structure preserved
  • ✓ Formula recognition (LaTeX format)
  • ✓ Multi-language support
  • ✓ Layout understanding (multi-column, etc.)

OCR-Specific Use Cases (Dolphin only):

  1. Document Digitization - Scanning paper archives into searchable text
  2. Receipt Processing - Photo receipts from mobile devices
  3. Screenshot Analysis - Extract text from application screenshots
  4. Infographic Text - Extract text from images with embedded text
  5. Historical Documents - Digitize old scanned materials

Quality Requirements for OCR:

  • Minimum: 150 DPI resolution
  • Recommended: 300+ DPI
  • Clear, high-contrast images
  • Minimal blur or distortion
  • Properly oriented (not rotated)

Extraction Configuration Options:

Configure in Settings → File Configuration:

  1. Text Extractor Selection:

    • LLPhant (default) - Local, free, privacy-friendly
    • Dolphin - Advanced AI, requires API key
  2. Extraction Scope:

    • None - Disabled
    • All files - Every uploaded file
    • Files in folders - Specific folders only
    • Files attached to objects - Only object attachments (recommended)
  3. Extraction Mode:

    • Background (default) - Async via background jobs
    • Immediate - Synchronous during upload (slower)
    • Manual - Triggered by admin action only
  4. Enabled File Types:

    • Select which file extensions to process
    • Different for LLPhant vs Dolphin
    • Enable OCR formats (images) only if using Dolphin

Integration Tests:

The file text extraction system includes comprehensive integration tests:

# Run file extraction tests
vendor/bin/phpunit tests/Integration/FileTextExtractionIntegrationTest.php

# Test cases covered:
# - File upload queues background job
# - Background job execution completes
# - Text extraction end-to-end with content verification
# - Multiple file format support (TXT, MD, JSON)
# - Extraction metadata recording (status, method, timestamps)

Monitoring Extraction:

Check extraction status via logs:

# Watch extraction progress
docker logs -f nextcloud-container | grep FileTextExtractionJob

# Check for errors
docker logs nextcloud-container | grep 'extraction failed'

# View extraction statistics
# Settings → File Configuration → Statistics section

Files Management Page

The Files page provides a centralized view of all files tracked in the text extraction system.

Accessing the Files Page:

Navigate to Files in the main menu to view all files with their extraction status.

Features:

  1. File List Table:

    • File name and path
    • File type and size
    • Extraction status (Pending, Processing, Completed, Failed)
    • Number of text chunks created
    • Last extraction timestamp
  2. Status Indicators:

    • 🟠 Pending: File discovered but not yet extracted
    • 🔵 Processing: Extraction in progress
    • 🟢 Completed: Successfully extracted
    • 🔴 Failed: Extraction error occurred
  3. File Actions:

    • Retry: Re-extract failed files
    • View Error: See detailed error message for failed extractions
  4. Pagination:

    • Browse through large file lists (50 files per page)
    • Navigate between pages
  5. Refresh:

    • Update the list to see latest extraction status

Use Cases:

  • Monitor extraction progress across all files
  • Identify and retry failed extractions
  • View error details for troubleshooting
  • Verify which files have been processed

Core File Extraction API:

OpenRegister provides dedicated API endpoints for file text extraction (moved from settings to core functionality):

  • GET /api/files - List all tracked files with extraction status
  • GET /api/files/{id} - Get single file extraction information
  • POST /api/files/{id}/extract - Extract text from specific file
  • POST /api/files/extract - Extract all pending files (batch processing)
  • POST /api/files/retry-failed - Retry all failed extractions
  • GET /api/files/stats - Get extraction statistics

Smart Re-Extraction:

The system automatically detects when files need re-extraction by comparing:

  • File modification time ('mtime' from Nextcloud's 'oc_filecache')
  • Last extraction time ('extractedAt' from 'oc_openregister_file_texts')

If 'mtime > extractedAt', the file is re-extracted to ensure content is up-to-date.

File Tracking Table:

Extracted text and metadata are stored in 'oc_openregister_file_texts' with:

  • 'file_id' - Links to Nextcloud's 'oc_filecache' table
  • 'extraction_status' - pending, processing, completed, failed
  • 'extractedAt' - Timestamp of last extraction
  • 'text_content' - Full extracted text
  • 'text_length' - Character count
  • 'chunked' - Whether text has been chunked
  • 'chunk_count' - Number of chunks created
  • 'chunks_json' - JSON array of text chunks with offsets (new in v0.2.7)
  • 'extraction_method' - LLPhant or Dolphin
  • Plus SOLR indexing and vectorization tracking

Chunking Details: Each chunk in 'chunks_json' contains the chunk text, start offset, and end offset. This allows for precise text retrieval and consistent chunking across all services.

Working with Files

Uploading Files

Files can be uploaded and attached to objects:

POST /api/objects/{id}/files
Content-Type: multipart/form-data

file: [binary data]
metadata: {"author": "Legal Department", "securityLevel": "confidential"}

Retrieving Files

You can download a file:

GET /api/files/{id}

Or get file metadata:

GET /api/files/{id}/metadata

Listing Files for an Object

You can retrieve all files associated with an object:

GET /api/objects/files/{objectId}

Updating Files

Files can be updated in two ways:

1. Update File Content

Upload a new version of the file:

PUT /api/objects/{register}/{schema}/{objectId}/files/{fileId}
Content-Type: application/json

{
'content': '[base64 encoded content or raw content]',
'tags': ['tag1', 'tag2']
}

2. Update Metadata Only

Update only the file metadata (tags) without changing content:

PUT /api/objects/{register}/{schema}/{objectId}/files/{fileId}
Content-Type: application/json

{
'tags': ['updated-tag1', 'updated-tag2']
}

Note: The 'content' parameter is optional. If omitted, only the metadata will be updated without modifying the file content itself.

Deleting Files

Files can be deleted when no longer needed:

DELETE /api/files/{id}

File Relationships

Files have important relationships with other core concepts:

Files and Objects

  • Files are attached to objects
  • An object can have multiple files
  • Files inherit permissions from their parent object
  • Files are versioned alongside their parent object

Files and Schemas

  • Schemas can define expectations for file attachments
  • File validation can be specified in schemas (allowed types, max size)
  • Schemas can define required file attachments

Files and Registers

  • Registers can be configured with different file storage options
  • File storage policies can be defined at the register level
  • Registers can have quotas for file storage

Use Cases

1. Document Management

Attach important documents to business objects:

  • Contracts to customer records
  • Invoices to order records
  • Specifications to product records

2. Media Management

Store and manage media assets:

  • Product images
  • Marketing materials
  • Training videos

3. Evidence Collection

Maintain evidence for regulatory or legal purposes:

  • Compliance documentation
  • Audit evidence
  • Legal case files

4. Technical Documentation

Manage technical documents:

  • User manuals
  • Technical specifications
  • Installation guides

Advanced File Features

1. Auto-Share Configuration

File properties can be configured to automatically share uploaded files publicly. This is useful for assets that need to be accessible without authentication, such as product images or public documents.

Configuration via UI

When editing a schema in the OpenRegister UI:

  1. Select a property with type 'file' or 'array' with items type 'file'
  2. In the property actions menu, expand the 'File Configuration' section
  3. Check the 'Auto-Share Files' checkbox
  4. Save the schema

Files uploaded to this property will now be automatically publicly shared.

Configuration via API

In your schema definition, add the 'autoPublish' option to file properties:

{
'properties': {
'productImage': {
'type': 'file',
'autoPublish': true,
'allowedTypes': ['image/jpeg', 'image/png'],
'maxSize': 5242880
}
}
}

When 'autoPublish' is set to 'true', files uploaded to this property will automatically:

  • Create a public share link
  • Set the 'published' timestamp
  • Generate a public 'accessUrl' and 'downloadUrl'

Important: Property-Level vs Schema-Level autoPublish

⚠️ Don't confuse these two different 'autoPublish' settings:

1. Property-Level autoPublish (this section):

{
'properties': {
'productImage': {
'type': 'file',
'autoPublish': true // ← Controls if FILES are published
}
}
}

Controls whether files uploaded to this specific property are automatically shared publicly.

2. Schema-Level autoPublish (different setting):

{
'configuration': {
'autoPublish': true // ← Controls if OBJECTS are published
}
}

Controls whether the object entity itself is published (has nothing to do with file sharing).

These are completely separate settings with different purposes. Setting one does NOT affect the other.

Example Response

{
'id': '12345',
'title': 'Product A',
'productImage': {
'id': 789,
'title': 'product-a.jpg',
'accessUrl': 'https://your-domain.com/index.php/s/AbCdEfG123',
'downloadUrl': 'https://your-domain.com/index.php/s/AbCdEfG123/download',
'published': '2024-01-15T10:30:00+00:00',
'size': 245678,
'type': 'image/jpeg'
}
}

2. Authenticated File Access

Files that are not publicly shared still have 'accessUrl' and 'downloadUrl' properties, but these URLs require authentication. This allows frontend applications to:

  • Display file previews for logged-in users
  • Provide download links that work within authenticated sessions
  • Maintain security while offering convenient access

Authenticated URLs

Non-shared files return URLs with the following format:

  • Access URL: /index.php/core/preview?fileId={fileId}&x=1920&y=1080&a=1
  • Download URL: /index.php/apps/openregister/api/files/{fileId}/download

These URLs require the user to be authenticated to Nextcloud.

Example Response (Non-Shared File)

{
'attachment': {
'id': 456,
'title': 'confidential-report.pdf',
'accessUrl': 'https://your-domain.com/index.php/core/preview?fileId=456&x=1920&y=1080&a=1',
'downloadUrl': 'https://your-domain.com/index.php/apps/openregister/api/files/456/download',
'published': null,
'size': 1234567,
'type': 'application/pdf'
}
}

3. Logo/Image Metadata from File Properties

When a schema is configured to extract metadata fields like 'image' or 'logo' from file properties, the system automatically extracts the public share URL (or authenticated URL if not shared) and stores it in the object metadata.

Configuration

{
'properties': {
'logo': {
'type': 'file',
'allowedTypes': ['image/png', 'image/jpeg'],
'autoPublish': true
}
},
'configuration': {
'objectImageField': 'logo'
}
}

Result

The object's '@self.image' field will contain the share URL:

{
'id': '12345',
'title': 'Company A',
'logo': {
'id': 789,
'accessUrl': 'https://your-domain.com/index.php/s/XyZ789',
'type': 'image/png'
},
'@self': {
'name': 'Company A',
'image': 'https://your-domain.com/index.php/s/XyZ789'
}
}

This makes it easy to display company logos, product images, or other visual metadata in listings and search results.

4. File Deletion via API

Files can be deleted by setting the file property to 'null' (for single file properties) or an empty array (for array file properties).

Single File Deletion

PUT /api/objects/{register}/{schema}/{id}
Content-Type: application/json

{
'title': 'Updated Title',
'attachment': null
}

This will:

  • Delete the file from Nextcloud storage
  • Remove the file record from the database
  • Set the 'attachment' property to 'null' in the object data

File Array Deletion

PUT /api/objects/{register}/{schema}/{id}
Content-Type: application/json

{
'title': 'Updated Gallery',
'images': []
}

This will:

  • Delete all files in the array from Nextcloud storage
  • Remove all file records from the database
  • Set the 'images' property to an empty array in the object data

Use Cases

  • Privacy Compliance: Remove sensitive files upon user request
  • Storage Management: Clean up unused files
  • Data Lifecycle: Remove temporary or expired files
  • Error Correction: Remove incorrectly uploaded files

Executable File Blocking

OpenRegister automatically blocks executable files from being uploaded for security reasons. This prevents malicious code execution and protects your Nextcloud instance.

What is Blocked

Blocked File Types

Windows Executables

  • .exe, .bat, .cmd, .com, .msi, .scr
  • .vbs, .vbe, .js, .jse, .wsf, .wsh
  • .ps1 (PowerShell), .dll

Unix/Linux Executables

  • .sh, .bash, .csh, .ksh, .zsh
  • .run, .bin, .app
  • .deb, .rpm (package files)

Scripts & Code

  • .php, .phtml, .php3, .php4, .php5, .phps, .phar
  • .py, .pyc, .pyo, .pyw (Python)
  • .pl, .pm, .cgi (Perl)
  • .rb, .rbw (Ruby)
  • .jar, .war, .ear, .class (Java)

Containers & Packages

  • .appimage, .snap, .flatpak
  • .dmg, .pkg, .command (macOS)
  • .apk (Android)

Binary Formats

  • .elf, .out, .o, .so, .dylib

Detection Methods

OpenRegister uses multiple layers of detection:

1. File Extension Check

Checks the file extension against a blacklist of dangerous extensions.

2. Magic Bytes Detection

Checks the first bytes of the file content for executable signatures:

  • MZ - Windows PE/EXE files
  • \x7FELF - Linux/Unix ELF executables
  • #!/bin/sh - Shell scripts
  • #!/bin/bash - Bash scripts
  • <?php - PHP scripts
  • \xCA\xFE\xBA\xBE - Java class files

3. MIME Type Validation

Blocks dangerous MIME types:

  • application/x-executable
  • application/x-dosexec
  • application/x-msdownload
  • application/x-sh
  • application/x-php
  • text/x-shellscript
  • And more...

Default Behavior: Blocked

By default, ALL executable files are blocked.

POST /api/registers/docs/schemas/document/objects
{
"title": "My Document",
"attachment": "script.sh" // ❌ BLOCKED!
}

Response:

{
"error": "File at attachment is an executable file (.sh). Executable files are blocked for security reasons. Allowed formats: documents, images, archives, data files."
}

If you absolutely need to allow executables (e.g., for a software repository), you can set allowExecutables: true in your schema:

{
"properties": {
"softwarePackage": {
"type": "file",
"allowExecutables": true, // ⚠️ DANGEROUS!
"allowedTypes": ["application/x-deb"] // Still enforce MIME type
}
}
}

⚠️ WARNING: Only use allowExecutables: true if:

  • You absolutely trust the source of files
  • Users are administrators only
  • You have other security measures in place (virus scanning, sandboxing)
  • You understand the security risks

Examples

✅ Safe Files (Allowed by Default)

# Documents
curl -X POST '/api/registers/docs/schemas/document/objects' \
-F 'title=Report' \
-F 'attachment=@report.pdf' # ✅ OK

# Images
curl -X POST '/api/registers/docs/schemas/document/objects' \
-F 'title=Photo' \
-F 'image=@photo.jpg' # ✅ OK

# Archives
curl -X POST '/api/registers/docs/schemas/document/objects' \
-F 'title=Data' \
-F 'data=@archive.zip' # ✅ OK (ZIPs are allowed unless they're JARs)

❌ Blocked Files (Default)

# Windows executable
curl -X POST '/api/registers/docs/schemas/document/objects' \
-F 'title=Software' \
-F 'file=@program.exe' # ❌ BLOCKED

# Shell script
curl -X POST '/api/registers/docs/schemas/document/objects' \
-F 'title=Script' \
-F 'file=@setup.sh' # ❌ BLOCKED

# PHP script
curl -X POST '/api/registers/docs/schemas/document/objects' \
-F 'title=Code' \
-F 'file=@index.php' # ❌ BLOCKED

🎭 Bypassing Attempts (Also Blocked!)

OpenRegister detects renamed executables:

# Renamed EXE to TXT - Still blocked by magic bytes!
mv malware.exe document.txt
curl -X POST '/api/.../' -F 'file=@document.txt' # ❌ BLOCKED (magic bytes: MZ)

# PHP file renamed to JPG - Still blocked!
mv shell.php image.jpg
curl -X POST '/api/.../' -F 'file=@image.jpg' # ❌ BLOCKED (detects <?php)

Schema Configuration

Basic File Upload (Executables Blocked)

{
"slug": "document",
"properties": {
"title": {
"type": "string"
},
"attachment": {
"type": "file",
"allowedTypes": ["application/pdf", "application/msword"],
"maxSize": 10485760 // 10MB
// allowExecutables defaults to false
}
}
}

Security Recommendations

1. Keep Executables Blocked (Default)

DO:

  • ✅ Use the default behavior (block executables)
  • ✅ Only allow documents, images, archives
  • ✅ Combine with virus scanning (ClamAV)

DON'T:

  • ❌ Set allowExecutables: true unless absolutely necessary
  • ❌ Allow untrusted users to upload files to executable-allowed schemas
  • ❌ Assume file extensions are safe

2. Layer Your Security

Even with executable blocking, use additional security:

3. Monitor and Log

All blocked uploads are logged:

# Check logs for blocked attempts
docker logs master-nextcloud-1 | grep "Executable file upload blocked"

Performance Impact

Minimal!

  • Extension check: < 0.1ms
  • Magic bytes check: < 1ms (only checks first 1KB)
  • MIME type check: < 0.1ms

Total overhead: ~1-2ms per file upload

Frequently Asked Questions

Q: Can I upload ZIP files? A: ✅ Yes! ZIP files (.zip) are allowed by default. Only executable ZIPs like JARs are blocked.

Q: What about JavaScript files (.js)? A: ❌ Blocked by default (can be executed in browsers). Use JSON or TXT for data.

Q: Can I upload Python notebooks (.ipynb)? A: ✅ Yes! .ipynb is JSON format, not an executable. Allowed by default.

Q: What if I need to share code files? A: Use:

  • Text files (.txt) with code inside
  • Archives (.zip or .tar.gz) containing code
  • Git repositories
  • Dedicated code hosting (GitHub, GitLab)

Q: Does this protect against all malware? A: No! This blocks known executable formats. Malicious documents (PDF with exploits, Office macros) need virus scanning. Use ClamAV for complete protection.

Best Practices

  1. Define File Types: Establish clear guidelines for what file types are allowed
  2. Set Size Limits: Define appropriate size limits for different file types
  3. Use Metadata: Add relevant metadata to improve searchability and context
  4. Consider Storage: Choose appropriate storage backends based on file types and access patterns
  5. Implement Retention Policies: Define how long files should be kept
  6. Plan for Backup: Ensure files are included in backup strategies
  7. Consider Performance: Optimize file storage for your access patterns
  8. Use Auto-Publish Wisely: Only enable property-level 'autoPublish' for files that should be publicly accessible. Remember: property 'autoPublish' (file sharing) is different from schema 'autoPublish' (object publishing)
  9. Document File Deletion: Maintain audit trails when files are deleted for compliance
  10. Handle Authentication: Use authenticated URLs for sensitive files
  11. Keep Executables Blocked: Use the default executable blocking behavior unless absolutely necessary
  12. Layer Security: Combine executable blocking with virus scanning for complete protection

Conclusion

Files in Open Register bridge the gap between structured data and unstructured content, providing a comprehensive solution for managing all types of information in your application. With advanced features like auto-sharing, authenticated access, metadata extraction, and flexible deletion options, Open Register creates a unified system where all your data—structured and unstructured—works together seamlessly.


Technical Architecture

This section provides detailed visualization of the file handling system's architecture and data flow.

File Upload and Processing Flow

File Property Processing Pipeline

Text Extraction Process

File Storage Architecture

File Type Compatibility Matrix

File Text Extraction Settings

File Chunking for Solr

Note: As of v0.2.7, chunking happens during text extraction, not during SOLR indexing. Chunks are stored in the database and reused.

Performance Characteristics

File Upload Performance:

Small files (<1MB):      ~100-200ms
Medium files (1-10MB): ~500ms-2s
Large files (>10MB): ~2-10s
Very large (>100MB): ~10-60s

Text Extraction Performance:

LLPhant:
- TXT/MD/HTML: <1s (instant)
- PDF (10 pages): 2-5s (library parsing)
- DOCX: 3-8s (library parsing)
- Images: N/A (not supported)

Dolphin AI:
- TXT/MD/HTML: 1-2s (API latency)
- PDF (10 pages): 5-10s (AI processing)
- DOCX: 4-8s (AI processing)
- Images (OCR): 5-15s (OCR + AI)

Chunking and Indexing:

Text chunking:     <100ms  for 100KB text (now part of extraction)
Solr indexing: ~50-200ms per document (reads pre-chunked data)
Batch indexing: ~500ms for 100 chunks (faster with pre-chunked data)

Note: Since v0.2.7, chunking is performed once during text extraction and stored in the database. This makes SOLR indexing faster and allows chunks to be reused for vector embeddings, AI processing, or any other service that needs chunked text.

Code Examples

Processing File Upload

use OCA\OpenRegister\Service\FileService;

// Create file from base64
$fileMetadata = $fileService->createFile(
objectEntity: $object,
fileData: [
'content' => 'data:image/jpeg;base64,/9j/4AAQ...',
'tags' => ['profile', 'avatar']
]
);

// Create file from URL
$fileMetadata = $fileService->createFile(
objectEntity: $object,
fileData: [
'url' => 'https://example.com/document.pdf',
'tags' => ['imported', 'external']
]
);

// Access file metadata
$fileId = $fileMetadata['id'];
$shareLinkUrl = $fileMetadata['accessUrl'];
$downloadUrl = $fileMetadata['downloadUrl'];

Text Extraction

use OCA\OpenRegister\Service\FileTextExtractionService;

// Extract text from file
$extractionService->extractText($fileId);

// Get extraction status
$fileText = $fileTextMapper->findByFileId($fileId);
$status = $fileText->getExtractionStatus(); // 'pending', 'processing', 'completed', 'failed'
$text = $fileText->getTextContent();

// Manually trigger extraction
$extractionService->queueExtraction($fileId);

Searching File Content

// Search across file content in Solr
$results = $solrService->searchFiles([
'_search' => 'contract terms',
'mime_type' => 'application/pdf',
'_limit' => 20
]);

// Access chunk results
foreach ($results['hits'] as $hit) {
$fileId = $hit['file_id'];
$chunkIndex = $hit['chunk_index'];
$text = $hit['chunk_text'];
$highlighted = $hit['highlighted_text'];
}

Testing

# Run file handling tests
vendor/bin/phpunit tests/Service/FileServiceTest.php

# Test text extraction
vendor/bin/phpunit tests/Service/FileTextExtractionServiceTest.php

# Test specific scenarios
vendor/bin/phpunit --filter testBase64FileUpload
vendor/bin/phpunit --filter testTextExtraction
vendor/bin/phpunit --filter testFileChunking

# Integration tests
vendor/bin/phpunit tests/Integration/FileIntegrationTest.php

Test Coverage:

  • File upload (base64, URL, file object)
  • File property processing
  • Text extraction (LLPhant, Dolphin)
  • Chunking and Solr indexing
  • File deletion
  • Share link generation
  • Auto-tagging