Archiving and Metadata Classification - Feature Summary
Overview
Complete feature documentation created for an Archiving and Metadata Classification system that builds on top of the chunk-based text extraction pipeline.
Status: 📝 Documentation Complete - NOT YET IMPLEMENTED
What is This Feature?
An intelligent classification and metadata extraction system for all content types (documents, objects, emails, chats) that:
-
Classifies content using two approaches:
- Constructive: User selects from curated taxonomy lists
- Suggestive: AI proposes new categories/themes
-
Extracts metadata automatically:
- Keywords and search terms
- Themes and topics
- Document properties
- Temporal information
Documentation Location
📄 Archiving and Metadata Classification
Key Concepts
Two Classification Approaches
1. Constructive Classification (Controlled Vocabulary)
User Action → Select Taxonomy → Select Category → Apply
Characteristics:
- Predefined categories
- Controlled vocabulary
- Consistent organization
- Manual or AI-assisted
Example Taxonomies:
- Document Types (Contracts, Policies, Reports)
- Content Themes (Technology, Business, HR)
- Records Management (Retention schedules)
2. Suggestive Classification (AI-Powered)
AI Analysis → Generate Suggestions → User Review → Approve/Reject
Characteristics:
- AI-discovered themes
- Dynamic categories
- Requires approval
- Can be promoted to taxonomy
Example Suggestions:
- "API Integration" (confidence: 89%)
- "Cloud Security" (confidence: 76%)
- "Performance Optimization" (confidence: 82%)
Metadata Extraction
Automatic extraction of:
- Keywords: Important terms (TF-IDF, NER, LLM)
- Themes: High-level topics (Topic modeling, clustering)
- Search Terms: How users might search for this content
- Properties: Structured metadata (dates, authors, versions)
Database Schema
4 New Tables
-
oc_openregister_classifications
- Links chunks to taxonomy categories
- Stores confidence and method
- Multi-tenant (owner, organisation)
-
oc_openregister_taxonomies
- Stores taxonomy definitions
- Hierarchical structures
- Global or organization-specific
-
oc_openregister_suggestions
- AI-generated classification suggestions
- Pending user review
- Confidence scores
-
oc_openregister_metadata
- Extracted metadata (keywords, themes, etc.)
- Linked to chunks and sources
- Method and confidence tracking
All tables include multi-tenancy fields (owner, organisation).
Integration with Existing Features
1. Text Extraction Pipeline
File/Object → Chunks → [NEW] Classification + Metadata Extraction
Applied after chunking, reuses existing chunk infrastructure.
2. GDPR Entity Tracking
Entities can become metadata:
- Person names → Keywords
- Organizations → Themes
- Locations → Properties
3. Search Enhancement
Classifications and metadata improve search:
- Filter by category
- Boost by theme relevance
- Faceted navigation
- Related content suggestions
4. Vector Search (RAG)
Metadata enhances AI:
- Filter vectors by classification
- Include metadata in context
- Theme-based retrieval
User Interface Components
1. Classification Panel
- Display current classifications
- Add/remove classifications
- Bulk classification
- History tracking
2. Suggestion Review Panel
- View pending AI suggestions
- Approve/reject with one click
- Promote to taxonomy
- Bulk actions
3. Metadata Display
- Show extracted keywords, themes
- Edit metadata manually
- View confidence scores
- Method transparency
4. Taxonomy Manager
- Create/edit taxonomies
- Hierarchical editor
- Import/export
- Global vs organization scope
API Endpoints
Classifications
GET /api/classifications
POST /api/classifications
DELETE /api/classifications/{id}
POST /api/classifications/bulk
Suggestions
GET /api/suggestions?status=pending
POST /api/suggestions/{id}/review
POST /api/suggestions/bulk-approve
Taxonomies
GET /api/taxonomies
POST /api/taxonomies
PUT /api/taxonomies/{id}
DELETE /api/taxonomies/{id}
GET /api/taxonomies/{id}/export
Metadata
GET /api/metadata?source_id=123
PUT /api/metadata/{id}
POST /api/metadata/extract
Use Cases
1. Legal Document Management
- Classify contracts by type
- Extract parties, dates, jurisdictions
- Apply retention schedules
- Compliance tracking
2. Knowledge Base Organization
- AI discovers documentation themes
- Automatic categorization
- Improved search
- Dynamic taxonomy evolution
3. Email Archiving
- Classify emails (Business, HR, Legal, IT)
- Extract sender, recipient, subject
- Apply retention policies
- GDPR compliance
4. Multi-Language Content
- Language-aware classification
- Localized taxonomies
- Cross-language themes
- Better UX per language
5. Research Document Analysis
- Discover research themes
- Extract concepts and keywords
- Cluster similar papers
- Knowledge graph generation
Multi-Tenancy
All entities fully support multi-tenancy:
ownerfield: User IDorganisationfield: Organisation UUID- Inherited from source content
- Automatic filtering by access rights
- Organization-level taxonomies
- Data isolation guaranteed
Configuration
Settings panel includes:
Classification Settings
- Enable constructive/suggestive/both
- Confidence thresholds
- Auto-approve settings
- Suggestion methods
Metadata Extraction Settings
- Enable/disable by type
- Extraction methods
- Algorithm parameters
- Min confidence scores
Processing Settings
- On upload vs background
- Batch sizes
- Job intervals
- Manual triggers
Performance Characteristics
- Keyword extraction: 50-200ms per chunk
- Theme extraction: 500-2000ms per document (LLM)
- Classification suggestion: 200-1000ms per chunk
- Metadata extraction: 100-500ms per chunk
Storage (10,000 documents)
- Classifications: ~6 MB
- Suggestions: ~10 MB
- Metadata: ~10 MB
- Taxonomies: ~250 KB
- Total: ~26 MB
AI/LLM Integration
Methods Supported
-
Topic Modeling (Unsupervised)
- LDA, NMF algorithms
- Probability distribution over topics
-
LLM-Based Analysis (Supervised)
- Prompt-based theme extraction
- Structured output with confidence
-
Clustering (Unsupervised)
- Vector similarity clustering
- Content grouping
-
Hybrid (Recommended)
- Combine multiple methods
- Confidence voting
- Best accuracy
Future Enhancements
- Auto-Classification: Classify based on similar content
- Smart Suggestions: Learn from user feedback
- Cross-Reference: Link classifications across documents
- Visualization: Knowledge graphs, theme evolution
- Export/Import: Share taxonomies
- Templates: Pre-built taxonomies
- Validation Rules: Ensure consistency
- Bulk Operations: Reclassify multiple items
Diagrams Included
The documentation includes 7 Mermaid diagrams:
- Sources → Classification & Metadata flow (TB)
- Classification schema class diagram
- Suggestion workflow sequence diagram
- Complete processing pipeline flowchart
- UI mockups (4 panels: Classification, Suggestions, Metadata, Taxonomy Manager)
All fully editable in markdown source.
Implementation Considerations
Phase 1: Database Schema
- Create 4 new tables
- Add multi-tenancy fields
- Create indexes
Phase 2: Classification Service
- Constructive classification logic
- Taxonomy management
- Category assignment
Phase 3: Suggestion Engine
- AI integration (LLM/clustering)
- Confidence scoring
- Deduplication
Phase 4: Metadata Extraction
- Keyword extraction (TF-IDF, NER)
- Theme extraction (topic modeling, LLM)
- Search term generation
- Property extraction
Phase 5: User Interface
- Classification panel
- Suggestion review
- Metadata display
- Taxonomy manager
Phase 6: API
- All CRUD endpoints
- Bulk operations
- Export/import
Phase 7: Integration
- Connect to chunk pipeline
- Search enhancement
- RAG context enrichment
Phase 8: Testing & Deployment
- Unit tests
- Integration tests
- Performance testing
- User acceptance testing
Security & Compliance
- Access Control: User/organization-based
- Data Isolation: Multi-tenant safe
- Audit Trail: All classification changes logged
- GDPR: Metadata can include entity references
- Approval Workflow: Admin review for suggestions
Dependencies
Existing Features Required
- ✅ Chunk system (text extraction)
- ✅ Multi-tenancy infrastructure
- ✅ Background job system
New Dependencies
- Topic modeling library (e.g., Gensim)
- TF-IDF implementation
- LLM API access (OpenAI, etc.)
- Clustering algorithms (scikit-learn)
Optional Integrations
- External taxonomy services
- Knowledge graph systems
- Visualization libraries
Benefits
For Users
- ✅ Better content organization
- ✅ Easier discovery
- ✅ Automatic categorization
- ✅ Improved search results
For Administrators
- ✅ Centralized taxonomy management
- ✅ AI-assisted classification
- ✅ Compliance tracking
- ✅ Usage analytics
For Organizations
- ✅ Knowledge management
- ✅ Information governance
- ✅ Regulatory compliance
- ✅ Operational efficiency
Comparison with Entity Tracking
| Feature | Entity Tracking (GDPR) | Classification & Metadata |
|---|---|---|
| Purpose | Find PII for compliance | Organize and discover content |
| Focus | Persons, emails, phones | Categories, themes, keywords |
| Approach | Detection (what exists) | Assignment (what it means) |
| User Input | Minimal (review) | Active (select categories) |
| AI Role | Detection assistant | Suggestion engine |
| Compliance | GDPR, privacy laws | Records management |
| Output | Entity register | Taxonomy, metadata |
Complementary: Both work together for complete content intelligence.
Documentation Quality
The feature documentation includes:
✅ Complete concept explanation
✅ Two classification approaches detailed
✅ Database schema with SQL
✅ 7 Mermaid diagrams
✅ UI mockups in ASCII
✅ Complete API specification
✅ 5 detailed use cases
✅ Multi-tenancy fully covered
✅ Performance characteristics
✅ Integration points identified
✅ Implementation phases outlined
✅ Security and compliance addressed
Next Steps
Before Implementation
-
Review with Stakeholders
- Validate classification approaches
- Confirm taxonomy requirements
- Agree on AI methods
-
Prioritize Features
- Constructive vs suggestive first?
- Which metadata types first?
- UI vs API priority?
-
Technical Decisions
- LLM provider selection
- Topic modeling approach
- Taxonomy storage format
-
Design Decisions
- Default taxonomies to include
- UI placement and flow
- Admin vs user capabilities
Implementation Order
Recommended: Implement after text extraction and entity tracking are stable.
Reason: Builds on chunk infrastructure, complements entity tracking.
Timeline: 8-10 weeks after text extraction completion.
Questions for Stakeholders
- Should we prioritize constructive or suggestive classification?
- What taxonomies are most important (legal, technical, business)?
- Do you have existing taxonomy standards to import?
- What LLM provider should we use for suggestions?
- Should taxonomies be managed centrally or by organization?
- What metadata is most valuable for your use case?
- Should we auto-apply high-confidence suggestions (>85%)?
- How should we handle multi-language taxonomies?
Conclusion
Complete feature documentation has been created for an intelligent archiving and metadata classification system that:
✅ Provides flexible classification (constructive + suggestive)
✅ Extracts rich metadata automatically
✅ Fully multi-tenant and secure
✅ Integrates with existing chunk pipeline
✅ Enhances search and discovery
✅ Complements GDPR entity tracking
✅ Includes complete database schema
✅ Defines all API endpoints
✅ Specifies UI components
✅ Identifies implementation phases
Status: Ready for stakeholder review and prioritization.
Do NOT implement yet - this is documentation only for planning purposes.
Related Documentation: