# Indexed Entities Design This document describes the design for indexed entity types in Implexus: Category, Catalogue, and Index. These entities use pre-built indices for fast lookups instead of scanning all documents. ## Overview ### Problem Statement The current implementation of virtual entities has performance issues: | Entity | Current Behavior | Problem | |--------|-----------------|---------| | Category | Scans all documents, evaluates expression per document | O(n) per query | | Index | Scans all documents, does substring matching | O(n) per query | | Catalogue | Does not exist | Need key-based groupings | ### Solution: Pre-built Indices All indexed entities will: 1. **Build indices at creation time** - Scan existing documents and populate index structures 2. **Update indices on changes** - Via hooks when documents are created, updated, or deleted 3. **Use fast key lookups** - O(1) or O(log n) queries instead of O(n) scans ### Entity Type Semantics ```mermaid graph TB subgraph Entity Types Container[Container - Arbitrary children] Document[Document - Properties] Category[Category - Predicate filter] Catalogue[Catalogue - Key groupings] Index[Index - Text search] end subgraph Category Behavior Category -->|contains| DocCat[Documents matching predicate] end subgraph Catalogue Behavior Catalogue -->|has child| Group1[Key Group: john] Catalogue -->|has child| Group2[Key Group: jane] Group1 -->|contains| Doc1[Post by john] Group2 -->|contains| Doc2[Post by jane] end subgraph Index Behavior Index -->|search| Result[Search Result Container] Result -->|contains| Doc3[Matching documents] end ``` ## Entity Definitions ### Container (Unchanged) - **Purpose**: Arbitrary child storage like a filesystem folder - **Children**: Any entity type, explicitly added - **Implementation**: Current implementation is correct, no changes needed ### Category (Redesigned) - **Purpose**: Contains documents matching a boolean predicate expression - **Type Label**: Filters documents by application-defined type - **Expression**: Boolean expression (e.g., `!draft`, `status == "active"`, `age > 18`) - **Children**: Documents where predicate evaluates to `true` - **Example**: Category at `/posts/active` with expression `!draft` returns all posts where `draft == false` ### Catalogue (New Entity Type) - **Purpose**: Groups documents by unique values of an expression - **Type Label**: Filters documents by application-defined type - **Expression**: Value expression (e.g., `author`, `category`, `status`) - **Children**: Virtual containers keyed by expression result - **Example**: Catalogue at `/posts/by-author` with expression `author`: - `/posts/by-author/john` → all posts where `author == "john"` - `/posts/by-author/jane` → all posts where `author == "jane"` - **Semantics**: Similar to `Dictionary>` or `GROUP BY` in SQL ### Index (Redesigned) - **Purpose**: Full-text search on a field with SQL LIKE-style patterns - **Type Label**: Filters documents by application-defined type - **Expression**: Field name to index (e.g., `content`, `title`) - **Children**: Search results as virtual containers - **Patterns Supported**: - `*world*` - contains "world" - `hello*` - starts with "hello" - `*goodbye` - ends with "goodbye" - **Configuration**: Case sensitivity option (default: insensitive) ## Storage Schema ### Key Prefixes All index data uses scoped keys to prevent collisions between entities: ``` index:::: ``` ### Type Label Index To efficiently find all documents of a given type, maintain a global type index: ``` typeidx: → [doc_path1, doc_path2, ...] ``` **Storage**: `typeidx:User` → `["/users/john", "/users/jane", ...]` ### Category Storage Schema Category stores document paths that match the predicate: ``` cat::members → [doc_path1, doc_path2, ...] cat::config → {type_label: "...", expression: "..."} ``` **Example**: ``` cat:/posts/active:members → ["/posts/post1", "/posts/post3"] cat:/posts/active:config → {type_label: "Post", expression: "!draft"} ``` ### Catalogue Storage Schema Catalogue stores groupings keyed by expression result: ``` catl::group: → [doc_path1, doc_path2, ...] catl::keys → ["key1", "key2", ...] catl::config → {type_label: "...", expression: "..."} ``` **Example**: ``` catl:/posts/by-author:group:john → ["/posts/post1", "/posts/post3"] catl:/posts/by-author:group:jane → ["/posts/post2"] catl:/posts/by-author:keys → ["john", "jane"] catl:/posts/by-author:config → {type_label: "Post", expression: "author"} ``` ### Index Storage Schema (Hierarchical N-gram) The Index uses a hierarchical n-gram structure for arbitrary-length pattern matching: #### Primary Trigram Index ``` idx::tri: → [doc_path1, doc_path2, ...] ``` **Example**: Text "hello world" produces trigrams: `hel`, `ell`, `llo`, `lo `, `o w`, ` wo`, `wor`, `orl`, `rld` ``` idx:/posts/search:tri:hel → ["/posts/post1", "/posts/post2"] idx:/posts/search:tri:ell → ["/posts/post1"] ... ``` #### Bigram Reverse Index Maps bigrams to trigrams containing them (for 2-character queries): ``` idx::bi: → [trigram1, trigram2, ...] ``` **Example**: ``` idx:/posts/search:bi:he → ["hel"] idx:/posts/search:bi:el → ["hel", "ell"] idx:/posts/search:bi:ll → ["ell", "llo"] ``` #### Unigram Reverse Index Maps characters to bigrams starting with them (for 1-character queries): ``` idx::uni: → [bigram1, bigram2, ...] ``` **Example**: ``` idx:/posts/search:uni:h → ["he"] idx:/posts/search:uni:e → ["el"] ``` #### Document Content Cache Stores the indexed field value for each document (used for verification and rebuilds): ``` idx::doc: → "indexed field value" ``` #### Configuration ``` idx::config → {type_label: "...", expression: "...", case_sensitive: false} ``` ### Index Size Analysis | Index Type | Entries Per Doc | Stores | Relative Size | |------------|-----------------|--------|---------------| | Trigram→Docs | O(L) where L = text length | Doc paths | Primary cost | | Bigram→Trigrams | O(L) | Strings only | Small | | Unigram→Bigrams | O(unique chars) | Strings only | Tiny | | Doc Content | 1 | Field value | Moderate | **Example**: For 10,000 documents averaging 1KB text: - Trigram index: ~10M entries (document paths) - Bigram reverse: ~2,500 entries (just strings) - Unigram reverse: ~50 entries (just strings) The reverse indexes have minimal overhead since they store short strings, not document paths. ## Query Algorithms ### Category Query ``` INPUT: Category at path P OUTPUT: All documents matching predicate 1. Load config to get type_label and expression 2. Lookup cat:P:members → [doc_paths] 3. For each doc_path, instantiate Document entity 4. Return documents ``` **Complexity**: O(k) where k = number of matching documents ### Catalogue Query ``` INPUT: Catalogue at path P, optional key K OUTPUT: If K provided: documents with that key value Otherwise: list of available keys # Get specific group get_child(K): 1. Lookup catl:P:group:K → [doc_paths] 2. Return VirtualContainer with documents # List available keys child_names: 1. Lookup catl:P:keys → [key1, key2, ...] 2. Return keys ``` **Complexity**: O(1) for key lookup, O(k) for listing groups ### Index Query ``` INPUT: Index at path P, search pattern S OUTPUT: Documents matching pattern parse_pattern(S): if S starts with * and ends with *: return CONTAINS(S[1:-1]) if S starts with *: return ENDS_WITH(S[1:]) if S ends with *: return STARTS_WITH(S[:-1]) return EXACT(S) search(pattern): terms = parse_pattern(pattern) if terms.type == CONTAINS and len(terms.value) >= 3: # Use trigram index trigrams = extract_trigrams(terms.value) doc_sets = [lookup idx:P:tri:t for t in trigrams] candidates = intersect(doc_sets) return [d for d in candidates if d contains terms.value] if terms.type == CONTAINS and len(terms.value) == 2: # Use bigram→trigram→docs bigram = terms.value trigrams = lookup idx:P:bi:bigram doc_sets = [lookup idx:P:tri:t for t in trigrams] candidates = union(doc_sets) return [d for d in candidates if d contains terms.value] if terms.type == CONTAINS and len(terms.value) == 1: # Use unigram→bigram→trigram→docs char = terms.value bigrams = lookup idx:P:uni:char trigrams = union([lookup idx:P:bi:b for b in bigrams]) doc_sets = [lookup idx:P:tri:t for t in trigrams] candidates = union(doc_sets) return [d for d in candidates if d contains terms.value] if terms.type == STARTS_WITH: # Prefix search using first trigram prefix = terms.value first_tri = prefix[0:3] candidates = lookup idx:P:tri:first_tri return [d for d in candidates if d starts_with prefix] if terms.type == ENDS_WITH: # Suffix search using last trigram suffix = terms.value last_tri = suffix[-3:] candidates = lookup idx:P:tri:last_tri return [d for d in candidates if d ends_with suffix] ``` **Complexity**: - 3+ char patterns: O(k) where k = candidates from trigram intersection - 1-2 char patterns: O(k) with more indirection but still indexed - Prefix/suffix: O(k) using boundary trigrams ## Hook/Notification System ### Overview The hook system notifies indexed entities when documents are created, modified, or deleted so they can update their indices. ```mermaid sequenceDiagram participant App as Application participant Engine as Engine participant Doc as Document participant Hook as HookManager participant Cat as Category participant Catl as Catalogue participant Idx as Index App->>Engine: create_document with type_label Engine->>Doc: create Engine->>Hook: entity_created(doc) Hook->>Hook: find_interested_entities(type_label) loop For each interested entity Hook->>Cat: on_document_created(doc) Hook->>Catl: on_document_created(doc) Hook->>Idx: on_document_created(doc) end ``` ### Hook Registration Indexed entities register interest in specific type labels: ```vala interface IndexHook { // The type label this entity is interested in public abstract string type_label { get; } // The entity path for index scoping public abstract EntityPath entity_path { get; } // Called when a document is created public abstract void on_document_created(Entity doc) throws Error; // Called when a document is modified public abstract void on_document_modified(Entity doc, Set changed_properties) throws Error; // Called when a document is deleted public abstract void on_document_deleted(EntityPath doc_path) throws Error; } ``` ### HookManager Implementation ```vala class HookManager { // type_label → list of registered hooks private Dictionary> _hooks_by_type; // Register a hook for a type label public void register_hook(IndexHook hook) { var hooks = _hooks_by_type.get_or_default(hook.type_label, new List()); hooks.add(hook); } // Unregister a hook public void unregister_hook(IndexHook hook) { var hooks = _hooks_by_type.get(hook.type_label); if (hooks != null) { hooks.remove(hook); } } // Notify hooks of document creation public void notify_created(Entity doc) { var hooks = _hooks_by_type.get(doc.type_label); if (hooks != null) { foreach (var hook in hooks) { hook.on_document_created(doc); } } } // Notify hooks of document modification public void notify_modified(Entity doc, Set changed_properties) { var hooks = _hooks_by_type.get(doc.type_label); if (hooks != null) { foreach (var hook in hooks) { hook.on_document_modified(doc, changed_properties); } } } // Notify hooks of document deletion public void notify_deleted(EntityPath doc_path, string type_label) { var hooks = _hooks_by_type.get(type_label); if (hooks != null) { foreach (var hook in hooks) { hook.on_document_deleted(doc_path); } } } } ``` ### Engine Integration The Engine emits signals that the HookManager listens to: ```vala class EmbeddedEngine : Object, Engine { private HookManager _hooks; // Connect signals to hook manager construct { _hooks = new HookManager(); this.entity_created.connect((entity) => { if (entity.entity_type == EntityType.DOCUMENT) { _hooks.notify_created(entity); } }); this.entity_modified.connect((entity) => { if (entity.entity_type == EntityType.DOCUMENT) { _hooks.notify_modified(entity, entity.modified_properties); } }); this.entity_deleted.connect((path, type_label) => { _hooks.notify_deleted(path, type_label); }); } } ``` ## Index Update Logic ### Category Index Update ```vala class Category : AbstractEntity, IndexHook { public void on_document_created(Entity doc) { update_index_for_document(doc, true); } public void on_document_modified(Entity doc, Set changed_properties) { // Re-evaluate predicate and update index update_index_for_document(doc, evaluate_predicate(doc)); } public void on_document_deleted(EntityPath doc_path) { remove_from_index(doc_path); } private void update_index_for_document(Entity doc, bool should_include) { var members_key = "cat:%s:members".printf(_path.to_string()); var current_members = load_member_set(members_key); if (should_include) { current_members.add(doc.path.to_string()); } else { current_members.remove(doc.path.to_string()); } save_member_set(members_key, current_members); } private bool evaluate_predicate(Entity doc) { // Evaluate the boolean expression against the document return ExpressionEvaluator.evaluate_boolean(_expression, doc.properties); } } ``` ### Catalogue Index Update ```vala class Catalogue : AbstractEntity, IndexHook { public void on_document_created(Entity doc) { update_index_for_document(doc); } public void on_document_modified(Entity doc, Set changed_properties) { // Check if the grouped expression changed if (changed_properties.contains(_expression)) { // Get old value and remove from old group var old_value = get_old_value(doc); remove_from_group(old_value, doc.path); // Add to new group update_index_for_document(doc); } } public void on_document_deleted(EntityPath doc_path) { // Need to look up the value before removing var value = get_indexed_value(doc_path); remove_from_group(value, doc_path); } private void update_index_for_document(Entity doc) { var value = evaluate_expression(doc); if (value != null) { add_to_group(value, doc.path); } } private void add_to_group(string key, EntityPath doc_path) { var group_key = "catl:%s:group:%s".printf(_path.to_string(), key); var members = load_member_set(group_key); members.add(doc_path.to_string()); save_member_set(group_key, members); // Also update keys list if this is a new key var keys_key = "catl:%s:keys".printf(_path.to_string()); var keys = load_key_set(keys_key); if (!keys.contains(key)) { keys.add(key); save_key_set(keys_key, keys); } } } ``` ### Index Entity Update (N-gram) ```vala class Index : AbstractEntity, IndexHook { private bool _case_sensitive; public void on_document_created(Entity doc) { index_document(doc); } public void on_document_modified(Entity doc, Set changed_properties) { if (changed_properties.contains(_expression)) { // Re-index with new content reindex_document(doc); } } public void on_document_deleted(EntityPath doc_path) { unindex_document(doc_path); } private void index_document(Entity doc) { var content = get_field_value(doc); if (content == null) return; var normalized = _case_sensitive ? content : content.down(); var doc_path_str = doc.path.to_string(); // Store content cache var cache_key = "idx:%s:doc:%s".printf(_path.to_string(), doc_path_str); storage.set(cache_key, normalized); // Index trigrams var trigrams = extract_trigrams(normalized); foreach (var tri in trigrams) { add_to_trigram_index(tri, doc_path_str); } // Update bigram reverse index var bigrams = extract_bigrams(normalized); foreach (var bi in bigrams) { add_to_bigram_reverse(bi, trigrams); } // Update unigram reverse index var unigrams = extract_unique_chars(normalized); foreach (var uni in unigrams) { add_to_unigram_reverse(uni, bigrams); } } private void unindex_document(EntityPath doc_path) { var doc_path_str = doc_path.to_string(); var cache_key = "idx:%s:doc:%s".printf(_path.to_string(), doc_path_str); // Get content to unindex var content = storage.get(cache_key); if (content == null) return; var trigrams = extract_trigrams(content); foreach (var tri in trigrams) { remove_from_trigram_index(tri, doc_path_str); } // Note: We don't remove from reverse indexes as they're just string mappings // and don't reference specific documents storage.delete(cache_key); } private void reindex_document(Entity doc) { unindex_document(doc.path); index_document(doc); } } ``` ## API Changes ### EntityType Enum Add `CATALOGUE` to the entity type enumeration: ```vala public enum EntityType { CONTAINER, DOCUMENT, CATEGORY, CATALOGUE, // NEW INDEX; public bool is_indexed() { return this == CATEGORY || this == CATALOGUE || this == INDEX; } } ``` ### Entity Interface Changes ```vala public interface Entity : Object { // ... existing members ... // New method for indexed entities public abstract void rebuild_index() throws EngineError; // Configuration for case sensitivity (Index only) public abstract bool case_sensitive { get; } } ``` ### Container Interface Changes Add method to create Catalogue: ```vala public interface Entity : Object { // ... existing methods ... // New factory method for Catalogue public abstract Entity? create_catalogue( string name, string type_label, string expression ) throws EngineError; // Index creation with options public abstract Entity? create_index_with_options( string name, string type_label, string expression, IndexOptions options ) throws EngineError; } public class IndexOptions : Object { public bool case_sensitive { get; set; default = false; } } ``` ### Engine Interface Changes ```vala public interface Engine : Object { // ... existing members ... // Get all indexed entities for a type label public abstract Invercargill.Enumerable get_indexes_for_type(string type_label); // Rebuild all indices (maintenance operation) public abstract void rebuild_all_indices() throws EngineError; // New signal with type_label for deleted entities public signal void entity_deleted(EntityPath path, string type_label); } ``` ### Storage Interface Changes ```vala public interface Storage : Object { // ... existing methods ... // Index operations public abstract void add_to_index(string index_key, string doc_path) throws StorageError; public abstract void remove_from_index(string index_key, string doc_path) throws StorageError; public abstract Invercargill.Enumerable get_index_members(string index_key) throws StorageError; // Type index operations public abstract void register_document_type(EntityPath doc_path, string type_label) throws StorageError; public abstract void unregister_document_type(EntityPath doc_path, string type_label) throws StorageError; public abstract Invercargill.Enumerable get_documents_by_type(string type_label) throws StorageError; // Catalogue operations public abstract void catalogue_add_to_group(EntityPath catalogue_path, string key, EntityPath doc_path) throws StorageError; public abstract void catalogue_remove_from_group(EntityPath catalogue_path, string key, EntityPath doc_path) throws StorageError; public abstract Invercargill.Enumerable catalogue_get_keys(EntityPath catalogue_path) throws StorageError; public abstract Invercargill.Enumerable catalogue_get_group(EntityPath catalogue_path, string key) throws StorageError; // N-gram index operations public abstract void ngram_add_trigram(EntityPath index_path, string trigram, EntityPath doc_path) throws StorageError; public abstract void ngram_remove_trigram(EntityPath index_path, string trigram, EntityPath doc_path) throws StorageError; public abstract Invercargill.Enumerable ngram_get_documents(EntityPath index_path, string trigram) throws StorageError; public abstract void ngram_add_bigram_reverse(EntityPath index_path, string bigram, string trigram) throws StorageError; public abstract Invercargill.Enumerable ngram_get_trigrams_for_bigram(EntityPath index_path, string bigram) throws StorageError; public abstract void ngram_add_unigram_reverse(EntityPath index_path, char unigram, string bigram) throws StorageError; public abstract Invercargill.Enumerable ngram_get_bigrams_for_unigram(EntityPath index_path, char unigram) throws StorageError; } ``` ## Implementation Notes ### Index Building on Creation When an indexed entity is created, it must build its index from existing documents: ```vala public override Entity? create_category(string name, string type_label, string expression) throws EngineError { // ... create entity metadata ... var category = new Category(_engine, child_path, type_label, expression); // Build initial index category.build_initial_index(); // Register with hook manager _engine.hooks.register_hook(category); _engine.entity_created(category); return category; } ``` ### Expression Evaluation The expression system needs to support: 1. **Boolean expressions** (Category): `!draft`, `status == "active"`, `age > 18` 2. **Value expressions** (Catalogue): `author`, `category`, `status.year` 3. **Field access** (Index): `content`, `title` Consider using or extending Invercargill's expression system if available. ### Transaction Safety Index updates should be atomic with document changes: ```vala public void set_entity_property(string name, Element value) throws EngineError { var tx = _engine.begin_transaction(); try { // Update property _properties.set(name, value); save_properties(); // Notify hooks (which update indices) _engine.notify_modified(this, new Set.from_values(name)); tx.commit(); } catch (Error e) { tx.rollback(); throw e; } } ``` ### Performance Considerations 1. **Batch Updates**: When importing many documents, consider a bulk index build mode 2. **Lazy Index Building**: For large datasets, build indices asynchronously 3. **Index Compaction**: Periodically compact index storage to remove fragmentation 4. **Caching**: Cache frequently accessed index data in memory ### Error Handling ```vala public enum IndexError { // Expression parsing failed EXPRESSION_ERROR, // Index corruption detected CORRUPT_INDEX, // Index build failed BUILD_FAILED, // Hook registration failed HOOK_ERROR; } ``` ## Migration Path ### From Current Implementation 1. **Add CATALOGUE to EntityType enum** 2. **Create Catalogue entity class** 3. **Update Storage interface with index methods** 4. **Implement HookManager in Engine** 5. **Refactor Category to use indexed storage** 6. **Refactor Index to use n-gram storage** 7. **Update Container to support create_catalogue** 8. **Add index rebuild utility** ### Backward Compatibility - Existing databases will not have index data - On first access, indexed entities should detect missing indices and rebuild - Provide a migration tool to pre-build indices for large databases ## Summary | Entity | Purpose | Index Type | Query Complexity | |--------|---------|------------|------------------| | Container | Arbitrary children | None (explicit) | O(k) | | Document | Properties | None | O(1) | | Category | Boolean filter | Member set | O(k) | | Catalogue | Key groupings | Key→members map | O(1) per key | | Index | Text search | Hierarchical n-gram | O(k) | Where k = number of results/candidates, not total documents.