11-Indexed-Entities.md 25 KB

Indexed Entities Design

This document describes the design for indexed entity types in Implexus: Category, Catalogue, and Index. These entities use pre-built indices for fast lookups instead of scanning all documents.

Overview

Problem Statement

The current implementation of virtual entities has performance issues:

Entity Current Behavior Problem
Category Scans all documents, evaluates expression per document O(n) per query
Index Scans all documents, does substring matching O(n) per query
Catalogue Does not exist Need key-based groupings

Solution: Pre-built Indices

All indexed entities will:

  1. Build indices at creation time - Scan existing documents and populate index structures
  2. Update indices on changes - Via hooks when documents are created, updated, or deleted
  3. Use fast key lookups - O(1) or O(log n) queries instead of O(n) scans

Entity Type Semantics

graph TB
    subgraph Entity Types
        Container[Container - Arbitrary children]
        Document[Document - Properties]
        Category[Category - Predicate filter]
        Catalogue[Catalogue - Key groupings]
        Index[Index - Text search]
    end
    
    subgraph Category Behavior
        Category -->|contains| DocCat[Documents matching predicate]
    end
    
    subgraph Catalogue Behavior
        Catalogue -->|has child| Group1[Key Group: john]
        Catalogue -->|has child| Group2[Key Group: jane]
        Group1 -->|contains| Doc1[Post by john]
        Group2 -->|contains| Doc2[Post by jane]
    end
    
    subgraph Index Behavior
        Index -->|search| Result[Search Result Container]
        Result -->|contains| Doc3[Matching documents]
    end

Entity Definitions

Container (Unchanged)

  • Purpose: Arbitrary child storage like a filesystem folder
  • Children: Any entity type, explicitly added
  • Implementation: Current implementation is correct, no changes needed

Category (Redesigned)

  • Purpose: Contains documents matching a boolean predicate expression
  • Type Label: Filters documents by application-defined type
  • Expression: Boolean expression (e.g., !draft, status == "active", age > 18)
  • Children: Documents where predicate evaluates to true
  • Example: Category at /posts/active with expression !draft returns all posts where draft == false

Catalogue (New Entity Type)

  • Purpose: Groups documents by unique values of an expression
  • Type Label: Filters documents by application-defined type
  • Expression: Value expression (e.g., author, category, status)
  • Children: Virtual containers keyed by expression result
  • Example: Catalogue at /posts/by-author with expression author:
    • /posts/by-author/john → all posts where author == "john"
    • /posts/by-author/jane → all posts where author == "jane"
  • Semantics: Similar to Dictionary<TKey, List<TItem>> or GROUP BY in SQL

Index (Redesigned)

  • Purpose: Full-text search on a field with SQL LIKE-style patterns
  • Type Label: Filters documents by application-defined type
  • Expression: Field name to index (e.g., content, title)
  • Children: Search results as virtual containers
  • Patterns Supported:
    • *world* - contains "world"
    • hello* - starts with "hello"
    • *goodbye - ends with "goodbye"
  • Configuration: Case sensitivity option (default: insensitive)

Storage Schema

Key Prefixes

All index data uses scoped keys to prevent collisions between entities:

index:<entity_type>:<entity_path>:<data_type>:<key>

Type Label Index

To efficiently find all documents of a given type, maintain a global type index:

typeidx:<type_label> → [doc_path1, doc_path2, ...]

Storage: typeidx:User["/users/john", "/users/jane", ...]

Category Storage Schema

Category stores document paths that match the predicate:

cat:<entity_path>:members → [doc_path1, doc_path2, ...]
cat:<entity_path>:config → {type_label: "...", expression: "..."}

Example:

cat:/posts/active:members → ["/posts/post1", "/posts/post3"]
cat:/posts/active:config → {type_label: "Post", expression: "!draft"}

Catalogue Storage Schema

Catalogue stores groupings keyed by expression result:

catl:<entity_path>:group:<key_value> → [doc_path1, doc_path2, ...]
catl:<entity_path>:keys → ["key1", "key2", ...]
catl:<entity_path>:config → {type_label: "...", expression: "..."}

Example:

catl:/posts/by-author:group:john → ["/posts/post1", "/posts/post3"]
catl:/posts/by-author:group:jane → ["/posts/post2"]
catl:/posts/by-author:keys → ["john", "jane"]
catl:/posts/by-author:config → {type_label: "Post", expression: "author"}

Index Storage Schema (Hierarchical N-gram)

The Index uses a hierarchical n-gram structure for arbitrary-length pattern matching:

Primary Trigram Index

idx:<entity_path>:tri:<trigram> → [doc_path1, doc_path2, ...]

Example: Text "hello world" produces trigrams: hel, ell, llo, lo, o w, wo, wor, orl, rld

idx:/posts/search:tri:hel → ["/posts/post1", "/posts/post2"]
idx:/posts/search:tri:ell → ["/posts/post1"]
...

Bigram Reverse Index

Maps bigrams to trigrams containing them (for 2-character queries):

idx:<entity_path>:bi:<bigram> → [trigram1, trigram2, ...]

Example:

idx:/posts/search:bi:he → ["hel"]
idx:/posts/search:bi:el → ["hel", "ell"]
idx:/posts/search:bi:ll → ["ell", "llo"]

Unigram Reverse Index

Maps characters to bigrams starting with them (for 1-character queries):

idx:<entity_path>:uni:<char> → [bigram1, bigram2, ...]

Example:

idx:/posts/search:uni:h → ["he"]
idx:/posts/search:uni:e → ["el"]

Document Content Cache

Stores the indexed field value for each document (used for verification and rebuilds):

idx:<entity_path>:doc:<doc_path> → "indexed field value"

Configuration

idx:<entity_path>:config → {type_label: "...", expression: "...", case_sensitive: false}

Index Size Analysis

Index Type Entries Per Doc Stores Relative Size
Trigram→Docs O(L) where L = text length Doc paths Primary cost
Bigram→Trigrams O(L) Strings only Small
Unigram→Bigrams O(unique chars) Strings only Tiny
Doc Content 1 Field value Moderate

Example: For 10,000 documents averaging 1KB text:

  • Trigram index: ~10M entries (document paths)
  • Bigram reverse: ~2,500 entries (just strings)
  • Unigram reverse: ~50 entries (just strings)

The reverse indexes have minimal overhead since they store short strings, not document paths.

Query Algorithms

Category Query

INPUT: Category at path P
OUTPUT: All documents matching predicate

1. Load config to get type_label and expression
2. Lookup cat:P:members → [doc_paths]
3. For each doc_path, instantiate Document entity
4. Return documents

Complexity: O(k) where k = number of matching documents

Catalogue Query

INPUT: Catalogue at path P, optional key K
OUTPUT: If K provided: documents with that key value
        Otherwise: list of available keys

# Get specific group
get_child(K):
1. Lookup catl:P:group:K → [doc_paths]
2. Return VirtualContainer with documents

# List available keys
child_names:
1. Lookup catl:P:keys → [key1, key2, ...]
2. Return keys

Complexity: O(1) for key lookup, O(k) for listing groups

Index Query

INPUT: Index at path P, search pattern S
OUTPUT: Documents matching pattern

parse_pattern(S):
  if S starts with * and ends with *:
    return CONTAINS(S[1:-1])
  if S starts with *:
    return ENDS_WITH(S[1:])
  if S ends with *:
    return STARTS_WITH(S[:-1])
  return EXACT(S)

search(pattern):
  terms = parse_pattern(pattern)
  
  if terms.type == CONTAINS and len(terms.value) >= 3:
    # Use trigram index
    trigrams = extract_trigrams(terms.value)
    doc_sets = [lookup idx:P:tri:t for t in trigrams]
    candidates = intersect(doc_sets)
    return [d for d in candidates if d contains terms.value]
  
  if terms.type == CONTAINS and len(terms.value) == 2:
    # Use bigram→trigram→docs
    bigram = terms.value
    trigrams = lookup idx:P:bi:bigram
    doc_sets = [lookup idx:P:tri:t for t in trigrams]
    candidates = union(doc_sets)
    return [d for d in candidates if d contains terms.value]
  
  if terms.type == CONTAINS and len(terms.value) == 1:
    # Use unigram→bigram→trigram→docs
    char = terms.value
    bigrams = lookup idx:P:uni:char
    trigrams = union([lookup idx:P:bi:b for b in bigrams])
    doc_sets = [lookup idx:P:tri:t for t in trigrams]
    candidates = union(doc_sets)
    return [d for d in candidates if d contains terms.value]
  
  if terms.type == STARTS_WITH:
    # Prefix search using first trigram
    prefix = terms.value
    first_tri = prefix[0:3]
    candidates = lookup idx:P:tri:first_tri
    return [d for d in candidates if d starts_with prefix]
  
  if terms.type == ENDS_WITH:
    # Suffix search using last trigram
    suffix = terms.value
    last_tri = suffix[-3:]
    candidates = lookup idx:P:tri:last_tri
    return [d for d in candidates if d ends_with suffix]

Complexity:

  • 3+ char patterns: O(k) where k = candidates from trigram intersection
  • 1-2 char patterns: O(k) with more indirection but still indexed
  • Prefix/suffix: O(k) using boundary trigrams

Hook/Notification System

Overview

The hook system notifies indexed entities when documents are created, modified, or deleted so they can update their indices.

sequenceDiagram
    participant App as Application
    participant Engine as Engine
    participant Doc as Document
    participant Hook as HookManager
    participant Cat as Category
    participant Catl as Catalogue
    participant Idx as Index
    
    App->>Engine: create_document with type_label
    Engine->>Doc: create
    Engine->>Hook: entity_created(doc)
    Hook->>Hook: find_interested_entities(type_label)
    loop For each interested entity
        Hook->>Cat: on_document_created(doc)
        Hook->>Catl: on_document_created(doc)
        Hook->>Idx: on_document_created(doc)
    end

Hook Registration

Indexed entities register interest in specific type labels:

interface IndexHook {
    // The type label this entity is interested in
    public abstract string type_label { get; }
    
    // The entity path for index scoping
    public abstract EntityPath entity_path { get; }
    
    // Called when a document is created
    public abstract void on_document_created(Entity doc) throws Error;
    
    // Called when a document is modified
    public abstract void on_document_modified(Entity doc, Set<string> changed_properties) throws Error;
    
    // Called when a document is deleted
    public abstract void on_document_deleted(EntityPath doc_path) throws Error;
}

HookManager Implementation

class HookManager {
    // type_label → list of registered hooks
    private Dictionary<string, List<IndexHook>> _hooks_by_type;
    
    // Register a hook for a type label
    public void register_hook(IndexHook hook) {
        var hooks = _hooks_by_type.get_or_default(hook.type_label, new List());
        hooks.add(hook);
    }
    
    // Unregister a hook
    public void unregister_hook(IndexHook hook) {
        var hooks = _hooks_by_type.get(hook.type_label);
        if (hooks != null) {
            hooks.remove(hook);
        }
    }
    
    // Notify hooks of document creation
    public void notify_created(Entity doc) {
        var hooks = _hooks_by_type.get(doc.type_label);
        if (hooks != null) {
            foreach (var hook in hooks) {
                hook.on_document_created(doc);
            }
        }
    }
    
    // Notify hooks of document modification
    public void notify_modified(Entity doc, Set<string> changed_properties) {
        var hooks = _hooks_by_type.get(doc.type_label);
        if (hooks != null) {
            foreach (var hook in hooks) {
                hook.on_document_modified(doc, changed_properties);
            }
        }
    }
    
    // Notify hooks of document deletion
    public void notify_deleted(EntityPath doc_path, string type_label) {
        var hooks = _hooks_by_type.get(type_label);
        if (hooks != null) {
            foreach (var hook in hooks) {
                hook.on_document_deleted(doc_path);
            }
        }
    }
}

Engine Integration

The Engine emits signals that the HookManager listens to:

class EmbeddedEngine : Object, Engine {
    private HookManager _hooks;
    
    // Connect signals to hook manager
    construct {
        _hooks = new HookManager();
        this.entity_created.connect((entity) => {
            if (entity.entity_type == EntityType.DOCUMENT) {
                _hooks.notify_created(entity);
            }
        });
        this.entity_modified.connect((entity) => {
            if (entity.entity_type == EntityType.DOCUMENT) {
                _hooks.notify_modified(entity, entity.modified_properties);
            }
        });
        this.entity_deleted.connect((path, type_label) => {
            _hooks.notify_deleted(path, type_label);
        });
    }
}

Index Update Logic

Category Index Update

class Category : AbstractEntity, IndexHook {
    
    public void on_document_created(Entity doc) {
        update_index_for_document(doc, true);
    }
    
    public void on_document_modified(Entity doc, Set<string> changed_properties) {
        // Re-evaluate predicate and update index
        update_index_for_document(doc, evaluate_predicate(doc));
    }
    
    public void on_document_deleted(EntityPath doc_path) {
        remove_from_index(doc_path);
    }
    
    private void update_index_for_document(Entity doc, bool should_include) {
        var members_key = "cat:%s:members".printf(_path.to_string());
        var current_members = load_member_set(members_key);
        
        if (should_include) {
            current_members.add(doc.path.to_string());
        } else {
            current_members.remove(doc.path.to_string());
        }
        
        save_member_set(members_key, current_members);
    }
    
    private bool evaluate_predicate(Entity doc) {
        // Evaluate the boolean expression against the document
        return ExpressionEvaluator.evaluate_boolean(_expression, doc.properties);
    }
}

Catalogue Index Update

class Catalogue : AbstractEntity, IndexHook {
    
    public void on_document_created(Entity doc) {
        update_index_for_document(doc);
    }
    
    public void on_document_modified(Entity doc, Set<string> changed_properties) {
        // Check if the grouped expression changed
        if (changed_properties.contains(_expression)) {
            // Get old value and remove from old group
            var old_value = get_old_value(doc);
            remove_from_group(old_value, doc.path);
            
            // Add to new group
            update_index_for_document(doc);
        }
    }
    
    public void on_document_deleted(EntityPath doc_path) {
        // Need to look up the value before removing
        var value = get_indexed_value(doc_path);
        remove_from_group(value, doc_path);
    }
    
    private void update_index_for_document(Entity doc) {
        var value = evaluate_expression(doc);
        if (value != null) {
            add_to_group(value, doc.path);
        }
    }
    
    private void add_to_group(string key, EntityPath doc_path) {
        var group_key = "catl:%s:group:%s".printf(_path.to_string(), key);
        var members = load_member_set(group_key);
        members.add(doc_path.to_string());
        save_member_set(group_key, members);
        
        // Also update keys list if this is a new key
        var keys_key = "catl:%s:keys".printf(_path.to_string());
        var keys = load_key_set(keys_key);
        if (!keys.contains(key)) {
            keys.add(key);
            save_key_set(keys_key, keys);
        }
    }
}

Index Entity Update (N-gram)

class Index : AbstractEntity, IndexHook {
    
    private bool _case_sensitive;
    
    public void on_document_created(Entity doc) {
        index_document(doc);
    }
    
    public void on_document_modified(Entity doc, Set<string> changed_properties) {
        if (changed_properties.contains(_expression)) {
            // Re-index with new content
            reindex_document(doc);
        }
    }
    
    public void on_document_deleted(EntityPath doc_path) {
        unindex_document(doc_path);
    }
    
    private void index_document(Entity doc) {
        var content = get_field_value(doc);
        if (content == null) return;
        
        var normalized = _case_sensitive ? content : content.down();
        var doc_path_str = doc.path.to_string();
        
        // Store content cache
        var cache_key = "idx:%s:doc:%s".printf(_path.to_string(), doc_path_str);
        storage.set(cache_key, normalized);
        
        // Index trigrams
        var trigrams = extract_trigrams(normalized);
        foreach (var tri in trigrams) {
            add_to_trigram_index(tri, doc_path_str);
        }
        
        // Update bigram reverse index
        var bigrams = extract_bigrams(normalized);
        foreach (var bi in bigrams) {
            add_to_bigram_reverse(bi, trigrams);
        }
        
        // Update unigram reverse index
        var unigrams = extract_unique_chars(normalized);
        foreach (var uni in unigrams) {
            add_to_unigram_reverse(uni, bigrams);
        }
    }
    
    private void unindex_document(EntityPath doc_path) {
        var doc_path_str = doc_path.to_string();
        var cache_key = "idx:%s:doc:%s".printf(_path.to_string(), doc_path_str);
        
        // Get content to unindex
        var content = storage.get(cache_key);
        if (content == null) return;
        
        var trigrams = extract_trigrams(content);
        foreach (var tri in trigrams) {
            remove_from_trigram_index(tri, doc_path_str);
        }
        
        // Note: We don't remove from reverse indexes as they're just string mappings
        // and don't reference specific documents
        
        storage.delete(cache_key);
    }
    
    private void reindex_document(Entity doc) {
        unindex_document(doc.path);
        index_document(doc);
    }
}

API Changes

EntityType Enum

Add CATALOGUE to the entity type enumeration:

public enum EntityType {
    CONTAINER,
    DOCUMENT,
    CATEGORY,
    CATALOGUE,  // NEW
    INDEX;
    
    public bool is_indexed() {
        return this == CATEGORY || this == CATALOGUE || this == INDEX;
    }
}

Entity Interface Changes

public interface Entity : Object {
    // ... existing members ...
    
    // New method for indexed entities
    public abstract void rebuild_index() throws EngineError;
    
    // Configuration for case sensitivity (Index only)
    public abstract bool case_sensitive { get; }
}

Container Interface Changes

Add method to create Catalogue:

public interface Entity : Object {
    // ... existing methods ...
    
    // New factory method for Catalogue
    public abstract Entity? create_catalogue(
        string name,
        string type_label,
        string expression
    ) throws EngineError;
    
    // Index creation with options
    public abstract Entity? create_index_with_options(
        string name,
        string type_label,
        string expression,
        IndexOptions options
    ) throws EngineError;
}

public class IndexOptions : Object {
    public bool case_sensitive { get; set; default = false; }
}

Engine Interface Changes

public interface Engine : Object {
    // ... existing members ...
    
    // Get all indexed entities for a type label
    public abstract Invercargill.Enumerable<Entity> get_indexes_for_type(string type_label);
    
    // Rebuild all indices (maintenance operation)
    public abstract void rebuild_all_indices() throws EngineError;
    
    // New signal with type_label for deleted entities
    public signal void entity_deleted(EntityPath path, string type_label);
}

Storage Interface Changes

public interface Storage : Object {
    // ... existing methods ...
    
    // Index operations
    public abstract void add_to_index(string index_key, string doc_path) throws StorageError;
    public abstract void remove_from_index(string index_key, string doc_path) throws StorageError;
    public abstract Invercargill.Enumerable<string> get_index_members(string index_key) throws StorageError;
    
    // Type index operations
    public abstract void register_document_type(EntityPath doc_path, string type_label) throws StorageError;
    public abstract void unregister_document_type(EntityPath doc_path, string type_label) throws StorageError;
    public abstract Invercargill.Enumerable<EntityPath> get_documents_by_type(string type_label) throws StorageError;
    
    // Catalogue operations
    public abstract void catalogue_add_to_group(EntityPath catalogue_path, string key, EntityPath doc_path) throws StorageError;
    public abstract void catalogue_remove_from_group(EntityPath catalogue_path, string key, EntityPath doc_path) throws StorageError;
    public abstract Invercargill.Enumerable<string> catalogue_get_keys(EntityPath catalogue_path) throws StorageError;
    public abstract Invercargill.Enumerable<EntityPath> catalogue_get_group(EntityPath catalogue_path, string key) throws StorageError;
    
    // N-gram index operations
    public abstract void ngram_add_trigram(EntityPath index_path, string trigram, EntityPath doc_path) throws StorageError;
    public abstract void ngram_remove_trigram(EntityPath index_path, string trigram, EntityPath doc_path) throws StorageError;
    public abstract Invercargill.Enumerable<EntityPath> ngram_get_documents(EntityPath index_path, string trigram) throws StorageError;
    
    public abstract void ngram_add_bigram_reverse(EntityPath index_path, string bigram, string trigram) throws StorageError;
    public abstract Invercargill.Enumerable<string> ngram_get_trigrams_for_bigram(EntityPath index_path, string bigram) throws StorageError;
    
    public abstract void ngram_add_unigram_reverse(EntityPath index_path, char unigram, string bigram) throws StorageError;
    public abstract Invercargill.Enumerable<string> ngram_get_bigrams_for_unigram(EntityPath index_path, char unigram) throws StorageError;
}

Implementation Notes

Index Building on Creation

When an indexed entity is created, it must build its index from existing documents:

public override Entity? create_category(string name, string type_label, string expression) throws EngineError {
    // ... create entity metadata ...
    
    var category = new Category(_engine, child_path, type_label, expression);
    
    // Build initial index
    category.build_initial_index();
    
    // Register with hook manager
    _engine.hooks.register_hook(category);
    
    _engine.entity_created(category);
    return category;
}

Expression Evaluation

The expression system needs to support:

  1. Boolean expressions (Category): !draft, status == "active", age > 18
  2. Value expressions (Catalogue): author, category, status.year
  3. Field access (Index): content, title

Consider using or extending Invercargill's expression system if available.

Transaction Safety

Index updates should be atomic with document changes:

public void set_entity_property(string name, Element value) throws EngineError {
    var tx = _engine.begin_transaction();
    try {
        // Update property
        _properties.set(name, value);
        save_properties();
        
        // Notify hooks (which update indices)
        _engine.notify_modified(this, new Set.from_values(name));
        
        tx.commit();
    } catch (Error e) {
        tx.rollback();
        throw e;
    }
}

Performance Considerations

  1. Batch Updates: When importing many documents, consider a bulk index build mode
  2. Lazy Index Building: For large datasets, build indices asynchronously
  3. Index Compaction: Periodically compact index storage to remove fragmentation
  4. Caching: Cache frequently accessed index data in memory

Error Handling

public enum IndexError {
    // Expression parsing failed
    EXPRESSION_ERROR,
    // Index corruption detected
    CORRUPT_INDEX,
    // Index build failed
    BUILD_FAILED,
    // Hook registration failed
    HOOK_ERROR;
}

Migration Path

From Current Implementation

  1. Add CATALOGUE to EntityType enum
  2. Create Catalogue entity class
  3. Update Storage interface with index methods
  4. Implement HookManager in Engine
  5. Refactor Category to use indexed storage
  6. Refactor Index to use n-gram storage
  7. Update Container to support create_catalogue
  8. Add index rebuild utility

Backward Compatibility

  • Existing databases will not have index data
  • On first access, indexed entities should detect missing indices and rebuild
  • Provide a migration tool to pre-build indices for large databases

Summary

Entity Purpose Index Type Query Complexity
Container Arbitrary children None (explicit) O(k)
Document Properties None O(1)
Category Boolean filter Member set O(k)
Catalogue Key groupings Key→members map O(1) per key
Index Text search Hierarchical n-gram O(k)

Where k = number of results/candidates, not total documents.