# Indexed Entities Design

This document describes the design for indexed entity types in Implexus: Category, Catalogue, and Index. These entities use pre-built indices for fast lookups instead of scanning all documents.

## Overview

### Problem Statement

The current implementation of virtual entities has performance issues:

| Entity | Current Behavior | Problem |
|--------|-----------------|---------|
| Category | Scans all documents, evaluates expression per document | O(n) per query |
| Index | Scans all documents, does substring matching | O(n) per query |
| Catalogue | Does not exist | Need key-based groupings |

### Solution: Pre-built Indices

All indexed entities will:

1. **Build indices at creation time** - Scan existing documents and populate index structures
2. **Update indices on changes** - Via hooks when documents are created, updated, or deleted
3. **Use fast key lookups** - O(1) or O(log n) queries instead of O(n) scans

### Entity Type Semantics

```mermaid
graph TB
    subgraph Entity Types
        Container[Container - Arbitrary children]
        Document[Document - Properties]
        Category[Category - Predicate filter]
        Catalogue[Catalogue - Key groupings]
        Index[Index - Text search]
    end
    
    subgraph Category Behavior
        Category -->|contains| DocCat[Documents matching predicate]
    end
    
    subgraph Catalogue Behavior
        Catalogue -->|has child| Group1[Key Group: john]
        Catalogue -->|has child| Group2[Key Group: jane]
        Group1 -->|contains| Doc1[Post by john]
        Group2 -->|contains| Doc2[Post by jane]
    end
    
    subgraph Index Behavior
        Index -->|search| Result[Search Result Container]
        Result -->|contains| Doc3[Matching documents]
    end
```

## Entity Definitions

### Container (Unchanged)

- **Purpose**: Arbitrary child storage like a filesystem folder
- **Children**: Any entity type, explicitly added
- **Implementation**: Current implementation is correct, no changes needed

### Category (Redesigned)

- **Purpose**: Contains documents matching a boolean predicate expression
- **Type Label**: Filters documents by application-defined type
- **Expression**: Boolean expression (e.g., `!draft`, `status == "active"`, `age > 18`)
- **Children**: Documents where predicate evaluates to `true`
- **Example**: Category at `/posts/active` with expression `!draft` returns all posts where `draft == false`

### Catalogue (New Entity Type)

- **Purpose**: Groups documents by unique values of an expression
- **Type Label**: Filters documents by application-defined type
- **Expression**: Value expression (e.g., `author`, `category`, `status`)
- **Children**: Virtual containers keyed by expression result
- **Example**: Catalogue at `/posts/by-author` with expression `author`:
  - `/posts/by-author/john` → all posts where `author == "john"`
  - `/posts/by-author/jane` → all posts where `author == "jane"`
- **Semantics**: Similar to `Dictionary<TKey, List<TItem>>` or `GROUP BY` in SQL

### Index (Redesigned)

- **Purpose**: Full-text search on a field with SQL LIKE-style patterns
- **Type Label**: Filters documents by application-defined type
- **Expression**: Field name to index (e.g., `content`, `title`)
- **Children**: Search results as virtual containers
- **Patterns Supported**:
  - `*world*` - contains "world"
  - `hello*` - starts with "hello"
  - `*goodbye` - ends with "goodbye"
- **Configuration**: Case sensitivity option (default: insensitive)

## Storage Schema

### Key Prefixes

All index data uses scoped keys to prevent collisions between entities:

```
index:<entity_type>:<entity_path>:<data_type>:<key>
```

### Type Label Index

To efficiently find all documents of a given type, maintain a global type index:

```
typeidx:<type_label> → [doc_path1, doc_path2, ...]
```

**Storage**: `typeidx:User` → `["/users/john", "/users/jane", ...]`

### Category Storage Schema

Category stores document paths that match the predicate:

```
cat:<entity_path>:members → [doc_path1, doc_path2, ...]
cat:<entity_path>:config → {type_label: "...", expression: "..."}
```

**Example**:
```
cat:/posts/active:members → ["/posts/post1", "/posts/post3"]
cat:/posts/active:config → {type_label: "Post", expression: "!draft"}
```

### Catalogue Storage Schema

Catalogue stores groupings keyed by expression result:

```
catl:<entity_path>:group:<key_value> → [doc_path1, doc_path2, ...]
catl:<entity_path>:keys → ["key1", "key2", ...]
catl:<entity_path>:config → {type_label: "...", expression: "..."}
```

**Example**:
```
catl:/posts/by-author:group:john → ["/posts/post1", "/posts/post3"]
catl:/posts/by-author:group:jane → ["/posts/post2"]
catl:/posts/by-author:keys → ["john", "jane"]
catl:/posts/by-author:config → {type_label: "Post", expression: "author"}
```

### Index Storage Schema (Hierarchical N-gram)

The Index uses a hierarchical n-gram structure for arbitrary-length pattern matching:

#### Primary Trigram Index

```
idx:<entity_path>:tri:<trigram> → [doc_path1, doc_path2, ...]
```

**Example**: Text "hello world" produces trigrams: `hel`, `ell`, `llo`, `lo `, `o w`, ` wo`, `wor`, `orl`, `rld`

```
idx:/posts/search:tri:hel → ["/posts/post1", "/posts/post2"]
idx:/posts/search:tri:ell → ["/posts/post1"]
...
```

#### Bigram Reverse Index

Maps bigrams to trigrams containing them (for 2-character queries):

```
idx:<entity_path>:bi:<bigram> → [trigram1, trigram2, ...]
```

**Example**:
```
idx:/posts/search:bi:he → ["hel"]
idx:/posts/search:bi:el → ["hel", "ell"]
idx:/posts/search:bi:ll → ["ell", "llo"]
```

#### Unigram Reverse Index

Maps characters to bigrams starting with them (for 1-character queries):

```
idx:<entity_path>:uni:<char> → [bigram1, bigram2, ...]
```

**Example**:
```
idx:/posts/search:uni:h → ["he"]
idx:/posts/search:uni:e → ["el"]
```

#### Document Content Cache

Stores the indexed field value for each document (used for verification and rebuilds):

```
idx:<entity_path>:doc:<doc_path> → "indexed field value"
```

#### Configuration

```
idx:<entity_path>:config → {type_label: "...", expression: "...", case_sensitive: false}
```

### Index Size Analysis

| Index Type | Entries Per Doc | Stores | Relative Size |
|------------|-----------------|--------|---------------|
| Trigram→Docs | O(L) where L = text length | Doc paths | Primary cost |
| Bigram→Trigrams | O(L) | Strings only | Small |
| Unigram→Bigrams | O(unique chars) | Strings only | Tiny |
| Doc Content | 1 | Field value | Moderate |

**Example**: For 10,000 documents averaging 1KB text:
- Trigram index: ~10M entries (document paths)
- Bigram reverse: ~2,500 entries (just strings)
- Unigram reverse: ~50 entries (just strings)

The reverse indexes have minimal overhead since they store short strings, not document paths.

## Query Algorithms

### Category Query

```
INPUT: Category at path P
OUTPUT: All documents matching predicate

1. Load config to get type_label and expression
2. Lookup cat:P:members → [doc_paths]
3. For each doc_path, instantiate Document entity
4. Return documents
```

**Complexity**: O(k) where k = number of matching documents

### Catalogue Query

```
INPUT: Catalogue at path P, optional key K
OUTPUT: If K provided: documents with that key value
        Otherwise: list of available keys

# Get specific group
get_child(K):
1. Lookup catl:P:group:K → [doc_paths]
2. Return VirtualContainer with documents

# List available keys
child_names:
1. Lookup catl:P:keys → [key1, key2, ...]
2. Return keys
```

**Complexity**: O(1) for key lookup, O(k) for listing groups

### Index Query

```
INPUT: Index at path P, search pattern S
OUTPUT: Documents matching pattern

parse_pattern(S):
  if S starts with * and ends with *:
    return CONTAINS(S[1:-1])
  if S starts with *:
    return ENDS_WITH(S[1:])
  if S ends with *:
    return STARTS_WITH(S[:-1])
  return EXACT(S)

search(pattern):
  terms = parse_pattern(pattern)
  
  if terms.type == CONTAINS and len(terms.value) >= 3:
    # Use trigram index
    trigrams = extract_trigrams(terms.value)
    doc_sets = [lookup idx:P:tri:t for t in trigrams]
    candidates = intersect(doc_sets)
    return [d for d in candidates if d contains terms.value]
  
  if terms.type == CONTAINS and len(terms.value) == 2:
    # Use bigram→trigram→docs
    bigram = terms.value
    trigrams = lookup idx:P:bi:bigram
    doc_sets = [lookup idx:P:tri:t for t in trigrams]
    candidates = union(doc_sets)
    return [d for d in candidates if d contains terms.value]
  
  if terms.type == CONTAINS and len(terms.value) == 1:
    # Use unigram→bigram→trigram→docs
    char = terms.value
    bigrams = lookup idx:P:uni:char
    trigrams = union([lookup idx:P:bi:b for b in bigrams])
    doc_sets = [lookup idx:P:tri:t for t in trigrams]
    candidates = union(doc_sets)
    return [d for d in candidates if d contains terms.value]
  
  if terms.type == STARTS_WITH:
    # Prefix search using first trigram
    prefix = terms.value
    first_tri = prefix[0:3]
    candidates = lookup idx:P:tri:first_tri
    return [d for d in candidates if d starts_with prefix]
  
  if terms.type == ENDS_WITH:
    # Suffix search using last trigram
    suffix = terms.value
    last_tri = suffix[-3:]
    candidates = lookup idx:P:tri:last_tri
    return [d for d in candidates if d ends_with suffix]
```

**Complexity**:
- 3+ char patterns: O(k) where k = candidates from trigram intersection
- 1-2 char patterns: O(k) with more indirection but still indexed
- Prefix/suffix: O(k) using boundary trigrams

## Hook/Notification System

### Overview

The hook system notifies indexed entities when documents are created, modified, or deleted so they can update their indices.

```mermaid
sequenceDiagram
    participant App as Application
    participant Engine as Engine
    participant Doc as Document
    participant Hook as HookManager
    participant Cat as Category
    participant Catl as Catalogue
    participant Idx as Index
    
    App->>Engine: create_document with type_label
    Engine->>Doc: create
    Engine->>Hook: entity_created(doc)
    Hook->>Hook: find_interested_entities(type_label)
    loop For each interested entity
        Hook->>Cat: on_document_created(doc)
        Hook->>Catl: on_document_created(doc)
        Hook->>Idx: on_document_created(doc)
    end
```

### Hook Registration

Indexed entities register interest in specific type labels:

```vala
interface IndexHook {
    // The type label this entity is interested in
    public abstract string type_label { get; }
    
    // The entity path for index scoping
    public abstract EntityPath entity_path { get; }
    
    // Called when a document is created
    public abstract void on_document_created(Entity doc) throws Error;
    
    // Called when a document is modified
    public abstract void on_document_modified(Entity doc, Set<string> changed_properties) throws Error;
    
    // Called when a document is deleted
    public abstract void on_document_deleted(EntityPath doc_path) throws Error;
}
```

### HookManager Implementation

```vala
class HookManager {
    // type_label → list of registered hooks
    private Dictionary<string, List<IndexHook>> _hooks_by_type;
    
    // Register a hook for a type label
    public void register_hook(IndexHook hook) {
        var hooks = _hooks_by_type.get_or_default(hook.type_label, new List());
        hooks.add(hook);
    }
    
    // Unregister a hook
    public void unregister_hook(IndexHook hook) {
        var hooks = _hooks_by_type.get(hook.type_label);
        if (hooks != null) {
            hooks.remove(hook);
        }
    }
    
    // Notify hooks of document creation
    public void notify_created(Entity doc) {
        var hooks = _hooks_by_type.get(doc.type_label);
        if (hooks != null) {
            foreach (var hook in hooks) {
                hook.on_document_created(doc);
            }
        }
    }
    
    // Notify hooks of document modification
    public void notify_modified(Entity doc, Set<string> changed_properties) {
        var hooks = _hooks_by_type.get(doc.type_label);
        if (hooks != null) {
            foreach (var hook in hooks) {
                hook.on_document_modified(doc, changed_properties);
            }
        }
    }
    
    // Notify hooks of document deletion
    public void notify_deleted(EntityPath doc_path, string type_label) {
        var hooks = _hooks_by_type.get(type_label);
        if (hooks != null) {
            foreach (var hook in hooks) {
                hook.on_document_deleted(doc_path);
            }
        }
    }
}
```

### Engine Integration

The Engine emits signals that the HookManager listens to:

```vala
class EmbeddedEngine : Object, Engine {
    private HookManager _hooks;
    
    // Connect signals to hook manager
    construct {
        _hooks = new HookManager();
        this.entity_created.connect((entity) => {
            if (entity.entity_type == EntityType.DOCUMENT) {
                _hooks.notify_created(entity);
            }
        });
        this.entity_modified.connect((entity) => {
            if (entity.entity_type == EntityType.DOCUMENT) {
                _hooks.notify_modified(entity, entity.modified_properties);
            }
        });
        this.entity_deleted.connect((path, type_label) => {
            _hooks.notify_deleted(path, type_label);
        });
    }
}
```

## Index Update Logic

### Category Index Update

```vala
class Category : AbstractEntity, IndexHook {
    
    public void on_document_created(Entity doc) {
        update_index_for_document(doc, true);
    }
    
    public void on_document_modified(Entity doc, Set<string> changed_properties) {
        // Re-evaluate predicate and update index
        update_index_for_document(doc, evaluate_predicate(doc));
    }
    
    public void on_document_deleted(EntityPath doc_path) {
        remove_from_index(doc_path);
    }
    
    private void update_index_for_document(Entity doc, bool should_include) {
        var members_key = "cat:%s:members".printf(_path.to_string());
        var current_members = load_member_set(members_key);
        
        if (should_include) {
            current_members.add(doc.path.to_string());
        } else {
            current_members.remove(doc.path.to_string());
        }
        
        save_member_set(members_key, current_members);
    }
    
    private bool evaluate_predicate(Entity doc) {
        // Evaluate the boolean expression against the document
        return ExpressionEvaluator.evaluate_boolean(_expression, doc.properties);
    }
}
```

### Catalogue Index Update

```vala
class Catalogue : AbstractEntity, IndexHook {
    
    public void on_document_created(Entity doc) {
        update_index_for_document(doc);
    }
    
    public void on_document_modified(Entity doc, Set<string> changed_properties) {
        // Check if the grouped expression changed
        if (changed_properties.contains(_expression)) {
            // Get old value and remove from old group
            var old_value = get_old_value(doc);
            remove_from_group(old_value, doc.path);
            
            // Add to new group
            update_index_for_document(doc);
        }
    }
    
    public void on_document_deleted(EntityPath doc_path) {
        // Need to look up the value before removing
        var value = get_indexed_value(doc_path);
        remove_from_group(value, doc_path);
    }
    
    private void update_index_for_document(Entity doc) {
        var value = evaluate_expression(doc);
        if (value != null) {
            add_to_group(value, doc.path);
        }
    }
    
    private void add_to_group(string key, EntityPath doc_path) {
        var group_key = "catl:%s:group:%s".printf(_path.to_string(), key);
        var members = load_member_set(group_key);
        members.add(doc_path.to_string());
        save_member_set(group_key, members);
        
        // Also update keys list if this is a new key
        var keys_key = "catl:%s:keys".printf(_path.to_string());
        var keys = load_key_set(keys_key);
        if (!keys.contains(key)) {
            keys.add(key);
            save_key_set(keys_key, keys);
        }
    }
}
```

### Index Entity Update (N-gram)

```vala
class Index : AbstractEntity, IndexHook {
    
    private bool _case_sensitive;
    
    public void on_document_created(Entity doc) {
        index_document(doc);
    }
    
    public void on_document_modified(Entity doc, Set<string> changed_properties) {
        if (changed_properties.contains(_expression)) {
            // Re-index with new content
            reindex_document(doc);
        }
    }
    
    public void on_document_deleted(EntityPath doc_path) {
        unindex_document(doc_path);
    }
    
    private void index_document(Entity doc) {
        var content = get_field_value(doc);
        if (content == null) return;
        
        var normalized = _case_sensitive ? content : content.down();
        var doc_path_str = doc.path.to_string();
        
        // Store content cache
        var cache_key = "idx:%s:doc:%s".printf(_path.to_string(), doc_path_str);
        storage.set(cache_key, normalized);
        
        // Index trigrams
        var trigrams = extract_trigrams(normalized);
        foreach (var tri in trigrams) {
            add_to_trigram_index(tri, doc_path_str);
        }
        
        // Update bigram reverse index
        var bigrams = extract_bigrams(normalized);
        foreach (var bi in bigrams) {
            add_to_bigram_reverse(bi, trigrams);
        }
        
        // Update unigram reverse index
        var unigrams = extract_unique_chars(normalized);
        foreach (var uni in unigrams) {
            add_to_unigram_reverse(uni, bigrams);
        }
    }
    
    private void unindex_document(EntityPath doc_path) {
        var doc_path_str = doc_path.to_string();
        var cache_key = "idx:%s:doc:%s".printf(_path.to_string(), doc_path_str);
        
        // Get content to unindex
        var content = storage.get(cache_key);
        if (content == null) return;
        
        var trigrams = extract_trigrams(content);
        foreach (var tri in trigrams) {
            remove_from_trigram_index(tri, doc_path_str);
        }
        
        // Note: We don't remove from reverse indexes as they're just string mappings
        // and don't reference specific documents
        
        storage.delete(cache_key);
    }
    
    private void reindex_document(Entity doc) {
        unindex_document(doc.path);
        index_document(doc);
    }
}
```

## API Changes

### EntityType Enum

Add `CATALOGUE` to the entity type enumeration:

```vala
public enum EntityType {
    CONTAINER,
    DOCUMENT,
    CATEGORY,
    CATALOGUE,  // NEW
    INDEX;
    
    public bool is_indexed() {
        return this == CATEGORY || this == CATALOGUE || this == INDEX;
    }
}
```

### Entity Interface Changes

```vala
public interface Entity : Object {
    // ... existing members ...
    
    // New method for indexed entities
    public abstract void rebuild_index() throws EngineError;
    
    // Configuration for case sensitivity (Index only)
    public abstract bool case_sensitive { get; }
}
```

### Container Interface Changes

Add method to create Catalogue:

```vala
public interface Entity : Object {
    // ... existing methods ...
    
    // New factory method for Catalogue
    public abstract Entity? create_catalogue(
        string name,
        string type_label,
        string expression
    ) throws EngineError;
    
    // Index creation with options
    public abstract Entity? create_index_with_options(
        string name,
        string type_label,
        string expression,
        IndexOptions options
    ) throws EngineError;
}

public class IndexOptions : Object {
    public bool case_sensitive { get; set; default = false; }
}
```

### Engine Interface Changes

```vala
public interface Engine : Object {
    // ... existing members ...
    
    // Get all indexed entities for a type label
    public abstract Invercargill.Enumerable<Entity> get_indexes_for_type(string type_label);
    
    // Rebuild all indices (maintenance operation)
    public abstract void rebuild_all_indices() throws EngineError;
    
    // New signal with type_label for deleted entities
    public signal void entity_deleted(EntityPath path, string type_label);
}
```

### Storage Interface Changes

```vala
public interface Storage : Object {
    // ... existing methods ...
    
    // Index operations
    public abstract void add_to_index(string index_key, string doc_path) throws StorageError;
    public abstract void remove_from_index(string index_key, string doc_path) throws StorageError;
    public abstract Invercargill.Enumerable<string> get_index_members(string index_key) throws StorageError;
    
    // Type index operations
    public abstract void register_document_type(EntityPath doc_path, string type_label) throws StorageError;
    public abstract void unregister_document_type(EntityPath doc_path, string type_label) throws StorageError;
    public abstract Invercargill.Enumerable<EntityPath> get_documents_by_type(string type_label) throws StorageError;
    
    // Catalogue operations
    public abstract void catalogue_add_to_group(EntityPath catalogue_path, string key, EntityPath doc_path) throws StorageError;
    public abstract void catalogue_remove_from_group(EntityPath catalogue_path, string key, EntityPath doc_path) throws StorageError;
    public abstract Invercargill.Enumerable<string> catalogue_get_keys(EntityPath catalogue_path) throws StorageError;
    public abstract Invercargill.Enumerable<EntityPath> catalogue_get_group(EntityPath catalogue_path, string key) throws StorageError;
    
    // N-gram index operations
    public abstract void ngram_add_trigram(EntityPath index_path, string trigram, EntityPath doc_path) throws StorageError;
    public abstract void ngram_remove_trigram(EntityPath index_path, string trigram, EntityPath doc_path) throws StorageError;
    public abstract Invercargill.Enumerable<EntityPath> ngram_get_documents(EntityPath index_path, string trigram) throws StorageError;
    
    public abstract void ngram_add_bigram_reverse(EntityPath index_path, string bigram, string trigram) throws StorageError;
    public abstract Invercargill.Enumerable<string> ngram_get_trigrams_for_bigram(EntityPath index_path, string bigram) throws StorageError;
    
    public abstract void ngram_add_unigram_reverse(EntityPath index_path, char unigram, string bigram) throws StorageError;
    public abstract Invercargill.Enumerable<string> ngram_get_bigrams_for_unigram(EntityPath index_path, char unigram) throws StorageError;
}
```

## Implementation Notes

### Index Building on Creation

When an indexed entity is created, it must build its index from existing documents:

```vala
public override Entity? create_category(string name, string type_label, string expression) throws EngineError {
    // ... create entity metadata ...
    
    var category = new Category(_engine, child_path, type_label, expression);
    
    // Build initial index
    category.build_initial_index();
    
    // Register with hook manager
    _engine.hooks.register_hook(category);
    
    _engine.entity_created(category);
    return category;
}
```

### Expression Evaluation

The expression system needs to support:

1. **Boolean expressions** (Category): `!draft`, `status == "active"`, `age > 18`
2. **Value expressions** (Catalogue): `author`, `category`, `status.year`
3. **Field access** (Index): `content`, `title`

Consider using or extending Invercargill's expression system if available.

### Transaction Safety

Index updates should be atomic with document changes:

```vala
public void set_entity_property(string name, Element value) throws EngineError {
    var tx = _engine.begin_transaction();
    try {
        // Update property
        _properties.set(name, value);
        save_properties();
        
        // Notify hooks (which update indices)
        _engine.notify_modified(this, new Set.from_values(name));
        
        tx.commit();
    } catch (Error e) {
        tx.rollback();
        throw e;
    }
}
```

### Performance Considerations

1. **Batch Updates**: When importing many documents, consider a bulk index build mode
2. **Lazy Index Building**: For large datasets, build indices asynchronously
3. **Index Compaction**: Periodically compact index storage to remove fragmentation
4. **Caching**: Cache frequently accessed index data in memory

### Error Handling

```vala
public enum IndexError {
    // Expression parsing failed
    EXPRESSION_ERROR,
    // Index corruption detected
    CORRUPT_INDEX,
    // Index build failed
    BUILD_FAILED,
    // Hook registration failed
    HOOK_ERROR;
}
```

## Migration Path

### From Current Implementation

1. **Add CATALOGUE to EntityType enum**
2. **Create Catalogue entity class**
3. **Update Storage interface with index methods**
4. **Implement HookManager in Engine**
5. **Refactor Category to use indexed storage**
6. **Refactor Index to use n-gram storage**
7. **Update Container to support create_catalogue**
8. **Add index rebuild utility**

### Backward Compatibility

- Existing databases will not have index data
- On first access, indexed entities should detect missing indices and rebuild
- Provide a migration tool to pre-build indices for large databases

## Summary

| Entity | Purpose | Index Type | Query Complexity |
|--------|---------|------------|------------------|
| Container | Arbitrary children | None (explicit) | O(k) |
| Document | Properties | None | O(1) |
| Category | Boolean filter | Member set | O(k) |
| Catalogue | Key groupings | Key→members map | O(1) per key |
| Index | Text search | Hierarchical n-gram | O(k) |

Where k = number of results/candidates, not total documents.