public_documentation-processors

Processors in Evie's Library are catalog-specific components that determine how file types (HTML, PDF, Markdown, Docx, audio) are extracted, chunked, and stored for AI-driven retrieval. Each processor supports configurable rules—such as chunking by heading level, exclusion patterns, or element-specific processing—to optimise content semantics and precision in responses.

Core Functionality

Processors handle file ingestion by splitting content into chunks—smaller, semantically coherent units that improve retrieval accuracy. This chunking process is governed by catalog-level settings (e.g., min/max chunk sizes) and processor-specific rules. For example, HTML processors can exclude navigation elements or target specific tags (<p>, <h1>), while PDF and audio processors rely on AI to automate extraction without manual configuration.

File Type Support and Limits

Evie’s Library currently supports five primary file types: HTML, PDF, Markdown, Docx, and audio (mp3/mp4/ogg). Each processor is bound to a single catalog and cannot be shared. Files exceeding 50MB are rejected. Sub-file types allow granular control—e.g., separating quarterly reports from project definitions—by applying distinct processing rules to structurally different documents within the same catalog.

Configuration Options

Processors offer two tiers of configuration: general (applicable to most types) and type-specific. General options include:

Chunking Heading Level (1–6; default: 2): Splits content at specified heading depths.
Chunking Patterns: Uses regex to force splits (e.g., \bProfile\b for customer profiles). Type-specific options vary: HTML processors define included/excluded elements (e.g., article vs. footer), while Docx processors control formatting (e.g., list styles, image handling).

Processor Types and Workflows

Automagic HTML Processor: AI-driven extraction with optional human-language instructions (html_custom_instructions).
HTML Processor: Manual control via included/excluded elements, tags, and classes (e.g., html_excluded_classes: ["sidebar"]).
PDF/Markdown Processors: Fully automated; no configuration required.
Docx Processor: Configurable for comments, headers, tables, and images (e.g., image_handling: "placeholder").
Audio Processor: Transcribes mp3/mp4/ogg files without additional setup.

Management and Tuning

Processors can be created, modified, or tuned post-deployment. Changes require manual reprocessing of existing documents to take effect. Implementation teams access tuning logs for diagnostics, though these are invisible to end users. Best practices include documenting configurations, testing with representative files, and periodically reviewing settings to align with evolving document structures.

Extensibility

The framework supports future expansions, including new processor types, metadata fields, and configuration options. The modular design ensures backward compatibility while accommodating emerging file formats or processing needs.