Document Processing Pipeline

When a document is uploaded to Queria, it goes through a processing pipeline that turns it from a static file into content searchable, analyzable and citable by AI. This process is fully automatic and designed to handle any documentary format with maximum fidelity.

Process overview

Original file
      |
      v
[1] Upload and validation
      |
      v
[2] Parsing and text extraction
      |
      v
[3] Linguistic analysis and metadata
      |
      v
[4] Intelligent chunking
      |
      v
[5] Vector embedding
      |
      v
[6] Multi-modal indexing
      |
      v
Document ready for search

Each phase is monitored and the document goes through states visible to the user: Uploaded, Processing, Ready, or Error if something goes wrong.

Phase 1: Upload and validation

On file arrival, the system runs a series of checks:

Format verification: file type is validated both by extension and by binary content analysis (magic bytes). Supported formats: PDF, DOCX, PPTX, XLSX, CSV, images (PNG, JPG, TIFF, WEBP) and other text formats.
Size limits: verification that the file fits within the limits configured for the organization.
Integrity: check that the file is not corrupted or truncated.
Deduplication: identification of documents already present in the system to avoid duplication.

The original file is stored securely and persistently, regardless of subsequent processing.

Phase 2: Parsing and text extraction

Each format is handled by a specialized parser that extracts the maximum structural information.

PDF

PDF documents receive differentiated treatment:

Native text PDFs: direct text extraction with structure preservation (headings, paragraphs, lists).
Scanned PDFs: automatic OCR activation when the system detects pages without selectable text.
Mixed PDFs: handling of documents that contain both native and scanned pages, applying OCR only where needed.

Office documents

DOCX: content extraction with hierarchy respect (heading levels, paragraphs, tables, bulleted lists). Document metadata (author, date, revision) is preserved.
PPTX: text extraction from each slide with presentation order kept.
XLSX/CSV: tabular processing with column header recognition. Each row is treated as an information unit with the context of its header.

Images

Images are processed by the OCR engine with advanced AI-based text recognition. The system supports scanned documents, photos of documents, screenshots and any image containing text.

Phase 3: Linguistic analysis and metadata

The extracted text is analyzed to enrich its metadata:

Language detection: automatic identification of the document language to optimize subsequent processing.
Metadata extraction: title, author, creation date, number of pages, original format.
Thematic classification: when configured, automatic assignment of the document to one or more thematic categories.

Phase 4: Intelligent chunking

The text is segmented into fragments (chunks) optimized for semantic search. This is one of the most critical steps of the whole pipeline: too small chunks lose context, too large chunks dilute meaning.

Adaptive segmentation

The system uses a chunking strategy that adapts to document structure:

Target size: about 2000 characters per chunk, calibrated to balance semantic precision and contextual richness.
Overlap: 15% of content is shared between adjacent chunks to avoid information at chunk boundaries being lost.
Structure respect: the system tries to align chunk boundaries with natural document boundaries (end of paragraph, end of section, end of list).

Specialized chunking for legal documents

Regulatory documents need dedicated treatment. The legal chunker:

Recognizes the article-paragraph-letter structure typical of Italian legislation
Keeps each article as a coherent unit
Preserves cross-references between articles
Handles partitions into titles, chapters and sections
Assigns a specific type to each chunk: article, section, paragraph, clause, heading, list, table

Preservation of tables and lists

Tables are treated as indivisible units: a table is never split between two chunks. The same principle applies to bulleted and numbered lists, kept intact to preserve sequential context.

Phase 5: Vector embedding

Each chunk is transformed into a numerical representation (vector) capturing its semantic meaning.

Dense vectors

The embedding model generates 1024-dimensional vectors representing the overall meaning of the text. Two fragments with similar meaning will have close vectors in the multidimensional space, even if they use completely different words.

Sparse vectors

In parallel, a sparse representation (BM25) is generated, capturing the importance of single keywords. This representation excels at retrieving specific terms: codes, acronyms, proper nouns, article numbers.

Indexed metadata

Along with the vectors, structured information enabling filtering is stored:

Source document title
Document date
Author
Original format
Chunk type (article, paragraph, table)
Owning organization
Topic/category

The vectorized chunks are indexed in a search system that supports three complementary modes:

Semantic similarity search: finds chunks with meaning similar to the user query, regardless of the words used.
Keyword search: finds chunks containing specific terms mentioned in the query.
Structured filter search: narrows results by date, format, category, author or any indexed metadata.

The three modes can be freely combined. A typical search uses semantic similarity and keywords in a hybrid way, with structured filters to narrow the scope.

OCR capabilities

Queria's OCR engine goes beyond simple character recognition:

Automatic detection: the system autonomously identifies pages needing OCR, without user intervention.
Multilingual support: text recognition in Italian, English and other European languages.
AI-enhanced recognition: the vision AI model interprets visual content with contextual understanding, improving accuracy on low-quality documents.
Table extraction from images: tables in scanned documents are recognized and converted into a structured format.
Two-pass with Markdown output: for complex documents, the system runs a first recognition pass and a second structuring pass, producing Markdown output with properly formatted tables.

Sync from external sources

Queria is not limited to manual upload. The system supports automatic document acquisition from remote sources:

Cloud storage

Connection to cloud storage services with periodic automatic sync. The system monitors configured folders and automatically imports new or updated documents.

Network folders (SMB/CIFS)

For organizations using internal file servers, Queria connects directly to shared folders via the SMB/CIFS protocol, also through VPN tunnels. Monitoring occurs at a configurable frequency.

Incremental updates

Sync is intelligent: only new or modified files are re-processed. Files removed from the source are marked accordingly in the system. This approach minimizes processing load and keeps the documentary base always up to date.

Document generation (DocGen)

Beyond analysis, Queria can generate new professional documents starting from extracted data:

Generation pipeline

Template: the user selects a document model from those available for their organization.
Data extraction: the system automatically extracts the needed information from source documents. The approach is hybrid: structured extraction for tabular data (XLSX) and AI for unstructured text.
Schema validation: the extracted data is verified against a schema defining mandatory fields and expected formats.
Interactive filling: if information is missing, the system presents the fields to complete manually.
Generation: the DOCX document is produced with professional formatting, headings, tables and Italian-standard layout.

Typical use cases

Generation of analysis reports starting from technical documents
Filling of standardized forms with data extracted from multiple sources
Creation of summary documents from sets of related documents
Production of structured comparisons between documents

The generated document is a standard DOCX file, editable with any compatible text editor and ready for professional distribution.

Document Processing Pipeline ​

Process overview ​

Phase 1: Upload and validation ​

Phase 2: Parsing and text extraction ​

PDF ​

Office documents ​

Images ​

Phase 3: Linguistic analysis and metadata ​

Phase 4: Intelligent chunking ​

Adaptive segmentation ​

Specialized chunking for legal documents ​

Preservation of tables and lists ​

Phase 5: Vector embedding ​

Dense vectors ​

Sparse vectors ​

Indexed metadata ​

Phase 6: Multi-modal indexing ​

OCR capabilities ​

Sync from external sources ​

Cloud storage ​

Network folders (SMB/CIFS) ​

Incremental updates ​

Document generation (DocGen) ​

Generation pipeline ​

Typical use cases ​