Document Processing Pipeline
When a document is uploaded to Queria, it goes through a processing pipeline that turns it from a static file into content searchable, analyzable and citable by AI. This process is fully automatic and designed to handle any documentary format with maximum fidelity.
Process overview
Original file
|
v
[1] Upload and validation
|
v
[2] Parsing and text extraction
|
v
[3] Linguistic analysis and metadata
|
v
[4] Intelligent chunking
|
v
[5] Vector embedding
|
v
[6] Multi-modal indexing
|
v
Document ready for searchEach phase is monitored and the document goes through states visible to the user: Uploaded, Processing, Ready, or Error if something goes wrong.
Phase 1: Upload and validation
On file arrival, the system runs a series of checks:
- Format verification: file type is validated both by extension and by binary content analysis (magic bytes). Supported formats: PDF, DOCX, PPTX, XLSX, CSV, images (PNG, JPG, TIFF, WEBP) and other text formats.
- Size limits: verification that the file fits within the limits configured for the organization.
- Integrity: check that the file is not corrupted or truncated.
- Deduplication: identification of documents already present in the system to avoid duplication.
The original file is stored securely and persistently, regardless of subsequent processing.
Phase 2: Parsing and text extraction
Each format is handled by a specialized parser that extracts the maximum structural information.
PDF
PDF documents receive differentiated treatment:
- Native text PDFs: direct text extraction with structure preservation (headings, paragraphs, lists).
- Scanned PDFs: automatic OCR activation when the system detects pages without selectable text.
- Mixed PDFs: handling of documents that contain both native and scanned pages, applying OCR only where needed.
Office documents
- DOCX: content extraction with hierarchy respect (heading levels, paragraphs, tables, bulleted lists). Document metadata (author, date, revision) is preserved.
- PPTX: text extraction from each slide with presentation order kept.
- XLSX/CSV: tabular processing with column header recognition. Each row is treated as an information unit with the context of its header.
Images
Images are processed by the OCR engine with advanced AI-based text recognition. The system supports scanned documents, photos of documents, screenshots and any image containing text.
Phase 3: Linguistic analysis and metadata
The extracted text is analyzed to enrich its metadata:
- Language detection: automatic identification of the document language to optimize subsequent processing.
- Metadata extraction: title, author, creation date, number of pages, original format.
- Thematic classification: when configured, automatic assignment of the document to one or more thematic categories.
Phase 4: Intelligent chunking
The text is segmented into fragments (chunks) optimized for semantic search. This is one of the most critical steps of the whole pipeline: too small chunks lose context, too large chunks dilute meaning.
Adaptive segmentation
The system uses a chunking strategy that adapts to document structure:
- Target size: about 2000 characters per chunk, calibrated to balance semantic precision and contextual richness.
- Overlap: 15% of content is shared between adjacent chunks to avoid information at chunk boundaries being lost.
- Structure respect: the system tries to align chunk boundaries with natural document boundaries (end of paragraph, end of section, end of list).
Specialized chunking for legal documents
Regulatory documents need dedicated treatment. The legal chunker:
- Recognizes the article-paragraph-letter structure typical of Italian legislation
- Keeps each article as a coherent unit
- Preserves cross-references between articles
- Handles partitions into titles, chapters and sections
- Assigns a specific type to each chunk: article, section, paragraph, clause, heading, list, table
Preservation of tables and lists
Tables are treated as indivisible units: a table is never split between two chunks. The same principle applies to bulleted and numbered lists, kept intact to preserve sequential context.
Phase 5: Vector embedding
Each chunk is transformed into a numerical representation (vector) capturing its semantic meaning.
Dense vectors
The embedding model generates 1024-dimensional vectors representing the overall meaning of the text. Two fragments with similar meaning will have close vectors in the multidimensional space, even if they use completely different words.
Sparse vectors
In parallel, a sparse representation (BM25) is generated, capturing the importance of single keywords. This representation excels at retrieving specific terms: codes, acronyms, proper nouns, article numbers.
Indexed metadata
Along with the vectors, structured information enabling filtering is stored:
- Source document title
- Document date
- Author
- Original format
- Chunk type (article, paragraph, table)
- Owning organization
- Topic/category
Phase 6: Multi-modal indexing
The vectorized chunks are indexed in a search system that supports three complementary modes:
- Semantic similarity search: finds chunks with meaning similar to the user query, regardless of the words used.
- Keyword search: finds chunks containing specific terms mentioned in the query.
- Structured filter search: narrows results by date, format, category, author or any indexed metadata.
The three modes can be freely combined. A typical search uses semantic similarity and keywords in a hybrid way, with structured filters to narrow the scope.
OCR capabilities
Queria's OCR engine goes beyond simple character recognition:
- Automatic detection: the system autonomously identifies pages needing OCR, without user intervention.
- Multilingual support: text recognition in Italian, English and other European languages.
- AI-enhanced recognition: the vision AI model interprets visual content with contextual understanding, improving accuracy on low-quality documents.
- Table extraction from images: tables in scanned documents are recognized and converted into a structured format.
- Two-pass with Markdown output: for complex documents, the system runs a first recognition pass and a second structuring pass, producing Markdown output with properly formatted tables.
Sync from external sources
Queria is not limited to manual upload. The system supports automatic document acquisition from remote sources:
Cloud storage
Connection to cloud storage services with periodic automatic sync. The system monitors configured folders and automatically imports new or updated documents.
Network folders (SMB/CIFS)
For organizations using internal file servers, Queria connects directly to shared folders via the SMB/CIFS protocol, also through VPN tunnels. Monitoring occurs at a configurable frequency.
Incremental updates
Sync is intelligent: only new or modified files are re-processed. Files removed from the source are marked accordingly in the system. This approach minimizes processing load and keeps the documentary base always up to date.
Document generation (DocGen)
Beyond analysis, Queria can generate new professional documents starting from extracted data:
Generation pipeline
- Template: the user selects a document model from those available for their organization.
- Data extraction: the system automatically extracts the needed information from source documents. The approach is hybrid: structured extraction for tabular data (XLSX) and AI for unstructured text.
- Schema validation: the extracted data is verified against a schema defining mandatory fields and expected formats.
- Interactive filling: if information is missing, the system presents the fields to complete manually.
- Generation: the DOCX document is produced with professional formatting, headings, tables and Italian-standard layout.
Typical use cases
- Generation of analysis reports starting from technical documents
- Filling of standardized forms with data extracted from multiple sources
- Creation of summary documents from sets of related documents
- Production of structured comparisons between documents
The generated document is a standard DOCX file, editable with any compatible text editor and ready for professional distribution.