Skip to content

Ingestion DSL

From v3.5.0 document ingestion has also become canvas-native. The pipeline that takes a user-uploaded PDF all the way to chunks vectorized in Qdrant is no longer an imperative TypeScript service: it is a canvas with purpose = INGESTION executed by the DSL engine.

This enables advanced per-tenant customizations, role-aware ingestion and a gradual migration from legacy pipelines.

What changes

AspectLEGACY (v3.1.x)DSL V2 (v3.5.0+)
ParserSingle PDF pipelineMIME auto-router, Docling + Unstructured as peers
Chunking14 domain-specific chunkers in code5 role-based chunkers + configurable domain presets
Document rolesNot modeledTRUTH / FORMAT / RULES / OPERATIONAL / EXAMPLES
CleanupManual re-embeddingRecordManager with 4 modes (UPSERT, FULL, INCREMENTAL, SCOPED)
ResumeNoPipeline snapshot + checkpoint
Multi-sinkOnly Qdrant + PostgresQdrant + Postgres + OperationalData + RecordManager as nodes
ConnectorsOnly upload + scrapeV2-ready for email, cloud storage, SharePoint

Document roles

Every ingested document is classified into one or more roles, which determine how the pipeline treats it:

RoleTypical examplesTreatment
TRUTHManuals, regulations, policies, booksSemantic chunking by paragraph/article
FORMATTemplates, forms, schemasStructure extraction + placeholders
RULESDecisions, regulations, judgmentsChunking by article/clause with citations
OPERATIONALPrice lists, tables, structured dataRow-level extraction, OperationalData sink
EXAMPLESCase studies, scenarios, FAQsChunking by Q&A pair or scenario block

Classification is automatic (role_classifier node) but can be forced by the admin at upload time or via per-topic override. A document can have multiple roles (e.g. a judgment is TRUTH + RULES).

For the operational guide on roles in retrieval see docs/ingestion-dsl/DOCUMENT_ROLES_GUIDE.md in the internal repo.

TRUTH

Authoritative, narrative, reference content. It is the "default" role for most company documents.

  • Examples: operational manuals, interpreted regulations, company policies, white papers, books.
  • Chunker: chunker_truth -- semantic segmentation by paragraphs and sections, alignment with natural boundaries, 15% overlap.
  • Extracted metadata: title, sections, language, confidence level, optional summary.
  • In retrieval: high base weight, cited as [N] with preview of the original paragraph.

FORMAT

Documents that show a structure to reproduce rather than carry authoritative information.

  • Examples: contract templates, blank forms, schemas, layouts of standard documents.
  • Chunker: chunker_format -- extracts the structure (fields, sections, placeholders) and indexes it separately from "filler text".
  • Extracted metadata: list of fields/placeholders, document type, template language.
  • In retrieval: mainly used by Document Generation to populate new documents. Rarely cited in chat (unless the question is explicitly "how do I fill form X?").

RULES

Prescriptive documents: they establish what can or cannot be done, define obligations, sanctions, deadlines, applicable regulations. The system treats them in a special way because the precision of the regulatory reference is critical.

  • Examples: judgments (Supreme Court, Courts of Appeal, administrative tribunals), decrees, EU regulations, law articles, orders, administrative decisions, Revenue Agency circulars, internal company regulations with binding effect.
  • Chunker: chunker_rules -- recognizes the article - paragraph - letter - number structure typical of Italian legislation and EU regulations. Keeps each article as a coherent unit, preserves cross-references (e.g. "pursuant to art. 12 par. 3 letter b") and indexes maxims and dispositives in judgments separately.
  • Assigned chunk type: ARTICLE, CLAUSE, SECTION, MASSIMA, DISPOSITIVO -- finer granularity than TRUTH.
  • Extracted metadata:
    • documentType -- e.g. JUDGMENT, LEGISLATIVE_DECREE, EU_REGULATION, TAX_CIRCULAR.
    • articleNumber, paragraphNumber, letter -- structured identifiers for precise citation.
    • dateInForce, dateRepealed -- temporal validity (if detected).
    • issuingAuthority -- e.g. "Italian Revenue Agency", "Supreme Court Section I", "EU Parliament".
    • citedNorms[] -- other norms cited in the text (links to related RULES chunks).
  • In retrieval:
    • Explicit priority over TRUTH when the question concerns obligations, sanctions, compliance.
    • Precise citation: the answer cites art. 5 par. 2 LD 231/2001 not just the document title.
    • Automatic validity filter: rules with dateRepealed < today are excluded by default (the filter can be disabled for historical searches).
    • Compatibility with external sources: an Italian norm ingested internally can be linked to the version in Legal Sources (Normattiva).
  • Confidentiality: typically public or internal. Internal judgments (e.g. arbitrations) may be confidential.

Practical example

Question: "Can I dismiss a sick employee?" -> The system prefers RULES documents (Workers' Statute art. 2110, related Labor Section Cass. judgments) over an internal HR circular marked TRUTH. The answer cites art. 2110 c.c. and Cass. n. 12345/2023 with textual dispositive, not a summary.

Multi-role TRUTH + RULES

A judgment has two aspects: the motivation (TRUTH -- legal reasoning, context) and the dispositive + maxim (RULES -- established rule). The classifier assigns both roles and the two chunkers produce distinct chunks that coexist in the same document.

OPERATIONAL

Tabular or structured documents with data that makes sense "per row" more than "per free text".

  • Examples: price lists, supplier/customer master data, product sheets, balance sheets in tabular format, reconciliations, time sheets, KPI dashboards.
  • Chunker: chunker_operational -- one row = one chunk. Preserves the key-value of the row (e.g. product=Alpha, price=120.00, availability=in stock).
  • Dedicated sink: sink_operational writes rows to OperationalDataRow (Postgres) enabling structured queries (filter, sum, group) alongside text retrieval.
  • In retrieval: the planner recognizes aggregative intents ("what is the total revenue?", "how many products under 100 euros?") and routes them to a SQL query over operational data instead of the LLM.

EXAMPLES

Demonstrative documents that illustrate how to apply a concept, procedure or rule.

  • Examples: case studies, application scenarios, company FAQs, solved exercises, support knowledge bases.
  • Chunker: chunker_examples -- one Q&A pair or one complete scenario = one chunk. Preserves the integrity of a single example case.
  • In retrieval: used to enrich the answer with a concrete example when the question allows it (e.g. "I have a case similar to..."). Cited with a dedicated sourceType.

Collections (Document Groups / "Raccolte")

Collections (DocumentGroup, "Raccolte" in the IT UI) are named sets of documents that share a common short-to-medium-term purpose -- a case, project, customer, dossier, audit, migration. They are designed to organize ingestion and consultation across Topics, without having to create new Topics for every initiative.

Difference with Topics

AspectTopicCollection
PurposeStable taxonomy (e.g. "Contracts", "HR", "Legal")Temporary or working aggregations (e.g. "Rossi Case 2026", "Q1 Audit")
Typical cardinality3-15 per companyTens or hundreds, can be born and die
LifetimeLong, rarely deletedOpen -> Closed -> archived
PermissionsFine-grained RBAC per roleInherits ACL from contained documents
AI customizationDedicated system promptNo dedicated prompt
Identified in chatTopic selector at the top@collection mention filter in input
DB modelTopic (with ragPresetId, systemPrompt)DocumentGroup (with topicId, status)

In practice: a document belongs to one or more Topics (stable category) and optionally to one or more Collections (working aggregation). Topic answers "what kind of document is it?", Collection answers "for which case/project was it uploaded?".

Collection status

A Collection has a status field with two values:

StatusMeaningEffect in chat
OPENActive workIncluded in @collection suggestions, priority retrieval when filtered
CLOSEDWork closed, archiveExcluded from suggestions, accessible only when explicitly recalled

Closing a Collection does not delete the documents: they remain indexed and searchable but exit the user's "working set".

Collections page (/raccolte)

The admin finds in Ingestion > Collections a dedicated CRUD page:

  • Collections table: name, associated Topic, number of documents, status, owner, last modified, actions.
  • Toolbar: search by name, filter by company (SYSTEM_ADMIN only), + New collection button.
  • Detail Sheet (on row click): header with metadata + nested documents table (filename, topic, status, "Remove from collection" action) + "Add documents" footer with multi-select picker.

Row menu actions: Open, Rename, Change status (OPEN/CLOSED), Delete.

When to assign a document to a Collection

Three entry points:

  1. Single wizard (Confirm step) -- "Collection" combobox -- existing or + New.
  2. Bulk profile (step 2) -- same combobox, applied to all files in the run; Path-rules overrides do not touch the Collection (it's a single value per job).
  3. Post-bulk Review Queue -- inline editable column, or bulk-edit bar to assign many rows together.

A document can belong to multiple Collections at the same time (many-to-many relation). Frequent for reused documents (e.g. a framework contract used both in "Rossi Case 2026" and "Bianchi Case 2026").

Chat usage

In chat the user can:

  • Type @collection to trigger autocomplete of accessible OPEN Collections: the conversation gets filtered to those documents.
  • Combine Topic + Collection: e.g. Topic = Contracts, Collection = Rossi Case 2026 -> retrieval restricted to contracts of that case.
  • Ask Collection-wide summaries: e.g. "summarize the status of Rossi Case 2026" -> the system builds a synthesis over the Collection's documents only.

Typical lifecycle

Case opened
   |
   v
+ New collection "Rossi Case 2026"
   |
   v
Bulk ingest 200 files (customer + correspondence + acts)
   |   Bulk profile -> Collection = "Rossi Case 2026"
   v
Review Queue: fix topic on ~10 files
   |
   v
Active work: chat filtered @collection-rossi-case
   |
   v
Case closed
   |
   v
status -> CLOSED (searchable archive, out of working set)

API endpoints

MethodEndpointPurpose
GET/api/document-groupsList Collections (filters by status, search, owner)
POST/api/document-groupsCreate Collection
PATCH/api/document-groups/:idRename, change status, associated Topic
DELETE/api/document-groups/:idDelete (documents stay, only lose the Collection link)
POST/api/document-groups/:id/documentsAdd documents
DELETE/api/document-groups/:id/documents/:docIdRemove a document

Feature flag

The sidebar shows Collections only if the tenant has the feature active:

bash
ENABLE_DOCUMENT_GROUPS=true

Disabling the flag, the API stays functional (existing data is not lost) but the nav entries and Collection fields in wizards / review disappear.

Best practices

  1. One Collection = one case / project -- avoid using Collections for further taxonomy (that's what Topics are for).
  2. Close concluded Collections -- an active user's working set should have 5-20 OPEN Collections, not hundreds.
  3. Speaking names -- Rossi Case 2026 is better than R-2026-01. The chat uses the name as a hint for the planner.
  4. Collections don't replace Topics -- a Collection without a Topic is an orphan: always assign the appropriate Topic at ingestion time.

Anatomy of an INGESTION pipeline

+--------+    +--------+    +--------------+    +----------+    +----------+
| source | -> | parser | -> |role_classify | -> | chunker  | -> | embedder |
+--------+    +--------+    +--------------+    +----------+    +----------+
                                                                     |
                                                                     v
                                              +---------+--------+-------------+
                                              | qdrant  | postgres| operational |
                                              | chunks  | document| data        |
                                              +---------+--------+-------------+
                                                              |
                                                              v
                                                    +------------------+
                                                    |  record manager  |
                                                    |  (cleanup state) |
                                                    +------------------+

Source nodes

  • source_upload -- from user upload (web form, drag&drop, API).
  • source_inchat -- documents attached to a chat (temporary).
  • source_scrape -- web pages from URL.
  • source_ocr -- images with OCR (GLM-OCR via vLLM 8004).
  • (V2) source_email, source_gdrive, source_sharepoint -- structure ready, implementation upcoming.

Parser

  • parser_auto -- automatic router by MIME type.
  • parser_docling -- for PDF/DOCX/PPTX/XLSX/HTML/images. Extracts sections, tables, reading order.
  • parser_unstructured -- for EML/MSG/EPUB/ODT/RTF, audio (V2 transcripts).

Both delegate to the ingestion-bridge microservice (FastAPI Python Docker).

Classifier and chunker

  • role_classifier -- LLM-based, output one or more roles + confidence.
  • chunker_truth, chunker_format, chunker_rules, chunker_operational, chunker_examples -- each with domain presets (legal, medical, financial, HR, ...).

Metadata extractors

Specialized LLM nodes: extract_dates, extract_entities, extract_summary, extract_topics, extract_grade, extract_confidentiality. Run in parallel where possible.

Sinks

  • sink_qdrant -- vectorizes with bge-m3 and writes chunks.
  • sink_metadata -- updates Document on Postgres.
  • sink_operational -- for OPERATIONAL role, writes tabular rows to OperationalDataRow.
  • sink_record_manager -- tracks content and identity hashes for incremental cleanup.

Ingestion modes

To manage migration and A/B testing, every company has a Company.ingestionMode field:

ModeBehavior
LEGACYOld pipeline (no role, single parser). Default for companies created before v3.5.0
SHADOWLEGACY in foreground, V2 in parallel. Generates a comparison report, no user impact
V2Full cutover to canvas DSL

Recommended migration

  1. Set SHADOW on a pilot tenant.
  2. Watch the shadow_comparison_report for ~1 week (precision, recall, latency).
  3. Move to V2 when the numbers are par or better.
  4. Remember: changing mode does not re-vectorize existing documents. To apply the new chunkers to old documents you need an explicit reingest.

Mode change via SQL:

sql
UPDATE "Company" SET "ingestionMode" = 'V2' WHERE id = '<companyId>';

Or from UI: Company settings > Ingestion > Mode.

Wizard, Bulk and Path-rules

The Ingestion Hub (/documents/upload) is the starting point to load documents into Queria. There are two intent modes, both backed by the same DSL pipeline: only the UX flow changes.

Single mode -- the 3-step AI Wizard

Designed for 1-10 manually uploaded files. The page mounts the UploadWizardPage (3 steps) with AI-assisted classification:

  1. Upload -- drag & drop or file selection. Source: local, cloud, network, URL.
  2. Classify -- for each file the system proposes a "suggested role" card (e.g. RULES — score 0.87) with predicted topic, sector and language. The user confirms or corrects. Multi-role: multiple roles can be selected when the document is hybrid (e.g. policy + checklist -> TRUTH + RULES).
  3. Confirm -- summary, visibility choice, ingestion job kickoff.

The wizard is the only mode that explicitly shows the classifier confidence score before launch. Useful when individual documents are critical (legal, healthcare) and pre-launch review is acceptable.

When to use it

1 to about 10 files. Small-volume companies start here by default. Above 50 files the system automatically suggests Bulk mode.

Bulk mode -- for cloud, network, URL lists

Designed to ingest dozens or thousands of files from a common source. Three steps:

  1. Source selection -- choose one of the four cards: File (mass drag&drop), Cloud (Drive / S3 / Azure / OneDrive), Network (SMB / NFS), URL list (CSV with N urls).
  2. Ingestion profile -- a single form with defaults applied to all selected files:
    • Topic (multi-select, required if Visibility is "Only assigned topics").
    • Collection (combobox -- existing or "+ New").
    • Document role -- TRUTH / FORMAT / RULES / OPERATIONAL / EXAMPLES / Auto AI (default).
    • Sector -- preset list or Auto AI.
    • Language -- auto-detect or ISO.
    • Visibility -- All / Only assigned topics / Admins only.
    • Path-rules (advanced, accordion closed by default — see below).
  3. Job in progress -- LiveJobProgress via SSE: per-file status (queued / parsed / vectorized / failed). As soon as one file is complete you can enter the Review Queue without waiting for the job to finish.

When to use it

When importing from an entire source (e.g. a whole Drive folder, an SMB mount, a JSONL dump). From 50 files up the system prefers this mode.

Path-rules

Path-rules are per-file overrides defined inside the bulk ingestion profile. Each rule is a row { glob_pattern, override_topic, override_role }.

Examples:

PatternOverride topicOverride role
gdrive:/Legal/**/*.pdfContractsTRUTH
gdrive:/Legal/Judgments/**/*.pdfCase lawRULES
smb://share/PriceLists/**/*.xlsxPrice listsOPERATIONAL
**/FAQ*.mdHelp centerEXAMPLES

Rules are evaluated top-down per file at job execution time. The first match wins; if none matches, profile defaults apply (including Auto AI for topic/role/sector).

When to use them:

  • Cloud or network sources with a folder structure that already encodes the company taxonomy.
  • One-shot migrations where you want to preserve a path -> topic mapping without correcting each file in Review.
  • Mixed sources where different sub-paths should land in different topics/roles.

Best practice:

  • Order rules from most specific to most generic (first one wins).
  • Document rules in the Collection name or topic changelog: future readers must understand the rationale.
  • For a few isolated files, don't write rules: better to edit afterwards in the Review Queue (see below).

Review Queue (post-bulk)

After a bulk job, the Review Queue (/documents/upload?mode=bulk&job=<id>) shows all files of the run in a DataTable:

ColumnEditable
Selection checkbox--
Filenamepreview on click (side Sheet)
Statussuccess / warning / error
Topicinline (multi-select)
Collectioninline
Sectorinline
Visibilityinline
AI confidenceprogress 0-100 (filterable)

Quick toolbar:

  • Search-by-filename (300ms debounce).
  • 3 preset chips: Sector = OTHER, AI confidence < 70%, Missing topic -- to quickly catch critical files.
  • CSV export of the selection.

Bulk-edit bar: appears when you select >= 1 row. Lets you apply topic / sector / visibility / collection to all selected files in a single PATCH (/api/documents/bulk-update). Optimistic UI with Sonner error toast on failure.

The ingestion job does not wait for review: files flow through the pipeline as they're uploaded. The Review Queue is for fixing ex-post topic/role/sector of files with low confidence or those that required fallback. Changes to these metadata fields do not require reingest (they are Qdrant payload + Postgres updates).

Reingest

To apply pipeline changes to already vectorized documents:

  • Single document: page Documents > [doc] > Reingest.
  • Bulk: script src/scripts/reingest-company-onebyone.ts (template available).
  • Payload backfill (no reingest, metadata only): idempotent scripts backfill-qdrant-*.ts to stamp new fields (sector, role, grade, aisummary) without re-embedding.

Status and monitoring

Every ingestion job produces:

  • IngestionJob on Postgres with state (RUNNING, COMPLETED, FAILED, RESUMED).
  • PipelineSnapshot for crash resume.
  • IngestionRecord (RecordManager) for deterministic cleanup.

Admin dashboard: Ingestion > Live dashboard shows current jobs, queues, errors, latency per node.

Composability per tenant

An admin can:

  • Add custom extractors -- e.g. a node that extracts the Italian ATECO code from company documents.
  • Disable classifier -- forcing everything as TRUTH for simple cases.
  • Change embedder -- e.g. for advanced multilingual (V2).
  • Add custom sinks -- e.g. mirror to an external data lake.

All without modifying the backend: just duplicate the ingestion canvas and edit it.

Preset migration

Legacy presets had chunking, retrieval, reranker, etc. sections. Today they only keep chunking (chunkSize, overlapPercentage). Everything else is in the canvas. See Slim RAG Presets for the new preset schema.

V1 limits

  • Audio/video parser deferred to V2.
  • Email/SharePoint/GDrive connectors not yet active in canvases (architecture ready, implementation upcoming).
  • Graph ingestion (entity extraction + community detection) permanently excluded in V1 for high LLM cost.
  • OCR for low-quality scans remains less precise than the Docling native parser: for heavily scanned archives consider upstream OCR improvement.

Queria v3.5.0 -- Canvas-native Ingestion DSL

Queria - Document Intelligence con Cog-RAG