Ingestion DSL

From v3.5.0 document ingestion has also become canvas-native. The pipeline that takes a user-uploaded PDF all the way to chunks vectorized in Qdrant is no longer an imperative TypeScript service: it is a canvas with purpose = INGESTION executed by the DSL engine.

This enables advanced per-tenant customizations, role-aware ingestion and a gradual migration from legacy pipelines.

What changes

Aspect	LEGACY (v3.1.x)	DSL V2 (v3.5.0+)
Parser	Single PDF pipeline	MIME auto-router, Docling + Unstructured as peers
Chunking	14 domain-specific chunkers in code	5 role-based chunkers + configurable domain presets
Document roles	Not modeled	TRUTH / FORMAT / RULES / OPERATIONAL / EXAMPLES
Cleanup	Manual re-embedding	RecordManager with 4 modes (UPSERT, FULL, INCREMENTAL, SCOPED)
Resume	No	Pipeline snapshot + checkpoint
Multi-sink	Only Qdrant + Postgres	Qdrant + Postgres + OperationalData + RecordManager as nodes
Connectors	Only upload + scrape	V2-ready for email, cloud storage, SharePoint

Document roles

Every ingested document is classified into one or more roles, which determine how the pipeline treats it:

Role	Typical examples	Treatment
TRUTH	Manuals, regulations, policies, books	Semantic chunking by paragraph/article
FORMAT	Templates, forms, schemas	Structure extraction + placeholders
RULES	Decisions, regulations, judgments	Chunking by article/clause with citations
OPERATIONAL	Price lists, tables, structured data	Row-level extraction, `OperationalData` sink
EXAMPLES	Case studies, scenarios, FAQs	Chunking by Q&A pair or scenario block

Classification is automatic (role_classifier node) but can be forced by the admin at upload time or via per-topic override. A document can have multiple roles (e.g. a judgment is TRUTH + RULES).

For the operational guide on roles in retrieval see docs/ingestion-dsl/DOCUMENT_ROLES_GUIDE.md in the internal repo.

TRUTH

Authoritative, narrative, reference content. It is the "default" role for most company documents.

Examples: operational manuals, interpreted regulations, company policies, white papers, books.
Chunker: chunker_truth -- semantic segmentation by paragraphs and sections, alignment with natural boundaries, 15% overlap.
Extracted metadata: title, sections, language, confidence level, optional summary.
In retrieval: high base weight, cited as [N] with preview of the original paragraph.

FORMAT

Documents that show a structure to reproduce rather than carry authoritative information.

Examples: contract templates, blank forms, schemas, layouts of standard documents.
Chunker: chunker_format -- extracts the structure (fields, sections, placeholders) and indexes it separately from "filler text".
Extracted metadata: list of fields/placeholders, document type, template language.
In retrieval: mainly used by Document Generation to populate new documents. Rarely cited in chat (unless the question is explicitly "how do I fill form X?").

RULES

Prescriptive documents: they establish what can or cannot be done, define obligations, sanctions, deadlines, applicable regulations. The system treats them in a special way because the precision of the regulatory reference is critical.

Examples: judgments (Supreme Court, Courts of Appeal, administrative tribunals), decrees, EU regulations, law articles, orders, administrative decisions, Revenue Agency circulars, internal company regulations with binding effect.
Chunker: chunker_rules -- recognizes the article - paragraph - letter - number structure typical of Italian legislation and EU regulations. Keeps each article as a coherent unit, preserves cross-references (e.g. "pursuant to art. 12 par. 3 letter b") and indexes maxims and dispositives in judgments separately.
Assigned chunk type: ARTICLE, CLAUSE, SECTION, MASSIMA, DISPOSITIVO -- finer granularity than TRUTH.
Extracted metadata:
- documentType -- e.g. JUDGMENT, LEGISLATIVE_DECREE, EU_REGULATION, TAX_CIRCULAR.
- articleNumber, paragraphNumber, letter -- structured identifiers for precise citation.
- dateInForce, dateRepealed -- temporal validity (if detected).
- issuingAuthority -- e.g. "Italian Revenue Agency", "Supreme Court Section I", "EU Parliament".
- citedNorms[] -- other norms cited in the text (links to related RULES chunks).
In retrieval:
- Explicit priority over TRUTH when the question concerns obligations, sanctions, compliance.
- Precise citation: the answer cites art. 5 par. 2 LD 231/2001 not just the document title.
- Automatic validity filter: rules with dateRepealed < today are excluded by default (the filter can be disabled for historical searches).
- Compatibility with external sources: an Italian norm ingested internally can be linked to the version in Legal Sources (Normattiva).
Confidentiality: typically public or internal. Internal judgments (e.g. arbitrations) may be confidential.

Practical example

Question: "Can I dismiss a sick employee?" -> The system prefers RULES documents (Workers' Statute art. 2110, related Labor Section Cass. judgments) over an internal HR circular marked TRUTH. The answer cites art. 2110 c.c. and Cass. n. 12345/2023 with textual dispositive, not a summary.

Multi-role TRUTH + RULES

A judgment has two aspects: the motivation (TRUTH -- legal reasoning, context) and the dispositive + maxim (RULES -- established rule). The classifier assigns both roles and the two chunkers produce distinct chunks that coexist in the same document.

OPERATIONAL

Tabular or structured documents with data that makes sense "per row" more than "per free text".

Examples: price lists, supplier/customer master data, product sheets, balance sheets in tabular format, reconciliations, time sheets, KPI dashboards.
Chunker: chunker_operational -- one row = one chunk. Preserves the key-value of the row (e.g. product=Alpha, price=120.00, availability=in stock).
Dedicated sink: sink_operational writes rows to OperationalDataRow (Postgres) enabling structured queries (filter, sum, group) alongside text retrieval.
In retrieval: the planner recognizes aggregative intents ("what is the total revenue?", "how many products under 100 euros?") and routes them to a SQL query over operational data instead of the LLM.

EXAMPLES

Demonstrative documents that illustrate how to apply a concept, procedure or rule.

Examples: case studies, application scenarios, company FAQs, solved exercises, support knowledge bases.
Chunker: chunker_examples -- one Q&A pair or one complete scenario = one chunk. Preserves the integrity of a single example case.
In retrieval: used to enrich the answer with a concrete example when the question allows it (e.g. "I have a case similar to..."). Cited with a dedicated sourceType.

Collections (Document Groups / "Raccolte")

Collections (DocumentGroup, "Raccolte" in the IT UI) are named sets of documents that share a common short-to-medium-term purpose -- a case, project, customer, dossier, audit, migration. They are designed to organize ingestion and consultation across Topics, without having to create new Topics for every initiative.

Difference with Topics

Aspect	Topic	Collection
Purpose	Stable taxonomy (e.g. "Contracts", "HR", "Legal")	Temporary or working aggregations (e.g. "Rossi Case 2026", "Q1 Audit")
Typical cardinality	3-15 per company	Tens or hundreds, can be born and die
Lifetime	Long, rarely deleted	Open -> Closed -> archived
Permissions	Fine-grained RBAC per role	Inherits ACL from contained documents
AI customization	Dedicated system prompt	No dedicated prompt
Identified in chat	Topic selector at the top	`@collection` mention filter in input
DB model	`Topic` (with `ragPresetId`, `systemPrompt`)	`DocumentGroup` (with `topicId`, `status`)

In practice: a document belongs to one or more Topics (stable category) and optionally to one or more Collections (working aggregation). Topic answers "what kind of document is it?", Collection answers "for which case/project was it uploaded?".

Collection status

A Collection has a status field with two values:

Status	Meaning	Effect in chat
`OPEN`	Active work	Included in `@collection` suggestions, priority retrieval when filtered
`CLOSED`	Work closed, archive	Excluded from suggestions, accessible only when explicitly recalled

Closing a Collection does not delete the documents: they remain indexed and searchable but exit the user's "working set".

Collections page (`/raccolte`)

The admin finds in Ingestion > Collections a dedicated CRUD page:

Collections table: name, associated Topic, number of documents, status, owner, last modified, actions.
Toolbar: search by name, filter by company (SYSTEM_ADMIN only), + New collection button.
Detail Sheet (on row click): header with metadata + nested documents table (filename, topic, status, "Remove from collection" action) + "Add documents" footer with multi-select picker.

Row menu actions: Open, Rename, Change status (OPEN/CLOSED), Delete.

When to assign a document to a Collection

Three entry points:

Single wizard (Confirm step) -- "Collection" combobox -- existing or + New.
Bulk profile (step 2) -- same combobox, applied to all files in the run; Path-rules overrides do not touch the Collection (it's a single value per job).
Post-bulk Review Queue -- inline editable column, or bulk-edit bar to assign many rows together.

A document can belong to multiple Collections at the same time (many-to-many relation). Frequent for reused documents (e.g. a framework contract used both in "Rossi Case 2026" and "Bianchi Case 2026").

Chat usage

In chat the user can:

Type @collection to trigger autocomplete of accessible OPEN Collections: the conversation gets filtered to those documents.
Combine Topic + Collection: e.g. Topic = Contracts, Collection = Rossi Case 2026 -> retrieval restricted to contracts of that case.
Ask Collection-wide summaries: e.g. "summarize the status of Rossi Case 2026" -> the system builds a synthesis over the Collection's documents only.

Typical lifecycle

Case opened
   |
   v
+ New collection "Rossi Case 2026"
   |
   v
Bulk ingest 200 files (customer + correspondence + acts)
   |   Bulk profile -> Collection = "Rossi Case 2026"
   v
Review Queue: fix topic on ~10 files
   |
   v
Active work: chat filtered @collection-rossi-case
   |
   v
Case closed
   |
   v
status -> CLOSED (searchable archive, out of working set)

API endpoints

Method	Endpoint	Purpose
`GET`	`/api/document-groups`	List Collections (filters by status, search, owner)
`POST`	`/api/document-groups`	Create Collection
`PATCH`	`/api/document-groups/:id`	Rename, change status, associated Topic
`DELETE`	`/api/document-groups/:id`	Delete (documents stay, only lose the Collection link)
`POST`	`/api/document-groups/:id/documents`	Add documents
`DELETE`	`/api/document-groups/:id/documents/:docId`	Remove a document

Feature flag

The sidebar shows Collections only if the tenant has the feature active:

bash

ENABLE_DOCUMENT_GROUPS=true

Disabling the flag, the API stays functional (existing data is not lost) but the nav entries and Collection fields in wizards / review disappear.

Best practices

One Collection = one case / project -- avoid using Collections for further taxonomy (that's what Topics are for).
Close concluded Collections -- an active user's working set should have 5-20 OPEN Collections, not hundreds.
Speaking names -- Rossi Case 2026 is better than R-2026-01. The chat uses the name as a hint for the planner.
Collections don't replace Topics -- a Collection without a Topic is an orphan: always assign the appropriate Topic at ingestion time.

Anatomy of an INGESTION pipeline

+--------+    +--------+    +--------------+    +----------+    +----------+
| source | -> | parser | -> |role_classify | -> | chunker  | -> | embedder |
+--------+    +--------+    +--------------+    +----------+    +----------+
                                                                     |
                                                                     v
                                              +---------+--------+-------------+
                                              | qdrant  | postgres| operational |
                                              | chunks  | document| data        |
                                              +---------+--------+-------------+
                                                              |
                                                              v
                                                    +------------------+
                                                    |  record manager  |
                                                    |  (cleanup state) |
                                                    +------------------+

Source nodes

source_upload -- from user upload (web form, drag&drop, API).
source_inchat -- documents attached to a chat (temporary).
source_scrape -- web pages from URL.
source_ocr -- images with OCR (GLM-OCR via vLLM 8004).
(V2) source_email, source_gdrive, source_sharepoint -- structure ready, implementation upcoming.

Parser

parser_auto -- automatic router by MIME type.
parser_docling -- for PDF/DOCX/PPTX/XLSX/HTML/images. Extracts sections, tables, reading order.
parser_unstructured -- for EML/MSG/EPUB/ODT/RTF, audio (V2 transcripts).

Both delegate to the ingestion-bridge microservice (FastAPI Python Docker).

Classifier and chunker

role_classifier -- LLM-based, output one or more roles + confidence.
chunker_truth, chunker_format, chunker_rules, chunker_operational, chunker_examples -- each with domain presets (legal, medical, financial, HR, ...).

Metadata extractors

Specialized LLM nodes: extract_dates, extract_entities, extract_summary, extract_topics, extract_grade, extract_confidentiality. Run in parallel where possible.

Sinks

sink_qdrant -- vectorizes with bge-m3 and writes chunks.
sink_metadata -- updates Document on Postgres.
sink_operational -- for OPERATIONAL role, writes tabular rows to OperationalDataRow.
sink_record_manager -- tracks content and identity hashes for incremental cleanup.

Ingestion modes

To manage migration and A/B testing, every company has a Company.ingestionMode field:

Mode	Behavior
`LEGACY`	Old pipeline (no role, single parser). Default for companies created before v3.5.0
`SHADOW`	LEGACY in foreground, V2 in parallel. Generates a comparison report, no user impact
`V2`	Full cutover to canvas DSL

Recommended migration

Set SHADOW on a pilot tenant.
Watch the shadow_comparison_report for ~1 week (precision, recall, latency).
Move to V2 when the numbers are par or better.
Remember: changing mode does not re-vectorize existing documents. To apply the new chunkers to old documents you need an explicit reingest.

Mode change via SQL:

sql

UPDATE "Company" SET "ingestionMode" = 'V2' WHERE id = '<companyId>';

Or from UI: Company settings > Ingestion > Mode.

Wizard, Bulk and Path-rules

The Ingestion Hub (/documents/upload) is the starting point to load documents into Queria. There are two intent modes, both backed by the same DSL pipeline: only the UX flow changes.

Single mode -- the 3-step AI Wizard

Designed for 1-10 manually uploaded files. The page mounts the UploadWizardPage (3 steps) with AI-assisted classification:

Upload -- drag & drop or file selection. Source: local, cloud, network, URL.
Classify -- for each file the system proposes a "suggested role" card (e.g. RULES — score 0.87) with predicted topic, sector and language. The user confirms or corrects. Multi-role: multiple roles can be selected when the document is hybrid (e.g. policy + checklist -> TRUTH + RULES).
Confirm -- summary, visibility choice, ingestion job kickoff.

The wizard is the only mode that explicitly shows the classifier confidence score before launch. Useful when individual documents are critical (legal, healthcare) and pre-launch review is acceptable.

When to use it

1 to about 10 files. Small-volume companies start here by default. Above 50 files the system automatically suggests Bulk mode.

Bulk mode -- for cloud, network, URL lists

Designed to ingest dozens or thousands of files from a common source. Three steps:

Source selection -- choose one of the four cards: File (mass drag&drop), Cloud (Drive / S3 / Azure / OneDrive), Network (SMB / NFS), URL list (CSV with N urls).
Ingestion profile -- a single form with defaults applied to all selected files:
- Topic (multi-select, required if Visibility is "Only assigned topics").
- Collection (combobox -- existing or "+ New").
- Document role -- TRUTH / FORMAT / RULES / OPERATIONAL / EXAMPLES / Auto AI (default).
- Sector -- preset list or Auto AI.
- Language -- auto-detect or ISO.
- Visibility -- All / Only assigned topics / Admins only.
- Path-rules (advanced, accordion closed by default — see below).
Job in progress -- LiveJobProgress via SSE: per-file status (queued / parsed / vectorized / failed). As soon as one file is complete you can enter the Review Queue without waiting for the job to finish.

When to use it

When importing from an entire source (e.g. a whole Drive folder, an SMB mount, a JSONL dump). From 50 files up the system prefers this mode.

Path-rules

Path-rules are per-file overrides defined inside the bulk ingestion profile. Each rule is a row { glob_pattern, override_topic, override_role }.

Examples:

Pattern	Override topic	Override role
`gdrive:/Legal/*/.pdf`	`Contracts`	`TRUTH`
`gdrive:/Legal/Judgments/*/.pdf`	`Case law`	`RULES`
`smb://share/PriceLists/*/.xlsx`	`Price lists`	`OPERATIONAL`
`*/FAQ.md`	`Help center`	`EXAMPLES`

Rules are evaluated top-down per file at job execution time. The first match wins; if none matches, profile defaults apply (including Auto AI for topic/role/sector).

When to use them:

Cloud or network sources with a folder structure that already encodes the company taxonomy.
One-shot migrations where you want to preserve a path -> topic mapping without correcting each file in Review.
Mixed sources where different sub-paths should land in different topics/roles.

Best practice:

Order rules from most specific to most generic (first one wins).
Document rules in the Collection name or topic changelog: future readers must understand the rationale.
For a few isolated files, don't write rules: better to edit afterwards in the Review Queue (see below).

Review Queue (post-bulk)

After a bulk job, the Review Queue (/documents/upload?mode=bulk&job=<id>) shows all files of the run in a DataTable:

Column	Editable
Selection checkbox	--
Filename	preview on click (side Sheet)
Status	success / warning / error
Topic	inline (multi-select)
Collection	inline
Sector	inline
Visibility	inline
AI confidence	progress 0-100 (filterable)

Quick toolbar:

Search-by-filename (300ms debounce).
3 preset chips: Sector = OTHER, AI confidence < 70%, Missing topic -- to quickly catch critical files.
CSV export of the selection.

Bulk-edit bar: appears when you select >= 1 row. Lets you apply topic / sector / visibility / collection to all selected files in a single PATCH (/api/documents/bulk-update). Optimistic UI with Sonner error toast on failure.

The ingestion job does not wait for review: files flow through the pipeline as they're uploaded. The Review Queue is for fixing ex-post topic/role/sector of files with low confidence or those that required fallback. Changes to these metadata fields do not require reingest (they are Qdrant payload + Postgres updates).

Reingest

To apply pipeline changes to already vectorized documents:

Single document: page Documents > [doc] > Reingest.
Bulk: script src/scripts/reingest-company-onebyone.ts (template available).
Payload backfill (no reingest, metadata only): idempotent scripts backfill-qdrant-*.ts to stamp new fields (sector, role, grade, aisummary) without re-embedding.

Status and monitoring

Every ingestion job produces:

IngestionJob on Postgres with state (RUNNING, COMPLETED, FAILED, RESUMED).
PipelineSnapshot for crash resume.
IngestionRecord (RecordManager) for deterministic cleanup.

Admin dashboard: Ingestion > Live dashboard shows current jobs, queues, errors, latency per node.

Composability per tenant

An admin can:

Add custom extractors -- e.g. a node that extracts the Italian ATECO code from company documents.
Disable classifier -- forcing everything as TRUTH for simple cases.
Change embedder -- e.g. for advanced multilingual (V2).
Add custom sinks -- e.g. mirror to an external data lake.

All without modifying the backend: just duplicate the ingestion canvas and edit it.

Preset migration

Legacy presets had chunking, retrieval, reranker, etc. sections. Today they only keep chunking (chunkSize, overlapPercentage). Everything else is in the canvas. See Slim RAG Presets for the new preset schema.

V1 limits

Audio/video parser deferred to V2.
Email/SharePoint/GDrive connectors not yet active in canvases (architecture ready, implementation upcoming).
Graph ingestion (entity extraction + community detection) permanently excluded in V1 for high LLM cost.
OCR for low-quality scans remains less precise than the Docling native parser: for heavily scanned archives consider upstream OCR improvement.

Queria v3.5.0 -- Canvas-native Ingestion DSL

Ingestion DSL ​

What changes ​

Document roles ​

TRUTH ​

FORMAT ​

RULES ​

OPERATIONAL ​

EXAMPLES ​

Collections (Document Groups / "Raccolte") ​

Difference with Topics ​

Collection status ​

Collections page (/raccolte) ​

When to assign a document to a Collection ​

Chat usage ​

Typical lifecycle ​

API endpoints ​

Feature flag ​

Best practices ​

Anatomy of an INGESTION pipeline ​

Source nodes ​

Parser ​

Classifier and chunker ​

Metadata extractors ​

Sinks ​

Ingestion modes ​

Wizard, Bulk and Path-rules ​

Single mode -- the 3-step AI Wizard ​

Bulk mode -- for cloud, network, URL lists ​

Path-rules ​

Review Queue (post-bulk) ​

Reingest ​

Status and monitoring ​

Composability per tenant ​

Preset migration ​

V1 limits ​