Ingestion DSL
From v3.5.0 document ingestion has also become canvas-native. The pipeline that takes a user-uploaded PDF all the way to chunks vectorized in Qdrant is no longer an imperative TypeScript service: it is a canvas with purpose = INGESTION executed by the DSL engine.
This enables advanced per-tenant customizations, role-aware ingestion and a gradual migration from legacy pipelines.
What changes
| Aspect | LEGACY (v3.1.x) | DSL V2 (v3.5.0+) |
|---|---|---|
| Parser | Single PDF pipeline | MIME auto-router, Docling + Unstructured as peers |
| Chunking | 14 domain-specific chunkers in code | 5 role-based chunkers + configurable domain presets |
| Document roles | Not modeled | TRUTH / FORMAT / RULES / OPERATIONAL / EXAMPLES |
| Cleanup | Manual re-embedding | RecordManager with 4 modes (UPSERT, FULL, INCREMENTAL, SCOPED) |
| Resume | No | Pipeline snapshot + checkpoint |
| Multi-sink | Only Qdrant + Postgres | Qdrant + Postgres + OperationalData + RecordManager as nodes |
| Connectors | Only upload + scrape | V2-ready for email, cloud storage, SharePoint |
Document roles
Every ingested document is classified into one or more roles, which determine how the pipeline treats it:
| Role | Typical examples | Treatment |
|---|---|---|
| TRUTH | Manuals, regulations, policies, books | Semantic chunking by paragraph/article |
| FORMAT | Templates, forms, schemas | Structure extraction + placeholders |
| RULES | Decisions, regulations, judgments | Chunking by article/clause with citations |
| OPERATIONAL | Price lists, tables, structured data | Row-level extraction, OperationalData sink |
| EXAMPLES | Case studies, scenarios, FAQs | Chunking by Q&A pair or scenario block |
Classification is automatic (role_classifier node) but can be forced by the admin at upload time or via per-topic override. A document can have multiple roles (e.g. a judgment is TRUTH + RULES).
For the operational guide on roles in retrieval see docs/ingestion-dsl/DOCUMENT_ROLES_GUIDE.md in the internal repo.
TRUTH
Authoritative, narrative, reference content. It is the "default" role for most company documents.
- Examples: operational manuals, interpreted regulations, company policies, white papers, books.
- Chunker:
chunker_truth-- semantic segmentation by paragraphs and sections, alignment with natural boundaries, 15% overlap. - Extracted metadata: title, sections, language, confidence level, optional summary.
- In retrieval: high base weight, cited as
[N]with preview of the original paragraph.
FORMAT
Documents that show a structure to reproduce rather than carry authoritative information.
- Examples: contract templates, blank forms, schemas, layouts of standard documents.
- Chunker:
chunker_format-- extracts the structure (fields, sections, placeholders) and indexes it separately from "filler text". - Extracted metadata: list of fields/placeholders, document type, template language.
- In retrieval: mainly used by Document Generation to populate new documents. Rarely cited in chat (unless the question is explicitly "how do I fill form X?").
RULES
Prescriptive documents: they establish what can or cannot be done, define obligations, sanctions, deadlines, applicable regulations. The system treats them in a special way because the precision of the regulatory reference is critical.
- Examples: judgments (Supreme Court, Courts of Appeal, administrative tribunals), decrees, EU regulations, law articles, orders, administrative decisions, Revenue Agency circulars, internal company regulations with binding effect.
- Chunker:
chunker_rules-- recognizes the article - paragraph - letter - number structure typical of Italian legislation and EU regulations. Keeps each article as a coherent unit, preserves cross-references (e.g. "pursuant to art. 12 par. 3 letter b") and indexes maxims and dispositives in judgments separately. - Assigned chunk type:
ARTICLE,CLAUSE,SECTION,MASSIMA,DISPOSITIVO-- finer granularity than TRUTH. - Extracted metadata:
documentType-- e.g.JUDGMENT,LEGISLATIVE_DECREE,EU_REGULATION,TAX_CIRCULAR.articleNumber,paragraphNumber,letter-- structured identifiers for precise citation.dateInForce,dateRepealed-- temporal validity (if detected).issuingAuthority-- e.g. "Italian Revenue Agency", "Supreme Court Section I", "EU Parliament".citedNorms[]-- other norms cited in the text (links to related RULES chunks).
- In retrieval:
- Explicit priority over TRUTH when the question concerns obligations, sanctions, compliance.
- Precise citation: the answer cites
art. 5 par. 2 LD 231/2001not just the document title. - Automatic validity filter: rules with
dateRepealed < todayare excluded by default (the filter can be disabled for historical searches). - Compatibility with external sources: an Italian norm ingested internally can be linked to the version in Legal Sources (Normattiva).
- Confidentiality: typically
publicorinternal. Internal judgments (e.g. arbitrations) may beconfidential.
Practical example
Question: "Can I dismiss a sick employee?" -> The system prefers RULES documents (Workers' Statute art. 2110, related Labor Section Cass. judgments) over an internal HR circular marked TRUTH. The answer cites art. 2110 c.c. and Cass. n. 12345/2023 with textual dispositive, not a summary.
Multi-role TRUTH + RULES
A judgment has two aspects: the motivation (TRUTH -- legal reasoning, context) and the dispositive + maxim (RULES -- established rule). The classifier assigns both roles and the two chunkers produce distinct chunks that coexist in the same document.
OPERATIONAL
Tabular or structured documents with data that makes sense "per row" more than "per free text".
- Examples: price lists, supplier/customer master data, product sheets, balance sheets in tabular format, reconciliations, time sheets, KPI dashboards.
- Chunker:
chunker_operational-- one row = one chunk. Preserves the key-value of the row (e.g.product=Alpha,price=120.00,availability=in stock). - Dedicated sink:
sink_operationalwrites rows toOperationalDataRow(Postgres) enabling structured queries (filter, sum, group) alongside text retrieval. - In retrieval: the planner recognizes aggregative intents ("what is the total revenue?", "how many products under 100 euros?") and routes them to a SQL query over operational data instead of the LLM.
EXAMPLES
Demonstrative documents that illustrate how to apply a concept, procedure or rule.
- Examples: case studies, application scenarios, company FAQs, solved exercises, support knowledge bases.
- Chunker:
chunker_examples-- one Q&A pair or one complete scenario = one chunk. Preserves the integrity of a single example case. - In retrieval: used to enrich the answer with a concrete example when the question allows it (e.g. "I have a case similar to..."). Cited with a dedicated sourceType.
Collections (Document Groups / "Raccolte")
Collections (DocumentGroup, "Raccolte" in the IT UI) are named sets of documents that share a common short-to-medium-term purpose -- a case, project, customer, dossier, audit, migration. They are designed to organize ingestion and consultation across Topics, without having to create new Topics for every initiative.
Difference with Topics
| Aspect | Topic | Collection |
|---|---|---|
| Purpose | Stable taxonomy (e.g. "Contracts", "HR", "Legal") | Temporary or working aggregations (e.g. "Rossi Case 2026", "Q1 Audit") |
| Typical cardinality | 3-15 per company | Tens or hundreds, can be born and die |
| Lifetime | Long, rarely deleted | Open -> Closed -> archived |
| Permissions | Fine-grained RBAC per role | Inherits ACL from contained documents |
| AI customization | Dedicated system prompt | No dedicated prompt |
| Identified in chat | Topic selector at the top | @collection mention filter in input |
| DB model | Topic (with ragPresetId, systemPrompt) | DocumentGroup (with topicId, status) |
In practice: a document belongs to one or more Topics (stable category) and optionally to one or more Collections (working aggregation). Topic answers "what kind of document is it?", Collection answers "for which case/project was it uploaded?".
Collection status
A Collection has a status field with two values:
| Status | Meaning | Effect in chat |
|---|---|---|
OPEN | Active work | Included in @collection suggestions, priority retrieval when filtered |
CLOSED | Work closed, archive | Excluded from suggestions, accessible only when explicitly recalled |
Closing a Collection does not delete the documents: they remain indexed and searchable but exit the user's "working set".
Collections page (/raccolte)
The admin finds in Ingestion > Collections a dedicated CRUD page:
- Collections table: name, associated Topic, number of documents, status, owner, last modified, actions.
- Toolbar: search by name, filter by company (SYSTEM_ADMIN only), + New collection button.
- Detail Sheet (on row click): header with metadata + nested documents table (filename, topic, status, "Remove from collection" action) + "Add documents" footer with multi-select picker.
Row menu actions: Open, Rename, Change status (OPEN/CLOSED), Delete.
When to assign a document to a Collection
Three entry points:
- Single wizard (Confirm step) -- "Collection" combobox -- existing or
+ New. - Bulk profile (step 2) -- same combobox, applied to all files in the run; Path-rules overrides do not touch the Collection (it's a single value per job).
- Post-bulk Review Queue -- inline editable column, or bulk-edit bar to assign many rows together.
A document can belong to multiple Collections at the same time (many-to-many relation). Frequent for reused documents (e.g. a framework contract used both in "Rossi Case 2026" and "Bianchi Case 2026").
Chat usage
In chat the user can:
- Type
@collectionto trigger autocomplete of accessibleOPENCollections: the conversation gets filtered to those documents. - Combine Topic + Collection: e.g. Topic =
Contracts, Collection =Rossi Case 2026-> retrieval restricted to contracts of that case. - Ask Collection-wide summaries: e.g. "summarize the status of Rossi Case 2026" -> the system builds a synthesis over the Collection's documents only.
Typical lifecycle
Case opened
|
v
+ New collection "Rossi Case 2026"
|
v
Bulk ingest 200 files (customer + correspondence + acts)
| Bulk profile -> Collection = "Rossi Case 2026"
v
Review Queue: fix topic on ~10 files
|
v
Active work: chat filtered @collection-rossi-case
|
v
Case closed
|
v
status -> CLOSED (searchable archive, out of working set)API endpoints
| Method | Endpoint | Purpose |
|---|---|---|
GET | /api/document-groups | List Collections (filters by status, search, owner) |
POST | /api/document-groups | Create Collection |
PATCH | /api/document-groups/:id | Rename, change status, associated Topic |
DELETE | /api/document-groups/:id | Delete (documents stay, only lose the Collection link) |
POST | /api/document-groups/:id/documents | Add documents |
DELETE | /api/document-groups/:id/documents/:docId | Remove a document |
Feature flag
The sidebar shows Collections only if the tenant has the feature active:
ENABLE_DOCUMENT_GROUPS=trueDisabling the flag, the API stays functional (existing data is not lost) but the nav entries and Collection fields in wizards / review disappear.
Best practices
- One Collection = one case / project -- avoid using Collections for further taxonomy (that's what Topics are for).
- Close concluded Collections -- an active user's working set should have 5-20
OPENCollections, not hundreds. - Speaking names --
Rossi Case 2026is better thanR-2026-01. The chat uses the name as a hint for the planner. - Collections don't replace Topics -- a Collection without a Topic is an orphan: always assign the appropriate Topic at ingestion time.
Anatomy of an INGESTION pipeline
+--------+ +--------+ +--------------+ +----------+ +----------+
| source | -> | parser | -> |role_classify | -> | chunker | -> | embedder |
+--------+ +--------+ +--------------+ +----------+ +----------+
|
v
+---------+--------+-------------+
| qdrant | postgres| operational |
| chunks | document| data |
+---------+--------+-------------+
|
v
+------------------+
| record manager |
| (cleanup state) |
+------------------+Source nodes
source_upload-- from user upload (web form, drag&drop, API).source_inchat-- documents attached to a chat (temporary).source_scrape-- web pages from URL.source_ocr-- images with OCR (GLM-OCR via vLLM 8004).- (V2)
source_email,source_gdrive,source_sharepoint-- structure ready, implementation upcoming.
Parser
parser_auto-- automatic router by MIME type.parser_docling-- for PDF/DOCX/PPTX/XLSX/HTML/images. Extracts sections, tables, reading order.parser_unstructured-- for EML/MSG/EPUB/ODT/RTF, audio (V2 transcripts).
Both delegate to the ingestion-bridge microservice (FastAPI Python Docker).
Classifier and chunker
role_classifier-- LLM-based, output one or more roles + confidence.chunker_truth,chunker_format,chunker_rules,chunker_operational,chunker_examples-- each with domain presets (legal, medical, financial, HR, ...).
Metadata extractors
Specialized LLM nodes: extract_dates, extract_entities, extract_summary, extract_topics, extract_grade, extract_confidentiality. Run in parallel where possible.
Sinks
sink_qdrant-- vectorizes with bge-m3 and writes chunks.sink_metadata-- updatesDocumenton Postgres.sink_operational-- for OPERATIONAL role, writes tabular rows toOperationalDataRow.sink_record_manager-- tracks content and identity hashes for incremental cleanup.
Ingestion modes
To manage migration and A/B testing, every company has a Company.ingestionMode field:
| Mode | Behavior |
|---|---|
LEGACY | Old pipeline (no role, single parser). Default for companies created before v3.5.0 |
SHADOW | LEGACY in foreground, V2 in parallel. Generates a comparison report, no user impact |
V2 | Full cutover to canvas DSL |
Recommended migration
- Set
SHADOWon a pilot tenant. - Watch the
shadow_comparison_reportfor ~1 week (precision, recall, latency). - Move to
V2when the numbers are par or better. - Remember: changing mode does not re-vectorize existing documents. To apply the new chunkers to old documents you need an explicit reingest.
Mode change via SQL:
UPDATE "Company" SET "ingestionMode" = 'V2' WHERE id = '<companyId>';Or from UI: Company settings > Ingestion > Mode.
Wizard, Bulk and Path-rules
The Ingestion Hub (/documents/upload) is the starting point to load documents into Queria. There are two intent modes, both backed by the same DSL pipeline: only the UX flow changes.
Single mode -- the 3-step AI Wizard
Designed for 1-10 manually uploaded files. The page mounts the UploadWizardPage (3 steps) with AI-assisted classification:
- Upload -- drag & drop or file selection. Source: local, cloud, network, URL.
- Classify -- for each file the system proposes a "suggested role" card (e.g.
RULES — score 0.87) with predicted topic, sector and language. The user confirms or corrects. Multi-role: multiple roles can be selected when the document is hybrid (e.g. policy + checklist -> TRUTH + RULES). - Confirm -- summary, visibility choice, ingestion job kickoff.
The wizard is the only mode that explicitly shows the classifier confidence score before launch. Useful when individual documents are critical (legal, healthcare) and pre-launch review is acceptable.
When to use it
1 to about 10 files. Small-volume companies start here by default. Above 50 files the system automatically suggests Bulk mode.
Bulk mode -- for cloud, network, URL lists
Designed to ingest dozens or thousands of files from a common source. Three steps:
- Source selection -- choose one of the four cards: File (mass drag&drop), Cloud (Drive / S3 / Azure / OneDrive), Network (SMB / NFS), URL list (CSV with N urls).
- Ingestion profile -- a single form with defaults applied to all selected files:
- Topic (multi-select, required if Visibility is "Only assigned topics").
- Collection (combobox -- existing or "+ New").
- Document role -- TRUTH / FORMAT / RULES / OPERATIONAL / EXAMPLES / Auto AI (default).
- Sector -- preset list or Auto AI.
- Language -- auto-detect or ISO.
- Visibility -- All / Only assigned topics / Admins only.
- Path-rules (advanced, accordion closed by default — see below).
- Job in progress --
LiveJobProgressvia SSE: per-file status (queued / parsed / vectorized / failed). As soon as one file is complete you can enter the Review Queue without waiting for the job to finish.
When to use it
When importing from an entire source (e.g. a whole Drive folder, an SMB mount, a JSONL dump). From 50 files up the system prefers this mode.
Path-rules
Path-rules are per-file overrides defined inside the bulk ingestion profile. Each rule is a row { glob_pattern, override_topic, override_role }.
Examples:
| Pattern | Override topic | Override role |
|---|---|---|
gdrive:/Legal/**/*.pdf | Contracts | TRUTH |
gdrive:/Legal/Judgments/**/*.pdf | Case law | RULES |
smb://share/PriceLists/**/*.xlsx | Price lists | OPERATIONAL |
**/FAQ*.md | Help center | EXAMPLES |
Rules are evaluated top-down per file at job execution time. The first match wins; if none matches, profile defaults apply (including Auto AI for topic/role/sector).
When to use them:
- Cloud or network sources with a folder structure that already encodes the company taxonomy.
- One-shot migrations where you want to preserve a path -> topic mapping without correcting each file in Review.
- Mixed sources where different sub-paths should land in different topics/roles.
Best practice:
- Order rules from most specific to most generic (first one wins).
- Document rules in the Collection name or topic changelog: future readers must understand the rationale.
- For a few isolated files, don't write rules: better to edit afterwards in the Review Queue (see below).
Review Queue (post-bulk)
After a bulk job, the Review Queue (/documents/upload?mode=bulk&job=<id>) shows all files of the run in a DataTable:
| Column | Editable |
|---|---|
| Selection checkbox | -- |
| Filename | preview on click (side Sheet) |
| Status | success / warning / error |
| Topic | inline (multi-select) |
| Collection | inline |
| Sector | inline |
| Visibility | inline |
| AI confidence | progress 0-100 (filterable) |
Quick toolbar:
- Search-by-filename (300ms debounce).
- 3 preset chips:
Sector = OTHER,AI confidence < 70%,Missing topic-- to quickly catch critical files. - CSV export of the selection.
Bulk-edit bar: appears when you select >= 1 row. Lets you apply topic / sector / visibility / collection to all selected files in a single PATCH (/api/documents/bulk-update). Optimistic UI with Sonner error toast on failure.
The ingestion job does not wait for review: files flow through the pipeline as they're uploaded. The Review Queue is for fixing ex-post topic/role/sector of files with low confidence or those that required fallback. Changes to these metadata fields do not require reingest (they are Qdrant payload + Postgres updates).
Reingest
To apply pipeline changes to already vectorized documents:
- Single document: page Documents > [doc] > Reingest.
- Bulk: script
src/scripts/reingest-company-onebyone.ts(template available). - Payload backfill (no reingest, metadata only): idempotent scripts
backfill-qdrant-*.tsto stamp new fields (sector, role, grade, aisummary) without re-embedding.
Status and monitoring
Every ingestion job produces:
IngestionJobon Postgres with state (RUNNING, COMPLETED, FAILED, RESUMED).PipelineSnapshotfor crash resume.IngestionRecord(RecordManager) for deterministic cleanup.
Admin dashboard: Ingestion > Live dashboard shows current jobs, queues, errors, latency per node.
Composability per tenant
An admin can:
- Add custom extractors -- e.g. a node that extracts the Italian ATECO code from company documents.
- Disable classifier -- forcing everything as TRUTH for simple cases.
- Change embedder -- e.g. for advanced multilingual (V2).
- Add custom sinks -- e.g. mirror to an external data lake.
All without modifying the backend: just duplicate the ingestion canvas and edit it.
Preset migration
Legacy presets had chunking, retrieval, reranker, etc. sections. Today they only keep chunking (chunkSize, overlapPercentage). Everything else is in the canvas. See Slim RAG Presets for the new preset schema.
V1 limits
- Audio/video parser deferred to V2.
- Email/SharePoint/GDrive connectors not yet active in canvases (architecture ready, implementation upcoming).
- Graph ingestion (entity extraction + community detection) permanently excluded in V1 for high LLM cost.
- OCR for low-quality scans remains less precise than the Docling native parser: for heavily scanned archives consider upstream OCR improvement.
Queria v3.5.0 -- Canvas-native Ingestion DSL