Documents and Storage

Guide

Importing Documents

Select the import tool

In the Storage tab, click Import. A selector displays the available import tools, organized by category (Files, Triggers, API, Databases).
Configure the fields

Fill in the fields specific to the selected tool:
- Document name and description
- Additional settings (delimiter, format, etc.)
- Import scheduling (optional)

Fill the metadata (optional)

Below the main fields, expand the Metadata accordion to set business attributes that travel with the document and are inherited by every chunk:

Field	Description
Source	Origin of the content (e.g. `manual`, `meeting`, `crm`). Used by quick filters and duplicate detection.
Subject	Subject or topic of the document.
Language	Document language (e.g. `pt-BR`, `en-US`).
Tags	Comma-separated free-form labels.

Each metadata field supports Magic Fill — click the sparkle icon to let AI suggest a value based on the file (its content for PDFs, otherwise its name) and the fields already filled in.

Upload the file

Drag and drop the file into the upload area or click to select. The system validates the format and size before sending.
Track processing

A progress bar starts at 1% as soon as the upload begins and advances as the document moves through the upload → storage → embedding pipeline. The bar replaces the previous spinner so the user always sees concrete progress.

When uploading a file, Magic Fill is automatically triggered to suggest name, description, metadata fields (source, subject, language, tags) and facets with AI assistance. The auto-trigger waits until the facets vocabulary has fully loaded before firing, so facet suggestions always come from your organization’s controlled vocabulary. If no facets are suggested on the first upload, verify that your organization has facets configured and try triggering Magic Fill manually.

Magic Fill works with any file type — not just PDFs. For PDF files it also reads the document content; for any other format (spreadsheets, CSV, Word, images, etc.) it derives the suggestions from the file name and the fields you’ve already filled in. The upload itself was never restricted by type — only the AI suggestions, which now run for every file.

Spreadsheets: exact answers from tabular data

Spreadsheets uploaded to the knowledge base (.xlsx, .xls, .csv, .tsv, .ods) get dedicated handling: besides being indexed for search, their data becomes queryable for analytical questions.

When you ask something like “what are the total sales by segment?” or “how many units of product X were sold in Germany?”, the assistant runs a real query over the spreadsheet data — instead of guessing from text — and returns exact numbers: totals, sums, averages, counts, group-bys and filters.

Where it works: Playground, communicators (Slack/Teams), workflows (RAG nodes) and MCP clients — the same capability across every surface.
Each sheet is a queryable unit: multi-sheet workbooks are indexed per sheet, each with its own schema (columns and types).
Multiple spreadsheets: when more than one spreadsheet is relevant, the system picks the right one or returns the result per spreadsheet. It does not sum data of different natures (distinct currencies/definitions) — in that case it shows the per-source breakdown instead of a meaningless total.
Descriptive questions (“what’s in this spreadsheet?”) get a column summary and a sample of rows.

Storage-First Architecture

Every document — regardless of how it entered the Knowledge Base (manual upload, workflow, agent, API) — is first persisted as a storage entry and only then indexed as embeddings. This guarantees:

A single inventory of documents visible in the listing, no matter the origin.
Business metadata is owned by the storage row and replicated into every chunk.
Cascade deletion: removing a storage entry automatically drops all embeddings linked to it.

Import Categories

Category	Description
Files	Direct document upload (CSV, PDF, DOCX, MD, etc.)
Triggers	Event-based imports
API	Data obtained from REST endpoints
Databases	Direct database connections

REST API Import

For external data sources via API, you can configure:

Field	Description
Base URL	Server address
Endpoint	Resource path
HTTP method	GET, POST, PUT, PATCH, DELETE
Authentication	None, Basic, Bearer, API Key
Parameters	Query parameters and headers
Response format	JSON, CSV, XML
Pagination	Automatic pagination configuration
Retries	Retry configuration on failure

Import Scheduling

Configure recurring automatic imports:

Frequency	Options
Hourly	Every N hours
Daily	Specific time
Weekly	Days of the week + time
Monthly	Day of the month + time

Managing Documents

The Storage tab displays all imported documents in a table, regardless of whether they were uploaded manually or produced automatically by an agent or workflow.

Listing Columns

Column	Description
Name	Document name prefixed by the tool logo (the icon identifies the import source at a glance).
Updated at	Date of the last update.
File size	Size of the original file (formatted, e.g. `1.2 MB`).
Status	Processing state as a progress bar that advances during indexing.
Quality	Overall quality score rendered as a 5-star scale (0–10 mapped to half-stars, amber).
Actions	Sticky column on the right side — stays visible while the user scrolls the table horizontally.

Sorting

The table is sorted by Updated at in descending order (most recent first) by default. Click a column header to sort by it; click again to toggle the direction (ascending/descending). An arrow in the header indicates the active column and direction.

Sortable columns: Name, Updated at, File size, Status and Quality. Sorting is applied on the server, so it spans the entire collection — not just the current page — and changing it returns the listing to the first page. Column resizing keeps working as usual, without triggering sorting.

Search and Quick Filters

The search bar shares its row with the filter button and a set of quick filter cards that apply client-side:

Quick filter	Behavior
All	Default state — no client-side filter applied.
Type	Popover with the document types present in the current page (markdown, pdf, csv, etc.).
Facets	Popover with the metadata facets indexed on the documents (source, subject, language, tags).

Quick filters reset automatically when the user changes the text search or the server-side filters, keeping the listing coherent.

Additional search capabilities:

Text search — name, description and metadata.
Date search — filter by creation or update period.
Status filter — filter by processing state.
Quality and file-size filters — available through the main filter button (Category Filter).
Column visibility — show or hide table columns.

Document Status

Status	Description
Active	Document available for querying
Completed	Processing finished
Embedded	Embeddings generated successfully
Processing	Embedding generation in progress (progress bar advancing)
Pending	Awaiting processing
Stored	File saved in the system
Partial	Partially processed
Failed	Processing error

Actions

The row actions menu (right side of each row) contains:

Action	Icon	Description
Info	`i` inside a circle	Opens the Document Details modal.
Download	Download icon	Available when the document has a file in Storage.
Delete	Trash icon	Removes the document with a confirmation dialog.

Selecting Info opens a fixed-height modal organized into four tabs:

Tab	Content
Details	Document name, description (full text), file information (size, type), origin, dates.
Metadata	All keys persisted in `storage.metadata`, including facets and processing data. JSON values (e.g. attendee lists, structured objects) are detected automatically and rendered as formatted blocks instead of raw strings. For spreadsheets, it also shows a readable Schema block (sheets, columns and types).
Quality	Quality, completeness and relevance scores; processing block with chunk method, document type, model, provider and timestamp. For spreadsheets, the chunk method is `sheet` (one unit per sheet).
Chunks	Paginated list of chunks generated for the document. Visible only to super-admin users. For spreadsheets, the tab is labeled Sheets, showing the sheet name, row count and columns.

The footer of the modal exposes Download and Delete as the primary actions, alongside Close.

Idempotent Indexing

Documents produced automatically (e.g. by agents and workflows) use an upsert flow rather than a blind insert:

A set of upsertKeys (typically source plus business identifiers such as meeting_title, meeting_date, organizer_email, document_type) is matched against existing storage entries.
Match found → the existing storage file is overwritten in place, old embeddings are removed, and re-indexing produces a fresh set of chunks while preserving the original storage_id and created_at.
No match → a new storage entry is created normally.
Identical content → the duplicate is detected during indexing, the orphan storage row created in the meantime is removed, and the listing keeps the original entry untouched.

This makes re-runs of the same source safe and prevents the listing from filling up with duplicates.

Chunking Strategy

Before a document is indexed, Prodgy splits it into chunks. By default the platform picks the best strategy automatically by analyzing the content — headings, speaker turns, code blocks, page breaks, length, and so on.

Workflow Knowledge Base nodes (operations Save in storage and Upsert in storage) expose an optional Chunk strategy field that lets you override this automatic choice when a specific behavior is required — for example, keeping a meeting summary as a single block instead of letting it be split by heading.

Option	Behavior
Automatic (content-based)	Default. Prodgy analyzes the content and selects the strategy. Identical to the previous behavior.
Single block	Stores the whole document as a single chunk.
By heading	Splits on Markdown / section headings.
By speaker	Splits on speaker turns (meeting transcriptions).
By page	Splits on page breaks (PDFs).
By code block	Splits on code fences.
Semantic	Groups semantically related passages.
By sentence	Splits on sentence boundaries.
Fixed size	Splits into fixed-size windows.