Skip to content

Documents and Storage

Guide
  1. Select the import tool

    In the Storage tab, click Import. A selector displays the available import tools, organized by category (Files, Triggers, API, Databases).

  2. Configure the fields

    Fill in the fields specific to the selected tool:

    • Document name and description
    • Additional settings (delimiter, format, etc.)
    • Import scheduling (optional)
  3. Fill the metadata (optional)

    Below the main fields, expand the Metadata accordion to set business attributes that travel with the document and are inherited by every chunk:

    FieldDescription
    SourceOrigin of the content (e.g. manual, meeting, crm). Used by quick filters and duplicate detection.
    SubjectSubject or topic of the document.
    LanguageDocument language (e.g. pt-BR, en-US).
    TagsComma-separated free-form labels.

    Each metadata field supports Magic Fill — click the sparkle icon to let AI suggest a value based on the file (its content for PDFs, otherwise its name) and the fields already filled in.

  4. Upload the file

    Drag and drop the file into the upload area or click to select. The system validates the format and size before sending.

  5. Track processing

    A progress bar starts at 1% as soon as the upload begins and advances as the document moves through the upload → storage → embedding pipeline. The bar replaces the previous spinner so the user always sees concrete progress.


Spreadsheets: exact answers from tabular data

Section titled “Spreadsheets: exact answers from tabular data”

Spreadsheets uploaded to the knowledge base (.xlsx, .xls, .csv, .tsv, .ods) get dedicated handling: besides being indexed for search, their data becomes queryable for analytical questions.

When you ask something like “what are the total sales by segment?” or “how many units of product X were sold in Germany?”, the assistant runs a real query over the spreadsheet data — instead of guessing from text — and returns exact numbers: totals, sums, averages, counts, group-bys and filters.

  • Where it works: Playground, communicators (Slack/Teams), workflows (RAG nodes) and MCP clients — the same capability across every surface.
  • Each sheet is a queryable unit: multi-sheet workbooks are indexed per sheet, each with its own schema (columns and types).
  • Multiple spreadsheets: when more than one spreadsheet is relevant, the system picks the right one or returns the result per spreadsheet. It does not sum data of different natures (distinct currencies/definitions) — in that case it shows the per-source breakdown instead of a meaningless total.
  • Descriptive questions (“what’s in this spreadsheet?”) get a column summary and a sample of rows.

Every document — regardless of how it entered the Knowledge Base (manual upload, workflow, agent, API) — is first persisted as a storage entry and only then indexed as embeddings. This guarantees:

  • A single inventory of documents visible in the listing, no matter the origin.
  • Business metadata is owned by the storage row and replicated into every chunk.
  • Cascade deletion: removing a storage entry automatically drops all embeddings linked to it.

CategoryDescription
FilesDirect document upload (CSV, PDF, DOCX, MD, etc.)
TriggersEvent-based imports
APIData obtained from REST endpoints
DatabasesDirect database connections

For external data sources via API, you can configure:

FieldDescription
Base URLServer address
EndpointResource path
HTTP methodGET, POST, PUT, PATCH, DELETE
AuthenticationNone, Basic, Bearer, API Key
ParametersQuery parameters and headers
Response formatJSON, CSV, XML
PaginationAutomatic pagination configuration
RetriesRetry configuration on failure

Configure recurring automatic imports:

FrequencyOptions
HourlyEvery N hours
DailySpecific time
WeeklyDays of the week + time
MonthlyDay of the month + time

The Storage tab displays all imported documents in a table, regardless of whether they were uploaded manually or produced automatically by an agent or workflow.

ColumnDescription
NameDocument name prefixed by the tool logo (the icon identifies the import source at a glance).
Updated atDate of the last update.
File sizeSize of the original file (formatted, e.g. 1.2 MB).
StatusProcessing state as a progress bar that advances during indexing.
QualityOverall quality score rendered as a 5-star scale (0–10 mapped to half-stars, amber).
ActionsSticky column on the right side — stays visible while the user scrolls the table horizontally.

The table is sorted by Updated at in descending order (most recent first) by default. Click a column header to sort by it; click again to toggle the direction (ascending/descending). An arrow in the header indicates the active column and direction.

Sortable columns: Name, Updated at, File size, Status and Quality. Sorting is applied on the server, so it spans the entire collection — not just the current page — and changing it returns the listing to the first page. Column resizing keeps working as usual, without triggering sorting.

The search bar shares its row with the filter button and a set of quick filter cards that apply client-side:

Quick filterBehavior
AllDefault state — no client-side filter applied.
TypePopover with the document types present in the current page (markdown, pdf, csv, etc.).
FacetsPopover with the metadata facets indexed on the documents (source, subject, language, tags).

Quick filters reset automatically when the user changes the text search or the server-side filters, keeping the listing coherent.

Additional search capabilities:

  • Text search — name, description and metadata.
  • Date search — filter by creation or update period.
  • Status filter — filter by processing state.
  • Quality and file-size filters — available through the main filter button (Category Filter).
  • Column visibility — show or hide table columns.
StatusDescription
ActiveDocument available for querying
CompletedProcessing finished
EmbeddedEmbeddings generated successfully
ProcessingEmbedding generation in progress (progress bar advancing)
PendingAwaiting processing
StoredFile saved in the system
PartialPartially processed
FailedProcessing error

The row actions menu (right side of each row) contains:

ActionIconDescription
Infoi inside a circleOpens the Document Details modal.
DownloadDownload iconAvailable when the document has a file in Storage.
DeleteTrash iconRemoves the document with a confirmation dialog.

Selecting Info opens a fixed-height modal organized into four tabs:

TabContent
DetailsDocument name, description (full text), file information (size, type), origin, dates.
MetadataAll keys persisted in storage.metadata, including facets and processing data. JSON values (e.g. attendee lists, structured objects) are detected automatically and rendered as formatted blocks instead of raw strings. For spreadsheets, it also shows a readable Schema block (sheets, columns and types).
QualityQuality, completeness and relevance scores; processing block with chunk method, document type, model, provider and timestamp. For spreadsheets, the chunk method is sheet (one unit per sheet).
ChunksPaginated list of chunks generated for the document. Visible only to super-admin users. For spreadsheets, the tab is labeled Sheets, showing the sheet name, row count and columns.

The footer of the modal exposes Download and Delete as the primary actions, alongside Close.


Documents produced automatically (e.g. by agents and workflows) use an upsert flow rather than a blind insert:

  • A set of upsertKeys (typically source plus business identifiers such as meeting_title, meeting_date, organizer_email, document_type) is matched against existing storage entries.
  • Match found → the existing storage file is overwritten in place, old embeddings are removed, and re-indexing produces a fresh set of chunks while preserving the original storage_id and created_at.
  • No match → a new storage entry is created normally.
  • Identical content → the duplicate is detected during indexing, the orphan storage row created in the meantime is removed, and the listing keeps the original entry untouched.

This makes re-runs of the same source safe and prevents the listing from filling up with duplicates.


Before a document is indexed, Prodgy splits it into chunks. By default the platform picks the best strategy automatically by analyzing the content — headings, speaker turns, code blocks, page breaks, length, and so on.

Workflow Knowledge Base nodes (operations Save in storage and Upsert in storage) expose an optional Chunk strategy field that lets you override this automatic choice when a specific behavior is required — for example, keeping a meeting summary as a single block instead of letting it be split by heading.

OptionBehavior
Automatic (content-based)Default. Prodgy analyzes the content and selects the strategy. Identical to the previous behavior.
Single blockStores the whole document as a single chunk.
By headingSplits on Markdown / section headings.
By speakerSplits on speaker turns (meeting transcriptions).
By pageSplits on page breaks (PDFs).
By code blockSplits on code fences.
SemanticGroups semantically related passages.
By sentenceSplits on sentence boundaries.
Fixed sizeSplits into fixed-size windows.