Ingestion

 

The ingestion service is responsible for taking in different files from various sources and bringing them into the system. This article explains which files can be ingested and how the quality can be improved.



Glossary

Content Contributor: A user that has the permission to ingest/upload a file in the knowledge center to add the content to the knowledge base.

Ingesting: Uploading/Connecting a file to the knowledge base of the Unique Chat.

OCR: OCR stands for "Optical Character Recognition." It is a technology that recognizes text within a digital image. It is commonly used to recognize text in scanned documents and images.

Supported Document Types (in general)

The following document types can generally be ingested (note exceptions)

  • PDF: .pdf

  • Word: .docx

  • Excel: .xlsx

  • Sharepoint Sites: .aspx

  • PowerPoint: .pptx

  • TextFile: .txt

  • HTML: .html

  • Markdown: .md

Exceptions

This list only provides as a guidance and cannot be seen as exhaustive, as the success of the ingestion is highly dependent on the structure of every individual file.

  • Any files larger than > 100 MB

  • PDF and docx with mostly text > 600 pages (e.g., e-books, directives, laws)

  • PDF and docx with various pictures/figures > 250-300 pages (e.g., annual reports)

  • .doc format (older word versions)

  • .xls format (older excel versions)

  • Excel with VBA (visual basic)

  • Excel with a combination of >10’000 rows and >50 columns

  • html > 20 MB

Please always check the ingestion state in the knowledge center to see if the file ingestion has been successful or not.

Improving Document Ingestion Quality

Converting a document from PDF to a format that can be read by large language models (LLMs), such as Markdown, is a critical step for ensuring the high quality of outputs from FinanceGPT. Maintaining the integrity of text structures, especially those within tables and complex structures, is a particularly challenging task that is currently the focus of extensive research.

Layout

  1. Structured Data: Text should have clear headings, subheadings, bullet points or numbers. Clear and distinct paragraphs focused on a single topic with consistent spacing between lines and paragraphs.

  2. Footnotes: The link between footnote and usage in paragraph can get lost. Include information in paragraphs if possible.

  3. Links: Avoid links to other paragraphs (e.g. 'as discussed in section 3') as this connection might not be understood

  4. Multi Columns: Avoid multi-column layouts if possible.

  5. Remove Noise: Footnotes, page numbers, headers, and footers should be eliminated if they do not contribute to the content’s meaning.

Text content

  1. Language and Style: The document should be written in clear, concise sentences, provide enough context for understanding.

  2. Metadata: The document should include a descriptive title.

  3. Contextual Clues: Context should be provided where necessary, especially when introducing acronyms, technical terms, or jargon.

Tables

  1. Headings: Tables should be simply formatted with clear headings for rows and columns for easy parsing by the model.

  2. Borders: Tables should have full borders.

  3. Multipage tables: Avoid multi-page tables. If a table goes across multiple pages, the headings have to repeated on each page. Content of one single cell must be within one page (no breaking on multiple pages)

Limitations

  • Pictures/figures cannot be read (also not if they are included in a .pptx or .pdf)

  • Powerpoint Presentations with complex figures (e.g. boxes with arrows)

  • Excel files with > 40 columns

  • “Unclean/Artistic” and very complex tables (various merged cells, no borders etc.)

The clients content contributor should always consider the chunking before deciding to add a document to the knowledge base that is accessible by a lot of end users.

Monitoring and Reporting

Please always report files that failed to ingest or led to bad chunking to Unique. These reports support Unique in continuously improving the service.

  1. Please always review if your uploaded/connected files were ingested properly by checking the ingestion state in the knowledge center.

  2. Please report it if a document failed to ingest or led to bad chunking to enterprise-support@unique.ch with the following details:

    1. Optional: Optimally you can provide us the file. However, if it is confidential please only provide the information b.- d.

    2. Document type (e.g., excel, word, html)

    3. Document size (e.g., 300 pages, 5MB)

    4. Document specific (e.g., scanned document with OCR, excel with 34’000 rows)

 


Author

@Jovana Sanussi

 

© 2024 Unique AG. All rights reserved. Privacy PolicyTerms of Service