We have developed an MVP product that utilizes Doc...

berk-karahan · 03-22-2024 05:43 AM

We have developed an MVP product that utilizes Document AI's Custom Extractor for analyzing a specific type of documents and extracting key information. In this MVP, we have established a field structure with a 3-level nested architecture and divided a set of 100 documents into equal parts for training and testing(50 test&50training set is recommended), allowing us to fine-tune our foundational model. Before proceeding to the next phase of product development, we want to ensure that we are employing the right methods.

Field Structure for Training:

Entity

Title

Inner Entity(for each row):

Data point#1

Data point#2

Data point#3

Data point#4

Although the documents we are analyzing contain similar information, the positions of the fields we expect in the output can vary from one document to another. In this case, should we aim to increase our dataset and develop a single, more generic model to accommodate these different document layouts? Or is it preferable to collect enough samples for each unique document format and train a different processor version for each format?
Regarding the decision between fine-tuning options and custom-model development in the Custom Extractor: under what circumstances is it preferable to opt for a custom model? Is there a specific threshold in dataset size beyond which custom model development becomes more viable?
We are working on a project that involves analyzing documents with nested structures and have come across challenges due to varying document layouts. Some are single-column (vertical layout), while others are dual-column (horizontal layout). In dual-column formats, the title, typically positioned at the top or to the left of other data, might be at the bottom of the left column, with the continuation of the structure at the top of the right column. This has led to overly large bounding boxes when labeling entities that span both columns. To address this, we have been labeling the data on the left as one entity and the data on the right (leaving the title entity blank) as another. However, we've observed that the processor sometimes incorrectly associates the top-right entity's title with the title of the entity below it. What is the best practice in such scenarios? Should we develop an algorithm based on bounding box positions to accurately associate titles, or should we adjust our nested structure approach by omitting the title information in these cases?