Document Incoming Management

Rational

Some companies spend a lot of time to manage the incoming document (like supplier invoice).
It is pretty common that those documents are received in electronic format (like PDF).
It will be great if we could automate a large part of the management using OCR tools (like typless). So having a place to drop electronic documents in Tryton with the possibilities to add some metadata (like a type (invoice, delivery order etc), a source (like “From” email) and then Tryton tries to create or update documents (like create a supplier invoice or receive a shipment).

Proposal

In a module named document_incoming:

We add a model document.incoming with:

  • name
  • company: optional
  • data: binary field with filestore id
  • type: selection required in state done
  • source: optional char
  • parent: Many2One to document.incoming
  • results: One2Many with just a Reference field
  • state: draft, exception, done, cancelled

The workflow execute the process for the specified type and set the state to done.
A wizard eases the split of PDF document per pages into other record linked with the original via the parent field. The simplest way would be with a simple Char field which contain a semicolon separated list of page range like 1-2;3;5-7 (to make one record with page 1 and 2, another with page 3 and a last one with page 5 to 7, page 4 is skipped (which means that we create a cancelled record for it)).
It is not allowed to process a record which is used as parent (to avoid process multiple times the same part of a document).

A route is defined to ease automate the creation of incoming document which allow to define optionally some metadata.
Another route is also defined for the same purpose but extracting attachment from email (and keep the email as attachment).

In a module named document_incoming_invoice:

We extend document.incoming to add the type supplier_invoice.
On account.invoice we add a Many2One field to document.incoming.
The processing create a draft supplier invoice for the document using the data extracted (can be from ocr dict or other sources like factur-x) and add the document as attachment (reusing the same filestore id).
The creation should never fail so for that we need to have default fallback values for the required fields like party, account etc.

The module must support to create invoice with and without line details. When there is no line detail, it creates just one line with the total.
For lines we try to find a line with the same description to select the same account but if we have a purchase order number, we search in the purchase line. If nothing is found, we use the default account.
If the total does not match, we add dummy line (or tax lines) to get the same totals.

In a module named document_incoming_ocr:

We extend document.incoming to add a Dict field to store metadata extracted by OCR from the document.
The module can be extended by other modules to plug any kind of OCR service.

In a module named document_incoming_ocr_typless:

It implements the requirements to access the API of the service.
And once the created documents are no more in draft, it upload to the feed-back loop the metadata updated with the real data.

Implementation

Future

Other OCR services could be implemented.
Support for embedded metadata in PDF like factur-x.

3 Likes

Haven’t tried yet, but pricing from Azure Form Recognizer for invoices seems very competitive (x10 cheaper than typless): https://azure.microsoft.com/en-us/pricing/details/form-recognizer/

I apreciate such a module a lot, aspecially it is open to and cloud service/API.
Perhaps it would be nice to have a look at Jan G.'s module, who created and uses getmyinvoices.com as a service since some years with tryton
We very lately updated Jan G.'s module using their cloud API v2. Our latest expierience is, recognition is good in general (eg supplier, customer, number, tax, invoice-amouint), but poor in recognition of single line with number of products, amount and prices.
Testing some tools and services, my favorite ended to be natif.ai, which has 100% of its technic and services in Europe (which might be important to some degree) but what’s even more important to us: Their level of recognition (even line-wise) is just amazing, additionally adding a value of likelihood to each value.

The design proposal does not link to any specific provider but using Microsoft as first implementation may be reluctant to some people. And for small volume it is only 2 times cheaper which is not so much a difference at the end.

That does not seem to be exactly the right service.

I could not compare prices because they request my email (that’s already a bad signal for me).
Also they do not seem to have a feedback loop from the API which will prevent to implement it if we use them as first implementation.

What is the source used for?

I think we miss a field named reference to store the reference of the document in the external system.

If we introduce the fallback values, we must validate that such values are not used when posting the invoice.

It may be interesting to define a set of rules to pick the right invoice account. We implemeted that on a customer and it worked quite well. The rational is you may now in advance the account to be used based on the party and the line description. Also for some parties you may now in advance in which accounts you want to track their expenses. (For example professional services).

Main problem here is that the external system does not know about your accounts, so you need to define a way to decide which account is used.

I guess all external modules will add the available fields like we do in account_statement_xxx modules right?

Just if this is interesting for anyone else, we used Adocum as API for automatic invoice OCR import

The process should support that the supplier invoice may already exist as a draft one in Tryton.
I think in this case the supplier invoice should not be created but matched with the existing one.

To know who is the source of the document (ex: which email address)

The document is stored in Tryton.

It is done by searching previous invoice lines.

Users who will use such feature will have to make manual the invoice method (or the shipment method).

Of course, but such document will have several identifiers. For example, for a supplier invoice:

  1. The supplier invoice number (which is stored as reference on the invoice).
  2. The typless identifier (which should be used to provide feedback).

I’m refering to an identifier to store the typless identifier, so the user can see it.
Or do you plan to have a dedicated field for each type of source?

Thats not good. The draft invoices are very good for the invoice audit process. With the manual invoice method the audit process would be much harder.

Yes because we can not assume what is needed by each service.

I’m not sure if this is the right place to put this topic, but for imported/scanned documents I would like to have a(n optional) field in the record to put in all the (recognized) text just plain and unstructured. This is to have it available for a full-text-search by option. An example for the usage is a suppliers price list, it would make it very easy to find all suppliers for a special product.