Document Incoming Management

Rational

Some companies spend a lot of time to manage the incoming document (like supplier invoice).
It is pretty common that those documents are received in electronic format (like PDF).
It will be great if we could automate a large part of the management using OCR tools (like typless). So having a place to drop electronic documents in Tryton with the possibilities to add some metadata (like a type (invoice, delivery order etc), a source (like “From” email) and then Tryton tries to create or update documents (like create a supplier invoice or receive a shipment).

Proposal

In a module named document_incoming:

We add a model document.incoming with:

  • name
  • company: optional
  • data: binary field with filestore id
  • type: selection required in state done
  • source: optional char
  • parent: Many2One to document.incoming
  • results: One2Many with just a Reference field
  • state: draft, exception, done, cancelled

The workflow execute the process for the specified type and set the state to done.
A wizard eases the split of PDF document per pages into other record linked with the original via the parent field. The simplest way would be with a simple Char field which contain a semicolon separated list of page range like 1-2;3;5-7 (to make one record with page 1 and 2, another with page 3 and a last one with page 5 to 7, page 4 is skipped (which means that we create a cancelled record for it)).
It is not allowed to process a record which is used as parent (to avoid process multiple times the same part of a document).

A route is defined to ease automate the creation of incoming document which allow to define optionally some metadata.
Another route is also defined for the same purpose but extracting attachment from email (and keep the email as attachment).

In a module named document_incoming_invoice:

We extend document.incoming to add the type supplier_invoice.
On account.invoice we add a Many2One field to document.incoming.
The processing create a draft supplier invoice for the document using the data extracted (can be from ocr dict or other sources like factur-x) and add the document as attachment (reusing the same filestore id).
The creation should never fail so for that we need to have default fallback values for the required fields like party, account etc.

The module must support to create invoice with and without line details. When there is no line detail, it creates just one line with the total.
For lines we try to find a line with the same description to select the same account but if we have a purchase order number, we search in the purchase line. If nothing is found, we use the default account.
If the total does not match, we add dummy line (or tax lines) to get the same totals.

In a module named document_incoming_ocr:

We extend document.incoming to add a Dict field to store metadata extracted by OCR from the document.
The module can be extended by other modules to plug any kind of OCR service.

In a module named document_incoming_ocr_typless:

It implements the requirements to access the API of the service.
And once the created documents are no more in draft, it upload to the feed-back loop the metadata updated with the real data.

Implementation

Future

Other OCR services could be implemented.
Support for embedded metadata in PDF like factur-x.

4 Likes

Haven’t tried yet, but pricing from Azure Form Recognizer for invoices seems very competitive (x10 cheaper than typless): https://azure.microsoft.com/en-us/pricing/details/form-recognizer/

I apreciate such a module a lot, aspecially it is open to and cloud service/API.
Perhaps it would be nice to have a look at Jan G.'s module, who created and uses getmyinvoices.com as a service since some years with tryton
We very lately updated Jan G.'s module using their cloud API v2. Our latest expierience is, recognition is good in general (eg supplier, customer, number, tax, invoice-amouint), but poor in recognition of single line with number of products, amount and prices.
Testing some tools and services, my favorite ended to be natif.ai, which has 100% of its technic and services in Europe (which might be important to some degree) but what’s even more important to us: Their level of recognition (even line-wise) is just amazing, additionally adding a value of likelihood to each value.

The design proposal does not link to any specific provider but using Microsoft as first implementation may be reluctant to some people. And for small volume it is only 2 times cheaper which is not so much a difference at the end.

That does not seem to be exactly the right service.

I could not compare prices because they request my email (that’s already a bad signal for me).
Also they do not seem to have a feedback loop from the API which will prevent to implement it if we use them as first implementation.

What is the source used for?

I think we miss a field named reference to store the reference of the document in the external system.

If we introduce the fallback values, we must validate that such values are not used when posting the invoice.

It may be interesting to define a set of rules to pick the right invoice account. We implemeted that on a customer and it worked quite well. The rational is you may now in advance the account to be used based on the party and the line description. Also for some parties you may now in advance in which accounts you want to track their expenses. (For example professional services).

Main problem here is that the external system does not know about your accounts, so you need to define a way to decide which account is used.

I guess all external modules will add the available fields like we do in account_statement_xxx modules right?

Just if this is interesting for anyone else, we used Adocum as API for automatic invoice OCR import

The process should support that the supplier invoice may already exist as a draft one in Tryton.
I think in this case the supplier invoice should not be created but matched with the existing one.

To know who is the source of the document (ex: which email address)

The document is stored in Tryton.

It is done by searching previous invoice lines.

Users who will use such feature will have to make manual the invoice method (or the shipment method).

Of course, but such document will have several identifiers. For example, for a supplier invoice:

  1. The supplier invoice number (which is stored as reference on the invoice).
  2. The typless identifier (which should be used to provide feedback).

I’m refering to an identifier to store the typless identifier, so the user can see it.
Or do you plan to have a dedicated field for each type of source?

Thats not good. The draft invoices are very good for the invoice audit process. With the manual invoice method the audit process would be much harder.

1 Like

Yes because we can not assume what is needed by each service.

I’m not sure if this is the right place to put this topic, but for imported/scanned documents I would like to have a(n optional) field in the record to put in all the (recognized) text just plain and unstructured. This is to have it available for a full-text-search by option. An example for the usage is a suppliers price list, it would make it very easy to find all suppliers for a special product.

1 Like

It would be nice to have GitHub - invoice-x/invoice2data: Extract structured data from PDF invoices implemented as well. No relying on third party services, but if you want you can.

1 Like

I think that instead of searching for invoice line created by purchase order, it is better to always create new lines and try to find the corresponding purchase line to be set as origin.
So on purchase processing, the system should try to delete or update draft invoice line with too much quantity before creating credit note lines.
The other advantage is that this will also improve the workflow even without these modules.

IIUC this will produce a lot of duplicate purchase lines?

Why it will be better? I just see it as a cause of duplicate information as you will need to remove some of the invoices latter.

How do you know that such invoices are correctly invoiced? What If there are partial invoices?

If you do that on the purchase part you will lose the feature that the system creates credit notes when the supplier invoices more than what we orded on the purchase request.

I do not see how the workflow will be improved. Could you please elaborate?

The merge request is now in a state where it can already been tested.
It still missing some features like wizard to split, import routes and feed-back loop.
Also I would like to make the processing asynchronous because the time to answer from Typless is a little bit too much for one document so if users want to process a bunch of document in a row it will be too long.

There will be no duplicated information as extra lines will be deleted.

The propre quantity will still be invoiced as if needed a credit note will be created.
More over users will not be required to follow this workflow.

I did not say that.

It i simpler for the user, he just have to enter the invoice.
We can even imagine that the origin can be set after posting.

The feed-back loop has been implemented. The only “ugly” part is that we need to have code to get the correct value for each field. So this results in methods with big if-elif. But I do not think it is possible to do differently.

I split the processing in two phases (a little bit like sale confirmation) with the actual processing being posted in the queue.

I added a route to import document using user application. I made a special treatment if the file uploaded is flagged as email. In this case I create a main inactive document with the raw email and create children document for each attachment.
Also the route accept a JSON with all the values or binary data with values as parameters. This way it can be pipped easily for example in a procmailrc with a command like | curl -X POST ... --data-binary @-.

I had an idea that we could have a first pass of OCR to just detect the kind of document (supplier invoice, delivery note etc.) which would create a new document with the detected type.

Just my two cents with our experience with managing incoming documents:

IMHO the module properly covers the case when you receive a PDF document and need to split it into several documents.

However there are use cases in which you need to scan the pages and need to join them. That is the case for supplier shipments, for example.

In this case the situation is different: you’ll get individual pages (JPEG, for example) from the scanner in the warehouse and they will have to be joined with a certain criteria.

One way of achieving that is by the user putting a QR code in the first page of each shipment document so we can later automatically put each page in their own document.

(I can get into the details why using JPEG of individual pages is better than trying to create the PDF directly from the scanner, if needed)

The problem with working with individual pages comes when there’re more than one scanner machine in the company: two scanners working simultaneously will create incoming documents where you will not to know if two consecutive pages really came from the same or different scanners and make the grouping algorithm to fail (or the life of the user harder if they need to join them manually).

The way we solved this problem is by creating a “Queue” model. All incoming documents have a Many2One field to the Queue where they come from.

So we have customers that have several queues: one for each scanner in the warehouse, another one for an e-mail account where we download e-mail attachments from, etc.

This queue can also be useful for setting a default document type.

For example, you can create a queue that receives all attachments from invoices@mycompany.com so instead of using an OCR for detecting the kind of document, you can set it automatically.

My scanner can scan multiple pages and generate a single PDF (it can even send it to an email).
It is not such fancy nor expensive scanner.