Parse document incoming invoices without OCR services

pokoli · March 4, 2024, 12:13pm

Hi,

We are developing a module to parse invoices based on e-Invoice UBL format.
Currently we are using the document_incoming_invoice module to store such invoices and process them as supplier invoices.

Our idea was to reuse as much as the current implementation code, but we found that most of the functions to create invoice datea are defined in the document_incoming_ocr module. As we do not have any ocr service (we just parse an structured file) we can not use them.

Why such functions are defined on OCR modules and not on the invoice one? From what I see they just read a json and return the newly created record. From what I understand, it won’t be hard to convert the UBL (or other e-invoice format) to the expected JSON by the current ocr functions and just create the invoices.

We will be happy to contribute some code if a proposal for improvement is raised.

edbo · March 4, 2024, 12:26pm

I’m developing a module based on invoice2data to import PDF invoices based on text extraction and templates. I also discovered this which is weird. That code should be in the document_incoming_invoice module. Also datetime conversion must be done in the module which imports data and it should return a Python datetime or date structure. And when using the OCR services when posting an invoice the results are automatically send back to the provider without any notice.

BTW. I’m very interested in your import module as this is more and more becomes actual. Also the other way around sending invoices using UBL format is something very interesting and can be based on the edocument_ modules.

Completely offtopic:
Another part is Peppol which is a more strict version of UBL but is also becoming more and more standard. Big problem here is the how to connect to it.

pokoli · March 4, 2024, 12:30pm

Great to see that I’m not the only one having the same issue. I agree that moving the code to invoice module will be a solution. But we need to declare a way to convert the source data by using the OCR service or any invoice parser.

Indeed we have both cases. We are sending customer invoices to a Web Service but also downloading supplier invoices from the same Web Service. I guess this will be something more and more adopted in the near future.

I found Peppol standard but to be honest I did not have fully digged in its details.

edbo · March 4, 2024, 12:34pm

That’s up to you. Your / my module should be based on the document_incoming_invoice module and extend the different functions to import and parse the data and return the required set of data needed by document_incoming_invoice. That’s how I’m doing it. So I must convert my dates etc to the right structure.

pokoli · March 4, 2024, 12:37pm

Yes, I mean that currently there is a code to convert the raw data from the OCR service into the structure needed from the function.

If we make such code available when no OCR is usefull, your module and my module will benefit from just converting the data to the right structure and reuse all the logic from the current code.

edbo · March 4, 2024, 3:14pm

So in short, most of the code you mention should be moved from the document_incoming_ocr to document_incoming_invoice so all invoice import services can use that code to create invoices.

I totally agree with this.

ced · March 4, 2024, 4:10pm

I disagree the methods implemented in document_incoming_ocr are for fuzzy parsing. They try to always succeed even with sparse data.
Document with structured data should use stricter techniques to input document.
Also it will be wrong to simplify structured data like UBL into a simplified dictionary of OCR.

pokoli · March 4, 2024, 6:09pm

What techniques are you thinking about?

At the end in the structured format we just have the following fields for lines:

Description
Quantity
Unit code
Unit price
Total amount

For me it seems the same values as one can get from an OCR services.

What I’m missing?

edbo · March 4, 2024, 6:25pm

In all cases, OCR, UBL or something else, nothing will perfectly match with the data you have in the database. So it’s also searching based on several parameters to get the data. Those functions should be in the document_incoming_invoice module.

For my module invoice2data (which is an on-off project) I had to define a few functions which are called:

def _get_document_incoming_invoice2data(self, document)
def _get_supplier_invoice_invoice2data(self, document)
def _get_supplier_invoice_invoice2data_line(self, parsed_line, invoice_data, tax_included):
def _get_supplier_invoice_invoice2data_tax(self, parsed_tax, invoice_data):

You can replace invoice2data with the name you want (typless for example). In those functions you build the data structure which is needed by the underlying system to search for the party, products, accounts, taxes etc. Even UBL has a tax identifier, maybe you haven’t added that to the party yet. So in that sense you can make it stricter to error and tell the user that a tax identifier is missing.