Implementing materialized views in Tryton

nicoe · August 21, 2024, 5:36pm

Hello everyone,

One of our customer is having performance issues when generating an extract from the accounting relying on the account.account.party Model which is using a table_query over hundred of millions of records.

I’ll cut the long story short but in the ends one of the origin of the slowness is the fact that we are repeating the query for each batch of 1000 records. So I started to think again about using the materialized views in Tryton (which is a subject that we have already though about but without having a good enough solution).

But this time I think that we’ve finally reach a point where the idea and the implementation looks good enough.

The idea

It’s really three different things coming together:

A Mixin that people can add to an existing Model that will override the __register__ method in order to create the view from the table_query. The Mixin will also add a method to refresh the view.
A wizard on each materialized view that will allow the user to trigger a refresh of the materialized view.
A cron job that would loop over each of the materialized views and call the method to refresh it when the time has come. The time would be specified in a section of the configuration file of Tryton.

The issues

One of the issue we had we the materialized views is that they would require to be kind of invalidated when the underlying data generating them is changed.

Our position on this has changed a little bit: first, we (at least I but @ced seems to go along ) consider that it might not be a big deal as long as the user knows that the data can be outdated. I mean when your company is dealing with million of invoices each year, you don’t really care if the statistics is off by 10 or 20 invoices because the invoices of today are not taken into account yet. Moreover there’s a way for the users to refresh the data (and thus hit hard your database with constant requests to compute a huge view ).

Moreover with the method to refresh the view, it will be possible for developers to trigger themselves the refresh of the data if really necessary by plugging it in the CUD calls (although I wonder why using the materialized view in this case but you might have a good reason).

One pending issue with this design is with the cron job, once the view is not populated with any record we can not know if it’s time to refresh the view or not. For now I’ve decided to skip updating the view.

Another pending issue is about updating the view when the table_query is modified (not only because a new model extends it but also because a new version might modify it). We could DROP and recreate the view on each __register__ call but it’s not satisfactory from my point of view.

The code

So the code for this is available in this merge request:

I would appreciate if anybody could share its idea about this (especially @albert as I know that you’re quite keen on everything postgres and I guess that it’s an issue that you’ve encountered quite some time with babi).

ced · August 21, 2024, 6:01pm

For me it must be updated otherwise it will never be bootstrapped as new materialized views are always empty.

One solution would be to compare the queries between the model and the view.
If it is not possible to get the query from the view, I think we could use the hash of the SQL query as the view name (a little bit like for the indexes) and so if the SQL query change, the name changes also.

nicoe · August 22, 2024, 8:20am

Yes it’s the only correct thing to do but I was wondering if there was not another way.

We can retrieve the query but it will be the verbatim of what was executed by the DB thus we would have to make the interpolation of the parameters ourselves and I don’t think psycopg can do that.

ced · August 22, 2024, 8:34am

If the view is empty, this means that there are no data so the refresh will be fast. I think we can afford to refresh such view very frequently.

albert · August 22, 2024, 11:24am

The proposal looks great to me!

albert · August 22, 2024, 11:47am

One thing that I think the proposal lacks is a way for the user to know when was computed the data he’s exploring.

Otherwise there’s a lot of insecurity in many scenarios.

We had to add it in babi reports as part of the title of the tab when it is opened:

Given that your proposal makes it part of core, maybe that information could be obtained dynamically in some way.

It seems a minor detail, but it is not.

Also note, that if the refresh is relatively frequent it is also much more probable that the search() returns a set of records that it no longer exists when the client does the read() operation and things like that. Handling that gracefully would be a nice addition too.

ced · August 22, 2024, 11:53am

It is the create date of the record.

The ids must be stable and this was already the case with dynamic table query so there is nothing new here.

albert · August 22, 2024, 12:41pm

Ok, but as long as the query fills the create_date with NOW(). Anyway it would be nice to make it more visible.

You’re right.

ced · August 22, 2024, 12:53pm

We always use CurrentTimestamp but we may add a generic test for that.

I do not think we should have any different behavior than the one for “standard” records.
For the user they are indistinguable.

albert · August 22, 2024, 1:52pm

I disagree.

This is a special case because we’re not showing the user the latest information. Everywhere in Tryton when the user opens a list of records that information is the latest there is. After 5 seconds they can reload and they’ll see what’s exactly there at that moment.

That is no longer true as soon as there’s a materialized view so we must be clear about that to the user.

ced · August 22, 2024, 1:57pm

You are thinking about existing users who already knows the system and have already assumptions about the behavior.
This does not apply to new users.
Old users must learn the new system after an upgrade.

For example Discourse as some dashboards and I do not expect them to show me the very updated information and they do not need to show me any kind of information to me to expect that. They just have a “Refresh” button on the detailed version.

nicoe · August 22, 2024, 2:27pm

In fact psycopg2 does the merge between the query and the parameters client-side (and I haven’t found out yet how to access this) but psycopg3 uses server side binding. So sooner or later it won’t be possible anymore.

I tend to think that we should use the hash solution (with an additional switch on trytond-admin to drop the outdated views).

ced · August 22, 2024, 2:30pm

Indeed I do not think we should drop old views because most of them will be intermediary views that will be created again.
So I think we should always create the materialized view with no data so it does not consume any resources.

For the name of the view, I think the hash should be just appended to the standard name to make it unique.

albert · August 22, 2024, 2:32pm

How obvious is that Refresh button?

I think there must be a way for the user to know if that information needs to be refereshed or not.

nicoe · August 22, 2024, 2:38pm

This reasoning holds for the views that are extended by another module, but in the case where we add a new field to the same model in the same module it doesn’t hold anymore.

But then we need to refresh them once without the CONCURRENTLY keyword (it’s a requirement of postgres).

ced · August 22, 2024, 2:47pm

Then the simplest way is to drop all the materialized views.
I guess it could be an extra option of the database update.

What about refresh all the active materialized view created at the end of the database update without CONCURRENTLY?

nicoe · August 22, 2024, 2:58pm

Yes this was my idea.