One of our customer is having performance issues when generating an extract from the accounting relying on the account.account.party Model which is using a table_query over hundred of millions of records.
I’ll cut the long story short but in the ends one of the origin of the slowness is the fact that we are repeating the query for each batch of 1000 records. So I started to think again about using the materialized views in Tryton (which is a subject that we have already though about but without having a good enough solution).
But this time I think that we’ve finally reach a point where the idea and the implementation looks good enough.
The idea
It’s really three different things coming together:
A Mixin that people can add to an existing Model that will override the __register__ method in order to create the view from the table_query. The Mixin will also add a method to refresh the view.
A wizard on each materialized view that will allow the user to trigger a refresh of the materialized view.
A cron job that would loop over each of the materialized views and call the method to refresh it when the time has come. The time would be specified in a section of the configuration file of Tryton.
The issues
One of the issue we had we the materialized views is that they would require to be kind of invalidated when the underlying data generating them is changed.
Our position on this has changed a little bit: first, we (at least I but @ced seems to go along ) consider that it might not be a big deal as long as the user knows that the data can be outdated. I mean when your company is dealing with million of invoices each year, you don’t really care if the statistics is off by 10 or 20 invoices because the invoices of today are not taken into account yet. Moreover there’s a way for the users to refresh the data (and thus hit hard your database with constant requests to compute a huge view ).
Moreover with the method to refresh the view, it will be possible for developers to trigger themselves the refresh of the data if really necessary by plugging it in the CUD calls (although I wonder why using the materialized view in this case but you might have a good reason).
One pending issue with this design is with the cron job, once the view is not populated with any record we can not know if it’s time to refresh the view or not. For now I’ve decided to skip updating the view.
Another pending issue is about updating the view when the table_query is modified (not only because a new model extends it but also because a new version might modify it). We could DROP and recreate the view on each __register__ call but it’s not satisfactory from my point of view.
The code
So the code for this is available in this merge request:
I would appreciate if anybody could share its idea about this (especially @albert as I know that you’re quite keen on everything postgres and I guess that it’s an issue that you’ve encountered quite some time with babi).
For me it must be updated otherwise it will never be bootstrapped as new materialized views are always empty.
One solution would be to compare the queries between the model and the view.
If it is not possible to get the query from the view, I think we could use the hash of the SQL query as the view name (a little bit like for the indexes) and so if the SQL query change, the name changes also.
Yes it’s the only correct thing to do but I was wondering if there was not another way.
We can retrieve the query but it will be the verbatim of what was executed by the DB thus we would have to make the interpolation of the parameters ourselves and I don’t think psycopg can do that.
Given that your proposal makes it part of core, maybe that information could be obtained dynamically in some way.
It seems a minor detail, but it is not.
Also note, that if the refresh is relatively frequent it is also much more probable that the search() returns a set of records that it no longer exists when the client does the read() operation and things like that. Handling that gracefully would be a nice addition too.
This is a special case because we’re not showing the user the latest information. Everywhere in Tryton when the user opens a list of records that information is the latest there is. After 5 seconds they can reload and they’ll see what’s exactly there at that moment.
That is no longer true as soon as there’s a materialized view so we must be clear about that to the user.
You are thinking about existing users who already knows the system and have already assumptions about the behavior.
This does not apply to new users.
Old users must learn the new system after an upgrade.
For example Discourse as some dashboards and I do not expect them to show me the very updated information and they do not need to show me any kind of information to me to expect that. They just have a “Refresh” button on the detailed version.
In fact psycopg2 does the merge between the query and the parameters client-side (and I haven’t found out yet how to access this) but psycopg3 uses server side binding. So sooner or later it won’t be possible anymore.
I tend to think that we should use the hash solution (with an additional switch on trytond-admin to drop the outdated views).
Indeed I do not think we should drop old views because most of them will be intermediary views that will be created again.
So I think we should always create the materialized view with no data so it does not consume any resources.
For the name of the view, I think the hash should be just appended to the standard name to make it unique.
This reasoning holds for the views that are extended by another module, but in the case where we add a new field to the same model in the same module it doesn’t hold anymore.
But then we need to refresh them once without the CONCURRENTLY keyword (it’s a requirement of postgres).