Modules for US geo-referential data and tax accounting

Just wondering how much interest there might be in a couple of US-specific modules (sorry they are managed using git right now, but I’m told they can be converted :upside_down_face:). These in-process modules are fruits of an earlier discussion.

Apologies for introducing both at the same time, but they are related:

country_uscensus
A module that augments the country.subdivision model with FIPS and GNIS codes and imports county and local subdivisions of the United States. Data come from the US Census Bureau.

Design questions:

  1. Is country.subdivision the best place to add this kind of data? Or would creating a new model be better? I tried both approaches, but this seemed the simplest and, at least theoretically, the most compatible with other modules.
  2. It appears the client loads all its subdivisions when viewing a country (country_view_form). If subdivisions for all US states are imported, this adds up to about 36k subdivision records, so it might be good to have something, perhaps a paging mechanism or limit for this view (though this doesn’t appear to be possible?). I tried to add a childs (One2Many) field to country.subdivision and override the subdivision_form view with a field_childs to turn it into a nesting tree, which fits the data well, but is there a way to do that without manual intervention? (I see that the product_classification and product_classification_taxonomic modules do something like this, so I can take a closer look unless there is a better way to approach this point generally.)

account_us_sstp
A module that manages sales and use tax for streamlined states, i.e. those that are participating in the Streamlined Sales Tax Project. Tax data are imported as tax, tax rule, tax code, and tax boundary records. The customer tax rule is dynamically resolved using the sale date along with the warehouse’s origin and customer’s destination addresses. By design, it will only attempt to resolve a customer tax rule if the customer tax rule and the default customer tax rule (in Account Configuration) are not already set.

Design questions and concessions:

  1. Dynamic resolution of customer tax rule is not yet implemented for the web_shop and web_shop_product_data_feed modules.
  2. Tax._amount_where did not give the option to include a join on the account_tax table, so I decided to override the Tax.get_amount query method and introduce an additional method Tax._amount_where_tax that included the table. It would be nice if the Tax._amount_where method exposed the account_tax table so I would just have to override the one method.
  3. On InvoiceTax there is no analog of InvoiceLine._compute_taxes() so I had to override its get_move_lines() method to intercept the computed TaxLines. Would be nicer to override a more fine-grained method, but not a big deal.

Overall this was a very fun project to work on to get my feet wet with Tryton. Any comments welcome! If there is enough interest I can post more explicit rationale for each separately on the Feature board (if that is the right process?).

P.S. The US address verification module mentioned earlier will be a separate (potential) module. While it will be useful for US addresses, it didn’t turn out to be required for these two modules’ functionality.

It seems they are only use on the tax. Is it really needed to be a subdivision? Could not it be just a Char field on the tax?

I do not really like the idea to change temporary the tax rule of the party.
I’m wondering as the tax rule come from the boundary (which is deduced from the full address), would it not be possible to have a single tax rule with the boundary as pattern criteria?

I see that you are using usaddress which does not seem to be well maintained: Is this project dead? · Issue #365 · datamade/usaddress · GitHub
I think it will be better to use one of the proposal from SEPA will use in the future a structured address instead of their current schema (#13190) · Issues · Tryton / Tryton · GitLab. And personally I think deepparse is the best choice.

Yes, my motivation for using usaddress was that it was small with minimal dependencies but got (this small) job done. If other modules will be using libpostal or deepparse, then it would make make sense to try to consolidate dependencies, and since they are active projects. It looks like deepparse is written in Python and can be installed completely via pip.

Good to know about SEPA’s change in address structure.

Though if Tryton’s party.address will be becoming more structured in general, then the need for a separate parsing library may be less. But still there may be enough differences that one will still be needed. In this case, even if I have the street number from the model, I will still need directionals and suffixes.

Good question! Initially, this attribute was part of the account_us_sstp module, but I separated it to its own because I thought it could be useful for other (future) modules. The FIPS codes are commonly used in business and government sectors in the US. GNIS Feature IDs are used in scientific applications, though usually in a more comprehensive list of geographical features, but I thought a mapping between the two might be useful.

I had wondered whether smaller subdivisions could be used in sales reporting. States are large in the US and sometime a company wants to see how sales compare across counties and cities.

From a more practical point of view, a foreign key in the database uses less storage than the Char field.

So yes, it could be a Char field on the tax, but I have been exploring other potential uses for the codes. I hope that helps at least to know where I’m coming from, even if the current technical approach is not the best fit.

This is an interesting idea. Theoretically yes, though I wonder whether the effective dates of the boundaries may make a single rule impractical. I will take a look at the data this weekend and try it out. Thank you!

Okay, this commit has restructured things to create one tax rule per state (subdivision). The change was helpful and simplified the design of the module in a few ways, one of which is that customer tax rules work like normal and can now be set for a party without worrying about side effects.

Instead of using the boundary as pattern criteria, I created a new field for account.tax.boundary called tax_key, though I am not completely settled on the name. The concept is a simply a concatenation of all the applicable tax codes for a boundary record, of which there are very few compared with the number of boundary records. Applicable tax codes for a boundary record are expressed fundamentally as FIPS codes, but could include up to 20 additional state or vendor defined codes (so a simple One2Many Function relation of country.subdivision won’t work). In order to store the value of this field more efficiently over potentially millions of boundary records, I split tax_key from tax boundary to its own model/table: account.tax.boundary.tax_key.

As a further argument for including smaller subdivisions as relations rather than Char fields, it seems to me that US subdivisions are underrepresented in ISO 3166, with only one level of subdivisions (at the state level, 56 in total for the country). In contrast, Spain has 69 subdivisions, for example; Italy, 126, complete with regions, provinces, and metropolitan cities; France, 127; UK, 216. I see the country-uscensus module as taking a step towards the wider use of equivalent smaller US subdivisions in Tryton.

On the other hand, with the move to one tax rule per state, the only model in this module that has a need for the smaller subdivisions is account.tax, for the descriptions. However, a human-targeted description on the tax line of an invoice is not necessary for the proper functioning of the module. If place doesn’t exist for a tax or the smaller subdivisions of the state have not been imported, it will default to a reasonable, though not as informative, alternative description.

After looking at both deepparse and libpostal, I realize that they each could be useful in theory for various use cases in Tryton because of their support for parsing addresses from many countries. But by default, they support all the tags that I need for the SSTP address records except for directionals and suffixes (see above quote). Two options that I see are (1) to keep using usaddress or (2) (re)train a model from one of these other projects to predict these US-specific tags.

The first option seems much simpler to me (assuming the project continues to be maintained), especially as this module will only ever deal with normalized US addresses.


Below are some details about how each of the aforementioned parsing projects compares, as well as how it might impact current and future situations. I am capturing my own learning here, but the details may also be useful to those that are considering including any of these libraries as dependencies to Tryton modules. Feel free to add your own experiences or thoughts.

usaddress uses a small probabilistic model (125KiB) based on terms from the US Postal Service. It has a few small dependencies (including probableparsing and python-crfsuite that can be installed by pip, totaling < 5MiB). However, it only works for US addresses (though its support for US address tags is better than the others’ or at least more aligned with the address components in the SSTP datasets).

deepparse is packaged training-ready as a research toolkit with dependencies that total > 5GiB, including numpy, pytorch, scipy, pandas, and nvidia GPU computing packages. Pretrained models, weights, and embeddings are downloaded separately (though automatically on first use of the model), and according to them can take an additional 160MB to 6.8GB. While its single RNN model is more sophisticated and flexible than libpostal, and with enough extra disk space can be installed via pip, its packaging approach makes this option feel awkward (to me) as a dependency of a Tryton module.

libpostal has a much smaller application footprint at around 25MiB, but for some reason doesn’t have much package support (Alpine Linux and FreeBSD according to pkgs.org; in the AUR for Arch Linux). Its latest pre-trained model (Senzing) weighs in at 2.2GiB, which must also be downloaded before the first use. To enable dynamic linking, it would need to be compiled and installed as a separate step before installing Tryton, or at least before any modules that had bindings to it.

As far as ISO 20022 structured addresses go, deepparse only has eight default prediction tags based on the most common components of the addresses of the countries they were testing: StreetNumber, StreetName, Unit, Municipality, Province, PostalCode, Orientation (of the street, e.g. west, east), and GeneralDelivery (for other delivery information :thinking:). libpostal has 20 default tags and better tag coverage with PostalAddress24.