Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add country classifications #158

Open
mabudz opened this issue Nov 7, 2024 · 4 comments
Open

add country classifications #158

mabudz opened this issue Nov 7, 2024 · 4 comments

Comments

@mabudz
Copy link
Contributor

mabudz commented Nov 7, 2024

Hi @konstantinstadler ,

we use coco already in our bonsai project. :-)
Since we use data from several data providers with their own country codes, we would like to extend the country_data.tsv by the following columns (classifications):

The mappings to ISO2 are already in another format in one of our repos.

If you agree, we would open a branch to implement these classifications.

@konstantinstadler
Copy link
Member

Hi,
Great, please just go ahead.
For a new classification, your really just need to:

  • add a new column in the data file (.tsv)
  • add the classification with some source link to the readme at https://github.com/IndEcol/country_converter/tree/master?tab=readme-ov-file#classification-schemes
  • add a test case to check if the new classification is present and gives country in the format you expect. Put these tests in
    /test_functionality.py (see test_IOC for a minimal example). These might seem trivial, but we had cases where column shifted by one or disappeared and these tests would catch this kind of issues

Just one question: a lot of these data seems to be based on UN numeric and/or ISO2 - these are already included. Please make sure to not accidentally add entries which are already in there with another name. If you rather need synonyms you can define them in "_validate_input_para" in the main file. Again, please add tests if you do

@mabudz
Copy link
Contributor Author

mabudz commented Nov 8, 2024

Thanks a lot.

Indeed many of the codes are based on existing classifications. E.g. "unido_indstat" and "baci" use ISOnumeric such as "051" for Armenia.
In this case we should not add these codes as an additional column, but just add it to "_validate_input_para" in the following manner? Although the code is "051" and not "51".

        alt_valid_names = {
            "ISOnumeric": ["isocode", "unido_indstat", "baci"],

Another example is "eurostat" which is based on multiple in coco existing classifications:

  • "ISO2",
  • "EU27_2007"
  • something like "EU27_2020" (which is not yet in coco; and could be added)
  • subregions ISO2 codes e.g "BE234" for Gent in Belgium , which are not countries. (probably not to be included in coco?)

Other classifications such as "prodcom", which uses the codes from Geonomenclature (GEONOM) (I guess another name for "prodcom" is more appropriate) or "hybridexiobase4" would be new classifications for coco.

@konstantinstadler
Copy link
Member

Regarding EU, there this EU27 which is the "official" name for the new one. We have a section about that in the readme:

The situation for the EU got complicated due to the Brexit process. For the naming, coco follows the Eurostat glossary, thus EU27 refers to the EU without UK, whereas EU27_2007 refers to the EU without Croatia (the status after the 2007 enlargement). The shortcut EU always links to the most recent classification. The EEA agreements for the UK ended by 2021-01-01 (which also affects Guernsey, Isle of Man, Jersey and Gibraltar). Switzerland is not part of the EEA but member of the single market.

Generally, I would like to avoid just adding columns by data provider if they actually explicitly saying they are using one of the exiting ones. UN will probably use UN numeric in most cases (there is just a question of comparing with int or str).

Hybridexio4 definetly make sense.

Subregions are tricky. It is a bit more complicated then just adding a row for the subregion. The regular expression probably stop to work or get exponentially more complicated. Also, the linking of subregions to countries is not trivial (disputed areas, different classifications across countries, etc). I think there we would very much push the limit of what is possbile with a simple table. This region/subregion seems to best be handled in some kind of graph database? I would guess somethign like this must exist already.

@mabudz
Copy link
Contributor Author

mabudz commented Nov 13, 2024

Alright, I created a PR #159

Regarding the subregions issue, from our point of view, it can be postponed. So the PR does not address it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants