Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: check and remove unwanted whitespaces and linebreaks in metadata #3654

Closed
lucasrodes opened this issue Nov 28, 2024 · 3 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@lucasrodes
Copy link
Member

lucasrodes commented Nov 28, 2024

It's often the case that we introduce unwanted linebreaks or whitespace in the metadata. These are sometimes rendered in the FASTT or the metadata of an indicator.

Current workflow

At the moment, to make sure that I do not introduce unwanted characters, I do the following:

  • Work on metadata, data steps.
  • After importing to the database, I go to a sample indicator's data page preview.
  • For instance, I might see the following chart, which has an unwanted linebreak before the period of the subtitle sentence:
    Image
  • Then, I'll open the browser inspector, and in the Network tab, look for the URL with the indicator metadata shown. I'll open it and explore to see if there are any other unwanted characters.
     {
         "id": 971388,
         "name": "Deaths from leukaemia among females aged 0-4 year olds\n",
         "unit": "deaths",
         "createdAt": "2024-08-07T11:04:57.000Z",
         "updatedAt": "2024-11-28T10:02:03.000Z",
         "coverage": "",
         "timespan": "2000-2021",
         "datasetId": 6662,
         "shortUnit": "",
         "columnOrder": 0,
         "shortName": "death_count__age_group_years0_4__sex_female__cause_leukaemia",
         "catalogPath": "grapher/who/2024-07-30/ghe/ghe#death_count__age_group_years0_4__sex_female__cause_leukaemia",
         "dimensions": {...},
         "descriptionShort": "Estimated number of deaths from leukaemia among females  aged 0-4 year olds\n.\n",
     ...
     }
  • We observe that not only is there a line break (\n) before the period in descriptionShort, but also:
    • Linebreaks: One at the end of name and descriptionShort.
    • Whitespaces: double white space in descriptionShort, before "aged 0-4..."

I could also check the metadata JSON file in data/ folder, but it contains all the Jinja syntax, and is hard to read:

"death_count": {
      "title": "<% if age_group == \"ALLAges\" %>\nTotal deaths from << cause.lower() >> among <% if sex == \"Both sexes\" %>both sexes<% elif sex == \"Male\" %>males<% elif sex == \"Female\" %>females<% endif %>\n<% elif age_group == \"Age-standardized\" %>\nAge-standardized deaths from << cause.lower() >> among <% if sex == \"Both sexes\" %>both sexes<% elif sex == \"Male\" %>males<% elif sex == \"Female\" %>females<% endif %>\n<% else %>\nDeaths from << cause.lower() >> among <% if sex == \"Both sexes\" %>both sexes<% elif sex == \"Male\" %>males<% elif sex == \"Female\" %>females<% endif %> aged <% if age_group == \"ALLAges\" %>\nall ages\n<% elif age_group == \"age-standardized\" %>\nan age-standardized population\n<% elif age_group == \"YEARS0-14\" %>\n0-14 year olds\n<% elif age_group == \"YEARS0-4\" %>\n0-4 year olds\n<% elif age_group == \"YEARS5-14\" %>\n5-14 year olds\n<% elif age_group == \"YEARS15-19\" %>\n15-19 year olds\n<% elif age_group == \"YEARS15-49\" %>\n15-49 year olds\n<% elif age_group == \"YEARS20-24\" %>\n20-24 year olds\n<% elif age_group == \"YEARS25-34\" %>\n25-34 year olds\n<% elif age_group == \"YEARS35-44\" %>\n35-44 year olds\n<% elif age_group == \"YEARS45-54\" %>\n45-54 year olds\n<% elif age_group == \"YEARS50-69\" %>\n50-69 year olds\n<% elif age_group == \"YEARS55-64\" %>\n55-64 year olds\n<% elif age_group == \"YEARS65-74\" %>\n65-74 year olds\n<% elif age_group == \"YEARS70+\" %>\n70+ year olds\n<% elif age_group == \"YEARS75-84\" %>\n75-84 year olds\n<% elif age_group == \"YEARS85PLUS\" %>\n85+ year olds\n<% endif %>\n<% endif %>",
      "description_short": "<% if age_group == \"ALLAges\" %>\nEstimated number of deaths from << cause.lower() >> in <% if sex == \"Both sexes\" %>both sexes<% elif sex == \"Male\" %>males<% elif sex == \"Female\" %>females<% endif %>.\n<% elif age_group == \"Age-standardized\" %>\nEstimated number of age-standardized deaths from << cause.lower() >> in <% if sex == \"Both sexes\" %>both sexes<% elif sex == \"Male\" %>males<% elif sex == \"Female\" %>females<% endif %>.\n<% else %>\nEstimated number of deaths from << cause.lower() >> among <% if sex == \"Both sexes\" %>both sexes<% elif sex == \"Male\" %>males<% elif sex == \"Female\" %>females<% endif %>  aged <% if age_group == \"ALLAges\" %>\nall ages\n<% elif age_group == \"age-standardized\" %>\nan age-standardized population\n<% elif age_group == \"YEARS0-14\" %>\n0-14 year olds\n<% elif age_group == \"YEARS0-4\" %>\n0-4 year olds\n<% elif age_group == \"YEARS5-14\" %>\n5-14 year olds\n<% elif age_group == \"YEARS15-19\" %>\n15-19 year olds\n<% elif age_group == \"YEARS15-49\" %>\n15-49 year olds\n<% elif age_group == \"YEARS20-24\" %>\n20-24 year olds\n<% elif age_group == \"YEARS25-34\" %>\n25-34 year olds\n<% elif age_group == \"YEARS35-44\" %>\n35-44 year olds\n<% elif age_group == \"YEARS45-54\" %>\n45-54 year olds\n<% elif age_group == \"YEARS50-69\" %>\n50-69 year olds\n<% elif` `age_group == \"YEARS55-64\" %>\n55-64 year olds\n<% elif age_group == \"YEARS65-74\" %>\n65-74 year olds\n<% elif age_group == \"YEARS70+\" %>\n70+ year olds\n<% elif age_group == \"YEARS75-84\" %>\n75-84 year olds\n<% elif age_group == \"YEARS85PLUS\" %>\n85+ year olds\n<% endif %>.\n<% endif %>",

Comments

My current workaround is a bit complex, and can't expect everyone to do this. Also, can be very time consuming if it takes a while to bake the dataset and import it to the database.

The origin of these unwanted characters is in Jinja's "misuse". So, ideally, we wouldn't insert these into the ETL metadata YAML files.

But this is a bit tricky because Jinja can be confusing at times. We may need something that helps us here.

A temporary workaround could be to format this (remove unwanted spacings) when rendering the JSON metadata files.

At the moment, I think we are doing some formatting only at the very last moment (when the indicator is shown on the site). Still some unwanted characters make it through (e.g. the linebreak before the period).

@lucasrodes lucasrodes added the enhancement New feature or request label Nov 28, 2024
@lucasrodes
Copy link
Member Author

Related issue: #3199

@lucasrodes
Copy link
Member Author

Brought up by Fiona

@larsyencken
Copy link
Collaborator

We discussed it in triage. @Marigold already shipped some improvements here, we might close this and keep the related issues for other kinds of follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants