feature: check and remove unwanted whitespaces and linebreaks in metadata #3654

lucasrodes · 2024-11-28T12:06:37Z

It's often the case that we introduce unwanted linebreaks or whitespace in the metadata. These are sometimes rendered in the FASTT or the metadata of an indicator.

Current workflow

At the moment, to make sure that I do not introduce unwanted characters, I do the following:

Work on metadata, data steps.
After importing to the database, I go to a sample indicator's data page preview.
For instance, I might see the following chart, which has an unwanted linebreak before the period of the subtitle sentence:

Then, I'll open the browser inspector, and in the Network tab, look for the URL with the indicator metadata shown. I'll open it and explore to see if there are any other unwanted characters.

 {
     "id": 971388,
     "name": "Deaths from leukaemia among females aged 0-4 year olds\n",
     "unit": "deaths",
     "createdAt": "2024-08-07T11:04:57.000Z",
     "updatedAt": "2024-11-28T10:02:03.000Z",
     "coverage": "",
     "timespan": "2000-2021",
     "datasetId": 6662,
     "shortUnit": "",
     "columnOrder": 0,
     "shortName": "death_count__age_group_years0_4__sex_female__cause_leukaemia",
     "catalogPath": "grapher/who/2024-07-30/ghe/ghe#death_count__age_group_years0_4__sex_female__cause_leukaemia",
     "dimensions": {...},
     "descriptionShort": "Estimated number of deaths from leukaemia among females  aged 0-4 year olds\n.\n",
 ...
 }

We observe that not only is there a line break (\n) before the period in descriptionShort, but also:
- Linebreaks: One at the end of name and descriptionShort.
- Whitespaces: double white space in descriptionShort, before "aged 0-4..."

I could also check the metadata JSON file in data/ folder, but it contains all the Jinja syntax, and is hard to read:

"death_count": {
      "title": "<% if age_group == \"ALLAges\" %>\nTotal deaths from << cause.lower() >> among <% if sex == \"Both sexes\" %>both sexes<% elif sex == \"Male\" %>males<% elif sex == \"Female\" %>females<% endif %>\n<% elif age_group == \"Age-standardized\" %>\nAge-standardized deaths from << cause.lower() >> among <% if sex == \"Both sexes\" %>both sexes<% elif sex == \"Male\" %>males<% elif sex == \"Female\" %>females<% endif %>\n<% else %>\nDeaths from << cause.lower() >> among <% if sex == \"Both sexes\" %>both sexes<% elif sex == \"Male\" %>males<% elif sex == \"Female\" %>females<% endif %> aged <% if age_group == \"ALLAges\" %>\nall ages\n<% elif age_group == \"age-standardized\" %>\nan age-standardized population\n<% elif age_group == \"YEARS0-14\" %>\n0-14 year olds\n<% elif age_group == \"YEARS0-4\" %>\n0-4 year olds\n<% elif age_group == \"YEARS5-14\" %>\n5-14 year olds\n<% elif age_group == \"YEARS15-19\" %>\n15-19 year olds\n<% elif age_group == \"YEARS15-49\" %>\n15-49 year olds\n<% elif age_group == \"YEARS20-24\" %>\n20-24 year olds\n<% elif age_group == \"YEARS25-34\" %>\n25-34 year olds\n<% elif age_group == \"YEARS35-44\" %>\n35-44 year olds\n<% elif age_group == \"YEARS45-54\" %>\n45-54 year olds\n<% elif age_group == \"YEARS50-69\" %>\n50-69 year olds\n<% elif age_group == \"YEARS55-64\" %>\n55-64 year olds\n<% elif age_group == \"YEARS65-74\" %>\n65-74 year olds\n<% elif age_group == \"YEARS70+\" %>\n70+ year olds\n<% elif age_group == \"YEARS75-84\" %>\n75-84 year olds\n<% elif age_group == \"YEARS85PLUS\" %>\n85+ year olds\n<% endif %>\n<% endif %>",
      "description_short": "<% if age_group == \"ALLAges\" %>\nEstimated number of deaths from << cause.lower() >> in <% if sex == \"Both sexes\" %>both sexes<% elif sex == \"Male\" %>males<% elif sex == \"Female\" %>females<% endif %>.\n<% elif age_group == \"Age-standardized\" %>\nEstimated number of age-standardized deaths from << cause.lower() >> in <% if sex == \"Both sexes\" %>both sexes<% elif sex == \"Male\" %>males<% elif sex == \"Female\" %>females<% endif %>.\n<% else %>\nEstimated number of deaths from << cause.lower() >> among <% if sex == \"Both sexes\" %>both sexes<% elif sex == \"Male\" %>males<% elif sex == \"Female\" %>females<% endif %>  aged <% if age_group == \"ALLAges\" %>\nall ages\n<% elif age_group == \"age-standardized\" %>\nan age-standardized population\n<% elif age_group == \"YEARS0-14\" %>\n0-14 year olds\n<% elif age_group == \"YEARS0-4\" %>\n0-4 year olds\n<% elif age_group == \"YEARS5-14\" %>\n5-14 year olds\n<% elif age_group == \"YEARS15-19\" %>\n15-19 year olds\n<% elif age_group == \"YEARS15-49\" %>\n15-49 year olds\n<% elif age_group == \"YEARS20-24\" %>\n20-24 year olds\n<% elif age_group == \"YEARS25-34\" %>\n25-34 year olds\n<% elif age_group == \"YEARS35-44\" %>\n35-44 year olds\n<% elif age_group == \"YEARS45-54\" %>\n45-54 year olds\n<% elif age_group == \"YEARS50-69\" %>\n50-69 year olds\n<% elif` `age_group == \"YEARS55-64\" %>\n55-64 year olds\n<% elif age_group == \"YEARS65-74\" %>\n65-74 year olds\n<% elif age_group == \"YEARS70+\" %>\n70+ year olds\n<% elif age_group == \"YEARS75-84\" %>\n75-84 year olds\n<% elif age_group == \"YEARS85PLUS\" %>\n85+ year olds\n<% endif %>.\n<% endif %>",

Comments

My current workaround is a bit complex, and can't expect everyone to do this. Also, can be very time consuming if it takes a while to bake the dataset and import it to the database.

The origin of these unwanted characters is in Jinja's "misuse". So, ideally, we wouldn't insert these into the ETL metadata YAML files.

But this is a bit tricky because Jinja can be confusing at times. We may need something that helps us here.

A temporary workaround could be to format this (remove unwanted spacings) when rendering the JSON metadata files.

At the moment, I think we are doing some formatting only at the very last moment (when the indicator is shown on the site). Still some unwanted characters make it through (e.g. the linebreak before the period).

The text was updated successfully, but these errors were encountered:

lucasrodes · 2024-11-28T12:06:57Z

Related issue: #3199

lucasrodes · 2024-11-28T13:12:47Z

Brought up by Fiona

larsyencken · 2024-12-05T10:19:20Z

We discussed it in triage. @Marigold already shipped some improvements here, we might close this and keep the related issues for other kinds of follow-up.

lucasrodes added the enhancement New feature or request label Nov 28, 2024

github-actions bot added the needs triage label Nov 28, 2024

Marigold mentioned this issue Nov 29, 2024

✨ Jinja whitespaces and newlines #3657

Merged

larsyencken assigned Marigold Dec 5, 2024

larsyencken removed the needs triage label Dec 5, 2024

larsyencken closed this as completed Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: check and remove unwanted whitespaces and linebreaks in metadata #3654

feature: check and remove unwanted whitespaces and linebreaks in metadata #3654

lucasrodes commented Nov 28, 2024 •

edited

Loading

lucasrodes commented Nov 28, 2024

lucasrodes commented Nov 28, 2024

larsyencken commented Dec 5, 2024

feature: check and remove unwanted whitespaces and linebreaks in metadata #3654

feature: check and remove unwanted whitespaces and linebreaks in metadata #3654

Comments

lucasrodes commented Nov 28, 2024 • edited Loading

Current workflow

Comments

lucasrodes commented Nov 28, 2024

lucasrodes commented Nov 28, 2024

larsyencken commented Dec 5, 2024

lucasrodes commented Nov 28, 2024 •

edited

Loading