Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reformat semicolons in names #4755

Open
1ec5 opened this issue Dec 17, 2022 · 9 comments
Open

Reformat semicolons in names #4755

1ec5 opened this issue Dec 17, 2022 · 9 comments
Labels

Comments

@1ec5
Copy link

1ec5 commented Dec 17, 2022

Expected behavior

Some features have name tags that contain multiple values separated by a semicolon. Some examples from the United States:

  • This place=town node’s name tag contains an English name and a Yiddish name separated by a semicolon. It is also tagged with name:en and name:yi, but the dual name is appropriate because of the widespread use of a language within the town that would be considered a minority language elsewhere.
  • This amenity=place_of_worship area’s name tag contains an Amharic name and an English name separated by a semicolon. Both names are signposted equally prominently and used interchangeably. No default_language tag applies in this case, because that key is intended for administrative boundaries, whereas this is a one-off feature.
  • This road’s name tag contains two English names separated by a semicolon. The road is maintained jointly by two highway departments that disagree on the name for political reasons, going as far as to post competing street name signs up and down the road. As a result, local residents also disagree on the name.

Unlike in some countries, there hasn’t historically been a consensus to separate dual names with an ad hoc delimiter such as a hyphen or slash. Instead, it’s not uncommon for mappers to use a standard semicolon value separator as they would with any other key. Apart from consistency with other keys, a semicolon is much less likely to occur within a name in reality.

A mapper who uses the semicolon delimiter would expect a renderer to reformat the semicolon in some fashion. For example, Mapbox-based maps replace each semicolon with a fancy em dash. But perhaps a more language-agnostic treatment would be to replace each semicolon with a newline, just as with refs in #750. A newline would be less ambiguous because it isn’t possible for a raw tag value to contain a newline.

Actual behavior

Unfortunately, openstreetmap-carto renders the raw name tag verbatim, including the semicolon:

Kaser Debre Yibabe Kulbi Kidus Gabriel Ethiopian Orthodox Tewahedo Church Cincinnati Columbus Cincinnati

Without support for a semicolon delimiter, openstreetmap-carto encourages mappers to choose unpredictable delimiters instead. A previous version of the Kaser node used a slash, indistinguishable from an individual place name or POI name that contains a slash in reality. This is problematic for other data consumers, such as the router GraphHopper, that reasonably expect a semicolon delimiter.

Implementation notes

#750 splits ref on ; and recombines it with \n, primarily to choose a shield image based on the length of the longest name. However, a simple replace() could suffice for name on a point-placed label such as a place or POI.

array_to_string(refs, E'\n') AS refs
FROM (
SELECT
way,
osm_id,
highway,
string_to_array(ref, ';') AS refs

There’s also a very rare ;; escape sequence for cases where a single name legitimately contains a semicolon. To handle this case, the replace() call can be nested inside another replace() call that replaces \n\n with ;, or regexp_replace() can be called instead.

A newline may not be suitable within line-placed labels (roads, rivers, etc.). In these cases, perhaps an em dash could be used. Though slightly less language-agnostic than a newline, it’s still independent of the writing direction and no less ambiguous to the viewer than a hyphen or slash that’s hardcoded in the database.

/ref #1086 #4404

@imagico
Copy link
Collaborator

imagico commented Dec 17, 2022

First of all i am glad there is a renewed discussion about how OSM should treat multilingual names. Back in 2017/2018 when several of the maintainers of this style tried to suggest changes in mapping practice to address some of the fundamental difficulties in proper labeling of names (see here, here and here) the overall consensus among mappers seemed to be that people were fairly content with the status quo.

The relevant issue regarding multilingual name rendering in general by the way is #4404 - where i already commented on (including contemplating the idea to explicitly remove support for the free form compound label painting in the name tag - like with / or -).

This issue is about the specific suggestion to interpret name tags containing a semicolon separate list of names. This would depend on how widespread this tagging is in the database. I have never seen this being used in the wild so far. If anyone could run through a planet file to see how many cases of this we have so far that would be helpful. For comparison: Having anything other than a single name in a name tag is exotic in general but the free form compound labels mentioned are moderately widespread in some cases:

To be clear: Since we never explicitly supported that kind of compound labeling string in name tag mapping, support for the semicolon separated list would IMO not compete with this to be supported by OSM-Carto. It would still need to have sufficient (and sufficiently widespread) use to be considered a tagging method that has consensus support by the global mapper community. Or to say it with different words: Back in 2014 @matkoniecz on #1086 concluded that storing anything other than a single name in the name tag is not correct tagging. This might have changed since then - but it still would require evidence that it actually has IMO. Also consensus support for the semicolon separated list would IMO mandate explicitly removing support for the free form compound labeling strings as discussed in #4404.

Independent of the question if the semicolon separated lists in the name tag have wide support from the world wide mapper community - there is also the other big problem of multilingual names that unfortunately would not be solved by adopting semicolon separated lists in the name tag, that is the Han unification problem - see #2208. Contrary to the name this is not specific to CJK but also occurs with Arabic and Cyrillic scripts - see the issues referencing #2208.

This is largely why - as mentioned in the beginning - several maintainers of this style suggested different approaches to the problem of multilingual names that would address both the problem of specifying more than one name as the locally used name (and potentially their order) and to provide information on what languages these names are actually in to allow using the correct typefaces to render them (without resorting to error prone double tagging and name matching heuristics across multiple tags)

As said - this matter is independent of the specific suggestion of this issue but i would find it unfortunate if mappers form an opinion on if and how to store multiple names in the name tag without being aware of and having had a chance to consider this other big problem of multilingual names. I would even go as far as saying that the Han unification problem is the larger problem of the two because it affects a much larger number of potential map users (the number of people living in countries where CJK/Arabic/Cyrillic are used is probably much larger than the number of people living in multilingual areas)

@imagico imagico added the text label Dec 17, 2022
@ZeLonewolf
Copy link
Contributor

Regarding the specific question of prevalance of semi-colon separations, I ran the following overpass query today:

[out:json][timeout:36000];

nwr[name~";"];
out count;

This returned the following output:

{
  "version": 0.6,
  "generator": "Overpass API 0.7.59 e21c39fe",
  "osm3s": {
    "timestamp_osm_base": "2022-12-17T14:03:07Z",
    "copyright": "The data included in this document is from www.openstreetmap.org. The data is made available under ODbL."
  },
  "elements": [

{
  "type": "count",
  "id": 0,
  "tags": {
    "nodes": "8872",
    "ways": "22077",
    "relations": "1138",
    "total": "32087"
  }
}

  ]
}

I have not done any further analysis to characterize precisely how those 32,000 objects are distributed in the database.

@matkoniecz
Copy link
Contributor

Back in 2014 @matkoniecz on #1086 concluded that storing anything other than a single name in the name tag is not correct tagging. This might have changed since then - but it still would require evidence that it actually has IMO.

it is also possible that I was wrong in 2014. At the very least, it is much more complex in areas where multiple languages are in active use, sometimes without ability to single language dominating over others.

(ad to that political complexity of declaring one language dominating over other)

@ZeLonewolf
Copy link
Contributor

Link to additional discussions in the community forums on this topic can be found here:
https://community.openstreetmap.org/t/multiple-delimited-names-in-the-name-tag

@imagico
Copy link
Collaborator

imagico commented Dec 17, 2022

32k occurrences of ; in name tags is a starting point but i think a somewhat deeper look is in order. Quite a few tools used for imports and automated edit automatically conflate tags with semicolons so looking over these a bit more in detail (in terms of spatial distribution, feature types and origin in particular) would be good.

Based on a few quick looks around central Europe (including some actual multilingual regions) - my impression is that the three main cases of semicolon in name tag in those areas are:

  • power lines where the name tag seems to be used in a similar fashion as ref on roads, i.e. the same physical line is part of several virtual lines/routes and is named based on the names of those routes - using semicolons as necessary. I am not sure is this is consensus tagging.
  • office sharing where the operator/company name is tagged on the name tag - like doctors' offices shared by several doctors or business offices shared by several companies. I am not sure is this is consensus tagging either.
  • a few cases of roads - with either different names on both sides (for which we have name:right/name:left i think), conflations of roads with different names or cases like @1ec5 mentioned where multiple names compete (though usually this is a case of name and alt_name because typically one name dominates in local use).
  • quite a large number of obvious incorrect uses of the name tag that are not names at all - lists or codes that belong in ref, address components (like street name and number) etc.

In the US there also seems to be cases of this coming from Tiger imports:

https://www.openstreetmap.org/way/5353554
https://www.openstreetmap.org/way/126300618

At a quick look i could not find any occurrence of multilingual names tagged this way. Can anyone point to an area where this is common locally?

@aighes
Copy link

aighes commented Dec 18, 2022

The relevant issue regarding multilingual name rendering in general by the way is #4404 - where i already commented on (including contemplating the idea to explicitly remove support for the free form compound label painting in the name tag - like with / or -).

I think there is no need to remove any support, it would be totally fine just to add support for ; so mappers, who actually wants to support the defined delimiter get an acceptable result displayed. By acceptable I mean a delimiter of those names, which humans commonly expect on maps.

Of course you wont find many ; in the name at the moment, as most mappers verify their results with OSMcarto and in the end the beautiful map is a higher motivation than following the rule of which delimiter to use.

@matkoniecz
Copy link
Contributor

matkoniecz commented Dec 20, 2022

Back in 2014 @matkoniecz on #1086 concluded that storing anything other than a single name in the name tag is not correct tagging. This might have changed since then - but it still would require evidence that it actually has IMO.

https://community.openstreetmap.org/t/multiple-delimited-names-in-the-name-tag/6803 has some examples that convinced me otherwise

Yes, we have almost no ; in names right now - but it is result of OSM Carto showing values using this delimiter in an extremely ugly way.

Given that ; is remarkably stably used across OSM tags as separator I think in this case introducing support and then having people retag name tags would be fine - it would not be inventing/pushing new tagging scheme but breaking bad cycle (mappers keep knowingly mapping invalidly as part of tagging for renderer, OSM Carto is not introducing support for rarely used tagging in this specific context, repeat).

@imagico
Copy link
Collaborator

imagico commented Dec 20, 2022

Because i am not sure if i have made that clear enough: The question that we need to discuss here is not primarily if the semicolon separated list is a more suitable form to record multiple names in a single name tag than the various free form compound labeling strings entered into name tags of certain features (see #4755 (comment)). That is (a) a discussion for a different venue and (b) something that is not really relevant here at all because we never decided to support the free form compound labeling strings and would probably never have done so if we have had the opportunity to choose.

What i would like to know if there is adoption of the semicolon separated list of names idea in tagging multiple names in cases where there is consensus among mappers that having multiple names is a suitable use of the name tag and what these names then actually mean. I listed a few cases where that might be the case in #4755 (comment) but also mentioned that it is not clear if these represent consensus tagging.

I in particular so far have not seen any evidence that there is consensus among mappers in multilingual regions, especially those subject to the Han unification problem, that the semicolon separated list is a suitable, let along the desirable form to record multilingual names. I consider this a relevant question, because - as explained above - we know that there are more elegant (in the sense of less error prone and more flexible) ways to handle these cases. If mappers in regions where this is an important matter decide to go with the semicolon separated lists despite the known disadvantages i would consider it our obligation to support that. But so far i see the discussion dominated by people exclusively using latin script and predominantly not from multilingual regions and i am hesitant to consider their view as representative on this matter.

The other, more formal technical question is if a semicolon separated list in the name tag is an ordered or an unordered list. If it is ordered it would compete with the name/alt_name tagging scheme - or in other words: we should consider rendering name=foo1 + alt_name=foo2 (2M uses) in the same way as this issue suggests rendering name=foo1;foo2. If it is unordered and the semicolon separated list in the name tag is only to be used when the different names have exactly the same weight as the locally used name of the feature then maybe we should consider using a different order than the one tagged (like sorting by script type and secondarily length of name).

In any case - if we should decide to support this it would be essential that QA tools support checking if this tagging is applied in a consistent manner. That means - for multilingual names - if all the components of the semicolon separated list in the name tag are also found in individual name: tags. Because only if there are practically usable QA tools that support checking this can mappers successfully implement and maintain such a tagging scheme in their area. Does anyone know if any of the commonly used QA tools and the verifiers of editors include this kind of consistency check on name tagging?

To summarize my current understanding of the use/potential use of semicolon separated lists in the name tag, there seem to be the following subtypes of this:

  • cases where multiple real world concepts with different names are conflated into one OSM feature - like power lines and shared offices.
  • roads with different names on both sides - here the tagging competes with the more commonly used name:right/name:left (28k uses), which we could think about supporting.
  • features with two names from the same language in use locally without there being a verifiable difference in prevalence in use. This is probably extremely rare.
  • features with two names from the same language in use locally but with a clear difference in prevalence. Here the tagging competes with the more commonly used alt_name (2M uses) - which we could think about supporting.
  • features in multilingual regions with several languages in use without there being a verifiable difference in prevalence in use. I suppose in those cases consensus among those who support the semicolon separated lists is that the individual names are in addition to be tagged in name:<lang>.

TL;DR: Main questions that i would appreciate input on are:

  • Are there any multilingual regions where the semicolon separated list in name tag has some adoption in use (or at least is discussed in the local multilingual community as a solution)?
  • Are there any QA tools/editor verifiers that check correct use of semicolon separated list in name tag for multilingual names?

@aighes
Copy link

aighes commented Dec 21, 2022

In multi-lingual areas to me it seems to be the consensus, that name should contain a list of values. Consensus listed on https://wiki.openstreetmap.org/wiki/Multilingual_names for Eg. Belgium, Germany, Hong Kong, Italy, Macau, Morocco, New Zealand, Portugal, Slovenia, Spain, Switzerland (seems only for country name).
But as mentioned above, ; is not used as a delimiter, because how name is handled by the map makers. ; is a technical delimiter, but not something you expect to have shown on the map. So the mappers have chosen the simple way of just writing the "beautiful" delimiter of their choice in name

Based on my knowledge regarding Germany areas, multiple values in name are not on a same level as alt_name. The main reason having two names is, that they are (eg. for political reason) on same level of importance. So the second value in name is not considered as an alternative name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants