Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 support #3

Open
programandala-net opened this issue Nov 10, 2019 · 6 comments
Open

UTF-8 support #3

programandala-net opened this issue Nov 10, 2019 · 6 comments

Comments

@programandala-net
Copy link

Why only ASCII is supported? It's a suprising limitation. First I thougth it was a mistake of the manual: I thougth it meant only the identifiers. But effectively, UTF-8 or even Latin 1 strings are not accepted in BASIC sources (all non-ASCII characters are removed). And only ASCII characters are accepted by the command line interpreter.

Is nuBASIC ASCII-only by design? Or is UTF-8 going to be supported (at least just to print strings, not to manipulate them) in a future version?

@eantcal
Copy link
Owner

eantcal commented Nov 10, 2019 via email

@martindecker
Copy link

martindecker commented Nov 14, 2019

I would prefer Latin-1 ( = ISO 8859-1 ), or a switch between ISO-8859-1 and UTF-8.
The reason is: My editors assume that .bas-Files are encoded in ISO 8859-1.

@programandala-net
Copy link
Author

I would prefer Latin-1 ( = ISO 8859-1 ), or a switch between ISO-8859-1 and UTF-8.

I'm not sure what you mean by "switch".

The reason is: My editors assume that .bas-Files are encoded in ISO 8859-1.

That is a problem of your editors ;)

Unicode is the way to go, and UTF-8 is its most practical encoding at the moment. Of course, it brings the issue about the BASIC string functions, but they could work with bytes as usual. The thing is to accept and print UTF-8 strings.

But anyway ISO 8859-1 is better than nothing: it would make nuBASIC useful to write programs in a few European languages other than English.

@programandala-net
Copy link
Author

Such limitation was simplifying the implementation, but maybe now I can improve it.

Thanks. I understand ASCII was enough for your initial scope, but it makes the language pretty useless for a more general usage.

@martindecker
Copy link

martindecker commented Nov 14, 2019

Hello, my answer regarding .bas Source-Files.

I'm not sure what you mean by "switch".

In Python the source code encoding is specified in Line 2 the following way:
#!/usr/bin/python
# -*- coding: iso-8859-1 -*-
or
#!/usr/bin/env python
# -*- coding: utf-8 -*-

It could also be a new Basic command, why put important information into comments ?

An other possibility for having a switch is: The Byte order mark, present = UTF8, not-Present = Latin1 or see my following post.
But does the BOM conflict with the shebang-Function in Linux ???
In a VB.Net- Source File, I found the Byte-Order-Mark ( EF BB BF ) at the beginning.
But VB.Net has no shebang.

Strings

In bigger projects, the language-specific string constants are in "resource" or external files .
Thanks Antonio for writing the Software - one thing at a time, perhaps first do sth regarding the Sourcecode-Question.
There is a "Unicode for C++23" proposal:
https://www.youtube.com/watch?v=3utLG0Qm1Ek
Currently strings are encoding agnostic, "what comes in, goes out", is my experience with the nubasic command Input#
.
As in
https://stackoverflow.com/questions/30277095/whats-the-definition-of-encoding-agnostic
And this was useful for me.

Currently, nubasic strings are 8-bit-sequences! Only the Source Code is treated as 7-bit.

@martindecker
Copy link

martindecker commented Nov 15, 2019

I assume, around russia they have a huge amount of cp1251 -coded files, etc.
On a more abstract level there are only two kinds of codings relevant today in our world:

  1. Some 256-Character encoding, determined by some Environment Variable(s) / locale settings for Editors and the Terminal. "Extended_ASCII" with 8 bits per character.
  2. UTF-8.

So a switch could also be between those two possibilities.
7-bit-Ascii is the common subset of both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants