UTF-8 support #3

programandala-net · 2019-11-10T13:19:40Z

Why only ASCII is supported? It's a suprising limitation. First I thougth it was a mistake of the manual: I thougth it meant only the identifiers. But effectively, UTF-8 or even Latin 1 strings are not accepted in BASIC sources (all non-ASCII characters are removed). And only ASCII characters are accepted by the command line interpreter.

Is nuBASIC ASCII-only by design? Or is UTF-8 going to be supported (at least just to print strings, not to manipulate them) in a future version?

eantcal · 2019-11-10T15:45:54Z

Hi Marcos. You are right. This is a limitation I can remove. It is in my todo-list. Originally, when I started with nuBASIC the purpose was to provide an example for my programming courses. I had in mind a kind of 80s interpreter, so supporting just ASCII was enough for that purpose. I will need to improve the tokenizer, which is responsible to read the input and transform in tokens. Such limitation was simplifying the implementation, but maybe now I can improve it. Thank you your suggestion. Kind regards, Antonino

…

On Sun, 10 Nov 2019 at 13:19, Marcos Cruz ***@***.***> wrote: Why only ASCII is supported? It's a suprising limitation. First I thougth it was a mistake of the manual: I thougth it meant only the identifiers. But effectively, UTF-8 or even Latin 1 strings are not accepted in BASIC sources (all non-ASCII characters are removed). And only ASCII characters are accepted by the command line interpreter. Is nuBASIC ASCII-only by design? Or is UTF-8 going to be supported (at least just to print strings, not to manipulate them) in a future version? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3?email_source=notifications&email_token=ADDNYVRCPZ3CHR3VI2RJWBDQTAC6ZA5CNFSM4JLMKKN2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HYHXDRA>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADDNYVRR24DXTJE6HVHECOTQTAC6ZANCNFSM4JLMKKNQ> .

martindecker · 2019-11-14T11:30:54Z

I would prefer Latin-1 ( = ISO 8859-1 ), or a switch between ISO-8859-1 and UTF-8.
The reason is: My editors assume that .bas-Files are encoded in ISO 8859-1.

programandala-net · 2019-11-14T19:41:42Z

I would prefer Latin-1 ( = ISO 8859-1 ), or a switch between ISO-8859-1 and UTF-8.

I'm not sure what you mean by "switch".

The reason is: My editors assume that .bas-Files are encoded in ISO 8859-1.

That is a problem of your editors ;)

Unicode is the way to go, and UTF-8 is its most practical encoding at the moment. Of course, it brings the issue about the BASIC string functions, but they could work with bytes as usual. The thing is to accept and print UTF-8 strings.

But anyway ISO 8859-1 is better than nothing: it would make nuBASIC useful to write programs in a few European languages other than English.

programandala-net · 2019-11-14T19:50:18Z

Such limitation was simplifying the implementation, but maybe now I can improve it.

Thanks. I understand ASCII was enough for your initial scope, but it makes the language pretty useless for a more general usage.

martindecker · 2019-11-14T23:31:52Z

Hello, my answer regarding .bas Source-Files.

I'm not sure what you mean by "switch".

In Python the source code encoding is specified in Line 2 the following way:
#!/usr/bin/python
# -*- coding: iso-8859-1 -*-
or
#!/usr/bin/env python
# -*- coding: utf-8 -*-

It could also be a new Basic command, why put important information into comments ?

An other possibility for having a switch is: The Byte order mark, present = UTF8, not-Present = Latin1 or see my following post.
But does the BOM conflict with the shebang-Function in Linux ???
In a VB.Net- Source File, I found the Byte-Order-Mark ( EF BB BF ) at the beginning.
But VB.Net has no shebang.

Strings

In bigger projects, the language-specific string constants are in "resource" or external files .
Thanks Antonio for writing the Software - one thing at a time, perhaps first do sth regarding the Sourcecode-Question.
There is a "Unicode for C++23" proposal:
https://www.youtube.com/watch?v=3utLG0Qm1Ek
Currently strings are encoding agnostic, "what comes in, goes out", is my experience with the nubasic command Input#
.
As in
https://stackoverflow.com/questions/30277095/whats-the-definition-of-encoding-agnostic
And this was useful for me.

Currently, nubasic strings are 8-bit-sequences! Only the Source Code is treated as 7-bit.

martindecker · 2019-11-15T11:21:47Z

I assume, around russia they have a huge amount of cp1251 -coded files, etc.
On a more abstract level there are only two kinds of codings relevant today in our world:

Some 256-Character encoding, determined by some Environment Variable(s) / locale settings for Editors and the Terminal. "Extended_ASCII" with 8 bits per character.
UTF-8.

So a switch could also be between those two possibilities.
7-bit-Ascii is the common subset of both.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 support #3

UTF-8 support #3

programandala-net commented Nov 10, 2019

eantcal commented Nov 10, 2019 via email

martindecker commented Nov 14, 2019 •

edited

Loading

programandala-net commented Nov 14, 2019

programandala-net commented Nov 14, 2019

martindecker commented Nov 14, 2019 •

edited

Loading

martindecker commented Nov 15, 2019 •

edited

Loading

UTF-8 support #3

UTF-8 support #3

Comments

programandala-net commented Nov 10, 2019

eantcal commented Nov 10, 2019 via email

martindecker commented Nov 14, 2019 • edited Loading

programandala-net commented Nov 14, 2019

programandala-net commented Nov 14, 2019

martindecker commented Nov 14, 2019 • edited Loading

martindecker commented Nov 15, 2019 • edited Loading

martindecker commented Nov 14, 2019 •

edited

Loading

martindecker commented Nov 14, 2019 •

edited

Loading

martindecker commented Nov 15, 2019 •

edited

Loading