Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self hosting tokenizer #9

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open

Conversation

nickd4
Copy link

@nickd4 nickd4 commented Apr 24, 2022

More hacking... what I set out to do was to make the seedForth tokenizer self-hosting, so that after bootstrap you would not need gForth to develop applications. So my idea was make the tokenizer work in gForth like now (for bootstrapping) and also work in seedForth interactive version (for application development). It turned out to be quite difficult, but ultimately it works.

So the actual changes to seedForth-tokenizer.fs to make it run under seedForth were not that huge, mainly a matter of accounting for seedForth's case sensitivity and restricted syntax for hex and character literals and various things like that, as well as minor differences in the words available (parse-name instead of <name> etc). But the larger difficulty was in making a seedForth or seedForthInteractive program run cleanly as a filter. I had to modify the runtime library and I/O system a lot.

There was also another issue to deal with which concerns the wrapping of the *.seed and *.seedsource files. Originally the input was wrapped in PROGRAM / END and the output was wrapped with an automatic bye token added at the end. I have removed the need for all of this wrapping, at the cost of its being slightly more awkward to invoke the gForth version of the tokenizer. Since this is only done from the Makefile during bootstrap, that's not a big deal. It's only just occurred to me now that the unusual extension *.seedsource was probably due to the wrapping, so maybe we can rename them to *.forth now?

Here is a detailed summary of all the changes I have made to support the self-hosting tokenizer:

  • Create the ./seedForth-tokenizer script, which operates as a filter and takes a *.seedsource file on stdin and outputs the corresponding *.seed file on stdout. It works similarly to ./seed by concatenating the various input files into ./seedForth.
  • Build the functionality of cat into every compiled preForth/seedForth application, so it will either process stdin if there are no command line arguments, or else open and read each file specified on the command line in sequence, where - is stdin. The effect of this change is that when running ./seed you no longer need to press Enter after typing bye to make seedForth quit. The extra keystroke was needed to force the front-end cat invocation to try to send something to seedForth and then it would realize the pipe was broken and quit. With seedForth managing its own input, you can quit cleanly.
  • Make the key? word use a poll() rather than ioctl() system call, since we now expect that stdin might come from a file.
  • Implement an eemit word throughout the system which is the same as emit but writes to stderr. I use this for debugging.
  • Remap the tokens key and emit to higher token numbers, to make it easier to detect the EOT character which used to correspond to the key token. Implement a new eot token at 4 which is similar to the bye token. The reason for this change is because the bye token was overloaded to use as [, i.e. it would restart the interpreter after compiling the ; token and during certain control flow constructs. This meant you couldn't compile a bye token into a program. By moving the original usage of bye onto the new eot token, it means bye is no longer special and can be compiled normally, while also the changes to the existing system are minimal, and as a bonus, if input runs out during a :-definition, the resulting EOT will be interpreted as [ and send us back to interpretive state, where a further EOT is considered invalid and quits the interpreter too.
  • Split out the essential definitions from seedForthInteractive.seedsource into a new seedForthRuntime.seedsource and from hi.forth into a new runtime.forth. The original files seedForthInteractive.seedsource and hi.forth still exist and contain all the tests as well as less essential words like sqr which you can grab if you actually need them. To use the system as it was previously, you have to tokenize seedForthRuntime.seedsource + seedForthInteractive.seedsource into seedForthInteractive.seed and then pass it runtime.forth + hi.forth and the Makefile and ./seed script have been updated appropriately. But when running the self-hosted tokenizer, it uses a different *.seed file which is basically generated by tokenizing seedForthRuntime.seedsource + a call to boot, and it uses the runtime.forth without the hi.forth part.
  • Increase tib from 80 to 255 characters, also fix a bug in accept which allowed it to write one character beyond the tib. Note that some source lines in the system as originally were > 80 characters, and I think they may have been silently truncated and the incomplete code not noticed. The extra character seemed to cause a crash on my Z80 port, which alerted me to the issue. There isn't really a good way to flag too-long lines to the user, but I have at least made it not echo any extra characters.
  • Hack on accept, refill and restart words to detect EOT and return something or quit. This is needed to prevent the self-hosted tokenizer from hanging after it tokenizes all the input. I'm not entirely happy with the solutions I came up with here, and I think possibly the entire concept of using EOT as a marker for the end of input might be flawed. Could we make key throw an exception instead? I'm not a very experienced Forth programmer so I don't really know how this would be done conventionally. But at any rate, you can now exit from ./seed by typing Ctrl-D (no Enter) or bye and I find the first more comfortable. Just be aware that a partial last line is not supported, either in a *.seedsource file or a ./seed session, it will say "not found".
  • Make the initial state of echo and input-echo be off. That's because after loading a seedForth runtime from *.seed file, you will always want to load further runtime as textual Forth source. So it's cleaner to let the second runtime enable echo. This makes the output of ./seed cleaner as well. But primarily it's needed to avoid junk getting into the tokenized *.seed files.
  • Implement DO/?DO/LOOP, as the experimental ?DO that was commented didn't have a correct companion LOOP.

Some of the more detailed changes might not be well explained in the above summary, or might be objectionable for whatever reason, so please feel free to check with me. Also, keep in mind that this changeset is "on top of" the previous changeset that I PR'ed the other day, so github will show both changesets. It's annoying the way github does this, and it does not recalculate the changeset after you merge the first PR. But you can force it to, by changing the base branch name and then changing it back.

I had a really good time doing this, even though it involved a lot of head-scratching and dealing with strange crashes and errors and unexpected behaviour. As I mentioned I'm not an experienced Forth programmer, but I've become more conversant with it.

Note: There is a minor bug in this PR, that I had directly invoked gforth in Makefile instead of $(HOSTFORTH). It is fixed in #12 so I have not fixed it here. If you do want the fixed version of this PR see the branch self_hosting_tokenizer1 in my github account. I wouldn't recommend using that branch though, because it will cause conflicts later when mergining #12 and others.

…to bss rather than text section which avoids the need to call mprotect(), rename things
… be wrapped with PROGRAM / END, also removes automatic bye token that was generated by END
…time.seedsource, so that we can run textual forth code without the tests or the banner
… writes to stderr, fix self-hosted tokenizer termination issue (was debugged with eemit)
@nickd4 nickd4 force-pushed the self_hosting_tokenizer branch from e3b1a9b to 7115f49 Compare May 1, 2022 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants