Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getty/serial test report #515

Closed
Mellvik opened this issue Apr 3, 2020 · 73 comments
Closed

getty/serial test report #515

Mellvik opened this issue Apr 3, 2020 · 73 comments
Labels
bug Defect in the product

Comments

@Mellvik
Copy link
Contributor

Mellvik commented Apr 3, 2020

Summary: Close, but no cigar
Testing on physical HW & virtualbox, the latter turns out to lend itself well to serial testing: Serial line set up to listen to a TCP port, which can be connected to using netcat.

  1. There is a problem with the serial driver (or maybe somewhere in the call chain from open() and down to physical) as we've seen before: sash < /dev/ttyS0 > /dev/ttyS0 & should work. When it eventually does, we're almost there. The error message is Cannot create /dev/ttyS0: error 16. Replacing sash with getty or login root doesn't make any difference. Neither does changing the shell we're running from. Replacing ttyS0 with tty2 or tty3 (virtual consoles) works fine. Further - sash < /dev/ttyS0 & works, takes input from serial, output to the console. And vice versa if using ttyS0 for output instead. Seemingly, when the device can accept several opens, this is going to work. (?)
  2. There seem to be a stdio-problem and possibly a file descriptor inheritance problem in getty. The device in arg[1] is never opened by getty. Thus starting getty writes to and reads from the console regardless. More about that below.
  3. Starting getty from inittab 'hangs' (loops) the system. All testing is done from the command line.
  4. Opening the device explicitly in getty and then replacing stdio 0,1,2 before the execv using dup2() works. There are some minor problems with the terms settings - to be sorted out. I cannot figure out why dup2 is accepted but not the shell redirects.

--Mellvik
Skjermbilde 2020-04-03 kl  11 04 32

@Mellvik Mellvik added the bug Defect in the product label Apr 3, 2020
@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 3, 2020

UPDATE:
Some adjustments to getty, and serial login is working via inittab.

Still some minor issues (commands echoed twice), but seemingly - when started from inittab, the device gets opened only once, and we're OK.

Great work, @ghaerr !!

@ghaerr
Copy link
Owner

ghaerr commented Apr 3, 2020

Some adjustments to getty, and serial login is working via inittab.

Good news! What kind of changes were made to getty?

A few explanations on your statements above:

The error message is Cannot create /dev/ttyS0: error 16.

Errno 16 is device busy, /dev/ttyS0 is protected against being opened twice.
Trying to run a shell with both 0 and 1 redirected isn't really the way to do it.

There seem to be a stdio-problem and possibly a file descriptor inheritance problem in getty. The device in arg[1] is never opened by getty.

Long story short on this is that /bin/init opens the argument to getty. There is some special processing of /etc/inittab by 'init' and things aren't as simple as they seem. See 'sys_utils/init.c' for details. We probably shouldn't change init processing at this point.

After the device is opened by init, it execs the inittab-specified program, usually getty, and now getty sets the termio struct as well as the baud rate from the command line (but the device is already opened).

Starting getty from inittab 'hangs' (loops) the system.

Is this fixed?

Opening the device explicitly in getty and then replacing stdio 0,1,2 before the execv using dup2() works.

/bin/init opens the device initially as stated above. I would like to see specifically what was required to get this working. 'init' does the same thing replacing 0,1,2 using dup2.

I cannot figure out why dup2 is accepted but not the shell redirects.

That's the protection in the serial driver that prohibits multiple opens. 'dup2' just duplicates a file descriptor in the kernel without re-opening the device.

To summarize - the changes in the device open and dup2 calls you've made to get this to work may need to be changed in 'init', not 'getty'.

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 4, 2020 via email

@ghaerr
Copy link
Owner

ghaerr commented Apr 4, 2020

When everything works, your fixed getty works perfectly.
unless everything on the hardware side is perfect.

What exactly is not working with my last PR? Is it just that it can't be debugged easily when the hardware side isn't perfect?

My adjustments are proposals, because as is, the serial/getty/inittab combination is impossible to debug

Agreed. I added a DEBUG option to both init and getty for this reason. Try turning them on (start with one at a time), it will help a lot. That was the only way I was able to see what was actually going on, finally.

The unix way is allowing multiple opens and then requesting exclusivity when required (and if possible).

Ok. The tty code could be changed on default to allow duplicate opens (at least for root), and the ktcp daemon could use O_EXCL on open to protect itself. I can add both those things.

while Linux seem to have deviated with mingetty.

I'm not that familiar with mingetty, and ELKS has one too. I'd say we should probably stick with getty for now and just get the few things needed to get this working.

Since I now know getty, that way seems simpler ...

Is there something that needs to be added to getty? What exactly?

Thanks for your testing, we're almost there!

@ghaerr
Copy link
Owner

ghaerr commented Apr 4, 2020

elvis won’t run in a terminal window, complains about Screen too small. TERM is OK and /etc/termcap is available.

What do you mean 'terminal window'? You mean from a /dev/ttyS0 line? Or from ELKS while connected to another system using 'miniterm'?

I suspect the reason may have to do with an ioctl to get the window size not being implemented on a non-/dev/ttyX line. Let me know more details and we'll figure it out.

@ghaerr
Copy link
Owner

ghaerr commented Apr 5, 2020

Finally - sort of unrelated, but related anyway: elvis won’t run in a terminal window, complains about Screen too small. TERM is OK and /etc/termcap is available. I have that on my list, but if you immediately have a hunch about the problem, let me know.

I spent some time tracking this down... /bin/init calls getty which calls login. Login sets the USER=, SHELL=, HOME= and TERM=ansi environment variables, then execs the shell specified in /etc/passwd. There is no other way that elvis gets the LINES/COLS. None of SIGWINCH nor the TIOCGWINSZ ioctl are implemented in ELKS. If login doesn't run, elvis won't work unless you set TERM=ansi explicitly beforehand.

Thus, I think your issue is that elvis is being run when the entire init->getty->login->shell sequence didn't happen. When it happens again, run printenv and take a look.

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 5, 2020 via email

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 5, 2020

BTW - this is from a Virtualbox connection via netcat. In my tribe language, this is a 'terminal window'. I haven't checked why we have the extra 'init's - coming back to that.
Skjermbilde 2020-04-05 kl  10 51 05

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 5, 2020 via email

@ghaerr
Copy link
Owner

ghaerr commented Apr 5, 2020

As suggested in the previous message, adding the ability to run getty from the command line turns it into a useful tool for debugging serial connections.

Agreed. Please post your modified getty.c here, along with exactly how you want to run it for debugging serial connections and I'll look at it.
We need to be careful as there are two other serial line issues not debugged: the init recycling an elvis not running.

@ghaerr
Copy link
Owner

ghaerr commented Apr 5, 2020

For the elvis issue and your screenshot: Thanks, very strange.

Lets debug the multiple init processes first - this is definitely because an /etc/inittab process died or failed to start and init is re-running it. Please send over a screenshot of your /etc/inittab or post it. You might try running init in DEBUG mode first, to get a better idea of what is failing.

@ghaerr ghaerr mentioned this issue Apr 5, 2020
@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 5, 2020

[moved from #523]
@Mellvik:

I'm not sure this is a problem, but it makes me suspicious.
Serial login testing means among other things running ps(1) frequently. I'm noticing process numbers increasing fast while I'm not running many processes. Also, most of the time I see extra init processes in the ps list (see screen dump).

@ghaerr:

This is most definitely a problem. The reason you don't see anything is that I fixed the process IDs from going negative so the system continues without you noticing. The issue is exactly the same as before - init is waiting on an inittab child to exit, and restarting it... very quickly. I suspect this is your getty line, but don't know what your /etc/inittab looks like.

I've added debugging to init so this can be traced - set DEBUG 1 in init.c and recompile, that will show the offending line and we can debug from there. There are lots of reasons that a getty line could be failing, I'm wondering whether /dev/ttyS0 is already in use, as that would cause the open to fail.

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 5, 2020

That's the strange thing: I'm diagnosing this from the serial line login window. If you look closely at the screen clip you can see that there are logins (shells) @ two different terminals. Also, I'm using the distribution inittab unmodified with the exception of the comment removed for ttyS0.

I'll test more with init debug on.
-Mellvik

@ghaerr
Copy link
Owner

ghaerr commented Apr 5, 2020

Beware, 'init' may need to be debugged from the console (/dev/tty1). There is some code in main() that opens DEVTTY (tty1). I don't recommend changing that to run off /dev/ttyS0, as that will cause the open to fail due to multiple opens.

If you think that perhaps /dev/ttyS0 multiple opens being prohibited might be the cause of all this mess, comment out lines 258-259 in elks/arch/i86/drivers/char/serial.c - that will end the problem of serial line multiple opens.

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 6, 2020 via email

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 7, 2020

@ghaerr:

Please post your modified getty.c here, along with exactly how you want to run it for debugging serial connections and I'll look at it.

Here we go:

diff --git a/elkscmd/sys_utils/getty.c b/elkscmd/sys_utils/getty.c
index 4403e7e5..9060b7d5 100644
--- a/elkscmd/sys_utils/getty.c
+++ b/elkscmd/sys_utils/getty.c
@@ -1,5 +1,5 @@
 /*
- * elkscmd/sysutils/getty.c
+ * elkscmd/sys_utils/getty.c
  *
  * Copyright (C) 1998 Alistair Riddoch <[email protected]>
  *
@@ -67,7 +67,7 @@
 
 char *	progname;
 char	Buffer[64];
-int	ch, col = 0, fd;
+int	ch, col = 0, fd, tty;
 
 void consolemsg(const char *str, ...)
 {
@@ -214,21 +214,48 @@ int main(int argc, char **argv)
     int n;
     speed_t baud = 0;
     struct termios termios;
+    pid_t ppid;
 
     progname = argv[0];
     signal(SIGTSTP, SIG_IGN);		/* ignore ^Z stop signal*/
 
-    if (argc < 2 || argc > 3) {
+    ppid = getppid();			/* if parent is init, usage() makes no sense */
+    if (ppid > 1 && (argc < 2 || argc > 3)) {
 	consolemsg("Usage: %s device [baudrate]\n", argv[0]);
 	exit(3);
     }
 
-    if (argc == 2) debug("'%s'\n", argv[1]);
-    else if (argc == 3) {
+    if (argc == 2) {
+	debug("info -- arg 2 is '%s'\n", argv[1]);
+    } else if (argc == 3) {
 	baud = atol(argv[2]);
 	debug("'%s' %ld\n", argv[1], baud);
     }
 
+    if (ppid > 1) {
+#if DEBUG
+    	tty = open(argv[1], O_RDWR);
+    	if (tty < 0) {
+		consolemsg("cannot open terminal %s\n", argv[1]);
+		exit(3);
+    	}
+
+    	debug("reopen %s\n", argv[1]);
+    	close(0); close(1); close(2); /* close inherited stdio */
+    	if (dup2(tty, STDIN_FILENO) != STDIN_FILENO ||  \
+		dup2(tty, STDOUT_FILENO) != STDOUT_FILENO ||  \
+		dup2(tty, STDERR_FILENO) != STDERR_FILENO) {
+		consolemsg("Cannot reopen stdio - error %d\n", errno);
+		exit(3);
+	}
+#else
+	fputs("Enable DEBUG to start getty from the command line\n", stderr);
+    	/*close(tty);		crashes getty */
+	exit(1);	
+
+#endif /* DEBUG */
+    }
+
     fd = open(ISSUE, O_RDONLY);
     if (fd >= 0) {
 	put('\n');
@@ -303,7 +330,10 @@ int main(int argc, char **argv)
 			    state(Host);
 			    break;
 			case 'L':			/* Line used */
-			    if (argc > 1) {
+			    if (argc > 1) {	/* this test is bull since 
+						 * we won't get here unless
+						 * argc > 1 
+						 */
 				ptr = rindex(argv[1],'/');
 				if (ptr == NULL)
 				    ptr = argv[1];
@@ -347,6 +377,7 @@ int main(int argc, char **argv)
 
     /* setup tty termios state*/
     baud = convert_baudrate(baud);
+    debug("setting baudrate %d\n", baud);
     if (tcgetattr(STDIN_FILENO, &termios) >= 0) {
         termios.c_lflag |= ISIG | ICANON | ECHO | ECHOE | ECHONL;
         termios.c_lflag &= ~(IEXTEN | ECHOK | NOFLSH);
@@ -369,7 +400,7 @@ int main(int argc, char **argv)
 	state("login: ");
 	n=read(STDIN_FILENO,Buffer,sizeof(Buffer)-1);
 	if (n < 1) {
-	    debug("read fail errno %d\n", errno);
+	    debug("read fail on stdin, errno %d\n", errno);
 	    if (errno != -EINTR)
 		exit(1);
 	    continue;
@@ -378,6 +409,7 @@ int main(int argc, char **argv)
 	while (n > 0)
 	    if (Buffer[--n] < ' ')
 		Buffer[n] = '\0';
+	debug("Calling login: %s\n", Buffer);
 	if (*Buffer) {
 	    char *nargv[3];
 

--Mellvik

@ghaerr
Copy link
Owner

ghaerr commented Apr 7, 2020

Thanks @Mellvik.

Other than the debug statements to see what is going on, these mods are only to get getty to run from the shell, correct?

No modifications are needed to run on ttyS0 from init?

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 7, 2020 via email

@ghaerr
Copy link
Owner

ghaerr commented Apr 7, 2020

You want the additional capability to run from the shell permanent, right?

I'll test it a bit, and create a PR for it. There's a couple small issues, and DEBUG shouldn't be required to run it from the shell, if it issues appropriate error messages, agree?

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 7, 2020 via email

@ghaerr
Copy link
Owner

ghaerr commented Apr 7, 2020

I'll put up a PR. There's still some bugs in your version, for instance, even though 'usage' is pointless when called from init, 'exit' still has be called. I want to review all the debug messages also, so that when we have to debug this again, it can be used!

I'm not completely happy with the debug messages from 'init' yet, but we're making progress. A system or console log would be nice, possibly combined with a kernel printk log using 'mesg', but we'll leave that for later.

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 7, 2020 via email

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 7, 2020

New serial related issue, @ghaerr:
ls -l /bin|more works fine on the console, not via the serial line.
After the first page (first --More--), more(1) fails:
more: : Bad file number

Repeatable on physical and virtual (Virtuabox).

Buffering issue?

--Mellvik

@ghaerr
Copy link
Owner

ghaerr commented Apr 7, 2020

After the first page (first --More--), more(1) fails:

IIRC, you added an explicit open of /dev/tty to more, and my implementation of that could be buggy. Try reading from stdin or 0 instead and see whether that helps. I'm not sure the open of /dev/tty should occur in more anyways, that's reserved normally for input that absolutely should not be redirected, like typing in passwords.

@ghaerr
Copy link
Owner

ghaerr commented Apr 7, 2020

you added an explicit open of /dev/tty to more

Opening /dev/tty is definitely the problem. It is failing and there is no check for -1. Try removing that extra open and using fd '1' instead and test it on serial and console. We can't use 0 as I suggested in my last comment, as more may be taking input from a pipe. But if the output is redirected, it shouldn't paginate anyways. You may need to remove the error printf in that case. After testing that case and the normal use case on console and serial, submit a PR, thanks!

There is also my bug that /dev/tty doesn't work on serial, but I can't fix that until I can figure out how to get serial working on QEMU. I also can't figure out how to attach an ELKS image to VirtualBox nor serial on it either.

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 7, 2020 via email

@ghaerr
Copy link
Owner

ghaerr commented Apr 8, 2020

I added #526 that should allows getty to run from the shell.

Are there any other serial login problems other than elvis not running at this point? I'd like to close this and start new issues so things don't get too complicated. I have elvis on my list for when I can duplicate it.

@ghaerr
Copy link
Owner

ghaerr commented Apr 16, 2020

@Mellvik - Thanks for the interesting explanation. Lets plan on using your modified patch for the chip detection algorithm and when to enable the FIFO. I say 'modified' because I notice that the 'B0' entry was deleted in divisors[] and that will definitely break the baud rate setting code.

The serial card I am using currently has two chips marked 16550, the identify (per the above) as 16550A. I have yet to test the operation of the FIFO, it's on the agenda.

After fully testing the chip detection and FIFO, please submit a PR, thanks!

@pawosm-arm
Copy link
Contributor

Is there a need to reallocate the other IRQs in @pawosm-arm's list? Allowing a general re-allocation of any IRQ could be a lot of work, it seems only IRQ 5 is used for multiple devices generally.

I guess IRQ 2 has similar story of being used for multiple devices.

@ghaerr
Copy link
Owner

ghaerr commented Apr 16, 2020

I guess IRQ 2 has similar story of being used for multiple devices.

I'll come up with something for IRQ 5 as well as IRQ 2. Lets move this IRQ discussion and comments over to #372 as there's lots going on around serial ports here.

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 17, 2020

@ghaerr -
Sorry about the B0, an experiment got interrupted (pun intended).

-M

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 21, 2020

@ghaerr,
more serial testing today. IMHO the chip detection code is solid and FIFO can be enabled. I've tested threshold levels 8 and 14, both work fine.

+    if ((port->flags & SERF_TYPE) > ST_16550)
+       outb_p(UART_FCR_ENABLE_FIFO14, port->io + UART_FCR);

I've made two more changes along the way, neither have any effect on the character dropping, but fixes nevertheless:

@@ -473,7 +481,7 @@ int rs_init(void)
     do {
        if (sp->tty != NULL) {
            printk("ttyS%d at 0x%x, irq %d is a%s\n", ttyno,
-                      sp->io, sp->irq, serial_type[sp->flags & 0x3]);
+                      sp->io, sp->irq, serial_type[sp->flags & SERF_TYPE]);
        }
 /* Flush input fifo */
+// resetting FIFO mode clears all FIFOs
+#define flush_input_fifo(pt) outb_p(0, pt->io + UART_FCR)
+#if 0
 static void flush_input_fifo(register struct serial_info *sp)
 {
     int i = MAX_RX_BUFFER_SIZE;
@@ -96,8 +100,9 @@ static void flush_input_fifo(register struct serial_info *sp)
        inb_p(sp->io + UART_RX);
     } while (--i && (inb_p(sp->io + UART_LSR) & UART_LSR_DR));
 }
+#endif

The docs clearly states that setting FCR bit 0, clears the FIFOs.

I'm still working on the character drop problem - we're getting there. A few of interesting tidbits from the trenches:

  • Other system activities (such as disk/floppy I/O) are likely part og the problem, but input (paste) into cat > /dev/null and cat > /tmp/xx give similar results (significant character loss).
  • There are no more data overrun messages to the console, even when most of the input data get lost.
  • On the other hand, pre-running (buffering) drastically improve serial performance (using ash, see below): Pasting into stty -echo; cat > xx; stty echo gives 50% character loss @ first run, almost 0 @ second run (12 bytes lost out of 12k). That's on a 16450, no fifo, @ 19200bps, probably as good as it gets w/o handshake.
  • FIFO works, but there is no improvement (I changed the local buffer to 16).
  • sash screwed me for a while (embarrassing) with it's handling of redirections and pipes: cat /etc/inittab > /dev/null takes about 19 seconds to execute (the timecommand reports 0,8 secs btw). This is on a 386/20.
  • sash seems to take significantly longer to exec commands than ash. This is on my list for further verification.
  • In fact dropping sash for ash decreased serial line character losses significantly (10% in general), which doesn't make much sense. This is the most recent version of ash btw, w/linenoise.
  • Like before, the line speed (19200 and below) does affect character loss, but not much.
  • There is no significant difference between the NEW and OLD irq code.

Conclusion: IMHO the serial code is as good as it gets until we do a rewrite using a fully interrupt driven model, which is required to implement HW flow control anyway.

-M

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 21, 2020

@ghaerr,
I just discovered (from the NS application note for the 16550a): the command to enable FIFO must have bit 0 set. Missed that one, need to do more testing in the morning, but it doesn't change the conclusion.
M

@ghaerr
Copy link
Owner

ghaerr commented Apr 21, 2020

I just discovered (from the NS application note for the 16550a): the command to enable FIFO must have bit 0 set.

It does set bit 0: #define UART_FCR_ENABLE_FIFO8 (UART_FCR_ENABLE_FIFO | 0x80).

IMHO the chip detection code is solid and FIFO can be enabled. I've tested threshold levels 8 and 14, both work fine.

Can you upload a patch of serial.c with the changes? Thanks! I will sort through it and add the FIFO detection code, the changed FIFO enable code, and look at your other changes. I'm not yet sure about replacing the flush input fifo code, since that code is run with chips without FIFOs as well.

There are no more data overrun messages to the console, even when most of the input data get lost.

This is very big news! That means that the kernel interrupt routine has the capability to keep up with input data when FIFOs are enabled. Very good news. I need to explain the problem of overall throughput, and why we are still a ways away from improving that, which I will do below.

There is no significant difference between the NEW and OLD irq code.

Good: it was the intent to change nothing when moving the IRQs to be configurable. There is still more work to be done for NE2K driver, but otherwise it is now lots easier to configure different hardware. (I assume you've noticed we had to turn off serial COM3 IRQ5 by default, it also was default OFF beforehand as well).

FIFO works, but there is no improvement (I changed the local buffer to 16).

There is improvement by not seeing data overruns, correct? You won't be seeing any full hardware-to-cat-file improvements, those are a different issue entirely.

Like before, the line speed (19200 and below) does affect character loss, but not much.

When we say "character loss" here, I assume you mean that "cat /dev/ttyS0 > file" loses characters. More important is whether there are any data overruns at 19200. I will explain why there is non-overrun "character loss" shortly. So we need to keep the two problems separate.

sash screwed me for a while (embarrassing) with it's handling of redirections and pipes: cat /etc/inittab > /dev/null takes about 19 seconds to execute (the timecommand reports 0,8 secs btw). This is on a 386/20.

I'll bet you don't realize that sash doesn't support file redirection? I made it work behind the scenes and sash execs sh whenever one types a command that uses '>', '<', '|', '&' etc. So heck yeah it's going to be very slow starting up. sash is supposed to be small, and does not support much. I felt it better to achieve seamless compatibility between sash and sh than to rewrite sash to become sh, since we don't really need two shells!

sash seems to take significantly longer to exec commands than ash. This is on my list for further verification.

sash will run its long list of internal commands without hitting the disk at all (e.g. ls, cp, chown, etc, see elkscmd/APPS for details). You can run the /bin version of the command by typing '/bin/ls', for instance. The actual exec of an external command should be quite quick, except for when seamless integration kicks in and sh needs to load. In that case, ELKS still has to read in the sh data segment, the code segment is shared, so its a little quicker than starting sh all on its own. I hope to have a buffer info program sometime in the future that will show buffer usage, and that may allow sh to be loaded from RAM instead of disk.

(buffering) drastically improve serial performance (using ash, see below): Pasting into stty -echo; cat > xx; stty echo gives 50% character loss @ first run, almost 0 @ second run (12 bytes lost out of 12k). That's on a 16450, no fifo, @ 19200bps, probably as good as it gets w/o handshake.

Sounds like things are improving, and my last PR allowed for reading more characters from the 512-byte TTY buffer into cats 4096 buffer with each read, it sounds like that is working.

Here is a quick overview of how serial data gets from hardware to disk. Keep in mind that each of these, except the interrupt processing, is subject to being rescheduled from the CPU and some other task running. And the faster the baud rate, the more time is spent in the interrupt routine rather than running programs:

  • serial interrupt occurs when FIFO trigger level of serial characters received. This can be set to 1, 4, 8 or 14. If the FIFO is set to > 1, then FIFO trigger, or 2 characters of serial framing timeout must occur before an interrupt is generated. ELKS interrupt routine reads up to 10 characters from FIFO into the 512-byte TTY character queue. If for any reason a program is not reading the TTY queue fast enough, then it will drop characters when the interrupt routine force-adds more.
  • A program issues a read call on the serial port, with a buffer length (4096 for cat). What happens next depends on the many variables settable using termios on the serial tty.
  • The TTY driver will suspend a read and reschedule the processor until various conditions are met, OR, if some met, will return less than the number of characters asked for. In the version before my last PR, at most ONE character was EVER returned from the TTY driver. In the new version after the PR, it uses VMIN and VTIME to determine (in somewhat complex fashion), how many characters to return. The old default for VMIN was 0, which meant that it basically polled the serial driver, and possibly returned 1 character, no matter how many were ready.
  • The new default for VMIN is 1 (instead of 0). It will now wait for at least one character, not poll, and return the number of characters ready, which could be from 0 through the number in the interrupt routine TTY queue buffer. It is likely way less than 512.
  • The kernel currently copies characters from the TTY queue ONE at a time, back to user space. This needs a rewrite, and will be next, as its quite slow.
  • The read returns (to cat) with some number of characters, and the characters are copied for a third time to a user buffer. Even if the user buffer only returned 1 character, that character is then passed to a write system call, to be processed by the filesystem code to be copied back from user space into another kernel buffer, and then that buffer may be queued to be written to floppy, even with only a few characters received.

This is an overview, can you see the potential for data loss in the TTY queue by it discarding characters while read, cat, filesystem, buffer, disk and other code run? !!! Various things can be updated, but cat is not the way to test for lost data on slow systems, once we have the serial driver working properly.

Options to play with would be setting VMIN to 256 or more and VTIME to 1, but would be best done in a special cat program that also did the stty calls directly.

Another thought is creating a mechanism to allocate the TTY queue buffer dynamically, and set it much higher, if data transfers need to happen at hight rates and the kernel throughput can't be achieved. But the kernel throughput needs to be improved as well.

The ktcp and SLIP processing routines have max packet sizes, and I believe these are all smaller than 512 bytes.

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 22, 2020 via email

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 22, 2020

Diff from the test version of serial.c:

diff --git a/elks/arch/i86/drivers/char/serial.c b/elks/arch/i86/drivers/char/serial.c
index 054c7775..7797c13e 100644
--- a/elks/arch/i86/drivers/char/serial.c
+++ b/elks/arch/i86/drivers/char/serial.c
@@ -12,7 +12,8 @@
 #include <arch/io.h>
 
 #define NEW            1       /* set =0 for old driver code if needed*/
-#define FIFO           0       /* set =1 to enable new FIFO code for testing*/
+#define FIFO           1       /* set =1 to enable new FIFO code for testing*/
+#define NEWPROBE       1       /* Use new probe code */
 
 #if defined (CONFIG_CHAR_DEV_RS) || defined (CONFIG_CONSOLE_SERIAL)
 
@@ -62,7 +63,7 @@ static struct serial_info ports[NR_SERIAL] = {
 static char irq_port[NR_SERIAL] = { 3, 1, 0, 2 };
 
 static unsigned int divisors[] = {
-    0,                         /*  0 = B0      */
+    0,                         /*  0 = B0      */
     2304,                      /*  1 = B50     */
     1536,                      /*  2 = B75     */
     1047,                      /*  3 = B110    */
@@ -88,6 +89,9 @@ static unsigned int divisors[] = {
 #define        RS_IALLMOSTEMPTY        (    INQ_SIZE / 4)
 
 /* Flush input fifo */
+// resetting FIFO mode clears all FIFOs
+#define flush_input_fifo(pt) outb_p(0, pt->io + UART_FCR)
+#if 0
 static void flush_input_fifo(register struct serial_info *sp)
 {
     int i = MAX_RX_BUFFER_SIZE;
@@ -96,8 +100,9 @@ static void flush_input_fifo(register struct serial_info *sp)
        inb_p(sp->io + UART_RX);
     } while (--i && (inb_p(sp->io + UART_LSR) & UART_LSR_DR));
 }
+#endif
 
-#if NEW
+#if NEWPROBE
 static int rs_probe(register struct serial_info *sp)
 {
     int status, type;
@@ -115,12 +120,13 @@ static int rs_probe(register struct serial_info *sp)
 
     /* then read FIFO status*/
     status = inb_p(sp->io + UART_IIR);
-    if (status & 0x40) {
-       if (status & 0x80) {            /* FIFO enabled*/
-           if (status & 0x20)          /* 64 byte FIFO enabled*/
-               type = ST_16750;
-           else type = ST_16550A;
-       } else type = ST_16550;
+    if (status & 0x80) {               /* FIFO available*/
+       if (status & 0x20)              /* 64 byte FIFO enabled*/
+           type = ST_16750;
+       else if (status & 0x40)         /* 16 byte FIFO OK */
+           type = ST_16550A;
+        else 
+           type = ST_16550;            /* Non-functional FIFO */
     } else {
        /* no FIFO, try writing arbitrary value to scratch reg*/
        outb_p(0x2A, sp->io + UART_SCR);
@@ -227,8 +233,6 @@ static void update_port(register struct serial_info *port)
 
     //FIXME: update lcr from parity and word termios values
 
-    clr_irq();
-
     /* Set the divisor latch bit */
     outb_p(port->lcr | UART_LCR_DLAB, port->io + UART_LCR);
 
@@ -238,8 +242,6 @@ static void update_port(register struct serial_info *port)
 
     /* Clear the divisor latch bit */
     outb_p(port->lcr, port->io + UART_LCR);
-
-    set_irq();
 }
 
 /* WARNING: Polling write function */
@@ -265,7 +267,7 @@ void rs_irq(int irq, struct pt_regs *regs, void *dev_id)
     struct ch_queue *q;
     int i, j, status;
     char *io;
-    unsigned char buf[10];
+    unsigned char buf[16];
 
     i = 0;
     sp = &ports[(int)irq_port[irq - 2]];
@@ -278,7 +280,7 @@ void rs_irq(int irq, struct pt_regs *regs, void *dev_id)
        //if (status & UART_LSR_DR)                     /* Receiver buffer full? */
            do {
                buf[i++] = inb_p(io + UART_RX);         /* Read received data */
-           } while ((inb_p(io + UART_LSR) & UART_LSR_DR) && i < 10);
+           } while ((inb_p(io + UART_LSR) & UART_LSR_DR) && i < 16);
        //}
     //} while (!(inb_p(io + UART_IIR) & UART_IIR_NO_INT) && i < 10);
 
@@ -367,7 +369,7 @@ static int rs_open(struct tty *tty)
 #define UART_FCR_ENABLE_FIFO4  (UART_FCR_ENABLE_FIFO | 0x40)
     /* enable FIFO with 8 byte trigger */
 #if FIFO
-    if ((port->flags & SERF_TYPE) > ST_16450)
+    if ((port->flags & SERF_TYPE) > ST_16550)
        outb_p(UART_FCR_ENABLE_FIFO8, port->io + UART_FCR);
 #endif
 
@@ -473,7 +475,7 @@ int rs_init(void)
     do {
        if (sp->tty != NULL) {
            printk("ttyS%d at 0x%x, irq %d is a%s\n", ttyno,
-                      sp->io, sp->irq, serial_type[sp->flags & 0x3]);
+                      sp->io, sp->irq, serial_type[sp->flags & SERF_TYPE]);
        }
        sp++;
     } while (++ttyno < NR_SERIAL);

@ghaerr
Copy link
Owner

ghaerr commented Apr 22, 2020

Thanks for the diff. I'll update serial.c and submit a PR to synchronize where we're both at.

It's may be good idea to make the number of serial lines a config option. More than 2 ports is rare, and the freed buffers and data structures may be better used somewhere else.

Ok, good idea. This may be best done for the time being in ports.h, since that file would also need modification to properly allocate COM3/COM4. I'll look at it.

From my perspective, if a system is getting more than it can handle, the only place data loss is acceptable, is at the ingress point, where it can be observed and handled. The rest of the path must be 100% reliable all the time.

All UNIX and Linux systems drop data in this way, when/if the TTY queue buffer overflows. There isn't anything the system can do about it reasonably, since this is at interrupt time.

With header compression (CSLIP), the packet size slightly less than 1K, I'm unsure whether it can be 'tuned' much lower …

I had thought that SLIP packets were 256 bytes, but looking at ktcp, I see they that slip MTU is 1064 bytes. Given this, and your reasonable comment that the system should be 100% reliable, I propose adding variable size TTY queues. That would allow a slow system with a FIFO chip to be 'tuned' so that the TTY queue takes up the slack between the interrupts and complete data throughput. This will allow you to fix the problem, providing the data rate doesn't exceed the hardware interrupt processing overhead.

A low level driver that doesn't support flow control and doesn't report (and record) errors (such as overruns, framing errors, …)

I'm not sure what the point would be of recording overruns, we report them now. Once a single report is seen, the baud rate must be dropped since there's nothing else ELKS/we can do to help other than using a larger FIFO. I plan on setting the largest FIFO trigger in the next PR, BTW.

But I can rewrite parts of the driver, making it fully interrupt driven w/ hard and soft flow control. And I have the setup to test it.

Great. I wouldn't suggest "fully interrupt driven" at this point, since that's just going to add to the interrupt overhead. I'll submit a PR shortly, then go ahead and add hardware flow control. Enabling interrupts for the MCR register (modem status interrupt) only would be a good start, and the interrupt routine can manipulate a flag in the serial_info struct that rs_write can use to continue transmitting or sleep using wake_up etc. To keep things simple, just use the 'CRTSCTS' bit in termios.c_cflag to enable HW flow control processing, which can be set by miniterm etc after you get it working.

Actually, some times it feels like the system just stops dead in the tracks for - say - 5+ seconds, then continues. And this happens only when using sash. Doesn't make sense.

I have found the memory corruption error I've been working on for two days. Turns out any program that calls grgetnam to get a group name badly corrupted memory. This includes the chgrp command, and ls -l. Since sash has an internal ls, running ls -l corrupted sash memory. That could be the problem, we'll have to see.

Also it seems like the time command is broken, reporting 0.8s real time when real time is more like 20s.

Keep testing. Lets keep time or sash bugs in another issue, please open if you find more information, after my PR memory fix.

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 24, 2020 via email

@ghaerr
Copy link
Owner

ghaerr commented Apr 24, 2020

Printk is OK, but not very useful on a PC screen w/o scrollback.

I'm running qemu.sh with the "CONSOLE=-serial stdio" option uncommented. This allows by default logins on both console and serial (which is macOS Terminal). I can then login on serial, which has scrollback from macOS terminal, while the printks remain on the console. I'm hoping there is a similar setup on VirtualBox. I plan on a subsequent enhancement to allow redirecting printk's to another TTY device, which would allow srollback of messages on serial.

Another thought is to implement passing a command line to ELKS at boot - one could specify serial console without recompilation, and display startup kernel messages on it. Another option could specify the startup init level. Both of these would help debugging.

sercat was written so that it can be run on either the console, or from the serial port after logging in, to enable quick serial throughput testing. Check it out, it will help us get through this final stage of eliminating any lost serial data.

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 26, 2020

Printk is OK, but not very useful on a PC screen w/o scrollback.

I'm running qemu.sh with the "CONSOLE=-serial stdio" option uncommented. This allows by default logins on both console and serial (which is macOS Terminal). I can then login on serial, which has scrollback from macOS terminal, while the printks remain on the console. I'm hoping there is a similar setup on VirtualBox. I plan on a subsequent enhancement to allow redirecting printk's to another TTY device, which would allow srollback of messages on serial.

Another thought is to implement passing a command line to ELKS at boot - one could specify serial console without recompilation, and display startup kernel messages on it. Another option could specify the startup init level. Both of these would help debugging.

sercat was written so that it can be run on either the console, or from the serial port after logging in, to enable quick serial throughput testing. Check it out, it will help us get through this final stage of eliminating any lost serial data.

Yes, I'm sure VB can do that too. Real HW cannot - unless serial console is used, which is sort of self prohibiting when debugging serial.
Anyway, my point was not related to the initial debugging, but adding a general enancement. At any time - in regular use - it is very valuable to have such error stats available and the cost is minimal. I'm going to include this as a proposal when adding hw flow control.
I'm also proposing an additional set of tty-devices (minor numbers) to accomodate hw flow control (such as if (minor&0xF0), then use hw). Most unix/linux systems support this, many use three different devices: no flow control, dialout (being DTE), dialin (being DCE). I think that's overkill, but having two is necessary.

@ghaerr
Copy link
Owner

ghaerr commented Apr 26, 2020

I'm also proposing an additional set of tty-devices (minor numbers) to accomodate hw flow control

I can think of several reasons this may not be a great idea, or be overly complicated for ELKS. Currently, each TTY device minor number has to be separately allocated in kernel data space within an already-shared array with console, pty, and serial ports, thus allocating 4 more slots for each of the variations on each of ttyS0/1/2/3. The exclusive-use mechanism recently added won't work with different minor numbers. Programs like miniterm, which default to /dev/ttyS0 will have to specify a new device on startup, rather than a dash option to turn on hardware or software flow control. /etc/inittab would have to be modified for each of the variations. But most importantly, any FAT-boot image won't be able to use them, as it is maxed-out on the 16 /dev/ fake-devices available, and current images only have ttyS0, with little that can be changed without affecting other use cases. [EDIT: Just looked at elks/fs/msdos/inode.c which allocates FAT fake devices, and I misspoke: there is additional space, so the last reason I gave isn't a good reason.]

But that can all get sorted out after hw flow control is working. I think it necessary to be able to turn on and off hw flow control via a programmatic ioctl on a regular ttyS0 (possibly also via stty) as well.

Real HW cannot - unless serial console is used, which is sort of self prohibiting when debugging serial.

Having a mechanism to redirect kernel printk's to the PC screen would help when running serial console.

it is very valuable to have such error stats available and the cost is minimal. I'm going to include this as a proposal when adding hw flow control.

I would prefer that each of the three issues under discussion be developed and submitted separately for stability and clarity moving forward (hw flow control, error logging, and additional minor devices).

I'm interested to see if there's any improvement to non-flow-controlled reliability using sercat instead of cat with the VMIN/VTIME throughput changes. Thank you for all your work, testing and comments!

@Mellvik
Copy link
Contributor Author

Mellvik commented Apr 26, 2020 via email

@ghaerr
Copy link
Owner

ghaerr commented Apr 26, 2020

As to the FAT fs issue, my opinion is still that it was a bad idea to begin with.

but also an inhibitor to the development of ELKS as it blocks sound choices (like in this case).

I respectfully disagree with both statements. I pointed out in an edit that I was incorrect about running out of FAT /dev/entries, there are a couple that could be replaced. But we will be rapidly approaching a hard limit with how much can be added to the ELKS kernel, in attempts to make ELKS very much like desktop Linux. We've got less than 10k code space available in the kernel, tradeoffs will have to be made. I'm still chuckling about @pawosm-arm's comment that Linux's /dev got so big it was converted to a binary blob.

@pawosm-arm
Copy link
Contributor

Hey @ghaerr, we were talking about different things having similar names! What I mentioned on some occasion recently was DeviceTree [1], a binary blob provided by the bootloader or the firmware (OpenFirmware in most cases) to the kernel, descibing in a tree-like structure the system setup. It was supposed to be portable across architectures standardizing how information normally obtained by kernel from BIOS or bootloader (using non-portable means of communication) should be propagated.

As I understand, what you discuss here is the way of holding stuff in the /dev tree. Today's desktop Linuxes use one of dynamic /dev tree solutions (systemd, udev, formerly devfs) that hold this stuff in memory and make things appear/disappear in the '/dev' tree as needed following the plug-and-play spirit. But even today's Linux can live without it, you can still build your own Linux distro which does not have this dynamic stuff and holds device nodes as the real things on a disk filesystem in the real '/dev' directory on disk. Assuming pre-386 machines were born years before plug-and-play, I can assume, device nodes on disk is the right choice for ELKS.

[1] https://elinux.org/Device_Tree_Reference

@ghaerr
Copy link
Owner

ghaerr commented Apr 26, 2020

we were talking about different things having similar names!

Thanks for straightening me out! :)

I'm looking at telnetd, and I notice that it doesn't run /bin/sh in quite the way it should for a login shell. Also, it doesn't change directory or ask for credentials.... just brings you the root prompt, I'm presuming. Do you think telnetd should exec /bin/login instead?

@pawosm-arm
Copy link
Contributor

Do you think telnetd should exec /bin/login instead?

No opinion on that. For now, network security isn't my concern, whatever makes it easier to implement will do.

@ghaerr
Copy link
Owner

ghaerr commented Jun 30, 2020

Hello @Mellvik,

As you've probably seen, the serial driver problems with ELKS have been identified and hopefully completely solved in #664. I would like to have you test serial connectivity on a few real systems, and then perhaps we can close this issue finally!

My testing has shown ELKS at 19200 baud received characters keeping up (tested by ^S/^Q on "ls -lR /") and no character loss on both the regular and "fast" driver, and short-burst (less than 1024 characters, tested by "cat file" or ^S/^Q quickly) using the "fast" driver at speeds up to 57600 on a Compaq 386 portable. My other fast 386 desktop system will run at 115200 baud continuously with no data loss.

@Mellvik
Copy link
Contributor Author

Mellvik commented Jul 1, 2020 via email

@ghaerr
Copy link
Owner

ghaerr commented Jul 18, 2020

Hello @Mellvik,

A series of PRs several weeks ago (#592 and #664) should have fixed all the issues brought up here. Please test, and close this issue if you are satisfied. Since this issue has gotten rather long, subsequent serial port requests/enhancements can be brought up in new issues.

Thanks!

@Mellvik
Copy link
Contributor Author

Mellvik commented Jul 19, 2020 via email

@Mellvik Mellvik closed this as completed Jul 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Defect in the product
Projects
None yet
Development

No branches or pull requests

3 participants