-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getty/serial test report #515
Comments
UPDATE: Still some minor issues (commands echoed twice), but seemingly - when started from inittab, the device gets opened only once, and we're OK. Great work, @ghaerr !! |
Good news! What kind of changes were made to getty? A few explanations on your statements above:
Errno 16 is device busy, /dev/ttyS0 is protected against being opened twice.
Long story short on this is that After the device is opened by init, it execs the inittab-specified program, usually getty, and now getty sets the termio struct as well as the baud rate from the command line (but the device is already opened).
Is this fixed?
That's the protection in the serial driver that prohibits multiple opens. 'dup2' just duplicates a file descriptor in the kernel without re-opening the device. To summarize - the changes in the device open and dup2 calls you've made to get this to work may need to be changed in 'init', not 'getty'. |
@ghaerr,
it’s really encouraging being so close to having serial login - and thus full serial support - working.
Final physical machine testing is pending - surely just adjustments.
The testing / debugging has left some - from my perspective - interesting experiences, summarized below.
3. apr. 2020 kl. 17:19 skrev Gregory Haerr ***@***.***>:
Some adjustments to getty, and serial login is working via inittab.
Good news! What kind of changes were made to getty?
When everything works, your fixed getty works perfectly. My adjustments are proposals, because as is, the serial/getty/inittab combination is impossible to debug: getty cannot be run from the command line (because of the single open policy, more on that below), and when run from init, it loops - unless everything on the hardware side is perfect. More on that too below.
A few explanations on your statements above:
The error message is Cannot create /dev/ttyS0: error 16.
Errno 16 is device busy, /dev/ttyS0 is protected against being opened twice.
Trying to run a shell with both 0 and 1 redirected isn't really the way to do it.
I don’t agree. The unix way is allowing multiple opens and then requesting exclusivity when required (and if possible). Making serial tty different from this breaks reasonable expectations, breaks with history and complicates things (like this example). Further - for serial lines, multiple opens has been a requirement: Using the same line for inbound and outbound connections for example. Having SLIP ‘block’ getty while active would be such a situation. Also, being able to redirect from the shell like showed, has saved my day too many times to remember, in particular when working on serial lines. In other words, my suggestion is to change this.
There seem to be a stdio-problem and possibly a file descriptor inheritance problem in getty. The device in arg[1] is never opened by getty.
Long story short on this is that /bin/init opens the argument to getty. There is some special processing of /etc/inittab by 'init' and things aren't as simple as they seem. See 'sys_utils/init.c' for details. We probably shouldn't change init processing at this point.
After the device is opened by init, it execs the inittab-specified program, usually getty, and now getty sets the termio struct as well as the baud rate from the command line (but the device is already opened).
I got this eventually, and indeed this is the traditional unix way, while Linux seem to have deviated with mingetty. Anyway, my proposal for getty is to have it check its origin so to speak: Either use parent process id or compare the name of stdin to argv[1]. That way getty (and the environment around it) becomes debuggable. And it is a minor change.
Starting getty from inittab 'hangs' (loops) the system.
Is this fixed?
Yes and no. For obvious reasons, getty is very sensitive to errors. Any open or read/write error (almost) will cause it to exit, making init loop and the system appears dead. There are probably more ways to fix this than I can think of, but I guess the ’normal’ way is to have init detect the looping and throttle. Or somehow make getty more robust...
Opening the device explicitly in getty and then replacing stdio 0,1,2 before the execv using dup2() works.
/bin/init opens the device initially as stated above. I would like to see specifically what was required to get this working. 'init' does the same thing replacing 0,1,2 using dup2.
I cannot figure out why dup2 is accepted but not the shell redirects.
That's the protection in the serial driver that prohibits multiple opens. 'dup2' just duplicates a file descriptor in the kernel without re-opening the device.
To summarize - the changes in the device open and dup2 calls you've made to get this to work may need to be changed in 'init', not 'getty'.
Maybe - I don’t know ELKS init, so I don’t have an opinion on that. Since I now know getty, that way seems simpler ...
Finally - sort of unrelated, but related anyway: elvis won’t run in a terminal window, complains about Screen too small. TERM is OK and /etc/termcap is available. I have that on my list, but if you immediately have a hunch about the problem, let me know.
—Mellvik
|
What exactly is not working with my last PR? Is it just that it can't be debugged easily when the hardware side isn't perfect?
Agreed. I added a DEBUG option to both init and getty for this reason. Try turning them on (start with one at a time), it will help a lot. That was the only way I was able to see what was actually going on, finally.
Ok. The tty code could be changed on default to allow duplicate opens (at least for root), and the
I'm not that familiar with mingetty, and ELKS has one too. I'd say we should probably stick with getty for now and just get the few things needed to get this working.
Is there something that needs to be added to getty? What exactly? Thanks for your testing, we're almost there! |
What do you mean 'terminal window'? You mean from a /dev/ttyS0 line? Or from ELKS while connected to another system using 'miniterm'? I suspect the reason may have to do with an |
I spent some time tracking this down... Thus, I think your issue is that |
@ghaerr,
This is a normal (full) login. The environ is set normally.
Let me know if there is anything specific to test to narrow it down.
Screenshot to be uploaded separately.
I don't think they're related, but FWIW: a stty problem report is coming separately.
…--Mellvik
5. apr. 2020 kl. 04:25 skrev Gregory Haerr ***@***.***>:
Finally - sort of unrelated, but related anyway: elvis won’t run in a terminal window, complains about Screen too small. TERM is OK and /etc/termcap is available. I have that on my list, but if you immediately have a hunch about the problem, let me know.
I spent some time tracking this down... /bin/init calls getty which calls login. Login sets the USER=, SHELL=, HOME= and TERM=ansi environment variables, then execs the shell specified in /etc/passwd. There is no other way that elvis gets the LINES/COLS. None of SIGWINCH nor the TIOCGWINSZ ioctl are implemented in ELKS. If login doesn't run, elvis won't work unless you set TERM=ansi explicitly beforehand.
Thus, I think your issue is that elvis is being run when the entire init->getty->login->shell sequence didn't happen. When it happens again, run printenv and take a look.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@ghaerr, just some clarifications:
When everything works, your fixed getty works perfectly.
unless everything on the hardware side is perfect.
What exactly is not working with my last PR? Is it just that it can't be debugged easily when the hardware side isn't perfect?
Getty works fine, there is nothing wrong with the PR. It’s just not helpful.
My adjustments are proposals, because as is, the serial/getty/inittab combination is impossible to debug
Agreed. I added a DEBUG option to both init and getty for this reason. Try turning them on (start with one at a time), it will help a lot. That was the only way I was able to see what was actually going on, finally.
The unix way is allowing multiple opens and then requesting exclusivity when required (and if possible).
Ok. The tty code could be changed on default to allow duplicate opens (at least for root), and the ktcp daemon could use O_EXCL on open to protect itself. I can add both those things.
Thanks - appreciated.
while Linux seem to have deviated with mingetty.
I'm not that familiar with mingetty, and ELKS has one too. I'd say we should probably stick with getty for now and just get the few things needed to get this working.
I agree.
Since I now know getty, that way seems simpler ...
Is there something that needs to be added to getty? What exactly?
As suggested in the previous message, adding the ability to run getty from the command line turns it into a useful tool for debugging serial connections. A simple test of ppid != 1, or maybe just filename(stdin) != argv[1] and then open argv[1] does it. My version of getty does that - wrapped by the DEBUG #if (using the ppid!=0 method).
Thanks for your testing, we're almost there!
Yes, this is very encouraging.
… —Mellvik
|
Agreed. Please post your modified getty.c here, along with exactly how you want to run it for debugging serial connections and I'll look at it. |
For the elvis issue and your screenshot: Thanks, very strange. Lets debug the multiple init processes first - this is definitely because an /etc/inittab process died or failed to start and init is re-running it. Please send over a screenshot of your /etc/inittab or post it. You might try running init in DEBUG mode first, to get a better idea of what is failing. |
|
That's the strange thing: I'm diagnosing this from the serial line login window. If you look closely at the screen clip you can see that there are logins (shells) @ two different terminals. Also, I'm using the distribution inittab unmodified with the exception of the comment removed for ttyS0. I'll test more with init debug on. |
Beware, 'init' may need to be debugged from the console (/dev/tty1). There is some code in main() that opens DEVTTY (tty1). I don't recommend changing that to run off /dev/ttyS0, as that will cause the open to fail due to multiple opens. If you think that perhaps /dev/ttyS0 multiple opens being prohibited might be the cause of all this mess, comment out lines 258-259 in elks/arch/i86/drivers/char/serial.c - that will end the problem of serial line multiple opens. |
Thanks for the hint @ghaerr. Actually a useful combination - work on the terminal, messages on the console :-)
I'll make the serial.c change to eliminate that possibility.
Right now it looks like the runaway init is on physical HW only, not on VirtualBox (& serial via netcat).
—Mellvik
… 5. apr. 2020 kl. 18:44 skrev Gregory Haerr ***@***.***>:
Beware, 'init' may need to be debugged from the console (/dev/tty1). There is some code in main() that opens DEVTTY (tty1). I don't recommend changing that to run off /dev/ttyS0, as that will cause the open to fail due to multiple opens.
If you think that perhaps /dev/ttyS0 multiple opens being prohibited might be the cause of all this mess, comment out lines 258-259 in elks/arch/i86/drivers/char/serial.c - that will end the problem of serial line multiple opens.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#515 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3WGOBATDZ3XF6XP5ABUYLRLCYP5ANCNFSM4L3U6I7A>.
|
Here we go:
--Mellvik |
Thanks @Mellvik. Other than the debug statements to see what is going on, these mods are only to get getty to run from the shell, correct? No modifications are needed to run on ttyS0 from init? |
Yes, that's right @ghaerr.
There is also a message if started from the command line with DEBUG off, informing that DEBUG needs too be set in order to run like that.
—Mellvik
… 7. apr. 2020 kl. 16:32 skrev Gregory Haerr ***@***.***>:
Thanks @Mellvik <https://github.com/Mellvik>.
Other than the debug statements to see what is going on, these mods are only to get getty to run from the shell, correct?
No modifications are needed to run on ttyS0 from init?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#515 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3WGOHQP42XRUPGTP3AIYLRLM2QLANCNFSM4L3U6I7A>.
|
You want the additional capability to run from the shell permanent, right? I'll test it a bit, and create a PR for it. There's a couple small issues, and DEBUG shouldn't be required to run it from the shell, if it issues appropriate error messages, agree? |
Yes, I agree.
I misread your previous mail.
I can make that change if you'd like.
—M
… 7. apr. 2020 kl. 16:54 skrev Gregory Haerr ***@***.***>:
You want the additional capability to run from the shell permanent, right?
I'll test it a bit, and create a PR for it. There's a couple small issues, and DEBUG shouldn't be required to run it from the shell, if it issues appropriate error messages, agree?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#515 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3WGOC5HRC6PSI32SEVV33RLM5EHANCNFSM4L3U6I7A>.
|
I'll put up a PR. There's still some bugs in your version, for instance, even though 'usage' is pointless when called from init, 'exit' still has be called. I want to review all the debug messages also, so that when we have to debug this again, it can be used! I'm not completely happy with the debug messages from 'init' yet, but we're making progress. A system or console log would be nice, possibly combined with a kernel printk log using 'mesg', but we'll leave that for later. |
OK. And yes, a printk would be nice at some point.
—M
… 7. apr. 2020 kl. 17:17 skrev Gregory Haerr ***@***.***>:
I'll put up a PR. There's still some bugs in your version, for instance, even though 'usage' is pointless when called from init, 'exit' still has be called. I want to review all the debug messages also, so that when we have to debug this again, it can be used!
I'm not completely happy with the debug messages from 'init' yet, but we're making progress. A system or console log would be nice, possibly combined with a kernel printk log using 'mesg', but we'll leave that for later.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#515 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3WGOENQWPQT5VQEJN6PQ3RLM7XRANCNFSM4L3U6I7A>.
|
New serial related issue, @ghaerr: Repeatable on physical and virtual (Virtuabox). Buffering issue? --Mellvik |
IIRC, you added an explicit open of |
Opening /dev/tty is definitely the problem. It is failing and there is no check for -1. Try removing that extra open and using fd '1' instead and test it on serial and console. We can't use 0 as I suggested in my last comment, as more may be taking input from a pipe. But if the output is redirected, it shouldn't paginate anyways. You may need to remove the error printf in that case. After testing that case and the normal use case on console and serial, submit a PR, thanks! There is also my bug that /dev/tty doesn't work on serial, but I can't fix that until I can figure out how to get serial working on QEMU. I also can't figure out how to attach an ELKS image to VirtualBox nor serial on it either. |
@ghaerr,
Yes, you're right, open /dev/tty does fail - but works in other cases. I'll take a look at the logic as well.
—M
… 7. apr. 2020 kl. 21:36 skrev Gregory Haerr ***@***.***>:
you added an explicit open of /dev/tty to more
Opening /dev/tty is definitely the problem. It is failing and there is no check for -1. Try removing that extra open and using fd '1' instead and test it on serial and console. We can't use 0 as I suggested in my last comment, as more may be taking input from a pipe. But if the output is redirected, it shouldn't paginate anyways. You may need to remove the error printf in that case. After testing that case and the normal use case on console and serial, submit a PR, thanks!
There is also my bug that /dev/tty doesn't work on serial, but I can't fix that until I can figure out how to get serial working on QEMU. I also can't figure out how to attach an ELKS image to VirtualBox nor serial on it either.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#515 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3WGOHM6JV2YRYF3DUJDYLRLN6EFANCNFSM4L3U6I7A>.
|
I added #526 that should allows getty to run from the shell. Are there any other serial login problems other than elvis not running at this point? I'd like to close this and start new issues so things don't get too complicated. I have elvis on my list for when I can duplicate it. |
@Mellvik - Thanks for the interesting explanation. Lets plan on using your modified patch for the chip detection algorithm and when to enable the FIFO. I say 'modified' because I notice that the 'B0' entry was deleted in divisors[] and that will definitely break the baud rate setting code.
After fully testing the chip detection and FIFO, please submit a PR, thanks! |
I guess IRQ 2 has similar story of being used for multiple devices. |
I'll come up with something for IRQ 5 as well as IRQ 2. Lets move this IRQ discussion and comments over to #372 as there's lots going on around serial ports here. |
@ghaerr - -M |
@ghaerr,
I've made two more changes along the way, neither have any effect on the character dropping, but fixes nevertheless:
The docs clearly states that setting FCR bit 0, clears the FIFOs. I'm still working on the character drop problem - we're getting there. A few of interesting tidbits from the trenches:
Conclusion: IMHO the serial code is as good as it gets until we do a rewrite using a fully interrupt driven model, which is required to implement HW flow control anyway. -M |
@ghaerr, |
It does set bit 0:
Can you upload a patch of serial.c with the changes? Thanks! I will sort through it and add the FIFO detection code, the changed FIFO enable code, and look at your other changes. I'm not yet sure about replacing the flush input fifo code, since that code is run with chips without FIFOs as well.
This is very big news! That means that the kernel interrupt routine has the capability to keep up with input data when FIFOs are enabled. Very good news. I need to explain the problem of overall throughput, and why we are still a ways away from improving that, which I will do below.
Good: it was the intent to change nothing when moving the IRQs to be configurable. There is still more work to be done for NE2K driver, but otherwise it is now lots easier to configure different hardware. (I assume you've noticed we had to turn off serial COM3 IRQ5 by default, it also was default OFF beforehand as well).
There is improvement by not seeing data overruns, correct? You won't be seeing any full hardware-to-cat-file improvements, those are a different issue entirely.
When we say "character loss" here, I assume you mean that "cat /dev/ttyS0 > file" loses characters. More important is whether there are any data overruns at 19200. I will explain why there is non-overrun "character loss" shortly. So we need to keep the two problems separate.
I'll bet you don't realize that
Sounds like things are improving, and my last PR allowed for reading more characters from the 512-byte TTY buffer into Here is a quick overview of how serial data gets from hardware to disk. Keep in mind that each of these, except the interrupt processing, is subject to being rescheduled from the CPU and some other task running. And the faster the baud rate, the more time is spent in the interrupt routine rather than running programs:
This is an overview, can you see the potential for data loss in the TTY queue by it discarding characters while read, cat, filesystem, buffer, disk and other code run? !!! Various things can be updated, but Options to play with would be setting VMIN to 256 or more and VTIME to 1, but would be best done in a special Another thought is creating a mechanism to allocate the TTY queue buffer dynamically, and set it much higher, if data transfers need to happen at hight rates and the kernel throughput can't be achieved. But the kernel throughput needs to be improved as well. The |
Thanks for this thorough walkthru, @ghaerr. It has been a while since I did low level serial communication software and device drivers. And yes, we're getting there.
Some clarifications below.
I just discovered (from the NS application note for the 16550a): the command to enable FIFO must have bit 0 set.
It does set bit 0: #define UART_FCR_ENABLE_FIFO8 (UART_FCR_ENABLE_FIFO | 0x80).
Sorry, I misread that one!
IMHO the chip detection code is solid and FIFO can be enabled. I've tested threshold levels 8 and 14, both work fine.
Can you upload a patch of serial.c with the changes? Thanks! I will sort through it and add the FIFO detection code, the changed FIFO enable code, and look at your other changes.
Will do. I suspect my version is quite a bit behind yours.
I'm not yet sure about replacing the flush input fifo code, since that code is run with chips without FIFOs as well.
I don't think this is an issue. When FIFO is not enabled the 'buffer' is emptied by a single inb().
There are no more data overrun messages to the console, even when most of the input data get lost.
This is very big news! That means that the kernel interrupt routine has the capability to keep up with input data when FIFOs are enabled. Very good news. I need to explain the problem of overall throughput, and why we are still a ways away from improving that, which I will do below.
To me this is bad news, although I suspect when it comes down to it, we're in agreement.
There is no significant difference between the NEW and OLD irq code.
Good: it was the intent to change nothing when moving the IRQs to be configurable. There is still more work to be done for NE2K driver, but otherwise it is now lots easier to configure different hardware. (I assume you've noticed we had to turn off serial COM3 IRQ5 by default, it also was default OFF beforehand as well).
I have, this is good. I believe - in order to save memory - it's may be good idea to make the number of serial lines a config option. More than 2 ports is rare, and the freed buffers and data structures may be better used somewhere else.
FIFO works, but there is no improvement (I changed the local buffer to 16).
There is improvement by not seeing data overruns, correct? You won't be seeing any full hardware-to-cat-file improvements, those are a different issue entirely.
Like before, the line speed (19200 and below) does affect character loss, but not much.
When we say "character loss" here, I assume you mean that "cat /dev/ttyS0 > file" loses characters. More important is whether there are any data overruns at 19200. I will explain why there is non-overrun "character loss" shortly. So we need to keep the two problems separate.
Yes, I'm talking about lost data...
sash screwed me for a while (embarrassing) with it's handling of redirections and pipes: cat /etc/inittab > /dev/null takes about 19 seconds to execute (the timecommand reports 0,8 secs btw). This is on a 386/20.
I'll bet you don't realize that sash doesn't support file redirection? I made it work behind the scenes and sash execs sh whenever one types a command that uses '>', '<', '|', '&' etc. So heck yeah it's going to be very slow starting up. sash is supposed to be small, and does not support much. I felt it better to achieve seamless compatibility between sash and sh than to rewrite sash to become sh, since we don't really need two shells!
sash seems to take significantly longer to exec commands than ash. This is on my list for further verification.
sash will run its long list of internal commands without hitting the disk at all (e.g. ls, cp, chown, etc, see elkscmd/APPS for details). You can run the /bin version of the command by typing '/bin/ls', for instance. The actual exec of an external command should be quite quick, except for when seamless integration kicks in and sh needs to load. In that case, ELKS still has to read in the sh data segment, the code segment is shared, so its a little quicker than starting sh all on its own. I hope to have a buffer info program sometime in the future that will show buffer usage, and that may allow sh to be loaded from RAM instead of disk.
I am well aware of the characteristics of sash and their motivation, which I completely support. The reason I'm reporting this is that the effects seem out of scale, and (to me) some of them are hard to explain. Also it seems like the time command is broken, reporting 0.8s real time when real time is more like 20s.
Actually, some times it feels like the system just stops dead in the tracks for - say - 5+ seconds, then continues. And this happens only when using sash. Doesn't make sense. Also, it doesn't make sense to me that sash takes longer to exec a random (external) command than ash. Sidenote: I'm doing most of the testing from a different room than the physical machine - continuously banging the serial line, so I don't hear floppy accesses. Thus the need for more verification.
(buffering) drastically improve serial performance (using ash, see below): Pasting into stty -echo; cat > xx; stty echo gives 50% character loss @ first run, almost 0 @ second run (12 bytes lost out of 12k). That's on a 16450, no fifo, @ 19200bps, probably as good as it gets w/o handshake.
Sounds like things are improving, and my last PR allowed for reading more characters from the 512-byte TTY buffer into cats 4096 buffer with each read, it sounds like that is working.
Here is a quick overview of how serial data gets from hardware to disk. Keep in mind that each of these, except the interrupt processing, is subject to being rescheduled from the CPU and some other task running. And the faster the baud rate, the more time is spent in the interrupt routine rather than running programs:
serial interrupt occurs when FIFO trigger level of serial characters received. This can be set to 1, 4, 8 or 14. If the FIFO is set to > 1, then FIFO trigger, or 2 characters of serial framing timeout must occur before an interrupt is generated. ELKS interrupt routine reads up to 10 characters from FIFO into the 512-byte TTY character queue. If for any reason a program is not reading the TTY queue fast enough, then it will drop characters when the interrupt routine force-adds more.
A program issues a read call on the serial port, with a buffer length (4096 for cat). What happens next depends on the many variables settable using termios on the serial tty.
The TTY driver will suspend a read and reschedule the processor until various conditions are met, OR, if some met, will return less than the number of characters asked for. In the version before my last PR, at most ONE character was EVER returned from the TTY driver. In the new version after the PR, it uses VMIN and VTIME to determine (in somewhat complex fashion), how many characters to return. The old default for VMIN was 0, which meant that it basically polled the serial driver, and possibly returned 1 character, no matter how many were ready.
The new default for VMIN is 1 (instead of 0). It will now wait for at least one character, not poll, and return the number of characters ready, which could be from 0 through the number in the interrupt routine TTY queue buffer. It is likely way less than 512.
The kernel currently copies characters from the TTY queue ONE at a time, back to user space. This needs a rewrite, and will be next, as its quite slow.
The read returns (to cat) with some number of characters, and the characters are copied for a third time to a user buffer. Even if the user buffer only returned 1 character, that character is then passed to a write system call, to be processed by the filesystem code to be copied back from user space into another kernel buffer, and then that buffer may be queued to be written to floppy, even with only a few characters received.
This is an overview, can you see the potential for data loss in the TTY queue by it discarding characters while read, cat, filesystem, buffer, disk and other code run? !!! Various things can be updated, but cat is not the way to test for lost data on slow systems, once we have the serial driver working properly.
Thanks again for the detailed walkthrough! If I read you correctly, I disagree with your conclusion. From my perspective, cat is a perfect test because it tests the entire system. Further, a system that loses data w/o flagging it, is fundamentally broken. The way I see it, we have two problems that need to be fixed:
1) A broken (unreliable, possibly too complicated?) data path between the input (character) buffers and the client program, maybe even the other way (output) too.
2) A low level driver that doesn't support flow control and doesn't report (and record) errors (such as overruns, framing errors, …)
From my perspective, if a system is getting more than it can handle, the only place data loss is acceptable, is at the ingress point, where it can be observed and handled. The rest of the path must be 100% reliable all the time.
Options to play with would be setting VMIN to 256 or more and VTIME to 1, but would be best done in a special cat program that also did the stty calls directly.
Another thought is creating a mechanism to allocate the TTY queue buffer dynamically, and set it much higher, if data transfers need to happen at hight rates and the kernel throughput can't be achieved. But the kernel throughput needs to be improved as well.
The ktcp and SLIP processing routines have max packet sizes, and I believe these are all smaller than 512 bytes.
With header compression (CSLIP), the packet size slightly less than 1K, I'm unsure whether it can be 'tuned' much lower …
Back to the two main problems above:
I suggest we share the 'burden'. Among a million other things it seems you are in the process of fixing the path inside the system already. I couldn't do that in any reasonable time. But I can rewrite parts of the driver, making it fully interrupt driven w/ hard and soft flow control. And I have the setup to test it.
—Mellvik
|
Diff from the test version of serial.c:
|
Thanks for the diff. I'll update serial.c and submit a PR to synchronize where we're both at.
Ok, good idea. This may be best done for the time being in ports.h, since that file would also need modification to properly allocate COM3/COM4. I'll look at it.
All UNIX and Linux systems drop data in this way, when/if the TTY queue buffer overflows. There isn't anything the system can do about it reasonably, since this is at interrupt time.
I had thought that SLIP packets were 256 bytes, but looking at
I'm not sure what the point would be of recording overruns, we report them now. Once a single report is seen, the baud rate must be dropped since there's nothing else ELKS/we can do to help other than using a larger FIFO. I plan on setting the largest FIFO trigger in the next PR, BTW.
Great. I wouldn't suggest "fully interrupt driven" at this point, since that's just going to add to the interrupt overhead. I'll submit a PR shortly, then go ahead and add hardware flow control. Enabling interrupts for the MCR register (modem status interrupt) only would be a good start, and the interrupt routine can manipulate a flag in the serial_info struct that
I have found the memory corruption error I've been working on for two days. Turns out any program that calls
Keep testing. Lets keep time or sash bugs in another issue, please open if you find more information, after my PR memory fix. |
@ghaerr,
I'm falling behind on some issues here.
From my perspective, if a system is getting more than it can handle, the only place data loss is acceptable, is at the ingress point, where it can be observed and handled. The rest of the path must be 100% reliable all the time.
All UNIX and Linux systems drop data in this way, when/if the TTY queue buffer overflows. There isn't anything the system can do about it reasonably, since this is at interrupt time.
Yes, that's my point.
With header compression (CSLIP), the packet size slightly less than 1K, I'm unsure whether it can be 'tuned' much lower …
I had thought that SLIP packets were 256 bytes, but looking at ktcp, I see they that slip MTU is 1064 bytes. Given this, and your reasonable comment that the system should be 100% reliable, I propose adding variable size TTY queues.
That's a great plan (and I see your path from idea to action is as short as we've become accustomed to!). BTW - if someone is actually planning to use SLIP, implementing CSLIP is a quick and very beneficial improvement.
A low level driver that doesn't support flow control and doesn't report (and record) errors (such as overruns, framing errors, …)
I'm not sure what the point would be of recording overruns, we report them now. Once a single report is seen, the baud rate must be dropped since there's nothing else ELKS/we can do to help other than using a larger FIFO.
Again the point is to know that something gets lost and how. Printk is OK, but not very useful on a PC screen w/o scrollback. The idea is to add counters to the tty structure that gets incremented when errors occur. Say, one for framing errors, one for overruns and one for discarded bytes at the buffer level. If necessary, have an IOCTL to get to/reset the data and something like stty to report. Preferably a printk when the first error of each type occurs or at set intervals. I've used this in the past and found it very useful.
I plan on setting the largest FIFO trigger in the next PR, BTW.
I noticed, that's good. It has been tested and works fine.
But I can rewrite parts of the driver, making it fully interrupt driven w/ hard and soft flow control. And I have the setup to test it.
Great. I wouldn't suggest "fully interrupt driven" at this point, since that's just going to add to the interrupt overhead.
I'll do this in steps, and are awaiting your PR(s) to get committed.
I'll submit a PR shortly, then go ahead and add hardware flow control. Enabling interrupts for the MCR register (modem status interrupt) only would be a good start, and the interrupt routine can manipulate a flag in the serial_info struct that rs_write can use to continue transmitting or sleep using wake_up etc. To keep things simple, just use the 'CRTSCTS' bit in termios.c_cflag to enable HW flow control processing, which can be set by miniterm etc after you get it working.
I'll pull out some old tricks and methods from the bottom of my bag of experience. Rather looking forward to it.
Actually, some times it feels like the system just stops dead in the tracks for - say - 5+ seconds, then continues. And this happens only when using sash. Doesn't make sense.
I have found the memory corruption error I've been working on for two days. Turns out any program that calls grgetnam to get a group name badly corrupted memory. This includes the chgrp command, and ls -l. Since sash has an internal ls, running ls -l corrupted sash memory. That could be the problem, we'll have to see.
Wow, that would be a game changer!
Also it seems like the time command is broken, reporting 0.8s real time when real time is more like 20s.
Keep testing. Lets keep time or sash bugs in another issue, please open if you find more information, after my PR memory fix.
—Mellvik
|
I'm running qemu.sh with the "CONSOLE=-serial stdio" option uncommented. This allows by default logins on both console and serial (which is macOS Another thought is to implement passing a command line to ELKS at boot - one could specify serial console without recompilation, and display startup kernel messages on it. Another option could specify the startup
|
Yes, I'm sure VB can do that too. Real HW cannot - unless serial console is used, which is sort of self prohibiting when debugging serial. |
I can think of several reasons this may not be a great idea, or be overly complicated for ELKS. Currently, each TTY device minor number has to be separately allocated in kernel data space within an already-shared array with console, pty, and serial ports, thus allocating 4 more slots for each of the variations on each of ttyS0/1/2/3. The exclusive-use mechanism recently added won't work with different minor numbers. Programs like But that can all get sorted out after hw flow control is working. I think it necessary to be able to turn on and off hw flow control via a programmatic ioctl on a regular ttyS0 (possibly also via stty) as well.
Having a mechanism to redirect kernel printk's to the PC screen would help when running serial console.
I would prefer that each of the three issues under discussion be developed and submitted separately for stability and clarity moving forward (hw flow control, error logging, and additional minor devices). I'm interested to see if there's any improvement to non-flow-controlled reliability using sercat instead of cat with the VMIN/VTIME throughput changes. Thank you for all your work, testing and comments! |
@ghaerr,
There's a lot to digest here, and I agree - it's a step by step process. Some of the challenges will (hopefully/probably) be easy when we get there. Others not, and we'll have to prioritize.
As to the FAT fs issue, my opinion is still that it was a bad idea to begin with. An impressive feat indeed, but also an inhibitor to the development of ELKS as it blocks sound choices (like in this case). With ELKS' now excellent support for mounting FAT filesystems, we have - again from my perspective - adequate FAT support, and no need to reduce root fs functionality to an inferior common denominator.
A 'not supported on FAT' is of course also possible, but I have a hard time seeing the benefit or the scenario where a FAT root makes sense.
…-- Mellvik
26. apr. 2020 kl. 17:20 skrev Gregory Haerr ***@***.***>:
I'm also proposing an additional set of tty-devices (minor numbers) to accomodate hw flow control
I can think of several reasons this may not be a great idea, or be overly complicated for ELKS. Currently, each TTY device minor number has to be separately allocated in kernel data space within an already-shared array with console, pty, and serial ports, thus allocating 4 more slots for each of the variations on each of ttyS0/1/2/3. The exclusive-use mechanism recently added won't work with different minor numbers. Programs like miniterm, which default to /dev/ttyS0 will have to specify a new device on startup, rather than a dash option to turn on hardware or software flow control. /etc/inittab would have to be modified for each of the variations. But most importantly, any FAT-boot image won't be able to use them, as it is maxed-out on the 16 /dev/ fake-devices available, and current images only have ttyS0, with little that can be changed without affecting other use cases.
But that can all get sorted out after hw flow control is working. I think it necessary to be able to turn on and off hw flow control via a programmatic ioctl on a regular ttyS0 (possibly also via stty) as well.
Real HW cannot - unless serial console is used, which is sort of self prohibiting when debugging serial.
Having a mechanism to redirect kernel printk's to the PC screen would help when running serial console.
it is very valuable to have such error stats available and the cost is minimal. I'm going to include this as a proposal when adding hw flow control.
I would prefer that each of the three issues under discussion be developed and submitted separately for stability and clarity moving forward (hw flow control, error logging, and additional minor devices).
I'm interested to see if there's any improvement to non-flow-controlled reliability using sercat instead of cat with the VMIN/VTIME throughput changes. Thank you for all your work, testing and comments!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I respectfully disagree with both statements. I pointed out in an edit that I was incorrect about running out of FAT /dev/entries, there are a couple that could be replaced. But we will be rapidly approaching a hard limit with how much can be added to the ELKS kernel, in attempts to make ELKS very much like desktop Linux. We've got less than 10k code space available in the kernel, tradeoffs will have to be made. I'm still chuckling about @pawosm-arm's comment that Linux's /dev got so big it was converted to a binary blob. |
Hey @ghaerr, we were talking about different things having similar names! What I mentioned on some occasion recently was DeviceTree [1], a binary blob provided by the bootloader or the firmware (OpenFirmware in most cases) to the kernel, descibing in a tree-like structure the system setup. It was supposed to be portable across architectures standardizing how information normally obtained by kernel from BIOS or bootloader (using non-portable means of communication) should be propagated. As I understand, what you discuss here is the way of holding stuff in the |
Thanks for straightening me out! :) I'm looking at |
No opinion on that. For now, network security isn't my concern, whatever makes it easier to implement will do. |
Hello @Mellvik, As you've probably seen, the serial driver problems with ELKS have been identified and hopefully completely solved in #664. I would like to have you test serial connectivity on a few real systems, and then perhaps we can close this issue finally! My testing has shown ELKS at 19200 baud received characters keeping up (tested by ^S/^Q on "ls -lR /") and no character loss on both the regular and "fast" driver, and short-burst (less than 1024 characters, tested by "cat file" or ^S/^Q quickly) using the "fast" driver at speeds up to 57600 on a Compaq 386 portable. My other fast 386 desktop system will run at 115200 baud continuously with no data loss. |
Hello @Mellvik,
As you've probably seen, the serial driver problems with ELKS have been identified and hopefully completely solved in #664. I would like to have you test serial connectivity on a few real systems, and then perhaps we can close this issue finally!
My testing has shown ELKS at 19200 baud received characters keeping up (tested by ^S/^Q on "ls -lR /") and no character loss on both the regular and "fast" driver, and short-burst (less than 1024 characters, tested by "cat file" or ^S/^Q quickly) using the "fast" driver at speeds up to 57600 on a Compaq 386 portable. My other fast 386 desktop system will run at 115200 baud continuously with no data loss.
—
I'm rather looking forward to it. Maybe even test running 2-3 serial lines in parallel. :-) Some with, some without FIFO, that would be an interesting experiment.
…--M
|
Thanks @ghaerr,
Seems I have som catching up to do.
I'll close up these and the others pending today or tomorrow.
…--Mellvik
19. jul. 2020 kl. 01:29 skrev Gregory Haerr ***@***.***>:
Hello @Mellvik,
A series of PRs several weeks ago (#592 and #664) should have fixed all the issues brought up here. Please test, and close this issue if you are satisfied. Since this issue has gotten rather long, subsequent serial port requests/enhancements can be brought up in new issues.
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Summary: Close, but no cigar
Testing on physical HW & virtualbox, the latter turns out to lend itself well to serial testing: Serial line set up to listen to a TCP port, which can be connected to using netcat.
sash < /dev/ttyS0 > /dev/ttyS0 &
should work. When it eventually does, we're almost there. The error message isCannot create /dev/ttyS0: error 16
. Replacing sash withgetty
orlogin root
doesn't make any difference. Neither does changing the shell we're running from. Replacing ttyS0 with tty2 or tty3 (virtual consoles) works fine. Further -sash < /dev/ttyS0 &
works, takes input from serial, output to the console. And vice versa if using ttyS0 for output instead. Seemingly, when the device can accept several opens, this is going to work. (?)--Mellvik
The text was updated successfully, but these errors were encountered: