-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[kernel, directfd] Redesigned sector buffer for the directfd driver #88
Conversation
Nice work getting the track cache rewritten so quickly! Has this new code been run very much? I am a bit confused about your recent testing results showing that a 6-7k track cache gives best results: is this for 360k floppies, or a variety of formats? If 360k only, then are we really talking about a cylinder cache, with best results at ~1.5 cylinders, since each track on the 360k is 4.5k? Obviously if the track cache is 6-7k and running 360k floppy, it'd be required that auto-seek is on. I know its always on for DF but I'm trying to understand which results, if any, might be applicable to BIOS track caching, where auto-seek may or may not be available. I had also believed, probably incorrectly, that a less-than-full-track cache was giving best results, rather than a cylinder (more-than-one-track) cache doing so. With regards to BIOS auto-seek, I haven't yet tested whether it works on QEMU (but agreed it likely will), I am concerned about having it enabled by default for ELKS (BIOS driver is still the default), since if enabled and unavailable, the system won't boot. We also have the compounded problem that if the hardware BIOS and the kernel DDPT get out of sync or the DDPT is ignored, then auto-seek won't work properly. The BIOS floppy driver creates the replacement DDPT at vector 1Eh and always sets the max sectors field, but I have seen many BIOS ASM source that does not read the 1E vector for max sectors and always use its own ROM table instead. If/when this happens, auto-seek may fail if when the BIOS and kernel disagree on floppy format/type. Kind of a mess, and likely means any BIOS auto-seek option needs to be enabled after boot? A bit off topic, but... Having said all this, perhaps the most compatible approach for ELKS might be to default to BIOS driver for distribution floppies (with auto-seek off or not implemented), and then direct users to run the DF driver with full auto-seek with optimizable cylinder caching, since the driver maintains full control of the FDC, after it is determined that the PC hardware is compatible with the DF driver. However, that approach trades off speed, async I/O, and now optimized track/cylinder caching for bootability on all systems, which perhaps isn't really the best for most users. The problem is, anybody that gets distribution disks and can't boot results in a support situation, or an unreported evaluation with no further use. |
When you speak of cylinder cache, you mean that when a block is requested that involves a split block and head 0, the first track is read up to the first half split block, then the next track is read in full, head 1. Right? That means that even if the request is just for a single split block, the whole next track is read. Have you tried/tested the idea of only reading to the second half split block, that is, only reading a single extra sector on the same track, head 1? It occurs to me that might also be an interesting case, especially for those systems like 1.44M where full track caching appears to slow things down. The idea would be that some number of multiples of full blocks are always read, including split blocks, but not always reading a full track on auto-seek. What have you found? |
The code has been really beaten up over several days heavy testing (probably close to a thousand system loads) on 3 different machines and qemu. No xms testing yet, and no occurences of raw IO hitting the 64k threshold.
(from #81 (comment) )
Since this testing, there is the 386 test run which also
I believed the same thing and nothing in the reported results indicate otherwise. That said, upcoming results from 360k tests may change our position on that. Testing on 386hw indicate that but it needs to be verified on slow hw. #81 (comment)
You cannot do that as it's a bit set in every read/write operation, not a configuration bit. OTOH, it could be an idea to test whether autoseek is on (a 2 sector read starting at the last sector/head 0) in the BIOS driver probe and set the strategy from the result.
That's - if nothing else - a very safe route. That said, excluding non IBM compatibles and barring bugs of course, the direct driver is guaranteed to be compatible, and in that regard safe.
I'm losing you here. I may of course be missing something but how can the PC hardware not be compatible with the driver? If there is something here, I'd be very interested in addressing it.
So it's really a question of trust - and AFAIK not about compatibility but about reliability. And the only way to get that is to really beat it up and get as many as possible to use it. IOW - possibly deliver both variants (double the set of boot images) for a while? BTW, in a way, the direct driver (the TLVC version of it for now) has more hw 'compatibility' than the BIOS driver in that you can configure drive types for XT class systeems. |
I have studied your 1.44M floppy track cache testing results #81 (comment) and now see plainly that a 7k cache gives superior boot performance. I still want to make sure I fully understand what is happening. I've studied most of the new driver code in this PR, but want to make sure I'm following: a 1.4MM floppy has 18 sectors/track = 9k. However, auto-seek is alway being used, right? So when we set the "track" cache to 7k, this means 14 sectors, and the new driver code will always read 14 sectors from the current cylinder (not track), if possible, right? [There would be three cases in the head == 0 case: 1) start sector is < 4, so 14 sectors read on the current track, but not to end of track, 2) start sector == 4, read 14 sectors to end of track, and 3) start sector > 4, read '18-sectors' on the current track, and the remaining on the next head==1 track. For I/O requests where head==1, only read sectors until end of current track.] In summary we're really talking about a cylinder cache, where exactly 14 sectors are attempted to be read in each I/O request. In the head==0 case auto-seek is performed, and when head==1, the sector count may be truncated. It seems to me that determining the exact number of sectors to cache could be heavily linked to the contiguous sectors within specifiec programs being exec'd, rather than native cache throughput. (I suppose that's obvious). It might be interesting to see whether removing the My first thoughts on all this were "why not always read until end of track/cylinder", but I don't know the speed tradeoff in reading extra data that might be unneeded. And since one of the main points of this exercise was to determine a reasonable cache size so that extra memory can be returned to user programs, overall, it makes a lot of sense to just set a max track/cylinder cache and let the system run with auto-set head==0 expansion and head==1 truncation. It would be interesting to see continuing "normal" system operation when running programs larger than the cache size. At the end of the day, it seems this approach is trading off making more memory available to user programs via less fixed cache by using an enhanced auto-seek cache fill previously used only with a single sector on 9-sector floppies. |
What the quote above says is that IF the cache size is 9k AND the track size is 9 sectors AND the read starts at sec1hd0, the entire cylinder will be cached, just like the previously deleted 'full cylinder buffer' driver. It's an illustration of what may happen, not an endorsement (it's rarely good).
Yes - if the cache size allows it. BTW - really, there is no such thing as a split block in this setting :-)
This was how it worked with the old caching mechanism: If head was 0 and sector count odd, one sector from the next track was read in order to always get full blocks. That mechanism always read from sec1 though, so it's not the same thing. OTOH, with the new sector cache, this happens all the time - the smaller the cache size, the more likely it happens. No measurements of this in particular though.
Incidentally, 1.44m (and 2.88M) are the only drive types where 'split block' never happen - they have an even # of sectors per track. That said, this is what my last to days of testing have been all about. Start at cache size 1k, boot and record jiffies, increase by 1k, repeat. It's a slow process, but the numbers are finally in for 286/12.5MHz/1.44M, 386SX40MHz/1.44M, 1.2M, 360K (in 1.2M drive). The latter machine, as we have discussed, deliver best with 1k (no cache), but I did complete testing anyway looking for patterns. Including - as referred in the PR - 360k w/9k cache. It's worth keeping in mind that what I've measured is system startup time only, not usage (coming later using the same method as before). Given what I reported for the 286, the numbers are unsurprising - with a couple of exceptions. 5-7k cache size is always best, which one depends on the system. Ignoring the fast 386 not needing the cache at all, it is interesting that the 1.2M drive clocks in with the same speed for6k and 9k cache sizes. The 360K drive clocks in with about the same speed at all sizes above and including 5k. The numbers need more validation, so the next target is 360K and 720K on the V20, then 360K, 1.2M and 1.44M on the 386DX/20MHz. For now it seems 5K cache size is a great compromise between RAM and speed. Also, what we're seeing is that the numbers are as dependent on the speed of the system as on the type of drive, which is interesting because the data transfer speed is constant (limited by ISA bus speed and DMA) - slow (360k, 720k) or a little less slow (1.2M, 1.44M). Depending on the next rounds of testing, we may end up with a recommended cache size per drive type/system speed. Finally, while this is interesting - we're really going overboard with this. Really, it's not all that big of a deal if the system load time is 14 or 20 seconds. OTOH going from 20 to 10s is significant. ... and I think reorganizing the boot image is likely to have much more effect than the cache tuning. BTW - not mentioned before - system startup in my case includes loading fsck (for root, reads superblock, then exits) and loading and starting the network - ktcp, ftpd, telnetd. Needless to say, the 360k image is filled to the rim. |
That's a great idea; the system can determine for itself whether the BIOS has auto-seek or not for the current floppy format!
Agreed. Quite cool actually and a very good reason for ELKS to move to the direct driver on 360k floppies, or at least to probe auto-seek and use it when the BIOS implements it. This gets rid of a lot of problems that would otherwise have to be hacked into
At ELKS, we have to support PC-98 floppies, for instance. They're not "compatible" with the DF driver, and at least new table entries need to be made. So PC-98 needs the BIOS driver. Other examples are users who might boot with 2.88M (emulated or real) - the TLVC driver doesn't support that hardware.
Agreed. And the benefits of the async driver are great (PC-98 notwithstanding) and should become part of the ELKS standard, as it has with TLVC.
That's another good idea. There could be a "basic boot" disk for those that need it, or just refer users to a previous version where BIOS was fully supported at boot time. |
@ghaerr, in order to avoid misunderstandings (I believe I do understand your scenario), look at it this way, this is how the sector cache works (it's really simple):
It is (obvious) - it will be interesting to see how my 'general usage' test works out with the various block sizes. Suggestions about what would make a reasonable general use test are welcome (we may not agree on that though :-) )
This is what the tests with varying cache sizes is supposed to reveal. And one of the contributing factors to the somewhat surprising (well, not really) numbers. More isn't necessarily better. |
Got it. We were kind of posting at the same, so this matches with my understanding from my last post, thank you.
Fantastic testing! BTW on the business of whether a "fixed" cache size or not makes sense for all systems (given the ability to give back unused RAM to main memory), I am thinking of a way to get the cache DMASEGSZ out of being a fixed constant in config.h and having the /bootopts equivalent be used by setup.S for relocating the kernel - and the latest modifications I've made to init/main.c and setup.S should allows this somewhat easily. More on that later, but the idea would be that the DMASEG would be fixed as now, and a variable DMASEGSZ sectors would then be allocated in low memory, then the kernel code and data segments after that. The biggest issue is that the /bootopts pre-pass for trackcache= would have to be written in ASM and performed by setup.S, but with some restrictions, like requiring trackcache= to be on its own line, I think not too much work.
In the interim, after your results are in then, it might be best to configure a default, say 5K cache in config.h and then have a normal /bootopts option what would allow setting a lower or no cache value for 1.4MM floppies or fast systems. This would have a big benefit of releasing 13k more RAM to user programs like Doom that really need it. (Yes, games still rule with users, it seems).
Agreed. And I am definitely planning on some improvements to the image layout, inode numbering and |
It is unlikely that the BIOS would add or remove the MT (Multitrack) bit depending on drive type. In fact there is little reason not to have that bit set always and just ignore it, like the DF driver does. IOW - of a probe responds 'positively' to a read spanning tracks, it will 99.99% certainly apply to any drive (type) on that system.
Yes, there is PC98, which TLVC does not support. If it's using the 765 FDC, 8237 DMA and the same IO addresses & IRQ it should work fine - with some density updates in the type table. If you can get the users to test it, it should be low threshold.
It's just a table entry, no code change. I'm saving RAM :-) |
Thanks, I get it now. :) One of the reasons I previously kept bringing up "auto-seek" (MT) was that the BIOS driver doesn't know about it (yet). I am thinking a good tradeoff might be to add auto-seek test capability to the BIOS probe routine and use it when possible, and additionally have config.h-configurable track caching, although possibly with different cache values for BIOS vs DF drivers. Yes, kind of a pain, but possibly necessary for ELKS. A real problem is that I personally don't have access to PC-98 nor real floppy hardware for BIOS, so I'm thinking perhaps just leaving the BIOS track cache as is, and going with your findings for the DF driver, and making that the new default. (And thank you! :)
Good point. I'm going to check my various BIOS sources just for the heck of it, now that I understand how it actually works.
I'm almost sure the PC98 uses the 765 FDC. I'll ask and see whether I can get more information on the BIOS for that system (It's mostly all in Japanese).
Haha, maybe on just the table entry... Perpendicular mode also has to be implemented, which I went ahead with in my driver version. It uses bit 6 of the bit rate field to indicate some additional rate bytes sent to the 82077. |
More on the business of how some BIOS (from ASM sources) determine sectors per track vs our DF driver. This is from Sergey Kiselev's 8088 BIOS project, which is fairly new and used in Book 8088 and other computers. I will continue analysis with other older BIOSes. I see now that both BIOS and DF have to send the EOT (End of track, sector number) to the FDC on read/writes, which determines when FDC should switch heads if the MT bit is set. The FDC is setup to start the read/write at a start sector and the DMA controller determines when the end of transfer should occur, from the byte count sent to it. Both BIOS and DF set MT all the time as you suspected, and the Sergey BIOS uses the DDPT 1E vector to get the all-important max sector (EOT) value. So this would work with ELKS BIOS, even though it never attempts it. What differs from our DF driver is that the Sergey BIOS doesn't use a table of "probe" entries quite like the DF driver, but instead, like the DF driver, gets the drive type from the NVRAM/config, which sets the drive type, but then unlike DF issues a READ ID command with varying data transfer rates to determine the floppy type within the drive. This opens up the possibility (not sure how much) of a difference in what the BIOS thinks the floppy is, versus what ELKS would. The bigger question is how might this affect what matters, which is the sectors-per-track value, which could fail an MT (auto-seek) transfer? So the "risk" in using MT reads in the BIOS driver is still just from the BIOS ignoring DDPT (but not in this case), when it would default to its idea of sectors-per-track from another method (READ ID in this case, but never matters, since DDPT overrides work). IIRC the original IBM BIOS didn't support DDPT 1E vector. I'll check for sure, but this would make sense, as the whole point of a DDPT was to allow user (DOS) programs to override the BIOS for floppy type. It would make sense that later BIOSes would support DDPT, while early versions could not support it prior to its invention. So it might be the issue is supporting very old systems with a default boot. BTW, ELKS already has workaround code for the lack of INT 13h fn 8 (Get Drive Parameters) for the original IBM PC and some XTs. This had to be added because of course some users wanted to see ELKS boot and run on their original 8088 IBM system. Given all that, some crazy ideas like "hold the space bar down during boot" to force the BIOS to not use MT/auto-seek come to mind, but one has to wonder whether users would know about that or not. They could be given this information without having to recompile though, which is nice. After boot they could be instructed to edit /bootopts with a workaround option. Is this complete overkill or necessary?!?! |
I think the important thing about MT is that it's just there, always on, whether reading (or writing) one sector or 36, whatever.
Perpendicular mode is implemented and tested in QEMU, works fine. |
Update on IBM PC BIOS (v1 4/24/81, v2 10/19/81, and v3 10/27/82): I was incorrect above, all three IBM PC early BIOS's support DDPT, which is good news. However, the FDC read command did not have MT added until v3, so early version of PC BIOS do not support multitrack reads, although there should not be an error if tried, the sector count returned would just be lower. I'm not sure if this needs to be handled specially, as any track cache would require separate reads for those BIOSes. For IBM XT BIOS v1 11/8/82, both DDPT and multitrack reads are supported. On another note, does the DF driver support multitrack writes? This would matter for the TLVC-only raw char driver, as well as single block writes on "split" blocks for odd-sectored floppies, which could be frequent. |
It's a good one. My V20 system runs that BIOS.
That's right - and it's still completely transparent in the sense that it's something I never paid attention to. The sector count is from the drive parameter table and all is automatic.
Well, that's good news, isn't it?
This sounds like a variant of (effectively the same as) what the DF driver is doing. And this is why I had to introduce the
I don't see how this would be a problem. If the drive is configured (or found) to be 360k (or system is <AT), assume 9 sectors and read 9+10 to see if it fails - should work regardless.
Probably just setting the drive type to 360K? Would make perfect sense.
I think it will be a lot easier than that. If the boot process (actually the BIOS driver) gets a floppy read error (we'll figure out the type, but probably sector not found) it would be easy enough to backtrack and redo that particular IO request, and if necessary change a parameter setting, don't you think? |
So you could peek the BIOS for version #, but the question is still open as to other (compatible) BIOSes - how compatible and since when? It sounds like a probe is worth it regardless.
Like I said, the MT bit is simply always there, so yes. Always. The thing is (again), and I think this is important - there is never a reason to ever remove it. |
It occurs to me that the prospect of getting rid of the entire DDPT mess is a strong case for the DF driver. :-) BTW I seriously doubt that any IBM PC or compatible ever had a 765 without MT bit support. Chips shipped as early as 1979 had the MT in place, 2+ years before the PC. That doesn't help much though if the BIOS doesn't use it, so I think your probe strategy is both necessary and safe. |
What I was getting at was not the FDC having the MT capability or not, but rather the driver, since it has to be coded to start another I/O request should the BIOS not honor the requested sector count assuming MT (when MT reads aren't performed in BIOS). Which brings me to my next point: I now realize there were some design mistakes made in the (my) track cache code for BIOS. It was designed as a full-track (not cylinder) sector cache, performed under the untested assumption that reading full tracks would be faster than reading multiple sectors. You're finding the real results of such an assumption now. Reading full tracks from sector 1 was removed via an option ifdef quite a while ago though. Also, I was unaware of, or at least did not take into account the ability for most BIOSes to perform MT reads, particularly across split blocks. So the track cache code is written to only consider a single track number and associated sector count, forcing somewhat worst case behavior on split block reads for odd sectored floppies (after all these years!!). Because of this, there is an upper level loop which continues to read an additional sector (or any required) to complete the I/O (read or write) block request. Now I realize that the BIOS driver can be easily improved without the need for probing MT capabilities at all: simply request the I/O assuming MT (or longer fixed-cache request) using INT 13h, and then accept the returned sector count, for either the I/O request or cache fill request; by not treating a less-than-requested sector count as an error, the upper loop will setup another I/O request for the remaining data. The routine will need a bit of recoding to calculate the ending sector and DMA etc, but it'll work for non-MT and MT capable BIOSes because the upper loop will reschedule remaining I/O in the non-MT case. In the track/cylinder cache fill request, the cache fill will exit the upper loop rather than continue, which prohibits non-MT BIOS track caches from issuing two I/O requests. Sounds complicated, but actually pretty straight forward. Just like the new DF cache fill code: a look at your rewrite seems to show more simplicity, and the upper level will also request additional I/O if the sector count (or DMA wrap) requires an additional request to be performed. For BIOS, the DDPT being handled in all BIOS versions means we can pretty much guarantee the I/O to match the kernel and BIOS's idea of the floppy. I suppose that's why the existing BIOS driver works on all known systems, albeit slowly until now.
Yes, but I'm pretty sure that for non-MT BIOSes, the BIOS limits the I/O to the last sector on the current track/head, and returns the actual sector count for the performed I/O, rather than an error. We'll have to see for sure, I'll check BIOS ASM to confirm. And as you point out, leaving the actual boot process the same (which doesn't rely on MT) gets us past the hard part of the boot, and the BIOS driver can handle irregularities without giving up like the space-limited boot sectors have to. |
Amazing conclusions on the TLVC load times! Am I reading it correctly that for all but the slowest systems, having no cache is best, regardless of drive types? I don't have a theory on why that is, other than perhaps the applications themselves are very slow and can't keep up with I/O?! It will be interesting to see what you come up with on the V20 (slow machine, right?). Is that your IBM 5150 with a different CPU, or is the 5150 a separate machine? If the testing concludes with no track/cylinder cache except for the slowest machines, perhaps having using the CPU test in setup.S could be used to either allocate a 7K (i.e. a default in config.h) track cache or none at all, automatically. In this way, the max memory is given to user applications and it automatically work on everything. There could be an additional config option (or just use CONFIG_TRACK_CACHE=N) to force track cache allocation, for which setup.S ignores the CPU type and always allocates a cache. A permanent /bootopts trackcache= might also make sense for testing/fine tuning. It also seems we might want to keep the timing information in the driver(s) so that a user can see for themselves the results of their own cache settings. I'm finding this research extremely interesting! |
I think we need to view it the other way around, sector cache is good except for the fastest systems. Of course that depends on how we define 'fastest'. The 12.5MHz 286 machine really needs the cache, the 20MHz 386 is yet to be tested (the blessing of having a hw museum in the garage).
Agan, maybe the other way around, as alluded to before. The 40MHz system s so fast it can field the next request befor3 the next sector passes. The V20 is 'dream machine ii' as described in the wiki. The 5155 (portable!!)is a different system, original 4.77MHz. To be tested.
I agree with part of this regardless of th3 discussion above, more about that later.
touchė.
Me too! |
I understand that I still haven't been able to convey how this works. I was NOT talking about the FDC but the driver - although the statement applies to both. There is NO WAY to turn on or off multitrack, it's not a mode or setting, it's the read/write command. Think of it this way: The FDC has two sets of read/write sector commands, one set implies MT, the other does not. There is no reason to ever use the latter, and the DF driver has no idea that it exists. It is really of no use to anyone in any context AFAIK. The question is whether (and which) BIOSes use one or the other, which is what the probe we have talked about will reveal. My guess is that most if not all use the MT commands. |
Yes, I agree - this sounds like a workable and quite simple solution. The ability to return partly completed requests like that is all you need to add raw access too, BTW.
Actually, (surprise! :-) ) this brings up a different issue on my list, to increase the read retries during boot. I'm experiencing (on old 360k drives) that TLVC fails to boot (repeated read errors) on floppies that DOS reads just fine. I don't know if the difference is the # of repeats, but it may be and I cannot afford not to try ... |
Thanks, yes I get it now - MT is bit 0x80 in the R/W command. Having now read several BIOS sources, only the two earliest version of the IBM PC BIOS don't set it; after that, everyone figured out the advantage of always setting it. That said, my concerns were (and now still are - see below) about drivers that may not use it - which include the ELKS BIOS driver. That driver always properly sets a DDPT, but then never attempts a read/write across a track boundary (for the cache or single block). So odd-sectors are always split into two I/O requests. [EDIT: |
This is pretty much exactly what I've been thinking - it seems (pending more testing, which is ongoing) we need to know the CPU - and probably the speed. The latter may be tricky, but let's get back to that if needed. The
I'll say 'yes to everything' - autotuning within the limits of the assigned RAM, RAM assignment via CONFIG and the ability to force the sector cache size via
Can you elaborate on this one? |
You may be able to handle such error returns from the BIOS by catching them in the driver and ignoring them, returning the reduced read as you suggested above. The result should be the same - assuming the first part of the read got into the buffer. If not, the driver could resubmit a partial request ... |
I'm changing the bootopts setting for the sector cache giving it its own 'name' as you suggested. |
Fdcache= sounds good! |
Thanks, |
They are indeed, a reminder as to why the DMA is so important, for floppies and MFM drives: The need to get the CPU out of the way (I've been reading up on XTIDE quite a bit lately, it's interesting how they think to squeeze the last little drop of performance out of the 'channel').
Good point, and I suspect our venerable It would be interesting to compare output from your recent QEMU IODELAY code and compare it to the physical numbers I get. BTW I wanted to get a readymade image of elks (360k) 0.8. yesterday, but could not find it. The 'downloads' link sent me to 'releases' instead... I must have missed something.
That's a good idea, I'll add that.
Great, it will be really interesting to see what this does in 'real life', on 'Dream System II' (V20).
It'll be interesting to soo how reliable it is possible to get - you may have to add something to account for seeks. My take is that if you're within +-25% of the real thing it's good enough for the purpose.
Doesn't make sense, does it? Did you figure out an explanation for this? My old thing about moving startup programs to /etc keeps popping up in the back of my head...
Yes, this is a good idea! |
After starting to read this article about estimating file access times for floppy disks I realized you'll definitely want this on your bedtime reading list. Everything we wanted to know about floppy timings. Some good stuff! |
Holy Mac, and 1983! It ended up being morning reading instead :-) and yes, very thorough indeed (and very academic towards the end). We now end up with an entirely new challenge: How to apply this knowledge to the challenge at hand without going overboard with details that have minimal effect on the result. The challenge at hand has (more or less a summary of our discussions so far) two parts:
The former benefits most from a) fast sequential reads, b) optimal placement of metadata and files on the floppy, c) minimizing directory and file accesses. I don't have the conclusions - but I do enjoy the discussion! Thanks for the entertaining link! |
My first pass at lowering system startup time is in ghaerr/elks#2071 and ghaerr/elks#2073. Prelim results on emulated floppy delay times show halving boot time from 10 to 5 seconds using BIOS driver ( This is also using an unmodified 9k (18 sector) track cache (for the moment) for 1440k floppies. It will be interesting to see what the results may be on using the enhanced TLVC DF driver with configurable block (non-track) caching. I am not yet able to test non-track caching as I have not added your DF driver cache code to ELKS yet. Also, for compatibility, I'm thinking of leaving the ELKS BIOS driver as it is, although the above PRs also added TRACK_SPLIT_BLK and SPLIT_BLK processing to the BIOS driver. These options have not (yet) been shown to measurably increase speed though. With the new |
Thank you @ghaerr - this is an important step, the ability to play around with these layouts and see how they work in practice. I'm rather looking forward to play with the new I've continued to enhance the |
I've been looking a bit harder at both the original and enhanced floppy cache code in the DF driver, since adding the IODELAY to ELKS for it for delay emulation purposes. As I mentioned in my PR, the ELKS DF driver is currently running slower than the BIOS driver since it always uses full-track reads (e.g. starting from sector 1). I was thinking of moving towards your new implementation but taking it in multiple steps: no caching, old-caching (full track), semi-track caching (start from requested sector, same as BIOS), and the new "fixed size" caching. The ELKS BIOS driver has defines that allow each of these (except the last, until I add it) to be turned on or off. I think I also mentioned I'm having trouble seeing much improvement in the BIOS driver when turning on MT/split block reads - funny, as I had thought there'd be a significant difference. When writing, the BIOS driver never stores the write block in the cache, but instead always invalidates the cache, then writes just the block. I just noticed though that the DF driver does not: it seems to store the requested block into the track cache, then goes ahead with the write. However, it seems to me that each block write will always result in a full track cache write, correct? If so, doesn't this produce worse-case behavior for any program that writes files sequentially? That is, each block will be written as a full track, over and over again. IMO this could contribute greatly to longer timings, but I'm not sure whether you're using writes, or only reads, for your time tests above. Can you clarify where only a single block is written on writes if I am misunderstanding? Thank you! |
While studying the DF driver and running some copy testing between drives, I noticed that the enhanced The enhanced Also, I happened to notice in the TLVC directhd.c driver that when allocating the I have written a disk-to-disk copy script from which I hope to further analyze the problem of the track cache being invalidated, then fully filled, successively, possibly for every few blocks transferred when the source and destination drives are different. This seems to be an inherent problem in both the BIOS and DF cache code. This is similar to the potential problem mentioned above, but not exactly. It is seeming that we are going to need a better cache strategy for READ vs WRITEs as well as when the drives are not the same. All of this is basically the result of having a single cache for multiple drives, as well as not really handling WRITE-throughs in an efficient manner, even though ELKS BIOS and TLVC are coded a bit differently, I fear the throughput results will be worse than they could be. More results coming, and I hope to publish some test scripts we can both use. |
I agree, this is indeed surprising. I'm rewriting
No, the write will always be of the requested block, the cache is never flushed to disk. That said there are things to be discussed about this as you allude to in the next post, I'll get back to that.
There may be code here that I have not touched yet, I'll take a look. |
Thanks for the heads up. I'll hold off importing the code until you send the 'clear' signal :-)
Thanks, you really have a keen eye!! It's bad practice I know, but I left it out because this alloc only happens during probe, which is before most of the stuff using the heap gets started.
This is a good point and it's easy to fix, probably a couple of lines of code only. I vaguely remember having thought about this way back and decided to ignore it because floppy-to-floppy copies happen so rarely (my take). Most likely on floppy only systems - and in testing like the modus operandi we're in now. I have a number of changes in the |
mfs is sall fixed and working well in ghaerr/elks#2080. I'm now going to take a deeper dive into the DF driver cache routines with attention towards READ vs WRITE and multi-floppy performance. My ultimate goal is to write an upper level caching layer that works identically for both BIOS and DF drivers, with options for FULL_TRACK and SEMI_TRACK along with a FIXED_CACHE option with a runtime configurable cache size. Within the driver, an option to handle SPLIT_BLK (single sector at end of track when no cache) and TRACK_SPLIT_BLK options. The options sound complicated, but aren't really and would allow for continued or delayed analysis of floppy performance. I'm thinking that always limiting block writes to single blocks (never tracks, yet to be confirmed in DF driver) and also making available a single block cache at DMASEG (before any track cache) that would allow for multiple drivers or floppy disks to perform single block write I/O without invalidating any track cache. As we are seeing, the whole business of understanding performance is seemingly pretty complicated, so having more options is likely better, certainly for analysis. On another subject similar to lifting the cache code above the BIOS driver (for use with both BIOS and DF), the same needs to be done for the old-fashioned BIOS "probe" code where both ELKS EPB and DOS BPBs are examined by reading the floppy boot sector when first opened. This code is badly missing from the DF driver and prevents problems otherwise having to be guessed, especially for DOS disks, but helps firmly identify the media type for bootable ELKS/TLVC floppies as well. |
thanks.
Agree on this, and like I said before, this is a matter of changing a few lines in the DF driver. The writing is where it should be, there has never been a cache or full track write, just a block write-thru if hitting the cache, otherwise invalidating the cache and writing the block in question. If raw, the first block of DMASEG may be used as a bounce buffer. Avoiding the cache invalidation on writes is on my list, and unlike a previous comment, it may occasionally be useful even when operating on a single floppy.
I've always considered getting rid of the (D)DPT a blessing, or are you talking about something else? What would be the benefit of adding this to the DF driver? |
Cool. Although costing 1K bytes, I think having an always-available DMASEG for single block operations opens up the ability to improve the track cache handling. I'm wondering whether it might be a good idea to support caching only on the boot floppy drive, rather than flip/flopping between drives (when copying), as even the inode lookup for file creates copy from drive 0 to drive 1 would currently end up invalidating drive 0 cache and track caching drive 1, only to reverse that on the first copy block, etc. Since you're thinking that drive 1 is much less used, perhaps a boot-drive-only cache makes more sense. More details will be found on actual (worst case) behavior after adding more trace code so we can see what's happening during file copies, not just read-only boot actions. Our discussions are proving quite useful, as the mfs image packing mod is showing a 50% boot speed increase with the BIOS driver, which uses semi-track caching, and a 40% speed increase using (your original on my ELKS) DF driver full-track caching. This is a huge deal, given what we've been seeing previously. These results are using QEMU IODELAY delay emulation, but I think they'll be fairly close to real hardware. If this turns out to be the case, then it appears the biggest speed problem might be floppy seek operations along with track caching only needed data. The jury's still out on 386+ systems with image packing and whether a cache helps at boot for that case. Yes, your testing as always exposes deep rabbit holes!!
Given my possible conclusions above, that's probably a very good idea. I'm going to try to add some code that shows seek movements between track reads so we get a better idea about what might be the slowest aspect of floppy performance.
Something else entirely - the BPB and EPB provide different information, required for fast floppy results. Here's my definitions of each:
TL;DR In summary, four years ago a lot of time was spent creating a reliable, fast probe routine that allowed for very fast floppy identification without the need for any special handling from the BIOS (except for of course providing a DDPT when calling the BIOS to override it's settings). My comment was about the fact that all of this is entirely missing from the DF driver. Properly engineered, the ELKS kernel and its use of BPB and EBP can/should be decoupled from the TLVC/ELKS DF driver re-implementation of an FDC controller driver, just like is being done with the BIOS driver. The way I'm starting to look at this is that the BIOS or DF driver should only concern itself with very low level read/write FDC handling, not track caching or EBP/BPB identification. (My first pass at this was a year ago when I seperated bios.c from bioshd.c). That said, of course the FDC itself needs to know data rates, HUTs, etc and may still have to perform the current "probe" (just like any real BIOS does). Complicated? We wouldn't be interested if it weren't :) |
I added all the same track cache debug display code in the BIOS driver to the DF driver for comparison in ghaerr/elks#2081. I've left a detailed description there but quick summary is that drive-to-drive copies are working quite well in both drivers, and I was not correct about track cache invalidation being a problem (explained there). Also, my num_sectors calculation was discovered incorrect for IODELAY emulation and that correction resulted in the boot times between drivers being extremely close at 4.5s for BIOS and 5.0s for DF using the optimized mfs image allocation. (The current Image.all doesn't include files necessary for high speed net start operation, feel free to add the files. When using DEBUG_CACHE and debug=1, the system will display the files opened or execed in the order processed, and you can use that information to quickly add more files to Image.all in that same order). |
Yes, this line of reasoning does make sense. As you found out, the DF driver already does the right thing on writes and doesn't touch the cache unless it has to (needs the cache for bouncing). Reserving the cache for the root drive (or drive 0 if not booting from fd) is likely to be good in all but some very rare situations. (I've been using df1 as root drive quite aq bit during the recent benchmarking, I never though about that having practical application before.)
I've had limited screen time these last few days, but will check in the current
OK; I get it. Still it seems to me that the setup we have (in the DF driver) is more than adequate for the purpose - or rather, I'm missing the point on what can be done at a higher level that improves the situation. The thing is, we need access to FDC level errors in order to probe correctly anyway and it seems to me driver is the perfect place to do this.
Completely agree. Where digital meets physical - that's where real problems are faced - and solved, right? Who would have thought we'd sit here counting milliseconds and rotational delay in 2024? BTW, the |
Oh geez, not good! I just checked here and the required binutils-ia16 and gcc-ia16 are still there... what error were you getting, could you download/build the cross compiler using
My issue is code portability - why duplicate all the BIOS probe code and BIOS cache code when, if designed better, both could be reused? It's the old problem of programmer reinvention desires vs reuse.
Not really. The old-fashioned (but much needed) "probe" code just needs an error return (0 or not 0 meaning success or not) from the driver. This is easy to do, as the buffer system uses the DF and BIOS driver return code every time in order to mark a buffer valid or not. Why rewrite (and test!?) all the old sector probe code when its already written and working well? It's all good. As a guy who ends up maintaining all this stuff, I find that duplicated code heavily adds to the maintenance burden. Sometimes its not all about experimentation but rearranging spaghetti threads in order for software to talk to each other. A similar issue is being discussed with regards to network driver API compatibility. I use these sorts of discussions to learn about how software needs to be architected for reuse. In the longer term, I have seen the mess created with arbitrary contributions by programmers long departed. That resembled the insane mess of crap I found when I first got involved with ELKS 4+ years ago. It's been a long haul!!!
My current take on floppy, hard disk, network and ramdisk drivers is that they should all do as absolutely as minimal as possible, and let libraries or upper level routines share any common code. The packet driver spec is a good example of this (or at least was, until the kitchen sink got added and now many only run on DOS). There was a day when I wanted to just load packet driver binaries for ELKS... wouldn't that be useful!? Thank you for your testing and conversation! |
the wget in the build failed, so I went to the repository- which was empty except for the message about moved to gitlab and codeberg. I retried now and there are 3 repositories visible. I haven't retried the build script.
we're coming from different sides on this. I have very limited interest in the bios driver and believe it's role is for emergencies only. The direct driver is AFAIK desirable in every way except size. As long as we're on PCs, there are no compatibility issues that would benefit from using bios code instead. As to error codes, I disagree. Reprobing on errors in general is not desirable and very few hard errors may actually indicate wrong format.
Appreciated, and a goal we share. Of course, ELKS and TLVC have different goals so it is not unnatural that some divergent choices be made.
Indeed, that would be incredible. I don't know how efficient the packet drivers are - what they deliver on MTCP (DOS) s really good, but then, they own the machine while running. Again we share the goal, and again I disagree on the floppy driver: to me probing is a driver issue, but you do have a point since you have two drivers to worry about. I don't...
|
I wonder what happened - whether Github was down at the moment (I've had wget fail a number of times previously) or whether the repos actually went away for a while. I have saved download .gz files for everything if we need it.
IIRC you participated heavily in the demand for, requirements and testing of the BIOS probing driver code years back. You don't think all that probe code/work is useful to have/use anymore? It isn't involved in any read/write sector retry or error handling at all. |
Interesting - I retried just now (morning) and it works fine, so all good. Somewhat worrisome though that @tkchia seems to be moving away from github.
I don't recall, but it sounds about right. I haven't looked it up, but I suspect it's all good and useful. At that time it was the only alternative, so it was well spent effort. Today we do have an alternative, and what I'm saying is that I consider the DF driver to be the better choice and IMHO keeping the rather simple, table driven format probing in the driver is good. We don't have to agree on this and like I said it's only natural that ELKS and TLVC have different priorities. |
@Mellvik: I'm finally at the point of adding the "fixed size cache" used in your testing here for ELKS. I was wondering about the exact algorithm used, say for a 6K cache (as opposed to "full track caching, or full track plus 1 MT caching", etc). When setting up the multisector read for the cache fill, I'm sure the simple case of just reading from the requested starting sector for 6K bytes is used when there's enough sectors left to perform an MT read, automatically switching the head from 0 to 1. But what happens, when there are not enough sectors left in a single FDC programmed MT read to fulfill the cache fill? Is the I/O split into multiple requests to get exactly 6K each time, or is the cache actually truncated in these cases? For instance, what happens on a read request to initially fill the cache on a 1.44M floppy, starting on head == 1 and sector 15? Does the driver just read sectors 15-18, and set the cache to 2K in this case, or is the read continued with another I/O request to read the remaining 4K of the "fixed cache"? I am assuming that another I/O would be necessary since MT doesn't work to advance the cylinder, right? The simple solution would be that the I/O request is truncated to what can be performed on a single MT I/O request. That is, the cache is filled fully only when the requested sector and head allows for a single FDC MT read request to do so - in the above example 2K valid cache. Thanks! |
Glad to hear that @ghaerr, glad to contribute.
No, I use a test for how much is left of the cylinder, and use that in the dma request.
That's right. And if I remember correctly, that's what the
You got it already! Thank you. |
This PR introduces a new sector cache implementation replacing the old track buffer in the
directfd
driver. The 'TRACK_CACHE' configuration option is also back, for reasons detailed below. This is part of a floppy 'boot and use' performance enhancement project discussed on #81. Some specifics about the changes:xtflpy=
setting inbootopts
. This is a temporary solution until thesysctl
facility from ELKS has been implemented. The metric is kilobytes, if missing the value is 0 which is read as 1, the minimum.Also on this PR:
menuconfig
page for block devices.Performance
Surprisingly it turns out that fast machines (like a 40MHz 386) does not benefit from floppy caching at all. Thus turning off the cache using
menuconfig
and release most of the memory is the right thing to do. Also somewhat surprising that a smaller (than max) cache, like 6 or 7k, in most cases will give better performance than a smaller and larger cache. Check #81 for more details about this.