Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kernel, directfd] Redesigned sector buffer for the directfd driver #88

Merged
merged 3 commits into from
Oct 1, 2024

Conversation

Mellvik
Copy link
Owner

@Mellvik Mellvik commented Sep 29, 2024

This PR introduces a new sector cache implementation replacing the old track buffer in the directfd driver. The 'TRACK_CACHE' configuration option is also back, for reasons detailed below. This is part of a floppy 'boot and use' performance enhancement project discussed on #81. Some specifics about the changes:

  • The old track cache would always cache at least a full track, from sector 1 regardless of which sector was actually requested (not very useful and time consuming). Further the buffer size was always 9k (18 sectors) - if compiled in.
  • The new sector cache has a very different approach:
    • Fills the cache from the requested sector to the end of the cylinder or end of cache, whichever comes first.
    • The cache can be any (even) number of blocks from 2 (1k) to 18 (9k) or larger, which doesn't make much sense.
    • This means
      • Cache size 1k behaves exactly like having no cache at all
      • Cache size 9k combined with 9 sectors per track floppies (360k, 720k) means full cylinder cache if the requested sector is the first on track (head) 0 of that cyl.
      • The cache size can be tuned and set to whatever works best for the actual environment.
    • Cache hits are being counted and reported on device close (release), which unfortunately will never show on a root file system.
  • The cache size can be set using a third parameter to the xtflpy= setting in bootopts. This is a temporary solution until the sysctl facility from ELKS has been implemented. The metric is kilobytes, if missing the value is 0 which is read as 1, the minimum.
  • For now, the actual cache size and the available cache size is being reported at boot time.
  • The variable size and potential absence of the sector cache forced a recite of the raw driver's mechanism to handle 64k physical memory boundaries. A raw request, if its buffer spans such a boundary will blend split into 3 (occasionally just 2) physical requests, isolating the boundary block to 1k using the bounce buffer (which is the first k of the sector cache if present). This part of the driver has had limited testing.

Also on this PR:

  • Simplified menuconfig page for block devices.

Performance
Surprisingly it turns out that fast machines (like a 40MHz 386) does not benefit from floppy caching at all. Thus turning off the cache using menuconfig and release most of the memory is the right thing to do. Also somewhat surprising that a smaller (than max) cache, like 6 or 7k, in most cases will give better performance than a smaller and larger cache. Check #81 for more details about this.

@ghaerr
Copy link

ghaerr commented Sep 29, 2024

Nice work getting the track cache rewritten so quickly! Has this new code been run very much?

I am a bit confused about your recent testing results showing that a 6-7k track cache gives best results: is this for 360k floppies, or a variety of formats? If 360k only, then are we really talking about a cylinder cache, with best results at ~1.5 cylinders, since each track on the 360k is 4.5k? Obviously if the track cache is 6-7k and running 360k floppy, it'd be required that auto-seek is on. I know its always on for DF but I'm trying to understand which results, if any, might be applicable to BIOS track caching, where auto-seek may or may not be available. I had also believed, probably incorrectly, that a less-than-full-track cache was giving best results, rather than a cylinder (more-than-one-track) cache doing so.

With regards to BIOS auto-seek, I haven't yet tested whether it works on QEMU (but agreed it likely will), I am concerned about having it enabled by default for ELKS (BIOS driver is still the default), since if enabled and unavailable, the system won't boot. We also have the compounded problem that if the hardware BIOS and the kernel DDPT get out of sync or the DDPT is ignored, then auto-seek won't work properly. The BIOS floppy driver creates the replacement DDPT at vector 1Eh and always sets the max sectors field, but I have seen many BIOS ASM source that does not read the 1E vector for max sectors and always use its own ROM table instead. If/when this happens, auto-seek may fail if when the BIOS and kernel disagree on floppy format/type. Kind of a mess, and likely means any BIOS auto-seek option needs to be enabled after boot?

A bit off topic, but...

Having said all this, perhaps the most compatible approach for ELKS might be to default to BIOS driver for distribution floppies (with auto-seek off or not implemented), and then direct users to run the DF driver with full auto-seek with optimizable cylinder caching, since the driver maintains full control of the FDC, after it is determined that the PC hardware is compatible with the DF driver. However, that approach trades off speed, async I/O, and now optimized track/cylinder caching for bootability on all systems, which perhaps isn't really the best for most users. The problem is, anybody that gets distribution disks and can't boot results in a support situation, or an unreported evaluation with no further use.

@ghaerr
Copy link

ghaerr commented Sep 29, 2024

Cache size 9k combined with 9 sectors per track floppies (360k, 720k) means full cylinder cache if the requested sector is the first on track (head) 0 of that cyl.

When you speak of cylinder cache, you mean that when a block is requested that involves a split block and head 0, the first track is read up to the first half split block, then the next track is read in full, head 1. Right? That means that even if the request is just for a single split block, the whole next track is read. Have you tried/tested the idea of only reading to the second half split block, that is, only reading a single extra sector on the same track, head 1? It occurs to me that might also be an interesting case, especially for those systems like 1.44M where full track caching appears to slow things down. The idea would be that some number of multiples of full blocks are always read, including split blocks, but not always reading a full track on auto-seek. What have you found?

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 30, 2024

Nice work getting the track cache rewritten so quickly! Has this new code been run very much?

The code has been really beaten up over several days heavy testing (probably close to a thousand system loads) on 3 different machines and qemu. No xms testing yet, and no occurences of raw IO hitting the 64k threshold.

I am a bit confused about your recent testing results showing that a 6-7k track cache gives best results: is this for 360k floppies, or a variety of formats?

(from #81 (comment) )

Anyway, more testing is needed - this is from the 286/12.5MHz compaq, booting off of a 1.44M floppy. The buffer available (DMASEGSZ) is 9k, I'm changing the size of the sector cache via bootopts not having implemented sysctl yet, always using full blocks - 1k thru 9k.

Since this testing, there is the 386 test run which also

... I'm trying to understand which results, if any, might be applicable to BIOS track caching, where auto-seek may or may not be available. I had also believed, probably incorrectly, that a less-than-full-track cache was giving best results, rather than a cylinder (more-than-one-track) cache doing so.

I believed the same thing and nothing in the reported results indicate otherwise. That said, upcoming results from 360k tests may change our position on that. Testing on 386hw indicate that but it needs to be verified on slow hw. #81 (comment)

With regards to BIOS auto-seek, I haven't yet tested whether it works on QEMU (but agreed it likely will), I am concerned about having it enabled by default for ELKS (BIOS driver is still the default), since if enabled and unavailable, the system won't boot. We also have the compounded problem that if the hardware BIOS and the kernel DDPT get out of sync or the DDPT is ignored, then auto-seek won't work properly. The BIOS floppy driver creates the replacement DDPT at vector 1Eh and always sets the max sectors field, but I have seen many BIOS ASM source that does not read the 1E vector for max sectors and always use its own ROM table instead. If/when this happens, auto-seek may fail if when the BIOS and kernel disagree on floppy format/type. Kind of a mess, and likely means any BIOS auto-seek option needs to be enabled after boot?

You cannot do that as it's a bit set in every read/write operation, not a configuration bit. OTOH, it could be an idea to test whether autoseek is on (a 2 sector read starting at the last sector/head 0) in the BIOS driver probe and set the strategy from the result.

A bit off topic, but...

Having said all this, perhaps the most compatible approach for ELKS might be to default to BIOS driver for distribution floppies (with auto-seek off or not implemented), and then direct users to run the DF driver with full auto-seek with optimizable cylinder caching,

That's - if nothing else - a very safe route. That said, excluding non IBM compatibles and barring bugs of course, the direct driver is guaranteed to be compatible, and in that regard safe.

..since the driver maintains full control of the FDC, after it is determined that the PC hardware is compatible with the DF driver.

I'm losing you here. I may of course be missing something but how can the PC hardware not be compatible with the driver? If there is something here, I'd be very interested in addressing it.

However, that approach trades off speed, async I/O, and now optimized track/cylinder caching for bootability on all systems, which perhaps isn't really the best for most users. The problem is, anybody that gets distribution disks and can't boot results in a support situation, or an unreported evaluation with no further use.

So it's really a question of trust - and AFAIK not about compatibility but about reliability. And the only way to get that is to really beat it up and get as many as possible to use it. IOW - possibly deliver both variants (double the set of boot images) for a while? BTW, in a way, the direct driver (the TLVC version of it for now) has more hw 'compatibility' than the BIOS driver in that you can configure drive types for XT class systeems.

@ghaerr
Copy link

ghaerr commented Sep 30, 2024

I have studied your 1.44M floppy track cache testing results #81 (comment) and now see plainly that a 7k cache gives superior boot performance.

I still want to make sure I fully understand what is happening. I've studied most of the new driver code in this PR, but want to make sure I'm following: a 1.4MM floppy has 18 sectors/track = 9k. However, auto-seek is alway being used, right? So when we set the "track" cache to 7k, this means 14 sectors, and the new driver code will always read 14 sectors from the current cylinder (not track), if possible, right?

[There would be three cases in the head == 0 case: 1) start sector is < 4, so 14 sectors read on the current track, but not to end of track, 2) start sector == 4, read 14 sectors to end of track, and 3) start sector > 4, read '18-sectors' on the current track, and the remaining on the next head==1 track. For I/O requests where head==1, only read sectors until end of current track.]

In summary we're really talking about a cylinder cache, where exactly 14 sectors are attempted to be read in each I/O request. In the head==0 case auto-seek is performed, and when head==1, the sector count may be truncated.

It seems to me that determining the exact number of sectors to cache could be heavily linked to the contiguous sectors within specifiec programs being exec'd, rather than native cache throughput. (I suppose that's obvious). It might be interesting to see whether removing the fsck exec in the boot process changes the results much.

My first thoughts on all this were "why not always read until end of track/cylinder", but I don't know the speed tradeoff in reading extra data that might be unneeded. And since one of the main points of this exercise was to determine a reasonable cache size so that extra memory can be returned to user programs, overall, it makes a lot of sense to just set a max track/cylinder cache and let the system run with auto-set head==0 expansion and head==1 truncation. It would be interesting to see continuing "normal" system operation when running programs larger than the cache size.

At the end of the day, it seems this approach is trading off making more memory available to user programs via less fixed cache by using an enhanced auto-seek cache fill previously used only with a single sector on 9-sector floppies.

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 30, 2024

Cache size 9k combined with 9 sectors per track floppies (360k, 720k) means full cylinder cache if the requested sector is the first on track (head) 0 of that cyl.

When you speak of cylinder cache, you mean that when a block is requested that involves a split block and head 0, the first track is read up to the first half split block, then the next track is read in full, head 1. Right?

What the quote above says is that IF the cache size is 9k AND the track size is 9 sectors AND the read starts at sec1hd0, the entire cylinder will be cached, just like the previously deleted 'full cylinder buffer' driver. It's an illustration of what may happen, not an endorsement (it's rarely good).

That means that even if the request is just for a single split block, the whole next track is read.

Yes - if the cache size allows it. BTW - really, there is no such thing as a split block in this setting :-)

Have you tried/tested the idea of only reading to the second half split block, that is, only reading a single extra sector on the same track, head 1?

This was how it worked with the old caching mechanism: If head was 0 and sector count odd, one sector from the next track was read in order to always get full blocks. That mechanism always read from sec1 though, so it's not the same thing. OTOH, with the new sector cache, this happens all the time - the smaller the cache size, the more likely it happens. No measurements of this in particular though.

It occurs to me that might also be an interesting case, especially for those systems like 1.44M where full track caching appears to slow things down.

Incidentally, 1.44m (and 2.88M) are the only drive types where 'split block' never happen - they have an even # of sectors per track.

That said, this is what my last to days of testing have been all about. Start at cache size 1k, boot and record jiffies, increase by 1k, repeat. It's a slow process, but the numbers are finally in for 286/12.5MHz/1.44M, 386SX40MHz/1.44M, 1.2M, 360K (in 1.2M drive). The latter machine, as we have discussed, deliver best with 1k (no cache), but I did complete testing anyway looking for patterns. Including - as referred in the PR - 360k w/9k cache.

It's worth keeping in mind that what I've measured is system startup time only, not usage (coming later using the same method as before). Given what I reported for the 286, the numbers are unsurprising - with a couple of exceptions. 5-7k cache size is always best, which one depends on the system. Ignoring the fast 386 not needing the cache at all, it is interesting that the 1.2M drive clocks in with the same speed for6k and 9k cache sizes. The 360K drive clocks in with about the same speed at all sizes above and including 5k.

The numbers need more validation, so the next target is 360K and 720K on the V20, then 360K, 1.2M and 1.44M on the 386DX/20MHz.

For now it seems 5K cache size is a great compromise between RAM and speed. Also, what we're seeing is that the numbers are as dependent on the speed of the system as on the type of drive, which is interesting because the data transfer speed is constant (limited by ISA bus speed and DMA) - slow (360k, 720k) or a little less slow (1.2M, 1.44M).

Depending on the next rounds of testing, we may end up with a recommended cache size per drive type/system speed.

Finally, while this is interesting - we're really going overboard with this. Really, it's not all that big of a deal if the system load time is 14 or 20 seconds. OTOH going from 20 to 10s is significant. ... and I think reorganizing the boot image is likely to have much more effect than the cache tuning.

BTW - not mentioned before - system startup in my case includes loading fsck (for root, reads superblock, then exits) and loading and starting the network - ktcp, ftpd, telnetd. Needless to say, the 360k image is filled to the rim.

@ghaerr
Copy link

ghaerr commented Sep 30, 2024

it could be an idea to test whether autoseek is on (a 2 sector read starting at the last sector/head 0) in the BIOS driver probe and set the strategy from the result.

That's a great idea; the system can determine for itself whether the BIOS has auto-seek or not for the current floppy format!
I'm going to add that to my list.

BTW - really, there is no such thing as a split block in this setting :-)

Agreed. Quite cool actually and a very good reason for ELKS to move to the direct driver on 360k floppies, or at least to probe auto-seek and use it when the BIOS implements it. This gets rid of a lot of problems that would otherwise have to be hacked into mfs.

since the driver maintains full control of the FDC, after it is determined that the PC hardware is compatible with the DF driver.
I'm losing you here. I may of course be missing something but how can the PC hardware not be compatible with the driver?

At ELKS, we have to support PC-98 floppies, for instance. They're not "compatible" with the DF driver, and at least new table entries need to be made. So PC-98 needs the BIOS driver. Other examples are users who might boot with 2.88M (emulated or real) - the TLVC driver doesn't support that hardware.

And the only way to get that is to really beat it up and get as many as possible to use it.

Agreed. And the benefits of the async driver are great (PC-98 notwithstanding) and should become part of the ELKS standard, as it has with TLVC.

possibly deliver both variants (double the set of boot images) for a while?

That's another good idea. There could be a "basic boot" disk for those that need it, or just refer users to a previous version where BIOS was fully supported at boot time.

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 30, 2024

I have studied your 1.44M floppy track cache testing results #81 (comment) and now see plainly that a 7k cache gives superior boot performance.

I still want to make sure I fully understand what is happening. I've studied most of the new driver code in this PR, but want to make sure I'm following: a 1.4MM floppy has 18 sectors/track = 9k. However, auto-seek is alway being used, right? So when we set the "track" cache to 7k, this means 14 sectors, and the new driver code will always read 14 sectors from the current cylinder (not track), if possible, right?

@ghaerr, in order to avoid misunderstandings (I believe I do understand your scenario), look at it this way, this is how the sector cache works (it's really simple):

  • There is no track, just a cylinder, SPT*2 sectors
  • A read starts at the requested sector and reads to EITHER fill the buffer OR to the end of the cylinder.
  • I think it helps to forget about the MT (aka autoseek), it's not something ever changed, set or reset. It's just there always.

It seems to me that determining the exact number of sectors to cache could be heavily linked to the contiguous sectors within specifiec programs being exec'd, rather than native cache throughput. (I suppose that's obvious). It might be interesting to see whether removing the fsck exec in the boot process changes the results much.

It is (obvious) - it will be interesting to see how my 'general usage' test works out with the various block sizes. Suggestions about what would make a reasonable general use test are welcome (we may not agree on that though :-) )

My first thoughts on all this were "why not always read until end of track/cylinder", but I don't know the speed tradeoff in reading extra data that might be unneeded.

This is what the tests with varying cache sizes is supposed to reveal. And one of the contributing factors to the somewhat surprising (well, not really) numbers. More isn't necessarily better.

@ghaerr
Copy link

ghaerr commented Sep 30, 2024

This was how it worked with the old caching mechanism: If head was 0 and sector count odd, one sector from the next track was read in order to always get full blocks. That mechanism always read from sec1 though, so it's not the same thing.

Got it. We were kind of posting at the same, so this matches with my understanding from my last post, thank you.

That said, this is what my last to days of testing have been all about. Start at cache size 1k, boot and record jiffies, increase by 1k, repeat. It's a slow process, but the numbers are finally in for 286/12.5MHz/1.44M, 386SX40MHz/1.44M, 1.2M, 360K (in 1.2M drive).

Fantastic testing!

BTW on the business of whether a "fixed" cache size or not makes sense for all systems (given the ability to give back unused RAM to main memory), I am thinking of a way to get the cache DMASEGSZ out of being a fixed constant in config.h and having the /bootopts equivalent be used by setup.S for relocating the kernel - and the latest modifications I've made to init/main.c and setup.S should allows this somewhat easily. More on that later, but the idea would be that the DMASEG would be fixed as now, and a variable DMASEGSZ sectors would then be allocated in low memory, then the kernel code and data segments after that. The biggest issue is that the /bootopts pre-pass for trackcache= would have to be written in ASM and performed by setup.S, but with some restrictions, like requiring trackcache= to be on its own line, I think not too much work.

For now it seems 5K cache size is a great compromise between RAM and speed.

In the interim, after your results are in then, it might be best to configure a default, say 5K cache in config.h and then have a normal /bootopts option what would allow setting a lower or no cache value for 1.4MM floppies or fast systems. This would have a big benefit of releasing 13k more RAM to user programs like Doom that really need it. (Yes, games still rule with users, it seems).

Really, it's not all that big of a deal if the system load time is 14 or 20 seconds. OTOH going from 20 to 10s is significant. ... and I think reorganizing the boot image is likely to have much more effect than the cache tuning.

Agreed. And I am definitely planning on some improvements to the image layout, inode numbering and mfs mods to achieve better boot results. Stay tuned, I still want to complete my boot block analysis before starting on those.

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 30, 2024

BTW - really, there is no such thing as a split block in this setting :-)

Agreed. Quite cool actually and a very good reason for ELKS to move to the direct driver on 360k floppies, ...

It is unlikely that the BIOS would add or remove the MT (Multitrack) bit depending on drive type. In fact there is little reason not to have that bit set always and just ignore it, like the DF driver does. IOW - of a probe responds 'positively' to a read spanning tracks, it will 99.99% certainly apply to any drive (type) on that system.

I'm losing you here. I may of course be missing something but how can the PC hardware not be compatible with the driver?

At ELKS, we have to support PC-98 floppies, for instance. They're not "compatible" with the DF driver, and at least new table entries need to be made. So PC-98 needs the BIOS driver.

Yes, there is PC98, which TLVC does not support. If it's using the 765 FDC, 8237 DMA and the same IO addresses & IRQ it should work fine - with some density updates in the type table. If you can get the users to test it, it should be low threshold.

Other examples are users who might boot with 2.88M (emulated or real) - the TLVC driver doesn't support that hardware.

It's just a table entry, no code change. I'm saving RAM :-)

@ghaerr
Copy link

ghaerr commented Sep 30, 2024

look at it this way, this is how the sector cache works (it's really simple):

Thanks, I get it now. :)

One of the reasons I previously kept bringing up "auto-seek" (MT) was that the BIOS driver doesn't know about it (yet). I am thinking a good tradeoff might be to add auto-seek test capability to the BIOS probe routine and use it when possible, and additionally have config.h-configurable track caching, although possibly with different cache values for BIOS vs DF drivers. Yes, kind of a pain, but possibly necessary for ELKS. A real problem is that I personally don't have access to PC-98 nor real floppy hardware for BIOS, so I'm thinking perhaps just leaving the BIOS track cache as is, and going with your findings for the DF driver, and making that the new default. (And thank you! :)

It is unlikely that the BIOS would add or remove the MT (Multitrack) bit depending on drive type.

Good point. I'm going to check my various BIOS sources just for the heck of it, now that I understand how it actually works.

there is PC98, which TLVC does not support. If it's using the 765 FDC, 8237 DMA and the same IO addresses & IRQ it should work fine - with some density updates in the type table. If you can get the users to test it, it should be low threshold.

I'm almost sure the PC98 uses the 765 FDC. I'll ask and see whether I can get more information on the BIOS for that system (It's mostly all in Japanese).

It's just a table entry, no code change. I'm saving RAM :-)

Haha, maybe on just the table entry... Perpendicular mode also has to be implemented, which I went ahead with in my driver version. It uses bit 6 of the bit rate field to indicate some additional rate bytes sent to the 82077.

@ghaerr
Copy link

ghaerr commented Sep 30, 2024

More on the business of how some BIOS (from ASM sources) determine sectors per track vs our DF driver. This is from Sergey Kiselev's 8088 BIOS project, which is fairly new and used in Book 8088 and other computers. I will continue analysis with other older BIOSes.

I see now that both BIOS and DF have to send the EOT (End of track, sector number) to the FDC on read/writes, which determines when FDC should switch heads if the MT bit is set. The FDC is setup to start the read/write at a start sector and the DMA controller determines when the end of transfer should occur, from the byte count sent to it.

Both BIOS and DF set MT all the time as you suspected, and the Sergey BIOS uses the DDPT 1E vector to get the all-important max sector (EOT) value. So this would work with ELKS BIOS, even though it never attempts it.

What differs from our DF driver is that the Sergey BIOS doesn't use a table of "probe" entries quite like the DF driver, but instead, like the DF driver, gets the drive type from the NVRAM/config, which sets the drive type, but then unlike DF issues a READ ID command with varying data transfer rates to determine the floppy type within the drive. This opens up the possibility (not sure how much) of a difference in what the BIOS thinks the floppy is, versus what ELKS would. The bigger question is how might this affect what matters, which is the sectors-per-track value, which could fail an MT (auto-seek) transfer?

So the "risk" in using MT reads in the BIOS driver is still just from the BIOS ignoring DDPT (but not in this case), when it would default to its idea of sectors-per-track from another method (READ ID in this case, but never matters, since DDPT overrides work).

IIRC the original IBM BIOS didn't support DDPT 1E vector. I'll check for sure, but this would make sense, as the whole point of a DDPT was to allow user (DOS) programs to override the BIOS for floppy type. It would make sense that later BIOSes would support DDPT, while early versions could not support it prior to its invention. So it might be the issue is supporting very old systems with a default boot. BTW, ELKS already has workaround code for the lack of INT 13h fn 8 (Get Drive Parameters) for the original IBM PC and some XTs. This had to be added because of course some users wanted to see ELKS boot and run on their original 8088 IBM system.

Given all that, some crazy ideas like "hold the space bar down during boot" to force the BIOS to not use MT/auto-seek come to mind, but one has to wonder whether users would know about that or not. They could be given this information without having to recompile though, which is nice. After boot they could be instructed to edit /bootopts with a workaround option. Is this complete overkill or necessary?!?!

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 30, 2024

look at it this way, this is how the sector cache works (it's really simple):

Thanks, I get it now. :)

I think the important thing about MT is that it's just there, always on, whether reading (or writing) one sector or 36, whatever.

It's just a table entry, no code change. I'm saving RAM :-)

Haha, maybe on just the table entry... Perpendicular mode also has to be implemented, which I went ahead with in my driver version. It uses bit 6 of the bit rate field to indicate some additional rate bytes sent to the 82077.

Perpendicular mode is implemented and tested in QEMU, works fine.

@ghaerr
Copy link

ghaerr commented Sep 30, 2024

Update on IBM PC BIOS (v1 4/24/81, v2 10/19/81, and v3 10/27/82):

I was incorrect above, all three IBM PC early BIOS's support DDPT, which is good news. However, the FDC read command did not have MT added until v3, so early version of PC BIOS do not support multitrack reads, although there should not be an error if tried, the sector count returned would just be lower. I'm not sure if this needs to be handled specially, as any track cache would require separate reads for those BIOSes.

For IBM XT BIOS v1 11/8/82, both DDPT and multitrack reads are supported.

On another note, does the DF driver support multitrack writes? This would matter for the TLVC-only raw char driver, as well as single block writes on "split" blocks for odd-sectored floppies, which could be frequent.

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 30, 2024

More on the business of how some BIOS (from ASM sources) determine sectors per track vs our DF driver. This is from Sergey Kiselev's 8088 BIOS project, which is fairly new and used in Book 8088 and other computers. I will continue analysis with other older BIOSes.

It's a good one. My V20 system runs that BIOS.

I see now that both BIOS and DF have to send the EOT (End of track, sector number) to the FDC on read/writes, which determines when FDC should switch heads if the MT bit is set. The FDC is setup to start the read/write at a start sector and the DMA controller determines when the end of transfer should occur, from the byte count sent to it.

That's right - and it's still completely transparent in the sense that it's something I never paid attention to. The sector count is from the drive parameter table and all is automatic.

Both BIOS and DF set MT all the time as you suspected, and the Sergey BIOS uses the DDPT 1E vector to get the all-important max sector (EOT) value. So this would work with ELKS BIOS, even though it never attempts it.

Well, that's good news, isn't it?

What differs from our DF driver is that the Sergey BIOS doesn't use a table of "probe" entries quite like the DF driver, but instead, like the DF driver, gets the drive type from the NVRAM/config, which sets the drive type, but then unlike DF issues a READ ID command with varying data transfer rates to determine the floppy type within the drive. This opens up the possibility (not sure how much) of a difference in what the BIOS thinks the floppy is, versus what ELKS would. The bigger question is how might this affect what matters, which is the sectors-per-track value, which could fail an MT (auto-seek) transfer?

This sounds like a variant of (effectively the same as) what the DF driver is doing. And this is why I had to introduce the xtfply= setting in bootopts. 360k and 720K have the same rate and cannot be detected this way. This does not, however, affect your MT test. When the data rate has been established that test will be reliable.

IIRC the original IBM BIOS didn't support DDPT 1E vector. I'll check for sure, but this would make sense, as the whole point of a DDPT was to allow user (DOS) programs to override the BIOS for floppy type. It would make sense that later BIOSes would support DDPT, while early versions could not support it prior to its invention. So it might be the issue is supporting very old systems with a default boot.

I don't see how this would be a problem. If the drive is configured (or found) to be 360k (or system is <AT), assume 9 sectors and read 9+10 to see if it fails - should work regardless.

BTW, ELKS already has workaround code for the lack of INT 13h fn 8 (Get Drive Parameters) for the original IBM PC and some XTs. This had to be added because of course some users wanted to see ELKS boot and run on their original 8088 IBM system.

Probably just setting the drive type to 360K? Would make perfect sense.

Given all that, some crazy ideas like "hold the space bar down during boot" to force the BIOS to not use MT/auto-seek come to mind, but one has to wonder whether users would know about that or not. They could be given this information without having to recompile though, which is nice. After boot they could be instructed to edit /bootopts with a workaround option. Is this complete overkill or necessary?!?!

I think it will be a lot easier than that. If the boot process (actually the BIOS driver) gets a floppy read error (we'll figure out the type, but probably sector not found) it would be easy enough to backtrack and redo that particular IO request, and if necessary change a parameter setting, don't you think?

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 30, 2024

Update on IBM PC BIOS (v1 4/24/81, v2 10/19/81, and v3 10/27/82):

I was incorrect above, all three IBM PC early BIOS's support DDPT, which is good news. However, the FDC read command did not have MT added until v3, so early version of PC BIOS do not support multitrack reads, although there should not be an error if tried, the sector count returned would just be lower. I'm not sure if this needs to be handled specially, as any track cache would require separate reads for those BIOSes.

For IBM XT BIOS v1 11/8/82, both DDPT and multitrack reads are supported.

So you could peek the BIOS for version #, but the question is still open as to other (compatible) BIOSes - how compatible and since when? It sounds like a probe is worth it regardless.

On another note, does the DF driver support multitrack writes? This would matter for the TLVC-only raw char driver, as well as single block writes on "split" blocks for odd-sectored floppies, which could be frequent.

Like I said, the MT bit is simply always there, so yes. Always. The thing is (again), and I think this is important - there is never a reason to ever remove it.

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 30, 2024

It occurs to me that the prospect of getting rid of the entire DDPT mess is a strong case for the DF driver. :-)

BTW I seriously doubt that any IBM PC or compatible ever had a 765 without MT bit support. Chips shipped as early as 1979 had the MT in place, 2+ years before the PC. That doesn't help much though if the BIOS doesn't use it, so I think your probe strategy is both necessary and safe.

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 30, 2024

Skjermbilde 2024-09-30 kl  14 28 58

@ghaerr
Copy link

ghaerr commented Sep 30, 2024

does the DF driver support multitrack writes?
the MT bit is simply always there, so yes. Always.

What I was getting at was not the FDC having the MT capability or not, but rather the driver, since it has to be coded to start another I/O request should the BIOS not honor the requested sector count assuming MT (when MT reads aren't performed in BIOS). Which brings me to my next point:

I now realize there were some design mistakes made in the (my) track cache code for BIOS. It was designed as a full-track (not cylinder) sector cache, performed under the untested assumption that reading full tracks would be faster than reading multiple sectors. You're finding the real results of such an assumption now. Reading full tracks from sector 1 was removed via an option ifdef quite a while ago though. Also, I was unaware of, or at least did not take into account the ability for most BIOSes to perform MT reads, particularly across split blocks. So the track cache code is written to only consider a single track number and associated sector count, forcing somewhat worst case behavior on split block reads for odd sectored floppies (after all these years!!).

Because of this, there is an upper level loop which continues to read an additional sector (or any required) to complete the I/O (read or write) block request. Now I realize that the BIOS driver can be easily improved without the need for probing MT capabilities at all: simply request the I/O assuming MT (or longer fixed-cache request) using INT 13h, and then accept the returned sector count, for either the I/O request or cache fill request; by not treating a less-than-requested sector count as an error, the upper loop will setup another I/O request for the remaining data. The routine will need a bit of recoding to calculate the ending sector and DMA etc, but it'll work for non-MT and MT capable BIOSes because the upper loop will reschedule remaining I/O in the non-MT case. In the track/cylinder cache fill request, the cache fill will exit the upper loop rather than continue, which prohibits non-MT BIOS track caches from issuing two I/O requests.

Sounds complicated, but actually pretty straight forward. Just like the new DF cache fill code: a look at your rewrite seems to show more simplicity, and the upper level will also request additional I/O if the sector count (or DMA wrap) requires an additional request to be performed. For BIOS, the DDPT being handled in all BIOS versions means we can pretty much guarantee the I/O to match the kernel and BIOS's idea of the floppy. I suppose that's why the existing BIOS driver works on all known systems, albeit slowly until now.

If the boot process (actually the BIOS driver) gets a floppy read error (we'll figure out the type, but probably sector not found) it would be easy enough to backtrack and redo that particular IO request

Yes, but I'm pretty sure that for non-MT BIOSes, the BIOS limits the I/O to the last sector on the current track/head, and returns the actual sector count for the performed I/O, rather than an error. We'll have to see for sure, I'll check BIOS ASM to confirm. And as you point out, leaving the actual boot process the same (which doesn't rely on MT) gets us past the hard part of the boot, and the BIOS driver can handle irregularities without giving up like the space-limited boot sectors have to.

@ghaerr
Copy link

ghaerr commented Sep 30, 2024

Amazing conclusions on the TLVC load times! Am I reading it correctly that for all but the slowest systems, having no cache is best, regardless of drive types? I don't have a theory on why that is, other than perhaps the applications themselves are very slow and can't keep up with I/O?! It will be interesting to see what you come up with on the V20 (slow machine, right?). Is that your IBM 5150 with a different CPU, or is the 5150 a separate machine?

If the testing concludes with no track/cylinder cache except for the slowest machines, perhaps having using the CPU test in setup.S could be used to either allocate a 7K (i.e. a default in config.h) track cache or none at all, automatically. In this way, the max memory is given to user applications and it automatically work on everything. There could be an additional config option (or just use CONFIG_TRACK_CACHE=N) to force track cache allocation, for which setup.S ignores the CPU type and always allocates a cache. A permanent /bootopts trackcache= might also make sense for testing/fine tuning.

It also seems we might want to keep the timing information in the driver(s) so that a user can see for themselves the results of their own cache settings.

I'm finding this research extremely interesting!

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 30, 2024

Amazing conclusions on the TLVC load times! Am I reading it correctly that for all but the slowest systems, having no cache is best, regardless of drive types?

I think we need to view it the other way around, sector cache is good except for the fastest systems. Of course that depends on how we define 'fastest'. The 12.5MHz 286 machine really needs the cache, the 20MHz 386 is yet to be tested (the blessing of having a hw museum in the garage).

I don't have a theory on why that is, other than perhaps the applications themselves are very slow and can't keep up with I/O?! It will be interesting to see what you come up with on the V20 (slow machine, right?). Is that your IBM 5150 with a different CPU, or is the 5150 a separate machine?

Agan, maybe the other way around, as alluded to before. The 40MHz system s so fast it can field the next request befor3 the next sector passes.

The V20 is 'dream machine ii' as described in the wiki. The 5155 (portable!!)is a different system, original 4.77MHz. To be tested.

If the testing concludes with no track/cylinder cache except for the slowest machines, perhaps having using the CPU test in setup.S could be used to either allocate a 7K (i.e. a default in config.h) track cache or none at all, automatically. In this way, the max memory is given to user applications and it automatically work on everything. There could be an additional config option (or just use CONFIG_TRACK_CACHE=N) to force track cache allocation, for which setup.S ignores the CPU type and always allocates a cache. A permanent /bootopts trackcache= might also make sense for testing/fine tuning.

I agree with part of this regardless of th3 discussion above, more about that later.

It also seems we might want to keep the timing information in the driver(s) so that a user can see for themselves the results of their own cache settings.

touchė.

I'm finding this research extremely interesting!

Me too!

@Mellvik
Copy link
Owner Author

Mellvik commented Oct 1, 2024

does the DF driver support multitrack writes?
the MT bit is simply always there, so yes. Always.

What I was getting at was not the FDC having the MT capability or not, but rather the driver, since it has to be coded to start another I/O request should the BIOS not honor the requested sector count assuming MT (when MT reads aren't performed in BIOS).

I understand that I still haven't been able to convey how this works. I was NOT talking about the FDC but the driver - although the statement applies to both. There is NO WAY to turn on or off multitrack, it's not a mode or setting, it's the read/write command. Think of it this way: The FDC has two sets of read/write sector commands, one set implies MT, the other does not. There is no reason to ever use the latter, and the DF driver has no idea that it exists. It is really of no use to anyone in any context AFAIK. The question is whether (and which) BIOSes use one or the other, which is what the probe we have talked about will reveal. My guess is that most if not all use the MT commands.

@Mellvik
Copy link
Owner Author

Mellvik commented Oct 1, 2024

... there is an upper level loop which continues to read an additional sector (or any required) to complete the I/O (read or write) block request. Now I realize that the BIOS driver can be easily improved without the need for probing MT capabilities at all: simply request the I/O assuming MT (or longer fixed-cache request) using INT 13h, and then accept the returned sector count, for either the I/O request or cache fill request; by not treating a less-than-requested sector count as an error, the upper loop will setup another I/O request for the remaining data. The routine will need a bit of recoding to calculate the ending sector and DMA etc, but it'll work for non-MT and MT capable BIOSes because the upper loop will reschedule remaining I/O in the non-MT case. In the track/cylinder cache fill request, the cache fill will exit the upper loop rather than continue, which prohibits non-MT BIOS track caches from issuing two I/O requests.

Sounds complicated, but actually pretty straight forward.

Yes, I agree - this sounds like a workable and quite simple solution. The ability to return partly completed requests like that is all you need to add raw access too, BTW.

Yes, but I'm pretty sure that for non-MT BIOSes, the BIOS limits the I/O to the last sector on the current track/head, and returns the actual sector count for the performed I/O, rather than an error. We'll have to see for sure, I'll check BIOS ASM to confirm. And as you point out, leaving the actual boot process the same (which doesn't rely on MT) gets us past the hard part of the boot, and the BIOS driver can handle irregularities without giving up like the space-limited boot sectors have to.

Actually, (surprise! :-) ) this brings up a different issue on my list, to increase the read retries during boot. I'm experiencing (on old 360k drives) that TLVC fails to boot (repeated read errors) on floppies that DOS reads just fine. I don't know if the difference is the # of repeats, but it may be and I cannot afford not to try ...

@ghaerr
Copy link

ghaerr commented Oct 1, 2024

The FDC has two sets of read/write sector commands

Thanks, yes I get it now - MT is bit 0x80 in the R/W command. Having now read several BIOS sources, only the two earliest version of the IBM PC BIOS don't set it; after that, everyone figured out the advantage of always setting it.

That said, my concerns were (and now still are - see below) about drivers that may not use it - which include the ELKS BIOS driver. That driver always properly sets a DDPT, but then never attempts a read/write across a track boundary (for the cache or single block). So odd-sectors are always split into two I/O requests. [EDIT: get_chst enforces this.] IIRC very early on when I was just getting involved, there was a problem with some machine that wouldn't boot until this was done. I'll go back and find it, but from our discussions one would think the only way this could occur would be the BIOS returning an error rather than a lower sector count on the read/write operation. Having just now remembered this, the driver may have to perform the probe you suggested, or at least have code to retry split block reads (and remember to do so) as separate sectors. The details never end when older machines are involved!

@Mellvik
Copy link
Owner Author

Mellvik commented Oct 1, 2024

If the testing concludes with no track/cylinder cache except for the slowest machines, perhaps having using the CPU test in setup.S could be used to either allocate a 7K (i.e. a default in config.h) track cache or none at all, automatically.

This is pretty much exactly what I've been thinking - it seems (pending more testing, which is ongoing) we need to know the CPU - and probably the speed. The latter may be tricky, but let's get back to that if needed. The calibrate_delay may have elements needed in that regard n(like the sys_dly_index).

In this way, the max memory is given to user applications and it automatically work on everything. There could be an additional config option (or just use CONFIG_TRACK_CACHE=N) to force track cache allocation, for which setup.S ignores the CPU type and always allocates a cache. A permanent /bootopts trackcache= might also make sense for testing/fine tuning.

I'll say 'yes to everything' - autotuning within the limits of the assigned RAM, RAM assignment via CONFIG and the ability to force the sector cache size via bootopts. I have two of them implemented now (some adjustments needed), let's look at how to make autotuning work when I finish my benchmarks. BTW I don't think it makes sense to spend time testing the true XT, the numbers will likely tell us the same as those from the V20.

It also seems we might want to keep the timing information in the driver(s) so that a user can see for themselves the results of their own cache settings.

Can you elaborate on this one?

@Mellvik
Copy link
Owner Author

Mellvik commented Oct 1, 2024

... from our discussions one would think the only way this could occur would be the BIOS returning an error rather than a lower sector count on the read/write operation. Having just now remembered this, the driver may have to perform the probe you suggested, or at least have code to retry split block reads (and remember to do so) as separate sectors. The details never end when older machines are involved!

You may be able to handle such error returns from the BIOS by catching them in the driver and ignoring them, returning the reduced read as you suggested above. The result should be the same - assuming the first part of the read got into the buffer. If not, the driver could resubmit a partial request ...

@Mellvik
Copy link
Owner Author

Mellvik commented Oct 1, 2024

I'm changing the bootopts setting for the sector cache giving it its own 'name' as you suggested. fdcache= is my suggestion, saving bytes (!) and feeling that 'trackcache` is no longer appropriate. Your take?

@ghaerr
Copy link

ghaerr commented Oct 1, 2024

Fdcache= sounds good!

@Mellvik
Copy link
Owner Author

Mellvik commented Oct 1, 2024

Thanks,
While at it, I'm also considering a rename of CONFIG_TRACK_CACHE to CONFIG_FLOPPY_CACHE or possibly (shorter) CONFIG_FLPY_CACHE. I think by connecting the CONFIG name to floppy, misunderstandings could be avoided.

@Mellvik
Copy link
Owner Author

Mellvik commented Oct 13, 2024

... it appears that perhaps the 8088s are more CPU bound than we would initially believe - they struggle to keep up.

They are indeed, a reminder as to why the DMA is so important, for floppies and MFM drives: The need to get the CPU out of the way (I've been reading up on XTIDE quite a bit lately, it's interesting how they think to squeeze the last little drop of performance out of the 'channel').

We need to think of some good tests to show we do understand what is going on.

Good point, and I suspect our venerable fdtest may be a good starting point, also when it comes to testing the validity of your qemu delays. (It's using direct BIOS called, so I just rewrote it for the directfd driver, using the raw devices instead - which makes it marginally more useful for the purpose than dd). Anyway, creating a few simple algorithms to figure out what's going on should be right up your alley? Its ability to start anywhere on a track and read any length of any block size opens up some interesting possibilities. Possibly add an interleave test just for fun, and also a gap size, how much to skip between reads. BTW, fdtest needs to be updated to have higher resolution than one second, maybe even using the precision timer stuff.

It would be interesting to compare output from your recent QEMU IODELAY code and compare it to the physical numbers I get. BTW I wanted to get a readymade image of elks (360k) 0.8. yesterday, but could not find it. The 'downloads' link sent me to 'releases' instead... I must have missed something.

I had an interesting printk keeping tabs on the req-queue. I'm going to reactivate that for good measure.

I was thinking perhaps the request queue should report itself each queue size increase above 1 (not zero), which would filter out all but when the system is actually queuing multiple I/O requests.

That's a good idea, I'll add that.

I have got an mfs mod working along with some other mods in image/Make.image that now allows one to specify a list of directory or filenames that should be created/allocated before the rest of the image is generated. Just changing /dev/console to the first /dev directory entry initially knocked off 1/2sec, and with other files added > 1 sec shaved off, and I've been playing around with it a bit.

Great, it will be really interesting to see what this does in 'real life', on 'Dream System II' (V20).

This is not using real hardware, but using the simulated IODELAY recently added to the BIOS driver. I'm not sure how accurate it is, we should discuss that (see https://github.com/ghaerr/elks/pull/2063/files for code). It always assumes half-rotation (100ms) delay, adds more per extra sector read, but doesn't add seek time delays. What do you think? It will be hard to come up with a good set of directories and files if I can't emulate the floppy speed reliably.

It'll be interesting to soo how reliable it is possible to get - you may have to add something to account for seeks. My take is that if you're within +-25% of the real thing it's good enough for the purpose.

Also seeing some complex behavior where adding /bin early results in poorer performance to its turning out a bit tricky.

Doesn't make sense, does it? Did you figure out an explanation for this? My old thing about moving startup programs to /etc keeps popping up in the back of my head...

I'm thinking of adding a more complex dprintk mechanism that will allow for various debug printk's to be turned on or off using /bootopts debug=7 (e.g. 4,2,1) etc in order to quickly be able change what is seen.

Yes, this is a good idea!

@ghaerr
Copy link

ghaerr commented Oct 13, 2024

using the simulated IODELAY recently added to the BIOS driver. I'm not sure how accurate it is,

After starting to read this article about estimating file access times for floppy disks I realized you'll definitely want this on your bedtime reading list. Everything we wanted to know about floppy timings. Some good stuff!

@Mellvik
Copy link
Owner Author

Mellvik commented Oct 14, 2024

Holy Mac, and 1983! It ended up being morning reading instead :-) and yes, very thorough indeed (and very academic towards the end). We now end up with an entirely new challenge: How to apply this knowledge to the challenge at hand without going overboard with details that have minimal effect on the result.

The challenge at hand has (more or less a summary of our discussions so far) two parts:

  • system startup time, and
  • general performance

The former benefits most from a) fast sequential reads, b) optimal placement of metadata and files on the floppy, c) minimizing directory and file accesses.
The latter is strongly usage dependent - as in a) copying a bunch of relatively small files is very different from b) loading and running a (relatively) large program.
And finally, given the limited use of floppies today, even on the oldest of machines, the balance between effort and code on the one hand, and actual benefit must be weighed carefully, as we've discussed before.
Then of course there is the fun factor, aka curiosity, like why the relocation of the /bin directory has the effect you're observing.

I don't have the conclusions - but I do enjoy the discussion! Thanks for the entertaining link!

@ghaerr
Copy link

ghaerr commented Oct 15, 2024

My first pass at lowering system startup time is in ghaerr/elks#2071 and ghaerr/elks#2073.

Prelim results on emulated floppy delay times show halving boot time from 10 to 5 seconds using BIOS driver (net start not yet invoked keeping things simple). Although many options were added, the massive improvement gain turns out to be just writing all the boot-required files close to each other using the new Image.all mechanism (see image/Image.all for details). It might be interesting to play around with Image.all files to see where the gains are really coming from. Nonetheless, the new option, which is a three-liner in image/Make.images and an enhanced mfs, is pretty spectacular, much better than I had imagined.

This is also using an unmodified 9k (18 sector) track cache (for the moment) for 1440k floppies. It will be interesting to see what the results may be on using the enhanced TLVC DF driver with configurable block (non-track) caching. I am not yet able to test non-track caching as I have not added your DF driver cache code to ELKS yet. Also, for compatibility, I'm thinking of leaving the ELKS BIOS driver as it is, although the above PRs also added TRACK_SPLIT_BLK and SPLIT_BLK processing to the BIOS driver. These options have not (yet) been shown to measurably increase speed though.

With the new Image.all file/directory specification compacting boot-required files all together on the image, it may be that a larger cache (or track cache) in particular on 386+ systems will help - I think it will, as the Image.all purposely packs everything together and results in a huge boot time savings. I am guessing this is a result of much fewer I/O requests needed. If that turns out to be the case, then we may have to come up with an option that enables track or larger cache at boot, but turns it off after booting done for 386+ systems. Not sure yet!

@Mellvik
Copy link
Owner Author

Mellvik commented Oct 15, 2024

Thank you @ghaerr - this is an important step, the ability to play around with these layouts and see how they work in practice. I'm rather looking forward to play with the new mfs and Image.all. As always, the proof of the pudding etc. ....

I've continued to enhance the fdtest utility - it's delivering some really interesting timing results. Weird actually. More on that later.

@ghaerr
Copy link

ghaerr commented Oct 15, 2024

@Mellvik,

I've been looking a bit harder at both the original and enhanced floppy cache code in the DF driver, since adding the IODELAY to ELKS for it for delay emulation purposes. As I mentioned in my PR, the ELKS DF driver is currently running slower than the BIOS driver since it always uses full-track reads (e.g. starting from sector 1). I was thinking of moving towards your new implementation but taking it in multiple steps: no caching, old-caching (full track), semi-track caching (start from requested sector, same as BIOS), and the new "fixed size" caching. The ELKS BIOS driver has defines that allow each of these (except the last, until I add it) to be turned on or off. I think I also mentioned I'm having trouble seeing much improvement in the BIOS driver when turning on MT/split block reads - funny, as I had thought there'd be a significant difference.

When writing, the BIOS driver never stores the write block in the cache, but instead always invalidates the cache, then writes just the block. I just noticed though that the DF driver does not: it seems to store the requested block into the track cache, then goes ahead with the write. However, it seems to me that each block write will always result in a full track cache write, correct?

If so, doesn't this produce worse-case behavior for any program that writes files sequentially? That is, each block will be written as a full track, over and over again. IMO this could contribute greatly to longer timings, but I'm not sure whether you're using writes, or only reads, for your time tests above. Can you clarify where only a single block is written on writes if I am misunderstanding?

Thank you!

@ghaerr
Copy link

ghaerr commented Oct 16, 2024

While studying the DF driver and running some copy testing between drives, I noticed that the enhanced mfs is not working exactly the way it should when running using the mfs addfs followed by mfs genfs -a options. What is happening is that it the directories (not files) that are specified in Image.all are being silently created twice. While this seems to have no effect on the image as far as working OK, it's obviously an error. FYI.

The enhanced mfs is still useful for testing boot times, but should not be used for production use. I'll have a fix out shortly.

Also, I happened to notice in the TLVC directhd.c driver that when allocating the ide_buffer using heap_alloc, a NULL return is not checked for - FYI. This could cause a kernel crash if running low on memory.

I have written a disk-to-disk copy script from which I hope to further analyze the problem of the track cache being invalidated, then fully filled, successively, possibly for every few blocks transferred when the source and destination drives are different. This seems to be an inherent problem in both the BIOS and DF cache code. This is similar to the potential problem mentioned above, but not exactly. It is seeming that we are going to need a better cache strategy for READ vs WRITEs as well as when the drives are not the same. All of this is basically the result of having a single cache for multiple drives, as well as not really handling WRITE-throughs in an efficient manner, even though ELKS BIOS and TLVC are coded a bit differently, I fear the throughput results will be worse than they could be.

More results coming, and I hope to publish some test scripts we can both use.

@Mellvik
Copy link
Owner Author

Mellvik commented Oct 16, 2024

... I'm having trouble seeing much improvement in the BIOS driver when turning on MT/split block reads - funny, as I had thought there'd be a significant difference.

I agree, this is indeed surprising. I'm rewriting dftest (again, this is the third time) and it will lend itself well to test things like this in isolation so to speak. I'll possibly add a seek time measurement while at it.

When writing, the BIOS driver never stores the write block in the cache, but instead always invalidates the cache, then writes just the block. I just noticed though that the DF driver does not: it seems to store the requested block into the track cache, then goes ahead with the write. However, it seems to me that each block write will always result in a full track cache write, correct?

No, the write will always be of the requested block, the cache is never flushed to disk. That said there are things to be discussed about this as you allude to in the next post, I'll get back to that.

If so, doesn't this produce worse-case behavior for any program that writes files sequentially? That is, each block will be written as a full track, over and over again. IMO this could contribute greatly to longer timings, but I'm not sure whether you're using writes, or only reads, for your time tests above. Can you clarify where only a single block is written on writes if I am misunderstanding?

There may be code here that I have not touched yet, I'll take a look.

@Mellvik
Copy link
Owner Author

Mellvik commented Oct 16, 2024

The enhanced mfs is still useful for testing boot times, but should not be used for production use. I'll have a fix out shortly.

Thanks for the heads up. I'll hold off importing the code until you send the 'clear' signal :-)

Also, I happened to notice in the TLVC directhd.c driver that when allocating the ide_buffer using heap_alloc, a NULL return is not checked for - FYI. This could cause a kernel crash if running low on memory.

Thanks, you really have a keen eye!! It's bad practice I know, but I left it out because this alloc only happens during probe, which is before most of the stuff using the heap gets started.

I have written a disk-to-disk copy script from which I hope to further analyze the problem of the track cache being invalidated, then fully filled, successively, possibly for every few blocks transferred when the source and destination drives are different. This seems to be an inherent problem in both the BIOS and DF cache code. This is similar to the potential problem mentioned above, but not exactly. It is seeming that we are going to need a better cache strategy for READ vs WRITEs as well as when the drives are not the same. All of this is basically the result of having a single cache for multiple drives, as well as not really handling WRITE-throughs in an efficient manner, even though ELKS BIOS and TLVC are coded a bit differently, I fear the throughput results will be worse than they could be.

This is a good point and it's easy to fix, probably a couple of lines of code only. I vaguely remember having thought about this way back and decided to ignore it because floppy-to-floppy copies happen so rarely (my take). Most likely on floppy only systems - and in testing like the modus operandi we're in now. I have a number of changes in the directfd driver pending (my work on fdtest delivered many odd results that sent me back to the driver to check wtf is going on - no bugs found but some adjustments, mostly to raw IO, since that's what fdtest uses), I'll add this one too, along with some probe changes we've talked about.

@ghaerr
Copy link

ghaerr commented Oct 16, 2024

I'll hold off importing the code until you send the 'clear' signal :-)

mfs is sall fixed and working well in ghaerr/elks#2080.

I'm now going to take a deeper dive into the DF driver cache routines with attention towards READ vs WRITE and multi-floppy performance. My ultimate goal is to write an upper level caching layer that works identically for both BIOS and DF drivers, with options for FULL_TRACK and SEMI_TRACK along with a FIXED_CACHE option with a runtime configurable cache size. Within the driver, an option to handle SPLIT_BLK (single sector at end of track when no cache) and TRACK_SPLIT_BLK options. The options sound complicated, but aren't really and would allow for continued or delayed analysis of floppy performance.

I'm thinking that always limiting block writes to single blocks (never tracks, yet to be confirmed in DF driver) and also making available a single block cache at DMASEG (before any track cache) that would allow for multiple drivers or floppy disks to perform single block write I/O without invalidating any track cache. As we are seeing, the whole business of understanding performance is seemingly pretty complicated, so having more options is likely better, certainly for analysis.

On another subject similar to lifting the cache code above the BIOS driver (for use with both BIOS and DF), the same needs to be done for the old-fashioned BIOS "probe" code where both ELKS EPB and DOS BPBs are examined by reading the floppy boot sector when first opened. This code is badly missing from the DF driver and prevents problems otherwise having to be guessed, especially for DOS disks, but helps firmly identify the media type for bootable ELKS/TLVC floppies as well.

@Mellvik
Copy link
Owner Author

Mellvik commented Oct 16, 2024

I'll hold off importing the code until you send the 'clear' signal :-)

mfs is sall fixed and working well in ghaerr/elks#2080.

thanks.

I'm thinking that always limiting block writes to single blocks (never tracks, yet to be confirmed in DF driver) and also making available a single block cache at DMASEG (before any track cache) that would allow for multiple drivers or floppy disks to perform single block write I/O without invalidating any track cache.

Agree on this, and like I said before, this is a matter of changing a few lines in the DF driver. The writing is where it should be, there has never been a cache or full track write, just a block write-thru if hitting the cache, otherwise invalidating the cache and writing the block in question. If raw, the first block of DMASEG may be used as a bounce buffer. Avoiding the cache invalidation on writes is on my list, and unlike a previous comment, it may occasionally be useful even when operating on a single floppy.

On another subject similar to lifting the cache code above the BIOS driver (for use with both BIOS and DF), the same needs to be done for the old-fashioned BIOS "probe" code where both ELKS EPB and DOS BPBs are examined by reading the floppy boot sector when first opened. This code is badly missing from the DF driver and prevents problems otherwise having to be guessed, especially for DOS disks, but helps firmly identify the media type for bootable ELKS/TLVC floppies as well.

I've always considered getting rid of the (D)DPT a blessing, or are you talking about something else? What would be the benefit of adding this to the DF driver?

@ghaerr
Copy link

ghaerr commented Oct 16, 2024

here has never been a cache or full track write, just a block write-thru if hitting the cache

Can you point to the source lines in tlvc/directfd.c that does that? I wasn't able to find it. [EDIT: never mind, I figured it out].

Avoiding the cache invalidation on writes is on my list

Cool. Although costing 1K bytes, I think having an always-available DMASEG for single block operations opens up the ability to improve the track cache handling. I'm wondering whether it might be a good idea to support caching only on the boot floppy drive, rather than flip/flopping between drives (when copying), as even the inode lookup for file creates copy from drive 0 to drive 1 would currently end up invalidating drive 0 cache and track caching drive 1, only to reverse that on the first copy block, etc. Since you're thinking that drive 1 is much less used, perhaps a boot-drive-only cache makes more sense. More details will be found on actual (worst case) behavior after adding more trace code so we can see what's happening during file copies, not just read-only boot actions.

Our discussions are proving quite useful, as the mfs image packing mod is showing a 50% boot speed increase with the BIOS driver, which uses semi-track caching, and a 40% speed increase using (your original on my ELKS) DF driver full-track caching. This is a huge deal, given what we've been seeing previously. These results are using QEMU IODELAY delay emulation, but I think they'll be fairly close to real hardware. If this turns out to be the case, then it appears the biggest speed problem might be floppy seek operations along with track caching only needed data. The jury's still out on 386+ systems with image packing and whether a cache helps at boot for that case. Yes, your testing as always exposes deep rabbit holes!!

I'll possibly add a seek time measurement while at it.

Given my possible conclusions above, that's probably a very good idea. I'm going to try to add some code that shows seek movements between track reads so we get a better idea about what might be the slowest aspect of floppy performance.

I've always considered getting rid of the (D)DPT a blessing, or are you talking about something else?

Something else entirely - the BPB and EPB provide different information, required for fast floppy results.

Here's my definitions of each:

  • DDPT - Disk Driver Parameter Table - used to inform BIOS of programmer override of floppy settings, but solely for the use of programming the FDC, not identifying the floppy format. Contains max sector info, but no cylinder or head info.
  • BPB - BIOS Parameter Block - (CHS 0,0,1 of FAT floppy) used to inform MSDOS of floppy format (and ELKS of the same when seeing FAT floppies). Contains max sector and head info along with filesystem size, which can be used to know max cylinder.
  • EBP - ELKS Parameter Block (CHS 0,0,1 of MINIX floppy formatted for MINIX) used to inform ELKS of floppy max track, head and sector info.

TL;DR
The current BIOS driver "probe" routine tries to identify the floppy image media type by reading a single (boot) sector, which for all DOS and most ELKS-formatted disks, immediately allows the CHS to be known; otherwise a sector read/track seek probe series is performed (not the same as what the BIOS itself does) to come up with a tested CHS, rather than having to be dependent on the BIOS's idea. The biggest advantage is very fast identification of floppy format, but the kernel and DOS want to be very sure they know the exact format rather than believing a BIOS or driver. Of course, the BIOS itself performs its own identification, similar to the way the DF driver "probes", but that only works to a point - similar to your stated problem of 360k vs 720k and xtflpy= need (or is xtflpy needed for drive type identification?) Both the BIOS and DF driver use CMOS as a starting point, and then try to determine a bit more than drive type if possible. Neither DOS nor ELKS are willing to believe the BIOS's idea of what the floppy format is, and the BIOS doesn't actually keep track of CHS, only drive type and DDPT (and there's no way to get that information passed back reliably anyways).

In summary, four years ago a lot of time was spent creating a reliable, fast probe routine that allowed for very fast floppy identification without the need for any special handling from the BIOS (except for of course providing a DDPT when calling the BIOS to override it's settings).

My comment was about the fact that all of this is entirely missing from the DF driver. Properly engineered, the ELKS kernel and its use of BPB and EBP can/should be decoupled from the TLVC/ELKS DF driver re-implementation of an FDC controller driver, just like is being done with the BIOS driver. The way I'm starting to look at this is that the BIOS or DF driver should only concern itself with very low level read/write FDC handling, not track caching or EBP/BPB identification. (My first pass at this was a year ago when I seperated bios.c from bioshd.c). That said, of course the FDC itself needs to know data rates, HUTs, etc and may still have to perform the current "probe" (just like any real BIOS does).

Complicated? We wouldn't be interested if it weren't :)

@ghaerr
Copy link

ghaerr commented Oct 17, 2024

I'm now going to take a deeper dive into the DF driver cache routines with attention towards READ vs WRITE and multi-floppy performance.

I added all the same track cache debug display code in the BIOS driver to the DF driver for comparison in ghaerr/elks#2081. I've left a detailed description there but quick summary is that drive-to-drive copies are working quite well in both drivers, and I was not correct about track cache invalidation being a problem (explained there). Also, my num_sectors calculation was discovered incorrect for IODELAY emulation and that correction resulted in the boot times between drivers being extremely close at 4.5s for BIOS and 5.0s for DF using the optimized mfs image allocation. (The current Image.all doesn't include files necessary for high speed net start operation, feel free to add the files. When using DEBUG_CACHE and debug=1, the system will display the files opened or execed in the order processed, and you can use that information to quickly add more files to Image.all in that same order).

@Mellvik
Copy link
Owner Author

Mellvik commented Oct 17, 2024

Avoiding the cache invalidation on writes is on my list

Cool. Although costing 1K bytes, I think having an always-available DMASEG for single block operations opens up the ability to improve the track cache handling. I'm wondering whether it might be a good idea to support caching only on the boot floppy drive, rather than flip/flopping between drives (when copying), as even the inode lookup for file creates copy from drive 0 to drive 1 would currently end up invalidating drive 0 cache and track caching drive 1, only to reverse that on the first copy block, etc. Since you're thinking that drive 1 is much less used, perhaps a boot-drive-only cache makes more sense. More details will be found on actual (worst case) behavior after adding more trace code so we can see what's happening during file copies, not just read-only boot actions.

Yes, this line of reasoning does make sense. As you found out, the DF driver already does the right thing on writes and doesn't touch the cache unless it has to (needs the cache for bouncing). Reserving the cache for the root drive (or drive 0 if not booting from fd) is likely to be good in all but some very rare situations. (I've been using df1 as root drive quite aq bit during the recent benchmarking, I never though about that having practical application before.)

Our discussions are proving quite useful, as the mfs image packing mod is showing a 50% boot speed increase with the BIOS driver, which uses semi-track caching, and a 40% speed increase using (your original on my ELKS) DF driver full-track caching. This is a huge deal, given what we've been seeing previously. These results are using QEMU IODELAY delay emulation, but I think they'll be fairly close to real hardware. If this turns out to be the case, then it appears the biggest speed problem might be floppy seek operations along with track caching only needed data. The jury's still out on 386+ systems with image packing and whether a cache helps at boot for that case. Yes, your testing as always exposes deep rabbit holes!!

I'll possibly add a seek time measurement while at it.

Given my possible conclusions above, that's probably a very good idea. I'm going to try to add some code that shows seek movements between track reads so we get a better idea about what might be the slowest aspect of floppy performance.

I've had limited screen time these last few days, but will check in the current fdtest version which is using times() instead of time() for timing for now. I want to switch to the precision timer though, and while porting the user space version of it and some other stuff over from ELKS I ran into some snags - which in turn made me realize that the ELKS/TLVC cross tools are gone from github (along with tkchia). Seems we need some new links to get our build from scratch setup working again...

In summary, four years ago a lot of time was spent creating a reliable, fast probe routine that allowed for very fast floppy identification without the need for any special handling from the BIOS (except for of course providing a DDPT when calling the BIOS to override it's settings).

My comment was about the fact that all of this is entirely missing from the DF driver. Properly engineered, the ELKS kernel and its use of BPB and EBP can/should be decoupled from the TLVC/ELKS DF driver re-implementation of an FDC controller driver, just like is being done with the BIOS driver. The way I'm starting to look at this is that the BIOS or DF driver should only concern itself with very low level read/write FDC handling, not track caching or EBP/BPB identification. (My first pass at this was a year ago when I seperated bios.c from bioshd.c). That said, of course the FDC itself needs to know data rates, HUTs, etc and may still have to perform the current "probe" (just like any real BIOS does).

OK; I get it. Still it seems to me that the setup we have (in the DF driver) is more than adequate for the purpose - or rather, I'm missing the point on what can be done at a higher level that improves the situation. The thing is, we need access to FDC level errors in order to probe correctly anyway and it seems to me driver is the perfect place to do this.

Complicated? We wouldn't be interested if it weren't :)

Completely agree. Where digital meets physical - that's where real problems are faced - and solved, right? Who would have thought we'd sit here counting milliseconds and rotational delay in 2024? BTW, the fdtest program sometimes reports 190ms for a full track (physical) read @ 300rpm. That's what sent me to the precision timer routines. Eliminate variables...

@ghaerr
Copy link

ghaerr commented Oct 17, 2024

which in turn made me realize that the ELKS/TLVC cross tools are gone from github (along with tkchia). Seems we need some new links to get our build from scratch setup working again...

Oh geez, not good! I just checked here and the required binutils-ia16 and gcc-ia16 are still there... what error were you getting, could you download/build the cross compiler using tools/build.sh or not?

it seems to me that the setup we have (in the DF driver) is more than adequate for the purpose

My issue is code portability - why duplicate all the BIOS probe code and BIOS cache code when, if designed better, both could be reused? It's the old problem of programmer reinvention desires vs reuse.

The thing is, we need access to FDC level errors in order to probe correctly anyway

Not really. The old-fashioned (but much needed) "probe" code just needs an error return (0 or not 0 meaning success or not) from the driver. This is easy to do, as the buffer system uses the DF and BIOS driver return code every time in order to mark a buffer valid or not. Why rewrite (and test!?) all the old sector probe code when its already written and working well?

It's all good. As a guy who ends up maintaining all this stuff, I find that duplicated code heavily adds to the maintenance burden. Sometimes its not all about experimentation but rearranging spaghetti threads in order for software to talk to each other. A similar issue is being discussed with regards to network driver API compatibility. I use these sorts of discussions to learn about how software needs to be architected for reuse. In the longer term, I have seen the mess created with arbitrary contributions by programmers long departed. That resembled the insane mess of crap I found when I first got involved with ELKS 4+ years ago. It's been a long haul!!!

it seems to me driver is the perfect place to do this.

My current take on floppy, hard disk, network and ramdisk drivers is that they should all do as absolutely as minimal as possible, and let libraries or upper level routines share any common code. The packet driver spec is a good example of this (or at least was, until the kitchen sink got added and now many only run on DOS). There was a day when I wanted to just load packet driver binaries for ELKS... wouldn't that be useful!?

Thank you for your testing and conversation!

@Mellvik
Copy link
Owner Author

Mellvik commented Oct 17, 2024

which in turn made me realize that the ELKS/TLVC cross tools are gone from github (along with tkchia). Seems we need some new links to get our build from scratch setup working again...

Oh geez, not good! I just checked here and the required binutils-ia16 and gcc-ia16 are still there... what error were you getting, could you download/build the cross compiler using tools/build.sh or not?

the wget in the build failed, so I went to the repository- which was empty except for the message about moved to gitlab and codeberg. I retried now and there are 3 repositories visible. I haven't retried the build script.

it seems to me that the setup we have (in the DF driver) is more than adequate for the purpose

My issue is code portability - why duplicate all the BIOS probe code and BIOS cache code when, if designed better, both could be reused? It's the old problem of programmer reinvention desires vs reuse.

The thing is, we need access to FDC level errors in order to probe correctly anyway

Not really. The old-fashioned (but much needed) "probe" code just needs an error return (0 or not 0 meaning success or not) from the driver. This is easy to do, as the buffer system uses the DF and BIOS driver return code every time in order to mark a buffer valid or not. Why rewrite (and test!?) all the old sector probe code when its already written and working well?

we're coming from different sides on this. I have very limited interest in the bios driver and believe it's role is for emergencies only. The direct driver is AFAIK desirable in every way except size. As long as we're on PCs, there are no compatibility issues that would benefit from using bios code instead.

As to error codes, I disagree. Reprobing on errors in general is not desirable and very few hard errors may actually indicate wrong format.

It's all good. As a guy who ends up maintaining all this stuff, I find that duplicated code heavily adds to the maintenance burden. Sometimes its not all about experimentation but rearranging spaghetti threads in order for software to talk to each other. A similar issue is being discussed with regards to network driver API compatibility. I use these sorts of discussions to learn about how software needs to be architected for reuse. In the longer term, I have seen the mess created with arbitrary contributions by programmers long departed. That resembled the insane mess of crap I found when I first got involved with ELKS 4+ years ago. It's been a long haul!!!

Appreciated, and a goal we share. Of course, ELKS and TLVC have different goals so it is not unnatural that some divergent choices be made.

it seems to me driver is the perfect place to do this.

My current take on floppy, hard disk, network and ramdisk drivers is that they should all do as absolutely as minimal as possible, and let libraries or upper level routines share any common code. The packet driver spec is a good example of this (or at least was, until the kitchen sink got added and now many only run on DOS). There was a day when I wanted to just load packet driver binaries for ELKS... wouldn't that be useful!?

Indeed, that would be incredible. I don't know how efficient the packet drivers are - what they deliver on MTCP (DOS) s really good, but then, they own the machine while running. Again we share the goal, and again I disagree on the floppy driver: to me probing is a driver issue, but you do have a point since you have two drivers to worry about. I don't...

Thank you for your testing and conversation!
Likewise.

@ghaerr
Copy link

ghaerr commented Oct 17, 2024

I retried now and there are 3 repositories visible.

I wonder what happened - whether Github was down at the moment (I've had wget fail a number of times previously) or whether the repos actually went away for a while. I have saved download .gz files for everything if we need it.

Reprobing on errors in general is not desirable and very few hard errors may actually indicate wrong format.

IIRC you participated heavily in the demand for, requirements and testing of the BIOS probing driver code years back. You don't think all that probe code/work is useful to have/use anymore? It isn't involved in any read/write sector retry or error handling at all.

@Mellvik
Copy link
Owner Author

Mellvik commented Oct 18, 2024

I retried now and there are 3 repositories visible.

I wonder what happened - whether Github was down at the moment (I've had wget fail a number of times previously) or whether the repos actually went away for a while. I have saved download .gz files for everything if we need it.

Interesting - I retried just now (morning) and it works fine, so all good. Somewhat worrisome though that @tkchia seems to be moving away from github.

Reprobing on errors in general is not desirable and very few hard errors may actually indicate wrong format.

IIRC you participated heavily in the demand for, requirements and testing of the BIOS probing driver code years back. You don't think all that probe code/work is useful to have/use anymore? It isn't involved in any read/write sector retry or error handling at all.

I don't recall, but it sounds about right. I haven't looked it up, but I suspect it's all good and useful. At that time it was the only alternative, so it was well spent effort. Today we do have an alternative, and what I'm saying is that I consider the DF driver to be the better choice and IMHO keeping the rather simple, table driven format probing in the driver is good. We don't have to agree on this and like I said it's only natural that ELKS and TLVC have different priorities.

@ghaerr
Copy link

ghaerr commented Nov 7, 2024

@Mellvik: I'm finally at the point of adding the "fixed size cache" used in your testing here for ELKS. I was wondering about the exact algorithm used, say for a 6K cache (as opposed to "full track caching, or full track plus 1 MT caching", etc).

When setting up the multisector read for the cache fill, I'm sure the simple case of just reading from the requested starting sector for 6K bytes is used when there's enough sectors left to perform an MT read, automatically switching the head from 0 to 1. But what happens, when there are not enough sectors left in a single FDC programmed MT read to fulfill the cache fill? Is the I/O split into multiple requests to get exactly 6K each time, or is the cache actually truncated in these cases?

For instance, what happens on a read request to initially fill the cache on a 1.44M floppy, starting on head == 1 and sector 15? Does the driver just read sectors 15-18, and set the cache to 2K in this case, or is the read continued with another I/O request to read the remaining 4K of the "fixed cache"? I am assuming that another I/O would be necessary since MT doesn't work to advance the cylinder, right?

The simple solution would be that the I/O request is truncated to what can be performed on a single MT I/O request. That is, the cache is filled fully only when the requested sector and head allows for a single FDC MT read request to do so - in the above example 2K valid cache.

Thanks!

@Mellvik
Copy link
Owner Author

Mellvik commented Nov 7, 2024

@Mellvik: I'm finally at the point of adding the "fixed size cache" used in your testing here for ELKS. I was wondering about the exact algorithm used, say for a 6K cache (as opposed to "full track caching, or full track plus 1 MT caching", etc).

Glad to hear that @ghaerr, glad to contribute.

When setting up the multisector read for the cache fill, I'm sure the simple case of just reading from the requested starting sector for 6K bytes is used when there's enough sectors left to perform an MT read, automatically switching the head from 0 to 1. But what happens, when there are not enough sectors left in a single FDC programmed MT read to fulfill the cache fill? Is the I/O split into multiple requests to get exactly 6K each time, or is the cache actually truncated in these cases?

No, I use a test for how much is left of the cylinder, and use that in the dma request.

For instance, what happens on a read request to initially fill the cache on a 1.44M floppy, starting on head == 1 and sector 15? Does the driver just read sectors 15-18, and set the cache to 2K in this case, or is the read continued with another I/O request to read the remaining 4K of the "fixed cache"? I am assuming that another I/O would be necessary since MT doesn't work to advance the cylinder, right?

That's right. And if I remember correctly, that's what the cache_len variable is for, keeping track of how many sectors og the cache are currently valid.

The simple solution would be that the I/O request is truncated to what can be performed on a single MT I/O request. That is, the cache is filled fully only when the requested sector and head allows for a single FDC MT read request to do so - in the above example 2K valid cache.

You got it already!

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants