[comp.sys.cbm] SEQ file access speedup

leblanc@eecg.toronto.edu (Marcel LeBlanc) (02/10/89)

In article <826@csd4.milw.wisc.edu> jgreco@csd4.milw.wisc.edu (Joe Greco) writes:
>]drives.  When I decided (2 yrs ago) that 1571 drives didn't give me enough
>]storage, I considered getting an IEEE drive, but then I learned about the
>]new 1581 drives that had been announced.  I managed to get two from
>]Commodore, and I'm glad I did.  It loads much faster than IEEE drives (my
>]assembler LOADs include files, taking maximum advantage of load speed), and
>]has good storage capacity (800K vs 1M/floppy for 8250).
>
>IF you're on a 128, or IF you're on a 64 with some sort of fastloader.
>Which is still only a marginal speedup, considering that file based
>operations are not enhanced by such devices.  (Since that's the major
>type of operation I need around here, that's what I look at.)  BLITZ!
>isn't speeded up at all by FastLoad... hehehe

I can appreciate that file operations are very important to you, and a
relatively small number of others (maybe 10's of thousands?).  But you
aren't exactly a typical C64 user, and neither am I.  For the millions that
use their C64 to play the latest games, all they are concerned about is how
fast they can start up the game, or how quickly the next level can be loaded
in.  Speeding up file operations is a more difficult issue.

Let's keep things in perspective here.  Although it's possible to speed up
sequential file access with 'transparent' speedup software, you will never
get as much of a speed increase as is possible on LOADs.  This has less to
do with the transfer protocol, than with the LOW PERFORMANCE limitations of
the C64 kernal.  To remain compatible with existing software, speedup
software must intercept OPEN, CLOSE, CHKIN, CHKOUT, CHRIN, GETIN, & CHROUT.
You can't expect that much speed if you call a subroutine for every byte of
a transfer.  You CAN expect much more speed if you call a subroutine to
transfer large blocks of memory (LOAD & SAVE).  All of this assumes at least
minimal optimization in the LOAD and SAVE routines.  If these are just
implemented as loops that repeatedly call the single byte transfer routines,
then the performance won't be any better.

For example, consider LOAD vs. sequential read on IEEE-488 drives, or C128
burst vs. fast serial.  The C128 gives you great burst serial speed (LOAD &
SAVE), but using fast serial instead of slow serial doesn't give you that
much of a speed increase (maybe 2x).  Since software overhead is the
dominant factor here, I'll guess that seq read on IEEE drives also gives
about a 2x speedup.  If somebody has concrete numbers, please post!

Before Eric Green decides to flame me :-), I should point out that I haven't
forgotten that block transfers can be hidden from applications software.
For devices like the 1541/71/81, it's quite reasonable to expect speedup
software to transfer pieces of a file in blocks, then read from this buffer
when a call is made to CHRIN or GETIN (same for CHROUT on write).  This
would have been great in the early days of the C64 & 1541 (my GUESS, 3-3.5x
speedup)!  But today, too much software bypasses CHRIN/CHROUT to use
ACPTR/CIOUT directly.  It's also a great idea for the C128 if you can spare
enough memory to burst load the whole file (or use an REU).

So what was the point of this posting?  Just that, if the software you use
has to do sequential file reads or writes, you are limited in how much it
can be speeded up without re-writing it.  The main reason I use Buddy 128
(an assembler) is that it LOADs include files (which defaults to burst
serial), giving me great speed on a 1581.  The SAME speedup factor (10-12x)
is possible on the C64 using software only (with 1541/71/81)!  This is at
least twice as fast as IEEE can load.  A complete assembly, which requires
2 passes through 600K of tokenized source, takes about 12 mins.  Using seq
reads on a C64 would probably take about 1.5 hours, or 50 mins using IEEE
drives (I haven't timed these, so they are just guesses).

If somebody can suggest a faster method of speeding up seq file accesses,
please let us know!  What we really need is a new OS for the C64...

Marcel A. LeBlanc	  | University of Toronto -- Toronto, Canada
leblanc@eecg.toronto.edu  | also: LMS Technologies Ltd, Fredericton, NB, Canada
-------------------------------------------------------------------------------
UUCP:	uunet!utai!eecg!leblanc    BITNET: leblanc@eecg.utoronto (may work)
ARPA:	leblanc%eecg.toronto.edu@relay.cs.net  CDNNET: <...>.toronto.cdn

elg@killer.DALLAS.TX.US (Eric Green) (02/13/89)

in article <89Feb10.182100est.2732@godzilla.eecg.toronto.edu>, leblanc@eecg.toronto.edu (Marcel LeBlanc) says:
>>Which is still only a marginal speedup, considering that file based
>>operations are not enhanced by such devices.  (Since that's the major
>>type of operation I need around here, that's what I look at.)  BLITZ!
>>isn't speeded up at all by FastLoad... hehehe
> Let's keep things in perspective here.  Although it's possible to speed up
> sequential file access with 'transparent' speedup software, you will never
> get as much of a speed increase as is possible on LOADs.  This has less to
> do with the transfer protocol, than with the LOW PERFORMANCE limitations of
> the C64 kernal.  To remain compatible with existing software, speedup
> software must intercept OPEN, CLOSE, CHKIN, CHKOUT, CHRIN, GETIN, & CHROUT.
> You can't expect that much speed if you call a subroutine for every byte of
> a transfer.  

     Take a look at the C-128's Kernal, and, specifically, the
burst-mode load routines (esp. the subroutine at $f4C5). It gets that
speedup despite jsr'ing STASH for each byte to store the data into the
proper bank of RAM.
     For an early version of some hardware later discarded due to
various problems, I didn't have burst-loads implemented yet, but had
standard fast serial working. I called LOAD. It had some speedup,
about equivalent to what you get from an old Epyx fastload cartridge,
but wasn't a real speed demon. Later I re-wrote LOAD to use
burst-load. Loading a large file (I forget the size, somewhere around
100 blocks) dropped from 18 seconds to 5 seconds between the two. The
main difference was the protocol.
     It seems obvious to me that the main reason fast serial isn't as
fast as burst mode is the character-by-character handshaking taking
place, not subroutine overhead. So, if you had a software product that
read 256 bytes worth of SEQ file data into a buffer, it would probably
speed up SEQ file access considerably -- although not as fast as
LOAD'ing, since there IS overhead involved in reading from a buffer
(e.g. see the 1750 REU example below).

> For example, consider LOAD vs. sequential read on IEEE-488 drives, or C128
> burst vs. fast serial. 

Curious, I just wrote a simple benchmark for the C-128. All it does is GETIN
from a file until it gets to the end. It took 17 seconds to read a 98
block file, or about the same as it took to LOAD the exact same file
off of one of my SFD-1001 IEEE drive (using the Skyles IEEE Flash,
admittedly not the fastest IEEE interface). This is about
1.3kbytes/second -- not slow at all, considering that the 1541 is
straining to do 400 cps. The IEEE interface doesn't do anything
special for LOAD'ing -- it just JSR's ACPTR repeatedly and stashes it,
just like the ordinary LOAD routine (in fact, it IS the ordinary LOAD
routine -- totally unmodified).
     Then I tried the same thing using RAMDOS and the 1750. 6 seconds
burstloading off of 1571. 7 seconds reading using RAMDOS (loading
using RAMDOS is just about instantaneous, using the DMA chip).

> Before Eric Green decides to flame me :-), I should point out that I haven't
> forgotten that block transfers can be hidden from applications
> software.

Who me? Flame? ;-). Hiding block transfers would be especially useful
for the 1750, because for byte-at-a-time transfers it's reading single
bytes instead of doing a DMA transfer of 256 bytes & going from there.
Unfortunately, there's not 256 bytes free anywhere to use for such a
buffer :-(.

> For devices like the 1541/71/81, it's quite reasonable to expect speedup
> software to transfer pieces of a file in blocks, then read from this buffer
> when a call is made to CHRIN or GETIN (same for CHROUT on write).  This
> would have been great in the early days of the C64 & 1541 (my GUESS, 3-3.5x
> speedup)!  But today, too much software bypasses CHRIN/CHROUT to use
> ACPTR/CIOUT directly.  

It could be done. But you'd have to do one of two things: Illegally
copy CBM's ROM & modify it, or have RAM and "patch" it. The latter
would be expensive, at least at current prices (32Kx8 static RAM is at
around $14 right now). After you have a patched ROM image, it's fairly
easy to do hardware tricks to swap it into place of the ordinary ROM
(but it DOES require at least one jumper into the inside of the
computer). 

In any event, there is STILL a lot of software that uses
CHRIN/GETIN... if I had the fastloader expertise to do it, I'd give it
a try just to see how much of a speedup it was.

> So what was the point of this posting?  Just that, if the software you use
> has to do sequential file reads or writes, you are limited in how much it
> can be speeded up without re-writing it.  

True. I suspect that the REU timing is about the maximum speed you can
get using byte-at-a-time. That time truly reflects JSR overhead. 

> least twice as fast as IEEE can load.  A complete assembly, which requires
> 2 passes through 600K of tokenized source, takes about 12 mins.  Using seq
> reads on a C64 would probably take about 1.5 hours, or 50 mins using IEEE
> drives (I haven't timed these, so they are just guesses).

600K?! (wow... boggle-mode activated!). Using HCD128 and the 1750 REU,
using SEQ files, it'd take about 20 minutes (if you could fit it on
the REU!). Now, on the Amiga, using DASM (a real speed-demon).... I'd
be surprised if it took longer than 2 minutes out of RAM:.

In any event... nobody's denying that LOADing will generally be faster
than READing. Just that SEQ file reading can be speeded up much more
than you imply.

> What we really need is a new OS for the C64...

Amen! But who's going to bother, when they can just go out and buy a
"real" computer? Today was the first time I'd touched my 128 in over a
week.... 

--
|    // Eric Lee Green              P.O. Box 92191, Lafayette, LA 70509     |
|   //  ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg     (318)989-9849     |
| \X/              >> In Hell you need 4Mb to Multitask <<                  |

jgreco@csd4.milw.wisc.edu (Joe Greco) (02/14/89)

]>IF you're on a 128, or IF you're on a 64 with some sort of fastloader.
]>Which is still only a marginal speedup, considering that file based
]>operations are not enhanced by such devices.  (Since that's the major
]>type of operation I need around here, that's what I look at.)  BLITZ!
]>isn't speeded up at all by FastLoad... hehehe
]
]I can appreciate that file operations are very important to you, and a
]relatively small number of others (maybe 10's of thousands?).  But you
]aren't exactly a typical C64 user, and neither am I.  For the millions that
]use their C64 to play the latest games, all they are concerned about is how
]fast they can start up the game, or how quickly the next level can be loaded
]in.  Speeding up file operations is a more difficult issue.

I agree that FastLoad has it's place....  but it is about as useful to
me as a doorstop is or as a 1670 is.  :-)  My 1541 is too slow to be
useful in any real way.

]Let's keep things in perspective here.  Although it's possible to speed up
]sequential file access with 'transparent' speedup software, you will never
]get as much of a speed increase as is possible on LOADs.  This has less to
]do with the transfer protocol, than with the LOW PERFORMANCE limitations of
]the C64 kernal.  To remain compatible with existing software, speedup
]software must intercept OPEN, CLOSE, CHKIN, CHKOUT, CHRIN, GETIN, & CHROUT.
]You can't expect that much speed if you call a subroutine for every byte of
]a transfer.  You CAN expect much more speed if you call a subroutine to
]transfer large blocks of memory (LOAD & SAVE).  All of this assumes at least
]minimal optimization in the LOAD and SAVE routines.  If these are just
]implemented as loops that repeatedly call the single byte transfer routines,
]then the performance won't be any better.
]
]For example, consider LOAD vs. sequential read on IEEE-488 drives, or C128
]burst vs. fast serial.  The C128 gives you great burst serial speed (LOAD &
]SAVE), but using fast serial instead of slow serial doesn't give you that
]much of a speed increase (maybe 2x).  Since software overhead is the
]dominant factor here, I'll guess that seq read on IEEE drives also gives
]about a 2x speedup.  If somebody has concrete numbers, please post!

I HAD concrete numbers, but rn barfed and then csd4 went down on
Sunday.  I don't have the exact figures with me, but here are
approximations:  A c64 with 1541 took about 1:30 to read 30,000 bytes.
A c64 with BusCard II and 8050 took more like 0:30.  The BusCard II,
by the way, is considered a "slower" interface.  I will try to make
some more tests at home tonight with the 1750 and the fast MSD IEEE
interface.

I used the following routine to do the reading and a stopwatch to time:

ready.
b*
   pc  sr ac xr yr sp
.;ee4e b0 50 00 00 f6
.
., 033c a2 02       ldx #$02
., 033e 20 c6 ff    jsr $ffc6
., 0341 a9 00       lda #$00
., 0343 8d 00 04    sta $0400
., 0346 8d 01 04    sta $0401
., 0349 20 e4 ff    jsr $ffe4
., 034c ee 00 04    inc $0400
., 034f d0 03       bne $0354
., 0351 ee 01 04    inc $0401
., 0354 20 b7 ff    jsr $ffb7
., 0357 c9 00       cmp #$00
., 0359 f0 ee       beq $0349
., 035b 4c cc ff    jmp $ffcc

]Before Eric Green decides to flame me :-), I should point out that I haven't
]forgotten that block transfers can be hidden from applications software.
]For devices like the 1541/71/81, it's quite reasonable to expect speedup
]software to transfer pieces of a file in blocks, then read from this buffer
]when a call is made to CHRIN or GETIN (same for CHROUT on write).  This
]would have been great in the early days of the C64 & 1541 (my GUESS, 3-3.5x
]speedup)!  But today, too much software bypasses CHRIN/CHROUT to use
]ACPTR/CIOUT directly.  It's also a great idea for the C128 if you can spare
]enough memory to burst load the whole file (or use an REU).

Bad programming form to use calls that one cannot intercept with the
vector table.  :-)

]So what was the point of this posting?  Just that, if the software you use
]has to do sequential file reads or writes, you are limited in how much it
]can be speeded up without re-writing it.  The main reason I use Buddy 128
](an assembler) is that it LOADs include files (which defaults to burst
]serial), giving me great speed on a 1581.  The SAME speedup factor (10-12x)
]is possible on the C64 using software only (with 1541/71/81)!  This is at
]least twice as fast as IEEE can load.  A complete assembly, which requires
]2 passes through 600K of tokenized source, takes about 12 mins.  Using seq
]reads on a C64 would probably take about 1.5 hours, or 50 mins using IEEE
]drives (I haven't timed these, so they are just guesses).

That's why I refuse to assemble/compile on or work with 1541's.  The
IEEE drives are "about" five times faster.

]If somebody can suggest a faster method of speeding up seq file accesses,
]please let us know!  What we really need is a new OS for the C64...

How about UNIX on an Amiga 2500?  hehehe
--
jgreco@csd4.milw.wisc.edu		Joe Greco at FidoNet 1:154/200
USnail: 9905 W Montana Ave			     PunterNet Node 30 or 31
	West Allis, WI  53227-3329	"These aren't anybody's opinions."
Voice:	414/321-6184			Data: 414/321-9287 (Happy Hacker's BBS)

jgreco@csd4.milw.wisc.edu (Joe Greco) (02/14/89)

As promised, here are some access times for various sequential mode
disk accesses.  The files were not all identical, but I am including a
CPS rating to account for that.

Device(s)		Time	Filesize CPS	%speed
----------------------- ------- -------- ------ ------
Regular 1541/C64        01:35.2    33453    351    100
C64/BusCard II/8050     00:31.8    30234    952    271
C64/Custom MSD/8050     00:17.3    30234   1745    497
C64/RAMDOS 3.2/1750     00:12.2    38760   3188    908

The "Custom MSD" interface pushes the IEEE bus much closer to
Commodore's specifications than the BusCard II.  Actually, I'm
suprised at the huge difference there.  The RAMDISK is nearly ten
times as fast as the standard serial bus access.  It would seem to me
that it would well be possible for a more efficient design to be
implemented.
 
As a side note:  The way my memory recalls, the IEEE bus is actually
capable of megabyte/second rates.  Of course, my magnetic media is
probably flaking again....

--
jgreco@csd4.milw.wisc.edu		Joe Greco at FidoNet 1:154/200
USnail: 9905 W Montana Ave			     PunterNet Node 30 or 31
	West Allis, WI  53227-3329	"These aren't anybody's opinions."
Voice:	414/321-6184			Data: 414/321-9287 (Happy Hacker's BBS)

elg@killer.DALLAS.TX.US (Eric Green) (02/14/89)

This is the results of benchmarking

a) loading, and
b) doing GETIN until EOF, from ML, doing nothing inbetween.

All tests were done with a 98 block file consisting of the main
body of a BBS program. It was just the handiest program that I had
available on both SFD and 1541 formats. I put it onto a blank 1541
disk, to prevent fragmentation. It was already first on the SFD disk
(the boot disk for the BBS, which I only recently made up).

My basic thought was that sequential file access can take place just
as fast as LOAD'ing. The benchmark confirms that for IEEE drives and
the standard 1541. There's a couple of constraints here. First of all,
doing TALK and UNTALK (chkin/clrchn) for each sequentially-read byte
is extremely slow. When you do chkin or clrchn, each call sends a
command byte out to the drive. So doing fast sequential access means
buffering your sequential file data anyhow (e.g. a simple filter
program would be best off reading in 256 bytes, filtering them, then
writing them to the output file, instead of doing
clrchn/chkin/clrchn/chkout for each individual byte -- a 4-to-1
overhead). When you do that, SEQ access isn't slow at all.. just look
at these timings:

C-64, IEEE Flash, SFD-1001, 'load"bbs",8' :   18 seconds
 """"""""""""""""""""""""  ML loop, seq read: 18 seconds
C-64, C-LINK II, SFD-1001    LOAD: 14 seconds
                             READ: 14 seconds
128 Ramdos: READ: 9 seconds
64 Ramdos: READ: 8 seconds
128 -- 1571 -- load -- 8 seconds
               read -- 26 seconds
in 64 mode, with 1571: load -- 60 seconds
                       read -- 62 seconds
64 mode, with Epyx fastload cart. --
                       LOAD 26 seconds
         with Mike J. Henry's "fastboot v2": 26 seconds

Unfortunately I couldn't see if the Super Snapshot was faster than the
Epyx or fastboot product. My brother sold ours because it was
incompatible with his C-64 (a very early production model), and
because "it wasn't any faster than the fastload cartridge" (his words,
not mine -- I never even used the darn thing).

Some trivia: the main difference between LOAD'ing (burst mode) and
READ'ing (fastmode) on the 1571 is that fast mode negotiates a
transaction for each byte, while burst mode negotiates on a per-block
basis. Burst mode is unique in that manner -- even the IEEE drives
negotiate on a per-byte basis (probably why they're slower than burst
mode, despite fairly equivalent hardware). 

Some other trivia: Using ACPTR should be faster than using GETIN, if
subroutine overhead is as big a problem as some hint. GETIN has to
do all sorts of testing to see where to dispatch to -- is it keyboard,
or is it disk? This overhead should be noticible when compared to
LOAD, which calls ACPTR directly. But for both the IEEE drives and the
1541, there was no significant difference between LOAD and GETIN
times, implying that transfer speed, and not internal Kernal overhead,
was the limitation. 

--
|    // Eric Lee Green              P.O. Box 92191, Lafayette, LA 70509     |
|   //  ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg     (318)989-9849     |
| \X/              >> In Hell you need 4Mb to Multitask <<                  |

izot@f171.n221.z1.FIDONET.ORG (Geoffrey Welsh) (02/14/89)

 > From: jgreco@csd4.milw.wisc.edu (Joe Greco)
 > Message-ID: <955@csd4.milw.wisc.edu>
 > Device(s)               Time    Filesize CPS    %speed
 > ----------------------- ------- -------- ------ ------
 > Regular 1541/C64        01:35.2    33453    351    100
 > C64/BusCard II/8050     00:31.8    30234    952    271
 > C64/Custom MSD/8050     00:17.3    30234   1745    497
 > C64/RAMDOS 3.2/1750     00:12.2    38760   3188    908
 
   Add to the list (result from my memory):
 
   C128/C64-Link II/SFD                       1900
   HyperPET/D9060                             Bloody fast - I'll get specs!
 
   The C128 was running at 2 MHz.
 
   The "HyperPET" is a 4 MHz 4032.
 
 > As a side note:  The way my memory recalls, the IEEE bus is actually
 > capable of megabyte/second rates.  Of course, my magnetic media is
 > probably flaking again....
 
   The IEEE-488-1979 spec says that the data transfer rate shall not exceed 1 
megabyte per second, but the handshake is designed to slow the transfers down 
to the slowest selected device on the bus. Since it takes several 1 MHz clock 
cycles to program the I/O chips to send the handshake signals, megabyte per 
second speeds are way out of the question.
 
   There is also the question of how quickly the data can be "lifted" from the 
disk. Even with most IEEE drives' 2-processor design, there is a severe limit 
to the speed with which the data can be put on the bus.
 
   Nevertheless, some sort of automated hardware handshake and more tightly 
coded ROMs in the drive would lead to vastly improved performance.
 
===========================================================================
Internet:  Geoffrey.Welsh@f171.n221.z1.fidonet.org | 66 Mooregate Crescent
Usenet:    watmath!isishq!izot                     | Suite 602
FidoNet:   Geoffrey Welsh on 1:221/171             | Kitchener, Ontario
PunterNet: 7/Geoffrey Welsh                        | N2M 5E6 CANADA
BBS:       (519) 742-8939 24h 7d 300/1200/2400bps  | (519) 741-9553
===========================================================================
|  "I don't need a disclaimer. No one pays any attention to what I say."  |
===========================================================================
 


--  
 Geoffrey Welsh - via FidoNet node 1:221/162
     UUCP: ...!watmath!isishq!171!izot
 Internet: izot@f171.n221.z1.FIDONET.ORG

leblanc@eecg.toronto.edu (Marcel LeBlanc) (02/15/89)

In article <7143@killer.DALLAS.TX.US> elg@killer.Dallas.TX.US (Eric Green) writes:
>This is the results of benchmarking
>
>a) loading, and
>b) doing GETIN until EOF, from ML, doing nothing inbetween.
> ...
>My basic thought was that sequential file access can take place just
>as fast as LOAD'ing. The benchmark confirms that for IEEE drives and
>the standard 1541. There's a couple of constraints here. First of all,
 ...
>clrchn/chkin/clrchn/chkout for each individual byte -- a 4-to-1
>overhead). When you do that, SEQ access isn't slow at all.. just look
>at these timings:
>
>C-64, IEEE Flash, SFD-1001, 'load"bbs",8' :   18 seconds
> """"""""""""""""""""""""  ML loop, seq read: 18 seconds
					      ^^^^
>C-64, C-LINK II, SFD-1001    LOAD: 14 seconds
>                             READ: 14 seconds
				   ^^^^
	Yes, times are identical.  Please read on.

>in 64 mode, with 1571: load -- 60 seconds
>                       read -- 62 seconds

	Almost identical.  This supports what I said in my original posting
on this subject.  Here's an excerpt:

	... as much of a speed increase as is possible on LOADs.  This has
	less to do with the transfer protocol, than with the LOW PERFORMANCE
	limitations of the C64 kernal.  To remain compatible with ...

As you pointed out in an earlier posting, the standard C64 load routine does
nothing but repeatedly call ACPTR!  This is a LOW PERFORMANCE limitation
when you have a decent transfer protocol, but it's of no importance when you
have to use the standard serial protocol of the C64!  But then you say
SFD-1001 (IEEE interface) isn't low performance?  It is reasonably fast, but
since they just speed up ACPTR/CIOUT without changing the LOAD routine, the
ML loop that you have written should give the same results as LOAD (since
it's basically the same loop), and it does.  This DOESN'T mean that SEQ read
is as fast as block transfers (LOAD), it just means that you have to
optimize ("speed up") the block transfer software as well as the transfer
protocol.  This is even better demonstrated by the following numbers:

>128 Ramdos: READ: 9 seconds
>64 Ramdos: READ: 8 seconds

	AND, as you stated before, LOAD is virtually instantaneous!

>128 -- 1571 -- load -- 8 seconds
>               read -- 26 seconds
>64 mode, with Epyx fastload cart. --
>                       LOAD 26 seconds
>         with Mike J. Henry's "fastboot v2": 26 seconds

Doesn't it seem unusual that C128 fast serial, Epyx FastLoad, and Mike
Henry's fastboot all take the same amount of time (26 secs)?  [This really
isn't intended to sound like a flame.]  Here's what you wrote earlier in the
article:
>All tests were done with a 98 block file consisting of the main
>body of a BBS program. It was just the handiest program that I had
>available on both SFD and 1541 formats. I put it onto a blank 1541
>disk, to prevent fragmentation. It was already first on the SFD disk

>From the numbers listed above, I would guess that you copied from the
SFD-1001 to a _1571_.  Unless you set the interleave yourself (using
"U0>"+chr$(interleave#)), the 1571 saves using a 6 sector interleave, even
when it's in 1541 mode.  The C128 burst mode can easily keep up with a 6
sector interleave, but Mike Henry's fastboot needs at least 8 sectors to
decode and transfer, and Epyx FastLoad needs at least 10 (the 1541
standard).  On a fresh disk, the program would be saved near the directory
track.  On this part of the disk, I think the sectors/track is 18.  Since
FastLoad, fastboot V2, and C128 fast serial can't keep up with an interleave
of 6 sectors, they are forced to wait a full revolution or 18+6 = 24
sectors!  The interleave forces a speed difference of 24/6 = 4 times
slowdown!  Of course, in a 98 block file, not all sectors can be stored
exactly 6 apart, so this is just a GOOD approximation.  This is very
close to the 26 sec/ 8 sec ratio (3.25) given by the above numbers.

Weren't we talking about SEQ file speed up? :-) What I'm getting at is that
1541/71 and SFD-1001 aren't good drives to use when studying byte-at-a-time
transfer overhead.  That's because the dos in those drives only buffers a
sector at a time, which forces it to use an interleave scheme.  It's
possible to get around this (and Super Snapshot V4 does, for LOAD only), but I
haven't seen an implementation yet that attempts to do this for SEQ file
accesses.  And since the transfer times involved in this performance range
(about 4-5 secs for 100 blocks, far beyond standard CBM IEEE) are less than the
overhead for byte-at-a-time transfers, you wouldn't be able to get close to
LOAD speedup for SEQ accesses.  Here's my speedup summary:
(by "State of the Art" I mean 1541 interleave INDEPENDENT serial fast
loaders and C128 burst mode with optimal interleave, NOT IEEE.)

				std blocks	non-std blocks
A. "State of the Art"  LOAD	 12-15x		 20-25x
B. not yet attempted,
   "State of the Art"  READ	 6-7x (guess)	 8-9x (guess)
C. Classical Fast I/O  LOAD	 5-6x		 n.a.
   e.g. Epyx FastLoad
   (interleave = 10)
D. Classical fast I/O  READ	 3-4x (guess)	 n.a.
E. Standard	LOAD & READ	 1x		 n.a.

The IEEE interfaces that various people have discussed so far probably fit
in with "C".  This is only because they are using the standard load routine
with faster ACPTR (to get any speedup they would have to SAVE with a tighter
interleave or execute custom LOAD routines within the IEEE device, but I
doubt that any of the IEEE owners on the net would want to have anything to
do with this :-) ).

A good way to see byte-at-a-time overhead is to use RAMDOS or a 1581, which
buffers half a track (one physical cylinder, not a full logical track).

>Unfortunately I couldn't see if the Super Snapshot was faster than the
>Epyx or fastboot product. My brother sold ours because it was ...
>... "it wasn't any faster than the fastload cartridge" (his words,

SS V1 and SS V2 were "classical" fast loader implementations, so the speed
was only marginally faster than Epyx FastLoad (5.5x vs. 5x).  The actual
transfer routines were significantly faster, but the 10 sector interleave of
the 1541 limited all these products to the same speed range.  The marginal
speedup came from significantly faster head stepping routines.  You could
SAVE at a different interleave to get some extra speedup, but it wasn't
usually worth the trouble.

SS V3 and SS V4 use a MUCH faster interleave independent technique.  The
speedup over Epyx FastLoad and similar products is very noticeable.

>Some trivia: the main difference between LOAD'ing (burst mode) and
>READ'ing (fastmode) on the 1571 is that fast mode negotiates a
>transaction for each byte, while burst mode negotiates on a per-block
>basis. Burst mode is unique in that manner -- even the IEEE drives
>negotiate on a per-byte basis (probably why they're slower than burst
>mode, despite fairly equivalent hardware). 

I agree, per-block is the only way to get great speed.  You have probably
noticed that the Burst mode examples in the 1571 user's manual avoid using
subroutine calls to get each byte as it arrives.  With transfer rates in the
range used by Burst mode, this could slow you down.  However, it turns out
that there's a fair bit of time to waste at the bit rate that CBM decided to
use for Burst mode.

>Some other trivia: Using ACPTR should be faster than using GETIN, if
>subroutine overhead is as big a problem as some hint. GETIN has to
>do all sorts of testing to see where to dispatch to -- is it keyboard,
>or is it disk? This overhead should be noticible when compared to
>LOAD, which calls ACPTR directly. But for both the IEEE drives and the
>1541, there was no significant difference between LOAD and GETIN
>times, implying that transfer speed, and not internal Kernal overhead,
>was the limitation. 

Again, this just shows how slow the standard ACPTR routine is, and how
important interleave limitations are no matter how fast the transfer
protocol is.  Once you have overcome the limitations of interleave, either
by buffering whole tracks or by doing other nasty manipulations :-), the
real speed of the transfer protocol can really shine.  After all, IEEE
interfaces should be capable of much faster transfers.

For those who don't believe that interleave is as important as I've said,
try the following:  Create a file on a 1541, then compare the time required
to LOAD it using a classical fastloader like Epyx FastLoad with the time
required to SCRATCH the file.  You should get the same results from a C128
with a 1571 (using burst mode vs. SCRATCH).  SCRATCH has to follow the chain
of sectors that are used in the file.  Since the only transfers involved in
SCRATCH are internal, all the time required is to follow the 10 sector
interleaved chain (6 if you're using a 1571).

I think this posting was already too long about half way through! :-)

Marcel A. LeBlanc	  | University of Toronto -- Toronto, Canada
leblanc@eecg.toronto.edu  | also: LMS Technologies Ltd, Fredericton, NB, Canada
-------------------------------------------------------------------------------
UUCP:	uunet!utai!eecg!leblanc    BITNET: leblanc@eecg.utoronto (may work)
ARPA:	leblanc%eecg.toronto.edu@relay.cs.net  CDNNET: <...>.toronto.cdn

izot@f171.n221.z1.FIDONET.ORG (Geoffrey Welsh) (02/15/89)

 > From: elg@killer.DALLAS.TX.US (Eric Green)
 > Message-ID: <7143@killer.DALLAS.TX.US>
 
Eric:
 
   Your benchmarks are interesting and informative, but I'd like to point out
something you said which is mildly misleading:
 
 > Some trivia: the main difference between LOAD'ing (burst mode) and
 > READ'ing (fastmode) on the 1571 is that fast mode negotiates a
 > transaction for each byte, while burst mode negotiates on a per-block
 > basis. Burst mode is unique in that manner -- even the IEEE drives
 > negotiate on a per-byte basis (probably why they're slower than burst
 > mode, despite fairly equivalent hardware).
 
   On true (parallel) IEEE drives, there is nothing to negotiate. While burst 
mode transactions on a 1571 or 1581 do have to be set up in slow (i.e. 1541 
speed) mode, there is no need for such arrangements on the parallel drive... 
in fact, the parallel drives have lower overheads because the handshake 
suffices for inter-byte holdoffs.
 
   Furthermore, the IEEE drives do NOT have "fairly equivalent hardware"... 
burst mode handshaking is done in hardware (at a rate dependent on gate 
timings, but the gates operate at less than 50 nanoseconds each), while IEEE 
handshaking is done in software at a rate dependent on instructions taking two 
to six clock cycles taking 1,000 nanoseconds per.
 
   Given hardware handshaking and DMA, some IEEE-488 bus instruments achieve 
data transfer rates in the hundresd of thousands of bytes per second.
 
 > Some other trivia: Using ACPTR should be faster than using GETIN, if
 > subroutine overhead is as big a problem as some hint. GETIN has to
 > do all sorts of testing to see where to dispatch to -- is it keyboard,
 > or is it disk? This overhead should be noticible when compared to
 > LOAD, which calls ACPTR directly. But for both the IEEE drives and the
 > 1541, there was no significant difference between LOAD and GETIN
 > times, implying that transfer speed, and not internal Kernal overhead,
 > was the limitation.
 
   The subroutine overhead, as they say, is a drop in the bucket. Using GETIN 
will be slower than using ACPTR, but only marginally.
 
===========================================================================
Internet:  Geoffrey.Welsh@f171.n221.z1.fidonet.org | 66 Mooregate Crescent
Usenet:    watmath!isishq!izot                     | Suite 602
FidoNet:   Geoffrey Welsh on 1:221/171             | Kitchener, Ontario
PunterNet: 7/Geoffrey Welsh                        | N2M 5E6 CANADA
BBS:       (519) 742-8939 24h 7d 300/1200/2400bps  | (519) 741-9553
===========================================================================
|  "I don't need a disclaimer. No one pays any attention to what I say."  |
===========================================================================
 


--  
 Geoffrey Welsh - via FidoNet node 1:221/162
     UUCP: ...!watmath!isishq!171!izot
 Internet: izot@f171.n221.z1.FIDONET.ORG

seeley@dalcsug.UUCP (Geoff Seeley) (02/15/89)

In article <7143@killer.DALLAS.TX.US>, elg@killer.DALLAS.TX.US (Eric Green) writes:
< 64 mode, with Epyx fastload cart. --
<                        LOAD 26 seconds
<          with Mike J. Henry's "fastboot v2": 26 seconds
< Unfortunately I couldn't see if the Super Snapshot was faster than the
< Epyx or fastboot product. My brother sold ours because it was
< incompatible with his C-64 (a very early production model), and
< because "it wasn't any faster than the fastload cartridge" (his words,
< not mine -- I never even used the darn thing).

I have just recently bought Super Snapshot v4, and after using the Epyx
fastload for 2 or 3 years, I can safely say that the Super Snapshot loads
much faster than the fastload. It will even save faster, something which the
fastload didn't do. I don't know what SS version you had, but the latest
version (v4) is one of the best commodore utilities I have seen.
-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Geoff Seeley          UUCP: dalcsug!seeley  |   Why the hell didn't they have 
Dalhousie University  BITN: csay0026@dalac  | ``Teenage Mutant Ninja Turtles''  Halifax, Nova Soctia  BEST: The local bar.  |         when I was a kid?  
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

jgreco@csd4.milw.wisc.edu (Joe Greco) (02/15/89)

In comp.sys.cbm article <7143@killer.DALLAS.TX.US>, elg@killer.Dallas.TX.US (Eric Green) wrote:
]My basic thought was that sequential file access can take place just
]as fast as LOAD'ing. The benchmark confirms that for IEEE drives and

Why shouldn't it?  Same routines.  LOAD simply calls ACPTR in a loop.

]the standard 1541. There's a couple of constraints here. First of all,
]doing TALK and UNTALK (chkin/clrchn) for each sequentially-read byte
]is extremely slow. When you do chkin or clrchn, each call sends a
]command byte out to the drive. So doing fast sequential access means
]buffering your sequential file data anyhow (e.g. a simple filter

Or, if you are not performing any output operations on the serial bus,
simply do not chkin/clrchn.  I use that technique fairly often.

]program would be best off reading in 256 bytes, filtering them, then
]writing them to the output file, instead of doing
]clrchn/chkin/clrchn/chkout for each individual byte -- a 4-to-1

I once devised a very short ML copy routine for an ancient (1985?)
release of my BBS program.  It worked on a byte-by-byte basis, and was
nearly as slow as the BASIC loop it had replaced.  I use buffering
techniques constantly now (and not just for disk I/O).

]Some other trivia: Using ACPTR should be faster than using GETIN, if
]subroutine overhead is as big a problem as some hint. GETIN has to
]do all sorts of testing to see where to dispatch to -- is it keyboard,
]or is it disk? This overhead should be noticible when compared to
]LOAD, which calls ACPTR directly. But for both the IEEE drives and the
]1541, there was no significant difference between LOAD and GETIN
]times, implying that transfer speed, and not internal Kernal overhead,
]was the limitation. 

Using non-vectored Kernal calls is bad form.  The slight SLIGHT
increased overhead time is not usually noticeable, especially when
working with disk.
--
jgreco@csd4.milw.wisc.edu		Joe Greco at FidoNet 1:154/200
USnail: 9905 W Montana Ave			     PunterNet Node 30 or 31
	West Allis, WI  53227-3329	"These aren't anybody's opinions."
Voice:	414/321-6184			Data: 414/321-9287 (Happy Hacker's BBS)

jgreco@csd4.milw.wisc.edu (Joe Greco) (02/15/89)

In comp.sys.cbm article <1606.23F8D15A@isishq.FIDONET.ORG>, izot@f171.n221.z1.FIDONET.ORG (Geoffrey Welsh) wrote:
]
] > From: jgreco@csd4.milw.wisc.edu (Joe Greco)
] > Message-ID: <955@csd4.milw.wisc.edu>
] > Device(s)               Time    Filesize CPS    %speed
] > ----------------------- ------- -------- ------ ------
] > Regular 1541/C64        01:35.2    33453    351    100
] > C64/BusCard II/8050     00:31.8    30234    952    271
] > C64/Custom MSD/8050     00:17.3    30234   1745    497
] > C64/RAMDOS 3.2/1750     00:12.2    38760   3188    908
] 
]   Add to the list (result from my memory):
] 
]   C128/C64-Link II/SFD                       1900

Relative comparisons between my MSD interface and a Link I showed the
MSD to be faster.  And as I recall, the Link I was a bit faster than
the Link II.  Ahhhh.... well....

]   HyperPET/D9060                             Bloody fast - I'll get specs!
] 
]   The C128 was running at 2 MHz.
] 
]   The "HyperPET" is a 4 MHz 4032.

Wish I had a HyperPET.  Of course, I also wish I had a D9090.  Then
again, I wish... oops better not start on that line again.

]   The IEEE-488-1979 spec says that the data transfer rate shall not exceed 1 
]megabyte per second, but the handshake is designed to slow the transfers down 
]to the slowest selected device on the bus. Since it takes several 1 MHz clock 
]cycles to program the I/O chips to send the handshake signals, megabyte per 
]second speeds are way out of the question.

That's what I meant (the spec itself)....  I realize, of course, that
such speeds are NOT possible at these clock rates.

]   There is also the question of how quickly the data can be "lifted" from the 
]disk. Even with most IEEE drives' 2-processor design, there is a severe limit 
]to the speed with which the data can be put on the bus.

It would be nice for a hard disk!  grin grin grin

]   Nevertheless, some sort of automated hardware handshake and more tightly 
]coded ROMs in the drive would lead to vastly improved performance.

And more tightly coded software on the computer.
--
jgreco@csd4.milw.wisc.edu		Joe Greco at FidoNet 1:154/200
USnail: 9905 W Montana Ave			     PunterNet Node 30 or 31
	West Allis, WI  53227-3329	"These aren't anybody's opinions."
Voice:	414/321-6184			Data: 414/321-9287 (Happy Hacker's BBS)

janhen@wn2.sci.kun.nl (Jan Hendrikx) (02/16/89)

In article <89Feb14.171816est.2394@godzilla.eecg.toronto.edu>, leblanc@eecg.toronto.edu (Marcel LeBlanc) writes:
>                     That's because the dos in those drives only buffers a
> sector at a time, which forces it to use an interleave scheme.

That is not true. 1541 DOS does do read-ahead when there are enough
free buffers. When a new buffer is needed, and all are occupied,
one of the read-ahead buffers is discarded.

Source: Inside Commodore DOS, and a ROM disassembly.

> Marcel A. LeBlanc	  | University of Toronto -- Toronto, Canada

-Olaf Seibert

leblanc@eecg.toronto.edu (Marcel LeBlanc) (02/18/89)

In article <335@wn2.sci.kun.nl> janhen@wn2.sci.kun.nl (Jan Hendrikx) writes:
>In article <89Feb14.171816est.2394@godzilla.eecg.toronto.edu>, leblanc@eecg.toronto.edu (Marcel LeBlanc) writes:
>>                     That's because the dos in those drives only buffers a
>> sector at a time, which forces it to use an interleave scheme.
>
>That is not true. 1541 DOS does do read-ahead when there are enough
>free buffers. When a new buffer is needed, and all are occupied,
>one of the read-ahead buffers is discarded.
>
>Source: Inside Commodore DOS, and a ROM disassembly.

I don't remember under what situations the DOS will do read-ahead, but the
point was that since the DOS follows the interleave chain, it won't send the
file any faster than the interleave allows, no matter what sort of interface
you are using.  In the case of a 1541, following the interleave chain will
only get you about a 5-6x speedup with the standard 10 sector interleave.
If this problem isn't addressed, using faster hardware buys you nothing.

Since you brought it up, under what situations will the DOS do read-ahead?

Marcel A. LeBlanc	  | University of Toronto -- Toronto, Canada
leblanc@eecg.toronto.edu  | also: LMS Technologies Ltd, Fredericton, NB, Canada
-------------------------------------------------------------------------------
UUCP:	uunet!utai!eecg!leblanc    BITNET: leblanc@eecg.utoronto (may work)
ARPA:	leblanc%eecg.toronto.edu@relay.cs.net  CDNNET: <...>.toronto.cdn

janhen@wn2.sci.kun.nl (Jan Hendrikx) (02/19/89)

In article <89Feb18.010253est.2384@godzilla.eecg.toronto.edu>, leblanc@eecg.toronto.edu (Marcel LeBlanc) writes:
> I don't remember under what situations the DOS will do read-ahead, but the
> point was that since the DOS follows the interleave chain, it won't send the
> file any faster than the interleave allows, no matter what sort of interface
> you are using. 

That was not the point that I was trying to make. What you say is of
course true. I just wanted to inform the person who thought the DOS
does no read-ahead at all.

> Since you brought it up, under what situations will the DOS do read-ahead?

As far as I remember without my references around, (I don't have a 64
anymore for over two years now), the algorithm is about the following:

If a file is opened for sequential reading, an 'active' buffer is
allocated, which is filled from the disk. Active means that the next
byte that can be requested from the computer, must come from that
buffer.  Also, an 'inactive' buffer is allocated, if possible. That
buffer is filled in from disk asyncronously. When the active buffer is
emtied by read requests from the computer, buffers are switched from
active to inactive and vv. Of course, if the inactive buffer has not
yet finished reading from disk, the computer must wait.

There is a routine which takes a buffer number and sets a pointer to
that buffer somewhere. It maintais a Least Recently Used stack, based
on the requests it gets.

Whenever a buffer is needed, but no unused buffer is available, an
inactive buffer is found based on the LRU stack. The idea is that the
buffer that is not used for the longest time, will not likely be used
again soon. So it is 'safe' to use it for something else.

The routine that switches buffers knows that inactive buffers may
disappear. If they do, it does something reasonable. I am not sure if
it just reads the next block into the one buffer that is left, or if it
first tries to get a new second buffer. In any case, the buffer that
was read-ahead must be re-read from the disk.

Relative files have two active buffers: one for a side-sector, and one
for actual data. I am not sure if it also reads-ahead.

Some other things I would need to look up are whether files open for
writing also have an inactive (write-behind) buffer, and whether
(trying to) allocate an inactive buffer may cause another inactive
buffer to be discarded.

As you can see, the operating system in the drive is considerably more
complex than that in the computer...

> Marcel A. LeBlanc	  | University of Toronto -- Toronto, Canada
> leblanc@eecg.toronto.edu  | also: LMS Technologies Ltd, Fredericton, NB, Canada
-Olaf Seibert

janhen@wn2.sci.kun.nl (Jan Hendrikx) (02/20/89)

In article <338@wn2.sci.kun.nl>, I wrote:
> As far as I remember without my references around, (I don't have a 64
> anymore for over two years now), the algorithm is about the following:

To be complete, I now have an Amiga and a 64 emulator :-)

> There is a routine which takes a buffer number and sets a pointer to
> that buffer somewhere. It maintais a Least Recently Used stack, based
> on the requests it gets.

After looking at the disassembly for some time, I found the following:
The LRU table is not a table of disk buffers, but of Logical INDeXes.
That is something like an internal file number for the disk. Every time
a block is read or written through a LINDX, the LRU is updated.

> Some other things I would need to look up are whether files open for
> writing also have an inactive (write-behind) buffer, and whether
> (trying to) allocate an inactive buffer may cause another inactive
> buffer to be discarded.

Both of these are true. It even appears that when switching to the
other (inactive) buffer, there is _always_ a need for a (temporary)
second buffer. If no such buffer can be found (or stolen), you get an
error #70, no channel. I have not found, however, any guarantee that
such a temporary buffer "almost always" will be available. So maybe one
of you netters can go out and write a program that produces error #70
when you are just writing to (or reading from) an ordinary file...

-Olaf Seibert