torkil@Pacesetter.COM (Torkil Hammer) (09/21/90)
Everything can be timed, including CPU speed, memory speed and prefetch queue. The trick is to do incremental measurements. Suppose you have written the routine that measures the system timer's low bits (in Turbo C it is outportb(0x43,0x00); low byte = inportb(0x40); high byte = inportb(0x40); - but you have to write it in assembler to really know what you get). Each bit increment corresponds to about 1 microsecond (actually 1/1.19318) but don't be surprised if you never see any odd bit count. It counts down, by the way. Now you put 2 of these back to back. The difference in readings is the time it took to read the clock, roughly proportional to the CPU speed. On a 16 MHz machine, that time should be about 16 clicks or 13 usec, depending on the actual assembler code. Of course, you should first wait until the clock() just incremented, then do the 2 readings so you don't get interrupted in the middle. Actually, do several measurements and screen out bad ones. In between you now sandwich something like [mov cx,100; rep nop] and the difference from before is what these instructions took. Now replace 100 with 200, and the incremental difference is the exact time it takes to do 100 prefetched nops on location, a number that only depends on the CPU speed (such as 5 CPU clock cycles per rep nop). Replace 100 with 10100 for greater accuracy. Also, replace rep nop (5 cycles) with rep mov bx,cx (4 cycles) for an independent check. Rotates of varying count is another way to get an incremental time that depends only on CPU speed. (1 incremental CPU clock per incremental rotate per repeat, but that only works on 80188 and up, though you can write a loop that switches cx between loop count and rotate count for the 8088) Once you have the exact CPU speed, you get the memory wait cycles from switching between a register operation (mov bx,ax) and a memory operation (mov [si],bx) inside the rep and measure the incremental time. You have to look up the cycles required for each operation. You should get the prefetch queue size from doing a backwards loop and expanding the scope of it with nop's. Once you fall over the edge, the time should go up by more than the cpu cycles in the nop's. (I haven't tried) After this you can play with 8, 16 and 32 bit instructions to tell the bus time and then the size of the bus. Enjoy Torkil Hammer
james@bigtex.cactus.org (James Van Artsdalen) (09/25/90)
In <1990Sep20.231706.27009@Pacesetter.COM>, torkil@Pacesetter.COM wrote: > Everything can be timed, including CPU speed, memory speed and prefetch > queue. Well, as a matter of theory, yes, but I haven't seen it done in practice yet. The Dell 325D, for example, has a cache, that has a write buffer, page mode DRAM, which is interleaved, and refreshed. All of these are hard to account for. A 486 is a bit harder yet because you have several write buffers. Other effects: cache line size, cache line fill order, interleave bit, page size... The measurable parameters are: cache hit, cache miss page hit, cache miss page miss, page size, interleave bit, RAS precharge on the "other" interleaved SIMM on a page miss, refresh pulse width, refresh frequency, and a bunch of other things I'm sure - and we haven't covered any cycle that misses the system board RAM. > The trick is to do incremental measurements. Suppose you have written > the routine that measures the system timer's low bits (in Turbo C it is > outportb(0x43,0x00); low byte = inportb(0x40); high byte = inportb(0x40); - > but you have to write it in assembler to really know what you get). > Each bit increment corresponds to about 1 microsecond (actually 1/1.19318) > but don't be surprised if you never see any odd bit count. It counts > down, by the way. If you don't see an odd bit count, there is a bug - you're probably seeing two microseconds per tick or worse. It is very hard to write a microsecond timer route that in fact works. Even when you have one, you have to account for external variables such as refresh. It is very important that such a routine be validated, or else you'll get another magazine-quality benchmark that can't return the same answer twice. Below is a routine I've derived experimentally. It returns a 32 bit value. It does not work near midnight. There's also the test program. Don't even think of modifying this without running the test program for a few hours. There are several different tests: the desired one is selected by choosing which #elif to enable (yes, I know it's gross but the file grew that way). All of the assembly routines after gettime() are just test helpers. On my Dell 386/16, this routine is consistent within 1 microtick after allowing IRQ 0. You probably won't get quite this good a result, since the 386/16 is unique in having true zero wait state RAM (no DRAM in system). If you use it, you have to allow for IRQ 0 and midnight, and make sure that the results are repeatable (nontrivial with CGA) page 59,130 .386p DEBUG equ 1 PIC0 equ 20h ;8259 Interrupt controller 0 PIC0MASK equ 21h ;8259 Interrupt controller 0 mask TIMER equ 40h ;8254 timer counter 0 address MAGIC equ 80h ;Magic stone - used for 1us delays PIC1 equ 0a0h ;8259 Interrupt controller 1 PIC1MASK equ 0a1h ;8259 Interrupt controller 1 mask WAFORIO macro out MAGIC,al endm A_BLOCK segment use16 at 0a000h public _vga _vga equ $ A_BLOCK ends B_BLOCK segment use16 at 0b000h public _monochrome, _cga _monochrome db 80 * 25 * 2 dup (?) org 8000h _cga db 80 * 25 * 2 dup (?) B_BLOCK ends ROMDAT segment use16 at 40h org 06ch public _bios_timer, _bios_timer_low, _bios_timer_high public _bios_timer_overflow _bios_timer equ $ _bios_timer_low dw ? _bios_timer_high dw ? _bios_timer_overflow db ? ROMDAT ends _TEXT SEGMENT use16 WORD PUBLIC 'CODE' _TEXT ENDS _DATA SEGMENT use16 WORD PUBLIC 'DATA' _DATA ENDS CONST SEGMENT use16 WORD PUBLIC 'CONST' CONST ENDS _BSS SEGMENT use16 WORD PUBLIC 'BSS' extrn _pending_int:word extrn _first_read:word extrn _second_read:word extrn _first_timer:byte extrn _second_timer:byte _BSS ENDS DGROUP GROUP CONST, _BSS, _DATA ASSUME CS: _TEXT, DS: DGROUP, SS: DGROUP _TEXT segment ;------------------------------------------------------------------------------ ; Strategy: ; ; Read the 8254 timer/counter. Read the 17th bit also. Concatinate ; with the BIOS timer tick count. Since the 8254 counts down, but the ; BIOS counts up, invert all 17 bits from the 8254 so that it counts ; up too. ; ; There are two special cases. The first is when the timer over ticked ; "recently". The interrupt to update the BIOS count may or may not ; have occurred. So if the timer wrapped recently, check to see if ; there is a pending interrupt. If so, the BIOS count was not updated, ; so update it "manually". If the 8254 didn't tick "recently", don't ; update the BIOS counter. ; ; The other special case is when the 8254 returns exactly 0. It is ; apparently not possible to determine if the interrupt has or hasn't ; occurred, or if the 17th bit is correct. So, the 8254 must be read ; a second time (we know that this second time will not return 0 ; obviously, and so will not be the same special case again). The ; second 8254 read returns a "wrapped" or "correct" value. If the ; "wrapped" value has the same BIOS count as the "0" read did, then ; the "0" read didn't get the right 17th bit. ; ; A couple of special attributes to this code. The routine always ; takes exactly the same amount of time to run. There is no ; variability in calling gettime(): it is guaranteed to be constant ; time. This is important if a benchmark calls it often. Second, the ; same read from the 8254 is returned each time. The second read ; decides how to update the first read, but the 8254 value read the ; second time does not replace the first value (the second will be ; several counts later than the first. This limits or eliminates any ; variability in terms of when within gettime() the 8254 is read: it ; is guaranteed that if gettime() is called N times, all N calls will ; measure the same duration, within two microseconds (subject to ; external interference). ; ; When using or modifying this, remember that accuracy here isn't the ; last word. Memory refresh, DMA cycles & BIOS timer INT overhead are ; substantial compared to the accuracy of gettime(). gettime() does ; not enable interrupts if they were previously disabled, so an ; application may disable interrupts (and take care of the BIOS timer ; itself). ; ; Handle the BIOS midnight timer count reset!!! public _gettime _gettime proc near pushf push bx mov al,0c2h ; latch counter value cli out 43h,al WAFORIO in al,TIMER ; OUT pin status ifdef DEBUG mov _first_timer,al endif add al,al ; Save OUT in carry flag WAFORIO in al,TIMER ; low order timer count mov ah,al WAFORIO in al,TIMER ; high order timer count xchg ah,al not ax ifdef DEBUG mov _first_read,ax endif mov dx,ax ; Save "raw" count read (inverted) cmc ; Put !OUT in AX:15 rcr ax,1 mov bx,ax ; Save the "official" count cmp dx,-1 ; CL == 0 iff the timer is ticking sete cl ; over right now rol bx,cl ; If ticking over, will need BX ; rotated later WAFORIO ; These seem important WAFORIO WAFORIO WAFORIO mov al,0c2h ; latch counter value out 43h,al WAFORIO in al,TIMER ; OUT pin status ifdef DEBUG mov _second_timer,al endif add al,al ; Save OUT in carry flag WAFORIO in al,TIMER ; low order timer count ifdef DEBUG mov ah,al endif WAFORIO in al,TIMER ; high order timer count ifdef DEBUG xchg ah,al not ax mov _second_read,ax endif cmc rcr bx,cl ; Put carry flag (second !OUT) into ; BX:15. BX already shifted in ; this case. shl cl,4 ; 16 (if timer ticked over) or 0 ifdef DEBUG mov byte ptr _pending_int+1,cl endif shld dx,bx,1 ; If the "raw" value read was exactly add bx,bx shl bx,cl ; ffffh, then clear BX:[0-14] shrd bx,dx,1 ; OK after here mov ax,bx ; If BX == ffffh or BX < 3fffh inc ax ; then if there is a pending interrupt and ax,7fffh ; add it to what the BIOS count is cmp ax,4000h ; since BIOS will update its count setae cl ; as soon as we do the POPF shl cl,4 ; 16 (BX in range) or 0 mov al,0ah out PIC0,al in al,PIC0 and ax,1 ; See if IRQ 0 is requesting mov byte ptr _pending_int,al shl ax,cl ; 1 (if BX in range) or 0 mov dx,ax mov cx,es mov ax,ROMDAT mov es,ax assume es:ROMDAT add dx,_bios_timer_low mov es,cx assume es:nothing mov ax,bx pop bx popf ;Potentially enable interrupts ret _gettime endp ;------------------------------------------------------------------------------ public _poke_bytes _poke_bytes proc near push bp mov bp,sp push di push es mov dx,10[bp] ;Count mov al,12[bp] ;Byte to write poke_bytes_loop: les di,4[bp] mov cx,8[bp] rep stosb dec dx jnz poke_bytes_loop pop es pop di pop bp ret _poke_bytes endp public _poke_words _poke_words proc near push bp mov bp,sp push di push es mov dx,10[bp] ;Count mov ax,12[bp] ;Byte to write poke_words_loop: les di,4[bp] mov cx,8[bp] rep stosw dec dx jnz poke_words_loop pop es pop di pop bp ret _poke_words endp public _peek_bytes _peek_bytes proc near push bp mov bp,sp push di push es mov dx,10[bp] ;Count mov al,12[bp] ;Byte to read peek_bytes_loop: les di,4[bp] mov cx,8[bp] repne scasb je peek_bytes_failed dec dx jnz peek_bytes_loop mov ax,0 ;No error peek_bytes_exit: pop es pop di pop bp ret peek_bytes_failed: mov ax,1 jmp peek_bytes_exit _peek_bytes endp public _peek_words _peek_words proc near push bp mov bp,sp push di push es mov dx,10[bp] ;Count mov ax,12[bp] ;Byte to write peek_words_loop: les di,4[bp] mov cx,8[bp] repne scasw je peek_words_failed dec dx jnz peek_words_loop mov ax,0 ;No error peek_words_exit: pop es pop di pop bp ret peek_words_failed: mov ax,1 jmp peek_words_exit _peek_words endp public _blit_bytes _blit_bytes proc near push bp mov bp,sp push si push di push ds push es mov dx,10[bp] ;Count mov al,12[bp] ;Byte to write blit_bytes_loop: les di,4[bp] lds si,4[bp] add si,160 mov cx,8[bp] rep movsb dec dx jnz blit_bytes_loop pop es pop ds pop di pop si pop bp ret _blit_bytes endp public _blit_words _blit_words proc near push bp mov bp,sp push si push di push ds push es mov dx,10[bp] ;Count mov al,12[bp] ;Byte to write blit_words_loop: les di,4[bp] lds si,4[bp] add si,160 mov cx,8[bp] rep movsw dec dx jnz blit_words_loop pop es pop ds pop di pop si pop bp ret _blit_words endp public _get_irq_mask _get_irq_mask proc near in al,PIC1MASK mov ah,al WAFORIO in al,PIC0MASK ret _get_irq_mask endp public _set_irq_mask _set_irq_mask proc near push bp mov bp,sp mov ax,4[bp] out PIC0MASK,al mov al,ah WAFORIO out PIC1MASK,al pop bp ret _set_irq_mask endp public _get_vga_mode _get_vga_mode proc near push bp mov ah,0fh int 10h pop bp mov ah,0 ret _get_vga_mode endp public _set_vga_mode _set_vga_mode proc near push bp mov bp,sp mov ax,4[bp] ;Get new mode pusha int 10h popa pop bp ret _set_vga_mode endp public _get_cursor _get_cursor proc near push bp mov ah,3 int 10h pop bp mov ax,cx ret _get_cursor endp public _set_cursor _set_cursor proc near push bp mov bp,sp mov cx,4[bp] mov ah,1 pusha int 10h popa pop bp ret _set_cursor endp _TEXT ends end ======================================== #include <stdio.h> extern volatile unsigned long far bios_timer; extern volatile unsigned int far bios_timer_low; extern volatile unsigned int far bios_timer_high; extern unsigned char far cga[2][80][25]; extern unsigned char far monochrome[2][80][25]; extern unsigned char far vga[256][256]; extern unsigned long gettime(void); extern void poke_bytes(void far *, int, int); int pending_int = 0; char first_timer, second_timer; int first_read, second_read; main(argc, argv, envp) int argc; char **argv; char **envp; { char buf[256]; int n; unsigned int e, f, g, h, i, j; unsigned long start, end; unsigned long a, b, c, d; unsigned int irq_mask; #if 1 /* Make sure that the timer never returns an unreasonable value. * Do this by sampling it twice in quick succession, and making sure * that the returned value is not too much larger than the previous * value. */ int a1read, a2read, b1read, b2read; int a1timer, a2timer, b1timer, b2timer; int aint, bint; b = gettime(); bint = pending_int; b1timer = (int) first_timer & 0xff; b1read = first_read; b2timer = (int) second_timer & 0xff; b2read = second_read; while (1) { a = gettime(); aint = pending_int; a1timer = (int) first_timer & 0xff; a1read = first_read; a2timer = (int) second_timer & 0xff; a2read = second_read; if (a - b > 128) { printf("N %8lx, diff %ld\n", gettime(), a - b); printf("a %8lx int %4x, 1read %4x, 1t %2x, 2read %4x, 2t %2x\n", b, bint, b1read, b1timer, b2read, b2timer); printf("a %8lx int %4x, 1read %4x, 1t %2x, 2read %4x, 2t %2x\n\n", a, aint, a1read, a1timer, a2read, a2timer); b = gettime(); bint = pending_int; b1timer = (int) first_timer & 0xff; b1read = first_read; b2timer = (int) second_timer & 0xff; b2read = second_read; } else { b = a; bint = aint; b1timer = a1timer; b1read = a1read; b2timer = a2timer; b2read = a2read; } } #elif 0 /* Make sure that gettime() is returning the low order bit clear * some of the time. Might not happen if the 8254 is read wrong. */ while (1) { a = gettime(); if (!(a & 1)) { printf("a %lx\n", a); exit(1); } /* endif */ } /* endwhile */ #elif 0 /* Test CGA screen access time. This won't give reproducible results * until the test is sync'd with the vertical refresh, and even then * the timer tick needs to be sync'd too. */ irq_mask = get_irq_mask(); set_irq_mask(0xfffe); /* sync with timer tick */ for (e = bios_timer_low; e == bios_timer_low; ) ; start = gettime(); poke_bytes(cga, 256, 1000); end = gettime(); printf("Start %lx\n", start); printf("Stop %lx\n", end); printf("Took %lu microticks.\n", end - start); set_irq_mask(irq_mask); exit(0); #elif 0 /* Another test routine to make sure that gettime() never returns an * unreasonable value. Print what gettime() got if there is a * problem. */ while (1) { int a1read, a2read, b1read, b2read; int a1timer, a2timer, b1timer, b2timer; int aint, bint; a = gettime(); aint = pending_int; a1timer = (int) first_timer & 0xff; a1read = first_read; a2timer = (int) second_timer & 0xff; a2read = second_read; b = gettime(); bint = pending_int; b1timer = (int) first_timer & 0xff; b1read = first_read; b2timer = (int) second_timer & 0xff; b2read = second_read; if (b - a > 600 #if 0 || ((a1read == 0xffff && a1timer != a2timer) || (b1read == 0xffff && b1timer != b2timer)) #endif ) { printf("N %8lx, diff %ld\n", gettime(), b - a); printf("a %8lx int %4x, 1read %4x, 1t %2x, 2read %4x, 2t %2x\n", a, aint, a1read, a1timer, a2read, a2timer); printf("a %8lx int %4x, 1read %4x, 1t %2x, 2read %4x, 2t %2x\n\n", b, bint, b1read, b1timer, b2read, b2timer); exit(1); } } #elif 0 /* Get a bunch of results and then print them. This is a good way * of seeing that the results are reproducible. It also shows why * you have to account for IRQ 0 timer ticks. */ while (1) { unsigned long a[100]; unsigned long diff; for (n = 0; n < 100; n++) a[n] = gettime(); diff = a[1] - a[0]; for (n = 1; n < 99; n++) { if (a[n+1] - a[n] > diff + 1) printf("a %lx, b %lx, diff %ld\n", a[n], a[n+1], a[n+1] - a[n]); } } #else /* Get some samples. Only keep a sample if it contains a "tick over" * event, where the 8254 rolled over. First wait for the high order * part of the 8254 to be FF: that means that a tickover will happen * "soon", and we're likely to capture it. This loop used so you * can visually see that the gettime() results are monotonic and * regular across tickovers. */ while (1) { unsigned long a[100]; while ((gettime() & 0xff00) != 0xff00) ; for (n = 99; n >= 0; n--) a[n] = gettime(); if ((a[0] & 0xffff0000) == (a[99] & 0xffff0000)) continue; for (n = 99; n >= 0; n--) printf("%lx\n", a[n]); printf("\n"); } #endif return 0; } -- James R. Van Artsdalen james@bigtex.cactus.org "Live Free or Die" Dell Computer Co 9505 Arboretum Blvd Austin TX 78759 512-338-8789