Except then you shouldn't call clc before the adc# 0 and that doesn't make the code repeat in case of self modifying code or even a sub routine, you would need to call clc conditionally, since you need to clear the carry each new loop and not clear the carry to carry the carry. This is why I tend to use the clc and adc #1 combination, also it is more explicit that one is added and the carry is taken into account. But I added both versions of the code thanks to your reply.
Another way to perform the calculations more quickly would be to measure the elapsed time for incrementing 16 bits, then multiply that by 65536 to extrapolate the time for the upper 16 bits that were not measured explicitly. Here is an example for the C64: .C:033c A9 00 LDA #$00 .C:033e 85 FB STA $FB .C:0340 85 FC STA $FC .C:0342 E6 FC INC $FC .C:0344 D0 FC BNE $0342 .C:0346 E6 FB INC $FB .C:0348 D0 F8 BNE $0342 .C:034a 60 RTS 10 ti$="000000":sys828:printti ready. run 34 ready. ? 34*2^16/60/60/60"hours 10.3158519 hours Edit: Since the Z80 and Arm calculations were performed using the CPU registers instead of memory, a fairer comparison might be doing something similar with the 6502... .C:033c A2 00 LDX #$00 .C:033e A0 00 LDY #$00 .C:0340 C8 INY .C:0341 D0 FD BNE $0340 .C:0343 E8 INX .C:0344 D0 FA BNE $0340 .C:0346 60 RTS 10 ti$="000000":sys828:printti ready. run 22 ready. ?22*2^16/60/60/60"hours 6.67496297 hours
16:05 I think on M1's sophisticated operating system just running a program adds a considerable overhead related to creating a process etc. In other words the ~1 s result you got includes more operations than just addition to 0xFFFFFFFF. To get a more realistic timing you could, for example, run the whole addition loop 100 times and then divide the time by 100.
Only slight more but yeah. The exit is actually a pretty heavy system call and the start isn’t trivial either. And I ran it many times and took the average hence I said in the video it ran 1.39 seconds on average 😉 Yeah it’s incredible the power we have these days.
@@CallousCoder I mean: if you set the address *=$0400 before the part with the 4 bytes, it would store and work with the 4 character positions. With that, you can actually see how fast the routine is running on a C64 without using a machine code monitor. BasicUpstart2(main) // 10 sys4096 *=$1000 main: loop: clc // clear carry flag lda bytes+3 // read byte 3 adc #1 // add 1 (if a = ff and 1 is added, a is set to zero, which sets the carry flag and the zero flag) sta bytes+3 // write to byte 3 bcc loop // if carry flag is not set then go to loop // else continue clc // clear carry flag lda bytes+2 // read byte 2 adc #1 // add 1 (if a = ff and 1 is added, a is set to zero, which sets the carry flag and the zero flag) sta bytes+2 // write to byte 2 bcc loop // if carry flag is not set then go to loop // else continue clc // clear carry flag lda bytes+1 // read byte 1 adc #1 // add 1 (if a = ff and 1 is added, a is set to zero, which sets the carry flag and the zero flag) sta bytes+1 // write to byte 1 bcc loop // if carry flag is not set then go to loop // else continue clc // clear carry flag lda bytes+0 // read byte 0 adc #1 // add 1 (if a = ff and 1 is added, a is set to zero, which sets the carry flag and the zero flag) sta bytes+0 // write to byte 0 bcc loop // if carry flag is not set then go to loop // else continue rts // end program *=$0400 bytes: .byte $00,$00,$00,$00 // display the bytes on the screen while they are going from 00 to ff // it's better to use $fb to $fe (zero page) for this
@@CallousCoder Me neither, at least not by head, but you'd see the counter running on the screen: *=$0400 bytes: .byte $00,$00,$00,$00 You set the first 4 bytes of the screen to 0, then use these addresses for the counter, so you see each number increase while it's running :)
For curiosity you could try this one and would be cool to see if the C128 could cut the speed in half: .const ZP = $fb :BasicUpstart2(Main) Main: sei // Disable interrupts (CIA IRQ steals R-Time) lda #$0b sta $d011 // Turn off screen to remove badlines // inc $d030 // Make use of that untapped C128 power (Disabled for now) lda #$00 sta ZP sta ZP + 1 sta ZP + 2 sta ZP + 3 // Reset counter/number to zero
Loop: inc ZP + 3 bne Loop inc ZP + 2 bne Loop inc ZP + 1 bne Loop inc ZP + 0 bne Loop lda #$1b sta $d011 // Turn the screen back on // dec $d030 // Switch back to 1 MHz Mode cli // Reenable interrupts rts There might be some gain by relocating the code to ZP, but not immediately obvoius. EDIT: I also guess the CIA Timers can be chained to make a 32 bit counter so then each "inc" would take 1 cycle each :)
Great suggestions! It’s a shame I sold my C128 recently 🙄I had no nostalgia for it but now it would’ve been nice to have it. But luckily there are emulators 😉
Oh wow, I thought only fd and fe were the two only available ZP addresses. I saw your code and looked up the zero page memory address but there are indeed 4! I totally did not recall that :D
I added your inc code to the repo too (without disabling the CIA) and running it currently. Should be faster for sure, not only the ZP but also using the inc alone.
adc #0 is the obvious choice not the weird choice IMO.
Except then you shouldn't call clc before the adc# 0 and that doesn't make the code repeat in case of self modifying code or even a sub routine, you would need to call clc conditionally, since you need to clear the carry each new loop and not clear the carry to carry the carry.
This is why I tend to use the clc and adc #1 combination, also it is more explicit that one is added and the carry is taken into account.
But I added both versions of the code thanks to your reply.
Another way to perform the calculations more quickly would be to measure the elapsed time for incrementing 16 bits, then multiply that by 65536 to extrapolate the time for the upper 16 bits that were not measured explicitly. Here is an example for the C64:
.C:033c A9 00 LDA #$00
.C:033e 85 FB STA $FB
.C:0340 85 FC STA $FC
.C:0342 E6 FC INC $FC
.C:0344 D0 FC BNE $0342
.C:0346 E6 FB INC $FB
.C:0348 D0 F8 BNE $0342
.C:034a 60 RTS
10 ti$="000000":sys828:printti
ready.
run
34
ready.
? 34*2^16/60/60/60"hours
10.3158519 hours
Edit: Since the Z80 and Arm calculations were performed using the CPU registers instead of memory, a fairer comparison might be doing something similar with the 6502...
.C:033c A2 00 LDX #$00
.C:033e A0 00 LDY #$00
.C:0340 C8 INY
.C:0341 D0 FD BNE $0340
.C:0343 E8 INX
.C:0344 D0 FA BNE $0340
.C:0346 60 RTS
10 ti$="000000":sys828:printti
ready.
run
22
ready.
?22*2^16/60/60/60"hours
6.67496297 hours
16:05 I think on M1's sophisticated operating system just running a program adds a considerable overhead related to creating a process etc. In other words the ~1 s result you got includes more operations than just addition to 0xFFFFFFFF. To get a more realistic timing you could, for example, run the whole addition loop 100 times and then divide the time by 100.
Only slight more but yeah. The exit is actually a pretty heavy system call and the start isn’t trivial either.
And I ran it many times and took the average hence I said in the video it ran 1.39 seconds on average 😉
Yeah it’s incredible the power we have these days.
To visualize it, set the start address of "bytes:" to $0400, you'll see the counter in the upper left corner of the screen :)
But that would add extra overhead if you would want to make it show something useful. I don’t know the petscii character numbers.
@@CallousCoder I mean: if you set the address *=$0400 before the part with the 4 bytes, it would store and work with the 4 character positions. With that, you can actually see how fast the routine is running on a C64 without using a machine code monitor.
BasicUpstart2(main) // 10 sys4096
*=$1000
main:
loop:
clc // clear carry flag
lda bytes+3 // read byte 3
adc #1 // add 1 (if a = ff and 1 is added, a is set to zero, which sets the carry flag and the zero flag)
sta bytes+3 // write to byte 3
bcc loop // if carry flag is not set then go to loop
// else continue
clc // clear carry flag
lda bytes+2 // read byte 2
adc #1 // add 1 (if a = ff and 1 is added, a is set to zero, which sets the carry flag and the zero flag)
sta bytes+2 // write to byte 2
bcc loop // if carry flag is not set then go to loop
// else continue
clc // clear carry flag
lda bytes+1 // read byte 1
adc #1 // add 1 (if a = ff and 1 is added, a is set to zero, which sets the carry flag and the zero flag)
sta bytes+1 // write to byte 1
bcc loop // if carry flag is not set then go to loop
// else continue
clc // clear carry flag
lda bytes+0 // read byte 0
adc #1 // add 1 (if a = ff and 1 is added, a is set to zero, which sets the carry flag and the zero flag)
sta bytes+0 // write to byte 0
bcc loop // if carry flag is not set then go to loop
// else continue
rts // end program
*=$0400
bytes:
.byte $00,$00,$00,$00 // display the bytes on the screen while they are going from 00 to ff
// it's better to use $fb to $fe (zero page) for this
@@CallousCoder Me neither, at least not by head, but you'd see the counter running on the screen:
*=$0400
bytes:
.byte $00,$00,$00,$00
You set the first 4 bytes of the screen to 0, then use these addresses for the counter, so you see each number increase while it's running :)
From where are you bro ? I am from india 🇮🇳♥️
I’m from hell mwahhahahahahaaaahaaa 😜I’m Dutch
@@CallousCoder Brother, you are very good, keep it up, your channel will definitely grow.
For curiosity you could try this one and would be cool to see if the C128 could cut the speed in half:
.const ZP = $fb
:BasicUpstart2(Main)
Main:
sei // Disable interrupts (CIA IRQ steals R-Time)
lda #$0b
sta $d011 // Turn off screen to remove badlines
// inc $d030 // Make use of that untapped C128 power (Disabled for now)
lda #$00
sta ZP
sta ZP + 1
sta ZP + 2
sta ZP + 3 // Reset counter/number to zero
Loop:
inc ZP + 3
bne Loop
inc ZP + 2
bne Loop
inc ZP + 1
bne Loop
inc ZP + 0
bne Loop
lda #$1b
sta $d011 // Turn the screen back on
// dec $d030 // Switch back to 1 MHz Mode
cli // Reenable interrupts
rts
There might be some gain by relocating the code to ZP, but not immediately obvoius.
EDIT: I also guess the CIA Timers can be chained to make a 32 bit counter so then each "inc" would take 1 cycle each :)
Great suggestions! It’s a shame I sold my C128 recently 🙄I had no nostalgia for it but now it would’ve been nice to have it. But luckily there are emulators 😉
This should be faster for sure as we just do an inc
Oh wow, I thought only fd and fe were the two only available ZP addresses. I saw your code and looked up the zero page memory address but there are indeed 4! I totally did not recall that :D
I added your inc code to the repo too (without disabling the CIA) and running it currently. Should be faster for sure, not only the ZP but also using the inc alone.
@@CallousCoder For science! 😁