Z80 Optimisation Tricks

Ready? Z80

มุมมอง 5 003

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 13 ก.ค. 2024
Do you want to make your Z80 code smaller and more efficient? Do you have limited memory space to code your next Z80 game? This video goes through some optimisation tricks you can use to make your code smaller and faster.
00:00 Intro
01:03 Use LDIR to initialise memory
03:02 Call fall through
05:17 Use JP instead of CALL/RET
06:29 Some Bytes have nice Bits
09:01 Hex to ASCII conversion trick
11:56 Quickies
16:00 Putting it all together
For example code that is shown in this video have a look here:
github.com/bchiha/Ready-Z80/t...
Some other cool Z80 optimisation tricks:
wikiti.brandonw.net/index.php...
z80-heaven.wikidot.com/optimiz...
And my all-time favourite Z80 reference site:
clrhome.org/table/
Check out this video on how to use Retro Virtual Machine for your Z80 coding...
• Z80 Coding with Retro ...
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 58

@jtsiomb 2 ปีที่แล้ว ⁺⁶
Every time you call an assembly routine "method", god kills a kitten.
@PeterLawton ปีที่แล้ว
LOL!
@PHILG2864 ปีที่แล้ว ⁺²
Similarly when you call 'assembling' 'compiling', even though you know you're using an assembler. 🙂
@PebblesChan 2 ปีที่แล้ว ⁺³
For bit 7 checks of the A register, just use JP M,nnnn or JP P,nnnn after an appropriate arithmetic function on A.
@thediaclub4781 ปีที่แล้ว
I didn't know that there are so many instructions that the gameboy doesn't have. Impressive.
@1971merlin ปีที่แล้ว ⁺¹
The LDIR example, you load BC with 000F which is 15. You have to do n-1 into BC for the fill as you preseeed the first location manually, also otherwise you'd overwrite one byte past the end of the target block.
@PebblesChan ปีที่แล้ว
It's always good to precheck the B/BC value particularly for calculated values prior to executing LDIR and DJNZ instructions because zero values can have unintended effects.
@PebblesChan 2 ปีที่แล้ว ⁺²
Rather than using PUSH AF, RLCA, RLCA, RLCA ,RLCA, CALL , POP AF, , have your param in (HL) and use the RLD instruction. The RLD & RRD instructions are specifically for working with nybbles and complements DAA which is very useful for Binary/BCD to ASCII and vice versa. e.g. if you have (HL) = 0x12, then LD A,#’0’, an RLD will set A=‘1’ and a subsequent RLD will set A=‘2’; optionally do another RLD if you want restore (HL)’s original value.
@ReadyZ80 2 ปีที่แล้ว
Okay, This is a nicer way to do the nibble convert. When using RLD, A is set there is no need to mask out the upper nibble. Thanks for sharing! I'll use this method in an upcoming project
@MarkOfBitcoin 2 ปีที่แล้ว
Nice one!
@PebblesChan 2 ปีที่แล้ว ⁺³
AND A is also often used instead of CP #0 or OR A.
@1971merlin ปีที่แล้ว
The H flag behaves differently. This may or may not matter depending on what your code is doing.
@PebblesChan ปีที่แล้ว
Agreed, only the CP instruction has a variable outcome on the Z80's Auxilliary Carry bit (the H flag for 8080) whereas the AND and OR instruction always sets it.
@markcummings150 2 ปีที่แล้ว
Great tips. I’m lucky if I can get my Z80 code to run without using any optimization. Maybe in the future I will try some of these tricks.
@jrkorman ปีที่แล้ว
Number one rule when I was working as a programmer - Make it work first - Testable! - Then once you have the tests in place and working code you may optimize if required! If you break something, the tests will tell you!
@HellCRICKET 2 ปีที่แล้ว
I have used these in microprocessor 8085 simulator it is really cool thing
@gregorymccoy6797 ปีที่แล้ว
Some of these are pretty awesome. The others I knew about and so less awesome 😀
@balorprice หลายเดือนก่อน
This is severaly my jam! Can I add more stuff?
Firstly, the way the fall-through example is written means it will execute 10 times, because after do_it and proc have been processed, it'll return back to the instruction after CALL do_it and do the whole thing again.
OR A is a quick way to replace CP A,0
Using the alternate register set is really quick when you run out of registers in a subroutine. The trick is to make sure all the stuff that needs to work together are in the same set, and only use A, the index registers and stack to transfer between them, because EXX does not swap A for A' register. You can preserve the flags with EX AF,AF' for use later this way too.
LDIR has a great 'bucket chain' algorithm that again isn't a trick: Doing something like LD HL,0; LD DE,1; LD BC,80 - 1; LD (HL),n; LDIR will fill 80 bytes with value n. That's because each new value of HL is just filled in time with DE, so the same value gets copied all the way through. You can use LDDR for this too, backwards, or avoid the effect by using the opposite instruction. If you need speed, a string of LDIs is a lot faster than the LDIR.
Aligning tables to a 256-byte boundary has more benefits: Say you want to have a buffer that loops (for example, scanning keys that might not all be processed immediately). You can use LD HL,table; ...[do stuff] ; INC L to loop back to the start without doing an out of bounds check.
Storing word-length lookup tables 'vertically', aligned to a 256 byte boundary helps with speed, as you don't have to double values to get to the correct offset. In this case you do something like LD E,(HL); INC H; LD D,(HL); DEC L
You can change the code directly by using LD (replace + 1),a; ...[more code] ;replace: LD A,0. The 0 is changed to your value. This is really useful because the most major optimisations come from identifying the most intensive loops and removing as many bytes/t-states from these as possible. In this case, A is freed up for more use between the instructions. However I condede it's faster to load from a free register if you have one spare. Using the index registers as 8-bit registers works in most cases - LD A,IXL is valid for most new compilers.
Extending this idea: You can use the stack as a standard register by doing LD (replace + ),sp; LD SP,nnnn, which means you can PUSH data directly into a buffer backwards. PUSH is only 11 t-states to store two bytes, a miracle. You'll need to make sure interrupts don't corrupt this, so it's better to PUSH information which can be corrected, rather than POPping from a table which may get corrupted. You can skip bytes with LD SP,nn directly, or something like LD HL,offset; ADD HL,SP; LD SP,HL
Using jump tables is great to embed logic, so instead of the example where you test A for specific values and jump to routines, you have a list of the options as a table of routine addresses. After you've got your jump address, use LD (jump + 1),HL; jump: CALL 0 to automatically return from your routine. This stops you having to worry about putting JP nnnn after each of these routines.
For speed, it always makes sense to repeat blocks of code rather than doing lots of loops. Even repeating something 4 or 8 times will speed things up a lot, if you have the memory to spare. If you can 'unroll' a loop entirely, you also free up a register to use.
Lastly, I alluded to this before, but: Optimisation for speed/memory is really addictive and you'll probably end up wasting lots of time on your new obsession. But it's way more important to keep your code readable for checking later, so always concentrate on optimising the intensive loops and not worrying about saving a few T-states every now and then.
Okay, nerd rant over! Happy optimising.
@ReadyZ80 หลายเดือนก่อน ⁺¹
Wow. Thanks for your contribution. I can make a whole video on this.
@johankoelman2996 ปีที่แล้ว
Know registervalues so when you point HL to a new value within same block, only do LD L,n
@PebblesChan 2 ปีที่แล้ว ⁺¹
When combining the loading of 2 registers such as B & C as a pair, in assembly language use a calculation such as the following to clearly indicate your intention.
e.g. LD BC,#(LOOPB_COUNT
@ReadyZ80 2 ปีที่แล้ว
For the example here, I just 'hard coded' the values but I would if possible use LD BC,((B_VALUE) * 256) + C_VALUE, using
@1971merlin ปีที่แล้ว
Use of
@PebblesChan ปีที่แล้ว
It's a feature of the current Zilog Z80 macro assembler and manual assembly. When a handy function is available make full use of it!
@kevincozens6837 2 ปีที่แล้ว
The HEX to ASCII methods look very interesting. I will be trying that out with some software I'm currently writing. I once rewrote a monitor program for a slightly different processor and needed to save bytes. The original program fit in to 1K and my modified and enhanced version was just a hair over that. There are times being able to save a byte or two becomes a fun challenge.
@johncochran8497 2 ปีที่แล้ว
Might want to look at:
CP 0Ah
SBC A,69h
DAA
Say in something like:
BYTE:
PUSH AF
RRA
RRA
RRA
RRA
CALL NYBBLE
POP AF
NYBBLE:
AND 0Fh
CP 0Ah
SBC A,69h
DAA
; Output A, preserve registers
CALL PRINT
RET
and of course, the CALL PRINT;RET could be JP PRINT
@kevincozens6837 2 ปีที่แล้ว
@@johncochran8497 That would be a shorter way of doing the conversion if it worked but it doesn't.
@ReadyZ80 2 ปีที่แล้ว
I haven't see that method before. Thanks for sharing. Also the nice hex to ASCII routine with Call fall through.
@johncochran8497 2 ปีที่แล้ว
@@kevincozens6837 Really? Perhaps you're using a buggy emulator since DAA is not used too often in subtract mode. So let's look at the code.
; Entry, A is 00h to 0Fh. Don't care about the flags.
CP 0Ah
; A unchanged. C if A is 00h to 09h
SBC A,69h
; If A was 00h to 09h, it's now 96h to 9Fh. C is set, H is set, N is set
; If A was 0Ah to 0Fh, it's now 0A1h to 0A6h. C is set, H is clear, N is set
DAA
; Since N is set, DAA will subtract one of {00h,06h,60h,66h} depending upon the value in A and the C and H flags.
; If A has 96h to 9Fh and both C and H set, 66h will be subtracted, giving 30h to 39h (ASCII '0' to '9')
; If A is 0A1h to 0A6h and H is clear, the 60h will be subtracted, giving 41h to 46h (ASCII 'A' to 'F')
Overall, no problem. But if you have a buggy emulator that properly handles DAA in addition mode only, look at:
OR 0F0h
DAA
ADD A,0A0h
ADC A,040h
7 bytes total which may seem less efficient. But notice in the hex routine the AND 0Fh at the beginning of NYBBLE. That's there to allow any value instead of restricting it to 00h to 0Fh. The 4 opcodes can be directly replaced with the sequence above for the same total byte count.
Code description:
OR 0Fh
; Set upper nybble to all ones. Clear N,C,and H flags
DAA
; Will convert A to 50h to 59h, 60h to 65h
ADD A,0A0h
; Want to force a carry if upper nybble is 6. So A is now 0F0h to 0F9h, 00h to 05h. Carry set if A is 0 to 5.
ADC A,40h
; And convert to ASCII RANGE.
@johankoelman2996 ปีที่แล้ว
With a lot of values to test with CP and JR use a table that points to the jumpaddress.
@ReadyZ80 ปีที่แล้ว
Thanks for your input!
@TheTurnipKing 2 ปีที่แล้ว ⁺²
the hatred of goto stems from the push for higher level languages. Which makes sense, you have better options there.
but on an old 8-bit processor, that's how the hardware really worked. And all those higher level features HAVE to be, on some level, ENABLED by the use of jumps.
@PeterLawton ปีที่แล้ว
My hatred of goto stems from some developers creating spaghetti code with them, chock full of disorganized ideas. But well thought out code can use them respectably, especially in ASM.
@dmitrykrapivin2808 2 ปีที่แล้ว ⁺¹
change xor a:ld (hl),a to ld (hl),0
@TanjoGalbi 11 หลายเดือนก่อน
The LDIR example you gave is incorrect, well, the comment is! Where you are setting BC to 000FH your comment says you are setting BC to 16, F = 15! 😛
@ChrisSavageEngineer ปีที่แล้ว
What editor software are you using? What compiler?
@ReadyZ80 ปีที่แล้ว ⁺¹
VS Code for the editor and z80asm compiler on a Mac osx
@ChrisSavageEngineer ปีที่แล้ว
@Ready? Z80 thanks for the reply. I keep finding Mac and Linux users with good tools. I'm having trouble finding the same for Windows.
@youreale 2 ปีที่แล้ว
LDIR is a very nice trick.
@n8wrl ปีที่แล้ว
I think TRS-BASIC used it to clear the screen
@andrewdunbar828 ปีที่แล้ว
@@n8wrl LDIR is the slowest way to clear the screen. It's only 1024 bytes though so not as critical as other Z80-based machines.
@DavidHembrow 2 ปีที่แล้ว ⁺³
It's not really worth doing the two back to back push/pop operations. Two 8 bit loads are a quicker way of achieving a 16 bit LD RR, RR and they also use two bytes Sometimes we can use ex de,hl instead of that though.
@slartibartfastBB 2 ปีที่แล้ว
Good point. If BC is already available, just doing a LD D,B and LD E,C is good enough. I'm just showing here that the stack can be used to set registers. Stack modification can also be used to change the return address after a CALL
@DavidHembrow 2 ปีที่แล้ว
@@slartibartfastBB Indeed. I used to often put null terminated text inline which would be output by the called routine. No need to load the address of the text before calling an output routine so it saved bytes.
@johncochran8497 2 ปีที่แล้ว ⁺³
The LDIR at 2:40, 13 bytes.
Other way.
XOR A
LD HL,DATA
LD B,16
LOOP:
LD (HL),A
INC HL
DJNZ LOOP
10 bytes, but slightly slower.
@1971merlin ปีที่แล้ว
This goes to show that the purpose of optimisation is important. Smaller code is not always faster. Are you trying to save clock cycles or bytes of memory, or both?
Push and pop are very slow at 11 clock cycles but only a one byte opcode. Ld reg>reg is only one byte but 4 clock cycles only.
So push bc pop de = 2 bytes 22 clocks
Ld d,b ld e,c = 2 bytes 8 clock cycles.
The fact that push and pop use memory can further impact performance via wait states in dram access, as well as needing available stack space.
@PebblesChan ปีที่แล้ว
Did you know that EX DE,HL just an instruction to toggle a the CPU’s DE, HL hardware addressing switch? No actual data moved.
@SpeccyMan ปีที่แล้ว
Perhaps you can now discuss the convention of not using decimals > 9 to describe hexadecimal numbers. 10H is NOT ten hex, it is one zero hex!
@PHILG2864 ปีที่แล้ว
Strictly yes Nick, but everyone I have ever known in computing (50 years for me) would call it ten hex
@johankoelman2996 ปีที่แล้ว ⁺¹
LDIR is not a win to clear a block. LD HL,address LD B,number LD (HL),value DJNZ loop
@SpeccyMan ปีที่แล้ว
LDIR can clear blocks > 256 unlike your routine!
@PHILG2864 ปีที่แล้ว ⁺¹
...that would write 'value' to the very same address B times. You need to INC HL 🙂
@PebblesChan ปีที่แล้ว
The clear winner is using a Z80 DMA however most Z80 systems don't have one! If you do have a Z80 DMA with DRAM you'll have to ensure that it is properly refreshed.
@andrewdunbar828 ปีที่แล้ว
Game code on Z80 systems back in the day used an actual trick to clear larger amounts of memory with the CPU. Such as clearing the screen. You save the stack pointer and all of the registers, set the stack pointer to the high end of the memory you want to clear and set all the registers to zero. You then push all the registers onto the stack in an unrolled loop. Finally restores your saved registers and stack pointer. There is a tradeoff to how much you unroll your loop. More unrolling means larger code size but faster memory clearing. Some techniques used all the alt. registers too. There are similar tricks for copying large amounts of memory used for instance in double buffering on systems that only had a single range of screen memory. LDIR is the obvious way and is not a "trick". When learning Z80 we all fell in love with LDIR and DJNZ of course but when we got good at Z80 we realized they were slow and avoided them. On those old systems speed optimizations = cycle count. On modern systems compilers know the efficiency of all the instructions but in addition to cycle count they now have to take into account more complex factors such as cache and pipeline behaviour.
@andrewdunbar828 ปีที่แล้ว ⁺²
LDIR is neither a trick nor an optimization. It's a pessimization. It's extremely slow. Optimized Z80 code back in the day avoided it. It is nice and readable and takes fewer bytes so if you're optimizing for code readability or binary size in code that doesn't need to be fast then it might be worth using.
@theALFEST 10 หลายเดือนก่อน
LDIR is slow for filling memory but ok for copying data from one place to another.

ต่อไป

เล่นอัตโนมัติ

Building the most controversial Z80 Computer ever