You got the assembler almost perfectly right! The jump actually takes 2 clock cpu cycles. The loop still takes 5 cycles per iteration because the number "1" is already in a register throughout the loop. --- ldi r25, 0x01 ; load value 1 into register 25 loopstart: in r24, 0x0b ; [1 cycle] load port B into register 24 eor r24, r25 ; [1 cycle] xor out 0x0b, r24 ; [1 cycle] store result into port B rjmp .-8 ; [2 cycles] jump to loopstart, --- And you can get it down to 4 clock cycles by doing this: byte tmp = PORTD; while (1) { tmp ^= 1; PORTD = tmp; } I'll let you figure out why that is, maybe you explain it in a future video :wink: --- For other viewers * Here's how to get the assembler listing: [android-install-path]/hardware/tools/avr/avr/bin/objdump -S [build-folder]/[sketch-name].ino.elf > assembler.asm * Here's a table with atmega328p instructions and cpu cycles per instruction: docs.google.com/spreadsheets/d/1EzwMkWOIMNDqnjpbzuchsLx5Zq_j927tvAPgvmSuP6M/edit#gid=1419860012 --- btw, you should like hearing your own voice, you have interesting things to say!
i tried to read about assembly a few years back.....i didn't understand a thing back then....except it's low level and difficult. but now even if your "commands" are not exact, it seems we actually getting somewhere. thank you.
I used to write mainframe Assembler. With 22 Instructions, 15 Registers, and 2Gb of Memory, you could write code to do anything you could dream of...in green and black. When I switched to Cobol, I used to decompile the Cobol into Assembler to optimize my Cobol programs...a fun game for a Nerd. Debugging Assembler was extremely tedious, writing It was 4 times slower than doing the same in Cobol, but boy was it fun. I mess with Arduino programming now, so you have my atttention.
Excellent tutorial..!! Earned my subscription..!! Keep them coming..!!;) The ringing on the square wave is most likely due to that long ground lead on the end of your oscilloscope test probe.. If you shorten that thing up as much as possible I think you will actually see a very clean square wave.. Some probes come with a little stubby ground probe you can replace the long wire and gator clip with.. Scope probes typically come in a pouch with a few accessories.. The little stubby ground probe most likely remains one of those things you glance at maybe once and go straight for the long gator clip attachment.. The high frequency components in the leading and trailing edges of that square wave hit that long ground lead of the probe with a similar effect to plunking one of those old door stop springs..;) I did bench work for a few years..;)
I was sorry to hear that someone told you that you talk to much. They say a picture is worth a thousand words. I am totally blind, so not much of a visual learner. You explain things in a way that I can follow. Not just the concepts, but the circuit layout, etc. I Wanted to learn how to make an R2R dac. I had to listen to about 8 videos just to learn how the darn thing was connected. Grrr, but who cares, I got there in the end. I especially like your videos because its one and done. Let me tell you, making disjointed notes from multipul videos flow in a way that makes sence, then finding a couple other videos to verify what you think you figured out, is challenging and teedius. I listened to a vid of yours yesterday, and subscribed within 30 seconds. Yes, I "liked" it too. Now I've been benge-watching your channel and learning in 10 minutes what could take 2 or 3 hours to scrape together from other channels. You just keep right on talking. For the curious,, I hear voices. My computer, phone, multimeter, etc, they all talk to me. I've also made some tactile gauges with servos. They hold their position while being touched. For example, I made a speedometer for my exercise bike with an Attiny85, a servo, and the input connection the original, inaccessible computer used. Works great, and easy to check without interrupting my TH-cam video. Oh, I put it in an altoids box, because no self-respecting hobbyist doesn't have at least one project in an altoids box!
I even used this high frequency flipping to create some RF radio communication since @20Mhz, I can flip port at 10Mhz (without loop) and its 9th and 11th harmonic can be picked up by AM receiver @ 90 and 110Mhz. To improve range, a BJT based common base amplifier, a simple LC filter and a small wire antenna can be added.
@@sullivanzheng9586 ah, ok. i found out the rjmp even takes 2 cycles. So the fastest you can do with 328p arduino @16MHz if you want 50% duty cycle (after the first loop run) is 2MHz with this "asm-loop": sbi,nop,nop,cbi,rjmp (they take two clock cycles except the nop) unless you modify all 8 port bits at same time, then it can go faster, or even faster, a raw unlooped sequence of out instructions
Dear Simply Put, let me start with saying that I'm glad I found your channel, so much knowledge and a pleasant way of hosting, I wasn't sure on which video to ask this question but I was sure it was you I would be asking, so may as well do it in this video as it is somewhat related (ahum) For me electronics is like I get it but I don't and now that I'm programming an ATtiny85 through my parallel port I can't figure out for the life of me if I need to use resistors between the pins of the parallel port and the pins of the ATtiny85. In one of your videos you mention resistors will slow down response time but I also now have four damaged parallel port cards,. My clock pulse resembles your square wave with the noise when changing state which when programming at higher speeds is causing false reads I think , so my second question is how can I clean up the signal ? I hope you are alright and continue posting soon
Great video as always. You certainly put a lot of work into these but they’re some of my favorite I’ve seen on TH-cam (and I’ve been here a long time). I wonder what you think arduino ide is doing to the instructions when you change to the “fastest” compilation. Or better yet, how many cycles does that save? Behind the curtain type of really interesting. I also wonder about the comparative stabilities of each but that’s a whole different can of worms
Different optimization levels as far as C/C++ go, you can f.ex favor speed over size (code/program size) by "unrolling" loops or "inline" function calls, what that is is fairly straightforward; instead of the program counter (the "where am I now?") doing a jump instruction to some other part of program, it is faster (or can be faster, especially for small functions/loops) to just copy/paste the function into where you are. In other words; if you were reading a book, instead of looking up where is chapter or paragraph X, you have just copy pasted that right into where you are you are. This will mean your book now contains the exact same paragraphs several places, but if it is a small paragraph that is faster than "jumping". Unrolling a loop likewise; imagine a for loop interating 3 times. You can either jump 3 times to the start or just write the exact the same thing 3 times after each other. Other types of optimizations is to keep some variable in a register instead of memory; for accessing memory is slow compared to accessing a register. But you only have X amount of registers. And so on.
Assembly is something I know.. or I have a fairly decent knowledge of pre-pentium x86 assembly anyway. You can't use two memory operands at the same time. I also don't think you can use a memory operand with an immediate, but I'm not sure about that. I love assembly, I recently converted a pi decimal place calculator (written in pascal) to assembly to see what kind of speed improvement I could get. I forget the actual numbers but it was a fairly large difference. Pascal has a lot of overhead (as do many high-level languages.)
There's no reason you couldn't have those things in assembly, of course, it's just that I expect the added complexity of the chip would be stupid and impractical. But from a conceptual standpoint you could make a chip do anything you want as a "single opcode" if you're willing to spend the silicon.
@@simplyput2796 I think there must be a reason, otherwise it would've been implemented. There have been times where I have wanted to move data between memory locations (sort algorithms come to mind). I'm certain it could be done, I wonder if it's just either too complex or too time consuming (either in implementation or in cpu cycles).
Yes, the reason is cost v. benefit: Too much extra complexity for too little gain. It's more engineering effort and it's more die space they could spend on something else more complex, and I seriously doubt there would be significant time gains since interacting with RAM involves the RAM controller and other timing considerations, and so it makes good sense to do it more simply to allow the CPU to more easily do stuff like multithreading and hyperthreading based on what's waiting for what and when.
Also keep in mind this is AVR assembly, not x86. Being a microcontroller, this may well be something they did implement as it IS of significant benefit to the typical application of an MCU. It's not a general purpose computing engine, so something specific like this may well have been done. I don't know any facts about this, I'm just saying maybe they did, you can't assume x86 knowledge translates 1:1 to an AVR processor
You can look at the compiler output of an Arduino sketch and maybe figure out what exactly the MCU is doing when it executes this code and how many clock cycles it takes by going to Compiler Explorer godbolt.org/z/yCSps9 You can also convert your .elf files into plain .txt by using avr-objdump.exe which is included with the Arduino IDE. I took the basic blink sketch and removed the delay commands so it blinks as fast as possible then clicked on Verify. This created an .elf file in this folder C:\Users\dentaku\AppData\Local\Temp\arduino_build_604254\ (you can find this temp folder by reading the verbose output of the IDE after the sketch has compiled) I have a copy of avr-objdump in the root of C:\ so it's always in my Windows %path% so all I need to do is this... avr-objdump -Sz -l Blink.ino.elf > Blink.ino.elf.txt At the bottom of this ridiculously long text file you will find the code that belongs to setup(); and loop(); I don't really know what any of it means but it's there to help you see what's going on after your code is compiled. For some reason AVR code is very hard to read but using the version of objdump.exe that comes with the STM32 core Arduino files gives me nice easy to find chunks of code that I also don't really understand but at least you can then find out how many clock cycles each command uses. void setup() { pinMode(PA4, OUTPUT); } void loop() { GPIOA->ODR = 0B10000; //PORTA Output Data Register GPIOA->ODR = 0B00000; } BECOMES >>>>>>>>> 08000234 : 8000234: b510 push {r4, lr} 8000236: 2101 movs r1, #1 8000238: 2014 movs r0, #20 800023a: f001 fb57 bl 80018ec 800023e: bd10 pop {r4, pc} 08000240 : 8000240: 2390 movs r3, #144 ; 0x90 8000242: 2210 movs r2, #16 8000244: 05db lsls r3, r3, #23 8000246: 615a str r2, [r3, #20] 8000248: 2200 movs r2, #0 800024a: 615a str r2, [r3, #20] 800024c: 4770 bx lr
You didn't say how fast the Arduino could switch a pin high to low versus how fast a digital write could do it. I mean you have an oscilloscope hooked up and everything it shouldn't have been that hard. Smh
You got the assembler almost perfectly right!
The jump actually takes 2 clock cpu cycles. The loop still takes 5 cycles per iteration because the number "1" is already in a register throughout the loop.
---
ldi r25, 0x01 ; load value 1 into register 25
loopstart:
in r24, 0x0b ; [1 cycle] load port B into register 24
eor r24, r25 ; [1 cycle] xor
out 0x0b, r24 ; [1 cycle] store result into port B
rjmp .-8 ; [2 cycles] jump to loopstart,
---
And you can get it down to 4 clock cycles by doing this:
byte tmp = PORTD;
while (1) {
tmp ^= 1;
PORTD = tmp;
}
I'll let you figure out why that is, maybe you explain it in a future video :wink:
---
For other viewers
* Here's how to get the assembler listing: [android-install-path]/hardware/tools/avr/avr/bin/objdump -S [build-folder]/[sketch-name].ino.elf > assembler.asm
* Here's a table with atmega328p instructions and cpu cycles per instruction: docs.google.com/spreadsheets/d/1EzwMkWOIMNDqnjpbzuchsLx5Zq_j927tvAPgvmSuP6M/edit#gid=1419860012
---
btw, you should like hearing your own voice, you have interesting things to say!
Another great video, thanks, discovered you yesterday, i like it, thamks for teaching qnd bringing us in your journey
i tried to read about assembly a few years back.....i didn't understand a thing back then....except it's low level and difficult. but now even if your "commands" are not exact, it seems we actually getting somewhere. thank you.
Simply put: You sir, are a scream of an educator!
You sure do like teasing us! :D Can't wait for the next video!
I used to write mainframe Assembler. With 22 Instructions, 15 Registers, and 2Gb of Memory, you could write code to do anything you could dream of...in green and black. When I switched to Cobol, I used to decompile the Cobol into Assembler to optimize my Cobol programs...a fun game for a Nerd. Debugging Assembler was extremely tedious, writing It was 4 times slower than doing the same in Cobol, but boy was it fun. I mess with Arduino programming now, so you have my atttention.
I also like hearing your voice.
Excellent tutorial..!! Earned my subscription..!! Keep them coming..!!;)
The ringing on the square wave is most likely due to that long ground lead on the end of your oscilloscope test probe.. If you shorten that thing up as much as possible I think you will actually see a very clean square wave.. Some probes come with a little stubby ground probe you can replace the long wire and gator clip with..
Scope probes typically come in a pouch with a few accessories.. The little stubby ground probe most likely remains one of those things you glance at maybe once and go straight for the long gator clip attachment..
The high frequency components in the leading and trailing edges of that square wave hit that long ground lead of the probe with a similar effect to plunking one of those old door stop springs..;)
I did bench work for a few years..;)
I was sorry to hear that someone told you that you talk to much. They say a picture is worth a thousand words. I am totally blind, so not much of a visual learner. You explain things in a way that I can follow. Not just the concepts, but the circuit layout, etc. I Wanted to learn how to make an R2R dac. I had to listen to about 8 videos just to learn how the darn thing was connected. Grrr, but who cares, I got there in the end. I especially like your videos because its one and done. Let me tell you, making disjointed notes from multipul videos flow in a way that makes sence, then finding a couple other videos to verify what you think you figured out, is challenging and teedius. I listened to a vid of yours yesterday, and subscribed within 30 seconds. Yes, I "liked" it too. Now I've been benge-watching your channel and learning in 10 minutes what could take 2 or 3 hours to scrape together from other channels. You just keep right on talking. For the curious,, I hear voices. My computer, phone, multimeter, etc, they all talk to me. I've also made some tactile gauges with servos. They hold their position while being touched. For example, I made a speedometer for my exercise bike with an Attiny85, a servo, and the input connection the original, inaccessible computer used. Works great, and easy to check without interrupting my TH-cam video. Oh, I put it in an altoids box, because no self-respecting hobbyist doesn't have at least one project in an altoids box!
Using PINB = (1
I even used this high frequency flipping to create some RF radio communication since @20Mhz, I can flip port at 10Mhz (without loop) and its 9th and 11th harmonic can be picked up by AM receiver @ 90 and 110Mhz. To improve range, a BJT based common base amplifier, a simple LC filter and a small wire antenna can be added.
PORTB = (1 I achieved 2.66MHz square wave on 8Mhz ATmega328p
Should be 4MHz if the pin toggles every cycle.
@@Henry-sv3wv the looping JMP instruction costs 1 cycle.
@@sullivanzheng9586
ah, ok. i found out the rjmp even takes 2 cycles.
So the fastest you can do with 328p arduino @16MHz if you want 50% duty cycle (after the first loop run) is 2MHz with this "asm-loop":
sbi,nop,nop,cbi,rjmp (they take two clock cycles except the nop)
unless you modify all 8 port bits at same time, then it can go faster, or even faster, a raw unlooped sequence of out instructions
Dear Simply Put, let me start with saying that I'm glad I found your channel, so much knowledge and a pleasant way of hosting,
I wasn't sure on which video to ask this question but I was sure it was you I would be asking, so may as well do it in this video as it is somewhat related (ahum)
For me electronics is like I get it but I don't and now that I'm programming an ATtiny85 through my parallel port I can't figure out for the life of me if I need to use resistors between the pins of the parallel port and the pins of the ATtiny85. In one of your videos you mention resistors will slow down response time but I also now have four damaged parallel port cards,. My clock pulse resembles your square wave with the noise when changing state which when programming at higher speeds is causing false reads I think , so my second question is how can I clean up the signal ?
I hope you are alright and continue posting soon
Dude, that was Rad. Kinda like doing a burnout in an ATmega328p.... Kinda... ;-p
Great video as always. You certainly put a lot of work into these but they’re some of my favorite I’ve seen on TH-cam (and I’ve been here a long time). I wonder what you think arduino ide is doing to the instructions when you change to the “fastest” compilation. Or better yet, how many cycles does that save? Behind the curtain type of really interesting. I also wonder about the comparative stabilities of each but that’s a whole different can of worms
Different optimization levels as far as C/C++ go, you can f.ex favor speed over size (code/program size) by "unrolling" loops or "inline" function calls, what that is is fairly straightforward; instead of the program counter (the "where am I now?") doing a jump instruction to some other part of program, it is faster (or can be faster, especially for small functions/loops) to just copy/paste the function into where you are. In other words; if you were reading a book, instead of looking up where is chapter or paragraph X, you have just copy pasted that right into where you are you are. This will mean your book now contains the exact same paragraphs several places, but if it is a small paragraph that is faster than "jumping".
Unrolling a loop likewise; imagine a for loop interating 3 times. You can either jump 3 times to the start or just write the exact the same thing 3 times after each other.
Other types of optimizations is to keep some variable in a register instead of memory; for accessing memory is slow compared to accessing a register. But you only have X amount of registers. And so on.
What do you need speed for that you will spend lot of time on coding in assembler language?
Very interesting arduino and assembly language , perhaps a video with multiplexing chip in the future.thanks
Assembly is something I know.. or I have a fairly decent knowledge of pre-pentium x86 assembly anyway. You can't use two memory operands at the same time. I also don't think you can use a memory operand with an immediate, but I'm not sure about that. I love assembly, I recently converted a pi decimal place calculator (written in pascal) to assembly to see what kind of speed improvement I could get. I forget the actual numbers but it was a fairly large difference. Pascal has a lot of overhead (as do many high-level languages.)
There's no reason you couldn't have those things in assembly, of course, it's just that I expect the added complexity of the chip would be stupid and impractical. But from a conceptual standpoint you could make a chip do anything you want as a "single opcode" if you're willing to spend the silicon.
@@simplyput2796 I think there must be a reason, otherwise it would've been implemented. There have been times where I have wanted to move data between memory locations (sort algorithms come to mind). I'm certain it could be done, I wonder if it's just either too complex or too time consuming (either in implementation or in cpu cycles).
Yes, the reason is cost v. benefit: Too much extra complexity for too little gain. It's more engineering effort and it's more die space they could spend on something else more complex, and I seriously doubt there would be significant time gains since interacting with RAM involves the RAM controller and other timing considerations, and so it makes good sense to do it more simply to allow the CPU to more easily do stuff like multithreading and hyperthreading based on what's waiting for what and when.
Also keep in mind this is AVR assembly, not x86. Being a microcontroller, this may well be something they did implement as it IS of significant benefit to the typical application of an MCU. It's not a general purpose computing engine, so something specific like this may well have been done. I don't know any facts about this, I'm just saying maybe they did, you can't assume x86 knowledge translates 1:1 to an AVR processor
excellent !
I now know I have so much more to learn...
Thanks for the video =)
Nice! Enjoyed it alot
You can look at the compiler output of an Arduino sketch and maybe figure out what exactly the MCU is doing when it executes this code and how many clock cycles it takes by going to Compiler Explorer
godbolt.org/z/yCSps9
You can also convert your .elf files into plain .txt by using avr-objdump.exe which is included with the Arduino IDE.
I took the basic blink sketch and removed the delay commands so it blinks as fast as possible then clicked on Verify. This created an .elf file in this folder C:\Users\dentaku\AppData\Local\Temp\arduino_build_604254\ (you can find this temp folder by reading the verbose output of the IDE after the sketch has compiled)
I have a copy of avr-objdump in the root of C:\ so it's always in my Windows %path% so all I need to do is this...
avr-objdump -Sz -l Blink.ino.elf > Blink.ino.elf.txt
At the bottom of this ridiculously long text file you will find the code that belongs to setup(); and loop();
I don't really know what any of it means but it's there to help you see what's going on after your code is compiled.
For some reason AVR code is very hard to read but using the version of objdump.exe that comes with the STM32 core Arduino files gives me nice easy to find chunks of code that I also don't really understand but at least you can then find out how many clock cycles each command uses.
void setup() {
pinMode(PA4, OUTPUT);
}
void loop() {
GPIOA->ODR = 0B10000; //PORTA Output Data Register
GPIOA->ODR = 0B00000;
}
BECOMES
>>>>>>>>>
08000234 :
8000234: b510 push {r4, lr}
8000236: 2101 movs r1, #1
8000238: 2014 movs r0, #20
800023a: f001 fb57 bl 80018ec
800023e: bd10 pop {r4, pc}
08000240 :
8000240: 2390 movs r3, #144 ; 0x90
8000242: 2210 movs r2, #16
8000244: 05db lsls r3, r3, #23
8000246: 615a str r2, [r3, #20]
8000248: 2200 movs r2, #0
800024a: 615a str r2, [r3, #20]
800024c: 4770 bx lr
Thanks for this. It saves me a lot of future research effort.
You didn't say how fast the Arduino could switch a pin high to low versus how fast a digital write could do it. I mean you have an oscilloscope hooked up and everything it shouldn't have been that hard. Smh