Thanks for this complimentary video. It was very helpful. How to decode the instructions from a byte stream is now very clear to me. But in the last video you mentioned that there are two pipelines (U and V) with different execution units. However, as far as I understood the U and V decoder work here together to generate a single instruction stream. How is it then possible to feed both execution pipelines in parallel?
I'm glad it helped, your earlier comment is what prompted me to make these supplementary simulation videos. The two pipelines can only execute sequential instructions, which also need to follow specific pairing rules to ensure that there are no dependency hazards or resource contention. If you look at the PMMX videos, that should make a bit more sense. There's also the MicroOps in the PMMX video which describes how more complex instructions are issued. The only difference between the P5 and P55C, is how the length decoding is done, and the inclusion of a transparent instruction FIFO - the pipeline issue / schedule logic is identical. So to make the comparison, think of the P5 having the pairing logic and first stage decoder squished into the alignment stage.
Does this mean you're writing a Pentium FPGA core next? ;) Enjoying this video series already, btw. I never knew the decoder stage was quite this complex.
This is supposed to just be a supplementary video, not a series. If I didn't need to explain what was shown, then I would have done it as a short, or even a gif. The actual "series" is the one that I recently posted on x86 front end complexity. As for the decoder, this is actually just the feeder for the decoder in the Pentium P5. The PMMX (P55C), K6, P II/III (P6), and K7 are far more involved. I have simulation versions of all of them, which is how I got the IPC numbers in the final chart.
Another question: How does the length prediction work for the V decoder? What information does it take into account to vary the estimated length from 1 byte?
Length prediction isn't required for the V decoder, only the U decoder. The true length of both the U and V instructions, is determined during the alignment, but after the alignment is done. If the predicted U length is incorrect, then the V instruction is squashed and only one is issued that cycle. If the U length is correct, then the V length is used to update the decode program counter. The estimate of 1 byte is updated once the instructions are previously decoded. There's a patent that describes the mechanism, where end-bits are determined during decode (32-bits for 32 bytes, where a 1 marks the end of the instruction). Once the 32 byte buffer is decoded for the first time, the end bits are stored in a prediction cache to be used for the next time that buffer page is accessed. Although, the cache is likely quite small, so that will only matter for small loops (it's even possible that the cache only has 1 entry).
Thanks for this complimentary video. It was very helpful. How to decode the instructions from a byte stream is now very clear to me. But in the last video you mentioned that there are two pipelines (U and V) with different execution units. However, as far as I understood the U and V decoder work here together to generate a single instruction stream. How is it then possible to feed both execution pipelines in parallel?
I'm glad it helped, your earlier comment is what prompted me to make these supplementary simulation videos.
The two pipelines can only execute sequential instructions, which also need to follow specific pairing rules to ensure that there are no dependency hazards or resource contention. If you look at the PMMX videos, that should make a bit more sense. There's also the MicroOps in the PMMX video which describes how more complex instructions are issued. The only difference between the P5 and P55C, is how the length decoding is done, and the inclusion of a transparent instruction FIFO - the pipeline issue / schedule logic is identical. So to make the comparison, think of the P5 having the pairing logic and first stage decoder squished into the alignment stage.
Does this mean you're writing a Pentium FPGA core next? ;)
Enjoying this video series already, btw. I never knew the decoder stage was quite this complex.
This is supposed to just be a supplementary video, not a series. If I didn't need to explain what was shown, then I would have done it as a short, or even a gif. The actual "series" is the one that I recently posted on x86 front end complexity.
As for the decoder, this is actually just the feeder for the decoder in the Pentium P5. The PMMX (P55C), K6, P II/III (P6), and K7 are far more involved. I have simulation versions of all of them, which is how I got the IPC numbers in the final chart.
Another question: How does the length prediction work for the V decoder? What information does it take into account to vary the estimated length from 1 byte?
Length prediction isn't required for the V decoder, only the U decoder. The true length of both the U and V instructions, is determined during the alignment, but after the alignment is done. If the predicted U length is incorrect, then the V instruction is squashed and only one is issued that cycle. If the U length is correct, then the V length is used to update the decode program counter.
The estimate of 1 byte is updated once the instructions are previously decoded. There's a patent that describes the mechanism, where end-bits are determined during decode (32-bits for 32 bytes, where a 1 marks the end of the instruction). Once the 32 byte buffer is decoded for the first time, the end bits are stored in a prediction cache to be used for the next time that buffer page is accessed. Although, the cache is likely quite small, so that will only matter for small loops (it's even possible that the cache only has 1 entry).
promosm