Drawing a circle is an interesting challenge. I think I came up with an elegant solution. It isn’t particularly difficult, but it does require some aha-moments. My key insights were:
1) If you know one point on the circle, you can only go to one of three pixels from there. For each of these three pixels, calculate the distance to the center of the circle and pick the one that’s closest.
For example, let’s say we’re drawing the upper-right quadrant. We know that the pixel to the right of the center – at (radius, 0) – is on the circle. Since this is the upper-right quadrant, there are only three possible pixels that connect to it, to the left at (radius - 1, 0), to the top at (radius, -1), or to the topleft at (radius - 1, -1). Pick the pixel that’s closest to the center and repeat until you reach to point above the center.
2) You don’t need to use the square root.
Normally, calculating the distance to the center requires a call to sqrt(), but we’re never interested in the distance itself. All we need is to compare distances and squared distances work for that just as well.
3) If you know the distance to the center for the current pixel, it’s easy to calculate the distance when you move one pixel, because it changes by a predictable amount.
Suppose you know that the distance to the center is 0 for pixel (6, 8). Since this is in the bottom-right quadrant, moving one pixel up will get you closer to the center, but by how much? Well, the distance changes from 6^2 + 8^2 to 6 ^2 + 7^2 and the difference between 8^2 and 7^2 is 2 * 8 - 1. In general, if you move from x^2 to (x-1)^2, the difference is 2 * x - 1. You can also go in the other direction: if you move from x^2 to (x+1)^2, the difference is 2 * x + 1.
Coming up with those insights and verifying their correctness was the tricky part. After that, implementing the algorithm was pretty straightforward.
The third question is really interesting. I have this idea: Put the color you are looking for, e.g. binary 01, into a byte byte four times, e.g. 01010101. Then xor this with the byte you are looking at. Now the result will have two zeros where the color was found. E.g.:
01 00 11 01 the four pixels in our buffer
01 01 01 01 the color we look for
00 01 10 00 pixels XOR color
The first and last pixel are the color we are looking for.
I thus need to check whether the first two bits are zero, the second two bits are zero, the third two bits are zero or the last two bits are zero. Also multiple of these can be true, if the color exists more than once in these four pixels.
This could be done with four if-statements but we want to be fast. I just listed all bytes that meet any of these conditions, i.e. that have at least one 00 pair in the four positions. I tried to find a common pattern in these, e.g. every byte <= 84 means we found the color. But I do not really find a nice way to put a fence around these numbers using bit manipulation.
My answer would thus be to have a look-up table of 256 entries, i.e. an array of 256 bytes, and put 1 where there is a match.
The first step, making 01 into 01010101, can also be done with a four-element lookup-table.
I started learning programming with resources from around 1995 and I remember lookup-tables to be used a lot for graphics, so maybe this was the way they beat Borland :-D
The answer for me would then be:
int ContainsColor(char unsigned Pixel, char unsigned Color) {
One way you could check if any pair had 00 after the xor without a lookup table would be to extract the left and right bits of each pair
right_bits = xored & 0b01010101;
left_bits = (xored >> 1) & 0b01010101;
Then you can return ((left_bits | right_bits) != 0b01010101), because left_bits | right_bits would only not match the test value if at least one of the xored pairs was 00.
I like how the mask can be used to make the two bit color into four occurrences of the color, but also for masking the XOR result and for comparison at the end. Also the != operator can be done with XOR as well. This looks like we might get this smaller still...
Woah that multiplication trick is really cool, but totally makes sense when you think about it! I was curious if it would be faster to duplicate the color by multiplying or by shifting a few times and bitwise oring, since the multiplication would use less instructions but the instruction itself could take more cycles than a shift. I compared those two implementations, your lut implementation, Paul's version, and a basic version that just compares each pair one at a time.
You can see my tests and results here if you're curious, though it won't compile without some modifications cause it depends on some of my own code:
The multiply version was faster than the shift version by a little bit, so in this case its better to do the one with less instructions. Of all the simd ones Paul's one with multiply was consistently the fastest, because it got compiled down to having just barely the fewest instructions. It might be possible to hand code something in asm with less instructions than that but i didn't try.
But surprisingly it turns out the dumb basic non-simd, non-lut version is by far the fastest, even though it has the most instructions! I think the reason its the fastest is because its able to do an early return as soon as it finds a matching pair, while all the other versions have to always go through all of their instructions. Although if the pixel and color are random then I think the expected number of instructions for it to execute is 14.67, which is roughly where all the other implementations are at, so that doesn't really explain to me why its so much faster (although my rng for the pixels and colors is terrible so maybe that's why).
Even though the lut version has by far the fewest instructions it takes a similar amount of time as the other simd ones so I guess the memory latency cancels out the number of instructions advantage pretty closely in this case.
I didn't test this in the situation where you would already have the color duplicated in a preprocessing step before comparing against a bunch of pixels so idk how that would change the results, though I assume it would make the simd versions come closer to the basic version.
Anyway it was pretty fun to compare the different versions and try to understand the results. I'd be curious to see if anyone can come up with a faster implementation.
I think I did something similar to what Ivy suggests. Here's my code:
int ContainsColor(char unsigned Pixel, char unsigned Color)
{
unsigned char x = Pixel ^ ~Color;
unsigned char y = x & (x >> 1);
unsigned char z = y & 0x55;
return z;
}
C doesn't have XNOR so I bit-wise invert Color so that "these bits match" is indicated by a 1 instead of 0 (this could be done in the precomputation, though).
The color is present where ever there are pairs of 1s (aligned to 2 bit boundaries, of course), so I right-shift x one bit and bit-wise AND with x. Then I just mask off the odd bits of the result using "& 0x55". If the result is non-zero, the color occurs at least once in Pixel.
What's neat is that if the Pixel was instead 32 bits, the only change to the code would be to make Pixel and Color u32s and it would work exactly the same, no additional instructions would be needed.
And if the number of bits in each pixel was increased to 3, you would just need one more shift and AND. (The mask at the end would change to 0111 -- octal comes in handy here!)
At first I though that the fourth questions code would be pretty complicated, but a basic implementation of it without any optimizations is pretty simple, a single loop with one if inside.
The three main ideas is:
1. A circle contains 8 identical quadrants
2. If you follow the line in a single quadrant the line continues straight or goes straight and one down. That is during a single step, if x changes by 1, y can only change by 0 or 1.
3. During each step you only need to check if you are still inside the circle and if not, just move a tiny bit on one of the axis to stay inside.
So it comes out to be that you always increment x by one and occasionally decrement y by one. Then when you have one quadrant, you mirror it 7 times more times.
About question 4, it seems pretty straight-forward to me so I feel like I must be missing something. Can't you just do this? (My code assumes the availability of an integer sqrt function, but if that isn't available you can just use a LUT).
As you might imagine, during the 8086-80386 days (which is when this took place), integer multiplication was 100+ cycles on an 8086, and between 9 and 22 cycles on a 80386. So even just your i*i term is probably already, by itself, slower than the actual algorithm the interviewer was expecting - even before we get to the fact that you would definitely need the integer square root table, since there is nothing even remotely similar to an integer square root on these processors.
Drawing a circle is an interesting challenge. I think I came up with an elegant solution. It isn’t particularly difficult, but it does require some aha-moments. My key insights were:
1) If you know one point on the circle, you can only go to one of three pixels from there. For each of these three pixels, calculate the distance to the center of the circle and pick the one that’s closest.
For example, let’s say we’re drawing the upper-right quadrant. We know that the pixel to the right of the center – at (radius, 0) – is on the circle. Since this is the upper-right quadrant, there are only three possible pixels that connect to it, to the left at (radius - 1, 0), to the top at (radius, -1), or to the topleft at (radius - 1, -1). Pick the pixel that’s closest to the center and repeat until you reach to point above the center.
2) You don’t need to use the square root.
Normally, calculating the distance to the center requires a call to sqrt(), but we’re never interested in the distance itself. All we need is to compare distances and squared distances work for that just as well.
3) If you know the distance to the center for the current pixel, it’s easy to calculate the distance when you move one pixel, because it changes by a predictable amount.
Suppose you know that the distance to the center is 0 for pixel (6, 8). Since this is in the bottom-right quadrant, moving one pixel up will get you closer to the center, but by how much? Well, the distance changes from 6^2 + 8^2 to 6 ^2 + 7^2 and the difference between 8^2 and 7^2 is 2 * 8 - 1. In general, if you move from x^2 to (x-1)^2, the difference is 2 * x - 1. You can also go in the other direction: if you move from x^2 to (x+1)^2, the difference is 2 * x + 1.
Coming up with those insights and verifying their correctness was the tricky part. After that, implementing the algorithm was pretty straightforward.
https://gist.github.com/joost-basic-bits/853215920d84fa1796ee5e500c90eda4
The third question is really interesting. I have this idea: Put the color you are looking for, e.g. binary 01, into a byte byte four times, e.g. 01010101. Then xor this with the byte you are looking at. Now the result will have two zeros where the color was found. E.g.:
01 00 11 01 the four pixels in our buffer
01 01 01 01 the color we look for
00 01 10 00 pixels XOR color
The first and last pixel are the color we are looking for.
I thus need to check whether the first two bits are zero, the second two bits are zero, the third two bits are zero or the last two bits are zero. Also multiple of these can be true, if the color exists more than once in these four pixels.
This could be done with four if-statements but we want to be fast. I just listed all bytes that meet any of these conditions, i.e. that have at least one 00 pair in the four positions. I tried to find a common pattern in these, e.g. every byte <= 84 means we found the color. But I do not really find a nice way to put a fence around these numbers using bit manipulation.
My answer would thus be to have a look-up table of 256 entries, i.e. an array of 256 bytes, and put 1 where there is a match.
The first step, making 01 into 01010101, can also be done with a four-element lookup-table.
I started learning programming with resources from around 1995 and I remember lookup-tables to be used a lot for graphics, so maybe this was the way they beat Borland :-D
The answer for me would then be:
int ContainsColor(char unsigned Pixel, char unsigned Color) {
return Matches[FourTimes[Color] ^ Pixel];
}
One way you could check if any pair had 00 after the xor without a lookup table would be to extract the left and right bits of each pair
right_bits = xored & 0b01010101;
left_bits = (xored >> 1) & 0b01010101;
Then you can return ((left_bits | right_bits) != 0b01010101), because left_bits | right_bits would only not match the test value if at least one of the xored pairs was 00.
This is awesome, great idea! I think we have our winner then :-)
int ContainsColor(char unsigned Pixel, char unsigned Color) {
const char unsigned Mask = 0b01010101;
char unsigned WantZeros = (Mask * Color) ^ Pixel;
return (WantZeros & Mask) | ((WantZeros>>1) & Mask) != Mask;
}
This uses zero look-up tables instead of two :-)
I like how the mask can be used to make the two bit color into four occurrences of the color, but also for masking the XOR result and for comparison at the end. Also the != operator can be done with XOR as well. This looks like we might get this smaller still...
Woah that multiplication trick is really cool, but totally makes sense when you think about it! I was curious if it would be faster to duplicate the color by multiplying or by shifting a few times and bitwise oring, since the multiplication would use less instructions but the instruction itself could take more cycles than a shift. I compared those two implementations, your lut implementation, Paul's version, and a basic version that just compares each pair one at a time.
You can see my tests and results here if you're curious, though it won't compile without some modifications cause it depends on some of my own code:
https://pastebin.com/gqMcMyYW
The multiply version was faster than the shift version by a little bit, so in this case its better to do the one with less instructions. Of all the simd ones Paul's one with multiply was consistently the fastest, because it got compiled down to having just barely the fewest instructions. It might be possible to hand code something in asm with less instructions than that but i didn't try.
But surprisingly it turns out the dumb basic non-simd, non-lut version is by far the fastest, even though it has the most instructions! I think the reason its the fastest is because its able to do an early return as soon as it finds a matching pair, while all the other versions have to always go through all of their instructions. Although if the pixel and color are random then I think the expected number of instructions for it to execute is 14.67, which is roughly where all the other implementations are at, so that doesn't really explain to me why its so much faster (although my rng for the pixels and colors is terrible so maybe that's why).
Even though the lut version has by far the fewest instructions it takes a similar amount of time as the other simd ones so I guess the memory latency cancels out the number of instructions advantage pretty closely in this case.
I didn't test this in the situation where you would already have the color duplicated in a preprocessing step before comparing against a bunch of pixels so idk how that would change the results, though I assume it would make the simd versions come closer to the basic version.
Anyway it was pretty fun to compare the different versions and try to understand the results. I'd be curious to see if anyone can come up with a faster implementation.
I think I did something similar to what Ivy suggests. Here's my code:
int ContainsColor(char unsigned Pixel, char unsigned Color)
{
unsigned char x = Pixel ^ ~Color;
unsigned char y = x & (x >> 1);
unsigned char z = y & 0x55;
return z;
}
C doesn't have XNOR so I bit-wise invert Color so that "these bits match" is indicated by a 1 instead of 0 (this could be done in the precomputation, though).
The color is present where ever there are pairs of 1s (aligned to 2 bit boundaries, of course), so I right-shift x one bit and bit-wise AND with x. Then I just mask off the odd bits of the result using "& 0x55". If the result is non-zero, the color occurs at least once in Pixel.
What's neat is that if the Pixel was instead 32 bits, the only change to the code would be to make Pixel and Color u32s and it would work exactly the same, no additional instructions would be needed.
And if the number of bits in each pixel was increased to 3, you would just need one more shift and AND. (The mask at the end would change to 0111 -- octal comes in handy here!)
At first I though that the fourth questions code would be pretty complicated, but a basic implementation of it without any optimizations is pretty simple, a single loop with one if inside.
The three main ideas is:
1. A circle contains 8 identical quadrants
2. If you follow the line in a single quadrant the line continues straight or goes straight and one down. That is during a single step, if x changes by 1, y can only change by 0 or 1.
3. During each step you only need to check if you are still inside the circle and if not, just move a tiny bit on one of the axis to stay inside.
So it comes out to be that you always increment x by one and occasionally decrement y by one. Then when you have one quadrant, you mirror it 7 times more times.
Here is my simple implementation circle outline: https://git.rpuzonas.com/rpuzonas/computer-enhance-intern-questions/src/branch/main/main.c#L68
About question 4, it seems pretty straight-forward to me so I feel like I must be missing something. Can't you just do this? (My code assumes the availability of an integer sqrt function, but if that isn't available you can just use a LUT).
void OutlineCircle(int Cx, int Cy, int R)
{
int R_2 = R*R;
int final = R*7071/10000;
for (int i = 0; i <= final; i++) {
int j = sqrt(R_2 - i*i);
Plot(Cx+i, Cy+j);
Plot(Cx+j, Cy+i);
Plot(Cx-i, Cy+j);
Plot(Cx-j, Cy+i);
Plot(Cx+i, Cy-j);
Plot(Cx+j, Cy-i);
Plot(Cx-i, Cy-j);
Plot(Cx-j, Cy-i);
}
}
As you might imagine, during the 8086-80386 days (which is when this took place), integer multiplication was 100+ cycles on an 8086, and between 9 and 22 cycles on a 80386. So even just your i*i term is probably already, by itself, slower than the actual algorithm the interviewer was expecting - even before we get to the fact that you would definitely need the integer square root table, since there is nothing even remotely similar to an integer square root on these processors.
- Casey
Awesome! Love this type of problem solving :D