Microsoft Intern Interview Question #2: String Copy

Question #2 should have been trivial, but because of some weird suggestions by the interviewer, it requires us to do a bit of compiler archaeology!

Aug 04, 2023

There are no Performance-Aware Programming lessons this week because it is Summer Internship 1994 Week here on Computer, Enhance! This week will feature five posts about the programming questions I was asked when I interviewed for a Microsoft summer internship way, way back in 1994. Normally scheduled lessons will resume next week.

The second question I was asked in my 1994 Microsoft internship interview was to implement a version of strcpy. I don't actually remember whether the parameter order was strcpy order, or a more Microsoft order, which would have had the parameters reversed. But since it wasn’t an API design question, that didn’t matter — all that mattered was if I could write the C code for something like this:

void CopyString(char *From, char *To)
{
    /* Fill this in */
}

This is a very simple thing to write in C. You aren’t even given a bounds check to perform (I should certainly hope that would be included in a modern version of the question!), so you literally just have to copy characters from the From buffer to the To buffer until you hit a 0 (the null terminator).

That’s it.

I’m not sure what I wrote on the board, but the interviewer acknowledged that I wrote working code right off the bat, so I’m fairly sure I didn’t mess anything up. But I do believe I used an index — something like this:

int i = 0;
while(To[i] = From[i])
{
    i++;
}

I’m pretty sure I didn’t the for loop version of this at the time, but an alternate version of the same loop is just:

for(int i = 0; To[i] = From[i]; i++);

However you want to write it, it’s very simple.

Unfortunately, although they acknowledged that it worked, the interviewer didn’t seem to like this. They poked and prodded at me for a while, and made me make changes to this code, until eventually I ended up with another standard C form of the loop:

while(*To++ = *From++);

You’ll notice that this version increments the pointers themselves instead of the index. This is really the only difference between the two.

This version was what the interviewer wanted. Once I wrote that, they were like, “great, OK.”

Now, this always struck me as a bit strange. At the time, I assumed that the interviewer knew something I didn’t — that perhaps it was a bad idea to use the index, because the code would be slower, or something like this. It certainly couldn’t be for readability, because if anything, the *Ptr++ form is much less readable to the average programmer than an indexed form! But since I wasn’t a very good programmer at the time, I didn’t really know how to poke back at the interviewer and find out what they were thinking.

Times have changed, of course.

Nowadays, I am all too familiar with the idea that high-level language expressions can result in shockingly different low-level output from the compiler. Even changes that appear like they are making code more efficient due to their notation in the high-level language may, in fact, be making the code worse because the resulting code generated by the compiler is worse for some reason.

So I have often wondered, did the interviewer actually know that compilers of that era would generally produce better code for the version he wanted me to write? Or was he actually ignorant of the details here, and merely thought that it would be more efficient the way he had me write it?

Stated more pithily, who was the real intern in this interviewer — me, or the full-time employee?

While it’s hard to know for sure, thanks to emulation, we can actually hazard a guess. Here is a program we can actually compile on old x86 compilers from the 8086-80386 era:

#include <stdio.h>

void CopyString_Microsoft(char *From, char *To)
{
    while(*To++ = *From++);
}

void CopyString_Casey(char *From, char *To)
{
    for(int i = 0; To[i] = From[i]; ++i);
}

int main(void)
{
    char *Source = "1994";
    char MSFTDest[32];
    char CaseyDest[32];

    CopyString_Microsoft(Source, MSFTDest);
    CopyString_Casey(Source, CaseyDest);
    printf(" MSFT: %s\n", MSFTDest);
    printf("Casey: %s\n", CaseyDest);

    return 0;
}

I was able to get a version of Borland C++ 3.0 from 1991 running in DOSBox. Unlike Turbo C++, it has a full code optimizer (which I assume was part of the upsell). If we compile this code with “fastest code” options, we get the following ASM for my index-based notation, which will be very familiar to all of you who have completed Part 1 of the Performance-Aware Programming series:

loop:
    inc si       ; Advance the source address
    inc di       ; Advance the destination address
    mov al,[si]  ; Read a byte from the source into the AL register
    mov [di],al  ; Write the byte from AL out to the destination
    or al,al     ; Test to see if the byte copied was 0 (null)
    jnz loop     ; Repeat the loop until the null terminator

This actually seems like quite good code generation! Bizarrely, the optimizer does not seem to have fused the INC/MOV pairs into LODSB and STOSB, which I believe would be significantly faster. I say “bizarrely” because it already is using AL as the intermediate register, which is what LODSB and STOSB require. So that’s a bit puzzling, but other than that, the code looks great. If you were actually trying to make this run faster, you would probably have to unroll it several times (jumps were very expensive in those days), but otherwise, not a lot to complain about. [If you are curious about why the increment happens first in this code, check the addendum I’ve added to the end of the article, where I go through the setup code that executes before the loop begins.]

On the other hand, here is the codegen for the Microsoft employee’s preferred notation:

loop:
    mov bx, dx   ; Move source address into bx so it can be used with mov
    inc dx       ; Advance the source address
    mov al,[bx]  ; Read a byte from the source into the AL register
    mov bx, cx   ; Move destination address into bx so it can be used with mov
    inc cx       ; Advance the destination address
    mov [bx],al  ; Write the byte from AL out to the destination
    or al,al     ; Test to see if the byte copied was 0 (null)
    jnz loop     ; Repeat the loop until the null terminator

This is essentially the same code, but the compiler has chosen to put the source and destination addresses in the CX and DX registers instead of SI and DI. This, of course, is a big problem for code generation, because the effective address expressions allowed by MOV in 8086 do not allow addressing with CX and DX. So the compiler has to swap the pointers in to BX, which can be used for effective addresses, in order to execute the MOVs.

Because MOVs between registers usually take 2 cycles, this means that the employee’s preferred notation actually slows down the loop by 4 cycles each iteration. Perhaps it was not a very good idea to suggest than an intern rewrite his faster notation into your slower notation, Mr. Employee!

Looking at the ASM, we can imagine what is probably going on with the compiler here. Because my notation makes it clear to the compiler that you are iterating over two buffers with one index, it was probably able to use built-in knowledge of how to do that quickly. There is probably special code in the optimizer for its 8086 codegen that knew it could use SI/DI that way, since that is a common pattern in the notoriously-hard-to-generated-code-for 8086 idiom.

By contrast, the Microsoft employee’s notation just looks to the compiler like some pointer accesses. As a result, it probably goes through a path that does not special case this kind of copy, so it treats the accesses like generic pointer accesses. That generic case probably assumes there will be several pointers, so it should just move them into BX when it goes to use them — it probably doesn’t have sufficiently advanced pattern matching to realize that this is a buffer-to-buffer copy using the same index.

The takeaway from all this is something that is still applicable today: never assume that the compiler will generate something more efficient just because it appears less complicated in high-level language notation. Compilers have to do a lot of complex pattern-matching and heuristic work to generate good code. Often times, the best high-level code will be code that lets the compiler see the pattern clearly, rather than the code that has been “optimized” to use the fewest high-level expressions.

This is still true today, although perhaps not in as extreme a form as it would have been in the 8086 days. It pays to learn which high-level patterns translate into efficient low-level code, and which ones don’t.

The good news is, you don't really have to become an assembly language expert to do that. You just have to learn enough about assembly language to inspect the output of your tools, and know when they’re doing something egregiously inefficient. Today’s CPUs are much more forgiving than those in the early x86 era, and can tolerate worse codegen. You just have to make sure the codegen isn’t horrible, which it unfortunately sometimes still is!

And of course, learning to read assembly language and make estimates about performance is precisely what we do here in the Performance-Aware Programming Course. So if you’re not already taking the course, and you’re interested in learning, you can check out our subscription options right here:

Addendum

Some folks who know 8086 were wondering why the increments in the codegen from my version happened before the byte copy. While all the code is visible in the video, if you only read the article, you don’t have the necessary context to see what’s happening — so I’ve included a longer version here:

    push si         ; Save SI register, since we need it
    push di         ; Save DI register, since we need it
    mov si,[bp+04]  ; Load SI with the From parameter
    mov di,[bp+06]  ; Load DI with the To parameter
    jmp skipinc     ; Skip over the increments the first time
loop:
    inc si          ; Advance the source address
    inc di          ; Advance the destination address
skipinc:
    mov al,[si]     ; Read a byte from the source into the AL register
    mov [di],al     ; Write the byte from AL out to the destination
    or al,al        ; Test to see if the byte copied was 0 (null)
    jnz loop        ; Repeat the loop until the null terminator

Again, I’m not sure why the compiler didn’t go with the more concise construction of just incrementing afterward… which, again, could have been done with LODSB and STOSB for even more efficiency. But presumably something about the way the compiler usually outputs for loops lead to this being the construction it chose. Since I’m not very familiar with 8086 idioms, I couldn’t say if there was some common reason why you would have wanted to structure for loops like this most of the time for other reasons.

Memory and CPU were very constrained in those days, so it is impressive that the compilers were even able to do this good of a job in DOS. This compiler doesn’t even run in protected mode — so it only had 640k to work with! The confusing part, perhaps, is not so much that it missed the optimization, but that it was able to produce code this good for 8086 at all.

Also, since I’ve added this addendum, and some 8086 people might be reading it, I might as well include what I assume to be the “correct” way to write this loop in 8086:

loop:
    lodsb        ; Read from source into AL and advance the source address
    stosb        ; Write from AL into dest and advance the dest address
    or al,al     ; Test to see if the byte copied was 0 (null)
    jnz loop     ; Repeat the loop until the null terminator

My reasoning here is that the clock counts on an 8086 would be as follows:

Borland C++ version of the loop (estimated at 50 cycles):
    inc si       ; +2 = 2
    inc di       ; +2 = 4
    mov al,[si]  ; +13 = 17 (8 + 5 effective address calc)
    mov [di],al  ; +14 = 31 (9 + 5 effective address calc)
    or al,al     ; +3 = 34
    jnz loop     ; +16 = 50

My version of the loop (estimated at 42 cycles):
    lodsb        ; +12 = 12
    stosb        ; +11 = 23
    or al,al     ; +3 = 26
    jnz loop     ; +16 = 42

So it seems like it would be an 8-cycle win to use the string instructions instead of manually INC’ing and MOV’ing. But, this is based solely on the reference manual. I don’t even have an 8086, so I can test any of these assumptions! It may be that the reference manual is wrong about the cycle counts for these instructions, and people who actually programmed the 8086 know better (such as the people who made Borland C++).

Also, according to the manual, the clocks on a 80386, while lower across the board, don’t seem to suggest that the string instructions would be slower even on that chip…

I would love to hear from an 8086, 80286, or 80386 expert if one happens to read this!

Paul Kennedy

Aug 4, 2023

I believe a likely explanation for why the interviewer wanted you to write it that way is because that is considered the "idiomatic" way.

If you have a copy of "The C Programming Language" 2nd edition handy, flip to page 106 (or look up strcpy in the index). It says "Although this may seem cryptic at first, the notational convenience is considerable, and the idiom should be mastered, because you will see it frequently in C programs."

Whether it is the most readable is, of course, debatable, but hopes of better compiler output may not have been the reason the interviewer wanted you to write it this way.

Expand full comment

4 replies by Casey Muratori and others

Yakvi

Maybe the interviewer wanted to check whether the potential intern knew about pointer arithmetic? That you can, in fact, do something like *From++

That, and of course terseness for terseness's sake.

1 reply by Casey Muratori

18 more comments...

Computer, Enhance!

Microsoft Intern Interview Question #2: String Copy

Question #2 should have been trivial, but because of some weird suggestions by the interviewer, it requires us to do a bit of compiler archaeology!

Addendum

20 Comments