5 Comments
⭠ Return to thread

1) It would be much more expensive to invoke a function call there, unless you are expecting a compiler to generate it inline for you, but the compilers that do that will generally turn memory copies into a memcpy automatically (unfortunately sometimes, since in the case of any small rectangle, it is a net loss to do so).

2) Usually not, although if you were trying to be performant, you would not use memcpy anyway, you would include the proper AVX code to do full 32- or 64-byte copies. There is no need to add the overhead of memcpy because you usually know things about the buffer that memcpy does not.

3) You can tell they did not expect you to do that because you are not passed the bounds of the buffers. If you were, though, extending this code to handle this is trivial: you just min/max the bounds before copying. The loops stay exactly the same.

- Casey

Expand full comment

I guess I would hope that memcpy isn't copying over byte by byte. Since, before you even get to AVX, you could do at least 8 bytes at a time these days. I guess this is an area where you should just have total faith in the compiler to do the right thing? If so, do you think the compiler has an easier time reasoning with the code as you wrote it or with memcpy?

Expand full comment

There are a lot of misconceptions about memcpy that people have, I think because of bad YouTube videos in the past :) Compilers automatically generate larger copies for you:

https://godbolt.org/z/oYYrv9vfY

You do not need to call memcpy to have this happen. Mostly what you get if you actually call memcpy (as opposed to the compiler recognizing it and actually doing generating the code in-line) is a bunch of setup overhead that is not specialized to the task at hand. For example, you might end up with something bad like this:

https://godbolt.org/z/nrz9YzTnv

which you DEFINITELY do not want.

- Casey

Expand full comment

Oh wow. I actually do find that pretty shocking. Any idea why your for loop and memcpy would be interpreted differently by the compiler? Is the idea not conceptually the same thing? It feels like memcpy could not be more explicit about the authors intentions.

Expand full comment

Well, loops in general will be unrolled and vectorized by CLANG, so that's just what you would expect to happen. In order to get it to that with memcpy, you would have to first make sure it was going to use a builtin version of that, so it knew what it was - which honestly I thought it did by default, but maybe I am wrong about that? There may be some additional compiler switches you could add to make it do that... since I never rely on builtin memcpy/memset I am not sure what they would be.

- Casey

Expand full comment