memcpy 最適化 - kazuhoのメモ置き場

バイト単位でコピーするアホなコードの方が、勝手にベクトル化される分、gcc 内蔵のヤツより最大３倍高速なんだってwww

memcpy() compiled with vectorizing compilers

All current compilers for linux should support SSE2 auto-vectorization with
#include <string.h>
void *(memcpy)(void *restrict b, const void *restrict a, size_t n){
    char *s1 = b;
    const char *s2 = a;
    for(; 0<n; --n)*s1++ = *s2++;
    return b;
}
(中略)

x86-64 gcc memcpy()

(中略)

Linking in a user-compiled memcpy(), using the source code presented above, nearly always improves performance. In the cases where the glibc fails to find needed wide moves, performance increases by a factor of 3.
http://softwarecommunity.intel.com/Wiki/Linux/719.htm

6月17日追記:

記事のコードは restrict をつける位置が間違っていると思います。gcc 4.1.2 において、以下のコードで自動ベクトル化されることを確認しました (-m64 -ftree-vectorize -msse2)

void *memcpy(void *b, const void *a, size_t n)
{
  char *__restrict__ s1 = b;
  const char *__restrict__ s2 = a;
  for(; 0<n; --n)*s1++ = *s2++;
  return b;
}