[ros-dev] [ros-diffs] [tkreuzer] 42353: asm version of DIB_32BPP_ColorFill: - Add frame pointer - Get rid of algin_draw, 32bpp surfaces must be DWORD aligned - Optimize the loop - Add comments

Michael Steil mist at c64.org
Wed Aug 5 06:50:19 CEST 2009


On 4 Aug 2009, at 17:37, Jose Catena wrote:
>> but how would you want to optimize "rep stosd" anyway?
>
> No way. That's what I said, possibly with the exception of using a  
> 64 bit
> equivalent if we could assume that the CPU is 64 bit capable.
> But Alex knows better, he's is calling me an ignorant. He says that
>
> L1:	Mov [edi], eax
> 	Add edi, 4
> 	Dec ecx
> 	Jnz L1
>
> Is faster than
>
> 	rep stosd
>
> Both things do exactly the same thing, the later much smaller AND  
> FASTER in
> any CPU from the 386 to the i7.

I have done some tests on all generations of Intel CPUs since Yonah,  
and in all cases, rep stosd was faster than any loop I could craft or  
GCC would generate from my C code.

But this does *not* mean that
* rep stosd is by definition faster than a scalar loop
* rep stosd is by definition faster than any kind of loop.

Look at the test program at the end of this email. It compares rep  
stosd with a hand-crafted loop written with SSE instructions and SSE  
registers (parts borrowed from XNU).

On all tested machines, the SSE version is significantly faster (for  
big loops):

Yonah: Genuine Intel(R) CPU           T2500  @ 2.00GHz
SSE is 3.34x faster than stosl

Merom: Intel(R) Core(TM)2 Duo CPU     P7350  @ 2.00GHz
SSE is 4.86x faster than stosl

Penryn: Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
SSE is 4.94x faster than stosl

Nehalem: Intel(R) Xeon(R) CPU           E5462  @ 2.80GHz
SSE is 4.62x faster than stosl

So one should not assume that it's a good idea to always just use rep  
stosd. Use memset(), and have an optimized implementation of memset()  
somewhere else. One that can be inlined, and checks the size and  
branches to the optimal implementation: Like XNU does it, for example:

http://fxr.watson.org/fxr/source/osfmk/i386/commpage/?v=xnu-1228

   Michael


#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define MIN(a,b) ((a)<(b)? (a):(b))

#define DATASIZE (1024*1024)
#define TIMES 10000

static inline long long
rdtsc64(void)
{
	long long ret;
	__asm__ volatile("lfence; rdtsc; lfence" : "=A" (ret));
	return ret;
}

static inline void
sse(int *p) {
	int c_new;
	char *p_new;
	asm volatile (
		"1:				\n"
		"movdqa  %%xmm0,(%%edi,%%ecx)	\n"
		"movdqa  %%xmm0,16(%%edi,%%ecx)	\n"
		"movdqa  %%xmm0,32(%%edi,%%ecx)	\n"
		"movdqa  %%xmm0,48(%%edi,%%ecx)	\n"
		"subl    $64,%%ecx		\n"
		"jns     1b			\n"
		: "=D"(p_new), "=c"(c_new)
		: "D"(p), "c"(DATASIZE/sizeof(int))
	);
}

static inline void
stos(int *p) {
	int c_new;
	char *p_new;
	asm volatile (
		"rep stosl"
		: "=D"(p_new), "=c"(c_new)
		: "D"(p), "c"(DATASIZE/sizeof(int)), "a"(1)
	);
}

int
main() {
	void *data = malloc(DATASIZE);
	long long t1, t2, t3, m1, m2;
	int i;

	t1 = rdtsc64();

	for (i = 0; i < TIMES; i++)
		sse(data);

	t2 = rdtsc64();

	for (i = 0; i < TIMES; i++)
		stos(data);

	t3 = rdtsc64();

	m1 = t2 - t1;
	m2 = t3 - t2;

	if (m1>m2)
		printf("stosl is %.2fx faster than SSE\n", (float)m1/m2);
	else
		printf("SSE is %.2fx faster than stosl\n", (float)m2/m1);

	return 0;
}



More information about the Ros-dev mailing list