[ros-dev] [ros-diffs] [tkreuzer] 42353: asm version of DIB_32BPP_ColorFill: - Add frame pointer - Get rid of algin_draw, 32bpp surfaces must be DWORD aligned - Optimize the loop - Add comments

Wed Aug 5 06:53:14 CEST 2009

Also, rep movsd will be slower on small counts. On most processors,  
less than 8 iterations will be faster with a move than with a rep.

This has changed lately: http://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/51402/reply/34703/

With blocks larger than 512 bytes, SSE/FPU code will always be faster.

On 4-Aug-09, at 9:50 PM, Michael Steil wrote:

> On 4 Aug 2009, at 17:37, Jose Catena wrote:
>>> but how would you want to optimize "rep stosd" anyway?
>>
>> No way. That's what I said, possibly with the exception of using a
>> 64 bit
>> equivalent if we could assume that the CPU is 64 bit capable.
>> But Alex knows better, he's is calling me an ignorant. He says that
>>
>> L1:	Mov [edi], eax
>> 	Add edi, 4
>> 	Dec ecx
>> 	Jnz L1
>>
>> Is faster than
>>
>> 	rep stosd
>>
>> Both things do exactly the same thing, the later much smaller AND
>> FASTER in
>> any CPU from the 386 to the i7.
>
> I have done some tests on all generations of Intel CPUs since Yonah,
> and in all cases, rep stosd was faster than any loop I could craft or
> GCC would generate from my C code.
>
> But this does *not* mean that
> * rep stosd is by definition faster than a scalar loop
> * rep stosd is by definition faster than any kind of loop.
>
> Look at the test program at the end of this email. It compares rep
> stosd with a hand-crafted loop written with SSE instructions and SSE
> registers (parts borrowed from XNU).
>
> On all tested machines, the SSE version is significantly faster (for
> big loops):
>
> Yonah: Genuine Intel(R) CPU           T2500  @ 2.00GHz
> SSE is 3.34x faster than stosl
>
> Merom: Intel(R) Core(TM)2 Duo CPU     P7350  @ 2.00GHz
> SSE is 4.86x faster than stosl
>
> Penryn: Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
> SSE is 4.94x faster than stosl
>
> Nehalem: Intel(R) Xeon(R) CPU           E5462  @ 2.80GHz
> SSE is 4.62x faster than stosl
>
> So one should not assume that it's a good idea to always just use rep
> stosd. Use memset(), and have an optimized implementation of memset()
> somewhere else. One that can be inlined, and checks the size and
> branches to the optimal implementation: Like XNU does it, for example:
>
> http://fxr.watson.org/fxr/source/osfmk/i386/commpage/?v=xnu-1228
>
>   Michael
>
>
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
>
> #define MIN(a,b) ((a)<(b)? (a):(b))
>
> #define DATASIZE (1024*1024)
> #define TIMES 10000
>
> static inline long long
> rdtsc64(void)
> {
> 	long long ret;
> 	__asm__ volatile("lfence; rdtsc; lfence" : "=A" (ret));
> 	return ret;
> }
>
> static inline void
> sse(int *p) {
> 	int c_new;
> 	char *p_new;
> 	asm volatile (
> 		"1:				\n"
> 		"movdqa  %%xmm0,(%%edi,%%ecx)	\n"
> 		"movdqa  %%xmm0,16(%%edi,%%ecx)	\n"
> 		"movdqa  %%xmm0,32(%%edi,%%ecx)	\n"
> 		"movdqa  %%xmm0,48(%%edi,%%ecx)	\n"
> 		"subl    $64,%%ecx		\n"
> 		"jns     1b			\n"
> 		: "=D"(p_new), "=c"(c_new)
> 		: "D"(p), "c"(DATASIZE/sizeof(int))
> 	);
> }
>
> static inline void
> stos(int *p) {
> 	int c_new;
> 	char *p_new;
> 	asm volatile (
> 		"rep stosl"
> 		: "=D"(p_new), "=c"(c_new)
> 		: "D"(p), "c"(DATASIZE/sizeof(int)), "a"(1)
> 	);
> }
>
> int
> main() {
> 	void *data = malloc(DATASIZE);
> 	long long t1, t2, t3, m1, m2;
> 	int i;
>
> 	t1 = rdtsc64();
>
> 	for (i = 0; i < TIMES; i++)
> 		sse(data);
>
> 	t2 = rdtsc64();
>
> 	for (i = 0; i < TIMES; i++)
> 		stos(data);
>
> 	t3 = rdtsc64();
>
> 	m1 = t2 - t1;
> 	m2 = t3 - t2;
>
> 	if (m1>m2)
> 		printf("stosl is %.2fx faster than SSE\n", (float)m1/m2);
> 	else
> 		printf("SSE is %.2fx faster than stosl\n", (float)m2/m1);
>
> 	return 0;
> }
>
> _______________________________________________
> Ros-dev mailing list
> Ros-dev at reactos.org
> http://www.reactos.org/mailman/listinfo/ros-dev

Best regards,
Alex Ionescu