[ros-dev] [ros-diffs] [tkreuzer] 42353: asm version of DIB_32BPP_ColorFill: - Add frame pointer - Get rid of algin_draw, 32bpp surfaces must be DWORD aligned - Optimize the loop - Add comments

Tue Aug 4 21:12:10 CEST 2009

I will provide some code and timings but I love how you ignored my  
main points:

1) The optimizations of the code *around* the function (ie: the  
callers), which Michael also pointed out, cannot be done in ASM.
2) The fact if you try this code on a Core 2, Pentium 4, Pentium 1 and  
Nehalem you will get totally different results with your ASM code,  
while the compilers will generate the best possible code.
3) The fact that someone will now have to write optimized versions for  
each other architecture
4) The fact that if the loop is what you're truly worried about, you  
can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a  
similar intrinsic), and still keep the rest of the function portable C.

Also, gcc does support profiling, another fact you don't seem to know.  
However, with linker optimizations, you do not need a profiler, the  
linker will do the static analysis.

Also, to everyone sayings things like "I was able to save a <operand  
name here>", I hope you understand that smaller != faster.

On 4-Aug-09, at 10:13 AM, Timo Kreuzer wrote:

> Michael Steil wrote:
>>
>> I wonder, has either of you, Alex or Timo actually *benchmarked* the
>> code on some sort of native i386 CPU before you argue whether it
>> should be a stosb or a stosd? If not, writing assembly would be a
>> clear case of "premature optimization".
>>
> I did. on Athlon X2 64, I called the function a bunch ot times, with a
> 100x100 rect, measuring time with rdtsc  the results were quite  
> random,
> but roughly
> asm: ~580
> gcc 4.2 -march=k8 -fexpensive-optimizations -O3: ~1800
> WDK: /GL /Oi /Ot /O2 : ~2600
> MSVC 2008 express: /GL /Oi /Ot /O2 ~1800
>
> using a 50x50 rect shifts the advantage slightly in direction of the  
> asm
> implementations.
>
> I added volatile to the pointer to prevent the loop to be optimized  
> away.
> using memset was a bit slower than a normal loop.
> This is what msvc produced with the above settings
>
> _DIB_32BPP_ColorFill:
>    push   ebx
>    mov   ebx, [eax+8]
>    sub    ebx, [eax]
>    test    ebx, ebx
>    jg      short label1
>    xor    al, al
>    pop   ebx
>    retn
>
> label1:
>    mov  ecx, [eax+4]
>    push esi
>    mov esi, [eax+0Ch]
>    sub  esi, ecx
>    test  esi, esi
>    jg     short label2
>    pop  esi
>    xor   al, al
>    pop  ebx
>    retn
>
> label2:
>    mov  eax, [edx+4]
>    imul  ecx, eax
>    add  ecx, [edx]
>    cdq
>    and  edx, 3
>    add  eax, edx
>    sar   eax, 2
>    add  eax, eax
>    push edi
>    mov edi, ecx
>    add  eax, eax
>    jmp  short label3
>
> align 10h
> label3:
>    mov  ecx, edi
>    mov  edx, ebx
>
> label4:
>    mov  dword ptr [ecx], 3039h
>    add   ecx, 4
>    sub   edx, 1
>    jnz    short  label4
>
>    dec   esi
>    add   edi, eax
>    test   esi, esi
>    jg     short  label3
>
>    pop  edi
>    pop  esi
>    mov al, 1
>    pop ebx
>    retn
>
>
>
> I though myself I did something wrong. For me no compiler was able to
> generate code as fast as the asm code.
> I don't know how Alex managed to get better optimizations, maybe he
> knows a secret ninja /Oxxx switch, or maybe express and wdk version  
> both
> suck at optimizing or maybe I'm just too supid... ;-)
>
>
>> See above: If all you want to optimize is the loop, then have C code
>> with asm("rep movsd") in it, or fix the static inline memcpy() to be
>> more efficient (if it isn't efficient in the first place).
>>
> I tried __stosd() which actually resulted in a faster function. with
> ~610 gcc was aslmost as fast as the asm implementation, msvc actually
> won with 590. But that was using not pure portable code. It's the best
> solution, it seems, although it will probably still be slower unless  
> we
> set our optimization to max.
>
> Btw, I already thought about rewriting our dib code some time ago.  
> Using
> inline functions instead of a code generator. The idea is to make it
> fully portable, optimizable though inline asm functions where useful  
> and
> easier to maintain then the current stuff. It's on my list...
>
> Timo
>
>
> _______________________________________________
> Ros-dev mailing list
> Ros-dev at reactos.org
> http://www.reactos.org/mailman/listinfo/ros-dev

Best regards,
Alex Ionescu