[ros-dev] [ros-diffs] [tkreuzer] 42353: asm version of DIB_32BPP_ColorFill: - Add frame pointer - Get rid of algin_draw, 32bpp surfaces must be DWORD aligned - Optimize the loop - Add comments

Tue Aug 4 05:44:34 CEST 2009

I won.

I actually had to spend the better part of the hour convincing GCC
*not* to optimize, just to make things fair.

You see, because one of the #1 reasons why inline ASM loses vs C, is
that gcc understood the structure of my code -- it realized that I was
calling the function with static parameters, and integrated them into
the function itself.

I tried calling the function twice -- gcc actually inlined the
function twice, with static parameters twice!

I finally settled on calling it with arguments from the command line,
which are always changing.

But this proves one of the first points -- gcc will be able to analyze
the form of your program, and make minute optimizations that are
*impossible* to do in ASM. For example, it could decide "all functions
calling this function will store parameter 3 in ECX" and this will
optimize the overall speed of the entire program, not only the
function, plus save stack space. This is only an example of the many
hidden optimizations it could decide to do.

Depending on how many times/how this function is called, gcc could've
done any number of register allocation and tree/loop optimizations
based on the code.

Once I fooled gcc into generating "stupid" code, the output was very
similar, but more optimized than yours -- partly because ebp was
clobbered. In case you're wondering, yes, gcc inlined the rep stosd.
However, it chose rep stosb instead, because I did not give it a
guarantee of alignment (that would be a simple __attribute__).

More importantly however, once I selected -mtune=core2, gcc destroyed
you. It made uglier (less compact) code, but didn't use any push/pops
at all, and moved data directly into the stack. It also used some more
exotic checks and operands, because it KNEW that this would be faster
on the Core 2 I was testing on. When I used -mtune=486, or -mtune=k6,
I once again got very different looking programs. Because gcc knew
what was best for each chip. You don't, and even if you did, you'd
have to write 50 versions of your assembly code.

I also built it for x64 and ARM, and got fast code for those platforms
too -- your assembly code requires someone to port it.

Additionally, gcc also aligned the code, and certain parts of the
loop, to best suit the cache settings of the platform, and where the
code was actually located in the binary.

Finally, on certain platforms, gcc chose to call memset instead, and
provided a highly optimized memset implementation which even used SSE
4.1 if required (if it determined it would be fastest for this set of
inputs). Again, your rep movsd, while fast on 486, is slow as molasses
on newer Core processors (or even the P3), because it gets micro-coded
and has to do a lot of pre-setup work.

I don't know if you were trying to bait me -- I respect you and I'm
pretty sure you knew these facts, so I'm surprised about this
"challenge".

Best regards,
Alex Ionescu

On Mon, Aug 3, 2009 at 7:05 PM, WaxDragon<waxdragon at gmail.com> wrote:
> Your kung-fu is the best, Alex.
>
> On Aug 3, 2009 7:22 PM, "Alex Ionescu" <ionucu at videotron.ca> wrote:
>
> Just got back to San Francisco... I will take you up on the challenge.
> Your ass is grass, and I'm the lawnmower.
> Best regards,
> Alex Ionescu
>
> On Mon, Aug 3, 2009 at 11:15 AM, Timo Kreuzer <timo.kreuzer at web.de> wrote: >
>> yeah ;-) > > Dmitr...
>
> _______________________________________________
> Ros-dev mailing list
> Ros-dev at reactos.org
> http://www.reactos.org/mailman/listinfo/ros-dev
>
> _______________________________________________
> Ros-dev mailing list
> Ros-dev at reactos.org
> http://www.reactos.org/mailman/listinfo/ros-dev
>
>