[ros-dev] [ros-diffs] [tkreuzer] 42353: asm version of DIB_32BPP_ColorFill: - Add frame pointer - Get rid of algin_draw, 32bpp surfaces must be DWORD aligned - Optimize the loop - Add comments

Tue Aug 4 12:06:53 CEST 2009

Alex Ionescu wrote:
> I won.
>   
Did you?

> I actually had to spend the better part of the hour convincing GCC
> *not* to optimize, just to make things fair.
>
> You see, because one of the #1 reasons why inline ASM loses vs C, is
> that gcc understood the structure of my code -- it realized that I was
> calling the function with static parameters, and integrated them into
> the function itself.
>   
What the compiler does not know is with what parameters this function is
generally called, unless you would profile the code and then reuse the
profiling data to recompile, but I don't think that gcc supports that.
So it has to rely on generic optimization. Hand coded assembly can be
optimized for the special usage pattern. Anyway that's theory.

> I tried calling the function twice -- gcc actually inlined the
> function twice, with static parameters twice!
>   
This function is neither static nor called with static parameters.

> I finally settled on calling it with arguments from the command line,
> which are always changing.
>
> But this proves one of the first points -- gcc will be able to analyze
> the form of your program, and make minute optimizations that are
> *impossible* to do in ASM. For example, it could decide "all functions
> calling this function will store parameter 3 in ECX" and this will
> optimize the overall speed of the entire program, not only the
> function, plus save stack space. This is only an example of the many
> hidden optimizations it could decide to do.
>   
All the small optimizations "around" the function don't matter much in
this case. You can assume that the functions spends > 90% of the time
inside the loop. So the loop needs to be optimized, everything else is
candy.

> Depending on how many times/how this function is called, gcc could've
> done any number of register allocation and tree/loop optimizations
> based on the code.
>
> Once I fooled gcc into generating "stupid" code, the output was very
> similar, but more optimized than yours -- partly because ebp was
> clobbered.
I fixed the function to *not* clobber ebp. Misusing ebp is lame.

>  In case you're wondering, yes, gcc inlined the rep stosd.
> However, it chose rep stosb instead, because I did not give it a
> guarantee of alignment (that would be a simple __attribute__).
>   
I wonder what compiler you are using then. I tried it with our current
RosBE with maximum optimization and it didn't do that. Same with gcc
4.4.0 and  msc (I tested the one that ships with the WDK 2008) also with
maximum optimization for speed.

> More importantly however, once I selected -mtune=core2, gcc destroyed
> you. It made uglier (less compact) code, but didn't use any push/pops
> at all, and moved data directly into the stack. It also used some more
>   
I used push/pop in favour of movs to improve the readability, as it
doesn't really matter. If I had been up to ultra optimization, I could
have quenched out a few cycles more. Optimizing the loop was sufficient
for me.

> exotic checks and operands, because it KNEW that this would be faster
> on the Core 2 I was testing on. When I used -mtune=486, or -mtune=k6,
> I once again got very different looking programs. Because gcc knew
> what was best for each chip. You don't, and even if you did, you'd
> have to write 50 versions of your assembly code.
>   
Having one version that runs on all x86 machines and is faster than
anything our current gcc can generate should be enough, thanks.

> I also built it for x64 and ARM, and got fast code for those platforms
> too -- your assembly code requires someone to port it.
>   
True, I never said anything else. But this is not the question.

> Additionally, gcc also aligned the code, and certain parts of the
> loop, to best suit the cache settings of the platform, and where the
> code was actually located in the binary.
>
> Finally, on certain platforms, gcc chose to call memset instead, and
> provided a highly optimized memset implementation which even used SSE
> 4.1 if required (if it determined it would be fastest for this set of
> inputs). Again, your rep movsd, while fast on 486, is slow as molasses
> on newer Core processors (or even the P3), because it gets micro-coded
> and has to do a lot of pre-setup work.
>   
As already mentioned memset doesn't work. And how does the compiler know
if something is worth the hassle or not? It's about 15 cycles for a rep,
call a subfunction and you quickly get more than 15 cycles overhead. How
does the compiler possibly "determine it would be fastest for this set
of inputs", without profiling? Again Theory.

> I don't know if you were trying to bait me -- I respect you and I'm
> pretty sure you knew these facts, so I'm surprised about this
> "challenge".
>   
The challenge was obviously the compiler. Please let us know which
version of gcc you were using and with what options, it seems to be way
more sophisticated than all the compilers/options I know.
I am the first to replace the asm version with a C implementation, as
soon as we use a proper gcc with decent optimization in reactos that
will create faster code. But I currently don't see this.

You talked about compiler optimization, and what it could theoretically
do here and there, but the only thing that is worth optimizing in this
function is the loop and here you managed to get a lousy rep stosb, not
a stosd or even SSE stuff? And what about the rest of the loop? And
where's the code? Where's the disassembly? I don't care if the compiler
"can do" or "could decide to do" something. I only care about what comes
out at the end. Quite disappointing what I've seen so far.

What you are saying is like, noone uses a plane nowadays, cause trains
are way faster. That might be true for a transrapid going at 500km/h,
while a crop duster might only make 200 km/h. But that doesn't count
when you plan a journey from Boston to San Francisco. :-P

You do not win, before reproducable and usable results are there.

Regards,
Timo

> Best regards,
> Alex Ionescu
>
>
>
> On Mon, Aug 3, 2009 at 7:05 PM, WaxDragon<waxdragon at gmail.com> wrote:
>   
>> Your kung-fu is the best, Alex.
>>
>> On Aug 3, 2009 7:22 PM, "Alex Ionescu" <ionucu at videotron.ca> wrote:
>>
>> Just got back to San Francisco... I will take you up on the challenge.
>> Your ass is grass, and I'm the lawnmower.
>> Best regards,
>> Alex Ionescu
>>
>> On Mon, Aug 3, 2009 at 11:15 AM, Timo Kreuzer <timo.kreuzer at web.de> wrote: >
>>     
>>> yeah ;-) > > Dmitr...
>>>       
>> _______________________________________________
>> Ros-dev mailing list
>> Ros-dev at reactos.org
>> http://www.reactos.org/mailman/listinfo/ros-dev
>>
>> _______________________________________________
>> Ros-dev mailing list
>> Ros-dev at reactos.org
>> http://www.reactos.org/mailman/listinfo/ros-dev
>>
>>
>>     
>
> _______________________________________________
> Ros-dev mailing list
> Ros-dev at reactos.org
> http://www.reactos.org/mailman/listinfo/ros-dev
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.reactos.org/pipermail/ros-dev/attachments/20090804/ddc431e0/attachment.htm