[ros-dev] [ros-diffs] [tkreuzer] 42353: asm version of DIB_32BPP_ColorFill: - Add frame pointer - Get rid of algin_draw, 32bpp surfaces must be DWORD aligned - Optimize the loop - Add comments

Tue Aug 4 14:35:23 CEST 2009

On 4 Aug 2009, at 03:06, Timo Kreuzer wrote:
> Alex Ionescu wrote:
>>
>> You see, because one of the #1 reasons why inline ASM loses vs C, is
>> that gcc understood the structure of my code -- it realized that I  
>> was
>> calling the function with static parameters, and integrated them into
>> the function itself.
>>
> What the compiler does not know is with what parameters this  
> function is generally called, unless you would profile the code and  
> then reuse the profiling data to recompile, but I don't think that  
> gcc supports that. So it has to rely on generic optimization. Hand  
> coded assembly can be optimized for the special usage pattern.  
> Anyway that's theory.

A compiler that does link time optimization like MSVC and LLVM will  
inline a small function like this one every time it is invoked, not  
only saving the call and the stack frame, but also to save copying  
around the arguments. If caller 1 happens to have the argument  
"iColor" in ebx, it will just inline the function in a way that it can  
work directly on ebx. And for caller 2, where iColor is in esi, it  
will inline it in a way so it can work with it in esi.

If you compile the source into an i386 binary .o file and analyze this  
one, the compiler does not know what registers hold the data *in  
general*, because you could still link it against anything. But at  
least it knows it for every single already existing invocation, and  
generate corresponding code.

But as soon as the compiler/linker knows all invocations (because you  
create the final executable, and your function is not an exported  
symbol any more), instead of optimizing by inlining every invocation,  
it can also decide to choose its own calling conventions between the n  
callers and the function and go for a register based calling  
convention, and make sure all callers prepare their arguments so that  
they end up in the right registers without the need to copy them around.

If you don't compile with a link time optimizing compiler like MSVC or  
LLVM (yet), you can choose to have your function as a "static inline"  
in a header file instead of in a C file, and the compiler won't  
require a link time optimizer in order to do the optimized inlining.

>> I tried calling the function twice -- gcc actually inlined the
>> function twice, with static parameters twice!
>>
> This function is neither static nor called with static parameters.

Not in the general case, but for every individual case, yes. And as  
soon as all possible callers are known to the toolchain, it might find  
some static property, like: A size is always divisible by four, a  
pointer is always aligned to a certain boundary, or a 64 bit argument  
will never be >= 2^32. You can't make these assumptions in assembly,  
because adding a single new caller will break your code. But in the C  
case, if you add a caller for which this property is not true any  
more, the compiler won't do this optimization any more.

>> But this proves one of the first points -- gcc will be able to  
>> analyze
>> the form of your program, and make minute optimizations that are
>> *impossible* to do in ASM. For example, it could decide "all  
>> functions
>> calling this function will store parameter 3 in ECX" and this will
>> optimize the overall speed of the entire program, not only the
>> function, plus save stack space. This is only an example of the many
>> hidden optimizations it could decide to do.
>>
> All the small optimizations "around" the function don't matter much  
> in this case. You can assume that the functions spends > 90% of the  
> time inside the loop. So the loop needs to be optimized, everything  
> else is candy.

So you are saying you only care that your rep movs is fast, and you  
don't care about the rest, because it's only 10%? In this case, please  
write your function in C, and replace only the memcpy() with inline  
asm, in order to make the code less error prone, more readable, more  
maintainable and more portable, without losing any of your goals  
(=speed in the tight loop).

But if it's really just the memcpy() that you think is slow, then you  
should fix the memcpy(), and have your compiler inline the version  
that you think is fastest. Or maybe the version it is inlining today  
is faster than what you think is faster.

I wonder, has either of you, Alex or Timo actually *benchmarked* the  
code on some sort of native i386 CPU before you argue whether it  
should be a stosb or a stosd? If not, writing assembly would be a  
clear case of "premature optimization".

>> Depending on how many times/how this function is called, gcc could've
>> done any number of register allocation and tree/loop optimizations
>> based on the code.
>>
>> Once I fooled gcc into generating "stupid" code, the output was very
>> similar, but more optimized than yours -- partly because ebp was
>> clobbered.
> I fixed the function to *not* clobber ebp. Misusing ebp is lame.

If you compile debug, you want ebp to be sane. If you compile release  
(i.e. perfomance), you usually do not care, In any case, with C code,  
you can decide whether you want debuggability or speed with a compile  
time switch. With assembly, you can not. And please don't start with  
#ifdefs.

>>  In case you're wondering, yes, gcc inlined the rep stosd.
>> However, it chose rep stosb instead, because I did not give it a
>> guarantee of alignment (that would be a simple __attribute__).
>>
> I wonder what compiler you are using then. I tried it with our  
> current RosBE with maximum optimization and it didn't do that. Same  
> with gcc 4.4.0 and  msc (I tested the one that ships with the WDK  
> 2008) also with maximum optimization for speed.

If the compiled code looks suboptimal, these should be the steps to  
take:
1. Understand whether the code is actually suboptimal, i.e. is it  
slow? Maybe the compiler uses stosb for a reason - for example because  
stosb is faster on unaligned data, and the compiler could not make any  
assumptions on the alignment of the data.
2. In case the code is bad because the compiler couldn't make an  
assumption, i.e. it did not know enough, then give it hints. GCC takes  
lots of hints.
3. In case the compiler was right, because you were making an  
assumption that cannot me made in general, then you should leave it  
the way it is, because the compiler is probably doing the right thing.
4. If the compiler is really acting stupid, compare it to other  
compilers to be really sure, and then file a bug with the compiler.
5. Refrain from hand-coding something that works around a compiler  
*performance* issue unless it is really really necessary. If you have  
to do so, clearly mark the code with a bug tracker number and state  
that this code should be reverted into C as soon as the compiler bug  
is fixed.

In short: Nowadays, compilers are really really smart. Amazingly  
smart. If you think you're better, then you're either doing it wrong,  
making assumptions that you haven't told the compiler about, or it's a  
compiler bug. In no case, try to do something by hand unless you  
perfectly understand *why* the compiler is doing it some way.

And yes, I understand that current compilers may have some limitations  
concerning SIMD instruction sets. (...although even this isn't  
strictly true any more today, if you know how to give the compiler the  
right hints and use the right intrinsics.) But this case is not one of  
these.

> I used push/pop in favour of movs to improve the readability, as it  
> doesn't really matter. If I had been up to ultra optimization, I  
> could have quenched out a few cycles more. Optimizing the loop was  
> sufficient for me.

See above: If all you want to optimize is the loop, then have C code  
with asm("rep movsd") in it, or fix the static inline memcpy() to be  
more efficient (if it isn't efficient in the first place).

> The challenge was obviously the compiler. Please let us know which  
> version of gcc you were using and with what options, it seems to be  
> way more sophisticated than all the compilers/options I know.

While I would indeed be interested in seeing the compiler output both  
of you get, because otherwise all this is pretty theoretical, I also  
think that it doesn't matter in the big picture. Even if today's  
compiler doesn't generate perfect code for a simple task as this one  
(and ReactOS' toolchain doesn't use the latest GCC today), then  
tomorrow's compiler will, and it's not worth doing any assembly any  
more today.

> You talked about compiler optimization, and what it could  
> theoretically do here and there, but the only thing that is worth  
> optimizing in this function is the loop and here you managed to get  
> a lousy rep stosb, not a stosd or even SSE stuff?

Now if we care a lot about memcpy() performance, then we could add  
many more tricks to the codebase. The memcpy() code in the "commpage"  
of Mac OS X is pretty fascinating, for example. Depending on the CPU  
type, different versions of memcpy() are in place, and some versions  
first look at the size of the region, and branch to one of three  
different implementations. Small regions use rep stosb, bigger ones  
SSE or the like. (I'm sure NT does something similar.)

If you are more into static optimization, or want to give the compiler  
hints about the typical size of the region, you could do some clever  
macro hacks that supply the typical size range as an extra argument to  
memcpy(), so that the best version can be chosen at preprocessing  
time, without the need for the size check in the compiled code - if,  
for example, DIB_32BPP_ColorFill() typically did memcpy()s of smaller  
sizes.

   Michael