[ros-dev] [ros-diffs] [tkreuzer] 42353: asm version of DIB_32BPP_ColorFill: - Add frame pointer - Get rid of algin_draw, 32bpp surfaces must be DWORD aligned - Optimize the loop - Add comments

Alex Ionescu ionucu at videotron.ca
Wed Aug 5 00:25:11 CEST 2009


Note to everyone else: I just spent some time to do the calculations
and have data proving C code can be faster -- I will post tonight from
home.

Now to get to your argument, Jose..

Best regards,
Alex Ionescu



On Tue, Aug 4, 2009 at 2:19 PM, Jose Catena<jc1 at diwaves.com> wrote:
> With all respect Alex, although I agree with you in the core, that this does
> not deserve the disadvantages of asm for a tiny performance difference if
> any (portability, readability, etc), I don't agree with many your arguments.

Also keep in mind Timo admitted "This code is not called often",
making ASM optimization useless.

>
> -->
> 1) The optimizations of the code *around* the function (ie: the
> callers), which Michael also pointed out, cannot be done in ASM.
>
> <--
> Yes, it can. I could always outperform or match a C compiler at that, and
> did many times (I'm the author of an original PC BIOS, performance
> libraries, mission critical systems, etc).
> I very often used regs for calling params, local storage through SP instead
> of BP, good use and reuse of registers, etc.

An optimizing compiler will do this too.

> In fact, the loop the compiler generated was identical to the asm source
> except for the two instructions the compiler added (that serve for no
> purpose, it is a msvc issue).

Really? Here's sample code from my faster C version:

.text:004013E0                 lea     eax, [esi+eax*4]
.text:004013E3                 lea     esi, ds:0[edi*4]
.text:004013EA                 lea     eax, [ebp+eax+0]
.text:004013EE                 db      66h
.text:004013EE                 nop

99% percent of people on this list (and you, probably) will tell me
"this is a GCC issue" or that this is "useless code".

Guess what, I compiled with mtune=core2 and this code sequence is
specifically generated before the loop.

Timo, and I admit not even myself, would think of adding this kind of
code. But once I asked some experts what this does, I understood why
it's there.

To quote Michael "if you think the compiler is generating useless
code, try to find out what the code is doing." In most cases, your
thinking that it is "wrong" or "useless" is probably wrong itself.

As a challenge, can you tell me the point of this code? Why is it
written this way? If I build for 486 (which is what ALL OF YOU SEEM TO
BE STUCK ON!!!), I get code that looks like Timo's.

> It is actually in the calling overhead and local initialization and storage
> where I could easily beat the compiler, since it complies with rules that I
> can safely break.

That doesn't make any sense. You are AGREEING with me. My point is
that a compiler will break normal calling rules while the assembly
code will have to respect at least some rules, because you won't know
apriori all your callers (you might in a BIOS, but not in a giant code
like win32k). The compiler on the other hand, DOES know all the
callers, and will hapilly clober registers, change the calling
convention, etc. Please re-read Michael's email.

> Furthermore, in most cases a compiler won't change calling convention unless
> the source specifies it

Completely not true. Compilers will do this. This is not 1994 anymore.

, and in any case the register based calling used by
> compilers is way restricted compared with what can be done in asm which can
> always use more efficient methods (more extensive and intelligent register
> allocation).

Again, simply NOT true. Today's compilers will be able to do things
like "All callers of foo must have param 8 in ECX", and will write the
code that way, not to save/restore ECX, and to use it as a parameter.
You CANNOT do this in assembly unless you have a very small number of
callers that you know nobody else will touch. As soon as someone else
adds a caller, YOU have to do all the work to make ECX work that way.

You seem to have a very 1990ies understanding of how compilers work
(respecting calling conventions, save/restoring registers, not
touching ebp, etc). Probably because you worked on BIOSes, which yes,
in that time, worked that way.

Please read a bit into the technologies such as LLVM or microsoft's
link time code generator.

> In any case, the most important optimizations are equally done in C and
> assembly when the programmer knows how to write optimum code and does not
> have to comply with a prototype.

Again, NO. Unless you control all your callsites and are willing to
update the code each single time a cal site gets added, the compiler
WILL beat you. LLVM and LTCG can even go 2-3 call sites away, such
that callers of foo which call bar which call baz have some sort of
stack frame or register content that will make barbaz faster.

> For example passing arguments as a pointer
> to an struct is always more efficient.
>

It actually depends, and again the compiler can make this choice.

> -->
> 2) The fact if you try this code on a Core 2, Pentium 4, Pentium 1 and
> Nehalem you will get totally different results with your ASM code,
> while the compilers will generate the best possible code.
>
> <--
> There are very few and specific cases where the optimum code for different
> processors is different, and this is not the case.

False. I got radically different ASM when building for K8, I7, Core2
and Pentium.

> If gcc generates different code for this function and different CPUs, it is
> not for a good reason.

Excuse me?

> There is only a meaningful exception for this function: if the inner loop
> can use a 64 bit rep stos instead of 32. And in this case it can be done in
> asm, while I don't know any compiler that would use a 64 bit rep stos
> instruction for a 32 bit target regardless of the CPU having 64 bit
> registers.

Again, this is full of assumptions. You seem to be saying "GCC is
stupid, I know better". Yet you don't even understand WHY gcc will
generate different code for different CPUs.

Please read into the topics of "pipelines" and "caches" and
"micro-operations" as a good starting point.

>
> -->
> 4) The fact that if the loop is what you're truly worried about, you
> can optimize it by hand with __builtinia32_rep_movsd (and MSVC has a
> similar intrinsic), and still keep the rest of the function portable C.
>
> <--
> It is not necessary to use to use a built in function like you mention,
> because any optimizing compiler will use rep movsd anyway, with better
> register allocation if any different.

Ummm, if you think "rep movsd" is what an optimizing compiler will
use, then I'm sorry but you don't have the credentials to be in this
argument, and I'm wasting my time. Rep Movsd is the SLOWEST way to
achieve this loop on modern CPUs. On my core2 build, for example, gcc
used "mov" and a loop instead. Only when building for Pentium 1, did
it use a rep movsd.

Please stop thinking that 1 line of ASM is faster than 12 lines,
because 12 > 1. On modern CPUs, a "manual" loop will be faster than a
rep movsd, nearly ALWAYS.

> If inline asm is used instead, optimizations for the whole function are
> disabled, as the compiler does not analyze what's done in inline assembly.

LOL??? Again, maybe true in the 1990ies. but first of all:

1) Built-ins are not "inline asm", and will be optimized
2) GCC and MSVC both will optimize the inline assembler according to
the function the inline is present in. The old mantra that "inline asm
disables optimizations" hasn't been true since about 2001...

In fact, when assembly is *required* (for something like a trap save),
it is ALWAYS better to use an inline __asm__ block within the C
function, then to call the external function in an .S or .ASM file,
because compilers like gcc will be able to fine-tune the assembly you
wrote, and modify it to work better with the C code. LTCG will, in
some cases, optimize the ASM you wrote by hand in the external .ASM
file as well.

>
> -->
> Also, gcc does support profiling, another fact you don't seem to know.
> However, with linker optimizations, you do not need a profiler, the
> linker will do the static analysis.
>
> <--
> Function level linking and profiling based optimization are very different
> things, the linker in no way can perform a similar statistical analysis.

But it can make static analysis.
>
> -->
> Also, to everyone sayings things like "I was able to save a <operand
> name here>", I hope you understand that smaller != faster.
>
> <--
> The save of these two instructions improve both the speed and size. Note
> that the loop the compiler generated was exactly the same as the original
> assembly, only with those two instructions added. I discern where I save
> speed, size, both, or none, in either C or assembly.
>
> I wrote this not to be argumentative or confrontational, but just because I
> don't like to read arguments that are not true, and I hope you all take this
> as constructive knowledge.
> BTW, I hardly support the use of assemly except in very specific cases, and
> this is not one. I disagreed with Alex in the arguments, not in the core.

Thanks Jose, but unfortunately you are wrong. If we were having this
argument in:

1) 1986
2) on a 486
3) about BIOS code (which is small and rarely extended, with all calls
"Controlled")

I would bow down and give you my hat in an instant, but times have changed.

I don't want to waste more time on these arguments, because I know I'm
right and I've asked several people which all agree with me -- people
that work closely with Intel, compiler technology and assembly. I
cannot convince people that don't even have the basic knowledge to be
able to UNDERSTAND the arguments. Do some reading, then come back.

I will post numbers and charts when I'm home, at least they will
provide some "visual" confirmation of what I'm saying, but I doubt
that will be enough.

>
> Jose Catena
> DIGIWAVES S.L.
>
>
>
> _______________________________________________
> Ros-dev mailing list
> Ros-dev at reactos.org
> http://www.reactos.org/mailman/listinfo/ros-dev
>



More information about the Ros-dev mailing list