[ros-dev] Speed Tests (was: ping Alex regarding log2() forscheduler)

Thu Mar 24 07:00:55 CET 2005

BSR has a latency of 8-12 Cycles on Athlon/P3 but can be pipelined. Worse 
(up to ~80 cycles) on Pentium and other older CPUs.
http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_3748,00.html

> Maximum latency on a Pentium 4 is eight clock cycles. But its throughput 
> is one, which means it is fully pipelined. So when you start this 
> instruction eight clock cycles before you need the result, it behaves as 
> if it only takes one clock cycle. You're not going to be able to beat that 
> with any other code. The closest you can get is to convert the integer to 
> float, then extract the exponent. That could be done in less than eight 
> clock cycles but throughput will be lower.
http://www.flipcode.com/cgi-bin/fcmsg.cgi?thread_show=16986&msg=113105

Dont know about A64 - maybe someone can test BSR with A64?
I'm very disappointed by its peformance on my K7 system - i might have 
messed something up thought.
You could save on the Pushing / Popping but that would be kinda like 
cheating and if you do it dirty/lazy I wont get the right returns either.

It doesnt make much sense to put the optimized ASM in there, neither is much 
hope of GCC having a good day and doing a lot of optimisation.
So far the best option would be the macro with a lookup table (only one 
global kernel table tho).

Here are the updated STATS
also available at http://hackersquest.org/kerneltest.html

result orig function            46ffffe9
it took         1526862         18%
result orig function inlined    46ffffe9
it took         1041460         12%
result second proposal inlined  46ffffe9
it took         1248990         15%
result optimized asm            46ffffe9
it took         1321532         16%
result lookup inlined           46ffffe9
it took         682264          8%
result bsr inlined              46ffffe9
it took         1751088         21%
result macro                    46ffffe9
it took         653692          7%

http://wohngebaeudeversicherung.einsurance.de/

----- Original Message ----- 
From: "Alex Ionescu" <ionucu at videotron.ca>
To: "ReactOS Development List" <ros-dev at reactos.com>
Sent: Thursday, March 24, 2005 3:57 AM
Subject: Re: [ros-dev] Speed Tests (was: ping Alex regarding log2() 
forscheduler)

> Ash wrote:
>
>> Hello,
>>
>> I'd like to provide a small VS.NET project file with 5 different tests.
>>
>> - 46ffffe9 tells everything is alright with the function calculation
>> - times are measured with QueryPerformanceCounter
>> - loop run cnt: 0x2ffffff
>>
>> result orig function            46ffffe9
>> it took         1491052
>> result orig function inlined    46ffffe9
>> it took         1035547
>> result second proposal inlined  46ffffe9
>> it took         1244434
>> result optimized asm            46ffffe9
>> it took         1338367
>> result debug asm                46ffffe9
>> it took         8774815
>>
>> The second proposal is the original proposal but more shifts - still 
>> slower tho.
>>
>> Interesting is the inlined version generated by MSVC, shaving off almost 
>> 1/3 of the overall time.
>> Also stared as "optimized asm" - no gurantee on register safety tho ;)
>>
>> For portability and performance sake it should be considered to create a 
>> compiler macro.
>> This function is terribly small, any optimisations inside are outweighted 
>> by the calling overhead in this case.
>> The most impressive one is the original function inlined, althought the 
>> ASM would only work on x86.
>>
>> Please do not think about using 64k tables, thats what, 1/2 of a Sempron 
>> L2 cache?
>> It would really really trash performance.
>
> Hi Ash,
>
> Thanks a lot for your tests. I don't have much time tonight, but if you'd 
> like/can, can you add two more tests?
>
> One using "bsr", the intel opcode. It does all the work for you and 
> returns the index.
>
> i think it's as simple as "bsr ecx, eax" where "ecx" is the mask and eax 
> is the returned index. Or it might be backwards.
>
> The second test is using a 256-byte log2 table:
>
> const char LogTable256[] = {
>  0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
>  4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
>  5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
>  5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
>  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
>  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
>  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
>  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
>  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
>  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
>  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
>  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
>  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
>  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
>  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
>  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
> };
>
>
> and a lookup code like this:
>
> Addon = 16;
> if ((IntMask = Mask >> 16)) {
>     Addon = 0;
>     IntMask = Mask;
> }
> if (IntMask && 0xFFFFFF00) {
>     Addon += 8;
> }
> HighBit = LogTable256[(Mask >> Addon)] + Addon;
>
> _______________________________________________
> Ros-dev mailing list
> Ros-dev at reactos.com
> http://reactos.com:8080/mailman/listinfo/ros-dev
>