2023年6月1日 19:05:02go评论123阅读模式

英文:

Is there a faster algorithm for max(ctz(x), ctz(y))?

问题

对于 min(ctz(x), ctz(y))，我们可以使用 ctz(x | y) 来获得更好的性能。但是对于 max(ctz(x), ctz(y)) 呢？

ctz 表示 "count trailing zeros"。

#include <algorithm>
#include <bit>
#include <cstdint>
int32_t test2(uint64_t x, uint64_t y) {
    return std::max(std::countr_zero(x), std::countr_zero(y));
}

Rust 版本 (Compiler Explorer)

pub fn test2(x: u64, y: u64) -> u32 {
    x.trailing_zeros().max(y.trailing_zeros())
}

英文:

For min(ctz(x), ctz(y)), we can use ctz(x | y) to gain better performance. But what about max(ctz(x), ctz(y))?

ctz represents "count trailing zeros".

C++ version (Compiler Explorer)

#include &lt;algorithm&gt;
#include &lt;bit&gt;
#include &lt;cstdint&gt;
int32_t test2(uint64_t x, uint64_t y) {
    return std::max(std::countr_zero(x), std::countr_zero(y));
}

Rust version (Compiler Explorer)

pub fn test2(x: u64, y: u64) -&gt; u32 {
    x.trailing_zeros().max(y.trailing_zeros())
}

答案1

得分: 24

我不认为有比朴素方法更好的方法来解决最大值问题。一种尝试是使用以下恒等式：

x + y = min(x, y) + max(x, y)

因此，

max(ctz(x), ctz(y)) = ctz(x) + ctz(y) - min(ctz(x), ctz(y))

这样，我们可以将max函数简化为我们已经优化过的min函数，尽管需要一些额外的操作。

以下是不同方法的Rust实现：

pub fn naive(x: u64, y: u64) -> u32 {
    x.trailing_zeros().max(y.trailing_zeros())
}
pub fn sum_minus_min(x: u64, y: u64) -> u32 {
    x.trailing_zeros() + y.trailing_zeros() - (x | y).trailing_zeros()
}
pub fn nielsen(x: u64, y: u64) -> u32 {
    let x_lsb = x & x.wrapping_neg();
    let y_lsb = y & y.wrapping_neg();
    let xy_lsb = x_lsb | y_lsb;
    let lsb = xy_lsb & xy_lsb.wrapping_neg();
    let xy_max_lsb = if xy_lsb == lsb { lsb } else { xy_lsb ^ lsb };
    xy_max_lsb.trailing_zeros()
}
pub fn timmermans(x: u64, y: u64) -> u32 {
    let loxs = !x & x.wrapping_sub(1);
    let loys = !y & y.wrapping_sub(1);
    return (loxs | loys).count_ones();
}
pub fn kealey(x: u64, y: u64) -> u32 {
    ((x | x.wrapping_neg()) & (y | y.wrapping_neg())).trailing_zeros()
}

我的机器上的结果如下：

ctz_max/naive           time:   [279.09 ns 279.55 ns 280.10 ns]
ctz_max/sum_minus_min   time:   [738.91 ns 742.87 ns 748.61 ns]
ctz_max/nielsen         time:   [935.35 ns 937.63 ns 940.40 ns]
ctz_max/timmermans      time:   [803.39 ns 806.98 ns 810.76 ns]
ctz_max/kealey          time:   [295.03 ns 295.93 ns 297.03 ns]

朴素实现胜过所有其他实现。唯一能与朴素方法竞争的实现是Martin Kealey提出的方法。请注意，由于测试工具的一些开销，实际的实现之间的因子可能甚至比计时所示更高。

很明显，你只有几个CPU指令可以用来优化朴素实现，因此我认为没有什么可以做的。供参考，以下是在现代x86_64处理器上编译这些实现作为独立函数时Rust编译器生成的汇编代码：

（这里省略了汇编代码，因为汇编代码非常长。如果你需要它，请告诉我。）

在我运行的基准测试中，函数被内联，循环部分展开，一些子表达式从内部循环中提取出来，因此汇编代码看起来不太整洁。

对于测试，我使用了Criterion。以下是附加代码：

（这里省略了代码，因为它是基准测试的辅助代码。如果你需要它，请告诉我。）

NUMBERS 是使用Python生成的，目的是使 min() 函数的分支预测尽可能困难。

我正在使用以下命令运行基准测试：

RUSTFLAGS='-C target-cpu=native -C opt-level=3' cargo bench

我的机器是第8代i7处理器（Whiskey Lake）。

英文:

I don't think there's anything better than the naive approach for the maximum. One attempt is using the identity

x + y = min(x, y) + max(x, y)

and thus

max(ctz(x), ctz(y)) = ctz(x) + ctz(y) - min(ctz(x), ctz(y))

This way, we can reduce the max function to the min function we already optimized, albeit with a few additional operations.

Here are some Rust implementations of the different approaches:

pub fn naive(x: u64, y: u64) -&gt; u32 {
    x.trailing_zeros().max(y.trailing_zeros())
}
pub fn sum_minus_min(x: u64, y: u64) -&gt; u32 {
    x.trailing_zeros() + y.trailing_zeros() - (x | y).trailing_zeros()
}
pub fn nielsen(x: u64, y: u64) -&gt; u32 {
    let x_lsb = x &amp; x.wrapping_neg();
    let y_lsb = y &amp; y.wrapping_neg();
    let xy_lsb = x_lsb | y_lsb;
    let lsb = xy_lsb &amp; xy_lsb.wrapping_neg();
    let xy_max_lsb = if xy_lsb == lsb { lsb } else { xy_lsb ^ lsb };
    xy_max_lsb.trailing_zeros()
}
pub fn timmermans(x: u64, y: u64) -&gt; u32 {
    let loxs = !x &amp; x.wrapping_sub(1);
    let loys = !y &amp; y.wrapping_sub(1);
    return (loxs | loys).count_ones();
}
pub fn kealey(x: u64, y: u64) -&gt; u32 {
    ((x | x.wrapping_neg()) &amp; (y | y.wrapping_neg())).trailing_zeros()
}

Results on my machine:

ctz_max/naive           time:   [279.09 ns 279.55 ns 280.10 ns]
ctz_max/sum_minus_min   time:   [738.91 ns 742.87 ns 748.61 ns]
ctz_max/nielsen         time:   [935.35 ns 937.63 ns 940.40 ns]
ctz_max/timmermans      time:   [803.39 ns 806.98 ns 810.76 ns]
ctz_max/kealey          time:   [295.03 ns 295.93 ns 297.03 ns]

The naive implementation beats all other implementations. The only implementation that can compete with the naive one is the approach suggested by Martin Kealey. Note that the actual factors between the implementation may be even higher than the timings indicate, due to some overhead of the test harness.

It's clear that you only have like a couple of CPU instructions to spare to optimize the naive implementation, so I don't think there is anything you can do. For reference, here is the assembly emitted by the Rust compiler when these implementations are compiled as standalone functions on a modern x86_64 processor:

example::naive:
        tzcnt   rcx, rdi
        tzcnt   rax, rsi
        cmp     ecx, eax
        cmova   eax, ecx
        ret
example::sum_minus_min:
        tzcnt   rcx, rdi
        tzcnt   rax, rsi
        add     eax, ecx
        or      rsi, rdi
        tzcnt   rcx, rsi
        sub     eax, ecx
        ret
example::nielsen:
        blsi    rax, rdi
        blsi    rcx, rsi
        or      rcx, rax
        blsi    rax, rcx
        xor     edx, edx
        cmp     rcx, rax
        cmovne  rdx, rcx
        xor     rdx, rax
        tzcnt   rax, rdx
        ret
example::timmermans:
        lea     rax, [rdi - 1]
        andn    rax, rdi, rax
        lea     rcx, [rsi - 1]
        andn    rcx, rsi, rcx
        or      rcx, rax
        xor     eax, eax
        popcnt  rax, rcx
        ret
example::kealey:
        mov     rax, rdi
        neg     rax
        or      rax, rdi
        mov     rcx, rsi
        neg     rcx
        or      rcx, rsi
        and     rcx, rax
        tzcnt   rax, rcx
        ret

In the benchmarks I ran, the functions get inlined, the loops partially unrolled and some subexpressions pulled out of the inner loops, so the assembly looks a lot less clean that the above.

For testing, I used Criterion. Here is the additional code:

use criterion::{black_box, criterion_group, criterion_main, Criterion};
const NUMBERS: [u64; 32] = [
    ...
];
fn bench&lt;F&gt;(func: F)
where
    F: Fn(u64, u64) -&gt; u32,
{
    for x in NUMBERS {
        for y in NUMBERS {
            black_box(func(x, y));
        }
    }
}
fn compare(c: &amp;mut Criterion) {
    let mut group = c.benchmark_group(&quot;ctz_max&quot;);
    group.bench_function(&quot;naive&quot;, |b| b.iter(|| bench(naive)));
    group.bench_function(&quot;sum_minus_min&quot;, |b| b.iter(|| bench(sum_minus_min)));
    group.bench_function(&quot;nielsen&quot;, |b| b.iter(|| bench(nielsen)));
    group.bench_function(&quot;timmermans&quot;, |b| b.iter(|| bench(timmermans)));
    group.bench_function(&quot;kealey&quot;, |b| b.iter(|| bench(kealey)));
}
criterion_group!(benches, compare);
criterion_main!(benches);

NUMBERS was generated with this Python code, with the intention of making branch prediction for the min() function as hard as possible:

[
    random.randrange(2 ** 32) * 2 ** random.randrange(32)
    for dummy in range(32)
]

I'm running the benchmark using

RUSTFLAGS=&#39;-C target-cpu=native -C opt-level=3&#39; cargo bench

on an 8th generation i7 processor (Whiskey Lake).

答案2

得分: 18

以下是翻译好的部分：

These are equivalent:

max(ctz(a),ctz(b))
ctz((a|-a)&(b|-b))
ctz(a)+ctz(b)-ctz(a|b)

The math-identity ctz(a)+ctz(b)-ctz(a|b) requires 6 CPU instructions, parallelizable to 3 steps on a 3-way superscalar CPU:

3× ctz
1× bitwise-or
1× addition
1× subtraction

The bit-mashing ctz((a|-a)&(b|-b)) requires 6 CPU instructions, parallelizable to 4 steps on a 2-way superscalar CPU:

2× negation
2× bitwise-or
1× bitwise-and
1× ctz

The naive max(ctz(a),ctz(b)) requires 5 CPU instructions, parallelizable to 4 steps on a 2-way superscalar CPU:

2× ctz
1× comparison
1× conditional branch
1× load/move (so that the "output" is always in the same register)

... but note that branch instructions can be very expensive.

If your CPU has a conditional load/move instruction, this reduces to 4 CPU instructions taking 3 super-scalar steps.

If your CPU has a max instruction (e.g. SSE4), this reduces to 3 CPU instructions taking 2 super-scalar steps.

All that said, the opportunities for super-scalar operation depend on which instructions you're trying to put against each other. Typically you get the most by putting different instructions in parallel, since they use different parts of the CPU (all at once). Typically there will be more "add" and "bitwise or" units than "ctz" units, so doing multiple ctz instructions may actually be the limiting factor, especially for the "math-identity" version.

If "compare and branch" is too expensive, you can make a non-branching "max" in 4 CPU instructions. Assuming A and B are positive integers:

C = A-B
subtract the previous carry, plus D, from D itself (D is now either 0 or -1, regardless of whatever value it previously held)
C &= D (C is now min(0, A-B))
A -= C (A' is now max(A,B))

英文:

These are equivalent:

max(ctz(a),ctz(b))
ctz((a|-a)&(b|-b))
ctz(a)+ctz(b)-ctz(a|b)

The math-identity ctz(a)+ctz(b)-ctz(a|b) requires 6 CPU instructions, parallelizable to 3 steps on a 3-way superscalar CPU:

3× ctz
1× bitwise-or
1× addition
1× subtraction

The bit-mashing ctz((a|-a)&(b|-b)) requires 6 CPU instructions, parallelizable to 4 steps on a 2-way superscalar CPU:

2× negation
2× bitwise-or
1× bitwize-and
1× ctz

The naïve max(ctz(a),ctz(b)) requires 5 CPU instructions, parallelizable to 4 steps on a 2-way superscalar CPU:

2× ctz
1× comparison
1× conditional branch
1× load/move (so that the "output" is always in the same register)

... but note that branch instructions can be very expensive.

If your CPU has a conditional load/move instruction, this reduces to 4 CPU instructions taking 3 super-scalar steps.

If your CPU has a max instruction (e.g. SSE4), this reduces to 3 CPU instructions taking 2 super-scalar steps.

If "compare and branch" is too expensive, you can make a non-branching "max" in 4 CPU instructions. Assuming A and B are positive integers:

C = A-B
subtract the previous carry, plus D, from D itself (D is now either 0 or -1, regardless of whatever value it previously held)
C &= D (C is now min(0, A-B))
A -= C (A' is now max(A,B))

答案3

得分: 11

你可以这样做：

#include <algorithm>
#include <bit>
#include <cstdint>
int32_t maxr_zero(uint64_t x, uint64_t y) {
    uint64_t loxs = ~x & (x-1); // x的低零位
    uint64_t loys = ~y & (y-1); // y的低零位
    return std::countr_zero((loxs|loys)+1);
}

英文:

You can do it like this:

#include &lt;algorithm&gt;
#include &lt;bit&gt;
#include &lt;cstdint&gt;
int32_t maxr_zero(uint64_t x, uint64_t y) {
    uint64_t loxs = ~x &amp; (x-1); // low zeros of x
    uint64_t loys = ~y &amp; (y-1); // low zeros of y
    return std::countr_zero((loxs|loys)+1);
}

答案4

得分: 1

我不确定是否更快，但此函数将获取 x 和 y 并计算输入给 ctz 以获取最大值的部分：

uint64_t getMaxTzInput(uint64_t x, uint64_t y)
{
   uint64_t x_lsb = x & (~x + 1);  // x 的最低有效位1
   uint64_t y_lsb = y & (~y + 1);  // y 的最低有效位1
   uint64_t xy_lsb = x_lsb | y_lsb;  // x 和 y 的最低有效位1（可能相同）
   uint64_t lsb = (xy_lsb) & (~(xy_lsb)+1);  // x 和 y 中的最低有效位1
   // 如果 x 和 y 的最低有效位1不同，去掉最低有效位1
   // 以获取第二个最低有效位1。
   uint64_t xy_max_lsb = (xy_lsb == lsb) ? lsb : xy_lsb ^ lsb;
   return xy_max_lsb;
}

因此，ctz(getMaxTzInput(x,y)) 至少应该在只调用一次 ctz 的情况下返回正确的值。

英文:

I am not sure whether or not it is faster, but this function will take x and y and calculate the input to ctz for getting the max value:

uint64_t getMaxTzInput(uint64_t x, uint64_t y)
{
   uint64_t x_lsb = x &amp; (~x + 1);  // Least significant 1 of x
   uint64_t y_lsb = y &amp; (~y + 1);  // Least significant 1 of y
   uint64_t xy_lsb = x_lsb | y_lsb;  // Least significant 1s of x and y (could be the same)
   uint64_t lsb = (xy_lsb) &amp; (~(xy_lsb)+1);  // Least significant 1 among x and y
   // If the least significant 1s are different for x and y, remove the least significant 1
   // to get the second least significant 1.
   uint64_t xy_max_lsb = (xy_lsb == lsb) ? lsb : xy_lsb ^ lsb;
   return xy_max_lsb;
}

Thus, ctz(getMaxTzInput(x,y)) should at least give the correct value with only one call of ctz.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有没有更快的算法来计算 max(ctz(x), ctz(y))?

问题

答案1

答案2

答案3

答案4

获取一个 n 位掩码，在 n 等于类型的位宽时避免未定义行为？

无法在并行中生成正态分布的随机数

Calling a CPP function from C with no header file

C/C++可以将所有动态库构建为一个二进制文件吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。