从链表末尾获取值为什么比从开头获取值慢得多?

huangapple go评论69阅读模式
英文:

Why is getting a value from the end of a LinkedList much slower than from the start?

问题

I have a LinkedList of 1,000,000 items. I measured the retrieval of an item first at index 100,000 and then at index 900,000. In both cases, the LinkedList goes through 100,000 operations to get to the desired index. So why is the retrieval from the end so much slower than from the start?
Measurements taken with JMH.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 10)
@Measurement(iterations = 10)
public class ComparationGet {

    static int val1 = 100_000;
    static int val2 = 500_000;
    static int val3 = 900_000;

    @Benchmark
    public void testGet1LinkedListFromStart(Blackhole blackhole, MyState state) {
        MyDigit res1 = state.linkedList.get(val1);
        blackhole.consume(res1);
    }

    @Benchmark
    public void testGet2LinkedListFromEnd(Blackhole blackhole, MyState state) {
        MyDigit res1 = state.linkedList.get(val3);
        blackhole.consume(res1);
    }
}

Results:

from start:
ComparationGet.testGet1LinkedListFromStart avgt 10 0,457 ± 0,207 ms/op

from end:
ComparationGet.testGet2LinkedListFromEnd avgt 10 5,789 ± 3,094 ms/op

State class:

@State(Scope.Thread)
public class MyState {
    public List<MyDigit> linkedList;

    private int iterations = 1_000_000;

    @Setup(Level.Invocation)
    public void setUp() {
        linkedList = new LinkedList<>();

        for (int i = 0; i < iterations; i++) {
            linkedList.add(new MyDigit(i));
        }
    }
}

MyDigit class:

public class MyDigit{
    private int val;

    public MyDigit(int val) {
        this.val = val;
    }
}

LinkedList get method:

public E get(int index) {
    checkElementIndex(index);
    return node(index).item;
}

Node<E> node(int index) {
    // assert isElementIndex(index);

    if (index < (size >> 1)) {
        Node<E> x = first;
        for (int i = 0; i < index; i++)
            x = x.next;
        return x;
    } else {
        Node<E> x = last;
        for (int i = size - 1; i > index; i--)
            x = x.prev;
        return x;
    }
}
英文:

I have a LinkedList of 1,000,000 items. I measured the retrieval of an item first at index 100,000 and then at index 900,000. In both cases, the LinkedList goes through 100,000 operations to get to the desired index. So why is the retrieval from the end so much slower than from the start?
Measurements taken with JMH.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 10)
@Measurement(iterations = 10)
public class ComparationGet {

    static int val1 = 100_000;
    static int val2 = 500_000;
    static int val3 = 900_000;

    @Benchmark
    public void testGet1LinkedListFromStart(Blackhole blackhole, MyState state) {
        MyDigit res1 = state.linkedList.get(val1);
        blackhole.consume(res1);
    }

    @Benchmark
    public void testGet2LinkedListFromEnd(Blackhole blackhole, MyState state) {
        MyDigit res1 = state.linkedList.get(val3);
        blackhole.consume(res3);
    }
}

Results:

from start:
ComparationGet.testGet1LinkedListFromStart  avgt   10  0,457 &#177; 0,207  ms/op

from end:
ComparationGet.testGet2LinkedListFromEnd  avgt   10  5,789 &#177; 3,094  ms/op

State class:

@State(Scope.Thread)
public class MyState {
    public List&lt;MyDigit&gt; linkedList;


    private int iterations = 1_000_000;

    @Setup(Level.Invocation)
    public void setUp() {
        linkedList = new LinkedList&lt;&gt;();

        for (int i = 0; i &lt; iterations; i++) {
            linkedList.add(new MyDigit(i));
        }
    }
}

MyDigit class:

public class MyDigit{
    private int val;

    public MyDigit(int val) {
        this.val = val;
    }
}

LinkedList get method:

public E get(int index) {
    checkElementIndex(index);
    return node(index).item;
}

Node&lt;E&gt; node(int index) {
    // assert isElementIndex(index);

    if (index &lt; (size &gt;&gt; 1)) {
        Node&lt;E&gt; x = first;
        for (int i = 0; i &lt; index; i++)
            x = x.next;
        return x;
    } else {
        Node&lt;E&gt; x = last;
        for (int i = size - 1; i &gt; index; i--)
            x = x.prev;
        return x;
    }
}

答案1

得分: 6

链表是基本信息学推理在算法上的限制的很好的例子。关于这里的代码的基本推理,以及将计算机视为简单的冯·诺依曼模型,会指示其中一个基准需要100k步才能从一个'end'到达所需的项目,因此,基准应该报告相等的时间,或者加上一些统计噪声。

实际上,一个比另一个慢一个数量级。

链表在这类问题中几乎总是失败者。事实上,作为一个经验法则,链表应该被禁止出现在所有代码库中。它几乎总是比基本推理所示的要慢得多,并且在链表实际上(实际上,在真实的基准测试中,而不是理论上!)优于ArrayList的罕见情况下,几乎总会有一种更合适的类型,比如,例如ArrayDeque

但是,为什么?

有很多原因。但通常与缓存分页有关。

注:对于CPU设计专家:我已经过于简化了,试图解释关键方面(即缓存未命中淹没了任何算法预期)。

现代CPU具有分层的存储器层次结构。远远最慢的是“主内存”(您拥有的16GB RAM之类的)。CPU实际上无法从主内存读取任何内容。但O(n)分析认为它们可以。

然后有缓存层次,通常有3层(L1到L3),甚至比这些更快的是寄存器。

当您读取一些内存时,实际发生的情况是系统检查您要读取的内容是否映射到缓存之一,并且只能映射整个页面的内存,因此它首先检查数据所在的页面,然后检查该页面是否在其中一个缓存中。如果是的,很好,操作成功。

如果没有,嗯哦。CPU无法完成您的工作。因此,CPU会做其他事情,或者至少会在至少500个周期(在更快的CPU上更多!)内闲置,同时它会从缓存之一中驱逐一些页面,并从主内存中复制您想要的页面到其中一个缓存中。

只有然后它才能继续。

Java保证数组是__连续的__。如果您声明,例如,new int[1000000],Java将保证所有1000000个4字节序列都紧挨着彼此,因此,如果您遍历它,您会获得最小可能的“缓存未命中”事件(其中您从某些不在缓存之一的内存中读取)。

因此,如果您有一个ArrayList,也就是说,由一个数组支持,那么该数组被保证是连续的。但内部的对象不必是连续的。与new int[1000000]不同,使用new Object[1000000],您只需具有__指针__全部连续;它们指向的实际对象不必连续。

然而,对于您设置的此测试,这是不重要的,您的代码实际上并没有“跟随指针”。

在链表中,您最终根本没有数组,而是有2*X(X为列表的大小)个对象:您存储的X个对象,以及X个“跟踪器”;每个跟踪器包含指向实际存储的对象的指针(在Java中为引用),以及指向其兄弟跟踪器对象的“前一个”和“下一个”指针。

这些都不能保证在内存中连续

它们可能分散在各个地方。即使只是在一个包含1000000个元素的列表中循环,根本不跟随指针,如果跟踪器都散落在各处,理论上最坏的情况就是1000000次未命中。

缓存未命中如此缓慢,而CPU如此快,您可以放心地将遍历每个跟踪器的工作(或遍历1000000大小的数组中的每个项)视为__完全免费__,不需要任何CPU时间,只要您不会遇到缓存未命中:缓存未命中往往会支配时间要求。

您需要进一步调查,但以下是您正在见证的情况的一个合理解释:

您的代码在隔离环境中运行(它没有做太多其他事情);因此,您的初始化无障碍运行,虽然Java对所有这些都没有连续的保证,但实际的内存布局看起来像:一个MyDigit对象,然后是一个链表跟踪器,然后是另一个mydigit对象,然后是另一个链表跟踪器,依此类推。

尽管如此,从最后一个节点出发涉及许多缓存未命中,而从前面出发(也有从页面“字节0”开始的好处)受影响没有那么严重。

供参考,这是获取一定大小数据块的访问时间图表,假设进行最佳缓存 - 请注意,当达到4M时有一个大的峰值。

英文:

LinkedList is a fine example of the limitations of fundamental informatics-based reasoning about algorithms. Basic reasoning about the code here, and treating a computer as a simple von neumann model, would dictate that either benchmark needs 100k steps to get from one 'end' to the desired item, and therefore, the benchmark should report equal times, give or take some statistical noise.

In actual fact, one is an order of magnitude slower than the other.

LinkedList is almost always the loser in such issues. In fact, as a rule of thumb, LinkedList should be banned from all codebases. It's almost always vastly slower than basic reasoning would indicate, and in the rare circumstances where LinkedList would (actually, in real benchmarks, not theoretically!) outperform an ArrayList, there's almost always a different type that's even more suitable, such as, say, ArrayDeque.

But, why?

There are many reasons. But usually it has to do with cache paging.

NB: For the CPU design expert: I've oversimplified rather a lot, to try to explain the key aspect (which is that cache misses drown out any algorithmic expectations).

Modern CPUs have hierarchical layers of memory. The slowest, by far, is 'main memory' (that 16GB of RAM or whatnot that you have). The CPU cannot actually read from main memory, at all. And yet O(n) analysis thinks that they can.

Then there's layers of caches, generally 3 (L1 to L3), and even faster than those, registers.

When you read some memory, what actually happens is that the system checks if what you want to read is mapped onto one of the caches, and only entire pages worth of memory can be, so it first checks which page your data is in, and then checks if said page is in one of those caches. If yes, great, the operation succeeds.

If not, uhoh. The CPU can't do your job. So instead, the CPU goes and does something else, or will just twiddle its thumbs for at least 500 cycles (more on faster CPUs!) whilst it evicts some page from one of the caches and copies over from main memory the page you wanted into one of the caches.

Only then can it continue.

Java guarantees that arrays are consecutive. if you declare, say, new int[1000000] java will guarantee that all 1000000 4-byte sequences are all right next to each other, so if you iterate through it, you get the minimum possible 'cache miss' events (where you read from some memory that isn't in one of the caches).

So, if you have an ArrayList, that is, well, backed by an array, so that array is guaranteed consecutive. However, the objects inside don't have to be. Unlike with new int[1000000], with new Object[1000000], you just have the pointers all consecutive; the actual objects they point at need not be.

However, for this test you've set up, that is immaterial, nothing in your code actually 'follows the pointer'.

In LinkedLists, you end up with no array at all, and instead with 2*X (X being the size of the list) objects: Your X objects you are storing, as well as X 'trackers'; each tracker contains a pointer (in java: reference) to the actual object being stored, as well as a 'previous' and 'next' pointer, pointing at its sibling tracker objects.

None of these are guaranteed to be consecutive in memory.

They could be smeared all over. Even just looping through each element in a list of 1000000, not following pointers at all, if the trackers are all over the place that's theoretically worst case scenario 1000000 case misses.

Cache misses are so slow, and CPUs are so fast, that you can safely consider the job of iterating through each tracker (or through each item in a 1000000-sized array) as entirely free, zero CPU time required, as long as you don't run into cache misses: The cache misses tend to dominate the time requirements.

You'd have to investigate further, but here is a plausible explanation for what you're witnessing:

Your code runs in isolation (it is not doing much else); so your init is running unimpeded, and whilst java makes no consecutive guarantees about any of this, your actual memory layout looks like: a MyDigit object, then a linkedlist tracker, then another mydigit object, then another linkedlist tracker, and so on.

Nevertheless, going from the last node involves a number of cache misses, whereas going from the front (which also had the benefit of starting at 'byte 0' of a page) isn't nearly as badly affected.

For reference, here is a chart of access times of fetching a certain sized chunk of data, assuming optimal caching - Note the biiig spike when you get to 4M.

huangapple
  • 本文由 发表于 2020年8月10日 21:02:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/63340794.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定