这个Haskell代码是否重用了先前的计算?

huangapple go评论58阅读模式
英文:

Does this Haskell code reuse previous computation?

问题

  1. 旧的 regularSeq 值在计算 timesTwo/timesThree/timesFive 时会被缓存(memoized),因此不会生成类似于朴素 Fibonacci 实现中的低效度-3计算树。Memoization 确保以前计算的值被重复使用,提高了性能。

  2. regularSeq 的计算过程中,只存在一个整数列表在内存中,timesTwo/timesThree/timesFive 指向同一列表内的不同元素。它们共享相同的计算结果,不会指向独立的列表。

  3. Haskell 垃圾收集器能够识别不再可达的旧值,并将其释放。在你的描述中,旧的值不再需要计算更多的元素时会被释放。

  4. 了解 Haskell 程序的内存行为可以遵循一些一般原则:

    • Haskell 使用惰性(lazy)求值,只有在需要时才会计算值。
    • Memoization 可以用于缓存中间结果,提高性能。
    • 共享数据结构是 Haskell 的一大特点,不同部分可以共享相同的数据,而不需要重复计算。
    • Haskell 的垃圾收集器会自动管理内存释放,但理解何时会发生释放可以有助于编写更高效的代码。

希望这些原则能帮助你更好地理解和推理关于 Haskell 程序的内存行为。

英文:

I have a piece of Haskell code that computes regular numbers, i.e. positive integers whose only prime factors can be 2, 3 or 5. The algorithm is straightforward and follows what's suggested in the same Wikipedia article.

regularSeq :: [Integer]
regularSeq = 1 : union timesTwo (union timesThree timesFive)
  where
    timesTwo   = map (* 2) regularSeq
    timesThree = map (* 3) regularSeq
    timesFive  = map (* 5) regularSeq

union :: (Ord a) => [a] -> [a] -> [a]
union [] ys = ys
union xs [] = xs
union (x : xs) (y : ys)
  | x < y     = x : union      xs (y : ys)
  | x > y     = y : union (x : xs)     ys
  | otherwise = x : union      xs      ys

Example:

ghci> takeWhile (<= 60) regularSeq 
[1,2,3,4,5,6,8,9,10,12,15,16,18,20,24,25,27,30,32,36,40,45,48,50,54,60]

I have a couple questions regarding the performance of this code, specifically the lazy evaluation, "memoization" and memory usage.

  1. Since the computation of a new number in the sequence relies on previous values further back, are older values of regularSeq "cached"/"memoized" and reused in the computation of timesTwo/timesThree/timesFive? Or does the recursive code spawn an inefficient degree-3 tree of computations similar to the naive Fibonacci implementation?

    fib 0 = 1
    fib 1 = 1
    fib n = fib (n-1) + fib (n-2)
    
  2. During the evlauation of regularSeq, is there only a single list of integers present in memory, with timesTwo/timesThree/timesFive acting as pointers to different elements inside this same list? Or do they point to independent lists, thus not sharing computation?

    In my mind, timesTwo/timesThree/timesFive simply "lag behind" and reuse the values already discovered by the evaluation of regularSeq, however I'm not entirely sure this is correct.

  3. If I were to implement the sequence in an imperative language (say C or Rust) as an endless stream, I would keep in memory only the values from the head of timesFive to the current value, as the older ones are no longer needed to compute further elements. Is the Haskell garbage collector able to see that older values are not reachable anymore and does it deallocate them? Or does it deallocate the entire sequence only when it is discarded in its entirety?

  4. I find it quite hard to reason about the memory behavior of Haskell programs, and it is often not apparent to me whether the result of computation is shared or something need to be unnecessarily re-evaluated. What are some general principles and a good framework to reason about this problem?

答案1

得分: 3

Here are the translated parts of the text you provided:

Since the computation of a new number in the sequence relies on previous values further back, are older values of regularSeq "cached"/"memoized" and reused in the computation of timesTwo/timesThree/timesFive? Or does the recursive code spawn an inefficient degree-3 tree of computations similar to the naive Fibonacci implementation?

这些值在产生后会立即存储在列表中,从而实现了高效的计算。这与您的 fib 示例不同。

Memoization 是一个相关但不同的概念,它在相同参数 x 上之前计算过 f x 时可以使计算 f x 变得快速。在您的情况下,regularSeq 不是一个函数,而只是一个列表。不需要记忆化。

在操作上,您可以将其状态视为 regularSeq = x0 : x1 : ... : xN : <<to be computed>>,其中最后的尾部是一个待评估的表达式。

During the evlauation of regularSeq, is there only a single list of integers present in memory, with timesTwo/timesThree/timesFive acting as pointers to different elements inside this same list? Or do they point to independent lists, thus not sharing computation?

timesTwo/... 是不同的列表,但它们确实重用存储在 regularSeq 中的值。

In my mind, timesTwo/timesThree/timesFive simply "lag behind" and reuse the values already discovered by the evaluation of regularSeq, however I'm not entirely sure this is correct.

我认为这是一个很好的直觉。思路是,每当我们从 regularSeq 读取下一个元素时,其代码根据 union 的工作方式要求从 timesTwo/... 获取下一个元素。这将访问已经评估过的 regularSeq 中的数据(您提到的“滞后”部分),然后产生结果。

If I were to implement the sequence in an imperative language (say C or Rust) as an endless stream, I would keep in memory only the values from the head of timesFive to the current value, as the older ones are no longer needed to compute further elements. Is the Haskell garbage collector able to see that older values are not reachable anymore and does it deallocate them? Or does it deallocate the entire sequence only when it is discarded in its entirety?

垃圾回收应该使您的代码在实际存储在 regularSeq 中的数据之外,只需 O(1) 的内存开销即可运行。这是因为为了生成 regularSeq 中的一个更多元素,我们只需要评估 timesTwo/... 中的“下一个”元素。因此,任何时候,我们只有三个“下一个”元素在内存中,再加上很快就会被垃圾回收的旧元素。

优化器甚至可能能够避免生成列表 timesTwo/...,因为它们将被垃圾回收。要查看是否是这种情况,通常需要检查优化器生成的 GHC Core。

I find it quite hard to reason about the memory behavior of Haskell programs, and it is often not apparent to me whether the result of computation is shared or something need to be unnecessarily re-evaluated. What are some general principles and a good framework to reason about this problem?

确实很难。在我看来,理解 Haskell 程序的性能是 Haskell 编程中最困难的任务。因为底层进行了大量优化,很难猜测到底发生了什么。最好的情况是,通过经验,人们可以获得一些直觉,但仍然有点神秘。

为了提高直觉,可以尝试阅读 Simon Peyton-Jones 的关于函数式语言实现的书,即使它现在有点过时。您还可以尝试使用 :sprint 来观察随着计算进行而计算的 thunk。或者如果您愿意冒险的话,可以尝试使用 ghc-vis(如果它仍然可用)。

英文:

> Since the computation of a new number in the sequence relies on previous values further back, are older values of regularSeq "cached"/"memoized" and reused in the computation of timesTwo/timesThree/timesFive? Or does the recursive code spawn an inefficient degree-3 tree of computations similar to the naive Fibonacci implementation?

They are simply stored in the list as soon as they are produced, resulting in an efficient computation. This is unlike your fib example.

Memoization is a related, but different concept which makes computing f x fast when f x was previously computed before for the same argument x. In your case, regularSeq is not a function, but only a list. No memoization is needed.

Operationally, you can think of its state as being regularSeq = x0 : x1 : ... : xN : &lt;&lt;to be computed&gt;&gt; where the last tail is a to-be-evaluated expression.

> During the evlauation of regularSeq, is there only a single list of integers present in memory, with timesTwo/timesThree/timesFive acting as pointers to different elements inside this same list? Or do they point to independent lists, thus not sharing computation?

timesTwo/... are distinct lists, but they indeed reuse the values stored inside regularSeq.

> In my mind, timesTwo/timesThree/timesFive simply "lag behind" and reuse the values already discovered by the evaluation of regularSeq, however I'm not entirely sure this is correct.

I think this is a good intuition. The idea is that whenever we read the next element from regularSeq, its code demands the next element from timesTwo/... according to how union works. That will in turn access the already-evaluated data in regularSeq (the "lag behind" you mentioned) and produce the result.

> If I were to implement the sequence in an imperative language (say C or Rust) as an endless stream, I would keep in memory only the values from the head of timesFive to the current value, as the older ones are no longer needed to compute further elements. Is the Haskell garbage collector able to see that older values are not reachable anymore and does it deallocate them? Or does it deallocate the entire sequence only when it is discarded in its entirety?

Garbage collection should indeed make your code work in O(1) memory overhead beyond the data actually stored in regularSeq. This is because in order to generate one more element in regularSeq we only need to evaluate the "next" elements from timesTwo/.... So, at any time, we only have the three "next" elements in memory, plus old ones that will be GC'd soon.

The optimizer might even be able to avoid generating the lists timesTwo/... since they will be GC'd anyway. To see if this is the case, one must usually check the GHC Core resulting from the optimizer.

> I find it quite hard to reason about the memory behavior of Haskell programs, and it is often not apparent to me whether the result of computation is shared or something need to be unnecessarily re-evaluated. What are some general principles and a good framework to reason about this problem?

It is indeed hard. In my opinion, understanding Haskell performance is the hardest task in Haskell programming. Since so much optimization is going on underneath the hood, it is hard to guess what is really going on. At best, with experience, one can get some intuition, but it is still a bit of a dark art.

To improve the intuition, one can try reading the Simon Peyton-Jones book on the implementation of functional languages, even if it's old nowadays. You can also try experimenting with :sprint to observe thunks as they are computed. Or ghc-vis if that still works, and you feel really adventurous.

huangapple
  • 本文由 发表于 2023年5月22日 23:01:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76307496.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定