从并发线程中访问全局哈希表 (Common Lisp)

huangapple go评论128阅读模式
英文:

Accessing a Global Hashtable from Concurrent Threads (Common Lisp)

问题

我无法找到与以下SBCL代码中的全局哈希表相关的错误:

(defparameter *lci-hash-table* (make-hash-table :size 10000))
; "哈希表的键 = 长度|字符|索引串联在一起(例如,5A0),
; 哈希表的值 = 匹配符号的列表(例如,(ABATE ABBEY ...))。

(defun dictionary-compatible ($new-cross-str)
  ; 例如,长度为5的“A     ”
  ; "测试字符串(具有大写字母和空格字符)
  ; 是否与哈希表中的字典词(符号)兼容。"
  (iter (with len = (length $new-cross-str))
        (for char in-sequence $new-cross-str)
        (for index from 0)
        (when (not (eql char #\Space))
          (collect (gethash (ut::intern-symbol len char index)
                            *lci-hash-table*) into dict-words)) ; 总是返回nil
        (finally (return (reduce #'intersection dict-words))))

(defun intern-symbol (&rest args)
  ; "由串联参数创建一个符号。
  ; 基于Let Over Lambda中的symb。"
  (flet ((mkstr (&rest args)
           (with-output-to-string (s)
             (dolist (a args) (princ a s))))
    (values (intern (apply #'mkstr args)))))

上述的 dictionary-compatible 函数在SBCL的单线程模式下运行良好。但在多线程模式下(使用 :lparallel 库),重新启动新的镜像时,对全局 *lci-hash-table*gethash 访问总是返回NIL。

*lci-hash-table* 仅被访问,从不被更新,因此根据SBCL手册,无需将其设为 :synchronized 或使用 sb-ext:defglobal 定义。

我认为我可能忽略了与SBCL中哈希表的多进程处理相关的一些内容。

编辑:背景信息,以下是填充哈希表的代码:

(defun create-lci-ht (dictionary-file)
  ; "从文件中读取字典词(符号)并存储为字符串
  ; 并将其存储在lci哈希表中。"
  (with-open-file (infile dictionary-file :direction :input :if-does-not-exist nil)
    (when (not (streamp infile)) (error "文件不存在!"))
    (with-open-file (stream dictionary-file)
      (let ((word-strings (uiop:read-file-lines stream)))
        (iter (for word-string in word-strings)
              (for word-length = (length word-string))
              (while word-string)
              (iter (for char in-sequence word-string)
                    (for index from 0)
                    (push (intern word-string)
                          (gethash (ut::intern-symbol word-length char index)
                                   *lci-hash-table*)))))))

希望这有助于解决你的问题。

英文:

I cannot find the bug related to the global hash-table in the following SBCL code:

(defparameter *lci-hash-table* (make-hash-table :size 10000))  
  "Hash-table key = length|char|index concatenated together (eg, 5A0),
   hash-table values = list of matching symbols (eg, (ABATE ABBEY ...))."

(defun dictionary-compatible ($new-cross-str)  ;eg, "A     " of length 5
  "Tests if a string (with uppercase alphabetic and space characters)
   is compatible with the dictionary words (symbols) in a hash-table."
  (iter (with len = (length $new-cross-str))
        (for char in-sequence $new-cross-str)
        (for index from 0)
        (when (not (eql char #\Space))
          (collect (gethash (ut::intern-symbol len char index)  ;this gethash
                            *lci-hash-table*) into dict-words)) ;always returns nil
        (finally (return (reduce #'intersection dict-words)))))

(defun intern-symbol (&rest args)
  "Interns a symbol created by concatenating args.
   Based on symb in Let Over Lambda."
  (flet ((mkstr (&rest args)
           (with-output-to-string (s)
             (dolist (a args) (princ a s)))))
    (values (intern (apply #'mkstr args)))))

The function dictionary-compatible above runs fine in single-threaded mode in SBCL. But starting over with a new image in multi-threaded mode (using the :lparallel library) the gethash access to the global *lci-hash-table* always returns NIL.

The *lci-hash-table* is only being accessed, never updated, and therefore, according to the SBCL manual, there is no need to make it :synchronized, or defined with sb-ext:defglobal.

I think I'm missing something that's related to multi-processing for hash-tables in SBCL.

Edit: For background info, here is the code to populate the hash-table:

(defun create-lci-ht (dictionary-file)
  "Read dictionary words (symbols) from a file into strings
   and store in a lci hash table."
  (with-open-file (infile dictionary-file :direction :input :if-does-not-exist nil)
    (when (not (streamp infile)) (error "File does not exist!"))
    (with-open-file (stream dictionary-file)
      (let ((word-strings (uiop:read-file-lines stream)))
        (iter (for word-string in word-strings)
              (for word-length = (length word-string))
              (while word-string)
              (iter (for char in-sequence word-string)
                    (for index from 0)
                    (push (intern word-string)
                          (gethash (ut::intern-symbol word-length char index)
                                   *lci-hash-table*))))))))

答案1

得分: 3

以下是翻译的部分:

第一个错误 是明确创建和查找包中的符号,而不指定包,因此在代码运行时仅依赖于 *package* 的环境值。这至少是不稳定的。

  • 源文件中的 (in-package ...) 设置包,在编译和加载文件中的代码时,但不设置编译/加载的动态范围之外的环境包,因为 loadcompile-file 都重新绑定 *package* 为其环境值。
  • with-standard-io-syntax 这样的宏重新绑定 *package*
  • *package* 的行为以及一般情况下的特殊变量与多线程处理是依赖于实现的,而 Bordeaux Threads 这样的适配器不会(也可能无法)隐藏这一点。

作为最后一点的示例,考虑以下代码:

;;; 假设加载了 Bordeaux Threads
;;;

(defpackage :one
  (:use :cl :bordeaux-threads))

(defpackage :two
  (:use :cl :bordeaux-threads))

(in-package :one)

(start-multiprocessing)

(defun test (p1 p2)
  (setf p1 (find-package p1)
        p2 (find-package p2))
  (let ((*package* p1))
    (multiple-value-bind (tp1 tp2)
        (join-thread
         (make-thread
          (lambda ()
            (values *package*
                    (setf *package* p2)))))
      (values (package-name tp1) (package-name tp2)
              (package-name *package*))))

现在在 SBCL 中:

> (test :one :two)
"COMMON-LISP-USER"
"TWO"
"ONE"
> (test :one :two)
"TWO"
"TWO"
"ONE"

是的,不同的结果!

在 LW 中:

> (test :one :two)
"ONE"
"TWO"
"ONE"
> (test :one :two)
"ONE"
"TWO"
"ONE"

我不理解 SBCL 的行为,但我也没有详细阅读过 SBCL 的手册。我认为可能发生的情况是,在 SBCL 中,线程不继承特殊绑定,而 *package* 在 REPL 线程的某个位置被绑定:这意味着线程中的赋值是对 *package* 的全局值的赋值。在 LW 中,线程继承其父线程的绑定,我认为。

解决此问题的方法是,如果您调用类似 intern 的函数,始终明确指定要使用的包。这实际上破坏了您的代码。

第二个错误 是您使用包将字符串映射到符号,然后使用哈希表将符号映射到值。换句话说:

  1. 您在包中查找字符串,包是一个将字符串映射到符号的对象,这些符号是对象,其中包括值和属性。
  2. 然后,您在哈希表中查找找到的符号,哈希表是一个将对象映射到值的对象。

好吧,考虑一下:您要获取所需的值,需要进行两次查找。相反,可以只进行一次查找:

  • 要么在包中查找字符串,然后直接使用找到的符号的值;
  • 或者,几乎肯定更好,直接在一个 equal 哈希表中查找字符串。

第一种方法本质上就是将包用作哈希表:只要小心,确保这是您所使用它的全部(因此确保不要将其用于构成程序的符号!),这是可以的,但由于以下两个原因,这是不好看的:

  • CL 中的包不是完全一流,特别是不能拥有匿名包,因此您最终不得不小心命名。
  • CL 实现中的符号往往是相当笨重的对象,其中有许多槽,而您只使用其中一个 - 它们不一定要这样,但它们通常是。

因此,通常情况下,如果您想要从字符串到值的映射,其中这些字符串不是程序的一部分的名称,最好的方法就是直接使用 equal 哈希表。

如果您确实有某种原因需要使用包,并且希望 intern-symbol 接受未知数量的参数,那么您可能希望将包作为其第一个参数传递给 intern-symbol

(defun intern-symbol (package &rest args)
  (intern (format nil "~{~A~}" args) package))

或者,可能提供一个明确的变量:

(defvar *my-hashtable-package* (make-package "MY-HASHTABLE-PACKAGE" :use '()))

(defun intern-symbol (&rest args)
  (intern (format nil "~{~A~}" args) *my-hashtable-package*))

最后要注意的是,intern-symbol 在一般情况下不是可靠的:

> (eq (intern-symbol 1 23) (intern-symbol 1 2 3))
t

> (eq (intern-symbol 1 23) (intern-symbol 1 "2" 3))
t

当然,在受限的情况下可能是可以接受的。

英文:

This code exhibits two classic mistakes in CL: mistakes which everyone, including me, has made at some point!

The first mistake is that you are explicitly creating and looking up symbols in a package without ever specifying the package, so you are simply relying on whatever the ambient value of *package* is when the code is running. That's, at the very least, fragile.

  • (in-package ...) in a source file sets the package during the time the code in the file is being compiled and loaded but not the ambient package outside the dynamic extent of the compilation / loading, since both load and compile-file rebind *package* to its ambient value.
  • Macros like with-standard-io-syntax rebind *package*.
  • The behaviour of *package*, and special variables in general, with multithreading is implementation-dependent and shims like Bordeaux Threads do not (and probably can not!) hide this.

As an example of the last point, consider this code:

;;; Assume Bordeaux threads is loaded
;;;

(defpackage :one
  (:use :cl :bordeaux-threads))

(defpackage :two
  (:use :cl :bordeaux-threads))

(in-package :one)

(start-multiprocessing)

(defun test (p1 p2)
  (setf p1 (find-package p1)
        p2 (find-package p2))
  (let ((*package* p1))
    (multiple-value-bind (tp1 tp2)
        (join-thread
         (make-thread
          (lambda ()
            (values *package*
                    (setf *package* p2)))))
      (values (package-name tp1) (package-name tp2)
              (package-name *package*)))))

Now in SBCL:

> (test :one :two)
"COMMON-LISP-USER"
"TWO"
"ONE"
> (test :one :two)
"TWO"
"TWO"
"ONE"

Yes: different results!

In LW:

> (test :one :two)
"ONE"
"TWO"
"ONE"
> (test :one :two)
"ONE"
"TWO"
"ONE"

I don't understand SBCL's behaviour but I also have not read a lot of SBCL's manual in detail. I think what is probably happening is that in SBCL threads do not inherit special bindings, and *package* is bound somewhere in the REPL thread: this means that the assignment in the thread is to the global value of *package*. In LW threads do inherit their parent's bindings I think.

The way of resolving this is to always be specific about what package you are using if you are calling functions like intern. This is what is actually breaking your code.

The second mistake is that you are using packages to map strings to symbols and then using a hashtable to map symbols to values. In other words:

  1. you are looking up a string in a package, which is an object whose job is to map strings to symbols, which are objects which, among other things, have values and properties;
  2. you are then looking up the symbol you found in a hashtable, which is an object whose job is to map objects to values.

OK, think about that: you're doing two lookups to get the value you're after. Instead, do one:

  • either look up the string in a package and then simply use the value of the resulting symbol;
  • or, and almost certainly better, look up the string in an equal hashtable directly.

The first approach is essentially just using a package as a hashtable: this is fine so long as you are careful that this is all you use it for (so make sure you don't use it for the symbols which make up your program as well!), but it's ugly for two reasons:

  • packages are not completely first-class in CL, and in particular you can't have anonymous packages, so you end up having to be careful about names.
  • symbols tend to be rather heavyweight objects in CL implementations, with a bunch of slots of which you are using only one -- they don't have to be but they often are.

So generally, if what you want is a map from strings to values, where those strings are not the names of parts of your program, the best approach is just to use an equal hashtable for this.


If you really do need to use a package for some reason, and you want intern-symbol to take an unknown number of arguments, then probably you want intern-symbol either to take the package as its first argument:

(defun intern-symbol (package &rest args)
  (intern (format nil "~{~A~}" args) package))

Or you might provide an explicit variable

(defvar *my-hashtable-package* (make-package "MY-HASHTABLE-PACKAGE" :use '()))

(defun intern-symbol (&rest args)
  (intern (format nil "~{~A~}" args) *my-hashtable-package*))

As a final note: intern-symbol is not reliable in general:

> (eq (intern-symbol 1 23) (intern-symbol 1 2 3))
t

> (eq (intern-symbol 1 23) (intern-symbol 1 "2" 3))
t

Of course it may be fine in constrained cases.

答案2

得分: 1

代码部分不需要翻译,以下是翻译好的部分:

"The general problem is localized to the hash-table access point (gethash (ut::intern-symbol len char index) *lci-hash-table*), which always returns NIL, even though the hash-table is visible in threads, and the hash-table keys (symbols--eg, 5A1) are verified correct. Note that the access is probably better rendered as (gethash (intern (format nil "~D~C~D" len char index)) *lci-hash-table*), but this makes no significant difference."

一般问题可以归结为哈希表访问点(gethash (ut::intern-symbol len char index) *lci-hash-table*),尽管哈希表在线程中可见,而哈希表键(例如,symbols--eg, 5A1)已被验证为正确,但始终返回NIL。请注意,访问点可能更好地表示为(gethash (intern (format nil "~D~C~D" len char index)) *lci-hash-table*),但这没有明显的区别。

"The specific problem seems to be with the hash-table keys. Evidently, the keys installed during hash-table creation, are different than the keys (with the same symbol name) when they are accessed. My guess is that the difference may be due to different package names. (The project package is called :ww). Changing the access point to (gethash (intern (format nil "~D~C~D" len char index) :ww) *lci-hash-table*) fixes the problem."

具体问题似乎与哈希表的键有关。显然,在哈希表创建期间安装的键与在访问时的键(具有相同的符号名称)不同。我猜想这种差异可能是由于不同的包名称引起的。(项目包称为:ww)。将访问点更改为(gethash (intern (format nil "~D~C~D" len char index) :ww) *lci-hash-table*)可以解决问题。

"It is still unclear to me how the hash-table key references could be in different packages, since everything is loaded into the one package :ww (all files begin with (in-package :ww)). It is also unclear why the extra package reference is not needed for single-threaded runs, but is needed for multi-threaded runs."

我仍然不清楚哈希表键引用如何可能在不同的包中,因为所有内容都加载到一个包:ww(所有文件都以(in-package :ww)开头)。同样不清楚为什么额外的包引用在单线程运行时不需要,但在多线程运行时需要。

英文:

The general problem is localized to the hash-table access point (gethash (ut::intern-symbol len char index) *lci-hash-table*), which always returns NIL, even though the hash-table is visible in threads, and the hash-table keys (symbols--eg, 5A1) are verified correct. Note that the access is probably better rendered as (gethash (intern (format nil "~D~C~D" len char index)) *lci-hash-table*), but this makes no significant difference.

The specific problem seems to be with the hash-table keys. Evidently, the keys installed during hash-table creation, are different than the keys (with the same symbol name) when they are accessed. My guess is that the difference may be due to different package names. (The project package is called :ww). Changing the access point to (gethash (intern (format nil "~D~C~D" len char index) :ww) *lci-hash-table*) fixes the problem.

It is still unclear to me how the hash-table key references could be in different packages, since everything is loaded into the one package :ww (all files begin with (in-package :ww)). It is also unclear why the extra package reference is not needed for single-threaded runs, but is needed for multi-threaded runs.

答案3

得分: 0

只返回翻译好的部分:

  • 你似乎只使用了 *lci-hash-table* 的全局绑定,没有局部重新绑定,因此据我所知,相同的值应该在所有线程中都可见。如果你观察到不同线程中 *lci-hash-table* 的不同值,那可能是特殊变量具有线程本地存储的情况,但在这里我认为这不是你的问题。

  • 在多线程模式下,当你启动一个新的图像时,该变量在你的情况下绑定为 nil,因此我建议从那里开始调试。我建议在 create-lci-htdictionary-compatible 中都加入 (break),以查看两个指令何时执行。除此之外,我不太清楚如何识别你的问题。

  • 另外,你可能会遇到特殊变量在不同线程中绑定到不同值的问题,但对于 *package* 变量:你可以通过在 break 时显示哪个包处于活动状态,或者在 intern 中强制包的值为特定值来检查这一点:(intern str package)

  • 这可以解释为什么使用 eql 相等测试的哈希表无法找到符号的匹配项(你可以尝试在 make-hash-table 中添加 :test #'equal,但这可能不是你在调试后想要保留的更改)。

英文:

You seem to be using only a global binding for *lci-hash-table*, not local rebindings, so as far as I know the same value should be visible from all threads. If you observe different values of *lci-hash-table* from different threads then it would be a case of special variables having thread-local storage, but here I think this is not your problem.

The variable is bound to nil in your case when you start over with a new image in multi-threaded mode, so I would start trying to debug that instead. I would add (break) in both create-lci-ht and dictionary-compatible near the gethash to see when both instructions are executed. Apart from that I don't really know how to identify your problem.

--- Edit

You might in fact have a problem of having special variables being bound to different values in different threads, but for the *package* variable: you can check that by showing which package is active during the break and/or forcing the package to be a specific value in intern: (intern str package).

That would explain why your hash-table with an equality test of eql doesn't find matches for symbols (you can try to add :test #'equal to make-hash-table but that is probably not a change you want to keep after debugging).

huangapple
  • 本文由 发表于 2023年3月4日 06:18:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75632322.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定