Loading data using `UnstructuredURLLoader` of LangChain halts with `TP_NUM_C_BUFS too small: 50`

huangapple go评论71阅读模式
英文:

Loading data using `UnstructuredURLLoader` of LangChain halts with `TP_NUM_C_BUFS too small: 50`

问题

I will translate the text excluding the code sections. Here's the translation:

我正试图复制LangChain文档中提供的代码(URL - 🦜🔗 LangChain 0.0.167)以便从URL列表中加载HTML文件到文档格式,然后可以通过先进的自然语言处理模型处理以执行下游任务。然而,我遇到了一个问题,代码url_data = url_loader.load()执行超过半个小时,而没有加载任何HTML文件。

我还遇到了一个堆栈跟踪,无法解释错误消息TP_NUM_C_BUFS too small: 50。此错误曾在LangChain存储库中报告为已解决的问题(链接)。问题的作者报告说,在Windows命令提示符上执行先前引发TP_NUM_C_BUFS too small: 50错误的脚本解决了问题。然而,在Windows命令提示符上执行我的脚本并没有解决问题。

是否有人能够识别此问题的源头并提供解决方案?

MWE(示例代码)

from langchain.document_loaders import UnstructuredURLLoader
import session_info

session_info.show()

urls = [
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023",
]

print(urls)

loader = UnstructuredURLLoader(urls=urls)

print(loader)

data = loader.load()

print(data)

MWE的执行结果

D:\path>C:/Python310/python.exe d:/path/src/langchain-url-mwe.py
-----
langchain           0.0.157
session_info        1.0.0
-----
Python 3.10.8 (tags/v3.10.8:aaaf517, Oct 11 2022, 16:50:30) [MSC v.1933 64 bit (AMD64)]
Windows-10-10.0.19045-SP0
-----
Session information updated at 2023-05-14 21:22
['https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023', 'https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023']
<langchain.document_loaders.url.UnstructuredURLLoader object at 0x0000022A495A7C40>
      0 [main] python (12524) C:\Python310\python.exe: *** fatal error - Internal error: TP_NUM_C_BUFS too small: 50
Stack trace:
Frame        Function    Args
...
End of stack trace (more stack frames may be present)
  23855 [main] python (12524) C:\Python310\python.exe: *** fatal error - Internal error: TP_NUM_C_BUFS too small: 50

If you have any further questions or need assistance with this issue, please let me know.

英文:

I am attempting to replicate the code provided in the documentation of LangChain (URL - 🦜🔗 LangChain 0.0.167) to enable loading HTML files from a list of URLs into a document format, which can then be processed by a sophisticated natural language processing model to perform downstream tasks. However, I have encountered an issue where the code url_data = url_loader.load() hangs for over half an hour without loading any HTML file.

I have also encountered a stack trace and am unable to interpret the error message for the error TP_NUM_C_BUFS too small: 50. This error was previously reported as a resolved issue in the repository of LangChain (link). The author of the issue reported that executing the script that previously caused the TP_NUM_C_BUFS too small: 50 error on Windows command prompt resolved the issue. However, executing my script on Windows command prompt did not resolve the problem.

Would anyone be able to identify the source of this problem and provide a solution?

MWE

from langchain.document_loaders import UnstructuredURLLoader
import session_info

session_info.show()

urls = [
    &quot;https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023&quot;,
    &quot;https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023&quot;,
]

print(urls)

loader = UnstructuredURLLoader(urls=urls)

print(loader)

data = loader.load()

print(data)

Execution result of MWE

D:\path&gt;C:/Python310/python.exe d:/path/src/langchain-url-mwe.py
-----
langchain           0.0.157
session_info        1.0.0
-----
Python 3.10.8 (tags/v3.10.8:aaaf517, Oct 11 2022, 16:50:30) [MSC v.1933 64 bit (AMD64)]
Windows-10-10.0.19045-SP0
-----
Session information updated at 2023-05-14 21:22
[&#39;https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023&#39;, &#39;https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023&#39;]
&lt;langchain.document_loaders.url.UnstructuredURLLoader object at 0x0000022A495A7C40&gt;
      0 [main] python (12524) C:\Python310\python.exe: *** fatal error - Internal error: TP_NUM_C_BUFS too small: 50
Stack trace:
Frame        Function    Args
045573E7140  001800629AE (001802AEB6B, 00180274E41, 00180318880, 045573E4A70)
045573E7140  0018004846A (00000000020, 045573E5C94, 00000000001, 00040400000)
045573E7140  001800484A2 (00000000032, 000000010A1, 00180318880, 045573E5B70)
045573E7140  00180168194 (00061948FE1, 00000000000, 045573E5CA0, 045571C3000)
045573E7140  00180106358 (00000000000, 008000010A1, 00180318880, 22A4BD5DBC0)
045573E7140  0018005ADC3 (045573E7140, 00000000000, 0018024EBA0, 008000003E0)
045573E7140  0018014E0B4 (7FFD54886255, 00000000016, 008000003E0, 004B394F3A6)
045573E7140  00180198EEB (7FFD54886255, 00000000016, 008000003E0, 004B394F3A6)
045573E7140  004B393212C (00000000020, 22A487BBAF8, 00000000000, 22A6C7F4120)
045573E7140  004B3936867 (00000000000, 00000000001, 00000000000, 045573E7170)
045573E7140  7FFD1B0F4461 (0000000000A, 045573E73A0, 22A6C7F3D60, D6D941F662B1)
045573E7170  7FFD1B0F418D (00000000000, 045573E7390, 00000000000, 00000000000)
045573E73E0  7FFD1B0F4042 (22A487D3B10, 045573E7278, 7FFD1BD4B4D8, 045573E7380)
045573E73E0  7FFD1B1032B5 (004B39319A0, 045573E7390, 045573E73A0, 22A00000000)
045573E73E0  7FFD1B102EC8 (22A6DC70940, 22A6DB2F030, 00000000000, 04500001101)
00000000000  7FFD1B102A8C (7FFD1B102940, 00000000000, 00000000000, 00000000000)
045573E7610  7FFD1BD33457 (00000000001, 22A487D3B10, 22A6DBB9BA8, 00000000001)
045573E7970  7FFD1BD2F616 (00000000000, 045573E7C40, 00000000000, 00000000000)
22A6DC85D10  7FFD1BD49BAD (22A6DC7FC00, 90DC8CB43D62805D, 00000000000, 00000000000)
22A6DC85D10  7FFD1BDDCDB6 (7FFD1C1017F8, 22A6DC85D10, 22A6DBCF0E0, 00000000000)
7FFD1C1017F8  7FFD1BDDCC3F (22A6DC70918, 7FFD1C0F32A0, 00000000000, 22A6DC70918)
22A6DC7FC00  7FFD1BDDCB3F (00000000002, 22A49634BD0, 00000000000, 22A6DC70900)
00000000002  7FFD1BDA0FDD (22A487D3B10, 00000000002, 172B7BEF138823D3, 00000000001)
22A487D3B10  7FFD1BD593C8 (22A487D3B10, 22A6DC7FE00, 00000000000, 045573E7FE0)
22A6DC7FE00  7FFD1BD5919F (22A496350D0, 22A6DC70900, 22A487D3B10, 22A6DBAD7E0)
22A496350D0  7FFD1BD592E3 (045573E7F50, 00000000001, 22A48828EA0, 22A6DC7FE00)
045573E7F50  7FFD1BD345CE (00000000002, 22A496015B0, 22A6C709708, 00000000003)
045573E8300  7FFD1BD2D787 (22A6C709700, 8000000000000003, 22A6C709700, 00000000002)
045573E8300  7FFD1BD30018 (00000000002, 22A487D3B10, 22A6C70AE30, 00000000002)
045573E8660  7FFD1BD2F616 (00000000001, 22A49602180, 22A6C70A470, 00000000001)
045573E8A10  7FFD1BD2D787 (22A6C70A468, 8000000000000001, 22A6DC83EB0, 22A4962C7F0)
045573E8A10  7FFD1BD2ECE4 (00000000002, 22A49603380, 22A6DBE81D0, 00000000002)
End of stack trace (more stack frames may be present)
  23855 [main] python (12524) C:\Python310\python.exe: *** fatal error - Internal error: TP_NUM_C_BUFS too small: 50

答案1

得分: 5

Installing libmagic through Python solved my problem.

pip install python-magic python-magic-bin

To parse an HTML (and PDF) using url_loader.load(), you may need to install the following package:

pip install tabulate pdf2image pytesseract
英文:

Infact, installing libmagic via Python solved my problem.

pip install python-magic python-magic-bin

It may be necessary to install the following package to parse an HTML (and PDF) using url_loader.load().

pip install tabulate pdf2image pytesseract

huangapple
  • 本文由 发表于 2023年5月14日 20:41:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76247540.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定