2023年7月13日 12:38:55go评论109阅读模式

英文:

How can I use `langchain.document_loaders.PyPDFLoader` for pdf documents uploaded on StreamLit?

问题

我试图使用 StreamLit 构建一个 Web 应用程序来阅读文档（主要是 PDF），并使用 langchain.document_loaders.PyPDFLoader 加载数据，但我遇到了以下错误：

TypeError: stat: path should be string, bytes, os.PathLike or integer, not list

随后是：

File &quot;/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py&quot;, line 552, in _run_script
    exec(code, module.__dict__)
File &quot;/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py&quot;, line 133, in &lt;module&gt;
    main()
File &quot;/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py&quot;, line 75, in main
    loader = PyPDFLoader(pdf)
             ^^^^^^^^^^^^^^^^
File &quot;/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py&quot;, line 92, in __init__
    super().__init__(file_path)
File &quot;/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py&quot;, line 42, in __init__
    if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File &quot;&lt;frozen genericpath&gt;&quot;, line 30, in isfile

在我的代码中，我实际上是使用 Streamlit 上传文档：

import streamlit as st
from langchain.document_loaders import PyPDFLoader
uploaded_file = st.file_uploader("上传 PDF", type="pdf")
if uploader_file is not None:
    loader = PyPDFLoader(uploaded_file)

我试图使用 PyPDFLoader 是因为我需要保存文档的来源信息，例如页面编号。

我尝试按页将 PDF 文档的每一页的文本添加如下：

from PyPDF2 import PdfReader
import streamlit as st
uploaded_file = st.file_uploader("上传 PDF", type="pdf")
if uploaded_file is not None:
    texts = ""
    reader = PdfReader(uploaded_file)
    for page in reader.pages:
        texts += page.extract_text()

但在这种情况下，我丢失了我在这种情况下需要的页面编号信息。

英文:

I am trying to build a webapp using StreamLit for reading documents (mainly pdf) and load the data using langchain.document_loaders.PyPDFLoader but I am ending up with an error as follows:

TypeError: stat: path should be string, bytes, os.PathLike or integer, not list

followed by :

File &quot;/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py&quot;, line 552, in _run_script
    exec(code, module.__dict__)
File &quot;/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py&quot;, line 133, in &lt;module&gt;
    main()
File &quot;/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py&quot;, line 75, in main
    loader = PyPDFLoader(pdf)
             ^^^^^^^^^^^^^^^^
File &quot;/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py&quot;, line 92, in __init__
    super().__init__(file_path)
File &quot;/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py&quot;, line 42, in __init__
    if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File &quot;&lt;frozen genericpath&gt;&quot;, line 30, in isfile

In my code, I am actually uploading document (in streamlit) using:

import streamlit as st
from langchain.document_loaders import PyPDFLoader
uploaded_file = st.file_uploader(&quot;Upload PDF&quot;, type=&quot;pdf&quot;)
if uploader_file is not None:
    loader = PyPDFLoader(uploaded_file)

I am trying to use PyPDFLoader because I need the source of the documents such as page numbers to be saved up.

I tried adding the texts of each page in the pdf document page-wise as follows:

from PyPDF2 import PdfReader
import streamlit as st
uploaded_file = st.file_uploader(&quot;Upload PDF&quot;, type=&quot;pdf&quot;)
if uploaded_file is not None:
    texts = &quot;&quot;
    reader = PdfReader(uploaded_file)
    for page in reader.pages:
        texts += page.extract_text()

But in this case, I have lost the information of the page number which I need in my case.

答案1

得分: 1

PyPdfLoader接受一个字符串类型的file_path参数。这意味着你不能直接传递上传的文件。

你可以将文件保存到临时位置，然后将file_path传递给pdf加载器，然后进行清理。

# 临时保存文件
tmp_location = os.path.join('/tmp', file.filename)
loader = PyPDFLoader(tmp_location)
pages = loader.load_and_split()
# 在这里进行你需要的操作
# 清理
if isinstance(file, Path):
   metadata.update({'file_name': file.name})

英文:

PyPdfLoader takes in file_path which is a string. That means you cannot directly pass the uploaded file.

What you can do is save the file to a temporary location and pass the file_path to pdf loader, then clean up afterwards.

# save the file temporarily
tmp_location = os.path.join(&#39;/tmp&#39;, file.filename)
loader = PyPDFLoader(tmp_location)
pages = loader.load_and_split()
# do whatever you need here
# clean up
if isinstance(file, Path):
   metadata.update({&#39;file_name&#39;: file.name})

I hope this helps.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How can I use `langchain.document_loaders.PyPDFLoader` for pdf documents uploaded on StreamLit?

问题

答案1

Altair 图表在容器之外，而容器宽度设置为 true。

TFIDFVectorizer 制作拼接的单词标记

更改在Jupyter Notebook中使用魔法命令设置的环境变量。

403 Forbidden 453 – You currently have access to a subset of Twitter API v2 endpoints and limited v1.1 endpoints only

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。