How can I use `langchain.document_loaders.PyPDFLoader` for pdf documents uploaded on StreamLit?

huangapple go评论82阅读模式
英文:

How can I use `langchain.document_loaders.PyPDFLoader` for pdf documents uploaded on StreamLit?

问题

我试图使用 StreamLit 构建一个 Web 应用程序来阅读文档(主要是 PDF),并使用 langchain.document_loaders.PyPDFLoader 加载数据,但我遇到了以下错误:

TypeError: stat: path should be string, bytes, os.PathLike or integer, not list

随后是:

File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
    exec(code, module.__dict__)
File "/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py", line 133, in <module>
    main()
File "/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py", line 75, in main
    loader = PyPDFLoader(pdf)
             ^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py", line 92, in __init__
    super().__init__(file_path)
File "/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py", line 42, in __init__
    if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen genericpath>", line 30, in isfile

在我的代码中,我实际上是使用 Streamlit 上传文档:

import streamlit as st
from langchain.document_loaders import PyPDFLoader

uploaded_file = st.file_uploader("上传 PDF", type="pdf")
if uploader_file is not None:
    loader = PyPDFLoader(uploaded_file)

我试图使用 PyPDFLoader 是因为我需要保存文档的来源信息,例如页面编号。

我尝试按页将 PDF 文档的每一页的文本添加如下:

from PyPDF2 import PdfReader
import streamlit as st

uploaded_file = st.file_uploader("上传 PDF", type="pdf")

if uploaded_file is not None:
    texts = ""
    reader = PdfReader(uploaded_file)
    for page in reader.pages:
        texts += page.extract_text()

但在这种情况下,我丢失了我在这种情况下需要的页面编号信息。

英文:

I am trying to build a webapp using StreamLit for reading documents (mainly pdf) and load the data using langchain.document_loaders.PyPDFLoader but I am ending up with an error as follows:

TypeError: stat: path should be string, bytes, os.PathLike or integer, not list

followed by :

File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
    exec(code, module.__dict__)
File "/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py", line 133, in <module>
    main()
File "/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py", line 75, in main
    loader = PyPDFLoader(pdf)
             ^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py", line 92, in __init__
    super().__init__(file_path)
File "/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py", line 42, in __init__
    if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen genericpath>", line 30, in isfile

In my code, I am actually uploading document (in streamlit) using:

import streamlit as st
from langchain.document_loaders import PyPDFLoader

uploaded_file = st.file_uploader("Upload PDF", type="pdf")
if uploader_file is not None:
    loader = PyPDFLoader(uploaded_file)

I am trying to use PyPDFLoader because I need the source of the documents such as page numbers to be saved up.

I tried adding the texts of each page in the pdf document page-wise as follows:

from PyPDF2 import PdfReader
import streamlit as st

uploaded_file = st.file_uploader("Upload PDF", type="pdf")

if uploaded_file is not None:
    texts = ""
    reader = PdfReader(uploaded_file)
    for page in reader.pages:
        texts += page.extract_text()

But in this case, I have lost the information of the page number which I need in my case.

答案1

得分: 1

PyPdfLoader接受一个字符串类型的file_path参数。这意味着你不能直接传递上传的文件。

你可以将文件保存到临时位置,然后将file_path传递给pdf加载器,然后进行清理。

# 临时保存文件
tmp_location = os.path.join('/tmp', file.filename)

loader = PyPDFLoader(tmp_location)
pages = loader.load_and_split()

# 在这里进行你需要的操作

# 清理
if isinstance(file, Path):
   metadata.update({'file_name': file.name})
英文:

PyPdfLoader takes in file_path which is a string. That means you cannot directly pass the uploaded file.

What you can do is save the file to a temporary location and pass the file_path to pdf loader, then clean up afterwards.

# save the file temporarily
tmp_location = os.path.join('/tmp', file.filename)

loader = PyPDFLoader(tmp_location)
pages = loader.load_and_split()

# do whatever you need here

# clean up
if isinstance(file, Path):
   metadata.update({'file_name': file.name})

I hope this helps.

huangapple
  • 本文由 发表于 2023年7月13日 12:38:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76675978.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定