How can I use `langchain.document_loaders.PyPDFLoader` for pdf documents uploaded on StreamLit?

huangapple go评论109阅读模式
英文:

How can I use `langchain.document_loaders.PyPDFLoader` for pdf documents uploaded on StreamLit?

问题

我试图使用 StreamLit 构建一个 Web 应用程序来阅读文档(主要是 PDF),并使用 langchain.document_loaders.PyPDFLoader 加载数据,但我遇到了以下错误:

  1. TypeError: stat: path should be string, bytes, os.PathLike or integer, not list

随后是:

  1. File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
  2. exec(code, module.__dict__)
  3. File "/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py", line 133, in <module>
  4. main()
  5. File "/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py", line 75, in main
  6. loader = PyPDFLoader(pdf)
  7. ^^^^^^^^^^^^^^^^
  8. File "/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py", line 92, in __init__
  9. super().__init__(file_path)
  10. File "/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py", line 42, in __init__
  11. if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path):
  12. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  13. File "<frozen genericpath>", line 30, in isfile

在我的代码中,我实际上是使用 Streamlit 上传文档:

  1. import streamlit as st
  2. from langchain.document_loaders import PyPDFLoader
  3. uploaded_file = st.file_uploader("上传 PDF", type="pdf")
  4. if uploader_file is not None:
  5. loader = PyPDFLoader(uploaded_file)

我试图使用 PyPDFLoader 是因为我需要保存文档的来源信息,例如页面编号。

我尝试按页将 PDF 文档的每一页的文本添加如下:

  1. from PyPDF2 import PdfReader
  2. import streamlit as st
  3. uploaded_file = st.file_uploader("上传 PDF", type="pdf")
  4. if uploaded_file is not None:
  5. texts = ""
  6. reader = PdfReader(uploaded_file)
  7. for page in reader.pages:
  8. texts += page.extract_text()

但在这种情况下,我丢失了我在这种情况下需要的页面编号信息。

英文:

I am trying to build a webapp using StreamLit for reading documents (mainly pdf) and load the data using langchain.document_loaders.PyPDFLoader but I am ending up with an error as follows:

  1. TypeError: stat: path should be string, bytes, os.PathLike or integer, not list

followed by :

  1. File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
  2. exec(code, module.__dict__)
  3. File "/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py", line 133, in <module>
  4. main()
  5. File "/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py", line 75, in main
  6. loader = PyPDFLoader(pdf)
  7. ^^^^^^^^^^^^^^^^
  8. File "/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py", line 92, in __init__
  9. super().__init__(file_path)
  10. File "/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py", line 42, in __init__
  11. if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path):
  12. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  13. File "<frozen genericpath>", line 30, in isfile

In my code, I am actually uploading document (in streamlit) using:

  1. import streamlit as st
  2. from langchain.document_loaders import PyPDFLoader
  3. uploaded_file = st.file_uploader("Upload PDF", type="pdf")
  4. if uploader_file is not None:
  5. loader = PyPDFLoader(uploaded_file)

I am trying to use PyPDFLoader because I need the source of the documents such as page numbers to be saved up.

I tried adding the texts of each page in the pdf document page-wise as follows:

  1. from PyPDF2 import PdfReader
  2. import streamlit as st
  3. uploaded_file = st.file_uploader("Upload PDF", type="pdf")
  4. if uploaded_file is not None:
  5. texts = ""
  6. reader = PdfReader(uploaded_file)
  7. for page in reader.pages:
  8. texts += page.extract_text()

But in this case, I have lost the information of the page number which I need in my case.

答案1

得分: 1

PyPdfLoader接受一个字符串类型的file_path参数。这意味着你不能直接传递上传的文件。

你可以将文件保存到临时位置,然后将file_path传递给pdf加载器,然后进行清理。

  1. # 临时保存文件
  2. tmp_location = os.path.join('/tmp', file.filename)
  3. loader = PyPDFLoader(tmp_location)
  4. pages = loader.load_and_split()
  5. # 在这里进行你需要的操作
  6. # 清理
  7. if isinstance(file, Path):
  8. metadata.update({'file_name': file.name})
英文:

PyPdfLoader takes in file_path which is a string. That means you cannot directly pass the uploaded file.

What you can do is save the file to a temporary location and pass the file_path to pdf loader, then clean up afterwards.

  1. # save the file temporarily
  2. tmp_location = os.path.join('/tmp', file.filename)
  3. loader = PyPDFLoader(tmp_location)
  4. pages = loader.load_and_split()
  5. # do whatever you need here
  6. # clean up
  7. if isinstance(file, Path):
  8. metadata.update({'file_name': file.name})

I hope this helps.

huangapple
  • 本文由 发表于 2023年7月13日 12:38:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76675978.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定