英文:
How can I use `langchain.document_loaders.PyPDFLoader` for pdf documents uploaded on StreamLit?
问题
我试图使用 StreamLit 构建一个 Web 应用程序来阅读文档(主要是 PDF),并使用 langchain.document_loaders.PyPDFLoader
加载数据,但我遇到了以下错误:
TypeError: stat: path should be string, bytes, os.PathLike or integer, not list
随后是:
File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
exec(code, module.__dict__)
File "/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py", line 133, in <module>
main()
File "/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py", line 75, in main
loader = PyPDFLoader(pdf)
^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py", line 92, in __init__
super().__init__(file_path)
File "/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py", line 42, in __init__
if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen genericpath>", line 30, in isfile
在我的代码中,我实际上是使用 Streamlit 上传文档:
import streamlit as st
from langchain.document_loaders import PyPDFLoader
uploaded_file = st.file_uploader("上传 PDF", type="pdf")
if uploader_file is not None:
loader = PyPDFLoader(uploaded_file)
我试图使用 PyPDFLoader
是因为我需要保存文档的来源信息,例如页面编号。
我尝试按页将 PDF 文档的每一页的文本添加如下:
from PyPDF2 import PdfReader
import streamlit as st
uploaded_file = st.file_uploader("上传 PDF", type="pdf")
if uploaded_file is not None:
texts = ""
reader = PdfReader(uploaded_file)
for page in reader.pages:
texts += page.extract_text()
但在这种情况下,我丢失了我在这种情况下需要的页面编号信息。
英文:
I am trying to build a webapp using StreamLit for reading documents (mainly pdf) and load the data using langchain.document_loaders.PyPDFLoader
but I am ending up with an error as follows:
TypeError: stat: path should be string, bytes, os.PathLike or integer, not list
followed by :
File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
exec(code, module.__dict__)
File "/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py", line 133, in <module>
main()
File "/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py", line 75, in main
loader = PyPDFLoader(pdf)
^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py", line 92, in __init__
super().__init__(file_path)
File "/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py", line 42, in __init__
if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen genericpath>", line 30, in isfile
In my code, I am actually uploading document (in streamlit) using:
import streamlit as st
from langchain.document_loaders import PyPDFLoader
uploaded_file = st.file_uploader("Upload PDF", type="pdf")
if uploader_file is not None:
loader = PyPDFLoader(uploaded_file)
I am trying to use PyPDFLoader
because I need the source of the documents such as page numbers to be saved up.
I tried adding the texts of each page in the pdf document page-wise as follows:
from PyPDF2 import PdfReader
import streamlit as st
uploaded_file = st.file_uploader("Upload PDF", type="pdf")
if uploaded_file is not None:
texts = ""
reader = PdfReader(uploaded_file)
for page in reader.pages:
texts += page.extract_text()
But in this case, I have lost the information of the page number which I need in my case.
答案1
得分: 1
PyPdfLoader
接受一个字符串类型的file_path
参数。这意味着你不能直接传递上传的文件。
你可以将文件保存到临时位置,然后将file_path
传递给pdf加载器,然后进行清理。
# 临时保存文件
tmp_location = os.path.join('/tmp', file.filename)
loader = PyPDFLoader(tmp_location)
pages = loader.load_and_split()
# 在这里进行你需要的操作
# 清理
if isinstance(file, Path):
metadata.update({'file_name': file.name})
英文:
PyPdfLoader
takes in file_path
which is a string. That means you cannot directly pass the uploaded file.
What you can do is save the file to a temporary location and pass the file_path
to pdf loader, then clean up afterwards.
# save the file temporarily
tmp_location = os.path.join('/tmp', file.filename)
loader = PyPDFLoader(tmp_location)
pages = loader.load_and_split()
# do whatever you need here
# clean up
if isinstance(file, Path):
metadata.update({'file_name': file.name})
I hope this helps.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论