Langchain pyPDFLoader

huangapple go评论239阅读模式
英文:

Langchain pyPDFLoader

问题

我目前正在尝试开始使用Langchain。我正在使用Anaconda/Spyder集成开发环境:

  1. # 导入模块
  2. import os
  3. from langchain.llms import OpenAI
  4. from langchain.document_loaders import TextLoader
  5. from langchain.document_loaders import PyPDFLoader
  6. from langchain.indexes import VectorstoreIndexCreator
  7. import streamlit as st
  8. from streamlit_chat import message
  9. # 设置API密钥和要使用的模型
  10. API_KEY = "我的API密钥在这里"
  11. model_id = "gpt-3.5-turbo"
  12. os.environ["OPENAI_API_KEY"] = API_KEY
  13. pdf_path = '.\Paris.pdf'
  14. loaders = PyPDFLoader(".\Paris.pdf")

然后我运行它:

  1. streamlit run c:\users\myname\.spyder-py3\untitled0.py [参数]

我遇到了以下问题:

streamlit模块确实运行并在浏览器中打开,但我遇到了错误:

  1. ValueError: 文件路径 .\Paris.pdf 不是有效的文件或URL

我仔细检查了,PDF实际上位于正确的目录中(即Python脚本所在的目录)。

作为测试,我还尝试了:

  1. # 导入模块
  2. from PyPDF2 import PdfReader
  3. pdf_path = './Paris.pdf'
  4. with open(pdf_path, 'rb') as file:
  5. pdf = PdfReader(file)
  6. num_pages = len(pdf.pages)
  7. for page_number in range(num_pages):
  8. page = pdf.pages[page_number]
  9. page_text = page.extract_text()
  10. print(f"第 {page_number + 1} 页:\n{page_text}")

这个方法运行得很完美。
请注意,我使用了与langchain/streamlit版本相同的路径。
我已经多次安装了langchain、pyPDF和streamlit。

然后我尝试了:

  1. import os
  2. from langchain.document_loaders import PyPDFLoader
  3. loader = PyPDFLoader(".\Paris.pdf")
  4. pages = loader.load_and_split()
  5. print(pages)

这个方法也可以运行。
第一个代码片段中导致文件路径抛出异常的问题是什么?

我进一步调查后发现,代码中添加了streamlit组件会导致文件路径问题发生。

英文:

I am currently trying to get started working with Langchain. I am working in Anaconda/Spyder IDE:

  1. # Imports
  2. import os
  3. from langchain.llms import OpenAI
  4. from langchain.document_loaders import TextLoader
  5. from langchain.document_loaders import PyPDFLoader
  6. from langchain.indexes import VectorstoreIndexCreator
  7. import streamlit as st
  8. from streamlit_chat import message
  9. # Set API keys and the models to use
  10. API_KEY = "MY API KEY HERE"
  11. model_id = "gpt-3.5-turbo"
  12. os.environ["OPENAI_API_KEY"] = API_KEY
  13. pdf_path = '.\Paris.pdf'
  14. loaders = PyPDFLoader(".\Paris.pdf")

I then run it with:

  1. streamlit run c:\users\myname\.spyder-py3\untitled0.py [ARGUMENTS]

I get:

The streamlit module does run and opens in the browser but I get an error.

  1. ValueError: File path .\Paris.pdf is not a valid file or url

I have checked carefully and the PDF is in fact located in the correct directory (i.e. the directory where the python script is located).

As a test I also tried:

  1. # Imports
  2. from PyPDF2 import PdfReader
  3. pdf_path = './Paris.pdf'
  4. with open(pdf_path, 'rb') as file:
  5. pdf = PdfReader(file)
  6. num_pages = len(pdf.pages)
  7. for page_number in range(num_pages):
  8. page = pdf.pages[page_number]
  9. page_text = page.extract_text()
  10. print(f"Page {page_number + 1}:\n{page_text}")

This worked perfectly.
Note that I used the same path as with the langchain/streamlit version.
I have installed langchain (multiple times), pyPDF and streamlit.

I then tried:

  1. import os
  2. from langchain.document_loaders import PyPDFLoader
  3. loader = PyPDFLoader(".\Paris.pdf")
  4. pages = loader.load_and_split()
  5. print(pages)

That works.
What is wrong in the first code snippet that causes the file path to throw an exception.

I investigated further and it turns out that the addition of the streamlit components of the code cause the file path issue to occur.

答案1

得分: 1

由于这是Streamlit组件的错误,我建议您使用Streamlit的file_uploader方法,如下所示:

  1. import streamlit as st
  2. uploaded_file = st.file_uploader("上传您的PDF文件")

但在这种情况下,您将需要使用另一种方法来读取PDF文件,即使用PyPDF2.PdfReader,如下所示:

  1. import streamlit as st
  2. from PyPDF2 import PdfReader
  3. uploaded_file = st.file_uploader("上传您的PDF文件")
  4. if uploaded_file is not None:
  5. reader = PdfReader(uploaded_file)

如果您需要上传的PDF文件以Document格式存在(这是通过langchain.document_loaders.PyPDFLoader上传文件时的格式),那么您可以执行以下操作:

  1. import streamlit as st
  2. from PyPDF2 import PdfReader
  3. from langchain.docstore.document import Document
  4. uploaded_file = st.file_uploader("上传您的PDF文件")
  5. if uploaded_file is not None:
  6. docs = []
  7. reader = PdfReader(uploaded_file)
  8. i = 1
  9. for page in reader.pages:
  10. docs.append(Document(page_content=page.extract_text(), metadata={'page':i}))
  11. i += 1
英文:

Since it is an error of streamlit components, I would suggest you to use streamlit's file_uploader method as follows:

  1. import streamlit as st
  2. uploaded_file = st.file_uploader("Upload your PDF")

But in this case, you will have to read the pdf file in another approach which is by using PyPDF2.PdfReader as follows:

  1. import streamlit as st
  2. from PyPDF2 import PdfReader
  3. uploaded_file = st.file_uploader("Upload your PDF")
  4. if uploaded_file is not None:
  5. reader = PdfReader(uploaded_file)

If you need the uploaded pdf to be in the format of Document (which is when the file is uploaded through langchain.document_loaders.PyPDFLoader) then you can do the following:

  1. import streamlit as st
  2. from PyPDF2 import PdfReader
  3. from langchain.docstore.document import Document
  4. uploaded_file = st.file_uploader("Upload your PDF")
  5. if uploaded_file is not None:
  6. docs = []
  7. reader = PdfReader(uploaded_file)
  8. i = 1
  9. for page in reader.pages:
  10. docs.append(Document(page_content=page.extract_text(), metadata={'page':i}))
  11. i += 1

答案2

得分: 0

The error ValueError: File path .\Paris.pdf is not a valid file or url is thrown from LangChain. See the source code at: https://api.python.langchain.com/en/latest/_modules/langchain/document_loaders/pdf.html

  1. def __init__(self, file_path: str, *, headers: Optional[Dict] = None):
  2. """Initialize with a file path.
  3. ...
  4. self.file_path = str(temp_pdf)
  5. elif not os.path.isfile(self.file_path):
  6. raise ValueError("File path %s is not a valid file or url" % self.file_path)

Then os.path.isfile(self.file_path) is defined from os.path and os.sep libraries. You can find more information about os.sep here: https://docs.python.org/3/library/os.html#os.sep

The character used by the operating system to separate pathname components. This is '/' for POSIX and '\' for Windows. Note that knowing this is not sufficient to be able to parse or concatenate pathnames — use os.path.split() and os.path.join() — but it is occasionally useful.

It's recommended to use os.path.join(subdir, fname) for path operations.

英文:

The error ValueError: File path .\Paris.pdf is not a valid file or url is thrown from LangChain. See the source code >

https://api.python.langchain.com/en/latest/_modules/langchain/document_loaders/pdf.html

  1. def __init__(self, file_path: str, *, headers: Optional[Dict] = None):
  2. """Initialize with a file path.
  3. ...
  4. self.file_path = str(temp_pdf)
  5. elif not os.path.isfile(self.file_path):
  6. raise ValueError("File path %s is not a valid file or url" % self.file_path)

Then os.path.isfile(self.file_path) is defined from os.path and os.sep libraries > https://docs.python.org/3/library/os.html#os.sep

> The character used by the operating system to separate pathname components. This is '/' for POSIX and '\' for Windows. Note that knowing this is not sufficient to be able to parse or concatenate pathnames — use os.path.split() and os.path.join() — but it is occasionally useful. Also available via os.path.

Use os.path.join(subdir, fname) is recommended

huangapple
  • 本文由 发表于 2023年6月8日 20:01:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76431655.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定