2023年6月8日 20:01:20go评论239阅读模式

英文:

Langchain pyPDFLoader

问题

我目前正在尝试开始使用Langchain。我正在使用Anaconda/Spyder集成开发环境：

# 导入模块
import os
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
import streamlit as st
from streamlit_chat import message
# 设置API密钥和要使用的模型
API_KEY = "我的API密钥在这里"
model_id = "gpt-3.5-turbo"
os.environ["OPENAI_API_KEY"] = API_KEY
pdf_path = '.\Paris.pdf'
loaders = PyPDFLoader(".\Paris.pdf")

然后我运行它：

streamlit run c:\users\myname\.spyder-py3\untitled0.py [参数]

我遇到了以下问题：

streamlit模块确实运行并在浏览器中打开，但我遇到了错误：

ValueError: 文件路径 .\Paris.pdf 不是有效的文件或URL

我仔细检查了，PDF实际上位于正确的目录中（即Python脚本所在的目录）。

作为测试，我还尝试了：

# 导入模块
from PyPDF2 import PdfReader
pdf_path = './Paris.pdf'
with open(pdf_path, 'rb') as file:
    pdf = PdfReader(file)
    num_pages = len(pdf.pages)
    for page_number in range(num_pages):
        page = pdf.pages[page_number]
        page_text = page.extract_text()
        print(f"第 {page_number + 1} 页:\n{page_text}")

这个方法运行得很完美。
请注意，我使用了与langchain/streamlit版本相同的路径。
我已经多次安装了langchain、pyPDF和streamlit。

然后我尝试了：

import os
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader(".\Paris.pdf")
pages = loader.load_and_split()
print(pages)

这个方法也可以运行。
第一个代码片段中导致文件路径抛出异常的问题是什么？

我进一步调查后发现，代码中添加了streamlit组件会导致文件路径问题发生。

英文:

I am currently trying to get started working with Langchain. I am working in Anaconda/Spyder IDE:

# Imports
import os 
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
import streamlit as st
from streamlit_chat import message
# Set API keys and the models to use
API_KEY = &quot;MY API KEY HERE&quot;
model_id = &quot;gpt-3.5-turbo&quot;
os.environ[&quot;OPENAI_API_KEY&quot;] = API_KEY
pdf_path = &#39;.\Paris.pdf&#39;
loaders = PyPDFLoader(&quot;.\Paris.pdf&quot;)

I then run it with:

streamlit run c:\users\myname\.spyder-py3\untitled0.py [ARGUMENTS]

I get:

The streamlit module does run and opens in the browser but I get an error.

ValueError: File path .\Paris.pdf is not a valid file or url

I have checked carefully and the PDF is in fact located in the correct directory (i.e. the directory where the python script is located).

As a test I also tried:

# Imports
from PyPDF2 import PdfReader
pdf_path = &#39;./Paris.pdf&#39;
with open(pdf_path, &#39;rb&#39;) as file:
    pdf = PdfReader(file)
    num_pages = len(pdf.pages)
    for page_number in range(num_pages):
        page = pdf.pages[page_number]
        page_text = page.extract_text()
        print(f&quot;Page {page_number + 1}:\n{page_text}&quot;)

This worked perfectly.
Note that I used the same path as with the langchain/streamlit version.
I have installed langchain (multiple times), pyPDF and streamlit.

I then tried:

import os
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader(&quot;.\Paris.pdf&quot;)
pages = loader.load_and_split()
print(pages)

That works.
What is wrong in the first code snippet that causes the file path to throw an exception.

I investigated further and it turns out that the addition of the streamlit components of the code cause the file path issue to occur.

答案1

得分: 1

由于这是Streamlit组件的错误，我建议您使用Streamlit的file_uploader方法，如下所示：

import streamlit as st
uploaded_file = st.file_uploader("上传您的PDF文件")

但在这种情况下，您将需要使用另一种方法来读取PDF文件，即使用PyPDF2.PdfReader，如下所示：

import streamlit as st
from PyPDF2 import PdfReader
uploaded_file = st.file_uploader("上传您的PDF文件")
if uploaded_file is not None:
   reader = PdfReader(uploaded_file)

如果您需要上传的PDF文件以Document格式存在（这是通过langchain.document_loaders.PyPDFLoader上传文件时的格式），那么您可以执行以下操作：

import streamlit as st
from PyPDF2 import PdfReader
from langchain.docstore.document import Document
uploaded_file = st.file_uploader("上传您的PDF文件")
if uploaded_file is not None:
    docs = []
    reader = PdfReader(uploaded_file)
    i = 1
    for page in reader.pages:
        docs.append(Document(page_content=page.extract_text(), metadata={'page':i}))
        i += 1

英文:

Since it is an error of streamlit components, I would suggest you to use streamlit's file_uploader method as follows:

import streamlit as st
uploaded_file = st.file_uploader(&quot;Upload your PDF&quot;)

But in this case, you will have to read the pdf file in another approach which is by using PyPDF2.PdfReader as follows:

import streamlit as st
from PyPDF2 import PdfReader
uploaded_file = st.file_uploader(&quot;Upload your PDF&quot;)
if uploaded_file is not None:
   reader = PdfReader(uploaded_file)

If you need the uploaded pdf to be in the format of Document (which is when the file is uploaded through langchain.document_loaders.PyPDFLoader) then you can do the following:

import streamlit as st
from PyPDF2 import PdfReader
from langchain.docstore.document import Document
uploaded_file = st.file_uploader(&quot;Upload your PDF&quot;)
if uploaded_file is not None:
    docs = []
    reader = PdfReader(uploaded_file)
    i = 1
    for page in reader.pages:
        docs.append(Document(page_content=page.extract_text(), metadata={&#39;page&#39;:i}))
        i += 1

答案2

得分: 0

The error ValueError: File path .\Paris.pdf is not a valid file or url is thrown from LangChain. See the source code at: https://api.python.langchain.com/en/latest/_modules/langchain/document_loaders/pdf.html


    def __init__(self, file_path: str, *, headers: Optional[Dict] = None):
        &quot;&quot;&quot;Initialize with a file path.
...
                self.file_path = str(temp_pdf)
        elif not os.path.isfile(self.file_path):
            raise ValueError(&quot;File path %s is not a valid file or url&quot; % self.file_path)

Then os.path.isfile(self.file_path) is defined from os.path and os.sep libraries. You can find more information about os.sep here: https://docs.python.org/3/library/os.html#os.sep

The character used by the operating system to separate pathname components. This is '/' for POSIX and '\' for Windows. Note that knowing this is not sufficient to be able to parse or concatenate pathnames — use os.path.split() and os.path.join() — but it is occasionally useful.

It's recommended to use os.path.join(subdir, fname) for path operations.

英文:

The error ValueError: File path .\Paris.pdf is not a valid file or url is thrown from LangChain. See the source code >

https://api.python.langchain.com/en/latest/_modules/langchain/document_loaders/pdf.html


    def __init__(self, file_path: str, *, headers: Optional[Dict] = None):
        &quot;&quot;&quot;Initialize with a file path.
...
                self.file_path = str(temp_pdf)
        elif not os.path.isfile(self.file_path):
            raise ValueError(&quot;File path %s is not a valid file or url&quot; % self.file_path)

Then os.path.isfile(self.file_path) is defined from os.path and os.sep libraries > https://docs.python.org/3/library/os.html#os.sep

> The character used by the operating system to separate pathname components. This is '/' for POSIX and '\' for Windows. Note that knowing this is not sufficient to be able to parse or concatenate pathnames — use os.path.split() and os.path.join() — but it is occasionally useful. Also available via os.path.

Use os.path.join(subdir, fname) is recommended

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Langchain pyPDFLoader

问题

答案1

答案2

Dataframe 无法删除 NaN。

如何使用geoserver-restconfig python包创建一个覆盖范围存储？

Organize items from txt file in a list.

如何在PySimpleGUI中更新组合框（combobox）的背景颜色？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。