Langchain pyPDFLoader

huangapple go评论144阅读模式
英文:

Langchain pyPDFLoader

问题

我目前正在尝试开始使用Langchain。我正在使用Anaconda/Spyder集成开发环境:

# 导入模块
import os
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
import streamlit as st
from streamlit_chat import message

# 设置API密钥和要使用的模型
API_KEY = "我的API密钥在这里"
model_id = "gpt-3.5-turbo"

os.environ["OPENAI_API_KEY"] = API_KEY

pdf_path = '.\Paris.pdf'
loaders = PyPDFLoader(".\Paris.pdf")

然后我运行它:

streamlit run c:\users\myname\.spyder-py3\untitled0.py [参数]

我遇到了以下问题:

streamlit模块确实运行并在浏览器中打开,但我遇到了错误:

ValueError: 文件路径 .\Paris.pdf 不是有效的文件或URL

我仔细检查了,PDF实际上位于正确的目录中(即Python脚本所在的目录)。

作为测试,我还尝试了:

# 导入模块
from PyPDF2 import PdfReader

pdf_path = './Paris.pdf'

with open(pdf_path, 'rb') as file:
    pdf = PdfReader(file)
    num_pages = len(pdf.pages)

    for page_number in range(num_pages):
        page = pdf.pages[page_number]
        page_text = page.extract_text()
        print(f"第 {page_number + 1} 页:\n{page_text}")

这个方法运行得很完美。
请注意,我使用了与langchain/streamlit版本相同的路径。
我已经多次安装了langchain、pyPDF和streamlit。

然后我尝试了:

import os

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(".\Paris.pdf")
pages = loader.load_and_split()
print(pages)

这个方法也可以运行。
第一个代码片段中导致文件路径抛出异常的问题是什么?

我进一步调查后发现,代码中添加了streamlit组件会导致文件路径问题发生。

英文:

I am currently trying to get started working with Langchain. I am working in Anaconda/Spyder IDE:

# Imports
import os 
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
import streamlit as st
from streamlit_chat import message



# Set API keys and the models to use
API_KEY = "MY API KEY HERE"
model_id = "gpt-3.5-turbo"

os.environ["OPENAI_API_KEY"] = API_KEY

pdf_path = '.\Paris.pdf'
loaders = PyPDFLoader(".\Paris.pdf")

I then run it with:

streamlit run c:\users\myname\.spyder-py3\untitled0.py [ARGUMENTS]

I get:

The streamlit module does run and opens in the browser but I get an error.

ValueError: File path .\Paris.pdf is not a valid file or url


I have checked carefully and the PDF is in fact located in the correct directory (i.e. the directory where the python script is located).

As a test I also tried:

# Imports
from PyPDF2 import PdfReader

pdf_path = './Paris.pdf'

with open(pdf_path, 'rb') as file:
    pdf = PdfReader(file)
    num_pages = len(pdf.pages)

    for page_number in range(num_pages):
        page = pdf.pages[page_number]
        page_text = page.extract_text()
        print(f"Page {page_number + 1}:\n{page_text}")

This worked perfectly.
Note that I used the same path as with the langchain/streamlit version.
I have installed langchain (multiple times), pyPDF and streamlit.

I then tried:

import os

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(".\Paris.pdf")
pages = loader.load_and_split()
print(pages)

That works.
What is wrong in the first code snippet that causes the file path to throw an exception.

I investigated further and it turns out that the addition of the streamlit components of the code cause the file path issue to occur.

答案1

得分: 1

由于这是Streamlit组件的错误,我建议您使用Streamlit的file_uploader方法,如下所示:

import streamlit as st

uploaded_file = st.file_uploader("上传您的PDF文件")

但在这种情况下,您将需要使用另一种方法来读取PDF文件,即使用PyPDF2.PdfReader,如下所示:

import streamlit as st
from PyPDF2 import PdfReader

uploaded_file = st.file_uploader("上传您的PDF文件")
if uploaded_file is not None:
   reader = PdfReader(uploaded_file)

如果您需要上传的PDF文件以Document格式存在(这是通过langchain.document_loaders.PyPDFLoader上传文件时的格式),那么您可以执行以下操作:

import streamlit as st
from PyPDF2 import PdfReader
from langchain.docstore.document import Document

uploaded_file = st.file_uploader("上传您的PDF文件")
if uploaded_file is not None:
    docs = []
    reader = PdfReader(uploaded_file)
    i = 1
    for page in reader.pages:
        docs.append(Document(page_content=page.extract_text(), metadata={'page':i}))
        i += 1
英文:

Since it is an error of streamlit components, I would suggest you to use streamlit's file_uploader method as follows:

import streamlit as st

uploaded_file = st.file_uploader("Upload your PDF")

But in this case, you will have to read the pdf file in another approach which is by using PyPDF2.PdfReader as follows:

import streamlit as st
from PyPDF2 import PdfReader

uploaded_file = st.file_uploader("Upload your PDF")
if uploaded_file is not None:
   reader = PdfReader(uploaded_file)

If you need the uploaded pdf to be in the format of Document (which is when the file is uploaded through langchain.document_loaders.PyPDFLoader) then you can do the following:

import streamlit as st
from PyPDF2 import PdfReader
from langchain.docstore.document import Document

uploaded_file = st.file_uploader("Upload your PDF")
if uploaded_file is not None:
    docs = []
    reader = PdfReader(uploaded_file)
    i = 1
    for page in reader.pages:
        docs.append(Document(page_content=page.extract_text(), metadata={'page':i}))
        i += 1

答案2

得分: 0

The error ValueError: File path .\Paris.pdf is not a valid file or url is thrown from LangChain. See the source code at: https://api.python.langchain.com/en/latest/_modules/langchain/document_loaders/pdf.html

def __init__(self, file_path: str, *, headers: Optional[Dict] = None):
"""Initialize with a file path. ... self.file_path = str(temp_pdf) elif not os.path.isfile(self.file_path): raise ValueError("File path %s is not a valid file or url" % self.file_path)

Then os.path.isfile(self.file_path) is defined from os.path and os.sep libraries. You can find more information about os.sep here: https://docs.python.org/3/library/os.html#os.sep

The character used by the operating system to separate pathname components. This is '/' for POSIX and '\' for Windows. Note that knowing this is not sufficient to be able to parse or concatenate pathnames — use os.path.split() and os.path.join() — but it is occasionally useful.

It's recommended to use os.path.join(subdir, fname) for path operations.

英文:

The error ValueError: File path .\Paris.pdf is not a valid file or url is thrown from LangChain. See the source code >

https://api.python.langchain.com/en/latest/_modules/langchain/document_loaders/pdf.html

def __init__(self, file_path: str, *, headers: Optional[Dict] = None):
"""Initialize with a file path. ... self.file_path = str(temp_pdf) elif not os.path.isfile(self.file_path): raise ValueError("File path %s is not a valid file or url" % self.file_path)

Then os.path.isfile(self.file_path) is defined from os.path and os.sep libraries > https://docs.python.org/3/library/os.html#os.sep

> The character used by the operating system to separate pathname components. This is '/' for POSIX and '\' for Windows. Note that knowing this is not sufficient to be able to parse or concatenate pathnames — use os.path.split() and os.path.join() — but it is occasionally useful. Also available via os.path.

Use os.path.join(subdir, fname) is recommended

huangapple
  • 本文由 发表于 2023年6月8日 20:01:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76431655.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定