f字符串传递文件路径问题

huangapple go评论95阅读模式
英文:

f string to pass file path issue

问题

def document_loader(doc_path: str) -> Optional[Document]:
    """这个函数接受一个文件路径,并执行如下操作:

    Args:
        doc_path (str): 表示PDF文档路径的字符串。

    Returns:
        Optional[DocumentLoader]: DocumentLoader类的实例,如果文件未找到则返回None。
    """
    
    # 尝试加载文档
    loader = PyPDFLoader(doc_path)
    docs = loader.load()
    print("文档加载完成")

# PyPDFLoader是用于读取PDF文件路径的PyPDF2的包装器

# 现在,当我使用硬编码的文件路径字符串调用该函数如下:
document_loader('/Users/Documents/hack/data/abc.pdf')

# 函数可以正常工作并读取PDF文件路径。

# 但是,如果我想让用户通过Streamlit的file_uploader()上传他们的PDF文件,如下所示:

uploaded_file = st.sidebar.file_uploader("上传文件", key="uploaded_file")
print(st.session_state.uploaded_file)

if uploaded_file is not None:
    filename = st.session_state.uploaded_file.name
    print(os.path.abspath(st.session_state.uploaded_file.name))
    document_loader(f'"{os.path.abspath(filename)}"')

# 我会收到错误消息:
# ValueError: 文件路径 "/Users/Documents/hack/data/abc.pdf" 不是有效的文件或URL

# 这个语句 `print(os.path.abspath(st.session_state.uploaded_file.name))` 打印出与硬编码路径相同的路径。

# 注意:Streamlit当前在我的笔记本上的本地主机上运行,并且我是试图通过本地运行的Streamlit应用程序上传PDF的“用户”。

# **编辑1:**

# 正如 @MAtchCatAnd 建议的,我添加了tempfile,它可以正常工作。但是有一个问题:

# 我的函数中传递了tempfile_path,每当用户有任何互动时,函数就会重新运行。这是因为tempfile路径会自动更改,从而使函数重新运行,即使我已经使用 @st.cache_data 修饰它。

# 上传的PDF文件保持不变,因此我不希望每次运行相同的函数,因为它每次运行都会消耗一些资源。

# 如何修复这个问题,因为我看到Streamlit已经弃用了st.cache中的allow_mutation=True参数。

# 下面是代码:

@st.cache_data
def document_loader(doc_path: str) -> Optional[Document]:
    """这个函数接受一个文件路径,并执行如下操作:

    Args:
        doc_path (str): 表示PDF文档路径的字符串。

    Returns:
        Optional[DocumentLoader]: DocumentLoader类的实例,如果文件未找到则返回None。
    """
    
    # 尝试加载文档
    loader = PyPDFLoader(doc_path)
    docs = loader.load()
    print("文档加载完成")

uploaded_file = st.sidebar.file_uploader("上传文件", key="uploaded_file")

if uploaded_file is not None:
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
        temp_file.write(uploaded_file.getvalue())
        temp_file_path = temp_file.name
        print(temp_file_path)

    custom_qa = document_loader(temp_file_path)
英文:

I have a function which accepts a file path. It's as below:

def document_loader(doc_path: str) -> Optional[Document]:
""" This function takes in a document in a particular format and 
converts it into a Langchain Document Object 
Args:
doc_path (str): A string representing the path to the PDF document.
Returns:
Optional[DocumentLoader]: An instance of the DocumentLoader class or None if the file is not found.
"""
# try:
loader = PyPDFLoader(doc_path)
docs = loader.load()
print("Document loader done")

PyPDfLoader is a wrapper around PyPDF2 to read in a pdf file path

Now,when I call the function with hardcoding the file path string as below:

document_loader('/Users/Documents/hack/data/abc.pdf')

The function works fine and is able to read the pdf file path.

But now if I want a user to upload their pdf file via Streamlit file_uploader() as below:

uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
print(st.session_state.uploaded_file)
if uploaded_file is not None:
filename = st.session_state.uploaded_file.name
print(os.path.abspath(st.session_state.uploaded_file.name))
document_loader(f'"{os.path.abspath(filename)}"')

I get the error:

ValueError: File path "/Users/Documents/hack/data/abc.pdf" is not a valid file or url

This statement print(os.path.abspath(st.session_state.uploaded_file.name)) prints out the same path as the hardcoded one.

Note: Streamlit is currently on localhost on my laptop and I am the "user" who is trying to upload a pdf via locally runnin streamlit app.

Edit1:

So as per @MAtchCatAnd I added tempfile and it WORKS. But with an issue:

My function where tempfile_path is passed, it is re-running everytime there is any interaction by a user. This is because tempfile path is changing automatically thereby making the function re-run even if I had decorated it with @st.cache_data.

The pdf file uploaded remains the same, so I don't want the same function to re run as it consumes some cost everytime it is run.

How to fix this as I see Streamlit has deprecated allow_mutation=True parameter in st.cache.

Here's the code:

@st.cache_data
def document_loader(doc_path: str) -> Optional[Document]:
""" This function takes in a document in a particular format and 
converts it into a Langchain Document Object 
Args:
doc_path (str): A string representing the path to the PDF document.
Returns:
Optional[DocumentLoader]: An instance of the DocumentLoader class or None if the file is not found.
"""
# try:
loader = PyPDFLoader(doc_path)
docs = loader.load()
print("Document loader done")
uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
if uploaded_file is not None:
with tempfile.NamedTemporaryFile(delete=False) as temp_file:
temp_file.write(uploaded_file.getvalue())
temp_file_path = temp_file.name
print(temp_file_path)
custom_qa = document_loader(temp_file_path)

答案1

得分: 3

st.file_uploader返回的对象是一个继承自BytesIO的"类似文件"对象。

根据文档

UploadedFile类是BytesIO的子类,因此它是"类似文件"的。这意味着您可以将它们传递到任何需要文件的地方。

虽然返回的对象具有name属性,但它没有路径。它存在于内存中,并且不与真实的已保存文件关联。尽管Streamlit 可能在本地运行,但实际上它具有服务器-客户端结构,其中Python后端通常位于与用户计算机不同的计算机上。因此,file_uploader小部件不设计为提供对用户文件系统的任何实际访问或指针。

您应该要么:

  1. 使用允许您传递文件缓冲区而不是路径的方法,
  2. 将文件保存到一个新的已知路径,
  3. 使用tempfiles

以下是一个使用临时文件和关于它们的另一个问题的简要示例:

import streamlit as st
import tempfile
import pandas as pd

file = st.file_uploader('Upload a file', type='csv')
tempdir = tempfile.gettempdir()

if file is not None:
    with tempfile.NamedTemporaryFile(delete=False) as tf:
        tf.write(file.read())
        tf_path = tf.name
    st.write(tf_path)
    df = pd.read_csv(tf_path)
    st.write(df)

对Edit 1的回应

我会删除缓存,而是依赖于st.session_state来存储您的结果。

在脚本开头为您想要的对象创建会话状态中的一个位置

if 'qa' not in st.session_state:
    st.session_state.qa = None

让您的函数返回您想要的对象

def document_loader(doc_path: str) -> Optional[Document]:
    loader = PyPDFLoader(doc_path)
    return loader # 或者返回loader.load(),根据情况选择

在运行文档加载器之前检查会话状态中是否有结果

uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")

if uploaded_file is not None and st.session_state.qa is None:
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
        temp_file.write(uploaded_file.getvalue())
        temp_file_path = temp_file.name
        print(temp_file_path)

    st.session_state.qa = document_loader(temp_file_path)

custom_qa = st.session_state.qa

# 在继续之前对custom_qa进行检查,可以使用"is None"与stop或"is not None"与其余代码嵌套在内部
if custom_qa is None:
    st.stop()

添加一个重置的方式,通过在文件上传器上添加on_change=clear_qa

def clear_qa():
    st.session_state.qa = None
英文:

The object returned by st.file_uploader is a "file-like" object inheriting from BytesIO.

From the docs:
> The UploadedFile class is a subclass of BytesIO, and therefore it is "file-like". This means you can pass them anywhere where a file is expected.

While the returned object does have a name attribute, it has no path. It exists in memory and is not associated to a real, saved file. Though Streamlit may be run locally, it does in actuality have a server-client structure where the Python backend is usually on a different computer than the user's computer. As such, the file_uploader widget is not designed to provide any real access or pointer to the user's file system.

You should either

  1. use a method that allows you to pass a file buffer instead of a path,
  2. save the file to a new, known path,
  3. use tempfiles

A brief example working with temp files and another question about them that may be helpful.

import streamlit as st
import tempfile
import pandas as pd

file = st.file_uploader('Upload a file', type='csv')
tempdir = tempfile.gettempdir()

if file is not None:
    with tempfile.NamedTemporaryFile(delete=False) as tf:
        tf.write(file.read())
        tf_path = tf.name
    st.write(tf_path)
    df = pd.read_csv(tf_path)
    st.write(df)

Response to Edit 1

I would remove the caching and instead rely on st.session_state to store your results.

Create a spot in session state for the object you want at the beginning of your script

if 'qa' not in st.session_state:
    st.session_state.qa = None

Have your function return the object you want

def document_loader(doc_path: str) -> Optional[Document]:
    loader = PyPDFLoader(doc_path)
    return loader # or return loader.load(), whichever is more suitable

Check for results in session state before running the document loader

uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")

if uploaded_file is not None and st.session_state.qa is None:
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
        temp_file.write(uploaded_file.getvalue())
        temp_file_path = temp_file.name
        print(temp_file_path)

    st.session_state.qa = document_loader(temp_file_path)

custom_qa = st.session_state.qa

# put a check on custom_qa before continuing, either "is None" with  
# stop or "is not None" with the rest of your code nested inside
if custom_qa is None:
    st.stop()

Add in a way to reset, by adding on_change=clear_qa to the file uploader

def clear_qa():
    st.session_state.qa = None

huangapple
  • 本文由 发表于 2023年5月29日 09:44:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76354242.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定