2023年5月29日 09:44:47go评论95阅读模式

英文:

f string to pass file path issue

问题

def document_loader(doc_path: str) -> Optional[Document]:
    """这个函数接受一个文件路径，并执行如下操作：

    Args:
        doc_path (str): 表示PDF文档路径的字符串。

    Returns:
        Optional[DocumentLoader]: DocumentLoader类的实例，如果文件未找到则返回None。
    """
    
    # 尝试加载文档
    loader = PyPDFLoader(doc_path)
    docs = loader.load()
    print("文档加载完成")

# PyPDFLoader是用于读取PDF文件路径的PyPDF2的包装器

# 现在，当我使用硬编码的文件路径字符串调用该函数如下：
document_loader('/Users/Documents/hack/data/abc.pdf')

# 函数可以正常工作并读取PDF文件路径。

# 但是，如果我想让用户通过Streamlit的file_uploader()上传他们的PDF文件，如下所示：

uploaded_file = st.sidebar.file_uploader("上传文件", key="uploaded_file")
print(st.session_state.uploaded_file)

if uploaded_file is not None:
    filename = st.session_state.uploaded_file.name
    print(os.path.abspath(st.session_state.uploaded_file.name))
    document_loader(f'"{os.path.abspath(filename)}"')

# 我会收到错误消息：
# ValueError: 文件路径 "/Users/Documents/hack/data/abc.pdf" 不是有效的文件或URL

# 这个语句 `print(os.path.abspath(st.session_state.uploaded_file.name))` 打印出与硬编码路径相同的路径。

# 注意：Streamlit当前在我的笔记本上的本地主机上运行，并且我是试图通过本地运行的Streamlit应用程序上传PDF的“用户”。

# **编辑1：**

# 正如 @MAtchCatAnd 建议的，我添加了tempfile，它可以正常工作。但是有一个问题：

# 我的函数中传递了tempfile_path，每当用户有任何互动时，函数就会重新运行。这是因为tempfile路径会自动更改，从而使函数重新运行，即使我已经使用 @st.cache_data 修饰它。

# 上传的PDF文件保持不变，因此我不希望每次运行相同的函数，因为它每次运行都会消耗一些资源。

# 如何修复这个问题，因为我看到Streamlit已经弃用了st.cache中的allow_mutation=True参数。

# 下面是代码：

@st.cache_data
def document_loader(doc_path: str) -> Optional[Document]:
    """这个函数接受一个文件路径，并执行如下操作：

    Args:
        doc_path (str): 表示PDF文档路径的字符串。

    Returns:
        Optional[DocumentLoader]: DocumentLoader类的实例，如果文件未找到则返回None。
    """
    
    # 尝试加载文档
    loader = PyPDFLoader(doc_path)
    docs = loader.load()
    print("文档加载完成")

uploaded_file = st.sidebar.file_uploader("上传文件", key="uploaded_file")

if uploaded_file is not None:
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
        temp_file.write(uploaded_file.getvalue())
        temp_file_path = temp_file.name
        print(temp_file_path)

    custom_qa = document_loader(temp_file_path)

英文:

I have a function which accepts a file path. It's as below:

def document_loader(doc_path: str) -&gt; Optional[Document]:
&quot;&quot;&quot; This function takes in a document in a particular format and 
converts it into a Langchain Document Object 
Args:
doc_path (str): A string representing the path to the PDF document.
Returns:
Optional[DocumentLoader]: An instance of the DocumentLoader class or None if the file is not found.
&quot;&quot;&quot;
# try:
loader = PyPDFLoader(doc_path)
docs = loader.load()
print(&quot;Document loader done&quot;)

PyPDfLoader is a wrapper around PyPDF2 to read in a pdf file path

Now,when I call the function with hardcoding the file path string as below:

document_loader(&#39;/Users/Documents/hack/data/abc.pdf&#39;)

The function works fine and is able to read the pdf file path.

But now if I want a user to upload their pdf file via Streamlit file_uploader() as below:

uploaded_file = st.sidebar.file_uploader(&quot;Upload a file&quot;, key= &quot;uploaded_file&quot;)
print(st.session_state.uploaded_file)
if uploaded_file is not None:
filename = st.session_state.uploaded_file.name
print(os.path.abspath(st.session_state.uploaded_file.name))
document_loader(f&#39;&quot;{os.path.abspath(filename)}&quot;&#39;)

I get the error:

ValueError: File path &quot;/Users/Documents/hack/data/abc.pdf&quot; is not a valid file or url

This statement print(os.path.abspath(st.session_state.uploaded_file.name)) prints out the same path as the hardcoded one.

Note: Streamlit is currently on localhost on my laptop and I am the "user" who is trying to upload a pdf via locally runnin streamlit app.

Edit1:

So as per @MAtchCatAnd I added tempfile and it WORKS. But with an issue:

My function where tempfile_path is passed, it is re-running everytime there is any interaction by a user. This is because tempfile path is changing automatically thereby making the function re-run even if I had decorated it with @st.cache_data.

The pdf file uploaded remains the same, so I don't want the same function to re run as it consumes some cost everytime it is run.

How to fix this as I see Streamlit has deprecated allow_mutation=True parameter in st.cache.

Here's the code:

@st.cache_data
def document_loader(doc_path: str) -&gt; Optional[Document]:
&quot;&quot;&quot; This function takes in a document in a particular format and 
converts it into a Langchain Document Object 
Args:
doc_path (str): A string representing the path to the PDF document.
Returns:
Optional[DocumentLoader]: An instance of the DocumentLoader class or None if the file is not found.
&quot;&quot;&quot;
# try:
loader = PyPDFLoader(doc_path)
docs = loader.load()
print(&quot;Document loader done&quot;)
uploaded_file = st.sidebar.file_uploader(&quot;Upload a file&quot;, key= &quot;uploaded_file&quot;)
if uploaded_file is not None:
with tempfile.NamedTemporaryFile(delete=False) as temp_file:
temp_file.write(uploaded_file.getvalue())
temp_file_path = temp_file.name
print(temp_file_path)
custom_qa = document_loader(temp_file_path)

答案1

得分: 3

st.file_uploader返回的对象是一个继承自BytesIO的"类似文件"对象。

根据文档：

UploadedFile类是BytesIO的子类，因此它是"类似文件"的。这意味着您可以将它们传递到任何需要文件的地方。

虽然返回的对象具有name属性，但它没有路径。它存在于内存中，并且不与真实的已保存文件关联。尽管Streamlit 可能在本地运行，但实际上它具有服务器-客户端结构，其中Python后端通常位于与用户计算机不同的计算机上。因此，file_uploader小部件不设计为提供对用户文件系统的任何实际访问或指针。

您应该要么：

使用允许您传递文件缓冲区而不是路径的方法，
将文件保存到一个新的已知路径，
使用tempfiles。

以下是一个使用临时文件和关于它们的另一个问题的简要示例：

import streamlit as st
import tempfile
import pandas as pd

file = st.file_uploader('Upload a file', type='csv')
tempdir = tempfile.gettempdir()

if file is not None:
    with tempfile.NamedTemporaryFile(delete=False) as tf:
        tf.write(file.read())
        tf_path = tf.name
    st.write(tf_path)
    df = pd.read_csv(tf_path)
    st.write(df)

对Edit 1的回应

我会删除缓存，而是依赖于st.session_state来存储您的结果。

在脚本开头为您想要的对象创建会话状态中的一个位置

if 'qa' not in st.session_state:
    st.session_state.qa = None

让您的函数返回您想要的对象

def document_loader(doc_path: str) -> Optional[Document]:
    loader = PyPDFLoader(doc_path)
    return loader # 或者返回loader.load()，根据情况选择

在运行文档加载器之前检查会话状态中是否有结果

uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")

if uploaded_file is not None and st.session_state.qa is None:
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
        temp_file.write(uploaded_file.getvalue())
        temp_file_path = temp_file.name
        print(temp_file_path)

    st.session_state.qa = document_loader(temp_file_path)

custom_qa = st.session_state.qa

# 在继续之前对custom_qa进行检查，可以使用"is None"与stop或"is not None"与其余代码嵌套在内部
if custom_qa is None:
    st.stop()

添加一个重置的方式，通过在文件上传器上添加`on_change=clear_qa`

def clear_qa():
    st.session_state.qa = None

英文:

The object returned by st.file_uploader is a "file-like" object inheriting from BytesIO.

From the docs:
> The UploadedFile class is a subclass of BytesIO, and therefore it is "file-like". This means you can pass them anywhere where a file is expected.

While the returned object does have a name attribute, it has no path. It exists in memory and is not associated to a real, saved file. Though Streamlit may be run locally, it does in actuality have a server-client structure where the Python backend is usually on a different computer than the user's computer. As such, the file_uploader widget is not designed to provide any real access or pointer to the user's file system.

You should either

use a method that allows you to pass a file buffer instead of a path,
save the file to a new, known path,
use tempfiles

A brief example working with temp files and another question about them that may be helpful.

import streamlit as st
import tempfile
import pandas as pd

file = st.file_uploader(&#39;Upload a file&#39;, type=&#39;csv&#39;)
tempdir = tempfile.gettempdir()

if file is not None:
    with tempfile.NamedTemporaryFile(delete=False) as tf:
        tf.write(file.read())
        tf_path = tf.name
    st.write(tf_path)
    df = pd.read_csv(tf_path)
    st.write(df)

Response to Edit 1

I would remove the caching and instead rely on st.session_state to store your results.

Create a spot in session state for the object you want at the beginning of your script

if &#39;qa&#39; not in st.session_state:
    st.session_state.qa = None

Have your function return the object you want

def document_loader(doc_path: str) -&gt; Optional[Document]:
    loader = PyPDFLoader(doc_path)
    return loader # or return loader.load(), whichever is more suitable

Check for results in session state before running the document loader

uploaded_file = st.sidebar.file_uploader(&quot;Upload a file&quot;, key= &quot;uploaded_file&quot;)

if uploaded_file is not None and st.session_state.qa is None:
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
        temp_file.write(uploaded_file.getvalue())
        temp_file_path = temp_file.name
        print(temp_file_path)

    st.session_state.qa = document_loader(temp_file_path)

custom_qa = st.session_state.qa

# put a check on custom_qa before continuing, either &quot;is None&quot; with  
# stop or &quot;is not None&quot; with the rest of your code nested inside
if custom_qa is None:
    st.stop()

Add in a way to reset, by adding `on_change=clear_qa` to the file uploader

def clear_qa():
    st.session_state.qa = None

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

f字符串传递文件路径问题

问题

答案1

对Edit 1的回应

在脚本开头为您想要的对象创建会话状态中的一个位置

让您的函数返回您想要的对象

在运行文档加载器之前检查会话状态中是否有结果

添加一个重置的方式，通过在文件上传器上添加`on_change=clear_qa`

Response to Edit 1

Create a spot in session state for the object you want at the beginning of your script

Have your function return the object you want

Check for results in session state before running the document loader

Add in a way to reset, by adding `on_change=clear_qa` to the file uploader

标记一个数据框中是否找到另一个数据框中的模式。

如何从包含非数字列的数据框中计算相关性。

VS Code Python, entry in Run&Debug not present in launch.json

如何找到对象之间的关系

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论

问题

答案1

对Edit 1的回应

在脚本开头为您想要的对象创建会话状态中的一个位置

让您的函数返回您想要的对象

在运行文档加载器之前检查会话状态中是否有结果

添加一个重置的方式，通过在文件上传器上添加on_change=clear_qa

Response to Edit 1

Create a spot in session state for the object you want at the beginning of your script

Have your function return the object you want

Check for results in session state before running the document loader

Add in a way to reset, by adding on_change=clear_qa to the file uploader

发表评论

添加一个重置的方式，通过在文件上传器上添加`on_change=clear_qa`

Add in a way to reset, by adding `on_change=clear_qa` to the file uploader