英文:
f string to pass file path issue
问题
def document_loader(doc_path: str) -> Optional[Document]:
"""这个函数接受一个文件路径,并执行如下操作:
Args:
doc_path (str): 表示PDF文档路径的字符串。
Returns:
Optional[DocumentLoader]: DocumentLoader类的实例,如果文件未找到则返回None。
"""
# 尝试加载文档
loader = PyPDFLoader(doc_path)
docs = loader.load()
print("文档加载完成")
# PyPDFLoader是用于读取PDF文件路径的PyPDF2的包装器
# 现在,当我使用硬编码的文件路径字符串调用该函数如下:
document_loader('/Users/Documents/hack/data/abc.pdf')
# 函数可以正常工作并读取PDF文件路径。
# 但是,如果我想让用户通过Streamlit的file_uploader()上传他们的PDF文件,如下所示:
uploaded_file = st.sidebar.file_uploader("上传文件", key="uploaded_file")
print(st.session_state.uploaded_file)
if uploaded_file is not None:
filename = st.session_state.uploaded_file.name
print(os.path.abspath(st.session_state.uploaded_file.name))
document_loader(f'"{os.path.abspath(filename)}"')
# 我会收到错误消息:
# ValueError: 文件路径 "/Users/Documents/hack/data/abc.pdf" 不是有效的文件或URL
# 这个语句 `print(os.path.abspath(st.session_state.uploaded_file.name))` 打印出与硬编码路径相同的路径。
# 注意:Streamlit当前在我的笔记本上的本地主机上运行,并且我是试图通过本地运行的Streamlit应用程序上传PDF的“用户”。
# **编辑1:**
# 正如 @MAtchCatAnd 建议的,我添加了tempfile,它可以正常工作。但是有一个问题:
# 我的函数中传递了tempfile_path,每当用户有任何互动时,函数就会重新运行。这是因为tempfile路径会自动更改,从而使函数重新运行,即使我已经使用 @st.cache_data 修饰它。
# 上传的PDF文件保持不变,因此我不希望每次运行相同的函数,因为它每次运行都会消耗一些资源。
# 如何修复这个问题,因为我看到Streamlit已经弃用了st.cache中的allow_mutation=True参数。
# 下面是代码:
@st.cache_data
def document_loader(doc_path: str) -> Optional[Document]:
"""这个函数接受一个文件路径,并执行如下操作:
Args:
doc_path (str): 表示PDF文档路径的字符串。
Returns:
Optional[DocumentLoader]: DocumentLoader类的实例,如果文件未找到则返回None。
"""
# 尝试加载文档
loader = PyPDFLoader(doc_path)
docs = loader.load()
print("文档加载完成")
uploaded_file = st.sidebar.file_uploader("上传文件", key="uploaded_file")
if uploaded_file is not None:
with tempfile.NamedTemporaryFile(delete=False) as temp_file:
temp_file.write(uploaded_file.getvalue())
temp_file_path = temp_file.name
print(temp_file_path)
custom_qa = document_loader(temp_file_path)
英文:
I have a function which accepts a file path. It's as below:
def document_loader(doc_path: str) -> Optional[Document]:
""" This function takes in a document in a particular format and
converts it into a Langchain Document Object
Args:
doc_path (str): A string representing the path to the PDF document.
Returns:
Optional[DocumentLoader]: An instance of the DocumentLoader class or None if the file is not found.
"""
# try:
loader = PyPDFLoader(doc_path)
docs = loader.load()
print("Document loader done")
PyPDfLoader is a wrapper around PyPDF2 to read in a pdf file path
Now,when I call the function with hardcoding the file path string as below:
document_loader('/Users/Documents/hack/data/abc.pdf')
The function works fine and is able to read the pdf file path.
But now if I want a user to upload their pdf file via Streamlit file_uploader() as below:
uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
print(st.session_state.uploaded_file)
if uploaded_file is not None:
filename = st.session_state.uploaded_file.name
print(os.path.abspath(st.session_state.uploaded_file.name))
document_loader(f'"{os.path.abspath(filename)}"')
I get the error:
ValueError: File path "/Users/Documents/hack/data/abc.pdf" is not a valid file or url
This statement print(os.path.abspath(st.session_state.uploaded_file.name))
prints out the same path as the hardcoded one.
Note: Streamlit is currently on localhost on my laptop and I am the "user" who is trying to upload a pdf via locally runnin streamlit app.
Edit1:
So as per @MAtchCatAnd I added tempfile and it WORKS. But with an issue:
My function where tempfile_path is passed, it is re-running everytime there is any interaction by a user. This is because tempfile path is changing automatically thereby making the function re-run even if I had decorated it with @st.cache_data.
The pdf file uploaded remains the same, so I don't want the same function to re run as it consumes some cost everytime it is run.
How to fix this as I see Streamlit has deprecated allow_mutation=True parameter in st.cache.
Here's the code:
@st.cache_data
def document_loader(doc_path: str) -> Optional[Document]:
""" This function takes in a document in a particular format and
converts it into a Langchain Document Object
Args:
doc_path (str): A string representing the path to the PDF document.
Returns:
Optional[DocumentLoader]: An instance of the DocumentLoader class or None if the file is not found.
"""
# try:
loader = PyPDFLoader(doc_path)
docs = loader.load()
print("Document loader done")
uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
if uploaded_file is not None:
with tempfile.NamedTemporaryFile(delete=False) as temp_file:
temp_file.write(uploaded_file.getvalue())
temp_file_path = temp_file.name
print(temp_file_path)
custom_qa = document_loader(temp_file_path)
答案1
得分: 3
st.file_uploader
返回的对象是一个继承自BytesIO的"类似文件"对象。
根据文档:
UploadedFile类是BytesIO的子类,因此它是"类似文件"的。这意味着您可以将它们传递到任何需要文件的地方。
虽然返回的对象具有name
属性,但它没有路径。它存在于内存中,并且不与真实的已保存文件关联。尽管Streamlit 可能在本地运行,但实际上它具有服务器-客户端结构,其中Python后端通常位于与用户计算机不同的计算机上。因此,file_uploader
小部件不设计为提供对用户文件系统的任何实际访问或指针。
您应该要么:
- 使用允许您传递文件缓冲区而不是路径的方法,
- 将文件保存到一个新的已知路径,
- 使用tempfiles。
以下是一个使用临时文件和关于它们的另一个问题的简要示例:
import streamlit as st
import tempfile
import pandas as pd
file = st.file_uploader('Upload a file', type='csv')
tempdir = tempfile.gettempdir()
if file is not None:
with tempfile.NamedTemporaryFile(delete=False) as tf:
tf.write(file.read())
tf_path = tf.name
st.write(tf_path)
df = pd.read_csv(tf_path)
st.write(df)
对Edit 1的回应
我会删除缓存,而是依赖于st.session_state
来存储您的结果。
在脚本开头为您想要的对象创建会话状态中的一个位置
if 'qa' not in st.session_state:
st.session_state.qa = None
让您的函数返回您想要的对象
def document_loader(doc_path: str) -> Optional[Document]:
loader = PyPDFLoader(doc_path)
return loader # 或者返回loader.load(),根据情况选择
在运行文档加载器之前检查会话状态中是否有结果
uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
if uploaded_file is not None and st.session_state.qa is None:
with tempfile.NamedTemporaryFile(delete=False) as temp_file:
temp_file.write(uploaded_file.getvalue())
temp_file_path = temp_file.name
print(temp_file_path)
st.session_state.qa = document_loader(temp_file_path)
custom_qa = st.session_state.qa
# 在继续之前对custom_qa进行检查,可以使用"is None"与stop或"is not None"与其余代码嵌套在内部
if custom_qa is None:
st.stop()
添加一个重置的方式,通过在文件上传器上添加on_change=clear_qa
def clear_qa():
st.session_state.qa = None
英文:
The object returned by st.file_uploader
is a "file-like" object inheriting from BytesIO.
From the docs:
> The UploadedFile class is a subclass of BytesIO, and therefore it is "file-like". This means you can pass them anywhere where a file is expected.
While the returned object does have a name
attribute, it has no path. It exists in memory and is not associated to a real, saved file. Though Streamlit may be run locally, it does in actuality have a server-client structure where the Python backend is usually on a different computer than the user's computer. As such, the file_uploader
widget is not designed to provide any real access or pointer to the user's file system.
You should either
- use a method that allows you to pass a file buffer instead of a path,
- save the file to a new, known path,
- use tempfiles
A brief example working with temp files and another question about them that may be helpful.
import streamlit as st
import tempfile
import pandas as pd
file = st.file_uploader('Upload a file', type='csv')
tempdir = tempfile.gettempdir()
if file is not None:
with tempfile.NamedTemporaryFile(delete=False) as tf:
tf.write(file.read())
tf_path = tf.name
st.write(tf_path)
df = pd.read_csv(tf_path)
st.write(df)
Response to Edit 1
I would remove the caching and instead rely on st.session_state
to store your results.
Create a spot in session state for the object you want at the beginning of your script
if 'qa' not in st.session_state:
st.session_state.qa = None
Have your function return the object you want
def document_loader(doc_path: str) -> Optional[Document]:
loader = PyPDFLoader(doc_path)
return loader # or return loader.load(), whichever is more suitable
Check for results in session state before running the document loader
uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
if uploaded_file is not None and st.session_state.qa is None:
with tempfile.NamedTemporaryFile(delete=False) as temp_file:
temp_file.write(uploaded_file.getvalue())
temp_file_path = temp_file.name
print(temp_file_path)
st.session_state.qa = document_loader(temp_file_path)
custom_qa = st.session_state.qa
# put a check on custom_qa before continuing, either "is None" with
# stop or "is not None" with the rest of your code nested inside
if custom_qa is None:
st.stop()
Add in a way to reset, by adding on_change=clear_qa
to the file uploader
def clear_qa():
st.session_state.qa = None
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论