2023年6月13日 14:38:56go评论118阅读模式

英文:

How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?

问题

我正在使用Azure ML。作为项目的一部分，我想将输出的Pandas数据框注册为文件并将其注册为Data部分的数据集。最好它存在于为AML创建的容器中（以csv或parquet扩展名），并且可以通过执行函数一次性在Data部分中使用。

到目前为止，我的函数代码如下：

def register_future_predictions(forecast_val, ws):
    global last_date
    last_date = forecast_val['ds'].iloc[-1]
    global future_dates
    future_dates = pd.date_range(start=last_date, periods=182, freq='W')
    global future_df
    future_df = pd.DataFrame({'ds': future_dates})
    for col in X_val.columns:
        if col != 'ds':
            future_df[col] = 0
    global future_predictions   
    future_predictions = results.predict(future_df)
    reg_data_future = pd.concat([X_train, X_test, X_val, future_predictions[['ds', 'yhat']].rename({'ds': 'Created on day', 'yhat': 'target_col'})])
    target_datastore = Datastore.get(ws, 'container-where-i-keep-datasets')
    ds = Dataset.Tabular.register_pandas_dataframe(dataframe=reg_data_future, name='full_data-TEST', target=target_datastore)
    return future_predictions, future_df

显然，这不起作用。也许你有一些建议或可行的解决方案可以提供参考？

提前感谢你。

英文:

I'm using Azure ML. As a part of the project I want to register output pandas dataframe as a file and dataset in the Data section. Preferably I want it to exist in the container which is created for AML (in csv or parquet extension) as well as being available in the Data section in one go by executing function.

So far, code for my function looks the following:

def register_future_predictions(forecast_val, ws):
    global last_date
    last_date = forecast_val[&#39;ds&#39;].iloc[-1]
    global future_dates
    future_dates = pd.date_range(start=last_date, periods=182, freq=&#39;W&#39;)
    global future_df
    future_df = pd.DataFrame({&#39;ds&#39;: future_dates})
    for col in X_val.columns:
        if col != &#39;ds&#39;:
            future_df[col] = 0
    global future_predictions   
    future_predictions = results.predict(future_df)
    reg_data_future = pd.concat([X_train, X_test, X_val, future_predictions[[&#39;ds&#39;, &#39;yhat&#39;]].rename({&#39;ds&#39;: &#39;Created on day&#39;, &#39;yhat&#39;: &#39;target_col&#39;})])
    target_datastore = Datastore.get(ws, &#39;container-where-i-keep-datasets&#39;)
    ds = Dataset.Tabular.register_pandas_dataframe(dataframe=reg_data_future, name=&#39;full_data-TEST&#39;, target=target_datastore)
    return future_predictions, future_df

Obviously this doesn't work. Maybe you have some propositions or working solution to refer?

Thank you in advance.

答案1

得分: 0

以下是代码部分的翻译：

根据我采用的情景，我有一个样本数据 X_train、X_test、X_val 和结果。
要将 Pandas 数据框注册为容器和 Azure ML 的数据部分中的 Parquet 或 CSV 数据集，您可以使用 **Dataset.Tabular.from_delimited_files** 函数：
```python
import  pandas  as  pd
from  azureml.core  import  Workspace, Datastore, Dataset
# 样本测试数据
X_train = pd.DataFrame({'ds': ['2023-01-01', '2023-01-02', '2023-01-03'], 'y': [1, 2, 3]})
X_test = pd.DataFrame({'ds': ['2023-01-04', '2023-01-05'], 'y': [4, 5]})
X_val = pd.DataFrame({'ds': ['2023-01-06', '2021-03-07'], 'y': [6, 7]})
results = pd.DataFrame({'ds': ['2023-01-08', '2023-01-09'], 'yhat': [8, 9]})

以下函数执行 CSV 和 Parquet 文件。您可以根据需要进行修改。

def  register_future_predictions(forecast_val, ws):
    last_date = forecast_val['ds'].iloc[-1]
    future_dates = pd.date_range(start=last_date, periods=182, freq='W')
    future_df = pd.DataFrame({'ds': future_dates})
    for  col  in  X_val.columns:
        if  col != 'ds':
            future_df[col] = 0
    future_predictions = results
    reg_data_future = pd.concat([X_train, X_test, X_val, 
    future_predictions[['ds', 'yhat']].rename(columns={'ds': 'Created on day', 'yhat': 'target_col'})])
    # 将数据框保存为文件，例如 CSV 或 Parquet
    reg_data_future.to_csv('reg_data_future.csv', index=False)
    reg_data_future.to_parquet('reg_data_future.parquet', index=False)
    # 获取目标数据存储
    target_datastore = Datastore.get(ws, 'workspaceblobstore')
    # 将文件上传到数据存储的容器中
    target_datastore.upload_files(files=['reg_data_future.csv', 'reg_data_future.parquet'], target_path='data')
    # 注册文件作为数据集
    csv_dataset = Dataset.Tabular.from_delimited_files(path=[(target_datastore, 'data/reg_data_future.csv')])
    csv_dataset.register(workspace=ws, name='full_data_csv', create_new_version=True)
    parquet_dataset = Dataset.Tabular.from_parquet_files(path=[(target_datastore, 'data/reg_data_future.parquet')])
    parquet_dataset.register(workspace=ws, name='full_data_parquet', create_new_version=True)
    return  future_predictions, future_df

# 创建或加载 Azure ML 工作区
ws = Workspace.from_config()
# 测试函数
sample_forecast_val = pd.DataFrame({'ds': ['2024-01-01', '2024-01-02', '2024-01-03']})
predictions, df = register_future_predictions(sample_forecast_val, ws)

使用更新后的函数将成功执行。

创建容器中的 CSV 或 Parquet 文件。
How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time? .

并将数据集注册为数据资产。
How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?

请参考此文档以获取更多详细信息和示例。

英文:

Based on the scenario I have taken a sample data X_train, X_test, X_val and results.

To register a Pandas dataframe as a Parquet or CSV dataset in the container and in the Data section of Azure ML, you can use the Dataset.Tabular.from_delimited_files function:

import  pandas  as  pd
from  azureml.core  import  Workspace, Datastore, Dataset
# Sample test data
X_train = pd.DataFrame({&#39;ds&#39;: [&#39;2023-01-01&#39;, &#39;2023-01-02&#39;, &#39;2023-01-03&#39;], &#39;y&#39;: [1, 2, 3]})
X_test = pd.DataFrame({&#39;ds&#39;: [&#39;2023-01-04&#39;, &#39;2023-01-05&#39;], &#39;y&#39;: [4, 5]})
X_val = pd.DataFrame({&#39;ds&#39;: [&#39;2023-01-06&#39;, &#39;2021-03-07&#39;], &#39;y&#39;: [6, 7]})
results = pd.DataFrame({&#39;ds&#39;: [&#39;2023-01-08&#39;, &#39;2023-01-09&#39;], &#39;yhat&#39;: [8, 9]})

Below function execute both csv and parquet file. You can modify this according to your need.

def  register_future_predictions(forecast_val, ws):
    last_date = forecast_val[&#39;ds&#39;].iloc[-1]
    future_dates = pd.date_range(start=last_date, periods=182, freq=&#39;W&#39;)
    future_df = pd.DataFrame({&#39;ds&#39;: future_dates})
    for  col  in  X_val.columns:
        if  col != &#39;ds&#39;:
            future_df[col] = 0
    future_predictions = results
    reg_data_future = pd.concat([X_train, X_test, X_val, 
    future_predictions[[&#39;ds&#39;, &#39;yhat&#39;]].rename(columns={&#39;ds&#39;: &#39;Created on day&#39;, &#39;yhat&#39;: &#39;target_col&#39;})])
    # Save the dataframe as a file, e.g., CSV or Parquet
    reg_data_future.to_csv(&#39;reg_data_future.csv&#39;, index=False)
    reg_data_future.to_parquet(&#39;reg_data_future.parquet&#39;, index=False)
    # Get the target datastore
    target_datastore = Datastore.get(ws, &#39;workspaceblobstore&#39;)
    # Upload the files to the datastore&#39;s container
    target_datastore.upload_files(files=[&#39;reg_data_future.csv&#39;, &#39;reg_data_future.parquet&#39;], target_path=&#39;data&#39;)
    # Register the files as datasets
    csv_dataset = Dataset.Tabular.from_delimited_files(path=[(target_datastore, &#39;data/reg_data_future.csv&#39;)])
    csv_dataset.register(workspace=ws, name=&#39;full_data_csv&#39;, create_new_version=True)
    parquet_dataset = Dataset.Tabular.from_parquet_files(path=[(target_datastore, &#39;data/reg_data_future.parquet&#39;)])
    parquet_dataset.register(workspace=ws, name=&#39;full_data_parquet&#39;, create_new_version=True)
    return  future_predictions, future_df

# Create or load Azure ML workspace
ws = Workspace.from_config()
# Test the function
sample_forecast_val = pd.DataFrame({&#39;ds&#39;: [&#39;2024-01-01&#39;, &#39;2024-01-02&#39;, &#39;2024-01-03&#39;]})
predictions, df = register_future_predictions(sample_forecast_val, ws)

With the updated function will successfully execute.

How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time? .

Create the csv or parquet file in container.
How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?

And register the dataset as data asset.
How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?

Please refer to this documentation for more details and examples.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?

问题

答案1

Python Selenium 浏览器 – 或其他远程控制库 – 无需手动下载要求

在COCO物体关键点相似性方程中，S代表什么？

Intel MKL调用SciPy函数时出错，来自MATLAB。

Streamlit表单在点击提交按钮后刷新的原因以及如何修复它？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。