英文:
How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?
问题
我正在使用Azure ML。作为项目的一部分,我想将输出的Pandas数据框注册为文件并将其注册为Data部分的数据集。最好它存在于为AML创建的容器中(以csv或parquet扩展名),并且可以通过执行函数一次性在Data部分中使用。
到目前为止,我的函数代码如下:
def register_future_predictions(forecast_val, ws):
global last_date
last_date = forecast_val['ds'].iloc[-1]
global future_dates
future_dates = pd.date_range(start=last_date, periods=182, freq='W')
global future_df
future_df = pd.DataFrame({'ds': future_dates})
for col in X_val.columns:
if col != 'ds':
future_df[col] = 0
global future_predictions
future_predictions = results.predict(future_df)
reg_data_future = pd.concat([X_train, X_test, X_val, future_predictions[['ds', 'yhat']].rename({'ds': 'Created on day', 'yhat': 'target_col'})])
target_datastore = Datastore.get(ws, 'container-where-i-keep-datasets')
ds = Dataset.Tabular.register_pandas_dataframe(dataframe=reg_data_future, name='full_data-TEST', target=target_datastore)
return future_predictions, future_df
显然,这不起作用。也许你有一些建议或可行的解决方案可以提供参考?
提前感谢你。
英文:
I'm using Azure ML. As a part of the project I want to register output pandas dataframe as a file and dataset in the Data section. Preferably I want it to exist in the container which is created for AML (in csv or parquet extension) as well as being available in the Data section in one go by executing function.
So far, code for my function looks the following:
def register_future_predictions(forecast_val, ws):
global last_date
last_date = forecast_val['ds'].iloc[-1]
global future_dates
future_dates = pd.date_range(start=last_date, periods=182, freq='W')
global future_df
future_df = pd.DataFrame({'ds': future_dates})
for col in X_val.columns:
if col != 'ds':
future_df[col] = 0
global future_predictions
future_predictions = results.predict(future_df)
reg_data_future = pd.concat([X_train, X_test, X_val, future_predictions[['ds', 'yhat']].rename({'ds': 'Created on day', 'yhat': 'target_col'})])
target_datastore = Datastore.get(ws, 'container-where-i-keep-datasets')
ds = Dataset.Tabular.register_pandas_dataframe(dataframe=reg_data_future, name='full_data-TEST', target=target_datastore)
return future_predictions, future_df
Obviously this doesn't work. Maybe you have some propositions or working solution to refer?
Thank you in advance.
答案1
得分: 0
以下是代码部分的翻译:
根据我采用的情景,我有一个样本数据 X_train、X_test、X_val 和结果。
要将 Pandas 数据框注册为容器和 Azure ML 的数据部分中的 Parquet 或 CSV 数据集,您可以使用 **Dataset.Tabular.from_delimited_files** 函数:
```python
import pandas as pd
from azureml.core import Workspace, Datastore, Dataset
# 样本测试数据
X_train = pd.DataFrame({'ds': ['2023-01-01', '2023-01-02', '2023-01-03'], 'y': [1, 2, 3]})
X_test = pd.DataFrame({'ds': ['2023-01-04', '2023-01-05'], 'y': [4, 5]})
X_val = pd.DataFrame({'ds': ['2023-01-06', '2021-03-07'], 'y': [6, 7]})
results = pd.DataFrame({'ds': ['2023-01-08', '2023-01-09'], 'yhat': [8, 9]})
以下函数执行 CSV 和 Parquet 文件。您可以根据需要进行修改。
def register_future_predictions(forecast_val, ws):
last_date = forecast_val['ds'].iloc[-1]
future_dates = pd.date_range(start=last_date, periods=182, freq='W')
future_df = pd.DataFrame({'ds': future_dates})
for col in X_val.columns:
if col != 'ds':
future_df[col] = 0
future_predictions = results
reg_data_future = pd.concat([X_train, X_test, X_val,
future_predictions[['ds', 'yhat']].rename(columns={'ds': 'Created on day', 'yhat': 'target_col'})])
# 将数据框保存为文件,例如 CSV 或 Parquet
reg_data_future.to_csv('reg_data_future.csv', index=False)
reg_data_future.to_parquet('reg_data_future.parquet', index=False)
# 获取目标数据存储
target_datastore = Datastore.get(ws, 'workspaceblobstore')
# 将文件上传到数据存储的容器中
target_datastore.upload_files(files=['reg_data_future.csv', 'reg_data_future.parquet'], target_path='data')
# 注册文件作为数据集
csv_dataset = Dataset.Tabular.from_delimited_files(path=[(target_datastore, 'data/reg_data_future.csv')])
csv_dataset.register(workspace=ws, name='full_data_csv', create_new_version=True)
parquet_dataset = Dataset.Tabular.from_parquet_files(path=[(target_datastore, 'data/reg_data_future.parquet')])
parquet_dataset.register(workspace=ws, name='full_data_parquet', create_new_version=True)
return future_predictions, future_df
# 创建或加载 Azure ML 工作区
ws = Workspace.from_config()
# 测试函数
sample_forecast_val = pd.DataFrame({'ds': ['2024-01-01', '2024-01-02', '2024-01-03']})
predictions, df = register_future_predictions(sample_forecast_val, ws)
使用更新后的函数将成功执行。
创建容器中的 CSV 或 Parquet 文件。
.
并将数据集注册为数据资产。
请参考此 文档 以获取更多详细信息和示例。
英文:
Based on the scenario I have taken a sample data X_train, X_test, X_val and results.
To register a Pandas dataframe as a Parquet or CSV dataset in the container and in the Data section of Azure ML, you can use the Dataset.Tabular.from_delimited_files function:
import pandas as pd
from azureml.core import Workspace, Datastore, Dataset
# Sample test data
X_train = pd.DataFrame({'ds': ['2023-01-01', '2023-01-02', '2023-01-03'], 'y': [1, 2, 3]})
X_test = pd.DataFrame({'ds': ['2023-01-04', '2023-01-05'], 'y': [4, 5]})
X_val = pd.DataFrame({'ds': ['2023-01-06', '2021-03-07'], 'y': [6, 7]})
results = pd.DataFrame({'ds': ['2023-01-08', '2023-01-09'], 'yhat': [8, 9]})
Below function execute both csv and parquet file. You can modify this according to your need.
def register_future_predictions(forecast_val, ws):
last_date = forecast_val['ds'].iloc[-1]
future_dates = pd.date_range(start=last_date, periods=182, freq='W')
future_df = pd.DataFrame({'ds': future_dates})
for col in X_val.columns:
if col != 'ds':
future_df[col] = 0
future_predictions = results
reg_data_future = pd.concat([X_train, X_test, X_val,
future_predictions[['ds', 'yhat']].rename(columns={'ds': 'Created on day', 'yhat': 'target_col'})])
# Save the dataframe as a file, e.g., CSV or Parquet
reg_data_future.to_csv('reg_data_future.csv', index=False)
reg_data_future.to_parquet('reg_data_future.parquet', index=False)
# Get the target datastore
target_datastore = Datastore.get(ws, 'workspaceblobstore')
# Upload the files to the datastore's container
target_datastore.upload_files(files=['reg_data_future.csv', 'reg_data_future.parquet'], target_path='data')
# Register the files as datasets
csv_dataset = Dataset.Tabular.from_delimited_files(path=[(target_datastore, 'data/reg_data_future.csv')])
csv_dataset.register(workspace=ws, name='full_data_csv', create_new_version=True)
parquet_dataset = Dataset.Tabular.from_parquet_files(path=[(target_datastore, 'data/reg_data_future.parquet')])
parquet_dataset.register(workspace=ws, name='full_data_parquet', create_new_version=True)
return future_predictions, future_df
# Create or load Azure ML workspace
ws = Workspace.from_config()
# Test the function
sample_forecast_val = pd.DataFrame({'ds': ['2024-01-01', '2024-01-02', '2024-01-03']})
predictions, df = register_future_predictions(sample_forecast_val, ws)
With the updated function will successfully execute.
.
Create the csv or parquet file in container.
And register the dataset as data asset.
Please refer to this documentation for more details and examples.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论