How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?

huangapple go评论69阅读模式
英文:

How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?

问题

我正在使用Azure ML。作为项目的一部分,我想将输出的Pandas数据框注册为文件并将其注册为Data部分的数据集。最好它存在于为AML创建的容器中(以csv或parquet扩展名),并且可以通过执行函数一次性在Data部分中使用。

到目前为止,我的函数代码如下:

def register_future_predictions(forecast_val, ws):

    global last_date
    last_date = forecast_val['ds'].iloc[-1]
    global future_dates
    future_dates = pd.date_range(start=last_date, periods=182, freq='W')
    global future_df
    future_df = pd.DataFrame({'ds': future_dates})

    for col in X_val.columns:
        if col != 'ds':
            future_df[col] = 0

    global future_predictions   
    future_predictions = results.predict(future_df)

    reg_data_future = pd.concat([X_train, X_test, X_val, future_predictions[['ds', 'yhat']].rename({'ds': 'Created on day', 'yhat': 'target_col'})])
    target_datastore = Datastore.get(ws, 'container-where-i-keep-datasets')
    ds = Dataset.Tabular.register_pandas_dataframe(dataframe=reg_data_future, name='full_data-TEST', target=target_datastore)

    return future_predictions, future_df

显然,这不起作用。也许你有一些建议或可行的解决方案可以提供参考?

提前感谢你。

英文:

I'm using Azure ML. As a part of the project I want to register output pandas dataframe as a file and dataset in the Data section. Preferably I want it to exist in the container which is created for AML (in csv or parquet extension) as well as being available in the Data section in one go by executing function.

So far, code for my function looks the following:

def register_future_predictions(forecast_val, ws):

    global last_date
    last_date = forecast_val['ds'].iloc[-1]
    global future_dates
    future_dates = pd.date_range(start=last_date, periods=182, freq='W')
    global future_df
    future_df = pd.DataFrame({'ds': future_dates})

    for col in X_val.columns:
        if col != 'ds':
            future_df[col] = 0

    global future_predictions   
    future_predictions = results.predict(future_df)

    reg_data_future = pd.concat([X_train, X_test, X_val, future_predictions[['ds', 'yhat']].rename({'ds': 'Created on day', 'yhat': 'target_col'})])
    target_datastore = Datastore.get(ws, 'container-where-i-keep-datasets')
    ds = Dataset.Tabular.register_pandas_dataframe(dataframe=reg_data_future, name='full_data-TEST', target=target_datastore)

    return future_predictions, future_df

Obviously this doesn't work. Maybe you have some propositions or working solution to refer?

Thank you in advance.

答案1

得分: 0

以下是代码部分的翻译:

根据我采用的情景我有一个样本数据 X_trainX_testX_val 和结果

要将 Pandas 数据框注册为容器和 Azure ML 的数据部分中的 Parquet 或 CSV 数据集您可以使用 **Dataset.Tabular.from_delimited_files** 函数

```python
import  pandas  as  pd
from  azureml.core  import  Workspace, Datastore, Dataset

# 样本测试数据
X_train = pd.DataFrame({'ds': ['2023-01-01', '2023-01-02', '2023-01-03'], 'y': [1, 2, 3]})
X_test = pd.DataFrame({'ds': ['2023-01-04', '2023-01-05'], 'y': [4, 5]})
X_val = pd.DataFrame({'ds': ['2023-01-06', '2021-03-07'], 'y': [6, 7]})
results = pd.DataFrame({'ds': ['2023-01-08', '2023-01-09'], 'yhat': [8, 9]})

以下函数执行 CSV 和 Parquet 文件。您可以根据需要进行修改。

def  register_future_predictions(forecast_val, ws):

    last_date = forecast_val['ds'].iloc[-1]
    future_dates = pd.date_range(start=last_date, periods=182, freq='W')
    future_df = pd.DataFrame({'ds': future_dates})

    for  col  in  X_val.columns:
        if  col != 'ds':
            future_df[col] = 0

    future_predictions = results
    reg_data_future = pd.concat([X_train, X_test, X_val, 
    future_predictions[['ds', 'yhat']].rename(columns={'ds': 'Created on day', 'yhat': 'target_col'})])

    # 将数据框保存为文件,例如 CSV 或 Parquet
    reg_data_future.to_csv('reg_data_future.csv', index=False)
    reg_data_future.to_parquet('reg_data_future.parquet', index=False)

    # 获取目标数据存储
    target_datastore = Datastore.get(ws, 'workspaceblobstore')

    # 将文件上传到数据存储的容器中
    target_datastore.upload_files(files=['reg_data_future.csv', 'reg_data_future.parquet'], target_path='data')

    # 注册文件作为数据集
    csv_dataset = Dataset.Tabular.from_delimited_files(path=[(target_datastore, 'data/reg_data_future.csv')])

    csv_dataset.register(workspace=ws, name='full_data_csv', create_new_version=True)

    parquet_dataset = Dataset.Tabular.from_parquet_files(path=[(target_datastore, 'data/reg_data_future.parquet')])

    parquet_dataset.register(workspace=ws, name='full_data_parquet', create_new_version=True)

    return  future_predictions, future_df
# 创建或加载 Azure ML 工作区
ws = Workspace.from_config()

# 测试函数
sample_forecast_val = pd.DataFrame({'ds': ['2024-01-01', '2024-01-02', '2024-01-03']})

predictions, df = register_future_predictions(sample_forecast_val, ws)

使用更新后的函数将成功执行。

创建容器中的 CSV 或 Parquet 文件。
How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?.

并将数据集注册为数据资产。
How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?

请参考此 文档 以获取更多详细信息和示例。

英文:

Based on the scenario I have taken a sample data X_train, X_test, X_val and results.

To register a Pandas dataframe as a Parquet or CSV dataset in the container and in the Data section of Azure ML, you can use the Dataset.Tabular.from_delimited_files function:

import  pandas  as  pd
from  azureml.core  import  Workspace, Datastore, Dataset

# Sample test data
X_train = pd.DataFrame({'ds': ['2023-01-01', '2023-01-02', '2023-01-03'], 'y': [1, 2, 3]})
X_test = pd.DataFrame({'ds': ['2023-01-04', '2023-01-05'], 'y': [4, 5]})
X_val = pd.DataFrame({'ds': ['2023-01-06', '2021-03-07'], 'y': [6, 7]})
results = pd.DataFrame({'ds': ['2023-01-08', '2023-01-09'], 'yhat': [8, 9]})

Below function execute both csv and parquet file. You can modify this according to your need.

def  register_future_predictions(forecast_val, ws):


    last_date = forecast_val['ds'].iloc[-1]
    future_dates = pd.date_range(start=last_date, periods=182, freq='W')
    future_df = pd.DataFrame({'ds': future_dates})

    for  col  in  X_val.columns:
        if  col != 'ds':
            future_df[col] = 0

    future_predictions = results
    reg_data_future = pd.concat([X_train, X_test, X_val, 
    future_predictions[['ds', 'yhat']].rename(columns={'ds': 'Created on day', 'yhat': 'target_col'})])

    # Save the dataframe as a file, e.g., CSV or Parquet
    reg_data_future.to_csv('reg_data_future.csv', index=False)
    reg_data_future.to_parquet('reg_data_future.parquet', index=False)

    # Get the target datastore
    target_datastore = Datastore.get(ws, 'workspaceblobstore')

    # Upload the files to the datastore's container
    target_datastore.upload_files(files=['reg_data_future.csv', 'reg_data_future.parquet'], target_path='data')

    # Register the files as datasets
    csv_dataset = Dataset.Tabular.from_delimited_files(path=[(target_datastore, 'data/reg_data_future.csv')])

    csv_dataset.register(workspace=ws, name='full_data_csv', create_new_version=True)

    parquet_dataset = Dataset.Tabular.from_parquet_files(path=[(target_datastore, 'data/reg_data_future.parquet')])

    parquet_dataset.register(workspace=ws, name='full_data_parquet', create_new_version=True)

    return  future_predictions, future_df
# Create or load Azure ML workspace
ws = Workspace.from_config()

# Test the function
sample_forecast_val = pd.DataFrame({'ds': ['2024-01-01', '2024-01-02', '2024-01-03']})

predictions, df = register_future_predictions(sample_forecast_val, ws)

With the updated function will successfully execute.

How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?.

Create the csv or parquet file in container.
How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?

And register the dataset as data asset.
How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?

Please refer to this documentation for more details and examples.

huangapple
  • 本文由 发表于 2023年6月13日 14:38:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76462238.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定