How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?

huangapple go评论110阅读模式

How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?


我正在使用Azure ML。作为项目的一部分,我想将输出的Pandas数据框注册为文件并将其注册为Data部分的数据集。最好它存在于为AML创建的容器中(以csv或parquet扩展名),并且可以通过执行函数一次性在Data部分中使用。


  1. def register_future_predictions(forecast_val, ws):
  2. global last_date
  3. last_date = forecast_val['ds'].iloc[-1]
  4. global future_dates
  5. future_dates = pd.date_range(start=last_date, periods=182, freq='W')
  6. global future_df
  7. future_df = pd.DataFrame({'ds': future_dates})
  8. for col in X_val.columns:
  9. if col != 'ds':
  10. future_df[col] = 0
  11. global future_predictions
  12. future_predictions = results.predict(future_df)
  13. reg_data_future = pd.concat([X_train, X_test, X_val, future_predictions[['ds', 'yhat']].rename({'ds': 'Created on day', 'yhat': 'target_col'})])
  14. target_datastore = Datastore.get(ws, 'container-where-i-keep-datasets')
  15. ds = Dataset.Tabular.register_pandas_dataframe(dataframe=reg_data_future, name='full_data-TEST', target=target_datastore)
  16. return future_predictions, future_df




I'm using Azure ML. As a part of the project I want to register output pandas dataframe as a file and dataset in the Data section. Preferably I want it to exist in the container which is created for AML (in csv or parquet extension) as well as being available in the Data section in one go by executing function.

So far, code for my function looks the following:

  1. def register_future_predictions(forecast_val, ws):
  2. global last_date
  3. last_date = forecast_val['ds'].iloc[-1]
  4. global future_dates
  5. future_dates = pd.date_range(start=last_date, periods=182, freq='W')
  6. global future_df
  7. future_df = pd.DataFrame({'ds': future_dates})
  8. for col in X_val.columns:
  9. if col != 'ds':
  10. future_df[col] = 0
  11. global future_predictions
  12. future_predictions = results.predict(future_df)
  13. reg_data_future = pd.concat([X_train, X_test, X_val, future_predictions[['ds', 'yhat']].rename({'ds': 'Created on day', 'yhat': 'target_col'})])
  14. target_datastore = Datastore.get(ws, 'container-where-i-keep-datasets')
  15. ds = Dataset.Tabular.register_pandas_dataframe(dataframe=reg_data_future, name='full_data-TEST', target=target_datastore)
  16. return future_predictions, future_df

Obviously this doesn't work. Maybe you have some propositions or working solution to refer?

Thank you in advance.


得分: 0


  1. 根据我采用的情景我有一个样本数据 X_trainX_testX_val 和结果
  2. 要将 Pandas 数据框注册为容器和 Azure ML 的数据部分中的 Parquet CSV 数据集您可以使用 **Dataset.Tabular.from_delimited_files** 函数
  3. ```python
  4. import pandas as pd
  5. from azureml.core import Workspace, Datastore, Dataset
  6. # 样本测试数据
  7. X_train = pd.DataFrame({'ds': ['2023-01-01', '2023-01-02', '2023-01-03'], 'y': [1, 2, 3]})
  8. X_test = pd.DataFrame({'ds': ['2023-01-04', '2023-01-05'], 'y': [4, 5]})
  9. X_val = pd.DataFrame({'ds': ['2023-01-06', '2021-03-07'], 'y': [6, 7]})
  10. results = pd.DataFrame({'ds': ['2023-01-08', '2023-01-09'], 'yhat': [8, 9]})

以下函数执行 CSV 和 Parquet 文件。您可以根据需要进行修改。

  1. def register_future_predictions(forecast_val, ws):
  2. last_date = forecast_val['ds'].iloc[-1]
  3. future_dates = pd.date_range(start=last_date, periods=182, freq='W')
  4. future_df = pd.DataFrame({'ds': future_dates})
  5. for col in X_val.columns:
  6. if col != 'ds':
  7. future_df[col] = 0
  8. future_predictions = results
  9. reg_data_future = pd.concat([X_train, X_test, X_val,
  10. future_predictions[['ds', 'yhat']].rename(columns={'ds': 'Created on day', 'yhat': 'target_col'})])
  11. # 将数据框保存为文件,例如 CSV 或 Parquet
  12. reg_data_future.to_csv('reg_data_future.csv', index=False)
  13. reg_data_future.to_parquet('reg_data_future.parquet', index=False)
  14. # 获取目标数据存储
  15. target_datastore = Datastore.get(ws, 'workspaceblobstore')
  16. # 将文件上传到数据存储的容器中
  17. target_datastore.upload_files(files=['reg_data_future.csv', 'reg_data_future.parquet'], target_path='data')
  18. # 注册文件作为数据集
  19. csv_dataset = Dataset.Tabular.from_delimited_files(path=[(target_datastore, 'data/reg_data_future.csv')])
  20. csv_dataset.register(workspace=ws, name='full_data_csv', create_new_version=True)
  21. parquet_dataset = Dataset.Tabular.from_parquet_files(path=[(target_datastore, 'data/reg_data_future.parquet')])
  22. parquet_dataset.register(workspace=ws, name='full_data_parquet', create_new_version=True)
  23. return future_predictions, future_df
  1. # 创建或加载 Azure ML 工作区
  2. ws = Workspace.from_config()
  3. # 测试函数
  4. sample_forecast_val = pd.DataFrame({'ds': ['2024-01-01', '2024-01-02', '2024-01-03']})
  5. predictions, df = register_future_predictions(sample_forecast_val, ws)


创建容器中的 CSV 或 Parquet 文件。
How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?.

How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?

请参考此 文档 以获取更多详细信息和示例。


Based on the scenario I have taken a sample data X_train, X_test, X_val and results.

To register a Pandas dataframe as a Parquet or CSV dataset in the container and in the Data section of Azure ML, you can use the Dataset.Tabular.from_delimited_files function:

  1. import pandas as pd
  2. from azureml.core import Workspace, Datastore, Dataset
  3. # Sample test data
  4. X_train = pd.DataFrame({'ds': ['2023-01-01', '2023-01-02', '2023-01-03'], 'y': [1, 2, 3]})
  5. X_test = pd.DataFrame({'ds': ['2023-01-04', '2023-01-05'], 'y': [4, 5]})
  6. X_val = pd.DataFrame({'ds': ['2023-01-06', '2021-03-07'], 'y': [6, 7]})
  7. results = pd.DataFrame({'ds': ['2023-01-08', '2023-01-09'], 'yhat': [8, 9]})

Below function execute both csv and parquet file. You can modify this according to your need.

  1. def register_future_predictions(forecast_val, ws):
  2. last_date = forecast_val['ds'].iloc[-1]
  3. future_dates = pd.date_range(start=last_date, periods=182, freq='W')
  4. future_df = pd.DataFrame({'ds': future_dates})
  5. for col in X_val.columns:
  6. if col != 'ds':
  7. future_df[col] = 0
  8. future_predictions = results
  9. reg_data_future = pd.concat([X_train, X_test, X_val,
  10. future_predictions[['ds', 'yhat']].rename(columns={'ds': 'Created on day', 'yhat': 'target_col'})])
  11. # Save the dataframe as a file, e.g., CSV or Parquet
  12. reg_data_future.to_csv('reg_data_future.csv', index=False)
  13. reg_data_future.to_parquet('reg_data_future.parquet', index=False)
  14. # Get the target datastore
  15. target_datastore = Datastore.get(ws, 'workspaceblobstore')
  16. # Upload the files to the datastore's container
  17. target_datastore.upload_files(files=['reg_data_future.csv', 'reg_data_future.parquet'], target_path='data')
  18. # Register the files as datasets
  19. csv_dataset = Dataset.Tabular.from_delimited_files(path=[(target_datastore, 'data/reg_data_future.csv')])
  20. csv_dataset.register(workspace=ws, name='full_data_csv', create_new_version=True)
  21. parquet_dataset = Dataset.Tabular.from_parquet_files(path=[(target_datastore, 'data/reg_data_future.parquet')])
  22. parquet_dataset.register(workspace=ws, name='full_data_parquet', create_new_version=True)
  23. return future_predictions, future_df
  1. # Create or load Azure ML workspace
  2. ws = Workspace.from_config()
  3. # Test the function
  4. sample_forecast_val = pd.DataFrame({'ds': ['2024-01-01', '2024-01-02', '2024-01-03']})
  5. predictions, df = register_future_predictions(sample_forecast_val, ws)

With the updated function will successfully execute.

How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?.

Create the csv or parquet file in container.
How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?

And register the dataset as data asset.
How to register pandas dataframe as parquet or csv dataset in the container and in the Data at the same time?

Please refer to this documentation for more details and examples.

  • 本文由 发表于 2023年6月13日 14:38:56
  • 转载请务必保留本文链接:



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
