英文:
How to create make_batch_reader object of petastorm library in DataBricks?
问题
我有数据以Parquet格式保存。 Petastorm是我用来获取用于训练的数据批次的库。
尽管我在本地系统中成功执行了这个操作,但相同的代码在Databricks中不起作用。
我在本地系统中使用的代码:
# 创建一个迭代器对象train_reader。num_epochs是我们想要训练模型的时代数
with make_batch_reader('file:///config/workspace/scaled.parquet', num_epochs=4, shuffle_row_groups=False) as train_reader:
train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
for ele in train_ds:
tensor = tf.reshape(ele, (2, 1, 15))
model.fit(tensor, tensor)
我在Databricks中使用的代码:
with make_batch_reader('dbfs://output/scaled.parquet', num_epochs=4, shuffle_row_groups=False) as train_reader:
train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
for ele in train_ds:
tensor = tf.reshape(ele, (2, 1, 15))
model.fit(tensor, tensor)
在Databricks代码中遇到的错误是:
TypeError: init()缺少2个必需的位置参数:'instance'和'token'
我已经查看了文档,但没有找到名为instance和token的参数。然而,在petastorm的类似方法make_reader中,对于Azure Databricks,我看到以下代码行:
# 为存储帐户访问创建SAS令牌,请使用您自己的ADLS帐户信息
remote_url = "abfs://container_name@storage_account_url"
account_name = "<<adls帐户名称>>"
linked_service_name = '<<链接的服务名称>>'
TokenLibrary = spark._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
sas_token = TokenLibrary.getConnectionString(linked_service_name)
with make_reader('{}/data_directory'.format(remote_url), storage_options={'sas_token': sas_token}) as reader:
for row in reader:
print(row)
在这里我看到一些'sas_token'作为输入参数传递。
请建议我如何解决这个错误?
我尝试更改Parquet文件的路径,但对我来说没有效果。
英文:
I have data saved in parquet format. Petastorm is a library I am using to obtain batches of data for training.
Though I was able to do this in my local system, but the same code is not working in Databricks.
Code I used in my local system
# create a iterator object train_reader. num_epochs is the number of epochs for which we want to train our model
with make_batch_reader('file:///config/workspace/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
for ele in train_ds:
tensor = tf.reshape(ele,(2,1,15))
model.fit(tensor,tensor)
Code I used in Databricks
with make_batch_reader('dbfs://output/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
for ele in train_ds:
tensor = tf.reshape(ele,(2,1,15))
model.fit(tensor,tensor)
Error I ma getting on DataBricks code is:
TypeError: init() missing 2 required positional arguments: 'instance' and 'token'
I have checked the documentation, but couldn't find any argument that Goes by the name of instance and token.However, in a similar method make_reader in petastorm, for Azure Databricks I see the below line of code:
# create sas token for storage account access, use your own adls account info
remote_url = "abfs://container_name@storage_account_url"
account_name = "<<adls account name>>"
linked_service_name = '<<linked service name>>'
TokenLibrary = spark._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
sas_token = TokenLibrary.getConnectionString(linked_service_name)
with make_reader('{}/data_directory'.format(remote_url), storage_options = {'sas_token' : sas_token}) as reader:
for row in reader:
print(row)
Here I see some 'sas_token' being passed as input.
Please suggest how do I resolve this error?
I tried changing paths of the parquet file but that did not work out for me.
答案1
得分: 1
问题在于你需要以数据湖上的不同格式提供路径,这对我来说行得通。添加文件关键字并使用三个斜杠 ///,就像这样:petastorm_dataset_url = "file://" + get_local_path(parquet_path)
'file:///dbfs/output/scaled.parquet'
英文:
The problem is that you have to provide the path in a different format on databricks which works for me. add the file keyword and use three front slashes /// like that:- petastorm_dataset_url = "file://" + get_local_path(parquet_path)
'file:///dbfs/output/scaled.parquet'
答案2
得分: 0
在代码中使用的SAS令牌可以通过以下步骤为您的容器生成:
- 转到您的容器所在的位置,然后选择设置。点击“生成SAS”
- 现在选择您将授予的所有必需权限(您需要执行的操作)。
- 当您点击生成时,您将获得可在您的代码中使用的令牌。
英文:
The SAS Token that is used in the code can be generated for your container by using the following steps:
- Navigate to where your container exists and select settings. Click
Generate SAS
- Now select all the required permissions that you are going to grant (operations you require to perform).
- When you click generate, you will get the token that can be used in your code.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论