如何在DataBricks中创建petastorm库的make_batch_reader对象?

huangapple go评论59阅读模式
英文:

How to create make_batch_reader object of petastorm library in DataBricks?

问题

我有数据以Parquet格式保存。 Petastorm是我用来获取用于训练的数据批次的库。

尽管我在本地系统中成功执行了这个操作,但相同的代码在Databricks中不起作用。

我在本地系统中使用的代码:

# 创建一个迭代器对象train_reader。num_epochs是我们想要训练模型的时代数

with make_batch_reader('file:///config/workspace/scaled.parquet', num_epochs=4, shuffle_row_groups=False) as train_reader:
  train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
  

  for ele in train_ds:
    tensor = tf.reshape(ele, (2, 1, 15))
    model.fit(tensor, tensor)

我在Databricks中使用的代码:

with make_batch_reader('dbfs://output/scaled.parquet', num_epochs=4, shuffle_row_groups=False) as train_reader:
    train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
  

    for ele in train_ds:
        tensor = tf.reshape(ele, (2, 1, 15))
        model.fit(tensor, tensor)

在Databricks代码中遇到的错误是:

TypeError: init()缺少2个必需的位置参数:'instance'和'token'

我已经查看了文档,但没有找到名为instance和token的参数。然而,在petastorm的类似方法make_reader中,对于Azure Databricks,我看到以下代码行:

# 为存储帐户访问创建SAS令牌,请使用您自己的ADLS帐户信息
remote_url = "abfs://container_name@storage_account_url"
account_name = "<<adls帐户名称>>"
linked_service_name = '<<链接的服务名称>>'
TokenLibrary = spark._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
sas_token = TokenLibrary.getConnectionString(linked_service_name)

with make_reader('{}/data_directory'.format(remote_url), storage_options={'sas_token': sas_token}) as reader:
    for row in reader:
        print(row)

在这里我看到一些'sas_token'作为输入参数传递。

请建议我如何解决这个错误?

我尝试更改Parquet文件的路径,但对我来说没有效果。

英文:

I have data saved in parquet format. Petastorm is a library I am using to obtain batches of data for training.

Though I was able to do this in my local system, but the same code is not working in Databricks.

Code I used in my local system

# create a iterator object train_reader. num_epochs is the number of epochs for which we want to train our model

with make_batch_reader(&#39;file:///config/workspace/scaled.parquet&#39;, num_epochs=4,shuffle_row_groups=False) as train_reader:
  train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
  

  for ele in train_ds:
    tensor = tf.reshape(ele,(2,1,15))
    model.fit(tensor,tensor)

Code I used in Databricks

with make_batch_reader(&#39;dbfs://output/scaled.parquet&#39;, num_epochs=4,shuffle_row_groups=False) as train_reader:
    train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
  

    for ele in train_ds:
        tensor = tf.reshape(ele,(2,1,15))
        model.fit(tensor,tensor)

Error I ma getting on DataBricks code is:

TypeError: init() missing 2 required positional arguments: 'instance' and 'token'

I have checked the documentation, but couldn't find any argument that Goes by the name of instance and token.However, in a similar method make_reader in petastorm, for Azure Databricks I see the below line of code:

# create sas token for storage account access, use your own adls account info
remote_url = &quot;abfs://container_name@storage_account_url&quot;
account_name = &quot;&lt;&lt;adls account name&gt;&gt;&quot;
linked_service_name = &#39;&lt;&lt;linked service name&gt;&gt;&#39;
TokenLibrary = spark._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
sas_token = TokenLibrary.getConnectionString(linked_service_name)

with make_reader(&#39;{}/data_directory&#39;.format(remote_url), storage_options = {&#39;sas_token&#39; : sas_token}) as reader:
    for row in reader:
        print(row)

Here I see some 'sas_token' being passed as input.

Please suggest how do I resolve this error?

I tried changing paths of the parquet file but that did not work out for me.

答案1

得分: 1

问题在于你需要以数据湖上的不同格式提供路径,这对我来说行得通。添加文件关键字并使用三个斜杠 ///,就像这样:petastorm_dataset_url = "file://" + get_local_path(parquet_path)

'file:///dbfs/output/scaled.parquet'

英文:

The problem is that you have to provide the path in a different format on databricks which works for me. add the file keyword and use three front slashes /// like that:- petastorm_dataset_url = "file://" + get_local_path(parquet_path)

&#39;file:///dbfs/output/scaled.parquet&#39;

答案2

得分: 0

在代码中使用的SAS令牌可以通过以下步骤为您的容器生成:

  • 转到您的容器所在的位置,然后选择设置。点击“生成SAS”

如何在DataBricks中创建petastorm库的make_batch_reader对象?

  • 现在选择您将授予的所有必需权限(您需要执行的操作)。

如何在DataBricks中创建petastorm库的make_batch_reader对象?

  • 当您点击生成时,您将获得可在您的代码中使用的令牌。

如何在DataBricks中创建petastorm库的make_batch_reader对象?

英文:

The SAS Token that is used in the code can be generated for your container by using the following steps:

  • Navigate to where your container exists and select settings. Click Generate SAS

如何在DataBricks中创建petastorm库的make_batch_reader对象?

  • Now select all the required permissions that you are going to grant (operations you require to perform).

如何在DataBricks中创建petastorm库的make_batch_reader对象?

  • When you click generate, you will get the token that can be used in your code.

如何在DataBricks中创建petastorm库的make_batch_reader对象?

huangapple
  • 本文由 发表于 2023年2月10日 16:33:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/75408622.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定