问题

我有数据以Parquet格式保存。 Petastorm是我用来获取用于训练的数据批次的库。

尽管我在本地系统中成功执行了这个操作，但相同的代码在Databricks中不起作用。

我在本地系统中使用的代码：

# 创建一个迭代器对象train_reader。num_epochs是我们想要训练模型的时代数
with make_batch_reader('file:///config/workspace/scaled.parquet', num_epochs=4, shuffle_row_groups=False) as train_reader:
  train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
  
  for ele in train_ds:
    tensor = tf.reshape(ele, (2, 1, 15))
    model.fit(tensor, tensor)

我在Databricks中使用的代码：

with make_batch_reader('dbfs://output/scaled.parquet', num_epochs=4, shuffle_row_groups=False) as train_reader:
    train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
  
    for ele in train_ds:
        tensor = tf.reshape(ele, (2, 1, 15))
        model.fit(tensor, tensor)

在Databricks代码中遇到的错误是：

TypeError: init()缺少2个必需的位置参数：'instance'和'token'

我已经查看了文档，但没有找到名为instance和token的参数。然而，在petastorm的类似方法make_reader中，对于Azure Databricks，我看到以下代码行：

# 为存储帐户访问创建SAS令牌，请使用您自己的ADLS帐户信息
remote_url = "abfs://container_name@storage_account_url"
account_name = "<<adls帐户名称>>"
linked_service_name = '<<链接的服务名称>>'
TokenLibrary = spark._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
sas_token = TokenLibrary.getConnectionString(linked_service_name)
with make_reader('{}/data_directory'.format(remote_url), storage_options={'sas_token': sas_token}) as reader:
    for row in reader:
        print(row)

在这里我看到一些'sas_token'作为输入参数传递。

请建议我如何解决这个错误？

我尝试更改Parquet文件的路径，但对我来说没有效果。

英文:

I have data saved in parquet format. Petastorm is a library I am using to obtain batches of data for training.

Though I was able to do this in my local system, but the same code is not working in Databricks.

Code I used in my local system

# create a iterator object train_reader. num_epochs is the number of epochs for which we want to train our model
with make_batch_reader(&#39;file:///config/workspace/scaled.parquet&#39;, num_epochs=4,shuffle_row_groups=False) as train_reader:
  train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
  
  for ele in train_ds:
    tensor = tf.reshape(ele,(2,1,15))
    model.fit(tensor,tensor)

Code I used in Databricks

with make_batch_reader(&#39;dbfs://output/scaled.parquet&#39;, num_epochs=4,shuffle_row_groups=False) as train_reader:
    train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
  
    for ele in train_ds:
        tensor = tf.reshape(ele,(2,1,15))
        model.fit(tensor,tensor)

Error I ma getting on DataBricks code is:

TypeError: init() missing 2 required positional arguments: 'instance' and 'token'

I have checked the documentation, but couldn't find any argument that Goes by the name of instance and token.However, in a similar method make_reader in petastorm, for Azure Databricks I see the below line of code:

# create sas token for storage account access, use your own adls account info
remote_url = &quot;abfs://container_name@storage_account_url&quot;
account_name = &quot;&lt;&lt;adls account name&gt;&gt;&quot;
linked_service_name = &#39;&lt;&lt;linked service name&gt;&gt;&#39;
TokenLibrary = spark._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
sas_token = TokenLibrary.getConnectionString(linked_service_name)
with make_reader(&#39;{}/data_directory&#39;.format(remote_url), storage_options = {&#39;sas_token&#39; : sas_token}) as reader:
    for row in reader:
        print(row)

Here I see some 'sas_token' being passed as input.

Please suggest how do I resolve this error?

I tried changing paths of the parquet file but that did not work out for me.

答案1

得分: 1

问题在于你需要以数据湖上的不同格式提供路径，这对我来说行得通。添加文件关键字并使用三个斜杠 ///，就像这样：petastorm_dataset_url = "file://" + get_local_path(parquet_path)

'file:///dbfs/output/scaled.parquet'

英文:

The problem is that you have to provide the path in a different format on databricks which works for me. add the file keyword and use three front slashes /// like that:- petastorm_dataset_url = "file://" + get_local_path(parquet_path)

&#39;file:///dbfs/output/scaled.parquet&#39;

答案2

得分: 0

在代码中使用的SAS令牌可以通过以下步骤为您的容器生成：

转到您的容器所在的位置，然后选择设置。点击“生成SAS”