2023年5月22日 15:04:26go评论81阅读模式

英文:

How to add a new column when writing to a Delta table?

问题

You can add a new column to a Delta table in Delta Lake by specifying the overwrite_schema parameter as True when using the write_deltalake function. Here's the relevant part of your Python code with the necessary adjustment:

write_deltalake(
    "s3a://my-bucket/delta-tables/motor",
    df,
    mode="append",
    schema=schema,
    storage_options=storage_options,
    overwrite_schema=True,  # Set this to True
)

By setting overwrite_schema to True, you are telling Delta Lake to update the schema of the Delta table to match the schema of your new data, which includes the additional "pressure" column. This will allow you to add the new column to the existing Delta table without encountering a schema mismatch error.

英文:

I am using delta-rs to write to a Delta table in the Delta Lake. Here is my code:

import time
import numpy as np
import pandas as pd
import pyarrow as pa
from deltalake.writer import write_deltalake

num_rows = 10
timestamp = np.array([time.time() + i * 0.01 for i in range(num_rows)])
current = np.random.rand(num_rows) * 10
voltage = np.random.rand(num_rows) * 100
temperature = np.random.rand(num_rows) * 50
data = {
    &quot;timestamp&quot;: timestamp,
    &quot;current&quot;: current,
    &quot;voltage&quot;: voltage,
    &quot;temperature&quot;: temperature,
}
df = pd.DataFrame(data)
storage_options = {
    &quot;AWS_DEFAULT_REGION&quot;: &quot;us-west-2&quot;,
    &quot;AWS_ACCESS_KEY_ID&quot;: &quot;xxx&quot;,
    &quot;AWS_SECRET_ACCESS_KEY&quot;: &quot;xxx&quot;,
    &quot;AWS_S3_ALLOW_UNSAFE_RENAME&quot;: &quot;true&quot;,
}
schema = pa.schema(
    [
        (&quot;timestamp&quot;, pa.float64()),
        (&quot;current&quot;, pa.float64()),
        (&quot;voltage&quot;, pa.float64()),
        (&quot;temperature&quot;, pa.float64()),
    ]
)
write_deltalake(
    &quot;s3a://my-bucket/delta-tables/motor&quot;,
    df,
    mode=&quot;append&quot;,
    schema=schema,
    storage_options=storage_options,
)

Above code successfully wrote the data including 4 columns to a Delta table. I can confirm by Spark SQL:

spark-sql&gt; describe table delta.`s3a://my-bucket/delta-tables/motor`;
23/05/22 06:38:51 WARN ObjectStore: Failed to get database delta, returning NoSuchObjectException
timestamp             double
current               double
voltage               double
temperature           double

# Partitioning
Not partitioned
Time taken: 0.39 seconds, Fetched 7 row(s)

spark-sql&gt; select * from delta . `s3a://my-bucket/delta-tables/motor` limit 10;
23/05/22 07:01:50 WARN ObjectStore: Failed to get database delta, returning NoSuchObjectException
1.683746477029865E9 7.604250297497938 9.421758439102415 72.1927369069416
1.683746477039865E9 0.09092487512480374 17.989035574705202  35.350210012093214
1.683746477049866E9 7.493128659573002 9.390891728445448 48.541259705334625
1.683746477059866E9 2.717780962917138 0.9268887657049119  59.10566692023579
1.683746477069866E9 2.57300442470119  17.486083607683693  47.23521355609355
1.683746477079866E9 2.09432242350117  14.945888123248054  47.125030870747715
1.683746477089866E9 4.136491853926207 16.52334128991138 27.544656909406505
1.6837464770998669E9  1.1299759566741152  5.539831633892187 52.50892511866684
1.6837464771098669E9  0.9626607062002979  8.400536671329352 72.49131313291358
1.6837464771198668E9  7.6866231204656446  4.033915109232906 48.900631068812075
Time taken: 5.925 seconds, Fetched 10 row(s)

Now I am trying to write to the Delta table with a new column pressure:

import time
import numpy as np
import pandas as pd
import pyarrow as pa
from deltalake.writer import write_deltalake

num_rows = 10
timestamp = np.array([time.time() + i * 0.01 for i in range(num_rows)])
current = np.random.rand(num_rows) * 10
voltage = np.random.rand(num_rows) * 100
temperature = np.random.rand(num_rows) * 50
pressure = np.random.rand(num_rows) * 1000
data = {
    &quot;timestamp&quot;: timestamp,
    &quot;current&quot;: current,
    &quot;voltage&quot;: voltage,
    &quot;temperature&quot;: temperature,
    &quot;pressure&quot;: pressure,
}
df = pd.DataFrame(data)
storage_options = {
    &quot;AWS_DEFAULT_REGION&quot;: &quot;us-west-2&quot;,
    &quot;AWS_ACCESS_KEY_ID&quot;: &quot;xxx&quot;,
    &quot;AWS_SECRET_ACCESS_KEY&quot;: &quot;xxx&quot;,
    &quot;AWS_S3_ALLOW_UNSAFE_RENAME&quot;: &quot;true&quot;,
}
schema = pa.schema(
    [
        (&quot;timestamp&quot;, pa.float64()),
        (&quot;current&quot;, pa.float64()),
        (&quot;voltage&quot;, pa.float64()),
        (&quot;temperature&quot;, pa.float64()),
        (&quot;pressure&quot;, pa.float64()), # &lt;- I added this line
    ]
)
write_deltalake(
    &quot;s3a://my-bucket/delta-tables/motor&quot;,
    df,
    mode=&quot;append&quot;,
    schema=schema,
    storage_options=storage_options,
    overwrite_schema=True, # &lt;- Whether add this or not will return same error
)

Note whether adding overwrite_schema=True in the function write_deltalake does not affect the result.

It will throw this error:

...

Traceback (most recent call last):
  File &quot;python3.11/site-packages/deltalake/writer.py&quot;, line 180, in write_deltalake
    raise ValueError(
ValueError: Schema of data does not match table schema

Table schema:
timestamp: double
current: double
voltage: double
temperature: double
pressure: double

Data Schema:
timestamp: double
current: double
voltage: double
temperature: double

This error confused me. Because my existing Delta table data schema should have 4 columns. And the new data I want to write has 5 columns. But based on the error, it is opposite.

How can I add a new column in a Delta table? Thanks!

答案1

得分: 0

看起来你需要使用 mode='overwrite' 来使用 overwrite_schema=True。（参见源代码）

这似乎没有很好的文档说明。如果你想在附加数据时添加一列，你需要首先覆盖现有数据，添加列，然后运行附加语句。

英文:

It looks like you need mode='overwrite' to use overwrite_schema=True. (See the source code)

It doesn't seem to be documented well. If you want to add a column when you append, you'd need to overwrite the existing data first, adding the column, and then run the append statement.

答案2

得分: 0

"截止到今天，此功能不受支持。\n\n这是功能请求票 https://github.com/delta-io/delta-rs/issues/1386"

英文:

As of today, this feature is not supported.

Here is the feature request ticket https://github.com/delta-io/delta-rs/issues/1386

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在写入Delta表时添加新列？

问题

答案1

答案2

Equivalent of Python string.format in Go?

YOLOv8 – 指定数据增强参数的正确方式是什么？

为什么如果我尝试打印CSV文件中特定列的索引，会收到KeyError错误？

如何配置dependabot来检查多个文件？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论