问题

我正在尝试在Spark DataFrame中使用NumPy的FFT函数对窗口进行FFT计算，代码如下：

import numpy as np

df_grouped = df.groupBy(
    "id",
    "type",
    "mode",
    func.window("timestamp", "10 seconds", "5 seconds"),
).agg(
    percentile_approx("value", 0.25).alias("quantile_1(value)"),
    percentile_approx("magnitude", 0.25).alias("quantile_1(magnitude)"),
    percentile_approx("value", 0.5).alias("quantile_2(value)"),
    percentile_approx("magnitude", 0.5).alias("quantile_2(magnitude)"),
    percentile_approx("value", 0.75).alias("quantile_3(value)"),
    percentile_approx("magnitude", 0.75).alias("quantile_3(magnitude)"),
    avg("value"),
    avg("magnitude"),
    min("value"),
    min("magnitude"),
    max("value"),
    max("magnitude"),
    kurtosis("value"),
    kurtosis("magnitude"),
    var_samp("value"),
    var_samp("magnitude"),
    stddev_samp("value"),
    stddev_samp("magnitude"),
    np.fft.fft("value"),
    np.fft.fft("magnitude"),
    np.fft.rfft("value"),
    np.fft.rfft("magnitude"),
)

每个聚合函数都可以正常工作，但对于FFT，您遇到了以下错误：

tuple index out of range

我理解您的数据是浮点数，但问题可能是由于Spark Row的数据结构而引起的。您可以尝试使用F.udf函数来将Spark的Row转换为NumPy数组，然后再应用FFT。以下是一种可能的解决方法：

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, DoubleType

# 定义一个UDF来将Spark Row转换为NumPy数组
def row_to_numpy(row):
    return [float(row["value_0"])]

# 注册UDF
row_to_numpy_udf = udf(row_to_numpy, ArrayType(DoubleType()))

# 在DataFrame中使用UDF进行转换
df_transformed = df.withColumn("value_array", row_to_numpy_udf(df))

# 应用NumPy的FFT
df_grouped = df_transformed.groupBy(
    "id",
    "type",
    "mode",
    func.window("timestamp", "10 seconds", "5 seconds"),
).agg(
    # 其他聚合函数
    # ...

    np.fft.fft("value_array"),
    # ...
)

通过这种方法，您将首先将Spark的Row数据转换为NumPy数组，然后再应用FFT函数。这应该解决您遇到的问题。

英文:

I am trying to get the fft over a window using numpy fft with spark dataframe like this:

import numpy as np

df_grouped = df.groupBy(
    &quot;id&quot;,
    &quot;type&quot;,
    &quot;mode&quot;,
    func.window(&quot;timestamp&quot;, &quot;10 seconds&quot;, &quot;5 seconds&quot;),
).agg(
    percentile_approx(&quot;value&quot;, 0.25).alias(&quot;quantile_1(value)&quot;),
    percentile_approx(&quot;magnitude&quot;, 0.25).alias(&quot;quantile_1(magnitude)&quot;),
    percentile_approx(&quot;value&quot;, 0.5).alias(&quot;quantile_2(value)&quot;),
    percentile_approx(&quot;magnitude&quot;, 0.5).alias(&quot;quantile_2(magnitude)&quot;),
    percentile_approx(&quot;value&quot;, 0.75).alias(&quot;quantile_3(value)&quot;),
    percentile_approx(&quot;magnitude&quot;, 0.75).alias(&quot;quantile_3(magnitude)&quot;),
    avg(&quot;value&quot;),
    avg(&quot;magnitude&quot;),
    min(&quot;value&quot;),
    min(&quot;magnitude&quot;),
    max(&quot;value&quot;),
    max(&quot;magnitude&quot;),
    kurtosis(&quot;value&quot;),
    kurtosis(&quot;magnitude&quot;),
    var_samp(&quot;value&quot;),
    var_samp(&quot;magnitude&quot;),
    stddev_samp(&quot;value&quot;),
    stddev_samp(&quot;magnitude&quot;),
    np.fft.fft(&quot;value&quot;),
    np.fft.fft(&quot;magnitude&quot;),
    np.fft.rfft(&quot;value&quot;),
    np.fft.rfft(&quot;magnitude&quot;),
)

Every aggregation function works fine, however for the fft I get:

tuple index out of range

and I don't understand why. Do I need to do anything particular to the values in order for numpy fft to work? The values are all floats. When I print the column it looks like this:

[Row(value_0=6.247499942779541), Row(value_0=63.0), Row(value_0=54.54375076293945), Row(value_0=0.7088077664375305), Row(value_0=51.431251525878906), Row(value_0=0.09377499669790268), Row(value_0=0.09707500040531158), Row(value_0=6.308750152587891), Row(value_0=8.503950119018555), Row(value_0=295.8463134765625), Row(value_0=7.938048839569092), Row(value_0=8.503950119018555), Row(value_0=0.7090428471565247), Row(value_0=0.7169944643974304), Row(value_0=0.5659012794494629)]

I am guessing the spark row might be an issue, but I am unsure of how to convert it in this context.

答案1

得分: 0

np.fft.fft是一个numpy函数，不是pyspark函数。因此，你不能直接将它应用于一个dataframe。

此外，它的输入是一个数组。`"value"`是一个字符串。函数`fft`不能推断出它是列`"value"`的值的聚合列表。你必须手动处理。

```python
from pyspark.sql import functions as F, types as T

df_grouped = df.groupBy(
    "id",
    "type",
    "mode",
    func.window("timestamp", "10 seconds", "5 seconds"),
).agg(
    F.percentile_approx("value", 0.25).alias("quantile_1(value)"),
    ...,
    F.stddev_samp("magnitude"), # 我将np.fft.fft替换为collect_list
    F.collect_list("value").alias("values"),
    F.collect_list("magnitude").alias("magnitudes"),
)


# 定义UDF fft。对于rfft做相同的操作
@F.udf(T.ArrayType(T.FloatType()))
def fft_udf(array):
    return [float(x) for x in np.fft.fft(array)]

# 对所有的列都这样做
df_grouped.withColumn("ftt_values", fft_udf(F.col("values")))

英文:

np.fft.fft is a numpy function, not a pyspark function. Therefore, you cannot apply it directly to a dataframe.

Moreover, it takes as entry an array. "value" is a string. The function fft cannot infer that as being the aggregated list of the values of the column "value". You have to do it manually.

from pyspark.sql import functions as F, types as T

df_grouped = df.groupBy(
    &quot;id&quot;,
    &quot;type&quot;,
    &quot;mode&quot;,
    func.window(&quot;timestamp&quot;, &quot;10 seconds&quot;, &quot;5 seconds&quot;),
).agg(
    F.percentile_approx(&quot;value&quot;, 0.25).alias(&quot;quantile_1(value)&quot;),
    ...,
    F.stddev_samp(&quot;magnitude&quot;), # I replace the np.fft.fft with a collect_list
    F.collect_list(&quot;value&quot;).alias(&quot;values&quot;),
    F.collect_list(&quot;magnitude&quot;).alias(&quot;magnitudes&quot;),
)


#  Definition of the UDF fft. Do the same for rfft
@F.udf(T.ArrayType(T.FloatType()))
def fft_udf(array):
    return [float(x) for x in np.fft.fft(array)]

# Do that for all your columns.
df_grouped.withColumn(&quot;ftt_values&quot;, fft_udf(F.col(&quot;values&quot;)))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

快速傅立叶变换（FFT）聚合在Spark数据框分组上

问题

答案1

如何将一个numpy数组保存为webp文件？

你可以使用Python/NumPy如何将数值转换为数组？

在Python中，筛选小于给定公差的列表元素。

What could be causing a visual discrepancy when displaying two identical numpy arrays using matplotlib's imshow()?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论