2023年6月1日 04:18:42go评论118阅读模式

英文:

Add date column on per-file basis with Polars when aggregating over multiple Parquet files

问题

Sure, here is the translated code portion:

(
    pl.scan_parquet('data/data-16828*.parquet')
    .groupby(['type_id', 'location_id'])
    .agg([
        pl.min('n').alias('n_min'),
        pl.max('n').alias('n_max')
    ])
    .collect()
)

Regarding your question about adding a new column to the DataFrame with Polars' lazy API, it's possible to achieve this by using the with_column method. You can calculate the desired value for the new column based on the date values within each file and add it to your DataFrame before running aggregation. This can be done lazily without loading the entire dataset into memory.

If you need further assistance with the code for adding a new column, please let me know.

英文:

I have a very large number of Parquet data files that I can nicely join and aggregate with Polars doing something like this (note the glob in filename):

(
    pl.scan_parquet(&#39;data/data-16828*.parquet&#39;)
    .groupby([&#39;type_id&#39;, &#39;location_id&#39;])
    .agg([
        pl.min(&#39;n&#39;).alias(&#39;n_min&#39;),
        pl.max(&#39;n&#39;).alias(&#39;n_max&#39;)
    ])
    .collect()
)

Each file is an output of a script run every five minutes and my goal is to make a single timeseries DataFrame out of them. There is a date column of type datetime[μs, UTC]. However, I discovered that the values of this column are not equal in a single file, rather they reflect the exact time during the run when a row was created. Because of this, the date column, as it is, is useless for grouping.

The way I see this, I probably should add a new column and populate it with the date value of the first row on a per-file basis. Is it possible to achieve this with Polars' lazy API or am I going to have to first fix the files before running aggregation with Polars?

Please note that I need to use the lazy API as the dataset is way larger than memory.

答案1

得分: 1

懒惰的框架（lazyframe）不包含有关它来自哪个文件的信息。因此，您需要将迭代移出polars，以便自己将文件信息提供给lazyframe。

类似这样的方式：

lazydf = []
from pathlib import Path
basepath = Path('data/')
for myfile in basepath.iterdir():
    if not "data-16828" in myfile.name or myfile.suffix != '.parquet':
        continue
    lazydf.append((
        pl.scan_parquet(myfile)
        .groupby(['type_id', 'location_id'])
        .agg([
            pl.min('n').alias('n_min'),
            pl.max('n').alias('n_max')
        ])
        .with_columns(source_file=pl.lit(myfile.name))
    ))
pl.concat(lazydf)

这个代码片段没有捕获第一行的方面，为了实现这一点，您需要改变groupby/agg模型，并使用窗口函数，以便每一列都有自己的分组，就像这样：

lazydf = []
from pathlib import Path
basepath = Path('data/')
for myfile in basepath.iterdir():
    if not "data-16828" in myfile.name or myfile.suffix != '.parquet':
        continue
    lazydf.append((
        pl.scan_parquet(myfile)
        .select('type_id',
                'location_id',
                n_min=pl.col('n').min().over(['type_id', 'location_id']),
                n_max=pl.col('n').max().over(['type_id', 'location_id']),
                date=pl.col('date').first())
        .unique(subset=['type_id', 'location_id', 'n_min', 'n_max', 'date'])
    ))
pl.concat(lazydf)

英文:

The lazyframe doesn't have any information about the file from whence it came. For that reason you'll need to move the iteration out of polars so that you can feed the file info to the lazyframe yourself.

Something like this:

lazydf=[]
from pathlib import Path
basepath=Path(&#39;data/&#39;)
for myfile in basepath.iterdir():
    if not &quot;data-16828&quot; in myfile.name or myfile.suffix!=&#39;.parquet&#39;: continue 
    lazydf.append((
        pl.scan_parquet(myfile)
        .groupby([&#39;type_id&#39;, &#39;location_id&#39;])
        .agg([
            pl.min(&#39;n&#39;).alias(&#39;n_min&#39;),
            pl.max(&#39;n&#39;).alias(&#39;n_max&#39;)
        ])
        .with_columns(source_file=pl.lit(myfile.name))
    ))
pl.concat(lazydf)

This doesn't capture the first row aspect, for that you'd need to change out of the groupby/agg model and use a window function so that each column gets its own grouping like this:

lazydf=[]
from pathlib import Path
basepath=Path(&#39;data/&#39;)
for myfile in basepath.iterdir():
    if not &quot;data-16828&quot; in myfile.name or myfile.suffix!=&#39;.parquet&#39;: continue 
    lazydf.append((
        pl.scan_parquet(myfile)
        .select(&#39;type_id&#39;,
                &#39;location_id&#39;,
                n_min=pl.col(&#39;n&#39;).min().over([&#39;type_id&#39;,&#39;location_id&#39;]),
                n_max=pl.col(&#39;n&#39;).max().over([&#39;type_id&#39;,&#39;location_id&#39;]),
                date=pl.col(&#39;date&#39;).first())
        .unique(subset=[&#39;type_id&#39;,&#39;location_id&#39;,&#39;n_min&#39;,&#39;n_max&#39;,&#39;date&#39;])
    ))
pl.concat(lazydf)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在使用 Polars 聚合多个 Parquet 文件时，按照每个文件的基础添加日期列。

问题

答案1

将Parquet文件转换为具有嵌套元素的Golang结构体。

在Python Polars中过滤带有时区信息的日期时间时的偏移量。

在Glue/Athena中重新映射列标题的最简单方式是什么？

Polars – 声明 pl.List – 类型错误: ‘DataTypeClass’ 对象不可索引

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论