如何可靠地在Rust Polars中连接LazyFrames

huangapple go评论71阅读模式
英文:

How to reliably concatenate LazyFrames in Rust Polars

问题

以下是翻译好的代码部分,不包含问题的回答:

Cargo.toml:
```toml
[dependencies]
polars = { version = "0.27.2", features = ["lazy"] }

I would expect that any two LazyFrames could be vertically concatenated as long as the columns they have in common had the same or promotable dtypes, with missing columns added in as nulls (like how pandas does it). But evidently they need to have the same columns:

use polars::lazy::dsl::*;
use polars::prelude::{concat, df, DataType, IntoLazy, NamedFrom, NULL};
fn main() -> Result<(), Box<dyn std::error::Error>> {
    // "y" intentionally comes before "x" here
    let df1 = df!["y" => &[1, 5, 17], "x" => &[1, 2, 3]].unwrap().lazy();
    let df2 = df!["x" => &[4, 5]].unwrap().lazy();
    println!(
        "{:?}",
        concat(&[df1, df2], true, true).unwrap().collect()?
    );

    Ok(())
}

This errors with Error: ShapeMisMatch(Owned("Could not vertically stack DataFrame. The DataFrames appended width 2 differs from the parent DataFrames width 1")).

I tried adding the missing "y" column to df2:

// everything but this line is the same as above
let df2 = df!["x" => &[4, 5]]
    .unwrap()
    .lazy()
    .with_column(lit(NULL).cast(DataType::Int32).alias("y"));

They have the same columns (albeit in different orders) and dtypes now:

shape: (3, 2)
┌─────┬─────┐
│ y   ┆ x   │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════╡
│ 1   ┆ 1   │
│ 5   ┆ 2   │
│ 17  ┆ 3   │
└─────┴─────┘

shape: (2, 2)
┌─────┬──────┐
│ x   ┆ y    │
│ --- ┆ ---  │
│ i32 ┆ i32  │
╞═════╪══════╡
│ 4   ┆ null │
│ 5   ┆ null │
└─────┴──────┘

But they still can't be concatenated. Trying to do so gives the error Error: SchemaMisMatch(Owned("cannot vstack: because column names in the two DataFrames do not match for left.name='y' != right.name='x'")). Evidently concat() requires that the columns be in the same order in the underlying DataFrames.

But I don't think it's possible to enforce any particular column order in LazyFrames (and it really shouldn't need to be because column order is supposed to be immaterial). So, what would be the best way to vertically concatenate these two LazyFrames?

If possible, I'd prefer not to .collect() them each into Dataframes and then vstack the DataFrames and call .lazy() on the result; that seems needlessly complicated. And if I did .collect() them, I still wouldn't want to have to put the columns in the two DataFrames in the same order before stacking.

Edit:
After digging through the source it's pretty clear that this just isn't implemented. This ultimately gets compiled into a call to DataFrame::vstack_mut which does not support missing or differently-ordered columns:

pub fn vstack_mut(&mut self, other: &DataFrame) -> PolarsResult<&mut Self> {
    if self.width() != other.width() {
        if self.width() == 0 {
            self.columns = other.columns.clone();
            return Ok(self);
        }

        return Err(PolarsError::ShapeMisMatch(
            format!("Could not vertically stack DataFrame. The DataFrames appended width {} differs from the parent DataFrames width 1", self.width(), other.width()).into()
        ));
    }

    self.columns
        .iter_mut()
        .zip(other.columns.iter())
        .try_for_each::<?, PolarsResult<_>>(|(left, right)| {
            can_extend(left, right)?;
            left.append(right).expect("should not fail");
            Ok(())
        })?;
    Ok(self)
}
英文:

Cargo.toml:

[dependencies]
polars = { version = &quot;0.27.2&quot;, features = [&quot;lazy&quot;] }

I would expect that any two LazyFrames could be vertically concatenated as long as the columns they have in common had the same or promotable dtypes, with missing columns added in as nulls (like how pandas does it). But evidently they need to have the same columns:

use polars::lazy::dsl::*;
use polars::prelude::{concat, df, DataType, IntoLazy, NamedFrom, NULL};
fn main() -&gt; Result&lt;(), Box&lt;dyn std::error::Error&gt;&gt; {
    // &quot;y&quot; intentionally comes before &quot;x&quot; here
    let df1 = df![&quot;y&quot; =&gt; &amp;[1, 5, 17], &quot;x&quot; =&gt; &amp;[1, 2, 3]].unwrap().lazy();
    let df2 = df![&quot;x&quot; =&gt; &amp;[4, 5]].unwrap().lazy();
    println!(
        &quot;{:?}&quot;,
        concat(&amp;[df1, df2], true, true).unwrap().collect()?
    );

    Ok(())
}

This errors with Error: ShapeMisMatch(Owned(&quot;Could not vertically stack DataFrame. The DataFrames appended width 2 differs from the parent DataFrames width 1&quot;)).

I tried adding the missing &quot;y&quot; column to df2:

// everything but this line is the same as above
let df2 = df![&quot;x&quot; =&gt; &amp;[4, 5]]
    .unwrap()
    .lazy()
    .with_column(lit(NULL).cast(DataType::Int32).alias(&quot;y&quot;));

They have the same columns (albeit in different orders) and dtypes now:

shape: (3, 2)
┌─────┬─────┐
│ y   ┆ x   │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════╡
│ 1   ┆ 1   │
│ 5   ┆ 2   │
│ 17  ┆ 3   │
└─────┴─────┘

shape: (2, 2)
┌─────┬──────┐
│ x   ┆ y    │
│ --- ┆ ---  │
│ i32 ┆ i32  │
╞═════╪══════╡
│ 4   ┆ null │
│ 5   ┆ null │
└─────┴──────┘

But they still can't be concatenated. Trying to do so gives the error Error: SchemaMisMatch(Owned(&quot;cannot vstack: because column names in the two DataFrames do not match for left.name=&#39;y&#39; != right.name=&#39;x&#39;&quot;)). Evidently concat() requires that the columns be in the same order in the underlying DataFrames.

But I don't think it's possible to enforce any particular column order in LazyFrames (and it really shouldn't need to be because column order is supposed to be immaterial). So, what would be the best way to vertically concatenate these two LazyFrames?

If possible, I'd prefer not to .collect() them each into Dataframes and then vstack the DataFrames and call .lazy() on the result; that seems needlessly complicated. And if I did .collect() them, I still wouldn't want to have to put the columns in the two DataFrames in the same order before stacking.

Edit:
After digging through the source it's pretty clear that this just isn't implemented. This ultimately gets compiled into a call to DataFrame::vstack_mut which does not support missing or differently-ordered columns:

pub fn vstack_mut(&amp;mut self, other: &amp;DataFrame) -&gt; PolarsResult&lt;&amp;mut Self&gt; {
    if self.width() != other.width() {
        if self.width() == 0 {
            self.columns = other.columns.clone();
            return Ok(self);
        }

        return Err(PolarsError::ShapeMisMatch(
            format!(&quot;Could not vertically stack DataFrame. The DataFrames appended width {} differs from the parent DataFrames width {}&quot;, self.width(), other.width()).into()
        ));
    }

    self.columns
        .iter_mut()
        .zip(other.columns.iter())
        .try_for_each::&lt;_, PolarsResult&lt;_&gt;&gt;(|(left, right)| {
            can_extend(left, right)?;
            left.append(right).expect(&quot;should not fail&quot;);
            Ok(())
        })?;
    Ok(self)
}

答案1

得分: 0

使用特性 diagonal_concat,你可以解锁 diag_concat_lf(以及急切的 diag_concat_df):

pub fn diag_concat_lf<L>(lfs: L, rechunk: bool, parallel: bool) -> PolarsResult<LazyFrame>
where
    L: AsRef<[LazyFrame]>,
{ ... }

pub fn diag_concat_df(dfs: &[DataFrame]) -> PolarsResult<DataFrame> { ... }
英文:

Well it turns out the answer was rather simple once you know where to look. With feature diagonal_concat, you unlock diag_concat_lf (and the eager diag_concat_df):

pub fn diag_concat_lf&lt;L&gt;(lfs: L, rechunk: bool, parallel: bool) -&gt; PolarsResult&lt;LazyFrame&gt;
where
    L: AsRef&lt;[LazyFrame]&gt;,
{ ... }

pub fn diag_concat_df(dfs: &amp;[DataFrame]) -&gt; PolarsResult&lt;dataframe&gt; { ... }

答案2

得分: 0

对于可能关心的人,也可以创建一个包含null值的Dataframe,以符合其他列的类型。我通过以下新方法创建了这个df

let mut columns: Vec<Series> = vec![];
let series = Series::new(column_name, &[None::<f64>]) // 这里我们将None类型强制转换为列的期望类型
columns.push(series);
let new_df = DataFrame::new(columns).unwrap();

当然,这是使用Dataframe而不是LazyFrame。对于后者,上面的实现应该足够了。

英文:

For whom it might concern, it is also possible to create an Dataframe with null values that abide by the type of other columns. I was able to achieve this by creating the df with the new method:

let mut columns: Vec&lt;Series&gt; = vec![];
let series = Series::new(column_name, &amp;[None::&lt;f64&gt;]) // here we transmute None type to columns expected type
columns.push(series);
let new_df = DataFrame::new(columns).unwrap();

Naturally that's using Dataframe instead of LazyFrame. For the latter, the implementation above should suffice.

huangapple
  • 本文由 发表于 2023年3月4日 03:34:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75631210.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定