Python – Pandas x Polars – 值映射(查找值)

huangapple go评论69阅读模式
英文:

Python - Pandas x Polars - Values mapping (Lookup value)

问题

在Pandas中,将一个Series的值映射到另一个值(或执行“查找值”)的一种方法是使用map函数

所以如果我有两个DataFrame(一个维度表和一个事实表):

# 导入库
import numpy as np
import pandas as pd

# 维度表 - 用于保存字母及其ID的DataFrame。
letters_ids = pd.DataFrame({
    'Letters': ['A', 'B', 'C'],
    'Letters_id': [1, 2, 3]     
}).set_index('Letters')

# 事实表 - 包含数百万封信的DataFrame。
many_letters = pd.DataFrame({
   'Letters': np.random.choice(['A', 'B', 'C'], 10_000_000)
})

考虑到业务模型要求我们将'Letters_id'添加到事实表中,我们可以这样做:

many_letters['letters_mapped'] = many_letters.Letters.map(letters_ids.squeeze())

这很简洁明了。

现在考虑一下Polars?
我的意思是,Polars有一个map函数,但似乎不像Pandas那样工作。

我找到了两种使用Polars将一个Series的值映射到另一个值的方法(或执行“查找值”),但我觉得可能还有更好的方法。

所以假设我们在Polars中有相同的数据集:

# 导入库
import numpy as np
import polars as pl

# 维度表 - 用于保存字母及其ID的DataFrame。
letters_ids = pl.DataFrame({
    'Letters': ['A', 'B', 'C'],
    'Letters_id': [1, 2, 3]     
})

# 事实表 - 包含数百万封信的DataFrame。
many_letters = pl.DataFrame({
    'Letters': np.random.choice(['A', 'B', 'C'], 10_000_000)
})

现在,为了将'Letters_id'添加到事实表中:

方法 # 1 - 我们可以使用join方法

many_letters = many_letters.join(
    other=letters_ids,
    on='Letters',
    how='left'
).rename({'Letters_id': 'letters_mapped'})

方法 # 2 - 使用map_dict(在这个Stack Overflow问题中找到了这个想法)

# 将两列DataFrame转换成Python的字典。
letters_dict = dict(letters_ids.iter_rows())

# 映射字典
many_letters = many_letters.with_columns(
    pl.col('Letters').map_dict(letters_dict).alias('letters_mapped')
)

只是作为性能信息,考虑到性能,上述两种Polars方法都比Pandas更快完成任务:

  • Pandas映射执行时间(对于1000万行):平均0.35秒
  • Polars映射执行时间(对于1000万行):平均0.21秒(方法 # 1)
  • Polars映射执行时间(对于1000万行):平均0.20秒(方法 # 2)

这些方法中哪个更受欢迎,或者是否有第三种方法可能更好?

英文:

One way in Pandas to map a Series from one value to another (or to do a 'lookup value') is with the map function.

So If I have two DataFrames (say one dimension and one fact table):

# imports
import numpy as np
import pandas as pd

# Dimension Table - DataFrame to hold the letter and its ID.
letters_ids = pd.DataFrame({
    'Letters': ['A', 'B', 'C'],
    'Letters_id': [1, 2, 3]     
}).set_index('Letters')

# Fact Table - DataFrame with millions of letters.
many_letters = pd.DataFrame({
   'Letters': np.random.choice(['A', 'B', 'C'], 10_000_000)
})

And considering that the business model require that we add to the fact table the 'Letters_id' we could do:

many_letters['letters_mapped'] = many_letters.Letters.map(letters_ids.squeeze())

That's clean and straightforward.

Now what about Polars?
I mean Polars have a map function but doesn't seems to work like Pandas.

I found two ways with Polars to map a Series from one value to another (or to do a 'lookup value') but I'm feeling I'm missing a proper way of doing it.

So say we have the same dataset with Polars:

# imports
import numpy as np
import polars as pl

# Dimension Table - DataFrame to hold the letter and its ID.
letters_ids = pl.DataFrame({
    'Letters': ['A', 'B', 'C'],
    'Letters_id': [1, 2, 3]     
})

# Fact Table - DataFrame with millions of letters.
many_letters = pl.DataFrame({
    'Letters': np.random.choice(['A', 'B', 'C'], 10_000_000)
})

Now in order to add to the fact table the 'Letters_id':

Approach # 1 - We can use the join method:

many_letters = many_letters.join(
    other=letters_ids,
    on='Letters',
    how='left'
).rename({'Letters_id': 'letters_mapped'})

Approach # 2 - Use the map_dict (found the idea at this SO question)

# Convert the two columns DataFrame into a Python's dictionary.
letters_dict = dict(letters_ids.iter_rows())

# Maps the dictionary
many_letters = many_letters.with_columns(
    pl.col('Letters').map_dict(letters_dict).alias('letters_mapped')
)

Just as an information when considering performance, both Polars approaches above are faster to do the job then Pandas:

  • Pandas mapping execution time (for 10 million rows): average 0.35 seconds
  • Polars mapping execution time (for 10 million rows): average 0.21 seconds (Approach # 1)
  • Polars mapping execution time (for 10 million rows): average 0.20 seconds (Approach # 2)

Which of these is preferred or is there a 3rd approach which may be better still?

答案1

得分: 1

底线是,连接这两个数据框是最好的方式,但 map_dict 一种连接操作,所以在它们两者之间,只需选择更容易的那个。大多数情况下,坚持使用这两种方法。

唯一可行的第三种选择,如果你处理的是字符串和索引,那就是使用 Categorical 类型。Categoricals,在其基本级别上,只是将唯一值存储起来,并用整数(技术上是 u32)索引来表示它们。您可以通过使用 StringCache 来控制将分配给一个值的整数(请参见下面的链接以获取更多关于如何执行此操作的详细信息)。您可以提取

with pl.StringCache():
    letters_ids = pl.DataFrame({
        'Letters': ['', 'A', 'B', 'C'], 
                   ## 我在开头添加了一个空字符串以实现 1 索引,因为 categoricals 使用 0 索引
    }, schema={'Letters':pl.Categorical()})
    many_letters = pl.DataFrame({
    'Letters': np.random.choice(['A', 'B', 'C'], 10_000_000)}, schema={'Letters':pl.Categorical()
    })

在这一点上,您可以执行

many_letters.with_columns(letter_id=pl.col('Letters').to_physical())
shape: (10_000_000, 2)
┌─────────┬───────────┐
 Letters  letter_id 
 ---      ---       
 cat      u32       
╞═════════╪═══════════╡
 C        3         
 C        3         
 A        1         
 A        1         
                  
 B        2         
 B        2         
 B        2         
 C        3         
└─────────┴───────────┘

这里有一些重要的注意事项。如果 ID 与数据库 ID 匹配,并且重要的是将其保留,则使用 Categoricals 会稍微复杂一些。如果 ID 需要是除 uint 之外的任何值,那么它们就不适用。

此问题报告说明了处理 Categoricals 的困难。

这里有一个 Stack Overflow 问题,其中提供了一些其他操作底层索引的技巧

英文:

Bottom line first, joining the two dfs is the best way BUT map_dict is a join so between the two of them, just do whichever is easier. Most of the time stick with these 2 approaches.

The only conceivable third option, if you're dealing specifically with strings and indices, is to use the Categorical type instead. Categoricals, at their base level, just stash the unique values away and represent them with an integer (technically a u32) index in data. You can control what integer will be assigned to a value by using the StringCache (see below for links to more details on doing that). You can extract

with pl.StringCache():     
    letters_ids = pl.DataFrame({
        'Letters': ['','A', 'B', 'C'], 
                   ## I added an empty string to the beginning 
                   ## to get 1 indexing since categoricals use 0 indexing
    }, schema={'Letters':pl.Categorical()})
    many_letters = pl.DataFrame({
    'Letters': np.random.choice(['A', 'B', 'C'], 10_000_000)}, schema={'Letters':pl.Categorical()
    })

At this point you could do

many_letters.with_columns(letter_id=pl.col('Letters').to_physical())
shape: (10_000_000, 2)
┌─────────┬───────────┐
│ Letters ┆ letter_id │
│ ---     ┆ ---       │
│ cat     ┆ u32       │
╞═════════╪═══════════╡
│ C       ┆ 3         │
│ C       ┆ 3         │
│ A       ┆ 1         │
│ A       ┆ 1         │
│ …       ┆ …         │
│ B       ┆ 2         │
│ B       ┆ 2         │
│ B       ┆ 2         │
│ C       ┆ 3         │
└─────────┴───────────┘

There are some major caveats here. If the id matches a database id and it is important to carry then using categoricals is a little tougher. If the id needs to be anything other than uint then they're out.

This issue report is illustrative of the difficulties of dealing with categoricals.

Here's an SO question with some other tricks to manipulate underlying index

huangapple
  • 本文由 发表于 2023年7月13日 23:20:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76681016.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定