pandas的数据框合并挑战,具有相同的字符串但不同的Unicode编码

huangapple go评论73阅读模式
英文:

pandas' dataframes merge challenge with identical strings but different unicodes

问题

我在使用pd.merge时遇到了一个问题,当我尝试将两个数据集中的两列合并时,我发现这两列中的某些行的unicode不同,尽管字符串是相同的。这是一个例子:

我有两个数据集,data1和data2,它们都有两列共同的列,'state'和'county',这将是我用来合并这两个数据集的列。我检查了这两个数据集中'state'和'county'的数据类型。它们都是类'str'。

通过使用

data_merge = pd.merge(data1, data2, on=['county','state'],how='right') 

我应该在data1的第308行和data2的第20691行之间有一个匹配,但由于data1第308行的'county'和第20691行的'county'具有不同的unicode,所以它们不匹配:

我查看了这两个单词的unicode(unicode1是data1中'county'的unicode,unicode2是data2中'county'的unicode),它们确实不同:

如何解决这个问题并合并这两个数据集?有没有办法告诉pd.merge忽略这些unicode的差异?谢谢!

英文:

I have a problem using pd.merge when some of the rows in the two columns in the two datasets I use to merge the two datasets have different unicodes even though the strings are identical. Here is one example:

I have two datasets data1 and data2 both of which have 2 columns in common, 'state' and 'county', which will be the columns I use to merge the two datasets. I checked datatype for both 'state' and 'county' in two datasets. They are all of class 'str'.

By using

data_merge = pd.merge(data1, data2, on=['county','state'],how='right') 

I should have a match between data1 row 308 and data2 row 20691 but it's not a match due to the 'county' in data1 row 308 and the 'county' in row 20691 have different unicodes:

pandas的数据框合并挑战,具有相同的字符串但不同的Unicode编码

I looked into the unicode of these two words (unicode1 is the unicode of county in data1 and unicode2 is the unicode of county in data2) and they are indeed different:

pandas的数据框合并挑战,具有相同的字符串但不同的Unicode编码

How do I go about merging these two datasets with this issue? Is there a way I can tell pd.merge to ignore the unicode differences? Thank you!

答案1

得分: 1

在小写k和大写K之间的Unicode差异可以通过print(chr(107),chr(75))来证明。在小写字母上进行合并可能会解决你的问题。

然而,假设你喜欢右侧的大写字母,并且想保留它们。因为你在进行右连接,我们可以知道合并后的索引将与合并前相同。因此,在合并之前复制大写字母的值,然后在合并后进行替换。

import pandas as pd

print(chr(107), chr(75))  # k K

data1 = pd.DataFrame({'county': ['mckinley', 'bell'], 'state': ['NM', 'tx'],'amount': [1, 2]})
data2 = pd.DataFrame({'county': ['mckinley', 'beLL', 'harris'], 'state': ['NM', 'TX','tx']})

#可选:保留大写
orig_caps = data2[['county','state']].copy()

for item in ['county', 'state']:
    data1[item] = data1[item].str.lower()
    data2[item] = data2[item].str.lower()

data_merge = pd.merge(data1, data2, on=['county', 'state'], how='right')
data_merge[['county','state']] = orig_caps
print(data_merge)

结果:

     county state  amount
0  mckinley    NM     1.0
1      beLL    TX     2.0
2    harris    tx     NaN
英文:

Well, the difference in unicode is a lowercase k vs a capital K - which you can prove to yourself with print(chr(107),chr(75)). Merging on lowercase will probably solve your problem.

However, let's say you like the capitalizations you have on the right side, and want to preserve them. Because you're doing a right merge, we can know that the indexes will be the same coming out of the merge as going in. Therefore, copy the capitalized values before the merge and replace them after.

import pandas as pd

print(chr(107), chr(75))  # k K

data1 = pd.DataFrame({'county': ['mckinley', 'bell'], 'state': ['NM', 'tx'],'amount': [1, 2]})
data2 = pd.DataFrame({'county': ['mckinley', 'beLL', 'harris'], 'state': ['NM', 'TX','tx']})

#Optional: To preserve caps
orig_caps = data2[['county','state']].copy()

for item in ['county', 'state']:
    data1[item] = data1[item].str.lower()
    data2[item] = data2[item].str.lower()

data_merge = pd.merge(data1, data2, on=['county', 'state'], how='right')
data_merge[['county','state']] = orig_caps
print(data_merge)

Yields:

     county state  amount
0  mckinley    NM     1.0
1      beLL    TX     2.0
2    harris    tx     NaN

huangapple
  • 本文由 发表于 2023年7月7日 07:18:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76633054.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定