2023年7月7日 07:18:28go评论99阅读模式

英文:

pandas' dataframes merge challenge with identical strings but different unicodes

问题

我在使用pd.merge时遇到了一个问题，当我尝试将两个数据集中的两列合并时，我发现这两列中的某些行的unicode不同，尽管字符串是相同的。这是一个例子：

我有两个数据集，data1和data2，它们都有两列共同的列，'state'和'county'，这将是我用来合并这两个数据集的列。我检查了这两个数据集中'state'和'county'的数据类型。它们都是类'str'。

通过使用

data_merge = pd.merge(data1, data2, on=['county','state'],how='right')

我应该在data1的第308行和data2的第20691行之间有一个匹配，但由于data1第308行的'county'和第20691行的'county'具有不同的unicode，所以它们不匹配：

我查看了这两个单词的unicode（unicode1是data1中'county'的unicode，unicode2是data2中'county'的unicode），它们确实不同：

如何解决这个问题并合并这两个数据集？有没有办法告诉pd.merge忽略这些unicode的差异？谢谢！

英文:

I have a problem using pd.merge when some of the rows in the two columns in the two datasets I use to merge the two datasets have different unicodes even though the strings are identical. Here is one example:

I have two datasets data1 and data2 both of which have 2 columns in common, 'state' and 'county', which will be the columns I use to merge the two datasets. I checked datatype for both 'state' and 'county' in two datasets. They are all of class 'str'.

By using

data_merge = pd.merge(data1, data2, on=[&#39;county&#39;,&#39;state&#39;],how=&#39;right&#39;)

I should have a match between data1 row 308 and data2 row 20691 but it's not a match due to the 'county' in data1 row 308 and the 'county' in row 20691 have different unicodes:

I looked into the unicode of these two words (unicode1 is the unicode of county in data1 and unicode2 is the unicode of county in data2) and they are indeed different:

How do I go about merging these two datasets with this issue? Is there a way I can tell pd.merge to ignore the unicode differences? Thank you!

答案1

得分: 1

在小写k和大写K之间的Unicode差异可以通过print(chr(107),chr(75))来证明。在小写字母上进行合并可能会解决你的问题。

然而，假设你喜欢右侧的大写字母，并且想保留它们。因为你在进行右连接，我们可以知道合并后的索引将与合并前相同。因此，在合并之前复制大写字母的值，然后在合并后进行替换。

import pandas as pd
print(chr(107), chr(75))  # k K
data1 = pd.DataFrame({'county': ['mckinley', 'bell'], 'state': ['NM', 'tx'],'amount': [1, 2]})
data2 = pd.DataFrame({'county': ['mckinley', 'beLL', 'harris'], 'state': ['NM', 'TX','tx']})
#可选：保留大写
orig_caps = data2[['county','state']].copy()
for item in ['county', 'state']:
    data1[item] = data1[item].str.lower()
    data2[item] = data2[item].str.lower()
data_merge = pd.merge(data1, data2, on=['county', 'state'], how='right')
data_merge[['county','state']] = orig_caps
print(data_merge)

结果：

     county state  amount
0  mckinley    NM     1.0
1      beLL    TX     2.0
2    harris    tx     NaN

英文:

Well, the difference in unicode is a lowercase k vs a capital K - which you can prove to yourself with print(chr(107),chr(75)). Merging on lowercase will probably solve your problem.

However, let's say you like the capitalizations you have on the right side, and want to preserve them. Because you're doing a right merge, we can know that the indexes will be the same coming out of the merge as going in. Therefore, copy the capitalized values before the merge and replace them after.

import pandas as pd
print(chr(107), chr(75))  # k K
data1 = pd.DataFrame({&#39;county&#39;: [&#39;mckinley&#39;, &#39;bell&#39;], &#39;state&#39;: [&#39;NM&#39;, &#39;tx&#39;],&#39;amount&#39;: [1, 2]})
data2 = pd.DataFrame({&#39;county&#39;: [&#39;mckinley&#39;, &#39;beLL&#39;, &#39;harris&#39;], &#39;state&#39;: [&#39;NM&#39;, &#39;TX&#39;,&#39;tx&#39;]})
#Optional: To preserve caps
orig_caps = data2[[&#39;county&#39;,&#39;state&#39;]].copy()
for item in [&#39;county&#39;, &#39;state&#39;]:
    data1[item] = data1[item].str.lower()
    data2[item] = data2[item].str.lower()
data_merge = pd.merge(data1, data2, on=[&#39;county&#39;, &#39;state&#39;], how=&#39;right&#39;)
data_merge[[&#39;county&#39;,&#39;state&#39;]] = orig_caps
print(data_merge)

Yields:

     county state  amount
0  mckinley    NM     1.0
1      beLL    TX     2.0
2    harris    tx     NaN

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

pandas的数据框合并挑战，具有相同的字符串但不同的Unicode编码

问题

答案1

不完整的二进制消息在Python STOMP客户端中接收

如何使用Pandas删除输出中不需要的零。

Pandas按标志ID分组，查找相对差异。

docker compose中的Flask应用返回空响应

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。