为什么 pandas.series.str.extract 在这里不起作用,但在其他地方起作用?

huangapple go评论65阅读模式
英文:

Why is pandas.series.str.extract not working here but working elsewhere

问题

为什么 pandas.series.extract(regex) 能够打印出正确的值但无法使用索引或 np.where 将该值分配给现有变量

import pandas as pd
import numpy as np

df = pd.DataFrame(
    [
        ['1', np.nan, np.nan, '1 Banana St, 69126 Heidelberg'],
        ['2', "Doloros St", 67898, '2 Choco Rd, 69412 Eberbach']], 
    columns=['id', "Street", 'Postcode', 'FullAddress']
)

m = df['Street'].isna()
print(df["FullAddress"].str.extract(r'(.+?),'))                       # 打印街道
print(df["FullAddress"].str.extract(r'\b(\d{5})\b'))                   # 打印邮政编码
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),')   # 输出 NaN
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(\d{5})\b')
# 尝试使用 where 方法会引发错误 - NotImplementedError: cannot align with a higher dimensional NDFrame
df["Street"] = df["Street"].where(~(df["Street"].isna()), df["FullAddress"].str.extract(r'(.+?),'))

我尝试的是使用 FullAddress 的值填充空的 Street 和 Postcode 值而不影响现有的 Street 和 Postcode 值

索引正则表达式或提取都没有问题... 我已阅读了文档搜索了任何相似的内容... 每个都得到了但我不明白!?!?
英文:

Why is a pandas.series.extract(regex) able to print the correct values, but won't assign the value to an existing variable using indexing or np.where.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    [
        ['1', np.nan, np.nan, '1 Banana St, 69126 Heidelberg'],
        ['2', "Doloros St", 67898, '2 Choco Rd, 69412 Eberbach']], 
    columns=['id', "Street", 'Postcode', 'FullAddress']
)

m = df['Street'].isna()
print(df["FullAddress"].str.extract(r'(.+?),'))                        # prints street
print(df["FullAddress"].str.extract(r'\b(\d{5})\b'))                   # prints postcode
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),')  # outputs NaN
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(\d{5})\b')
# trying where method throws error - NotImplementedError: cannot align with a higher dimensional NDFrame
df["Street"] = df["Street"].where(~(df["Street"].isna()), df["FullAddress"].str.extract(r'(.+?),'))

What I'm trying to do is fill the empty Street and Postcode with the values from FullAddress - without disturbing the existing Street and Postcode values.

There is no problem with the indexing, the regex, or even the extract... I've read the docs, searched for anything similar... What does every get, but I don't understand!?!?!

答案1

得分: 1

你错过了str.extract函数的expand=False参数:

>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),')

             0  # <- it's not a Series but a DataFrame with one column
0  1 Banana St

>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)

0    1 Banana St
Name: FullAddress, dtype: object  # <- now it's a Series

在第一个版本中,Pandas不能对齐列标签Street0。在第二个版本中,Series适应了Street Series,所以:

df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(\d{5})\b', expand=False)
print(df)

# Output
  id       Street Postcode                    FullAddress
0  1  1 Banana St    69126  1 Banana St, 69126 Heidelberg
1  2   Doloros St  67898.0     2 Choco Rd, 69412 Eberbach

更新*:可以使用带有命名组(?P<xxx>...)extract来对齐列标签,而不使用expand=False

df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(?P<Street>.+?),')
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(?P<Postcode>\d{5})\b')

# 或

pattern = r'(?P<Street>.+?),\s*\b(?P<Postcode>\d{5})\b'
df.loc[m, ['Street', 'Postcode']] = df.loc[m, 'FullAddress'].str.extract(pattern)
英文:

You missed expand=False as parameter of str.extract:

&gt;&gt;&gt; df.loc[m, &#39;FullAddress&#39;].str.extract(r&#39;(.+?),&#39;)

             0  # &lt;- it&#39;s not a Series but a DataFrame with one column
0  1 Banana St

&gt;&gt;&gt; df.loc[m, &#39;FullAddress&#39;].str.extract(r&#39;(.+?),&#39;, expand=False)

0    1 Banana St
Name: FullAddress, dtype: object  # &lt;- now it&#39;s a Series

In the first version, Pandas can't align column labels Street vs 0. In the second version, the Series fit into the Street Series, so:

df.loc[m, &#39;Street&#39;] = df.loc[m, &#39;FullAddress&#39;].str.extract(r&#39;(.+?),&#39;, expand=False)
df.loc[m, &#39;Postcode&#39;] = df.loc[m, &#39;FullAddress&#39;].str.extract(r&#39;\b(\d{5})\b&#39;, expand=False)
print(df)

# Output
  id       Street Postcode                    FullAddress
0  1  1 Banana St    69126  1 Banana St, 69126 Heidelberg
1  2   Doloros St  67898.0     2 Choco Rd, 69412 Eberbach

Update*: it's possible to use extract without expand=False using named groups (?P&lt;xxx&gt;...) to align column labels:

df.loc[m, &#39;Street&#39;] = df.loc[m, &#39;FullAddress&#39;].str.extract(r&#39;(?P&lt;Street&gt;.+?),&#39;)
df.loc[m, &#39;Postcode&#39;] = df.loc[m, &#39;FullAddress&#39;].str.extract(r&#39;\b(?P&lt;Postcode&gt;\d{5})\b&#39;)

# OR

pattern = r&#39;(?P&lt;Street&gt;.+?),\s*\b(?P&lt;Postcode&gt;\d{5})\b&#39;
df.loc[m, [&#39;Street&#39;, &#39;Postcode&#39;]] = df.loc[m, &#39;FullAddress&#39;].str.extract(pattern)

答案2

得分: 0

你可以使用 .fillna 来填充数据框中的 NaN 值:

df["Street"] = df["Street"].fillna(df["FullAddress"].str.extract(r'(.+?),')[0])
df["Postcode"] = df["Postcode"].fillna(df["FullAddress"].str.extract(r'\b(\d{5})\b')[0])

这将使用 extract 的结果来填充所有的空值,同时保留所有现有的值:

  id       Street Postcode                    FullAddress
0  1  1 Banana St    69126  1 Banana St, 69126 Heidelberg
1  2   Doloros St    67898     2 Choco Rd, 69412 Eberbach
英文:

You can use .fillna to fill in the NaN values in your dataframe:

df[&quot;Street&quot;] = df[&quot;Street&quot;].fillna(df[&quot;FullAddress&quot;].str.extract(r&#39;(.+?),&#39;)[0])
df[&quot;Postcode&quot;] = df[&quot;Postcode&quot;].fillna(df[&quot;FullAddress&quot;].str.extract(r&#39;\b(\d{5})\b&#39;)[0])

This will fill in all of your null values with the result of the extract while keeping all existing values:

  id       Street Postcode                    FullAddress
0  1  1 Banana St    69126  1 Banana St, 69126 Heidelberg
1  2   Doloros St    67898     2 Choco Rd, 69412 Eberbach

huangapple
  • 本文由 发表于 2023年3月31日 18:35:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75897551.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定