2023年3月31日 18:35:49go评论70阅读模式

英文:

Why is pandas.series.str.extract not working here but working elsewhere

问题

为什么 pandas.series.extract(regex) 能够打印出正确的值，但无法使用索引或 np.where 将该值分配给现有变量。

import pandas as pd
import numpy as np

df = pd.DataFrame(
    [
        ['1', np.nan, np.nan, '1 Banana St, 69126 Heidelberg'],
        ['2', "Doloros St", 67898, '2 Choco Rd, 69412 Eberbach']], 
    columns=['id', "Street", 'Postcode', 'FullAddress']
)

m = df['Street'].isna()
print(df["FullAddress"].str.extract(r'(.+?),'))                       # 打印街道
print(df["FullAddress"].str.extract(r'\b(\d{5})\b'))                   # 打印邮政编码
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),')   # 输出 NaN
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(\d{5})\b')
# 尝试使用 where 方法会引发错误 - NotImplementedError: cannot align with a higher dimensional NDFrame
df["Street"] = df["Street"].where(~(df["Street"].isna()), df["FullAddress"].str.extract(r'(.+?),'))

我尝试的是使用 FullAddress 的值填充空的 Street 和 Postcode 值，而不影响现有的 Street 和 Postcode 值。

索引、正则表达式或提取都没有问题... 我已阅读了文档，搜索了任何相似的内容... 每个都得到了，但我不明白！？！？

英文:

Why is a pandas.series.extract(regex) able to print the correct values, but won't assign the value to an existing variable using indexing or np.where.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    [
        [&#39;1&#39;, np.nan, np.nan, &#39;1 Banana St, 69126 Heidelberg&#39;],
        [&#39;2&#39;, &quot;Doloros St&quot;, 67898, &#39;2 Choco Rd, 69412 Eberbach&#39;]], 
    columns=[&#39;id&#39;, &quot;Street&quot;, &#39;Postcode&#39;, &#39;FullAddress&#39;]
)

m = df[&#39;Street&#39;].isna()
print(df[&quot;FullAddress&quot;].str.extract(r&#39;(.+?),&#39;))                        # prints street
print(df[&quot;FullAddress&quot;].str.extract(r&#39;\b(\d{5})\b&#39;))                   # prints postcode
df.loc[m, &#39;Street&#39;] = df.loc[m, &#39;FullAddress&#39;].str.extract(r&#39;(.+?),&#39;)  # outputs NaN
df.loc[m, &#39;Postcode&#39;] = df.loc[m, &#39;FullAddress&#39;].str.extract(r&#39;\b(\d{5})\b&#39;)
# trying where method throws error - NotImplementedError: cannot align with a higher dimensional NDFrame
df[&quot;Street&quot;] = df[&quot;Street&quot;].where(~(df[&quot;Street&quot;].isna()), df[&quot;FullAddress&quot;].str.extract(r&#39;(.+?),&#39;))

What I'm trying to do is fill the empty Street and Postcode with the values from FullAddress - without disturbing the existing Street and Postcode values.

There is no problem with the indexing, the regex, or even the extract... I've read the docs, searched for anything similar... What does every get, but I don't understand!?!?!

答案1

得分: 1

你错过了str.extract函数的expand=False参数：

>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),')

             0  # <- it's not a Series but a DataFrame with one column
0  1 Banana St

>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)

0    1 Banana St
Name: FullAddress, dtype: object  # <- now it's a Series

在第一个版本中，Pandas不能对齐列标签Street和0。在第二个版本中，Series适应了Street Series，所以：

df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(\d{5})\b', expand=False)
print(df)

# Output
  id       Street Postcode                    FullAddress
0  1  1 Banana St    69126  1 Banana St, 69126 Heidelberg
1  2   Doloros St  67898.0     2 Choco Rd, 69412 Eberbach

更新*：可以使用带有命名组(?P<xxx>...)的extract来对齐列标签，而不使用expand=False：

df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(?P<Street>.+?),')
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(?P<Postcode>\d{5})\b')

# 或

pattern = r'(?P<Street>.+?),\s*\b(?P<Postcode>\d{5})\b'
df.loc[m, ['Street', 'Postcode']] = df.loc[m, 'FullAddress'].str.extract(pattern)

英文:

You missed expand=False as parameter of str.extract:

&gt;&gt;&gt; df.loc[m, &#39;FullAddress&#39;].str.extract(r&#39;(.+?),&#39;)

             0  # &lt;- it&#39;s not a Series but a DataFrame with one column
0  1 Banana St

&gt;&gt;&gt; df.loc[m, &#39;FullAddress&#39;].str.extract(r&#39;(.+?),&#39;, expand=False)

0    1 Banana St
Name: FullAddress, dtype: object  # &lt;- now it&#39;s a Series

In the first version, Pandas can't align column labels Street vs 0. In the second version, the Series fit into the Street Series, so:

df.loc[m, &#39;Street&#39;] = df.loc[m, &#39;FullAddress&#39;].str.extract(r&#39;(.+?),&#39;, expand=False)
df.loc[m, &#39;Postcode&#39;] = df.loc[m, &#39;FullAddress&#39;].str.extract(r&#39;\b(\d{5})\b&#39;, expand=False)
print(df)

# Output
  id       Street Postcode                    FullAddress
0  1  1 Banana St    69126  1 Banana St, 69126 Heidelberg
1  2   Doloros St  67898.0     2 Choco Rd, 69412 Eberbach

Update*: it's possible to use extract without expand=False using named groups (?P<xxx>...) to align column labels:

df.loc[m, &#39;Street&#39;] = df.loc[m, &#39;FullAddress&#39;].str.extract(r&#39;(?P&lt;Street&gt;.+?),&#39;)
df.loc[m, &#39;Postcode&#39;] = df.loc[m, &#39;FullAddress&#39;].str.extract(r&#39;\b(?P&lt;Postcode&gt;\d{5})\b&#39;)

# OR

pattern = r&#39;(?P&lt;Street&gt;.+?),\s*\b(?P&lt;Postcode&gt;\d{5})\b&#39;
df.loc[m, [&#39;Street&#39;, &#39;Postcode&#39;]] = df.loc[m, &#39;FullAddress&#39;].str.extract(pattern)

答案2

得分: 0

你可以使用 .fillna 来填充数据框中的 NaN 值：

df["Street"] = df["Street"].fillna(df["FullAddress"].str.extract(r'(.+?),')[0])
df["Postcode"] = df["Postcode"].fillna(df["FullAddress"].str.extract(r'\b(\d{5})\b')[0])

这将使用 extract 的结果来填充所有的空值，同时保留所有现有的值：

  id       Street Postcode                    FullAddress
0  1  1 Banana St    69126  1 Banana St, 69126 Heidelberg
1  2   Doloros St    67898     2 Choco Rd, 69412 Eberbach

英文:

You can use .fillna to fill in the NaN values in your dataframe:

df[&quot;Street&quot;] = df[&quot;Street&quot;].fillna(df[&quot;FullAddress&quot;].str.extract(r&#39;(.+?),&#39;)[0])
df[&quot;Postcode&quot;] = df[&quot;Postcode&quot;].fillna(df[&quot;FullAddress&quot;].str.extract(r&#39;\b(\d{5})\b&#39;)[0])

This will fill in all of your null values with the result of the extract while keeping all existing values:

  id       Street Postcode                    FullAddress
0  1  1 Banana St    69126  1 Banana St, 69126 Heidelberg
1  2   Doloros St    67898     2 Choco Rd, 69412 Eberbach

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么 pandas.series.str.extract 在这里不起作用，但在其他地方起作用？

问题

答案1

答案2

打印函数打印顺序不正确

In python how to create multiple dataclasses instances with different objects instance in the fields?

在NumPy中高效迭代，其中下一次迭代取决于前一次的结果。

TypeError导入hdbscan时出现问题。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论