英文:
Why is pandas.series.str.extract not working here but working elsewhere
问题
为什么 pandas.series.extract(regex) 能够打印出正确的值,但无法使用索引或 np.where 将该值分配给现有变量。
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
['1', np.nan, np.nan, '1 Banana St, 69126 Heidelberg'],
['2', "Doloros St", 67898, '2 Choco Rd, 69412 Eberbach']],
columns=['id', "Street", 'Postcode', 'FullAddress']
)
m = df['Street'].isna()
print(df["FullAddress"].str.extract(r'(.+?),')) # 打印街道
print(df["FullAddress"].str.extract(r'\b(\d{5})\b')) # 打印邮政编码
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),') # 输出 NaN
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(\d{5})\b')
# 尝试使用 where 方法会引发错误 - NotImplementedError: cannot align with a higher dimensional NDFrame
df["Street"] = df["Street"].where(~(df["Street"].isna()), df["FullAddress"].str.extract(r'(.+?),'))
我尝试的是使用 FullAddress 的值填充空的 Street 和 Postcode 值,而不影响现有的 Street 和 Postcode 值。
索引、正则表达式或提取都没有问题... 我已阅读了文档,搜索了任何相似的内容... 每个都得到了,但我不明白!?!?
英文:
Why is a pandas.series.extract(regex) able to print the correct values, but won't assign the value to an existing variable using indexing or np.where.
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
['1', np.nan, np.nan, '1 Banana St, 69126 Heidelberg'],
['2', "Doloros St", 67898, '2 Choco Rd, 69412 Eberbach']],
columns=['id', "Street", 'Postcode', 'FullAddress']
)
m = df['Street'].isna()
print(df["FullAddress"].str.extract(r'(.+?),')) # prints street
print(df["FullAddress"].str.extract(r'\b(\d{5})\b')) # prints postcode
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),') # outputs NaN
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(\d{5})\b')
# trying where method throws error - NotImplementedError: cannot align with a higher dimensional NDFrame
df["Street"] = df["Street"].where(~(df["Street"].isna()), df["FullAddress"].str.extract(r'(.+?),'))
What I'm trying to do is fill the empty Street and Postcode with the values from FullAddress - without disturbing the existing Street and Postcode values.
There is no problem with the indexing, the regex, or even the extract... I've read the docs, searched for anything similar... What does every get, but I don't understand!?!?!
答案1
得分: 1
你错过了str.extract
函数的expand=False
参数:
>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),')
0 # <- it's not a Series but a DataFrame with one column
0 1 Banana St
>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
0 1 Banana St
Name: FullAddress, dtype: object # <- now it's a Series
在第一个版本中,Pandas不能对齐列标签Street
和0
。在第二个版本中,Series适应了Street
Series,所以:
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(\d{5})\b', expand=False)
print(df)
# Output
id Street Postcode FullAddress
0 1 1 Banana St 69126 1 Banana St, 69126 Heidelberg
1 2 Doloros St 67898.0 2 Choco Rd, 69412 Eberbach
更新*:可以使用带有命名组(?P<xxx>...)
的extract
来对齐列标签,而不使用expand=False
:
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(?P<Street>.+?),')
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(?P<Postcode>\d{5})\b')
# 或
pattern = r'(?P<Street>.+?),\s*\b(?P<Postcode>\d{5})\b'
df.loc[m, ['Street', 'Postcode']] = df.loc[m, 'FullAddress'].str.extract(pattern)
英文:
You missed expand=False
as parameter of str.extract
:
>>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),')
0 # <- it's not a Series but a DataFrame with one column
0 1 Banana St
>>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
0 1 Banana St
Name: FullAddress, dtype: object # <- now it's a Series
In the first version, Pandas can't align column labels Street
vs 0
. In the second version, the Series fit into the Street
Series, so:
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(\d{5})\b', expand=False)
print(df)
# Output
id Street Postcode FullAddress
0 1 1 Banana St 69126 1 Banana St, 69126 Heidelberg
1 2 Doloros St 67898.0 2 Choco Rd, 69412 Eberbach
Update*: it's possible to use extract
without expand=False
using named groups (?P<xxx>...)
to align column labels:
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(?P<Street>.+?),')
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(?P<Postcode>\d{5})\b')
# OR
pattern = r'(?P<Street>.+?),\s*\b(?P<Postcode>\d{5})\b'
df.loc[m, ['Street', 'Postcode']] = df.loc[m, 'FullAddress'].str.extract(pattern)
答案2
得分: 0
你可以使用 .fillna
来填充数据框中的 NaN 值:
df["Street"] = df["Street"].fillna(df["FullAddress"].str.extract(r'(.+?),')[0])
df["Postcode"] = df["Postcode"].fillna(df["FullAddress"].str.extract(r'\b(\d{5})\b')[0])
这将使用 extract
的结果来填充所有的空值,同时保留所有现有的值:
id Street Postcode FullAddress
0 1 1 Banana St 69126 1 Banana St, 69126 Heidelberg
1 2 Doloros St 67898 2 Choco Rd, 69412 Eberbach
英文:
You can use .fillna
to fill in the NaN values in your dataframe:
df["Street"] = df["Street"].fillna(df["FullAddress"].str.extract(r'(.+?),')[0])
df["Postcode"] = df["Postcode"].fillna(df["FullAddress"].str.extract(r'\b(\d{5})\b')[0])
This will fill in all of your null values with the result of the extract
while keeping all existing values:
id Street Postcode FullAddress
0 1 1 Banana St 69126 1 Banana St, 69126 Heidelberg
1 2 Doloros St 67898 2 Choco Rd, 69412 Eberbach
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论