英文:
Any method of Replacing NaN values with already existing values
问题
我正在Jupyter笔记本中处理一个数据集。我的数据集有两列,'State'和'City',如图所示
现在,State列中有一些缺失值,所以我想根据数据集中已有的数据来填充NaN数据。
例如,在图2中,城市是钦奈,但状态缺失,所以我想根据图1中的数据在状态列中填写泰米尔纳德邦。
示例2 - 我想填写安得拉邦的状态,城市是维萨卡帕特南,对所有NaN数据都是如此。
英文:
I am working on a dataset in a Jupyter notebook. My dataset has two columns, 'State' and 'City' as you can see in image
Now, State has some missing values in it, So I want to fill NaN data based on the already existing data in the dataset.
For example, in Image 2, the city is Chennai and the state is missing, So I want to fill Tamil Nadu in state column based on Image 1.
Example 2 - I want to fill in the state of Andhra Pradesh, where city is Vishakapatnam and so on for all the NaN data .
答案1
得分: 1
如您现在已经评论:“我有一个Pandas数据框,在其中一些州的值丢失,但在数据框的其他行中为相同城市提供”,那么您可以在城市上使用groupby
,然后从相同城市的其他行中提取州的值。您可以使用下面的代码。ffill和bfill会在正确的州值之前或之后替换任何NaN值。
import pandas as pd
data = {
'city': ['Delhi', 'Mumbai', 'Bangalore', 'Hyderabad', 'Chennai', 'Delhi', 'Chennai'],
'state': ['Delhi', 'Maharashtra', 'Karnataka', 'Telangana', None, None, 'Tamil Nadu'],
}
df = pd.DataFrame(data)
df['state'] = df.groupby(['city'])['state'].transform(lambda x: x.ffill().bfill())
print(df)
结果为:
city state
0 Delhi Delhi
1 Mumbai Maharashtra
2 Bangalore Karnataka
3 Hyderabad Telangana
4 Chennai Tamil Nadu
5 Delhi Delhi
6 Chennai Tamil Nadu
英文:
As you have now commented "I have one Pandas DF where some state values are missing but available for the same city at other rows in the DF" then you can use groupby
on the City and pick-up the state from other row entries for the same City. You can use the code below. The ffill and bfill replaces any NaN values before or after the correct State value.
import pandas as pd
data = {
'city': ['Delhi', 'Mumbai', 'Bangalore', 'Hyderabad', 'Chennai', 'Delhi', 'Chennai'],
'state': ['Delhi', 'Maharashtra', 'Karnataka', 'Telangana', None, None, 'Tamil Nadu'],
}
df = pd.DataFrame(data)
df['state'] = df.groupby(['city'])['state'].transform(lambda x: x.ffill().bfill())
print(df)
gives:
city state
0 Delhi Delhi
1 Mumbai Maharashtra
2 Bangalore Karnataka
3 Hyderabad Telangana
4 Chennai Tamil Nadu
5 Delhi Delhi
6 Chennai Tamil Nadu
答案2
得分: 0
我相信你需要一个将每个城市与其所属州份进行映射的字典,然后问题就迎刃而解了。
对于印度,这个链接提供了印度所有城市和州份的JSON文件格式。
以下是示例代码:
data = {
'city': ['Delhi', 'Mumbai', 'Bangalore', 'Hyderabad', 'Chennai'],
'state': ['Delhi', 'Maharashtra', 'Karnataka', 'Telangana', 'Tamil Nadu'],
}
city_state_dict = {
'Delhi': 'Delhi',
'Mumbai': 'Maharashtra',
'Bangalore': 'Karnataka',
'Hyderabad': 'Telangana',
'Chennai': 'Tamil Nadu',
'Kolkata': 'West Bengal',
'Pune': 'Maharashtra',
}
# 这一行是为了模拟州份列中的NaN值
added_row = pd.Series(['Kolkata', np.nan], index=['city', 'state']).to_frame().T
added_row_2 = pd.Series(['Kolkata', 'West Bengal'], index=['city', 'state']).to_frame().T
added_row_3 = pd.Series(['Pune', np.nan], index=['city', 'state']).to_frame().T
final = (
pd.concat(
[
pd.DataFrame(data),
added_row, added_row_2, added_row_3
],
axis=0,
ignore_index=True)
.assign(state=lambda df: df.city.map(city_state_dict))
)
final.head(10)
如果你想要一个快速解决方案,不考虑所有情况,你可以使用前向填充和后向填充,例如:
final = (
pd.concat(
[
pd.DataFrame(tt),
added_row,
added_row_2,
added_row_3
],
axis=0,
ignore_index=True)
.assign(state= lambda df_: df_.groupby('city', group_key=False).state.apply(lambda x: x.bfill().ffill()))
)
final.head(10)
这将产生以下输出,如果在你的数据框中有一个城市没有对应的州份但你想要填充它,它将使用 bfill()
和 ffill()
来产生NaN值:
这种情况发生的原因是我们按city分组,城市Pune在组中只有一行,没有关于州份的后向或前向知识来填充它。
希望这对你有所帮助!
英文:
I believe you need a dictionary that maps each city with its state then it will be a piece of cake.
for India, this link has a JSON file format for all cities and states in India.
Here is a sample code:
data = {
'city': ['Delhi', 'Mumbai', 'Bangalore', 'Hyderabad', 'Chennai'],
'state': ['Delhi', 'Maharashtra', 'Karnataka', 'Telangana', 'Tamil Nadu'],
}
city_state_dict = {
'Delhi': 'Delhi',
'Mumbai': 'Maharashtra',
'Bangalore': 'Karnataka',
'Hyderabad': 'Telangana',
'Chennai': 'Tamil Nadu',
'Kolkata': 'West Bengal',
'Pune': 'Maharashtra',
}
# This line is to simulate the Nan valu in state column
added_row = pd.Series(['Kolkata', np.nan], index=['city', 'state']).to_frame().T
added_row_2 = pd.Series(['Kolkata', 'West Bengal'], index=['city', 'state']).to_frame().T
added_row_3 = pd.Series(['Pune', np.nan], index=['city', 'state']).to_frame().T
final = (
pd.concat(
[
pd.DataFrame(data),
added_row, added_row_2, added_row_3],
axis=0,
ignore_index=True)
.assign(state=lambda df: df.city.map(city_state_dict))
)
final.head(10)
If you want a quick fix that doesn't cover all cases, you can use forward-fill and back-fill, for instance:
final = (
pd.concat(
[
pd.DataFrame(tt),
added_row,
added_row_2,
added_row_3],
axis=0,
ignore_index=True)
# .assign(state=lambda df: df.city.map(city_state_dict))
.assign(state= lambda df_: df_.groupby('city', group_key=False).state.apply(lambda x: x.bfill().ffill()))
)
final.head(10)
This will give this output, if there is one city that doesn't have a corresponding state in your data frame but you want to fill it, it will be a NaN with the bfill()
and ffill()
:
Why this happens? because we grouped by city and the city Pune has only one row in the group with no back or forward knowledge about the state to impute it.
I hope this helps!
答案3
得分: 0
output:
city state
0 Pune NaN
1 Visakhapatnam AndhraPradesh
2 Indore NaN
3 Visakhapatnam AndhraPradesh
4 Chennai TamilNadu
5 Chennai TamilNadu
6 Indore NaN
7 Chennai TamilNadu
8 Delhi Delhi
9 Pune NaN
英文:
Example
plz provide code not image and provide desired output
import numpy as np
import pandas as pd
data1 = {'city': ['Delhi', 'Calcutta', 'Delhi', 'Chennai', 'Bokaro', 'Visakhapatnam', 'Chennai', 'Wanparti', 'Delhi', 'Mumbai'],
'state': ['Delhi', 'West Bengal', 'Delhi', 'TamilNadu', 'Jharkhand', 'AndhraPradesh', 'TamilNadu', 'AndhraPradesh', 'Delhi', 'Maharashtra']}
df1 = pd.DataFrame(data1)
data2 = {'city': ['Pune', 'Visakhapatnam', 'Indore', 'Visakhapatnam', 'Chennai', 'Chennai', 'Indore', 'Chennai', 'Delhi', 'Pune']}
df2 = pd.DataFrame(data2)
df2['state'] = np.nan
df1
city state
0 Delhi Delhi
1 Calcutta West Bengal
2 Delhi Delhi
3 Chennai TamilNadu
4 Bokaro Jharkhand
5 Visakhapatnam AndhraPradesh
6 Chennai TamilNadu
7 Wanparti AndhraPradesh
8 Delhi Delhi
9 Mumbai Maharashtra
df2
city state
0 Pune NaN
1 Visakhapatnam NaN
2 Indore NaN
3 Visakhapatnam NaN
4 Chennai NaN
5 Chennai NaN
6 Indore NaN
7 Chennai NaN
8 Delhi NaN
9 Pune NaN
Code
m = dict(df1.values)
df2.assign(state=df2['state'].fillna(df2['city'].map(m)))
output:
city state
0 Pune NaN
1 Visakhapatnam AndhraPradesh
2 Indore NaN
3 Visakhapatnam AndhraPradesh
4 Chennai TamilNadu
5 Chennai TamilNadu
6 Indore NaN
7 Chennai TamilNadu
8 Delhi Delhi
9 Pune NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论