用已有值替换 NaN 值的任何方法

huangapple go评论79阅读模式
英文:

Any method of Replacing NaN values with already existing values

问题

我正在Jupyter笔记本中处理一个数据集。我的数据集有两列,'State'和'City',如图所示

现在,State列中有一些缺失值,所以我想根据数据集中已有的数据来填充NaN数据。

例如,在图2中,城市是钦奈,但状态缺失,所以我想根据图1中的数据在状态列中填写泰米尔纳德邦。

示例2 - 我想填写安得拉邦的状态,城市是维萨卡帕特南,对所有NaN数据都是如此。

英文:

I am working on a dataset in a Jupyter notebook. My dataset has two columns, 'State' and 'City' as you can see in image

用已有值替换 NaN 值的任何方法

Now, State has some missing values in it, So I want to fill NaN data based on the already existing data in the dataset.

用已有值替换 NaN 值的任何方法

For example, in Image 2, the city is Chennai and the state is missing, So I want to fill Tamil Nadu in state column based on Image 1.

Example 2 - I want to fill in the state of Andhra Pradesh, where city is Vishakapatnam and so on for all the NaN data .

答案1

得分: 1

如您现在已经评论:“我有一个Pandas数据框,在其中一些州的值丢失,但在数据框的其他行中为相同城市提供”,那么您可以在城市上使用groupby,然后从相同城市的其他行中提取州的值。您可以使用下面的代码。ffill和bfill会在正确的州值之前或之后替换任何NaN值。

import pandas as pd

data = {
    'city': ['Delhi', 'Mumbai', 'Bangalore', 'Hyderabad', 'Chennai', 'Delhi', 'Chennai'],
    'state': ['Delhi', 'Maharashtra', 'Karnataka', 'Telangana', None, None, 'Tamil Nadu'],
}

df = pd.DataFrame(data)

df['state'] = df.groupby(['city'])['state'].transform(lambda x: x.ffill().bfill())

print(df)

结果为:

        city        state
0      Delhi        Delhi
1     Mumbai  Maharashtra
2  Bangalore    Karnataka
3  Hyderabad    Telangana
4    Chennai   Tamil Nadu
5      Delhi        Delhi
6    Chennai   Tamil Nadu
英文:

As you have now commented "I have one Pandas DF where some state values are missing but available for the same city at other rows in the DF" then you can use groupby on the City and pick-up the state from other row entries for the same City. You can use the code below. The ffill and bfill replaces any NaN values before or after the correct State value.

import pandas as pd


data = {
    'city': ['Delhi', 'Mumbai', 'Bangalore', 'Hyderabad', 'Chennai', 'Delhi', 'Chennai'],
    'state': ['Delhi', 'Maharashtra', 'Karnataka', 'Telangana', None, None, 'Tamil Nadu'],
}

df = pd.DataFrame(data)

df['state'] = df.groupby(['city'])['state'].transform(lambda x: x.ffill().bfill())


print(df)

gives:

        city        state
0      Delhi        Delhi
1     Mumbai  Maharashtra
2  Bangalore    Karnataka
3  Hyderabad    Telangana
4    Chennai   Tamil Nadu
5      Delhi        Delhi
6    Chennai   Tamil Nadu

答案2

得分: 0

我相信你需要一个将每个城市与其所属州份进行映射的字典,然后问题就迎刃而解了。
对于印度,这个链接提供了印度所有城市和州份的JSON文件格式。
以下是示例代码:

data = {
    'city': ['Delhi', 'Mumbai', 'Bangalore', 'Hyderabad', 'Chennai'],
    'state': ['Delhi', 'Maharashtra', 'Karnataka', 'Telangana', 'Tamil Nadu'],
}

city_state_dict = {
    'Delhi': 'Delhi',
    'Mumbai': 'Maharashtra',
    'Bangalore': 'Karnataka',
    'Hyderabad': 'Telangana',
    'Chennai': 'Tamil Nadu',
    'Kolkata': 'West Bengal',
    'Pune': 'Maharashtra',
}

# 这一行是为了模拟州份列中的NaN值
added_row = pd.Series(['Kolkata', np.nan], index=['city', 'state']).to_frame().T
added_row_2 = pd.Series(['Kolkata', 'West Bengal'], index=['city', 'state']).to_frame().T
added_row_3 = pd.Series(['Pune', np.nan], index=['city', 'state']).to_frame().T

final = (
    pd.concat(
        [
            pd.DataFrame(data),
            added_row, added_row_2, added_row_3
        ],
        axis=0,
        ignore_index=True)
    .assign(state=lambda df: df.city.map(city_state_dict))
)
final.head(10)

如果你想要一个快速解决方案,不考虑所有情况,你可以使用前向填充和后向填充,例如:

final = (
    pd.concat(
        [
            pd.DataFrame(tt),
            added_row,
            added_row_2,
            added_row_3
        ],
        axis=0,
        ignore_index=True)
    .assign(state= lambda df_: df_.groupby('city', group_key=False).state.apply(lambda x: x.bfill().ffill()))
)
final.head(10)

这将产生以下输出,如果在你的数据框中有一个城市没有对应的州份但你想要填充它,它将使用 bfill()ffill() 来产生NaN值:

这种情况发生的原因是我们按city分组,城市Pune在组中只有一行,没有关于州份的后向或前向知识来填充它。

希望这对你有所帮助!

英文:

I believe you need a dictionary that maps each city with its state then it will be a piece of cake.
for India, this link has a JSON file format for all cities and states in India.
Here is a sample code:

data = {
    'city': ['Delhi', 'Mumbai', 'Bangalore', 'Hyderabad', 'Chennai'],
    'state': ['Delhi', 'Maharashtra', 'Karnataka', 'Telangana', 'Tamil Nadu'],
}

city_state_dict = {
    'Delhi': 'Delhi',
    'Mumbai': 'Maharashtra',
    'Bangalore': 'Karnataka',
    'Hyderabad': 'Telangana',
    'Chennai': 'Tamil Nadu',
    'Kolkata': 'West Bengal',
    'Pune': 'Maharashtra',
}

# This line is to simulate the Nan valu in state column
added_row = pd.Series(['Kolkata', np.nan], index=['city', 'state']).to_frame().T
added_row_2 = pd.Series(['Kolkata', 'West Bengal'], index=['city', 'state']).to_frame().T
added_row_3 = pd.Series(['Pune', np.nan], index=['city', 'state']).to_frame().T
final = (
    pd.concat(
        [
            pd.DataFrame(data),
               added_row, added_row_2, added_row_3],
               axis=0,
               ignore_index=True)
    .assign(state=lambda df: df.city.map(city_state_dict))
)
final.head(10)

用已有值替换 NaN 值的任何方法

If you want a quick fix that doesn't cover all cases, you can use forward-fill and back-fill, for instance:

final = (
    pd.concat(
        [
            pd.DataFrame(tt),
               added_row,
               added_row_2,
               added_row_3],
               axis=0,
               ignore_index=True)
    # .assign(state=lambda df: df.city.map(city_state_dict))
    .assign(state= lambda df_: df_.groupby('city', group_key=False).state.apply(lambda x: x.bfill().ffill()))
)
final.head(10)

This will give this output, if there is one city that doesn't have a corresponding state in your data frame but you want to fill it, it will be a NaN with the bfill() and ffill():
用已有值替换 NaN 值的任何方法

Why this happens? because we grouped by city and the city Pune has only one row in the group with no back or forward knowledge about the state to impute it.

I hope this helps!

答案3

得分: 0

output:

    city	        state
0	Pune	        NaN
1	Visakhapatnam	AndhraPradesh
2	Indore	        NaN
3	Visakhapatnam	AndhraPradesh
4	Chennai	        TamilNadu
5	Chennai	        TamilNadu
6	Indore	        NaN
7	Chennai	        TamilNadu
8	Delhi	        Delhi
9	Pune	        NaN
英文:

Example

plz provide code not image and provide desired output

import numpy as np
import pandas as pd

data1 = {'city': ['Delhi', 'Calcutta', 'Delhi', 'Chennai', 'Bokaro', 'Visakhapatnam', 'Chennai', 'Wanparti', 'Delhi', 'Mumbai'], 
         'state': ['Delhi', 'West Bengal', 'Delhi', 'TamilNadu', 'Jharkhand', 'AndhraPradesh', 'TamilNadu', 'AndhraPradesh', 'Delhi', 'Maharashtra']}
df1 = pd.DataFrame(data1)

data2 = {'city': ['Pune', 'Visakhapatnam', 'Indore', 'Visakhapatnam', 'Chennai', 'Chennai', 'Indore', 'Chennai', 'Delhi', 'Pune']}
df2 = pd.DataFrame(data2)
df2['state'] = np.nan

df1

    city	        state
0	Delhi	        Delhi
1	Calcutta	    West Bengal
2	Delhi	        Delhi
3	Chennai	        TamilNadu
4	Bokaro	        Jharkhand
5	Visakhapatnam	AndhraPradesh
6	Chennai	        TamilNadu
7	Wanparti	    AndhraPradesh
8	Delhi	        Delhi
9	Mumbai	        Maharashtra

df2

    city	        state
0	Pune	        NaN
1	Visakhapatnam	NaN
2	Indore	        NaN
3	Visakhapatnam	NaN
4	Chennai	        NaN
5	Chennai	        NaN
6	Indore	        NaN
7	Chennai	        NaN
8	Delhi	        NaN
9	Pune	        NaN

Code

m = dict(df1.values)
df2.assign(state=df2['state'].fillna(df2['city'].map(m)))

output:

    city	        state
0	Pune	        NaN
1	Visakhapatnam	AndhraPradesh
2	Indore	        NaN
3	Visakhapatnam	AndhraPradesh
4	Chennai	        TamilNadu
5	Chennai	        TamilNadu
6	Indore	        NaN
7	Chennai	        TamilNadu
8	Delhi	        Delhi
9	Pune	        NaN

huangapple
  • 本文由 发表于 2023年6月22日 05:52:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76527384.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定