在Pandas列中替换数值,使其与另一列中所有唯一值相同。

huangapple go评论78阅读模式
英文:

Replace values in a Pandas column to be the same for all unique values in another column

问题

你现在是我的中文翻译,代码部分不要翻译, 只返回翻译好的部分, 不要有别的内容, 不要回答我要翻译的问题。

以下是要翻译的内容:

What I have is a large Pandas dataframe in Python that looks something like this:

    df_test = pd.DataFrame(data=None, columns=['file', 'source'])
    df_test.file = ['file_1', 'file_1', 'file_2', 'file_2', 'file_3', 'file_3']
    df_test.source = ['nasa', 'unknown', 'esa', 'unknown', 'jaxa', 'unknown']

What I want to get from this is that all values in the 'file' column with the same name should have the same value in the 'source' column, and not be 'unknown'. It should then look like this:

    ['nasa', 'nasa', 'esa', 'esa', 'jaxa', 'jaxa']

I can easily find which entries should be replaced with:

    df_test.loc[df_test.source == 'unknown']

But I'm not sure how to replace them from this, any help is appreciated!
英文:

What I have is a large Pandas dataframe in Python that looks something like this:

df_test = pd.DataFrame(data = None, columns = ['file','source'])
df_test.file = ['file_1', 'file_1', 'file_2', 'file_2', 'file_3', 'file_3']
df_test.source = ['nasa', 'unknown', 'esa', 'unknown', 'jaxa', 'unknown']

What I want to get from this is that all values in the 'file' column with the same name should have the same value in the 'source' column, and not be 'unknown'. It should then look like this:

['nasa', 'nasa', 'esa', 'esa', 'jaxa', 'jaxa']

I can easily find which entries should be replaced with:

df_test.loc[df_test.source == 'unknown']

But I'm not sure how to replace them from this, any help is appreciated!

答案1

得分: 3

你可以使用 Series.maskunknown 值转换为缺失值,然后使用 GroupBy.transform 与第一个非缺失值创建新列:

注意:只有当某些组中的值首次出现为 unknown 时,此解决方案有效。

df_test['new'] = (df_test['source'].mask(df_test.source == 'unknown')
                                   .groupby(df_test['file'])
                                   .transform('first'))
print (df_test)
     file   source   new
0  file_1     nasa  nasa
1  file_1  unknown  nasa
2  file_2      esa   esa
3  file_2  unknown   esa
4  file_3     jaxa  jaxa
5  file_3  unknown  jaxa

或者,如果每个 file 中只有一个非 unknown 值,可以通过 boolean indexingDataFrame.set_index 创建辅助 Series,然后使用 Series.map

s = df_test[df_test.source != 'unknown'].set_index('file')['source']

df_test['new'] = df_test['file'].map(s)
print (df_test)
     file   source   new
0  file_1     nasa  nasa
1  file_1  unknown  nasa
2  file_2      esa   esa
3  file_2  unknown   esa
4  file_3     jaxa  jaxa
5  file_3  unknown  jaxa
英文:

You can convert unknown values to missing values in Series.mask and then use GroupBy.transform with first non missing value to new column:

Notice: Solution working if unknown value is first in some group.

df_test['new'] = (df_test['source'].mask(df_test.source == 'unknown')
                                   .groupby(df_test['file'])
                                   .transform('first'))
print (df_test)
     file   source   new
0  file_1     nasa  nasa
1  file_1  unknown  nasa
2  file_2      esa   esa
3  file_2  unknown   esa
4  file_3     jaxa  jaxa
5  file_3  unknown  jaxa

Or if only one non unknown value per files create helper Series by boolean indexing and DataFrame.set_index and then Series.map:

s = df_test[df_test.source != 'unknown'].set_index('file')['source']

df_test['new'] = df_test['file'].map(s)
print (df_test)
     file   source   new
0  file_1     nasa  nasa
1  file_1  unknown  nasa
2  file_2      esa   esa
3  file_2  unknown   esa
4  file_3     jaxa  jaxa
5  file_3  unknown  jaxa

答案2

得分: 1

你可以通过按file分组,然后将所有值设置为source的第一个实例来实现这一点:

df_test['source'] = df_test.groupby('file')['source'].transform('first')

输出:

    	file	source
    0	file_1	nasa
    1	file_1	nasa
    2	file_2	esa
    3	file_2	esa
    4	file_3	jaxa
    5	file_3	jaxa

//注意,这假设您始终有一个有效的源首先列出,而不是unknown,如果不是这种情况,另一个答案更好,因为它最初会将它们删除。

英文:

You can do this by grouping by file, and setting all to the first instance of source:

df_test['source'] = df_test.groupby('file')['source'].transform('first')

Output:

	file	source
0	file_1	nasa
1	file_1	nasa
2	file_2	esa
3	file_2	esa
4	file_3	jaxa
5	file_3	jaxa

//Note, this assumes you always have a valid source listed first, not unknown, if that's not the case, the other answer is better as it removes them initially.

huangapple
  • 本文由 发表于 2023年6月19日 17:19:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76505257.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定