英文:
Replace values in a Pandas column to be the same for all unique values in another column
问题
你现在是我的中文翻译,代码部分不要翻译, 只返回翻译好的部分, 不要有别的内容, 不要回答我要翻译的问题。
以下是要翻译的内容:
What I have is a large Pandas dataframe in Python that looks something like this:
df_test = pd.DataFrame(data=None, columns=['file', 'source'])
df_test.file = ['file_1', 'file_1', 'file_2', 'file_2', 'file_3', 'file_3']
df_test.source = ['nasa', 'unknown', 'esa', 'unknown', 'jaxa', 'unknown']
What I want to get from this is that all values in the 'file' column with the same name should have the same value in the 'source' column, and not be 'unknown'. It should then look like this:
['nasa', 'nasa', 'esa', 'esa', 'jaxa', 'jaxa']
I can easily find which entries should be replaced with:
df_test.loc[df_test.source == 'unknown']
But I'm not sure how to replace them from this, any help is appreciated!
英文:
What I have is a large Pandas dataframe in Python that looks something like this:
df_test = pd.DataFrame(data = None, columns = ['file','source'])
df_test.file = ['file_1', 'file_1', 'file_2', 'file_2', 'file_3', 'file_3']
df_test.source = ['nasa', 'unknown', 'esa', 'unknown', 'jaxa', 'unknown']
What I want to get from this is that all values in the 'file' column with the same name should have the same value in the 'source' column, and not be 'unknown'. It should then look like this:
['nasa', 'nasa', 'esa', 'esa', 'jaxa', 'jaxa']
I can easily find which entries should be replaced with:
df_test.loc[df_test.source == 'unknown']
But I'm not sure how to replace them from this, any help is appreciated!
答案1
得分: 3
你可以使用 Series.mask
将 unknown
值转换为缺失值,然后使用 GroupBy.transform
与第一个非缺失值创建新列:
注意:只有当某些组中的值首次出现为 unknown
时,此解决方案有效。
df_test['new'] = (df_test['source'].mask(df_test.source == 'unknown')
.groupby(df_test['file'])
.transform('first'))
print (df_test)
file source new
0 file_1 nasa nasa
1 file_1 unknown nasa
2 file_2 esa esa
3 file_2 unknown esa
4 file_3 jaxa jaxa
5 file_3 unknown jaxa
或者,如果每个 file
中只有一个非 unknown
值,可以通过 boolean indexing
和 DataFrame.set_index
创建辅助 Series
,然后使用 Series.map
:
s = df_test[df_test.source != 'unknown'].set_index('file')['source']
df_test['new'] = df_test['file'].map(s)
print (df_test)
file source new
0 file_1 nasa nasa
1 file_1 unknown nasa
2 file_2 esa esa
3 file_2 unknown esa
4 file_3 jaxa jaxa
5 file_3 unknown jaxa
英文:
You can convert unknown
values to missing values in Series.mask
and then use GroupBy.transform
with first non missing value to new column:
Notice: Solution working if unknown
value is first in some group.
df_test['new'] = (df_test['source'].mask(df_test.source == 'unknown')
.groupby(df_test['file'])
.transform('first'))
print (df_test)
file source new
0 file_1 nasa nasa
1 file_1 unknown nasa
2 file_2 esa esa
3 file_2 unknown esa
4 file_3 jaxa jaxa
5 file_3 unknown jaxa
Or if only one non unknown
value per file
s create helper Series
by boolean indexing
and DataFrame.set_index
and then Series.map
:
s = df_test[df_test.source != 'unknown'].set_index('file')['source']
df_test['new'] = df_test['file'].map(s)
print (df_test)
file source new
0 file_1 nasa nasa
1 file_1 unknown nasa
2 file_2 esa esa
3 file_2 unknown esa
4 file_3 jaxa jaxa
5 file_3 unknown jaxa
答案2
得分: 1
你可以通过按file
分组,然后将所有值设置为source
的第一个实例来实现这一点:
df_test['source'] = df_test.groupby('file')['source'].transform('first')
输出:
file source
0 file_1 nasa
1 file_1 nasa
2 file_2 esa
3 file_2 esa
4 file_3 jaxa
5 file_3 jaxa
//注意,这假设您始终有一个有效的源首先列出,而不是unknown
,如果不是这种情况,另一个答案更好,因为它最初会将它们删除。
英文:
You can do this by grouping by file
, and setting all to the first instance of source
:
df_test['source'] = df_test.groupby('file')['source'].transform('first')
Output:
file source
0 file_1 nasa
1 file_1 nasa
2 file_2 esa
3 file_2 esa
4 file_3 jaxa
5 file_3 jaxa
//Note, this assumes you always have a valid source listed first, not unknown
, if that's not the case, the other answer is better as it removes them initially.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论