2023年6月19日 17:19:42go评论85阅读模式

英文:

Replace values in a Pandas column to be the same for all unique values in another column

问题

你现在是我的中文翻译，代码部分不要翻译，只返回翻译好的部分，不要有别的内容，不要回答我要翻译的问题。

以下是要翻译的内容：

What I have is a large Pandas dataframe in Python that looks something like this:

    df_test = pd.DataFrame(data=None, columns=['file', 'source'])
    df_test.file = ['file_1', 'file_1', 'file_2', 'file_2', 'file_3', 'file_3']
    df_test.source = ['nasa', 'unknown', 'esa', 'unknown', 'jaxa', 'unknown']

What I want to get from this is that all values in the 'file' column with the same name should have the same value in the 'source' column, and not be 'unknown'. It should then look like this:

    ['nasa', 'nasa', 'esa', 'esa', 'jaxa', 'jaxa']

I can easily find which entries should be replaced with:

    df_test.loc[df_test.source == 'unknown']

But I'm not sure how to replace them from this, any help is appreciated!

英文:

What I have is a large Pandas dataframe in Python that looks something like this:

df_test = pd.DataFrame(data = None, columns = [&#39;file&#39;,&#39;source&#39;])
df_test.file = [&#39;file_1&#39;, &#39;file_1&#39;, &#39;file_2&#39;, &#39;file_2&#39;, &#39;file_3&#39;, &#39;file_3&#39;]
df_test.source = [&#39;nasa&#39;, &#39;unknown&#39;, &#39;esa&#39;, &#39;unknown&#39;, &#39;jaxa&#39;, &#39;unknown&#39;]

What I want to get from this is that all values in the 'file' column with the same name should have the same value in the 'source' column, and not be 'unknown'. It should then look like this:

[&#39;nasa&#39;, &#39;nasa&#39;, &#39;esa&#39;, &#39;esa&#39;, &#39;jaxa&#39;, &#39;jaxa&#39;]

I can easily find which entries should be replaced with:

df_test.loc[df_test.source == &#39;unknown&#39;]

But I'm not sure how to replace them from this, any help is appreciated!

答案1

得分: 3

你可以使用 Series.mask 将 unknown 值转换为缺失值，然后使用 GroupBy.transform 与第一个非缺失值创建新列：

注意：只有当某些组中的值首次出现为 unknown 时，此解决方案有效。

df_test['new'] = (df_test['source'].mask(df_test.source == 'unknown')
                                   .groupby(df_test['file'])
                                   .transform('first'))
print (df_test)
     file   source   new
0  file_1     nasa  nasa
1  file_1  unknown  nasa
2  file_2      esa   esa
3  file_2  unknown   esa
4  file_3     jaxa  jaxa
5  file_3  unknown  jaxa

或者，如果每个 file 中只有一个非 unknown 值，可以通过 boolean indexing 和 DataFrame.set_index 创建辅助 Series，然后使用 Series.map：

s = df_test[df_test.source != 'unknown'].set_index('file')['source']

df_test['new'] = df_test['file'].map(s)
print (df_test)
     file   source   new
0  file_1     nasa  nasa
1  file_1  unknown  nasa
2  file_2      esa   esa
3  file_2  unknown   esa
4  file_3     jaxa  jaxa
5  file_3  unknown  jaxa

英文:

You can convert unknown values to missing values in Series.mask and then use GroupBy.transform with first non missing value to new column:

Notice: Solution working if unknown value is first in some group.

df_test[&#39;new&#39;] = (df_test[&#39;source&#39;].mask(df_test.source == &#39;unknown&#39;)
                                   .groupby(df_test[&#39;file&#39;])
                                   .transform(&#39;first&#39;))
print (df_test)
     file   source   new
0  file_1     nasa  nasa
1  file_1  unknown  nasa
2  file_2      esa   esa
3  file_2  unknown   esa
4  file_3     jaxa  jaxa
5  file_3  unknown  jaxa

Or if only one non unknown value per files create helper Series by boolean indexing and DataFrame.set_index and then Series.map:

s = df_test[df_test.source != &#39;unknown&#39;].set_index(&#39;file&#39;)[&#39;source&#39;]

df_test[&#39;new&#39;] = df_test[&#39;file&#39;].map(s)
print (df_test)
     file   source   new
0  file_1     nasa  nasa
1  file_1  unknown  nasa
2  file_2      esa   esa
3  file_2  unknown   esa
4  file_3     jaxa  jaxa
5  file_3  unknown  jaxa

答案2

得分: 1

你可以通过按file分组，然后将所有值设置为source的第一个实例来实现这一点：

df_test['source'] = df_test.groupby('file')['source'].transform('first')

输出：

    	file	source
    0	file_1	nasa
    1	file_1	nasa
    2	file_2	esa
    3	file_2	esa
    4	file_3	jaxa
    5	file_3	jaxa

//注意，这假设您始终有一个有效的源首先列出，而不是unknown，如果不是这种情况，另一个答案更好，因为它最初会将它们删除。

英文:

You can do this by grouping by file, and setting all to the first instance of source:

df_test[&#39;source&#39;] = df_test.groupby(&#39;file&#39;)[&#39;source&#39;].transform(&#39;first&#39;)

Output:

	file	source
0	file_1	nasa
1	file_1	nasa
2	file_2	esa
3	file_2	esa
4	file_3	jaxa
5	file_3	jaxa

//Note, this assumes you always have a valid source listed first, not unknown, if that's not the case, the other answer is better as it removes them initially.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Pandas列中替换数值，使其与另一列中所有唯一值相同。

问题

答案1

答案2

QTreeView, QAbstractItemModel. 在展开节点时应用程序退出。

服务 /usr/bin/chromedriver 意外退出，状态码为：1

我无法理解这些行代码。有人可以帮忙解释一下吗？

如何将SQLAlchemy（来自langchain的SQLDatabaseChain）连接到SingleStoreDB

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论