2020年1月7日 00:04:31go评论164阅读模式

英文:

Flag repeating entries in pandas time series

问题

我理解你的问题。你想要标记那些参与者在某一年去了他们前一年没有去过的度假地点的行。以下是你描述的问题的解决方案：

# 将'year'列转换为日期时间类型
df['year'] = pd.to_datetime(df['year'], format='%Y')

# 根据'id'对数据框进行分组
grouped = df.groupby('id')

# 添加一个'new'列，表示是否是新的度假地点
df['new'] = grouped['vacation'].transform(lambda x: x.ne(x.shift()))

# 输出结果
print(df)

这将给你想要的输出，标记了参与者在某一年去了他们前一年没有去过的度假地点的行。希望这对你有所帮助！

英文:

I have a data frame that takes this form (but is several millions of rows long):

import pandas as pd     
dict = {&#39;id&#39;:[&quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;B&quot;, &quot;B&quot;, &quot;B&quot;, &quot;C&quot;, &quot;C&quot;, &quot;C&quot;, &quot;D&quot;, &quot;D&quot;, &quot;D&quot;], 
    &#39;year&#39;: [&quot;2000&quot;, &quot;2001&quot;, &quot;2002&quot;, &quot;2000&quot;, &quot;2001&quot;, &quot;2003&quot;, &quot;1999&quot;, &quot;2000&quot;, &quot;2001&quot;, &quot;2000&quot;, &quot;2000&quot;, &quot;2001&quot;],
    &#39;vacation&#39;:[&quot;France&quot;, &quot;Morocco&quot;, &quot;Morocco&quot;, &quot;Germany&quot;, &quot;Germany&quot;, &quot;Germany&quot;, &quot;Japan&quot;, &quot;Australia&quot;, &quot;Japan&quot;, &quot;Canada&quot;, &quot;Mexico&quot;, &quot;China&quot;],
    &#39;new&#39;:[1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1]} 
df = pd.DataFrame(dict)

A	2000	France
A	2001	Morocco
A	2002	Morocco
B	2000	Germany
B	2001	Germany
B	2003	Germany
C	1999	Japan
C	2000	Australia
C	2001	Japan
D	2000	Canada	     
D	2000	Mexico  	 
D	2001	China

For each person in each year, the holiday destination(s) is/are given; there can be multiple holiday destinations in a given year.
I would like to flag the rows when a participant goes to holiday to a destination to which they had not gone the year before (i.e., the destination is new). In the case above, the output would be:

id	year	vacation	new
A	2000	France	     1
A	2001	Morocco      1
A	2002	Morocco	     0
B	2001	Germany	     1
B	2002	Germany	     0
B	2003	Germany	     0
C	1999	Japan	     1
C	1999	Australia	 1
C	2000	Japan	     1
D	2000	Canada	     1
D	2000	Mexico  	 1
D	2001	China	     1

For A, B, C, and D, the first holiday destination in our data frame is flagged as new. When A goes to Morocco two years in a row, the 2nd occurrence is not flagged, because A went there the year before. When B goes to Germany 3 times in a row, the 2nd and 3rd occurrences are not flagged. When person C goes to Japan twice, all of the occurrences are flagged, because they did not go to Japan two years in a row. D goes to 3 different destinations (albeit to 2 destinations in 2000) and all of them are flagged.

I have been trying to solve it myself, but have not been able to break away from iterations, which are too computationally intensive for such a massive dataset.

I'd appreciate any input; thanks.

答案1

得分: 4

IIUC,

我们正在按id和vacation分组，并确保年份与上一年不相等，或者我们可以选择该组合的第一个实例。

希望这清楚了。如果需要更多帮助，请告诉我。

df["new_2"] = (
    df.groupby(["id", "vacation"])["id", "year"]
    .apply(lambda x: x.ne(x.shift()))
    .all(axis=1)
    .add(0)
)

print(df)
  id  year   vacation  new_2
0  A  2000     France    1
1  A  2001        USA    1
2  A  2002     France    0
3  B  2001    Germany    1
4  B  2002    Germany    0
5  B  2003    Germany    0
6  C  1999      Japan    1
7  C  2000  Australia    1
8  C  2001     France    1

英文:

IIUC,

what we are doing is grouping by id & vacation and ensuring that year is not equal to the year above, or we can selecting the first instance of that combination.

hopefully that's clear. let me know if you need anymore help.

df[&quot;new_2&quot;] = (
    df.groupby([&quot;id&quot;, &quot;vacation&quot;])[&quot;id&quot;, &quot;year&quot;]
    .apply(lambda x: x.ne(x.shift()))
    .all(axis=1)
    .add(0)
)

print(df)
  id  year   vacation  new_2
0  A  2000     France    1
1  A  2001        USA    1
2  A  2002     France    0
3  B  2001    Germany    1
4  B  2002    Germany    0
5  B  2003    Germany    0
6  C  1999      Japan    1
7  C  2000  Australia    1
8  C  2001     France    1

答案2

得分: 2

这是我用groupby和transform提出的一个解决方案：

df = df.sort_values(["id", "vacation", "year"])
df["new"] = (
    df.groupby(["id", "vacation"])
    .transform(lambda x: x.iloc[0])
    .year.eq(df.year)
    .astype(int)
)

你将会得到以下结果：

  id  year   vacation  new
0  A  2000     France    1
1  A  2001        USA    1
2  A  2002     France    0
3  B  2001    Germany    1
4  B  2002    Germany    0
5  B  2003    Germany    0
6  C  1999      Japan    1
7  C  2000  Australia    1
8  C  2001     France    1

如果你需要进一步的帮助，请告诉我。

英文:

Here's one solution I came up with, using groupby and transform:

df = df.sort_values([&quot;id&quot;, &quot;vacation&quot;, &quot;year&quot;])
df[&quot;new&quot;] = (
    df.groupby([&quot;id&quot;, &quot;vacation&quot;])
    .transform(lambda x: x.iloc[0])
    .year.eq(df.year)
    .astype(int)
)

You'll get

  id  year   vacation  new
0  A  2000     France    1
1  A  2001        USA    1
2  A  2002     France    0
3  B  2001    Germany    1
4  B  2002    Germany    0
5  B  2003    Germany    0
6  C  1999      Japan    1
7  C  2000  Australia    1
8  C  2001     France    1

答案3

得分: 2

这里是使用 groupby+cumcount 和 series.mask 的一种方法：

df['new'] = df.groupby(['id', 'vacation']).cumcount().add(1).mask(lambda x: x.gt(1), 0)
print(df)

  id  year   vacation  new
0  A  2000     France    1
1  A  2001        USA    1
2  A  2002     France    0
3  B  2001    Germany    1
4  B  2002    Germany    0
5  B  2003    Germany    0
6  C  1999      Japan    1
7  C  2000  Australia    1
8  C  2001     France    1

英文:

Here is a way using groupby+cumcount and series.mask:

df[&#39;new&#39;]=df.groupby([&#39;id&#39;,&#39;vacation&#39;]).cumcount().add(1).mask(lambda x: x.gt(1),0)
print(df)

  id  year   vacation  new
0  A  2000     France    1
1  A  2001        USA    1
2  A  2002     France    0
3  B  2001    Germany    1
4  B  2002    Germany    0
5  B  2003    Germany    0
6  C  1999      Japan    1
7  C  2000  Australia    1
8  C  2001     France    1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在pandas时间序列中标记重复的条目。

问题

答案1

答案2

答案3

User-agent错误与Python3的网络抓取

计算交易算法中最有效的VWAP（成交量加权平均价格）的方法

pyspark 使用分隔符分割时出现错误（在高阶内部）？

如何以合理的方式处理Python异常？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论