计算pandas数据框中一列的稀有度

huangapple go评论61阅读模式
英文:

Calculate the rarity of a column in a pandas dataframe

问题

我有一个带有国家列和日期时间列的pandas数据框。对于每一行,在滚动的30天内,我想计算该国家在过去30天的行中被看到的频率有多低。即在过去30天的行中,该国家显示了多少比例的时间。

为了使这更加复杂,我还想在这个计算的比例中包括更稀有的国家。也就是说,对于每一行,计算与该行相同或更稀有的所有国家在过去30天内的总比例。所以如果一行中的国家在过去30天内出现了10%的时间,但另外两个国家在过去30天内分别出现了5%和2%的时间,那么稀有度分数将是0.17。

例如:

country_data = {
'country': ['USA', 'USA', 'USA', 'Canada', 'UK', 'Canada', 'USA', 'Canada', 'Canada', 'UK'],
'datetime': [
'2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05',
'2023-01-06', '2023-01-10', '2023-01-11', '2023-01-12', '2023-02-07',
],
}
country_df = pd.DataFrame(country_data)

输出我想要的是:

country datetime 稀有度分数
USA 2023-01-01 1.0
USA 2023-01-02 1.0
USA 2023-01-03 1.0
Canada 2023-01-04 0.25
UK 2023-01-05 0.4
Canada 2023-01-06 0.5
USA 2023-01-10 1.0
Canada 2023-01-11 0.5
Canada 2023-01-12 1.0
UK 2023-02-07 0.5

每行的稀有度分数解释:

  1. USA出现一次,过去30天内100%的时间内有出现
  2. USA出现两次,过去30天内100%的时间内有出现
  3. USA出现三次,过去30天内100%的时间内有出现
  4. Canada出现一次,过去30天内25%的时间内有出现
  5. UK出现一次,过去30天内20%的时间内有出现。然而,Canada也出现了一次,所以我们将其包括在稀有度分数中,使其达到40%。
  6. Canada出现两次,过去30天内40%的时间内有出现。然而,UK出现了一次,所以我们将其包括在稀有度分数中,使其达到50%。
  7. USA出现4次,UK和Canada都没有出现这么多次,所以我们包括了所有国家,使其达到100%。
  8. Canada出现3次,UK出现一次,少于Canada,所以我们包括它,使稀有度分数达到50%。
  9. Canada出现4次,USA出现4次,UK出现一次,所以我们包括了所有国家,使分数达到100%。
  10. 在过去的30天内,我们只看到前4行。UK出现一次,USA也是一样,所以这使得分数为50%。
英文:

I have a pandas dataframe with a country column and a datetime column. Over a rolling 30 day period for each row I would like to calculate how rare that country is in being seen. I.e. what proportion of the time was that country shown in the last 30 days of rows.

To make this more complex I also want to include the rarer countries in this calculated proportion of each row. I.e For each row calculate the summed proportions of all countries that are as rare or rarer than the row in question over 30 days. So if the country in a row is found 10% of the time in the past 30 days but also two other countries are found 5% and 2% of the time in the last 30 days then the rarity score would be 0.17.

e.g.

country_data = {
    'country': ['USA', 'USA', 'USA', 'Canada', 'UK', 'Canada', 'USA', 'Canada', 'Canada', 'UK'],
    'datetime': [
        '2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05',
        '2023-01-06', '2023-01-10', '2023-01-11', '2023-01-12', '2023-02-07'
    ],
}

country_df = pd.DataFrame(country_data)

country	datetime
USA	    2023-01-01
USA	    2023-01-02
USA	    2023-01-03
Canada	2023-01-04
UK	    2023-01-05
Canada	2023-01-06
USA	    2023-01-07
Canada	2023-01-08
Canada	2023-01-09
UK	    2023-02-07

The output I would want would then be:

country	datetime    rarity_score
USA	    2023-01-01  1.0 
USA	    2023-01-02  1.0
USA	    2023-01-03  1.0
Canada	2023-01-04  0.25
UK	    2023-01-05  0.4
Canada	2023-01-06  0.5
USA	    2023-01-10  1.0
Canada	2023-01-11  0.5
Canada	2023-01-12  1.0
UK	    2023-02-07  0.5

Explanantions of each rows rarity score:

  1. USA is seen once and 100% of the time in the pst 30 days
  2. USA is seen twice 100% of the time in the past 30 days
  3. USA is seen three times and 100% of the time in the past 30 days
  4. Canada is seen once and 25% of the time in the past 30 days
  5. UK is seen once and 20% of the time in the past 30 days. However Canada is also seen once so we include this in the rarity score to make it 40%
  6. Canada is seen twice and 40% of the time in the past 30 days. However UK is seen once so we include this in the rarity score to make it 50%
  7. USA is seen 4 times and UK and Canada are both seen less than this so we include everything to make it 100%
  8. Canada is seen 3 times and the UK once which is less than Canada so we include it making the rarity score 50%
  9. Canada is seen 4 times as is the USA and the UK once so we include everthing making the score 100%
  10. We only see the previous 4 rows in the 30 day period. The UK is seen once as is the USA so this makes 50%

答案1

得分: 1

假设日期是唯一的,您可以先计算crosstab,然后在30天内进行rolling.sum,然后使用索引查找检索每行的计数,然后使用它来获取计数较少(或相等)的国家。最后,通过总计每天的计数来计算分数:

country_df['datetime'] = pd.to_datetime(country_df['datetime'])

idx, cols = pd.factorize(country_df['country'])
tmp = pd.crosstab(country_df['datetime'], country_df['country']).rolling('30D').sum()

count = tmp.reindex(index=country_df['datetime'], columns=cols).to_numpy()[np.arange(len(tmp)), idx]
country_df['rarity_score'] = country_df['datetime'].map(tmp.where(tmp.le(count, axis=0)).sum(axis=1).div(tmp.sum(axis=1)))

输出:

  country   datetime  rarity_score
0     USA 2023-01-01          1.00
1     USA 2023-01-02          1.00
2     USA 2023-01-03          1.00
3  Canada 2023-01-04          0.25
4      UK 2023-01-05          0.40
5  Canada 2023-01-06          0.50
6     USA 2023-01-10          1.00
7  Canada 2023-01-11          0.50
8  Canada 2023-01-12          1.00
9      UK 2023-02-07          0.50

中间结果:

# pd.crosstab(country_df['datetime'], country_df['country'])
country     Canada  UK  USA
datetime                   
2023-01-01       0   0    1
2023-01-02       0   0    1
2023-01-03       0   0    1
2023-01-04       1   0    0
2023-01-05       0   1    0
2023-01-06       1   0    0
2023-01-10       0   0    1
2023-01-11       1   0    0
2023-01-12       1   0    0
2023-02-07       0   1    0

# tmp
country     Canada   UK  USA
datetime                    
2023-01-01     0.0  0.0  1.0
2023-01-02     0.0  0.0  2.0
2023-01-03     0.0  0.0  3.0
2023-01-04     1.0  0.0  3.0
2023-01-05     1.0  1.0  3.0
2023-01-06     2.0  1.0  3.0
2023-01-10     2.0  1.0  4.0
2023-01-11     3.0  1.0  4.0
2023-01-12     4.0  1.0  4.0
2023-02-07     2.0  1.0  1.0

# tmp.where(tmp.le(count, axis=0))
country     Canada   UK  USA
datetime                    
2023-01-01     0.0  0.0  1.0
2023-01-02     0.0  0.0  2.0
2023-01-03     0.0  0.0  3.0
2023-01-04     1.0  0.0  NaN
2023-01-05     1.0  1.0  NaN
2023-01-06     2.0  1.0  NaN
2023-01-10     2.0  1.0  4.0
2023-01-11     3.0  1.0  NaN
2023-01-12     4.0  1.0  4.0
2023-02-07     NaN  1.0  1.0
英文:

Assuming unique dates, you could compute a crosstab then a rolling.sum on 30 days, then use indexing lookup to retrieve the count for each row and use that to get the countries with rarer (or equal) counts. Finally, compute the score by dividing by the total count per day:

country_df['datetime'] = pd.to_datetime(country_df['datetime'])
idx, cols = pd.factorize(country_df['country'])
tmp = pd.crosstab(country_df['datetime'], country_df['country']).rolling('30D').sum()
count = tmp.reindex(index=country_df['datetime'], columns=cols).to_numpy()[np.arange(len(tmp)), idx]
country_df['rarity_score'] = country_df['datetime'].map(tmp.where(tmp.le(count, axis=0)).sum(axis=1).div(tmp.sum(axis=1)))

Output:

  country   datetime  rarity_score
0     USA 2023-01-01          1.00
1     USA 2023-01-02          1.00
2     USA 2023-01-03          1.00
3  Canada 2023-01-04          0.25
4      UK 2023-01-05          0.40
5  Canada 2023-01-06          0.50
6     USA 2023-01-10          1.00
7  Canada 2023-01-11          0.50
8  Canada 2023-01-12          1.00
9      UK 2023-02-07          0.50

Intermediates:

# pd.crosstab(country_df['datetime'], country_df['country'])
country     Canada  UK  USA
datetime                   
2023-01-01       0   0    1
2023-01-02       0   0    1
2023-01-03       0   0    1
2023-01-04       1   0    0
2023-01-05       0   1    0
2023-01-06       1   0    0
2023-01-10       0   0    1
2023-01-11       1   0    0
2023-01-12       1   0    0
2023-02-07       0   1    0
# tmp
country     Canada   UK  USA
datetime                    
2023-01-01     0.0  0.0  1.0
2023-01-02     0.0  0.0  2.0
2023-01-03     0.0  0.0  3.0
2023-01-04     1.0  0.0  3.0
2023-01-05     1.0  1.0  3.0
2023-01-06     2.0  1.0  3.0
2023-01-10     2.0  1.0  4.0
2023-01-11     3.0  1.0  4.0
2023-01-12     4.0  1.0  4.0
2023-02-07     2.0  1.0  1.0
# tmp.where(tmp.le(count, axis=0))
country     Canada   UK  USA
datetime                    
2023-01-01     0.0  0.0  1.0
2023-01-02     0.0  0.0  2.0
2023-01-03     0.0  0.0  3.0
2023-01-04     1.0  0.0  NaN
2023-01-05     1.0  1.0  NaN
2023-01-06     2.0  1.0  NaN
2023-01-10     2.0  1.0  4.0
2023-01-11     3.0  1.0  NaN
2023-01-12     4.0  1.0  4.0
2023-02-07     NaN  1.0  1.0

答案2

得分: 0

我认为最好从您的原始数据框中创建一个时间序列:

data = pd.DataFrame({'country': ['USA', 'USA', 'USA', 'Canada', 'UK', 'Canada', 'USA', 'Canada', 'Canada', 'UK']}, index=country_data["datetime"])

data.index = pd.to_datetime(data.index)
data = data["country"]

之后(这可能不是解决您问题的最有效方式,但假设您没有一个很大的数据框,它就足够了):

from pandas.tseries.offsets import *

vals = {}
# 假设索引已经排序
for i in data.index:
    last_30 = i - 30 * Day() # 允许在过去30天内进行切片
    d = data[last_30:i].value_counts() # 获取最近30天每个国家的比例
    vals[i] = (d.sort_values(ascending=True) / d.sum()).loc[:data[i]].sum()

pd.Series(vals)

希望对您有所帮助。

英文:

I think it is better to create a timeseries from you original dataframe:

data = pd.DataFrame({'country': ['USA', 'USA', 'USA', 'Canada', 'UK', 'Canada', 'USA', 'Canada', 'Canada', 'UK']}, index = country_data["datetime"])
data.index = pd.to_datetime(data.index)
data = data["country"]

Afterwards (this may not be the most efficient way to address you issue - but assuming you don't have a large df it will suffice):

from pandas.tseries.offsets import *
vals = {}
# assume index is already sorted
for i in data.index:
last_30 = i - 30*Day() # allows slicing for the last 30 days
d = data[last_30:i].value_counts() # get proportions of each country for the last 30 days
vals[i] = (d.sort_values(ascending=True)/d.sum()).loc[:data[i]].sum()
pd.Series(vals)

huangapple
  • 本文由 发表于 2023年7月20日 19:31:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76729407.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定