
huangapple go评论61阅读模式

Calculate the rarity of a column in a pandas dataframe





country_data = {
'country': ['USA', 'USA', 'USA', 'Canada', 'UK', 'Canada', 'USA', 'Canada', 'Canada', 'UK'],
'datetime': [
'2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05',
'2023-01-06', '2023-01-10', '2023-01-11', '2023-01-12', '2023-02-07',
country_df = pd.DataFrame(country_data)


country datetime 稀有度分数
USA 2023-01-01 1.0
USA 2023-01-02 1.0
USA 2023-01-03 1.0
Canada 2023-01-04 0.25
UK 2023-01-05 0.4
Canada 2023-01-06 0.5
USA 2023-01-10 1.0
Canada 2023-01-11 0.5
Canada 2023-01-12 1.0
UK 2023-02-07 0.5


  1. USA出现一次,过去30天内100%的时间内有出现
  2. USA出现两次,过去30天内100%的时间内有出现
  3. USA出现三次,过去30天内100%的时间内有出现
  4. Canada出现一次,过去30天内25%的时间内有出现
  5. UK出现一次,过去30天内20%的时间内有出现。然而,Canada也出现了一次,所以我们将其包括在稀有度分数中,使其达到40%。
  6. Canada出现两次,过去30天内40%的时间内有出现。然而,UK出现了一次,所以我们将其包括在稀有度分数中,使其达到50%。
  7. USA出现4次,UK和Canada都没有出现这么多次,所以我们包括了所有国家,使其达到100%。
  8. Canada出现3次,UK出现一次,少于Canada,所以我们包括它,使稀有度分数达到50%。
  9. Canada出现4次,USA出现4次,UK出现一次,所以我们包括了所有国家,使分数达到100%。
  10. 在过去的30天内,我们只看到前4行。UK出现一次,USA也是一样,所以这使得分数为50%。

I have a pandas dataframe with a country column and a datetime column. Over a rolling 30 day period for each row I would like to calculate how rare that country is in being seen. I.e. what proportion of the time was that country shown in the last 30 days of rows.

To make this more complex I also want to include the rarer countries in this calculated proportion of each row. I.e For each row calculate the summed proportions of all countries that are as rare or rarer than the row in question over 30 days. So if the country in a row is found 10% of the time in the past 30 days but also two other countries are found 5% and 2% of the time in the last 30 days then the rarity score would be 0.17.


country_data = {
    'country': ['USA', 'USA', 'USA', 'Canada', 'UK', 'Canada', 'USA', 'Canada', 'Canada', 'UK'],
    'datetime': [
        '2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05',
        '2023-01-06', '2023-01-10', '2023-01-11', '2023-01-12', '2023-02-07'

country_df = pd.DataFrame(country_data)

country	datetime
USA	    2023-01-01
USA	    2023-01-02
USA	    2023-01-03
Canada	2023-01-04
UK	    2023-01-05
Canada	2023-01-06
USA	    2023-01-07
Canada	2023-01-08
Canada	2023-01-09
UK	    2023-02-07

The output I would want would then be:

country	datetime    rarity_score
USA	    2023-01-01  1.0 
USA	    2023-01-02  1.0
USA	    2023-01-03  1.0
Canada	2023-01-04  0.25
UK	    2023-01-05  0.4
Canada	2023-01-06  0.5
USA	    2023-01-10  1.0
Canada	2023-01-11  0.5
Canada	2023-01-12  1.0
UK	    2023-02-07  0.5

Explanantions of each rows rarity score:

  1. USA is seen once and 100% of the time in the pst 30 days
  2. USA is seen twice 100% of the time in the past 30 days
  3. USA is seen three times and 100% of the time in the past 30 days
  4. Canada is seen once and 25% of the time in the past 30 days
  5. UK is seen once and 20% of the time in the past 30 days. However Canada is also seen once so we include this in the rarity score to make it 40%
  6. Canada is seen twice and 40% of the time in the past 30 days. However UK is seen once so we include this in the rarity score to make it 50%
  7. USA is seen 4 times and UK and Canada are both seen less than this so we include everything to make it 100%
  8. Canada is seen 3 times and the UK once which is less than Canada so we include it making the rarity score 50%
  9. Canada is seen 4 times as is the USA and the UK once so we include everthing making the score 100%
  10. We only see the previous 4 rows in the 30 day period. The UK is seen once as is the USA so this makes 50%


得分: 1


country_df['datetime'] = pd.to_datetime(country_df['datetime'])

idx, cols = pd.factorize(country_df['country'])
tmp = pd.crosstab(country_df['datetime'], country_df['country']).rolling('30D').sum()

count = tmp.reindex(index=country_df['datetime'], columns=cols).to_numpy()[np.arange(len(tmp)), idx]
country_df['rarity_score'] = country_df['datetime'].map(tmp.where(tmp.le(count, axis=0)).sum(axis=1).div(tmp.sum(axis=1)))


  country   datetime  rarity_score
0     USA 2023-01-01          1.00
1     USA 2023-01-02          1.00
2     USA 2023-01-03          1.00
3  Canada 2023-01-04          0.25
4      UK 2023-01-05          0.40
5  Canada 2023-01-06          0.50
6     USA 2023-01-10          1.00
7  Canada 2023-01-11          0.50
8  Canada 2023-01-12          1.00
9      UK 2023-02-07          0.50


# pd.crosstab(country_df['datetime'], country_df['country'])
country     Canada  UK  USA
2023-01-01       0   0    1
2023-01-02       0   0    1
2023-01-03       0   0    1
2023-01-04       1   0    0
2023-01-05       0   1    0
2023-01-06       1   0    0
2023-01-10       0   0    1
2023-01-11       1   0    0
2023-01-12       1   0    0
2023-02-07       0   1    0

# tmp
country     Canada   UK  USA
2023-01-01     0.0  0.0  1.0
2023-01-02     0.0  0.0  2.0
2023-01-03     0.0  0.0  3.0
2023-01-04     1.0  0.0  3.0
2023-01-05     1.0  1.0  3.0
2023-01-06     2.0  1.0  3.0
2023-01-10     2.0  1.0  4.0
2023-01-11     3.0  1.0  4.0
2023-01-12     4.0  1.0  4.0
2023-02-07     2.0  1.0  1.0

# tmp.where(tmp.le(count, axis=0))
country     Canada   UK  USA
2023-01-01     0.0  0.0  1.0
2023-01-02     0.0  0.0  2.0
2023-01-03     0.0  0.0  3.0
2023-01-04     1.0  0.0  NaN
2023-01-05     1.0  1.0  NaN
2023-01-06     2.0  1.0  NaN
2023-01-10     2.0  1.0  4.0
2023-01-11     3.0  1.0  NaN
2023-01-12     4.0  1.0  4.0
2023-02-07     NaN  1.0  1.0

Assuming unique dates, you could compute a crosstab then a rolling.sum on 30 days, then use indexing lookup to retrieve the count for each row and use that to get the countries with rarer (or equal) counts. Finally, compute the score by dividing by the total count per day:

country_df['datetime'] = pd.to_datetime(country_df['datetime'])
idx, cols = pd.factorize(country_df['country'])
tmp = pd.crosstab(country_df['datetime'], country_df['country']).rolling('30D').sum()
count = tmp.reindex(index=country_df['datetime'], columns=cols).to_numpy()[np.arange(len(tmp)), idx]
country_df['rarity_score'] = country_df['datetime'].map(tmp.where(tmp.le(count, axis=0)).sum(axis=1).div(tmp.sum(axis=1)))


  country   datetime  rarity_score
0     USA 2023-01-01          1.00
1     USA 2023-01-02          1.00
2     USA 2023-01-03          1.00
3  Canada 2023-01-04          0.25
4      UK 2023-01-05          0.40
5  Canada 2023-01-06          0.50
6     USA 2023-01-10          1.00
7  Canada 2023-01-11          0.50
8  Canada 2023-01-12          1.00
9      UK 2023-02-07          0.50


# pd.crosstab(country_df['datetime'], country_df['country'])
country     Canada  UK  USA
2023-01-01       0   0    1
2023-01-02       0   0    1
2023-01-03       0   0    1
2023-01-04       1   0    0
2023-01-05       0   1    0
2023-01-06       1   0    0
2023-01-10       0   0    1
2023-01-11       1   0    0
2023-01-12       1   0    0
2023-02-07       0   1    0
# tmp
country     Canada   UK  USA
2023-01-01     0.0  0.0  1.0
2023-01-02     0.0  0.0  2.0
2023-01-03     0.0  0.0  3.0
2023-01-04     1.0  0.0  3.0
2023-01-05     1.0  1.0  3.0
2023-01-06     2.0  1.0  3.0
2023-01-10     2.0  1.0  4.0
2023-01-11     3.0  1.0  4.0
2023-01-12     4.0  1.0  4.0
2023-02-07     2.0  1.0  1.0
# tmp.where(tmp.le(count, axis=0))
country     Canada   UK  USA
2023-01-01     0.0  0.0  1.0
2023-01-02     0.0  0.0  2.0
2023-01-03     0.0  0.0  3.0
2023-01-04     1.0  0.0  NaN
2023-01-05     1.0  1.0  NaN
2023-01-06     2.0  1.0  NaN
2023-01-10     2.0  1.0  4.0
2023-01-11     3.0  1.0  NaN
2023-01-12     4.0  1.0  4.0
2023-02-07     NaN  1.0  1.0


得分: 0


data = pd.DataFrame({'country': ['USA', 'USA', 'USA', 'Canada', 'UK', 'Canada', 'USA', 'Canada', 'Canada', 'UK']}, index=country_data["datetime"])

data.index = pd.to_datetime(data.index)
data = data["country"]


from pandas.tseries.offsets import *

vals = {}
# 假设索引已经排序
for i in data.index:
    last_30 = i - 30 * Day() # 允许在过去30天内进行切片
    d = data[last_30:i].value_counts() # 获取最近30天每个国家的比例
    vals[i] = (d.sort_values(ascending=True) / d.sum()).loc[:data[i]].sum()




I think it is better to create a timeseries from you original dataframe:

data = pd.DataFrame({'country': ['USA', 'USA', 'USA', 'Canada', 'UK', 'Canada', 'USA', 'Canada', 'Canada', 'UK']}, index = country_data["datetime"])
data.index = pd.to_datetime(data.index)
data = data["country"]

Afterwards (this may not be the most efficient way to address you issue - but assuming you don't have a large df it will suffice):

from pandas.tseries.offsets import *
vals = {}
# assume index is already sorted
for i in data.index:
last_30 = i - 30*Day() # allows slicing for the last 30 days
d = data[last_30:i].value_counts() # get proportions of each country for the last 30 days
vals[i] = (d.sort_values(ascending=True)/d.sum()).loc[:data[i]].sum()

  • 本文由 发表于 2023年7月20日 19:31:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76729407.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
