获取基于状态更改的ID组的时间范围

huangapple go评论69阅读模式
英文:

Obtaining timeframe for ID groups based on state change

问题

首先,对不熟悉Python的道歉,我完全是一个新手。我极少使用它,但在这个问题中需要它。

我有一组数据,看起来像下面这样:

id state dt
101 0 2022-15
101 1 2022-22
101 0 2022-26
102 0 2022-01
102 1 2022-41
103 1 2022-03
103 0 2022-12

我需要提供一个输出,显示每个ID的状态为 "1" 的时间量。例如,对于ID 101 - state1_start_dt = "2022_22",state1_end_dt = "2022_25"。

数据以 .CSV 格式存在。我尝试使用 Pandas 导入数据框,并在数据框上使用 groupby,然后循环遍历数据 - 但这似乎非常慢。

我了解到有限状态机似乎与我的需求相关,但我在尝试创建一个可以接受 .CSV 输入的Python有限状态机时陷入困境,它提供了每个ID组的输出,并包括了逻辑以考虑对于最后一个ID条目为state = "1"的情况 - 因此,我们将假设时间范围一直到2022年底。

如果有人能够提供一些资源或示例代码,我可以分解以更好地理解 - 那将非常好。

编辑

为了更清晰地表达我想要实现的目标,以下是一些示例:

  • 对于没有以0结尾的状态序列的ID,state1_end_dt 应该输入为 '2022-52'(2022年的最后一周)

  • 对于交替状态的ID,我们可以加入第二、第三、第四等一组列(例如 state1_start_dt_2,state1_end_dt_2)。这将允许考虑每个时间窗口。对于只有一个时间窗口的条目,这些额外的列可以为空。

  • 对于在状态列中没有 "1" 的ID,可以跳过。

  • 对于没有0状态的ID,state1_start_dt 应该取最小的 dt 值,state1_end_dt 可以输入为 '2022-52'。

英文:

First off, my apologies, I'm a complete novice when it comes to Python. I use it extremely infrequently but require it for this problem.

I have a set of data which looks like the below:

id state dt
101 0 2022-15
101 1 2022-22
101 0 2022-26
102 0 2022-01
102 1 2022-41
103 1 2022-03
103 0 2022-12

I need to provide an output which displays the amount of time each ID was state = "1". E.G for ID 101 - state1_start_dt = "2022_22", state1_end_dt = "2022_25".

The data is in .CSV format. I've attempted to bring this in via Pandas, utilise groupby on the df and then loop over this - however this seems extremely slow.

I've come across Finite State Machines which seem to link to my requirements, however I'm in way over my head attempting to create a Finite State Machine in Python which accepts .CSV inputs, provides output per each ID group as well as incorporates logic to account for scenarios where the last entry for an ID is state = "1" - therefore we'd assume the time frame was until the end of 2022.

If anyone can provide some sources or sample code which I can break down to get a better understanding - that would be great.

EDIT

Some examples to be clearer on what I'd like to achieve:

-For IDs that have no ending 0 in the state sequence, the state1_end_dt should be entered as '2022-52' (the final week in 2022)

-For IDs which have alternating states, we can incorporate a second, third, forth etc.. set of columns (E.G state1_start_dt_2, state1_end_dt_2). This will allow each window to be accounted for. For any entries that only have one window, these extra columns can be NULL.

-For IDs which have no "1" present in the state column, these can be skipped.

-For IDs which do not have any 0 states present, the minimum dt value should be taken as the state1_start_dt and '2022-52' can be entered for state1_end_dt

答案1

得分: 1

如果名为 one_zero.csv 的 CSV 文件如下所示:

id,state,dt
100,0,2022-15
100,1,2022-22
100,0,2022-26
101,0,2022-01
101,1,2022-41
102,1,2022-03
102,0,2022-12
102,1,2022-33

那么以下的代码将会给你想要的结果:

import pandas as pd

df = pd.read_csv("one_zero.csv")
result = {}
for id_, sub_df in df.groupby('id'):
    sub_df = sub_df.sort_values("dt")
    intervals = []
    start_dt = None
    for state, dt in zip(sub_df["state"], sub_df["dt"]):
        if state == 1:
            start_dt = dt
        if state == 0 and start_dt is not None:
            week = int(dt.split("-", maxsplit=1)[1])
            intervals.append((start_dt, f"2022-{week-1:02d}"))
            start_dt = None
    if start_dt is not None:
        intervals.append((start_dt, "2022-52"))
    result[id_] = intervals

最终,结果字典将包含如下内容:

{
 100: [('2022-22', '2022-25')],
 101: [('2022-41', '2022-52')],
 102: [('2022-03', '2022-11'), ('2022-33', '2022-52')]
}

使用 groupbysort_values,即使在 CSV 文件中打乱了行的顺序,这段代码也能正常工作。我使用了格式化字符串来修复周数。其中的 02d 表示周始终是两位数字,从第1周开始为0。

如果你更熟悉的话,也可以使用以下迭代行的方式来遍历,可能会占用更少的内存:

    for _, row in sub_df.iterrows():
        state = row["state"]
        dt = row["dt"]

请注意,我已经将代码部分提供给您,并只回答您要求翻译的内容。

英文:

If the csv file called one_zero.csv is this

id,state,dt
100,0,2022-15
100,1,2022-22
100,0,2022-26
101,0,2022-01
101,1,2022-41
102,1,2022-03
102,0,2022-12
102,1,2022-33

(I've added one additional item to the end.)

Then this code gives you what you want.

import pandas as pd

df = pd.read_csv("one_zero.csv")
result = {}
for id_, sub_df in df.groupby('id'):
    sub_df = sub_df.sort_values("dt")
    intervals = []
    start_dt = None
    for state, dt in zip(sub_df["state"], sub_df["dt"]):
        if state == 1:
            start_dt = dt
        if state == 0 and start_dt is not None:
            week = int(dt.split("-", maxsplit=1)[1])
            intervals.append((start_dt, f"2022-{week-1:02d}"))
            start_dt = None
    if start_dt is not None:
        intervals.append((start_dt, "2022-52"))
    result[id_] = intervals

At the end the result dictionary will contain this:

{
 100: [('2022-22', '2022-25')],
 101: [('2022-41', '2022-52')],
 102: [('2022-03', '2022-11'), ('2022-33', '2022-52')]
}

With this groupby and sort_values it works even if you shuffle the lines in the csv file. I've used formatted string to fix the week number. 02d there means there, that the week will be always two digits, starting with 0 for the first 9 weeks.

I guess you need less memory if you iterate on the rows like this, but for me the zip version is more familiar.

    for _, row in sub_df.iterrows():
        state = row["state"]
        dt = row["dt"]

答案2

得分: 1

以下是要翻译的部分:

IIUC, here are some functions to perform the aggregation you are looking for.
First, we convert the strings '%Y-%W' (e.g. '2022-15') into a DateTime (the Monday of that week), e.g. '2022-04-11', as it is easier to deal with actual dates than these strings. This makes this solution generic in that it can have arbitrary dates in it, not just for a single year.
Second, we augment the df with a "sentinel": a row for each id that is on the first week of the next year (next year being max year of all dates, plus 1) with state = 0. That allows us to not worry whether a sequence ends with 0 or not.
Then, we essentially group by id and apply the following logic: keep only transitions, so, e.g., [1,1,1,0,0,1,0] becomes [1,.,.,0,.,1,0] (where '.' indicates dropped values). That gives us the spans we are looking for (after subtracting one week for the 0 states).

Edit: speedup: instead of applying the masking logic to each group, we detect transitions globally (on the sentinel-augmented df, sorted by ['id', 'dt', 'state']). Since each id sequence in the augmented df ends with the sentinel (0), we are guaranteed to catch the first 1 of the next id.

Putting it all together, including a postproc() to convert dates back into strings of year-week:

def preproc(df):
    df = df.assign(dt=pd.to_datetime(df['dt'] + '-Mon', format='%Y-%W-%a'))
    max_year = df['dt'].max().year
    # first week next year:
    tmax = pd.Timestamp(f'{max_year}-12-31') + pd.offsets.Week(1)
    sentinel = pd.DataFrame(
        pd.unique(df['id']),
        columns=['id']).assign(state=0, dt=tmax)
    df = pd.concat([df, sentinel])
    df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
    return df

# speed up
def proc(df):
    mask = df['state'] != df['state'].shift(fill_value=0)
    df = df[mask]
    z = df.assign(c=df.groupby('id').cumcount()).set_index(['c', 'id'])['dt'].unstack('c')
    z[z.columns[1::2]] -= pd.offsets.Week(1)
    cols = [
        f'{x}_{i}'
        for i in range(len(z.columns) // 2)
        for x in ['start', 'end']
    ]
    return z.set_axis(cols, axis=1)

def asweeks_str(t, nat='--'):
    return f'{t:%Y-%W}' if t and t == t else nat

def postproc(df):
    # convert dates into strings '%Y-%W'
    return df.applymap(asweeks_str)
英文:

IIUC, here are some functions to perform the aggregation you are looking for.

First, we convert the strings '%Y-%W' (e.g. '2022-15') into a DateTime (the Monday of that week), e.g. '2022-04-11', as it is easier to deal with actual dates than these strings. This makes this solution generic in that it can have arbitrary dates in it, not just for a single year.

Second, we augment the df with a "sentinel": a row for each id that is on the first week of the next year (next year being max year of all dates, plus 1) with state = 0. That allows us to not worry whether a sequence ends with 0 or not.

Then, we essentially group by id and apply the following logic: keep only transitions, so, e.g., [1,1,1,0,0,1,0] becomes [1,.,.,0,.,1,0] (where '.' indicates dropped values). That gives us the spans we are looking for (after subtracting one week for the 0 states).

Edit: speedup: instead of applying the masking logic to each group, we detect transitions globally (on the sentinel-augmented df, sorted by ['id', 'dt', 'state']). Since each id sequence in the augmented df ends with the sentinel (0), we are guaranteed to catch the first 1 of the next id.

Putting it all together, including a postproc() to convert dates back into strings of year-week:

def preproc(df):
    df = df.assign(dt=pd.to_datetime(df['dt'] + '-Mon', format='%Y-%W-%a'))
    max_year = df['dt'].max().year
    # first week next year:
    tmax = pd.Timestamp(f'{max_year}-12-31') + pd.offsets.Week(1)
    sentinel = pd.DataFrame(
        pd.unique(df['id']),
        columns=['id']).assign(state=0, dt=tmax)
    df = pd.concat([df, sentinel])
    df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
    return df

# speed up
def proc(df):
    mask = df['state'] != df['state'].shift(fill_value=0)
    df = df[mask]
    z = df.assign(c=df.groupby('id').cumcount()).set_index(['c', 'id'])['dt'].unstack('c')
    z[z.columns[1::2]] -= pd.offsets.Week(1)
    cols = [
        f'{x}_{i}'
        for i in range(len(z.columns) // 2)
        for x in ['start', 'end']
    ]
    return z.set_axis(cols, axis=1)

def asweeks_str(t, nat='--'):
    return f'{t:%Y-%W}' if t and t == t else nat

def postproc(df):
    # convert dates into strings '%Y-%W'
    return df.applymap(asweeks_str)

Examples

First, let's use the example that is in the original question. Note that this doesn't exemplifies some of the corner cases we are able to handle (more on that in a minute).

df = pd.DataFrame({
    'id': [101, 101, 101, 102, 102, 103, 103],
    'state': [0, 1, 0, 0, 1, 1, 0],
    'dt': ['2022-15', '2022-22', '2022-26', '2022-01', '2022-41', '2022-03', '2022-12'],
})

>>> postproc(proc(preproc(df)))
     start_0    end_0
id                   
101  2022-22  2022-25
102  2022-41  2022-52
103  2022-03  2022-11

But let's generate some random data, to observe some corner cases:

def gen(n, nids=2):
    wk = np.random.randint(1, 53, n*nids)
    st = np.random.choice([0, 1], n*nids)
    ids = np.repeat(np.arange(nids) + 101, n)
    df = pd.DataFrame({
        'id': ids,
        'state': st,
        'dt': [f'2022-{w:02d}' for w in wk],
    })
    df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
    return df

Now:

np.random.seed(0)  # reproducible example
df = gen(6, 3)

>>> df
     id  state       dt
0   101      0  2022-01
1   101      0  2022-04
2   101      1  2022-04
3   101      1  2022-40
4   101      1  2022-45
5   101      1  2022-48
6   102      1  2022-10
7   102      1  2022-20
8   102      0  2022-22
9   102      1  2022-24
10  102      0  2022-37
11  102      1  2022-51
12  103      1  2022-02
13  103      0  2022-07
14  103      0  2022-13
15  103      1  2022-25
16  103      1  2022-25
17  103      1  2022-39

There are several interesting things here. First, 101 starts with a 0 state, whereas 102 and 103 both start with 1. Then, there are repeated ones for all ids. There are also repeated weeks: '2022-04' for 101 and '2022-25' for 103.

Nevertheless, the aggregation works just fine and produces:

>>> postproc(proc(preproc(df)))
     start_0    end_0  start_1    end_1  start_2    end_2
id                                                       
101  2022-04  2022-52       --       --       --       --
102  2022-10  2022-21  2022-24  2022-36  2022-51  2022-52
103  2022-02  2022-06  2022-25  2022-52       --       --

Speed

np.random.seed(0)
n = 10
k = 100_000
df = gen(n, k)
%timeit preproc(df)
483 ms ± 4.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The processing itself takes less than 200ms for 1 million rows:

a = preproc(df)

%timeit proc(a)
185 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As for the post-processing (converting dates back to year-week strings), it is the slowest thing of all:

b = proc(a)

%timeit postproc(b)
1.63 s ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For a speed-up of that post-processing, we can rely on the fact that there are only a small number of distinct dates that are week-starts (52 per year, plus NaT for the blank cells):

def postproc2(df, nat='--'):
    dct = {
        t: f'{t:%Y-%W}' if t and t == t else nat
        for t in df.stack().reset_index(drop=True).drop_duplicates()
    }
    return df.applymap(dct.get)

%timeit postproc2(b)
542 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

We could of course do something similar for preproc().

答案3

得分: 1

另一个替代方案:

res = (
    df.drop(columns="dt")
    .assign(week=df["dt"].str.split("-").str[1].astype("int"))
    .sort_values(["id", "week"])
    .assign(group=lambda df:
        df.groupby("id")["state"].diff().fillna(1).ne(0).cumsum()
    )
    .drop_duplicates(subset="group", keep="first")
    .loc[lambda df: df["state"].eq(1) | df["id"].eq(df["id"].shift())]
    .assign(no=lambda df: df.groupby("id")["state"].cumsum())
    .pivot(index=["id", "no"], columns="state", values="week")
    .rename(columns={0: "end", 1: "start"}).fillna("52").astype("int")
)[["start", "end"]]

首先添加一个名为 week 的新列,然后按 idweek 进行排序(如果数据已经按顺序排列,则排序可能不是必要的)。然后按 id 分组查找连续的 01 块,并基于结果(存储在新列 group 中)删除所有相应的重复项,只保留第一个(根据您提供的逻辑,其他重复项不相关)。然后删除在 id 组中起始的 0 状态。接下来,在结果中按 id 组标识连接的 start-end 组合(存储在新的 no 组中)。然后对其进行 .pivot 操作:将 idno 放入索引中,将 state 放入列中。然后用 "52" 填充 NaN 值,并进行一些类型转换、重命名和排序,以便更好地整理结果。

如果您确实希望将各种 start-end 组合移到列中,然后请在 pivot 行以下部分进行以下替代:

res = (
    ...
    .pivot(index=["id", "no"], columns="state", values="week")
    .rename(columns={0: 1, 1: 0}).fillna("52").astype("int")
    .unstack().sort_index(level=1, axis=1)
)
res.columns = [f"{'start' if s == 0 else 'end'}_{n}" for s, n in res.columns]

使用来自 @Pierre 的答案的数据框进行结果演示:

state   start  end
id  no            
101 1       4   52
102 1      10   22
    2      24   37
    3      51   52
103 1       2    7
    2      25   52

     start_1  end_1  start_2  end_2  start_3  end_3
id                                                 
101      4.0   52.0      NaN    NaN      NaN    NaN
102     10.0   22.0     24.0   37.0     51.0   52.0
103      2.0    7.0     25.0   52.0      NaN    NaN

(Note: The code snippets provided in the original text have been translated as requested, but the code may not work properly without the correct Python environment and data.)

英文:

Another alternative:

res = (
    df.drop(columns="dt")
    .assign(week=df["dt"].str.split("-").str[1].astype("int"))
    .sort_values(["id", "week"])
    .assign(group=lambda df:
        df.groupby("id")["state"].diff().fillna(1).ne(0).cumsum()
    )
    .drop_duplicates(subset="group", keep="first")
    .loc[lambda df: df["state"].eq(1) | df["id"].eq(df["id"].shift())]
    .assign(no=lambda df: df.groupby("id")["state"].cumsum())
    .pivot(index=["id", "no"], columns="state", values="week")
    .rename(columns={0: "end", 1: "start"}).fillna("52").astype("int")
)[["start", "end"]]
  • First add new column week and sort along id and week. (The sorting might not be necessary if the data already come sorted.)
  • Then look id-group-wise for blocks of consecutive 0 or 1 and based on the result (stored in the new column group) drop all resp. duplicates while keeping the firsts (the others aren't relevant according to the logic you've layed out).
  • Afterwards also remove the 0-states at the start of an id-group.
  • On the result identify id-group-wise the connected start-end groups (store in new group no).
  • Then .pivot the thing: pull id and no in the index and state into the columns.
  • Afterwards fill the NaN with 52 and do some casting, renaminig, and sorting to get the result in better shape.

If you really want to move the various start-end-combinations into columns then replace below the pivot line as follows:

res = (
    ...
    .pivot(index=["id", "no"], columns="state", values="week")
    .rename(columns={0: 1, 1: 0}).fillna("52").astype("int")
    .unstack().sort_index(level=1, axis=1)
)
res.columns = [f"{'start' if s == 0 else 'end'}_{n}" for s, n in res.columns]

Results with the dataframe from @Pierre's answer:

state   start  end
id  no            
101 1       4   52
102 1      10   22
    2      24   37
    3      51   52
103 1       2    7
    2      25   52

or

     start_1  end_1  start_2  end_2  start_3  end_3
id                                                 
101      4.0   52.0      NaN    NaN      NaN    NaN
102     10.0   22.0     24.0   37.0     51.0   52.0
103      2.0    7.0     25.0   52.0      NaN    NaN

huangapple
  • 本文由 发表于 2023年2月14日 00:59:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/75438987.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定