英文:
Obtaining timeframe for ID groups based on state change
问题
首先,对不熟悉Python的道歉,我完全是一个新手。我极少使用它,但在这个问题中需要它。
我有一组数据,看起来像下面这样:
id | state | dt |
---|---|---|
101 | 0 | 2022-15 |
101 | 1 | 2022-22 |
101 | 0 | 2022-26 |
102 | 0 | 2022-01 |
102 | 1 | 2022-41 |
103 | 1 | 2022-03 |
103 | 0 | 2022-12 |
我需要提供一个输出,显示每个ID的状态为 "1" 的时间量。例如,对于ID 101 - state1_start_dt = "2022_22",state1_end_dt = "2022_25"。
数据以 .CSV 格式存在。我尝试使用 Pandas 导入数据框,并在数据框上使用 groupby,然后循环遍历数据 - 但这似乎非常慢。
我了解到有限状态机似乎与我的需求相关,但我在尝试创建一个可以接受 .CSV 输入的Python有限状态机时陷入困境,它提供了每个ID组的输出,并包括了逻辑以考虑对于最后一个ID条目为state = "1"的情况 - 因此,我们将假设时间范围一直到2022年底。
如果有人能够提供一些资源或示例代码,我可以分解以更好地理解 - 那将非常好。
编辑
为了更清晰地表达我想要实现的目标,以下是一些示例:
-
对于没有以0结尾的状态序列的ID,state1_end_dt 应该输入为 '2022-52'(2022年的最后一周)
-
对于交替状态的ID,我们可以加入第二、第三、第四等一组列(例如 state1_start_dt_2,state1_end_dt_2)。这将允许考虑每个时间窗口。对于只有一个时间窗口的条目,这些额外的列可以为空。
-
对于在状态列中没有 "1" 的ID,可以跳过。
-
对于没有0状态的ID,state1_start_dt 应该取最小的 dt 值,state1_end_dt 可以输入为 '2022-52'。
英文:
First off, my apologies, I'm a complete novice when it comes to Python. I use it extremely infrequently but require it for this problem.
I have a set of data which looks like the below:
id | state | dt |
---|---|---|
101 | 0 | 2022-15 |
101 | 1 | 2022-22 |
101 | 0 | 2022-26 |
102 | 0 | 2022-01 |
102 | 1 | 2022-41 |
103 | 1 | 2022-03 |
103 | 0 | 2022-12 |
I need to provide an output which displays the amount of time each ID was state = "1". E.G for ID 101 - state1_start_dt = "2022_22", state1_end_dt = "2022_25".
The data is in .CSV format. I've attempted to bring this in via Pandas, utilise groupby on the df and then loop over this - however this seems extremely slow.
I've come across Finite State Machines which seem to link to my requirements, however I'm in way over my head attempting to create a Finite State Machine in Python which accepts .CSV inputs, provides output per each ID group as well as incorporates logic to account for scenarios where the last entry for an ID is state = "1" - therefore we'd assume the time frame was until the end of 2022.
If anyone can provide some sources or sample code which I can break down to get a better understanding - that would be great.
EDIT
Some examples to be clearer on what I'd like to achieve:
-For IDs that have no ending 0 in the state sequence, the state1_end_dt should be entered as '2022-52' (the final week in 2022)
-For IDs which have alternating states, we can incorporate a second, third, forth etc.. set of columns (E.G state1_start_dt_2, state1_end_dt_2). This will allow each window to be accounted for. For any entries that only have one window, these extra columns can be NULL.
-For IDs which have no "1" present in the state column, these can be skipped.
-For IDs which do not have any 0 states present, the minimum dt value should be taken as the state1_start_dt and '2022-52' can be entered for state1_end_dt
答案1
得分: 1
如果名为 one_zero.csv
的 CSV 文件如下所示:
id,state,dt
100,0,2022-15
100,1,2022-22
100,0,2022-26
101,0,2022-01
101,1,2022-41
102,1,2022-03
102,0,2022-12
102,1,2022-33
那么以下的代码将会给你想要的结果:
import pandas as pd
df = pd.read_csv("one_zero.csv")
result = {}
for id_, sub_df in df.groupby('id'):
sub_df = sub_df.sort_values("dt")
intervals = []
start_dt = None
for state, dt in zip(sub_df["state"], sub_df["dt"]):
if state == 1:
start_dt = dt
if state == 0 and start_dt is not None:
week = int(dt.split("-", maxsplit=1)[1])
intervals.append((start_dt, f"2022-{week-1:02d}"))
start_dt = None
if start_dt is not None:
intervals.append((start_dt, "2022-52"))
result[id_] = intervals
最终,结果字典将包含如下内容:
{
100: [('2022-22', '2022-25')],
101: [('2022-41', '2022-52')],
102: [('2022-03', '2022-11'), ('2022-33', '2022-52')]
}
使用 groupby
和 sort_values
,即使在 CSV 文件中打乱了行的顺序,这段代码也能正常工作。我使用了格式化字符串来修复周数。其中的 02d
表示周始终是两位数字,从第1周开始为0。
如果你更熟悉的话,也可以使用以下迭代行的方式来遍历,可能会占用更少的内存:
for _, row in sub_df.iterrows():
state = row["state"]
dt = row["dt"]
请注意,我已经将代码部分提供给您,并只回答您要求翻译的内容。
英文:
If the csv file called one_zero.csv
is this
id,state,dt
100,0,2022-15
100,1,2022-22
100,0,2022-26
101,0,2022-01
101,1,2022-41
102,1,2022-03
102,0,2022-12
102,1,2022-33
(I've added one additional item to the end.)
Then this code gives you what you want.
import pandas as pd
df = pd.read_csv("one_zero.csv")
result = {}
for id_, sub_df in df.groupby('id'):
sub_df = sub_df.sort_values("dt")
intervals = []
start_dt = None
for state, dt in zip(sub_df["state"], sub_df["dt"]):
if state == 1:
start_dt = dt
if state == 0 and start_dt is not None:
week = int(dt.split("-", maxsplit=1)[1])
intervals.append((start_dt, f"2022-{week-1:02d}"))
start_dt = None
if start_dt is not None:
intervals.append((start_dt, "2022-52"))
result[id_] = intervals
At the end the result dictionary will contain this:
{
100: [('2022-22', '2022-25')],
101: [('2022-41', '2022-52')],
102: [('2022-03', '2022-11'), ('2022-33', '2022-52')]
}
With this groupby
and sort_values
it works even if you shuffle the lines in the csv file. I've used formatted string to fix the week number. 02d
there means there, that the week will be always two digits, starting with 0 for the first 9 weeks.
I guess you need less memory if you iterate on the rows like this, but for me the zip version is more familiar.
for _, row in sub_df.iterrows():
state = row["state"]
dt = row["dt"]
答案2
得分: 1
以下是要翻译的部分:
IIUC, here are some functions to perform the aggregation you are looking for.
First, we convert the strings '%Y-%W'
(e.g. '2022-15'
) into a DateTime (the Monday of that week), e.g. '2022-04-11'
, as it is easier to deal with actual dates than these strings. This makes this solution generic in that it can have arbitrary dates in it, not just for a single year.
Second, we augment the df
with a "sentinel": a row for each id
that is on the first week of the next year (next year being max year of all dates, plus 1) with state = 0
. That allows us to not worry whether a sequence ends with 0 or not.
Then, we essentially group by id
and apply the following logic: keep only transitions, so, e.g., [1,1,1,0,0,1,0]
becomes [1,.,.,0,.,1,0]
(where '.'
indicates dropped values). That gives us the spans we are looking for (after subtracting one week for the 0 states).
Edit: speedup: instead of applying the masking logic to each group, we detect transitions globally (on the sentinel-augmented df
, sorted by ['id', 'dt', 'state']
). Since each id
sequence in the augmented df
ends with the sentinel (0), we are guaranteed to catch the first 1 of the next id.
Putting it all together, including a postproc()
to convert dates back into strings of year-week
:
def preproc(df):
df = df.assign(dt=pd.to_datetime(df['dt'] + '-Mon', format='%Y-%W-%a'))
max_year = df['dt'].max().year
# first week next year:
tmax = pd.Timestamp(f'{max_year}-12-31') + pd.offsets.Week(1)
sentinel = pd.DataFrame(
pd.unique(df['id']),
columns=['id']).assign(state=0, dt=tmax)
df = pd.concat([df, sentinel])
df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
return df
# speed up
def proc(df):
mask = df['state'] != df['state'].shift(fill_value=0)
df = df[mask]
z = df.assign(c=df.groupby('id').cumcount()).set_index(['c', 'id'])['dt'].unstack('c')
z[z.columns[1::2]] -= pd.offsets.Week(1)
cols = [
f'{x}_{i}'
for i in range(len(z.columns) // 2)
for x in ['start', 'end']
]
return z.set_axis(cols, axis=1)
def asweeks_str(t, nat='--'):
return f'{t:%Y-%W}' if t and t == t else nat
def postproc(df):
# convert dates into strings '%Y-%W'
return df.applymap(asweeks_str)
英文:
IIUC, here are some functions to perform the aggregation you are looking for.
First, we convert the strings '%Y-%W'
(e.g. '2022-15'
) into a DateTime (the Monday of that week), e.g. '2022-04-11'
, as it is easier to deal with actual dates than these strings. This makes this solution generic in that it can have arbitrary dates in it, not just for a single year.
Second, we augment the df
with a "sentinel": a row for each id
that is on the first week of the next year (next year being max year of all dates, plus 1) with state = 0
. That allows us to not worry whether a sequence ends with 0 or not.
Then, we essentially group by id
and apply the following logic: keep only transitions, so, e.g., [1,1,1,0,0,1,0]
becomes [1,.,.,0,.,1,0]
(where '.'
indicates dropped values). That gives us the spans we are looking for (after subtracting one week for the 0 states).
Edit: speedup: instead of applying the masking logic to each group, we detect transitions globally (on the sentinel-augmented df
, sorted by ['id', 'dt', 'state']
). Since each id
sequence in the augmented df
ends with the sentinel (0), we are guaranteed to catch the first 1 of the next id.
Putting it all together, including a postproc()
to convert dates back into strings of year-week
:
def preproc(df):
df = df.assign(dt=pd.to_datetime(df['dt'] + '-Mon', format='%Y-%W-%a'))
max_year = df['dt'].max().year
# first week next year:
tmax = pd.Timestamp(f'{max_year}-12-31') + pd.offsets.Week(1)
sentinel = pd.DataFrame(
pd.unique(df['id']),
columns=['id']).assign(state=0, dt=tmax)
df = pd.concat([df, sentinel])
df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
return df
# speed up
def proc(df):
mask = df['state'] != df['state'].shift(fill_value=0)
df = df[mask]
z = df.assign(c=df.groupby('id').cumcount()).set_index(['c', 'id'])['dt'].unstack('c')
z[z.columns[1::2]] -= pd.offsets.Week(1)
cols = [
f'{x}_{i}'
for i in range(len(z.columns) // 2)
for x in ['start', 'end']
]
return z.set_axis(cols, axis=1)
def asweeks_str(t, nat='--'):
return f'{t:%Y-%W}' if t and t == t else nat
def postproc(df):
# convert dates into strings '%Y-%W'
return df.applymap(asweeks_str)
Examples
First, let's use the example that is in the original question. Note that this doesn't exemplifies some of the corner cases we are able to handle (more on that in a minute).
df = pd.DataFrame({
'id': [101, 101, 101, 102, 102, 103, 103],
'state': [0, 1, 0, 0, 1, 1, 0],
'dt': ['2022-15', '2022-22', '2022-26', '2022-01', '2022-41', '2022-03', '2022-12'],
})
>>> postproc(proc(preproc(df)))
start_0 end_0
id
101 2022-22 2022-25
102 2022-41 2022-52
103 2022-03 2022-11
But let's generate some random data, to observe some corner cases:
def gen(n, nids=2):
wk = np.random.randint(1, 53, n*nids)
st = np.random.choice([0, 1], n*nids)
ids = np.repeat(np.arange(nids) + 101, n)
df = pd.DataFrame({
'id': ids,
'state': st,
'dt': [f'2022-{w:02d}' for w in wk],
})
df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
return df
Now:
np.random.seed(0) # reproducible example
df = gen(6, 3)
>>> df
id state dt
0 101 0 2022-01
1 101 0 2022-04
2 101 1 2022-04
3 101 1 2022-40
4 101 1 2022-45
5 101 1 2022-48
6 102 1 2022-10
7 102 1 2022-20
8 102 0 2022-22
9 102 1 2022-24
10 102 0 2022-37
11 102 1 2022-51
12 103 1 2022-02
13 103 0 2022-07
14 103 0 2022-13
15 103 1 2022-25
16 103 1 2022-25
17 103 1 2022-39
There are several interesting things here. First, 101 starts with a 0 state, whereas 102 and 103 both start with 1. Then, there are repeated ones for all ids. There are also repeated weeks: '2022-04'
for 101 and '2022-25'
for 103.
Nevertheless, the aggregation works just fine and produces:
>>> postproc(proc(preproc(df)))
start_0 end_0 start_1 end_1 start_2 end_2
id
101 2022-04 2022-52 -- -- -- --
102 2022-10 2022-21 2022-24 2022-36 2022-51 2022-52
103 2022-02 2022-06 2022-25 2022-52 -- --
Speed
np.random.seed(0)
n = 10
k = 100_000
df = gen(n, k)
%timeit preproc(df)
483 ms ± 4.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The processing itself takes less than 200ms for 1 million rows:
a = preproc(df)
%timeit proc(a)
185 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
As for the post-processing (converting dates back to year-week strings), it is the slowest thing of all:
b = proc(a)
%timeit postproc(b)
1.63 s ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For a speed-up of that post-processing, we can rely on the fact that there are only a small number of distinct dates that are week-starts (52 per year, plus NaT
for the blank cells):
def postproc2(df, nat='--'):
dct = {
t: f'{t:%Y-%W}' if t and t == t else nat
for t in df.stack().reset_index(drop=True).drop_duplicates()
}
return df.applymap(dct.get)
%timeit postproc2(b)
542 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
We could of course do something similar for preproc()
.
答案3
得分: 1
另一个替代方案:
res = (
df.drop(columns="dt")
.assign(week=df["dt"].str.split("-").str[1].astype("int"))
.sort_values(["id", "week"])
.assign(group=lambda df:
df.groupby("id")["state"].diff().fillna(1).ne(0).cumsum()
)
.drop_duplicates(subset="group", keep="first")
.loc[lambda df: df["state"].eq(1) | df["id"].eq(df["id"].shift())]
.assign(no=lambda df: df.groupby("id")["state"].cumsum())
.pivot(index=["id", "no"], columns="state", values="week")
.rename(columns={0: "end", 1: "start"}).fillna("52").astype("int")
)[["start", "end"]]
首先添加一个名为 week
的新列,然后按 id
和 week
进行排序(如果数据已经按顺序排列,则排序可能不是必要的)。然后按 id
分组查找连续的 0
或 1
块,并基于结果(存储在新列 group
中)删除所有相应的重复项,只保留第一个(根据您提供的逻辑,其他重复项不相关)。然后删除在 id
组中起始的 0
状态。接下来,在结果中按 id
组标识连接的 start
-end
组合(存储在新的 no
组中)。然后对其进行 .pivot
操作:将 id
和 no
放入索引中,将 state
放入列中。然后用 "52" 填充 NaN 值,并进行一些类型转换、重命名和排序,以便更好地整理结果。
如果您确实希望将各种 start
-end
组合移到列中,然后请在 pivot
行以下部分进行以下替代:
res = (
...
.pivot(index=["id", "no"], columns="state", values="week")
.rename(columns={0: 1, 1: 0}).fillna("52").astype("int")
.unstack().sort_index(level=1, axis=1)
)
res.columns = [f"{'start' if s == 0 else 'end'}_{n}" for s, n in res.columns]
使用来自 @Pierre 的答案的数据框进行结果演示:
state start end
id no
101 1 4 52
102 1 10 22
2 24 37
3 51 52
103 1 2 7
2 25 52
或
start_1 end_1 start_2 end_2 start_3 end_3
id
101 4.0 52.0 NaN NaN NaN NaN
102 10.0 22.0 24.0 37.0 51.0 52.0
103 2.0 7.0 25.0 52.0 NaN NaN
(Note: The code snippets provided in the original text have been translated as requested, but the code may not work properly without the correct Python environment and data.)
英文:
Another alternative:
res = (
df.drop(columns="dt")
.assign(week=df["dt"].str.split("-").str[1].astype("int"))
.sort_values(["id", "week"])
.assign(group=lambda df:
df.groupby("id")["state"].diff().fillna(1).ne(0).cumsum()
)
.drop_duplicates(subset="group", keep="first")
.loc[lambda df: df["state"].eq(1) | df["id"].eq(df["id"].shift())]
.assign(no=lambda df: df.groupby("id")["state"].cumsum())
.pivot(index=["id", "no"], columns="state", values="week")
.rename(columns={0: "end", 1: "start"}).fillna("52").astype("int")
)[["start", "end"]]
- First add new column
week
and sort alongid
andweek
. (The sorting might not be necessary if the data already come sorted.) - Then look
id
-group-wise for blocks of consecutive0
or1
and based on the result (stored in the new columngroup
) drop all resp. duplicates while keeping the firsts (the others aren't relevant according to the logic you've layed out). - Afterwards also remove the
0
-states at the start of anid
-group. - On the result identify
id
-group-wise the connectedstart
-end
groups (store in new groupno
). - Then
.pivot
the thing: pullid
andno
in the index andstate
into the columns. - Afterwards fill the
NaN
with52
and do some casting, renaminig, and sorting to get the result in better shape.
If you really want to move the various start
-end
-combinations into columns then replace below the pivot
line as follows:
res = (
...
.pivot(index=["id", "no"], columns="state", values="week")
.rename(columns={0: 1, 1: 0}).fillna("52").astype("int")
.unstack().sort_index(level=1, axis=1)
)
res.columns = [f"{'start' if s == 0 else 'end'}_{n}" for s, n in res.columns]
Results with the dataframe from @Pierre's answer:
state start end
id no
101 1 4 52
102 1 10 22
2 24 37
3 51 52
103 1 2 7
2 25 52
or
start_1 end_1 start_2 end_2 start_3 end_3
id
101 4.0 52.0 NaN NaN NaN NaN
102 10.0 22.0 24.0 37.0 51.0 52.0
103 2.0 7.0 25.0 52.0 NaN NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论