截断时间序列文件并提取一些描述性变量。

huangapple go评论77阅读模式
英文:

Truncate a time serie files and extract some descriptive variable

问题

1/ 我想截取文本文件,仅检索数据集日期点前的变量。

2/ 为了进行一些统计分析,我想获取这些变量的均值或最大值,并将其添加到如下的数据框中:

ID 均值 最大值
0001
0002
0003
... .... ...

我知道对你来说这将是一个微不足道的问题,但对于我(一个Python初学者),这将是一个挑战!

谢谢大家。管理包含日期点的数据帧的时间序列并获取一些统计值。

英文:

I have two major problems, and I can't imagine the solution in python. Now, I explain you the context.
On the one hand I have a dataset, containing some date point with ID (1 ID = 1 patient) like this :

ID Date point
0001 25/12/2022 09:00
0002 29/12/2022 16:00
0003 30/12/2022 18:00
... ....

And on the other hand, i have a folder with many text files containing the times series, like this :
> 0001.txt
0002.txt
0003.txt
...

The files have the same architecture : the ID (same as the dataset) is in the name of the file, and inside the file is structured like that (first column contains the date and the second de value) :

>25/12/2022 09:00 155
25/12/2022 09:01 156
25/12/2022 09:02 157
25/12/2022 09:03 158
...

1/ I would like to truncate the text files and retrieve only the variables prior to the 48H dataset Date point.

2/ To make some statistical analysis, I want to take some value like the mean or the maximum of this variables and add in a dataframe like this :

ID Mean Maximum
0001
0002
0003
... .... ...

I know for you it will be a trivial problem, but for me (a beginner in python code) it will be a challenge !

Thank you everybody.

Manage time series with a dataframe containing date point and take some statistical values.

答案1

得分: 0

你可以尝试使用 pandas 按照以下方式进行操作(我无法完全测试这段代码):

import pandas as pd
from pathlib import Path

# 创建一个限制版的初始表格
data = {
    "ID": ["0001", "0002", "0003"],
    "Date point": ["25/12/2022 09:00", "29/12/2022 16:00", "30/12/2022 18:00"]
}

# 转换成 Pandas DataFrame
df = pd.DataFrame(data)

# 将 "Date point" 列转换为日期时间对象
df["Date point"] = pd.to_datetime(df["Date point"])

# 提供包含文件的文件夹路径
folder = Path("/path_to_files")

newdata = {"ID": [], "Mean": [], "Maximum": []}  # 一个空字典,你将填入所需的统计信息

# 遍历 ID 并读取文件
for i, date in zip(df["ID"], df["Date point"]):
    inputfile = folder / f"{i}.txt"  # 构造文件名
    if inputfile.exists():
        # 读取文件
        subdata = pd.read_csv(
            inputfile,
            sep="\s+",  # 列之间用空格分隔
            header=None,  # 没有头部信息
            parse_dates=[[0, 1]],  # 第一和第二列应合并并转换为日期时间对象
            infer_datetime_format=True
        )

        # 获取当前日期点之后 48 小时的值
        td = pd.Timedelta(value=48, unit="hours")
        mask = (subdata["0_1"] > date) & (subdata["0_1"] <= date + td)

        # 添加所需的信息
        newdata["ID"].append(i)
        newdata["Mean"].append(subdata[2].loc[mask].mean())
        newdata["Maximum"].append(subdata[2].loc[mask].max())

# 将 newdata 放入一个 DataFrame
dfnew = pd.DataFrame(newdata)

希望这有帮助!

英文:

You could do something along these lines using pandas (I've not been able to test this fully):

import pandas as pd
from pathlib import Path


# I&#39;ll create a limited version of your initial table
data = {
    &quot;ID&quot;: [&quot;0001&quot;, &quot;0002&quot;, &quot;0003&quot;],
    &quot;Date point&quot;: [&quot;25/12/2022 09:00&quot;, &quot;29/12/2022 16:00&quot;, &quot;30/12/2022 18:00&quot;]
}

# put in a Pandas DataFrame
df = pd.DataFrame(data)

# convert the &quot;Date point&quot; column to a datetime object
df[&quot;Date point&quot;] = pd.to_datetime(df[&quot;Date point&quot;])

# provide the path to the folder containing the files
folder = Path(&quot;/path_to_files&quot;)

newdata = {&quot;ID&quot;: [], &quot;Mean&quot;: [], &quot;Maximum&quot;: []}  # an empty dictionary that you&#39;ll fill with the required statistical info

# loop through the IDs and read in the files
for i, date in zip(df[&quot;ID&quot;], df[&quot;Date point&quot;]):
    inputfile = folder / f&quot;{i}.txt&quot;  # construct file name
    if inputfile.exists():
        # read in the file
        subdata = pd.read_csv(
            inputfile,
            sep=&quot;\s+&quot;,  # columns are separated by spaces
            header=None,  # there&#39;s no header information
            parse_dates=[[0, 1]],  # the first and second columns should be combined and converted to datetime objects
            infer_datetime_format=True
        )

        # get the values 48 hours after the current date point
        td = pd.Timedelta(value=48, unit=&quot;hours&quot;)
        mask = (subdata[&quot;0_1&quot;] &gt; date) &amp; (subdata[&quot;0_1&quot;] &lt;= date + td)

        # add in the required info
        newdata[&quot;ID&quot;].append(i)
        newdata[&quot;Mean&quot;].append(subdata[2].loc[mask].mean())
        newdata[&quot;Maximum&quot;].append(subdata[2].loc[mask].max())

# put newdata into a DataFrame
dfnew = pd.DataFrame(newdata)

huangapple
  • 本文由 发表于 2023年2月8日 16:43:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/75383186.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定