问题

1/ 我想截取文本文件，仅检索数据集日期点前的变量。

2/ 为了进行一些统计分析，我想获取这些变量的均值或最大值，并将其添加到如下的数据框中：

ID	均值	最大值
0001
0002
0003
...	....	...

我知道对你来说这将是一个微不足道的问题，但对于我（一个Python初学者），这将是一个挑战！

谢谢大家。管理包含日期点的数据帧的时间序列并获取一些统计值。

英文:

I have two major problems, and I can't imagine the solution in python. Now, I explain you the context.
On the one hand I have a dataset, containing some date point with ID (1 ID = 1 patient) like this :

ID	Date point
0001	25/12/2022 09:00
0002	29/12/2022 16:00
0003	30/12/2022 18:00
...	....

And on the other hand, i have a folder with many text files containing the times series, like this :
> 0001.txt
0002.txt
0003.txt
...

The files have the same architecture : the ID (same as the dataset) is in the name of the file, and inside the file is structured like that (first column contains the date and the second de value) :

>25/12/2022 09:00 155
25/12/2022 09:01 156
25/12/2022 09:02 157
25/12/2022 09:03 158
...

1/ I would like to truncate the text files and retrieve only the variables prior to the 48H dataset Date point.

2/ To make some statistical analysis, I want to take some value like the mean or the maximum of this variables and add in a dataframe like this :

ID	Mean	Maximum
0001
0002
0003
...	....	...

I know for you it will be a trivial problem, but for me (a beginner in python code) it will be a challenge !

Thank you everybody.

Manage time series with a dataframe containing date point and take some statistical values.

答案1

得分: 0

你可以尝试使用 pandas 按照以下方式进行操作（我无法完全测试这段代码）：

import pandas as pd
from pathlib import Path

# 创建一个限制版的初始表格
data = {
    "ID": ["0001", "0002", "0003"],
    "Date point": ["25/12/2022 09:00", "29/12/2022 16:00", "30/12/2022 18:00"]
}

# 转换成 Pandas DataFrame
df = pd.DataFrame(data)

# 将 "Date point" 列转换为日期时间对象
df["Date point"] = pd.to_datetime(df["Date point"])

# 提供包含文件的文件夹路径
folder = Path("/path_to_files")

newdata = {"ID": [], "Mean": [], "Maximum": []}  # 一个空字典，你将填入所需的统计信息

# 遍历 ID 并读取文件
for i, date in zip(df["ID"], df["Date point"]):
    inputfile = folder / f"{i}.txt"  # 构造文件名
    if inputfile.exists():
        # 读取文件
        subdata = pd.read_csv(
            inputfile,
            sep="\s+",  # 列之间用空格分隔
            header=None,  # 没有头部信息
            parse_dates=[[0, 1]],  # 第一和第二列应合并并转换为日期时间对象
            infer_datetime_format=True
        )

        # 获取当前日期点之后 48 小时的值
        td = pd.Timedelta(value=48, unit="hours")
        mask = (subdata["0_1"] > date) & (subdata["0_1"] <= date + td)

        # 添加所需的信息
        newdata["ID"].append(i)
        newdata["Mean"].append(subdata[2].loc[mask].mean())
        newdata["Maximum"].append(subdata[2].loc[mask].max())

# 将 newdata 放入一个 DataFrame
dfnew = pd.DataFrame(newdata)

希望这有帮助！

英文:

You could do something along these lines using pandas (I've not been able to test this fully):

import pandas as pd
from pathlib import Path


# I&#39;ll create a limited version of your initial table
data = {
    &quot;ID&quot;: [&quot;0001&quot;, &quot;0002&quot;, &quot;0003&quot;],
    &quot;Date point&quot;: [&quot;25/12/2022 09:00&quot;, &quot;29/12/2022 16:00&quot;, &quot;30/12/2022 18:00&quot;]
}

# put in a Pandas DataFrame
df = pd.DataFrame(data)

# convert the &quot;Date point&quot; column to a datetime object
df[&quot;Date point&quot;] = pd.to_datetime(df[&quot;Date point&quot;])

# provide the path to the folder containing the files
folder = Path(&quot;/path_to_files&quot;)

newdata = {&quot;ID&quot;: [], &quot;Mean&quot;: [], &quot;Maximum&quot;: []}  # an empty dictionary that you&#39;ll fill with the required statistical info

# loop through the IDs and read in the files
for i, date in zip(df[&quot;ID&quot;], df[&quot;Date point&quot;]):
    inputfile = folder / f&quot;{i}.txt&quot;  # construct file name
    if inputfile.exists():
        # read in the file
        subdata = pd.read_csv(
            inputfile,
            sep=&quot;\s+&quot;,  # columns are separated by spaces
            header=None,  # there&#39;s no header information
            parse_dates=[[0, 1]],  # the first and second columns should be combined and converted to datetime objects
            infer_datetime_format=True
        )

        # get the values 48 hours after the current date point
        td = pd.Timedelta(value=48, unit=&quot;hours&quot;)
        mask = (subdata[&quot;0_1&quot;] &gt; date) &amp; (subdata[&quot;0_1&quot;] &lt;= date + td)

        # add in the required info
        newdata[&quot;ID&quot;].append(i)
        newdata[&quot;Mean&quot;].append(subdata[2].loc[mask].mean())
        newdata[&quot;Maximum&quot;].append(subdata[2].loc[mask].max())

# put newdata into a DataFrame
dfnew = pd.DataFrame(newdata)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

截断时间序列文件并提取一些描述性变量。

问题

答案1

这些不同的异常值检测方法为什么无法检测到异常值？

Listen for keyboard shortcut ESC + ESC then run some code

找到一个列表中具有相同重复次数的两个整数的序列的方法是什么？

如何在px.scatter()图中将标记和标记边框分配相同的颜色？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论