英文:
How do I count the times the word true was printed in my csv file?
问题
生成的输出是一个包含真或假的列,我需要统计每10行中打印了多少次“真”这个词。
在我提供的代码中,它读取一个文件夹中的csv文件并将它们打印到另一个文件夹中。在这些csv文件中,包含了在数据帧定义时选择的两列。此外,通过how_many_times
函数,添加了两列,用于计算值满足我给定条件的次数。
我的CSV示例(原始数据帧有更多行):
In [1]: dff = pd.DataFrame([['20220901-00:00:00', 50.0335, False, True], ['20220901-00:00:01', 50.024, False, False], ['20220901-00:00:02', 50.021, False, False]], columns=['t', 'f', 'f<49.975', 'f>50.025'])
这是我的代码(我使用了.sum
,但对我所需的功能无效):
import pandas as pd
import numpy as np
import glob
import os
all_files = glob.glob("C:/Users/Gamer/Documents/Colbun/Saturn/*.csv")
file_list = []
for i, f in enumerate(all_files):
df = pd.read_csv(f, header=0, usecols=['t', 'f'])
how_many_times1 = df.apply(lambda x: x['f'] < 49.975, axis=1).sum()
df['f<49.975'] = how_many_times1
how_many_times2 = df.apply(lambda x: x['f'] > 50.025, axis=1).sum()
df['f>50.025'] = how_many_times2
df.to_csv(f'C:/Users/Gamer/Documents/Colbun/Saturn2/{os.path.basename(f).split(".")[0]}_ext.csv')
注意:我已经在你的代码中修正了.sum
方法的调用,确保它正确计算每列的“真”值的数量。
英文:
The output that is generated is a column of true or false, I need to count how many times the word true was printed every 10 rows
In the code that I present, it reads csv files that are in one folder and prints them in another. In each of these csv's it contains two columns that were chosen when the dataframe was defined. In addition, two columns were added which, through the how_many_times function, count how many times the value meets the condition that I give it.
Example of my csv(original df has more rows):
In [1]: dff = pd.DataFrame([['20220901-00:00:00', 50.0335,False,True], ['20220901-00:00:01', 50.024,False,False], ['20220901-00:00:02', 50.021,False,False]], columns=['t', 'f','f<49.975','f>50.025'])
This is my code (I used .sum but it didn't work for what I needed):
import pandas as pd
import numpy as np
import glob
import os
all_files = glob.glob("C:/Users/Gamer/Documents/Colbun/Saturn/*.csv")
file_list = []
for i,f in enumerate(all_files):
df = pd.read_csv(f,header=0,usecols=["t","f"])
how_many_times1= df.apply(lambda x: x['f'] < 49.975, axis=1).sum
df['f<49.975']=how_many_times1
how_many_times2= df.apply(lambda x: x['f'] > 50.025, axis=1).sum
df['f>50.025']=how_many_times2
df.to_csv(f'C:/Users/Gamer/Documents/Colbun/Saturn2/{os.path.basename(f).split(".")[0]}_ext.csv')
答案1
得分: 0
你可以直接将.sum()
方法应用于作为DataFrame列的Pandas Series (how_many_times1.sum()
). 由于True等于1,False等于0,你可以直接计算条目而不需要应用条件。
这在需要总和时是有意义的。如果需要每十行周期性地进行求和,那么通过应用到列的函数中计算True值是有意义的。
下面的代码定义了两个"apply"函数,它们执行创建列的正确条目并打印每十行的总和的任务。
以下是详细操作的代码:
import pandas as pd
df = pd.DataFrame([['20220901-00:00:00', 50.0335],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ]],
columns=['t', 'f'])
columns = ['dummy', 'f<49.975','f>50.025']
print(df)
row1 = 1
sum1 = 0
lsm1 = []
def cond1(x):
global row1, sum1
cond = False
if x < 49.975:
cond=True
sum1+=1
if row1%10==0:
print('sum1:', sum1, 'at row:', row1)
lsm1.append(sum1)
sum1=0 # outcomment if cummulative sum required
else:
lsm1.append(None)
row1 += 1
return cond
row2 = 1
sum2 = 0
lsm2 = []
def cond2(x):
global row2, sum2
cond = False
if x > 50.025:
cond=True
sum2+=1
if row2%10==0:
print('sum2:', sum2, 'at row:', row2)
lsm2.append(sum2)
sum2=0 # outcomment if cummulative sum required
else:
lsm2.append(None)
row2 += 1
return cond
how_many_times1 = df['f'].apply(cond1)
df[columns[1]] = how_many_times1
df['sum1'] = lsm1
how_many_times2 = df['f'].apply(cond2)
df[columns[2]] =how_many_times2
df['sum2'] = lsm2
print(df)
打印结果如下:
t f
0 20220901-00:00:00 50.0335
1 20220901-00:00:01 50.1000
2 20220901-00:00:02 48.0210
3 20220901-00:00:01 50.1000
4 20220901-00:00:01 50.1000
5 20220901-00:00:02 48.0210
6 20220901-00:00:01 50.1000
7 20220901-00:00:01 50.1000
8 20220901-00:00:02 48.0210
9 20220901-00:00:01 50.1000
10 20220901-00:00:02 48.0210
11 20220901-00:00:01 50.1000
12 20220901-00:00:01 50.1000
13 20220901-00:00:01 50.1000
sum1: 3 at row: 10
sum2: 7 at row: 10
t f f<49.975 sum1 f>50.025 sum2
0 20220901-00:00:00 50.0335 False NaN True NaN
1 20220901-00:00:01 50.1000 False NaN True NaN
2 20220901-00:00:02 48.0210 True NaN False NaN
3 20220901-00:00:01 50.1000 False NaN True NaN
4 20220901-00:00:01 50.1000 False NaN True NaN
5 20220901-00:00:02 48.0210 True NaN False NaN
6 20220901-00:00:01 50.1000 False NaN True NaN
7 20220901-00:00:01 50.1000 False NaN True NaN
8 20220901-00:00:02 48.0210 True NaN False NaN
9 20220901-00:00:01 50.1000 False 3.0 True 7.0
10 20220901-00:00:02 48.0210 True NaN False NaN
11 20220901-00:00:01 50.1000 False NaN True NaN
12 20220901-00:00:01 50.1000 False NaN True NaN
13 20220901-00:00:01 50.1000 False NaN True NaN
英文:
You apply the .sum()
method directly to a Pandas Series being a DataFrame column ( how_many_times1.sum()
). And because True is equivalent to 1 and False to 0 you can directly count the entries without applying a condition.
This makes sense in case you need the total sum. In case you need the sum periodically each ten rows it makes sense to count the True values in the to the column applied function.
The code below defines two 'apply' functions which do the job of creating the right entries for the columns and printing the sum each ten rows.
See the code below for how it is done in detail:
import pandas as pd
df = pd.DataFrame([['20220901-00:00:00', 50.0335],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:02', 48.021 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ],
['20220901-00:00:01', 50.100 ]],
columns=['t', 'f' ])
columns = ['dummy', 'f<49.975','f>50.025']
print(df)
row1 = 1
sum1 = 0
lsm1 = []
def cond1(x):
global row1, sum1
cond = False
if x < 49.975:
cond=True
sum1+=1
if row1%10==0:
print('sum1:', sum1, 'at row:', row1)
lsm1.append(sum1)
sum1=0 # outcomment if cummulative sum required
else:
lsm1.append(None)
row1 += 1
return cond
row2 = 1
sum2 = 0
lsm2 = []
def cond2(x):
global row2, sum2
cond = False
if x > 50.025:
cond=True
sum2+=1
if row2%10==0:
print('sum2:', sum2, 'at row:', row2)
lsm2.append(sum2)
sum2=0 # outcomment if cummulative sum required
else:
lsm2.append(None)
row2 += 1
return cond
how_many_times1 = df['f'].apply(cond1)
df[columns[1]] = how_many_times1
df['sum1'] = lsm1
how_many_times2 = df['f'].apply(cond2)
df[columns[2]] =how_many_times2
df['sum2'] = lsm2
print(df)
prints
t f
0 20220901-00:00:00 50.0335
1 20220901-00:00:01 50.1000
2 20220901-00:00:02 48.0210
3 20220901-00:00:01 50.1000
4 20220901-00:00:01 50.1000
5 20220901-00:00:02 48.0210
6 20220901-00:00:01 50.1000
7 20220901-00:00:01 50.1000
8 20220901-00:00:02 48.0210
9 20220901-00:00:01 50.1000
10 20220901-00:00:02 48.0210
11 20220901-00:00:01 50.1000
12 20220901-00:00:01 50.1000
13 20220901-00:00:01 50.1000
sum1: 3 at row: 10
sum2: 7 at row: 10
t f f<49.975 sum1 f>50.025 sum2
0 20220901-00:00:00 50.0335 False NaN True NaN
1 20220901-00:00:01 50.1000 False NaN True NaN
2 20220901-00:00:02 48.0210 True NaN False NaN
3 20220901-00:00:01 50.1000 False NaN True NaN
4 20220901-00:00:01 50.1000 False NaN True NaN
5 20220901-00:00:02 48.0210 True NaN False NaN
6 20220901-00:00:01 50.1000 False NaN True NaN
7 20220901-00:00:01 50.1000 False NaN True NaN
8 20220901-00:00:02 48.0210 True NaN False NaN
9 20220901-00:00:01 50.1000 False 3.0 True 7.0
10 20220901-00:00:02 48.0210 True NaN False NaN
11 20220901-00:00:01 50.1000 False NaN True NaN
12 20220901-00:00:01 50.1000 False NaN True NaN
13 20220901-00:00:01 50.1000 False NaN True NaN
答案2
得分: 0
另一个选项可以是:
df[["f<49.975", "f>50.025"]] = (
df.assign(f1=df["f"] < 49.975, f2=df["f"] > 50.025)
.groupby(df.index // 10)[["f1", "f2"]].transform("sum")
.loc[df.index % 10 == 9]
)
- 添加两列
f1
和f2
到df
,根据这两个条件定义列值。 - 现在将每10行分组,并对这两个新列进行求和,以获得每个块的真值计数。使用
.transform
来保留原始索引。 - 然后仅选择每十行,将结果分配给这两个新列。
英文:
Another option would be:
df[["f<49.975", "f>50.025"]] = (
df.assign(f1=df["f"].lt(49.975), f2=df["f"].gt(50.025))
.groupby(df.index // 10)[["f1", "f2"]].transform("sum")
.loc[df.index % 10 == 9]
)
- Add two columns
f1
,f2
todf
, defined by the two conditions. - Now group every 10 rows and sum over the two new columns to get the truth-count per block. Use
.transform
to do that to keep the original index. - Then take only every tenth row and assign the result to the two new columns.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论