英文:
Detect multiple headers in CSV files
问题
在CSV文件中,您希望Python能够自动检测测量类型,而不必像df_titles = ["Level and Distortion", "THD Ratio", "Reference Waveform"]
那样手动定义。您可以尝试以下方法来实现这一目标:
import pandas as pd
df = pd.read_csv("three measurement.csv", header=None)
# 寻找测量类型行
df_titles = df.iloc[:, 0].str.strip('"') # 去除双引号并获取第一列
# 创建一个包含唯一值的列表,这些唯一值将成为测量类型的标识
measurement_types = df_titles[df_titles != ""].unique()
# 创建测量类型到数据框的映射字典
measurement_data = {}
for measurement_type in measurement_types:
# 使用布尔索引获取特定测量类型的数据框
measurement_df = df[df_titles == measurement_type]
# 去掉测量类型所在的行,并重置索引
measurement_df = measurement_df.iloc[1:].reset_index(drop=True)
# 创建MultiIndex列标签,从前4行获取信息,并去掉空白字符
header = measurement_df.iloc[:4, :].apply(lambda x: x.str.strip()).ffill(axis=1)
# 去掉前4行,重置索引,并设置列标签
measurement_df = measurement_df.iloc[4:].reset_index(drop=True)
measurement_df.columns = pd.MultiIndex.from_arrays(header.values)
# 存储到字典中
measurement_data[measurement_type] = measurement_df
# 现在,measurement_data 字典包含了自动检测到的所有测量类型和对应的数据框。
这段代码将自动检测CSV文件中的测量类型,创建一个字典,其中测量类型作为键,对应的数据框作为值。您不再需要手动定义df_titles
变量,它将根据CSV文件的内容动态确定测量类型。
英文:
I have a large csv file. Inside the csv file there are several headers. I've separated the three headers using code as below.
import pandas as pd
df = pd.read_csv("three measurement.csv", header=None)
# find header rows
df_titles = ["Level and Distortion", "THD Ratio", "Reference Waveform"]
# create groups for each section
groupings = df.iloc[:, 0].str.contains("|".join(df_titles)).cumsum()
# split into new dataframes as dictionary
d = {}
for i, j in df.groupby(groupings):
# define name of dictionary key as title, and set data of DF as values
d[j.iloc[0, 0]] = pd.DataFrame(data=j.values[4:, :],
# create MultiIndex from 3 header rows
columns=pd.MultiIndex.from_arrays(
j.iloc[0:4, :].ffill(axis=1).values))
# suggested not to use, but you can set the variables directly (outside of the dictionary)
globals()[j.iloc[0, 0]] = pd.DataFrame(data=j.values[4:, :],
columns=pd.MultiIndex.from_arrays(
j.iloc[0:4, :].ffill(axis=1).values))
Level and Distortion
, THD Ratio
, and Reference Waveform
are the types of measurements that I have separated. But the measurement type can change (increase or decrease) how to detect the measurement type without specifying the measurement type as in the code above (df_titles=[".."
). This means that in the code above to find the header line, you have to specify the names of the measurement types according to the csv that I show below. If the measurement type is different and increasing, how do I find the header row without having to define the df_title variable.
Here CSV File
"Level and Distortion",,,,,,,,,,,,,,,
"Ch1 (F)",,"Ch1 (H2)",,"Ch1 (H3)",,"Ch1 (Total)",,"Ch2 (F)",,"Ch2 (H2)",,"Ch2 (H3)",,"Ch2 (Total)",
X,Y,X,Y,X,Y,X,Y,X,Y,X,Y,X,Y,X,Y
Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms
20,0.00772013164376534,20,5.60982648239952E-05,20,0.000389709733151927,20,0.011492581958802,20,0.00699792689186063,20,0.000151471712877565,20,0.000389940899485093,20,0.010080448380793
21.1179638886716,0.00747175133180212,21.1179638886716,8.83327496082501E-05,21.1179638886716,0.000426696028852445,21.1179638886716,0.0122462876404656,21.1179638886716,0.00756340531214287,21.1179638886716,0.000181697169530165,21.1179638886716,0.000443499862648762,21.1179638886716,0.0108494276048029
"THD Ratio",,,,,,,,,,,,,,,
Ch1,,Ch2,,,,,,,,,,,,,
X,Y,X,Y,,,,,,,,,,,,
Hz,%,Hz,%,,,,,,,,,,,,
20,83.009797319554,20,82.1460991930652,,,,,,,,,,,,
21.1179638886716,85.3656629417084,21.1179638886716,82.0338466400102,,,,,,,,,,,,
22.2984199401618,90.6674826441566,22.2984199401618,85.7190774666039,,,,,,,,,,,,
"Reference Waveform",,,,,,,,,,,,,,,
Ch1,,Ch2,,,,,,,,,,,,,
X,Y,X,Y,,,,,,,,,,,,
s,V,s,V,,,,,,,,,,,,
0,0,0,0,,,,,,,,,,,,
2.08333333333333E-05,6.47890208369956E-08,2.08333333333333E-05,6.47890208369956E-08,,,,,,,,,,,,
4.16666666666667E-05,5.18304721721536E-07,4.16666666666667E-05,5.18304721721536E-07,,,,,,,,,,,,
6.25E-05,1.74923655865586E-06,6.25E-05,1.74923655865586E-06,,,,,,,,,,,,
Actually there are a lot of row data in the csv file.
Because the type of measurement can change (increase or decrease) there are 11 types of measurement. I want python to be able to detect the measurement type without having to define it like df_titles = ["Level and Distortion", "THD Ratio", "Reference Waveform"]
.
Thanks.
答案1
得分: 1
您可以使用pd.read_csv
并使用正确的参数:
import pandas as pd
import numpy as np
df = pd.read_csv('three measurement.csv', dtype='str', header=None, keep_default_na=False, na_values='')
dfs = {}
idx = df.index[df.notna().sum(axis=1).eq(1)].tolist()
for i, j in zip(idx, idx[1:]+[len(df)]):
name = df.loc[i, 0]
df1 = df.loc[i+1:j-1].dropna(how='all', axis=1).replace('N/A', np.nan)
m = pd.to_numeric(df1[0], errors='coerce').notna()
mi = pd.MultiIndex.from_arrays(df1.loc[~m].ffill(axis=1).values)
dfs[name] = pd.DataFrame(df1.loc[m].values.astype(float), columns=mi)
输出:
>>> list(dfs.keys())
['Level and Distortion',
'THD Ratio',
'Reference Waveform',
'Summary: Signal Path1.Acoustic Response.Level and Distortion',
'Summary: Signal Path1.Acoustic Response.THD Ratio',
'Summary: Signal Path1.Acoustic Response.Reference Waveform']
>>> dfs['Level and Distortion']
Ch1 (F) Ch1 (H2) Ch1 (H3) Ch1 (Total) Ch2 (F) Ch2 (H2) Ch2 (H3) Ch2 (Total)
X Y X Y X Y X Y X Y X Y X Y X Y
Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms
0 20.000000 0.007720 20.000000 0.000056 20.000000 0.000390 20.000000 0.011493 20.000000 0.006998 20.000000 0.000151 20.000000 0.000390 20.000000 0.010080
1 21.117964 0.007472 21.117964 0.000088 21.117964 0.000427 21.117964 0.012246 21.117964 0.007563 21.117964 0.000182 21.117964 0.000443 21.117964 0.010849
2 22.298420 0.005521 22.298420 0.000107 22.298420 0.000408 22.298420 0.011866 22.298420 0.007572 22.298420 0.000209 22.298420 0.000505 22.298420 0.012603
3 23.544861 0.007055 23.544861 0.000269 23.544861 0.000313 23.544861 0.012183 23.544861 0.007603 23.544861 0.000227 23.544861 0.000568 23.544861 0.012779
4 24.860977 0.008749 24.860977 0.000260 24.860977 0.000653 24.860977 0.013460 24.860977 0.005882 24.860977 0.000217 24.860977 0.000364 24.860977 0.013626
...
123 16089.472533 0.060342 NaN NaN NaN NaN NaN NaN 16089.472533 0.061988 NaN NaN NaN NaN NaN NaN
124 16988.844997 0.059465 NaN NaN NaN NaN NaN NaN 16988.844997 0.047697 NaN NaN NaN NaN NaN NaN
125 17938.490757 0.036990 NaN NaN NaN NaN NaN NaN 17938.490757 0.035544 NaN NaN NaN NaN NaN NaN
126 18941.220002 0.061363 NaN NaN NaN NaN NaN NaN 18941.220002 0.050418 NaN NaN NaN NaN NaN NaN
127 20000.000000 0.064279 NaN NaN NaN NaN NaN NaN 20000.000000 0.059796 NaN NaN NaN NaN NaN NaN
[128 rows x 16 columns]
>>> dfs['THD Ratio']
Ch1 Ch2
X Y X Y
Hz % Hz %
0 20.000000 83.009797 20.000000 82.146099
1 21.117964 85.365663 21.117964 82.033847
2 22.298420 90.667483 22.298420 85.719077
3 23.544861 86.536381 23.544861 85.941987
4 24.860977 83.845657 24.860977 91.810750
...
112 8845.029526 58.356075 8845.029526 37.976407
113 9339.450707 32.391086 9339.450707 58.039594
114 9861.509138 34.756184 9861.509138 38.396633
115 10412.749693 31.359953 10412.749693 20.900020
116 10994.803600 35.219231 10994.803600 26.008255
[117 rows x 4 columns]
>>> dfs['Reference Waveform']
Ch1 Ch2
X Y X Y
s V s V
0 0.000000 0.000000e+00 0.000000 0.000000
<details>
<summary>英文:</summary>
You can use `pd.read_csv` with right parameters:
import pandas as pd
import numpy as np
df = pd.read_csv('three measurement.csv', dtype='str', header=None, keep_default_na=False, na_values='')
dfs = {}
idx = df.index[df.notna().sum(axis=1).eq(1)].tolist()
for i, j in zip(idx, idx[1:]+[len(df)]):
name = df.loc[i, 0]
df1 = df.loc[i+1:j-1].dropna(how='all', axis=1).replace('N/A', np.nan)
m = pd.to_numeric(df1[0], errors='coerce').notna()
mi = pd.MultiIndex.from_arrays(df1.loc[~m].ffill(axis=1).values)
dfs[name] = pd.DataFrame(df1.loc[m].values.astype(float), columns=mi)
Output:
>>> list(dfs.keys())
['Level and Distortion',
'THD Ratio',
'Reference Waveform',
'Summary: Signal Path1.Acoustic Response.Level and Distortion',
'Summary: Signal Path1.Acoustic Response.THD Ratio',
'Summary: Signal Path1.Acoustic Response.Reference Waveform']
>>> dfs['Level and Distortion']
Ch1 (F) Ch1 (H2) Ch1 (H3) Ch1 (Total) Ch2 (F) Ch2 (H2) Ch2 (H3) Ch2 (Total)
X Y X Y X Y X Y X Y X Y X Y X Y
Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms
0 20.000000 0.007720 20.000000 0.000056 20.000000 0.000390 20.000000 0.011493 20.000000 0.006998 20.000000 0.000151 20.000000 0.000390 20.000000 0.010080
1 21.117964 0.007472 21.117964 0.000088 21.117964 0.000427 21.117964 0.012246 21.117964 0.007563 21.117964 0.000182 21.117964 0.000443 21.117964 0.010849
2 22.298420 0.005521 22.298420 0.000107 22.298420 0.000408 22.298420 0.011866 22.298420 0.007572 22.298420 0.000209 22.298420 0.000505 22.298420 0.012603
3 23.544861 0.007055 23.544861 0.000269 23.544861 0.000313 23.544861 0.012183 23.544861 0.007603 23.544861 0.000227 23.544861 0.000568 23.544861 0.012779
4 24.860977 0.008749 24.860977 0.000260 24.860977 0.000653 24.860977 0.013460 24.860977 0.005882 24.860977 0.000217 24.860977 0.000364 24.860977 0.013626
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
123 16089.472533 0.060342 NaN NaN NaN NaN NaN NaN 16089.472533 0.061988 NaN NaN NaN NaN NaN NaN
124 16988.844997 0.059465 NaN NaN NaN NaN NaN NaN 16988.844997 0.047697 NaN NaN NaN NaN NaN NaN
125 17938.490757 0.036990 NaN NaN NaN NaN NaN NaN 17938.490757 0.035544 NaN NaN NaN NaN NaN NaN
126 18941.220002 0.061363 NaN NaN NaN NaN NaN NaN 18941.220002 0.050418 NaN NaN NaN NaN NaN NaN
127 20000.000000 0.064279 NaN NaN NaN NaN NaN NaN 20000.000000 0.059796 NaN NaN NaN NaN NaN NaN
[128 rows x 16 columns]
>>> dfs['THD Ratio']
Ch1 Ch2
X Y X Y
Hz % Hz %
0 20.000000 83.009797 20.000000 82.146099
1 21.117964 85.365663 21.117964 82.033847
2 22.298420 90.667483 22.298420 85.719077
3 23.544861 86.536381 23.544861 85.941987
4 24.860977 83.845657 24.860977 91.810750
.. ... ... ... ...
112 8845.029526 58.356075 8845.029526 37.976407
113 9339.450707 32.391086 9339.450707 58.039594
114 9861.509138 34.756184 9861.509138 38.396633
115 10412.749693 31.359953 10412.749693 20.900020
116 10994.803600 35.219231 10994.803600 26.008255
[117 rows x 4 columns]
>>> dfs['Reference Waveform']
Ch1 Ch2
X Y X Y
s V s V
0 0.000000 0.000000e+00 0.000000 0.000000e+00
1 0.000021 6.478902e-08 0.000021 6.478902e-08
2 0.000042 5.183047e-07 0.000042 5.183047e-07
3 0.000063 1.749237e-06 0.000063 1.749237e-06
4 0.000083 4.146200e-06 0.000083 4.146200e-06
... ... ... ... ...
719994 14.999875 3.035885e-08 14.999875 3.035885e-08
719995 14.999896 -7.711807e-09 14.999896 -7.711807e-09
719996 14.999917 2.917104e-10 14.999917 2.917104e-10
719997 14.999937 3.821995e-10 14.999937 3.821995e-10
719998 14.999958 -4.497611e-11 14.999958 -4.497611e-11
[719999 rows x 4 columns]
>>> dfs['Summary: Signal Path1.Acoustic Response.Level and Distortion']
Empty DataFrame
Columns: [(Channel, Ch1, Ch2), (Passed Lower Limit, Ch1, Ch2), (Passed Upper Limit, Ch1, Ch2)]
Index: []
>>> dfs['Summary: Signal Path1.Acoustic Response.THD Ratio']
Empty DataFrame
Columns: [(Channel, Ch1, Ch2), (Passed Lower Limit, Ch1, Ch2), (Passed Upper Limit, Ch1, Ch2)]
Index: []
>>> dfs['Summary: Signal Path1.Acoustic Response.Reference Waveform']
Empty DataFrame
Columns: [(Channel, Ch1, Ch2), (Passed Lower Limit, Ch1, Ch2), (Passed Upper Limit, Ch1, Ch2)]
Index: []
Concat all dataframes:
>>> pd.concat(dfs, names=['Measurement'], axis=1)
Measurement Level and Distortion ... Summary: Signal Path1.Acoustic Response.THD Ratio Summary: Signal Path1.Acoustic Response.Reference Waveform
Ch1 (F) Ch1 (H2) ... Passed Upper Limit Channel Passed Lower Limit Passed Upper Limit
X Y X Y ... Ch1 Ch1 Ch1 Ch1
Hz Vrms Hz Vrms ... Ch2 Ch2 Ch2 Ch2
0 20.000000 0.007720 20.000000 0.000056 ... NaN NaN NaN NaN
1 21.117964 0.007472 21.117964 0.000088 ... NaN NaN NaN NaN
2 22.298420 0.005521 22.298420 0.000107 ... NaN NaN NaN NaN
3 23.544861 0.007055 23.544861 0.000269 ... NaN NaN NaN NaN
4 24.860977 0.008749 24.860977 0.000260 ... NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
719994 NaN NaN NaN NaN ... NaN NaN NaN NaN
719995 NaN NaN NaN NaN ... NaN NaN NaN NaN
719996 NaN NaN NaN NaN ... NaN NaN NaN NaN
719997 NaN NaN NaN NaN ... NaN NaN NaN NaN
719998 NaN NaN NaN NaN ... NaN NaN NaN NaN
[719999 rows x 33 columns]
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论