检测CSV文件中的多个标题

huangapple go评论63阅读模式
英文:

Detect multiple headers in CSV files

问题

在CSV文件中,您希望Python能够自动检测测量类型,而不必像df_titles = ["Level and Distortion", "THD Ratio", "Reference Waveform"]那样手动定义。您可以尝试以下方法来实现这一目标:

import pandas as pd

df = pd.read_csv("three measurement.csv", header=None)

# 寻找测量类型行
df_titles = df.iloc[:, 0].str.strip('"')  # 去除双引号并获取第一列

# 创建一个包含唯一值的列表,这些唯一值将成为测量类型的标识
measurement_types = df_titles[df_titles != ""].unique()

# 创建测量类型到数据框的映射字典
measurement_data = {}

for measurement_type in measurement_types:
    # 使用布尔索引获取特定测量类型的数据框
    measurement_df = df[df_titles == measurement_type]
    
    # 去掉测量类型所在的行,并重置索引
    measurement_df = measurement_df.iloc[1:].reset_index(drop=True)
    
    # 创建MultiIndex列标签,从前4行获取信息,并去掉空白字符
    header = measurement_df.iloc[:4, :].apply(lambda x: x.str.strip()).ffill(axis=1)
    
    # 去掉前4行,重置索引,并设置列标签
    measurement_df = measurement_df.iloc[4:].reset_index(drop=True)
    measurement_df.columns = pd.MultiIndex.from_arrays(header.values)
    
    # 存储到字典中
    measurement_data[measurement_type] = measurement_df

# 现在,measurement_data 字典包含了自动检测到的所有测量类型和对应的数据框。

这段代码将自动检测CSV文件中的测量类型,创建一个字典,其中测量类型作为键,对应的数据框作为值。您不再需要手动定义df_titles变量,它将根据CSV文件的内容动态确定测量类型。

英文:

I have a large csv file. Inside the csv file there are several headers. I've separated the three headers using code as below.

import pandas as pd

df = pd.read_csv("three measurement.csv", header=None)

# find header rows
df_titles = ["Level and Distortion", "THD Ratio", "Reference Waveform"]
# create groups for each section
groupings = df.iloc[:, 0].str.contains("|".join(df_titles)).cumsum()

# split into new dataframes as dictionary
d = {}
for i, j in df.groupby(groupings):
    # define name of dictionary key as title, and set data of DF as values
    d[j.iloc[0, 0]] = pd.DataFrame(data=j.values[4:, :],
                                   # create MultiIndex from 3 header rows
                                   columns=pd.MultiIndex.from_arrays(
                                       j.iloc[0:4, :].ffill(axis=1).values))
    # suggested not to use, but you can set the variables directly (outside of the dictionary)
    globals()[j.iloc[0, 0]] = pd.DataFrame(data=j.values[4:, :],
                                           columns=pd.MultiIndex.from_arrays(
                                               j.iloc[0:4, :].ffill(axis=1).values))

Level and Distortion, THD Ratio, and Reference Waveform are the types of measurements that I have separated. But the measurement type can change (increase or decrease) how to detect the measurement type without specifying the measurement type as in the code above (df_titles=[".."). This means that in the code above to find the header line, you have to specify the names of the measurement types according to the csv that I show below. If the measurement type is different and increasing, how do I find the header row without having to define the df_title variable.
Here CSV File

"Level and Distortion",,,,,,,,,,,,,,,
"Ch1 (F)",,"Ch1 (H2)",,"Ch1 (H3)",,"Ch1 (Total)",,"Ch2 (F)",,"Ch2 (H2)",,"Ch2 (H3)",,"Ch2 (Total)",
X,Y,X,Y,X,Y,X,Y,X,Y,X,Y,X,Y,X,Y
Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms
20,0.00772013164376534,20,5.60982648239952E-05,20,0.000389709733151927,20,0.011492581958802,20,0.00699792689186063,20,0.000151471712877565,20,0.000389940899485093,20,0.010080448380793
21.1179638886716,0.00747175133180212,21.1179638886716,8.83327496082501E-05,21.1179638886716,0.000426696028852445,21.1179638886716,0.0122462876404656,21.1179638886716,0.00756340531214287,21.1179638886716,0.000181697169530165,21.1179638886716,0.000443499862648762,21.1179638886716,0.0108494276048029
"THD Ratio",,,,,,,,,,,,,,,
Ch1,,Ch2,,,,,,,,,,,,,
X,Y,X,Y,,,,,,,,,,,,
Hz,%,Hz,%,,,,,,,,,,,,
20,83.009797319554,20,82.1460991930652,,,,,,,,,,,,
21.1179638886716,85.3656629417084,21.1179638886716,82.0338466400102,,,,,,,,,,,,
22.2984199401618,90.6674826441566,22.2984199401618,85.7190774666039,,,,,,,,,,,,
"Reference Waveform",,,,,,,,,,,,,,,
Ch1,,Ch2,,,,,,,,,,,,,
X,Y,X,Y,,,,,,,,,,,,
s,V,s,V,,,,,,,,,,,,
0,0,0,0,,,,,,,,,,,,
2.08333333333333E-05,6.47890208369956E-08,2.08333333333333E-05,6.47890208369956E-08,,,,,,,,,,,,
4.16666666666667E-05,5.18304721721536E-07,4.16666666666667E-05,5.18304721721536E-07,,,,,,,,,,,,
6.25E-05,1.74923655865586E-06,6.25E-05,1.74923655865586E-06,,,,,,,,,,,,

Actually there are a lot of row data in the csv file.

Because the type of measurement can change (increase or decrease) there are 11 types of measurement. I want python to be able to detect the measurement type without having to define it like df_titles = ["Level and Distortion", "THD Ratio", "Reference Waveform"].

Thanks.

答案1

得分: 1

您可以使用pd.read_csv并使用正确的参数:

import pandas as pd
import numpy as np

df = pd.read_csv('three measurement.csv', dtype='str', header=None, keep_default_na=False, na_values='')

dfs = {}
idx = df.index[df.notna().sum(axis=1).eq(1)].tolist()
for i, j in zip(idx, idx[1:]+[len(df)]):
    name = df.loc[i, 0]
    df1 = df.loc[i+1:j-1].dropna(how='all', axis=1).replace('N/A', np.nan)
    m = pd.to_numeric(df1[0], errors='coerce').notna()
    mi = pd.MultiIndex.from_arrays(df1.loc[~m].ffill(axis=1).values)
    dfs[name] = pd.DataFrame(df1.loc[m].values.astype(float), columns=mi)

输出:

>>> list(dfs.keys())
['Level and Distortion',
 'THD Ratio',
 'Reference Waveform',
 'Summary: Signal Path1.Acoustic Response.Level and Distortion',
 'Summary: Signal Path1.Acoustic Response.THD Ratio',
 'Summary: Signal Path1.Acoustic Response.Reference Waveform']

>>> dfs['Level and Distortion']
          Ch1 (F)             Ch1 (H2)             Ch1 (H3)           Ch1 (Total)                 Ch2 (F)             Ch2 (H2)             Ch2 (H3)           Ch2 (Total)          
                X         Y          X         Y          X         Y           X         Y             X         Y          X         Y          X         Y           X         Y
               Hz      Vrms         Hz      Vrms         Hz      Vrms          Hz      Vrms            Hz      Vrms         Hz      Vrms         Hz      Vrms          Hz      Vrms
0       20.000000  0.007720  20.000000  0.000056  20.000000  0.000390   20.000000  0.011493     20.000000  0.006998  20.000000  0.000151  20.000000  0.000390   20.000000  0.010080
1       21.117964  0.007472  21.117964  0.000088  21.117964  0.000427   21.117964  0.012246     21.117964  0.007563  21.117964  0.000182  21.117964  0.000443   21.117964  0.010849
2       22.298420  0.005521  22.298420  0.000107  22.298420  0.000408   22.298420  0.011866     22.298420  0.007572  22.298420  0.000209  22.298420  0.000505   22.298420  0.012603
3       23.544861  0.007055  23.544861  0.000269  23.544861  0.000313   23.544861  0.012183     23.544861  0.007603  23.544861  0.000227  23.544861  0.000568   23.544861  0.012779
4       24.860977  0.008749  24.860977  0.000260  24.860977  0.000653   24.860977  0.013460     24.860977  0.005882  24.860977  0.000217  24.860977  0.000364   24.860977  0.013626
...
123  16089.472533  0.060342        NaN       NaN        NaN       NaN         NaN       NaN  16089.472533  0.061988        NaN       NaN        NaN       NaN         NaN       NaN
124  16988.844997  0.059465        NaN       NaN        NaN       NaN         NaN       NaN  16988.844997  0.047697        NaN       NaN        NaN       NaN         NaN       NaN
125  17938.490757  0.036990        NaN       NaN        NaN       NaN         NaN       NaN  17938.490757  0.035544        NaN       NaN        NaN       NaN         NaN       NaN
126  18941.220002  0.061363        NaN       NaN        NaN       NaN         NaN       NaN  18941.220002  0.050418        NaN       NaN        NaN       NaN         NaN       NaN
127  20000.000000  0.064279        NaN       NaN        NaN       NaN         NaN       NaN  20000.000000  0.059796        NaN       NaN        NaN       NaN         NaN       NaN

[128 rows x 16 columns]

>>> dfs['THD Ratio']
              Ch1                      Ch2           
                X          Y             X          Y
               Hz          %            Hz          %
0       20.000000  83.009797     20.000000  82.146099
1       21.117964  85.365663     21.117964  82.033847
2       22.298420  90.667483     22.298420  85.719077
3       23.544861  86.536381     23.544861  85.941987
4       24.860977  83.845657     24.860977  91.810750
...
112   8845.029526  58.356075   8845.029526  37.976407
113   9339.450707  32.391086   9339.450707  58.039594
114   9861.509138  34.756184   9861.509138  38.396633
115  10412.749693  31.359953  10412.749693  20.900020
116  10994.803600  35.219231  10994.803600  26.008255

[117 rows x 4 columns]

>>> dfs['Reference Waveform']
              Ch1                      Ch2              
                X             Y          X             Y
                s             V          s             V
0        0.000000  0.000000e+00   0.000000  0.000000

<details>
<summary>英文:</summary>

You can use `pd.read_csv` with right parameters:

import pandas as pd
import numpy as np

df = pd.read_csv('three measurement.csv', dtype='str', header=None, keep_default_na=False, na_values='')

dfs = {}
idx = df.index[df.notna().sum(axis=1).eq(1)].tolist()
for i, j in zip(idx, idx[1:]+[len(df)]):
name = df.loc[i, 0]
df1 = df.loc[i+1:j-1].dropna(how='all', axis=1).replace('N/A', np.nan)
m = pd.to_numeric(df1[0], errors='coerce').notna()
mi = pd.MultiIndex.from_arrays(df1.loc[~m].ffill(axis=1).values)
dfs[name] = pd.DataFrame(df1.loc[m].values.astype(float), columns=mi)


Output:

>>> list(dfs.keys())
['Level and Distortion',
'THD Ratio',
'Reference Waveform',
'Summary: Signal Path1.Acoustic Response.Level and Distortion',
'Summary: Signal Path1.Acoustic Response.THD Ratio',
'Summary: Signal Path1.Acoustic Response.Reference Waveform']

>>> dfs['Level and Distortion']
Ch1 (F) Ch1 (H2) Ch1 (H3) Ch1 (Total) Ch2 (F) Ch2 (H2) Ch2 (H3) Ch2 (Total)
X Y X Y X Y X Y X Y X Y X Y X Y
Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms
0 20.000000 0.007720 20.000000 0.000056 20.000000 0.000390 20.000000 0.011493 20.000000 0.006998 20.000000 0.000151 20.000000 0.000390 20.000000 0.010080
1 21.117964 0.007472 21.117964 0.000088 21.117964 0.000427 21.117964 0.012246 21.117964 0.007563 21.117964 0.000182 21.117964 0.000443 21.117964 0.010849
2 22.298420 0.005521 22.298420 0.000107 22.298420 0.000408 22.298420 0.011866 22.298420 0.007572 22.298420 0.000209 22.298420 0.000505 22.298420 0.012603
3 23.544861 0.007055 23.544861 0.000269 23.544861 0.000313 23.544861 0.012183 23.544861 0.007603 23.544861 0.000227 23.544861 0.000568 23.544861 0.012779
4 24.860977 0.008749 24.860977 0.000260 24.860977 0.000653 24.860977 0.013460 24.860977 0.005882 24.860977 0.000217 24.860977 0.000364 24.860977 0.013626
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
123 16089.472533 0.060342 NaN NaN NaN NaN NaN NaN 16089.472533 0.061988 NaN NaN NaN NaN NaN NaN
124 16988.844997 0.059465 NaN NaN NaN NaN NaN NaN 16988.844997 0.047697 NaN NaN NaN NaN NaN NaN
125 17938.490757 0.036990 NaN NaN NaN NaN NaN NaN 17938.490757 0.035544 NaN NaN NaN NaN NaN NaN
126 18941.220002 0.061363 NaN NaN NaN NaN NaN NaN 18941.220002 0.050418 NaN NaN NaN NaN NaN NaN
127 20000.000000 0.064279 NaN NaN NaN NaN NaN NaN 20000.000000 0.059796 NaN NaN NaN NaN NaN NaN

[128 rows x 16 columns]

>>> dfs['THD Ratio']
Ch1 Ch2
X Y X Y
Hz % Hz %
0 20.000000 83.009797 20.000000 82.146099
1 21.117964 85.365663 21.117964 82.033847
2 22.298420 90.667483 22.298420 85.719077
3 23.544861 86.536381 23.544861 85.941987
4 24.860977 83.845657 24.860977 91.810750
.. ... ... ... ...
112 8845.029526 58.356075 8845.029526 37.976407
113 9339.450707 32.391086 9339.450707 58.039594
114 9861.509138 34.756184 9861.509138 38.396633
115 10412.749693 31.359953 10412.749693 20.900020
116 10994.803600 35.219231 10994.803600 26.008255

[117 rows x 4 columns]

>>> dfs['Reference Waveform']
Ch1 Ch2
X Y X Y
s V s V
0 0.000000 0.000000e+00 0.000000 0.000000e+00
1 0.000021 6.478902e-08 0.000021 6.478902e-08
2 0.000042 5.183047e-07 0.000042 5.183047e-07
3 0.000063 1.749237e-06 0.000063 1.749237e-06
4 0.000083 4.146200e-06 0.000083 4.146200e-06
... ... ... ... ...
719994 14.999875 3.035885e-08 14.999875 3.035885e-08
719995 14.999896 -7.711807e-09 14.999896 -7.711807e-09
719996 14.999917 2.917104e-10 14.999917 2.917104e-10
719997 14.999937 3.821995e-10 14.999937 3.821995e-10
719998 14.999958 -4.497611e-11 14.999958 -4.497611e-11

[719999 rows x 4 columns]

>>> dfs['Summary: Signal Path1.Acoustic Response.Level and Distortion']
Empty DataFrame
Columns: [(Channel, Ch1, Ch2), (Passed Lower Limit, Ch1, Ch2), (Passed Upper Limit, Ch1, Ch2)]
Index: []

>>> dfs['Summary: Signal Path1.Acoustic Response.THD Ratio']
Empty DataFrame
Columns: [(Channel, Ch1, Ch2), (Passed Lower Limit, Ch1, Ch2), (Passed Upper Limit, Ch1, Ch2)]
Index: []

>>> dfs['Summary: Signal Path1.Acoustic Response.Reference Waveform']
Empty DataFrame
Columns: [(Channel, Ch1, Ch2), (Passed Lower Limit, Ch1, Ch2), (Passed Upper Limit, Ch1, Ch2)]
Index: []


Concat all dataframes:

>>> pd.concat(dfs, names=['Measurement'], axis=1)

Measurement Level and Distortion ... Summary: Signal Path1.Acoustic Response.THD Ratio Summary: Signal Path1.Acoustic Response.Reference Waveform
Ch1 (F) Ch1 (H2) ... Passed Upper Limit Channel Passed Lower Limit Passed Upper Limit
X Y X Y ... Ch1 Ch1 Ch1 Ch1
Hz Vrms Hz Vrms ... Ch2 Ch2 Ch2 Ch2
0 20.000000 0.007720 20.000000 0.000056 ... NaN NaN NaN NaN
1 21.117964 0.007472 21.117964 0.000088 ... NaN NaN NaN NaN
2 22.298420 0.005521 22.298420 0.000107 ... NaN NaN NaN NaN
3 23.544861 0.007055 23.544861 0.000269 ... NaN NaN NaN NaN
4 24.860977 0.008749 24.860977 0.000260 ... NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
719994 NaN NaN NaN NaN ... NaN NaN NaN NaN
719995 NaN NaN NaN NaN ... NaN NaN NaN NaN
719996 NaN NaN NaN NaN ... NaN NaN NaN NaN
719997 NaN NaN NaN NaN ... NaN NaN NaN NaN
719998 NaN NaN NaN NaN ... NaN NaN NaN NaN

[719999 rows x 33 columns]


</details>

huangapple
  • 本文由 发表于 2023年5月30日 10:29:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76361273.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定