英文:
Detect multiple headers in CSV files
问题
在CSV文件中,您希望Python能够自动检测测量类型,而不必像df_titles = ["Level and Distortion", "THD Ratio", "Reference Waveform"]那样手动定义。您可以尝试以下方法来实现这一目标:
import pandas as pd
df = pd.read_csv("three measurement.csv", header=None)
# 寻找测量类型行
df_titles = df.iloc[:, 0].str.strip('"')  # 去除双引号并获取第一列
# 创建一个包含唯一值的列表,这些唯一值将成为测量类型的标识
measurement_types = df_titles[df_titles != ""].unique()
# 创建测量类型到数据框的映射字典
measurement_data = {}
for measurement_type in measurement_types:
    # 使用布尔索引获取特定测量类型的数据框
    measurement_df = df[df_titles == measurement_type]
    
    # 去掉测量类型所在的行,并重置索引
    measurement_df = measurement_df.iloc[1:].reset_index(drop=True)
    
    # 创建MultiIndex列标签,从前4行获取信息,并去掉空白字符
    header = measurement_df.iloc[:4, :].apply(lambda x: x.str.strip()).ffill(axis=1)
    
    # 去掉前4行,重置索引,并设置列标签
    measurement_df = measurement_df.iloc[4:].reset_index(drop=True)
    measurement_df.columns = pd.MultiIndex.from_arrays(header.values)
    
    # 存储到字典中
    measurement_data[measurement_type] = measurement_df
# 现在,measurement_data 字典包含了自动检测到的所有测量类型和对应的数据框。
这段代码将自动检测CSV文件中的测量类型,创建一个字典,其中测量类型作为键,对应的数据框作为值。您不再需要手动定义df_titles变量,它将根据CSV文件的内容动态确定测量类型。
英文:
I have a large csv file. Inside the csv file there are several headers. I've separated the three headers using code as below.
import pandas as pd
df = pd.read_csv("three measurement.csv", header=None)
# find header rows
df_titles = ["Level and Distortion", "THD Ratio", "Reference Waveform"]
# create groups for each section
groupings = df.iloc[:, 0].str.contains("|".join(df_titles)).cumsum()
# split into new dataframes as dictionary
d = {}
for i, j in df.groupby(groupings):
    # define name of dictionary key as title, and set data of DF as values
    d[j.iloc[0, 0]] = pd.DataFrame(data=j.values[4:, :],
                                   # create MultiIndex from 3 header rows
                                   columns=pd.MultiIndex.from_arrays(
                                       j.iloc[0:4, :].ffill(axis=1).values))
    # suggested not to use, but you can set the variables directly (outside of the dictionary)
    globals()[j.iloc[0, 0]] = pd.DataFrame(data=j.values[4:, :],
                                           columns=pd.MultiIndex.from_arrays(
                                               j.iloc[0:4, :].ffill(axis=1).values))
Level and Distortion, THD Ratio, and Reference Waveform are the types of measurements that I have separated. But the measurement type can change (increase or decrease) how to detect the measurement type without specifying the measurement type as in the code above (df_titles=[".."). This means that in the code above to find the header line, you have to specify the names of the measurement types according to the csv that I show below. If the measurement type is different and increasing, how do I find the header row without having to define the df_title variable.
Here CSV File
"Level and Distortion",,,,,,,,,,,,,,,
"Ch1 (F)",,"Ch1 (H2)",,"Ch1 (H3)",,"Ch1 (Total)",,"Ch2 (F)",,"Ch2 (H2)",,"Ch2 (H3)",,"Ch2 (Total)",
X,Y,X,Y,X,Y,X,Y,X,Y,X,Y,X,Y,X,Y
Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms
20,0.00772013164376534,20,5.60982648239952E-05,20,0.000389709733151927,20,0.011492581958802,20,0.00699792689186063,20,0.000151471712877565,20,0.000389940899485093,20,0.010080448380793
21.1179638886716,0.00747175133180212,21.1179638886716,8.83327496082501E-05,21.1179638886716,0.000426696028852445,21.1179638886716,0.0122462876404656,21.1179638886716,0.00756340531214287,21.1179638886716,0.000181697169530165,21.1179638886716,0.000443499862648762,21.1179638886716,0.0108494276048029
"THD Ratio",,,,,,,,,,,,,,,
Ch1,,Ch2,,,,,,,,,,,,,
X,Y,X,Y,,,,,,,,,,,,
Hz,%,Hz,%,,,,,,,,,,,,
20,83.009797319554,20,82.1460991930652,,,,,,,,,,,,
21.1179638886716,85.3656629417084,21.1179638886716,82.0338466400102,,,,,,,,,,,,
22.2984199401618,90.6674826441566,22.2984199401618,85.7190774666039,,,,,,,,,,,,
"Reference Waveform",,,,,,,,,,,,,,,
Ch1,,Ch2,,,,,,,,,,,,,
X,Y,X,Y,,,,,,,,,,,,
s,V,s,V,,,,,,,,,,,,
0,0,0,0,,,,,,,,,,,,
2.08333333333333E-05,6.47890208369956E-08,2.08333333333333E-05,6.47890208369956E-08,,,,,,,,,,,,
4.16666666666667E-05,5.18304721721536E-07,4.16666666666667E-05,5.18304721721536E-07,,,,,,,,,,,,
6.25E-05,1.74923655865586E-06,6.25E-05,1.74923655865586E-06,,,,,,,,,,,,
Actually there are a lot of row data in the csv file.
Because the type of measurement can change (increase or decrease) there are 11 types of measurement. I want python to be able to detect the measurement type without having to define it like df_titles = ["Level and Distortion", "THD Ratio", "Reference Waveform"].
Thanks.
答案1
得分: 1
您可以使用pd.read_csv并使用正确的参数:
import pandas as pd
import numpy as np
df = pd.read_csv('three measurement.csv', dtype='str', header=None, keep_default_na=False, na_values='')
dfs = {}
idx = df.index[df.notna().sum(axis=1).eq(1)].tolist()
for i, j in zip(idx, idx[1:]+[len(df)]):
    name = df.loc[i, 0]
    df1 = df.loc[i+1:j-1].dropna(how='all', axis=1).replace('N/A', np.nan)
    m = pd.to_numeric(df1[0], errors='coerce').notna()
    mi = pd.MultiIndex.from_arrays(df1.loc[~m].ffill(axis=1).values)
    dfs[name] = pd.DataFrame(df1.loc[m].values.astype(float), columns=mi)
输出:
>>> list(dfs.keys())
['Level and Distortion',
 'THD Ratio',
 'Reference Waveform',
 'Summary: Signal Path1.Acoustic Response.Level and Distortion',
 'Summary: Signal Path1.Acoustic Response.THD Ratio',
 'Summary: Signal Path1.Acoustic Response.Reference Waveform']
>>> dfs['Level and Distortion']
          Ch1 (F)             Ch1 (H2)             Ch1 (H3)           Ch1 (Total)                 Ch2 (F)             Ch2 (H2)             Ch2 (H3)           Ch2 (Total)          
                X         Y          X         Y          X         Y           X         Y             X         Y          X         Y          X         Y           X         Y
               Hz      Vrms         Hz      Vrms         Hz      Vrms          Hz      Vrms            Hz      Vrms         Hz      Vrms         Hz      Vrms          Hz      Vrms
0       20.000000  0.007720  20.000000  0.000056  20.000000  0.000390   20.000000  0.011493     20.000000  0.006998  20.000000  0.000151  20.000000  0.000390   20.000000  0.010080
1       21.117964  0.007472  21.117964  0.000088  21.117964  0.000427   21.117964  0.012246     21.117964  0.007563  21.117964  0.000182  21.117964  0.000443   21.117964  0.010849
2       22.298420  0.005521  22.298420  0.000107  22.298420  0.000408   22.298420  0.011866     22.298420  0.007572  22.298420  0.000209  22.298420  0.000505   22.298420  0.012603
3       23.544861  0.007055  23.544861  0.000269  23.544861  0.000313   23.544861  0.012183     23.544861  0.007603  23.544861  0.000227  23.544861  0.000568   23.544861  0.012779
4       24.860977  0.008749  24.860977  0.000260  24.860977  0.000653   24.860977  0.013460     24.860977  0.005882  24.860977  0.000217  24.860977  0.000364   24.860977  0.013626
...
123  16089.472533  0.060342        NaN       NaN        NaN       NaN         NaN       NaN  16089.472533  0.061988        NaN       NaN        NaN       NaN         NaN       NaN
124  16988.844997  0.059465        NaN       NaN        NaN       NaN         NaN       NaN  16988.844997  0.047697        NaN       NaN        NaN       NaN         NaN       NaN
125  17938.490757  0.036990        NaN       NaN        NaN       NaN         NaN       NaN  17938.490757  0.035544        NaN       NaN        NaN       NaN         NaN       NaN
126  18941.220002  0.061363        NaN       NaN        NaN       NaN         NaN       NaN  18941.220002  0.050418        NaN       NaN        NaN       NaN         NaN       NaN
127  20000.000000  0.064279        NaN       NaN        NaN       NaN         NaN       NaN  20000.000000  0.059796        NaN       NaN        NaN       NaN         NaN       NaN
[128 rows x 16 columns]
>>> dfs['THD Ratio']
              Ch1                      Ch2           
                X          Y             X          Y
               Hz          %            Hz          %
0       20.000000  83.009797     20.000000  82.146099
1       21.117964  85.365663     21.117964  82.033847
2       22.298420  90.667483     22.298420  85.719077
3       23.544861  86.536381     23.544861  85.941987
4       24.860977  83.845657     24.860977  91.810750
...
112   8845.029526  58.356075   8845.029526  37.976407
113   9339.450707  32.391086   9339.450707  58.039594
114   9861.509138  34.756184   9861.509138  38.396633
115  10412.749693  31.359953  10412.749693  20.900020
116  10994.803600  35.219231  10994.803600  26.008255
[117 rows x 4 columns]
>>> dfs['Reference Waveform']
              Ch1                      Ch2              
                X             Y          X             Y
                s             V          s             V
0        0.000000  0.000000e+00   0.000000  0.000000
<details>
<summary>英文:</summary>
You can use `pd.read_csv` with right parameters:
import pandas as pd
import numpy as np
df = pd.read_csv('three measurement.csv', dtype='str', header=None, keep_default_na=False, na_values='')
dfs = {}
idx = df.index[df.notna().sum(axis=1).eq(1)].tolist()
for i, j in zip(idx, idx[1:]+[len(df)]):
name = df.loc[i, 0]
df1 = df.loc[i+1:j-1].dropna(how='all', axis=1).replace('N/A', np.nan)
m = pd.to_numeric(df1[0], errors='coerce').notna()
mi = pd.MultiIndex.from_arrays(df1.loc[~m].ffill(axis=1).values)
dfs[name] = pd.DataFrame(df1.loc[m].values.astype(float), columns=mi)
Output:
>>> list(dfs.keys())
['Level and Distortion',
'THD Ratio',
'Reference Waveform',
'Summary: Signal Path1.Acoustic Response.Level and Distortion',
'Summary: Signal Path1.Acoustic Response.THD Ratio',
'Summary: Signal Path1.Acoustic Response.Reference Waveform']
>>> dfs['Level and Distortion']
Ch1 (F)             Ch1 (H2)             Ch1 (H3)           Ch1 (Total)                 Ch2 (F)             Ch2 (H2)             Ch2 (H3)           Ch2 (Total)       
X         Y          X         Y          X         Y           X         Y             X         Y          X         Y          X         Y           X         Y
Hz      Vrms         Hz      Vrms         Hz      Vrms          Hz      Vrms            Hz      Vrms         Hz      Vrms         Hz      Vrms          Hz      Vrms
0       20.000000  0.007720  20.000000  0.000056  20.000000  0.000390   20.000000  0.011493     20.000000  0.006998  20.000000  0.000151  20.000000  0.000390   20.000000  0.010080
1       21.117964  0.007472  21.117964  0.000088  21.117964  0.000427   21.117964  0.012246     21.117964  0.007563  21.117964  0.000182  21.117964  0.000443   21.117964  0.010849
2       22.298420  0.005521  22.298420  0.000107  22.298420  0.000408   22.298420  0.011866     22.298420  0.007572  22.298420  0.000209  22.298420  0.000505   22.298420  0.012603
3       23.544861  0.007055  23.544861  0.000269  23.544861  0.000313   23.544861  0.012183     23.544861  0.007603  23.544861  0.000227  23.544861  0.000568   23.544861  0.012779
4       24.860977  0.008749  24.860977  0.000260  24.860977  0.000653   24.860977  0.013460     24.860977  0.005882  24.860977  0.000217  24.860977  0.000364   24.860977  0.013626
..            ...       ...        ...       ...        ...       ...         ...       ...           ...       ...        ...       ...        ...       ...         ...       ...
123  16089.472533  0.060342        NaN       NaN        NaN       NaN         NaN       NaN  16089.472533  0.061988        NaN       NaN        NaN       NaN         NaN       NaN
124  16988.844997  0.059465        NaN       NaN        NaN       NaN         NaN       NaN  16988.844997  0.047697        NaN       NaN        NaN       NaN         NaN       NaN
125  17938.490757  0.036990        NaN       NaN        NaN       NaN         NaN       NaN  17938.490757  0.035544        NaN       NaN        NaN       NaN         NaN       NaN
126  18941.220002  0.061363        NaN       NaN        NaN       NaN         NaN       NaN  18941.220002  0.050418        NaN       NaN        NaN       NaN         NaN       NaN
127  20000.000000  0.064279        NaN       NaN        NaN       NaN         NaN       NaN  20000.000000  0.059796        NaN       NaN        NaN       NaN         NaN       NaN
[128 rows x 16 columns]
>>> dfs['THD Ratio']
Ch1                      Ch2        
X          Y             X          Y
Hz          %            Hz          %
0       20.000000  83.009797     20.000000  82.146099
1       21.117964  85.365663     21.117964  82.033847
2       22.298420  90.667483     22.298420  85.719077
3       23.544861  86.536381     23.544861  85.941987
4       24.860977  83.845657     24.860977  91.810750
..            ...        ...           ...        ...
112   8845.029526  58.356075   8845.029526  37.976407
113   9339.450707  32.391086   9339.450707  58.039594
114   9861.509138  34.756184   9861.509138  38.396633
115  10412.749693  31.359953  10412.749693  20.900020
116  10994.803600  35.219231  10994.803600  26.008255
[117 rows x 4 columns]
>>> dfs['Reference Waveform']
Ch1                      Ch2           
X             Y          X             Y
s             V          s             V
0        0.000000  0.000000e+00   0.000000  0.000000e+00
1        0.000021  6.478902e-08   0.000021  6.478902e-08
2        0.000042  5.183047e-07   0.000042  5.183047e-07
3        0.000063  1.749237e-06   0.000063  1.749237e-06
4        0.000083  4.146200e-06   0.000083  4.146200e-06
...           ...           ...        ...           ...
719994  14.999875  3.035885e-08  14.999875  3.035885e-08
719995  14.999896 -7.711807e-09  14.999896 -7.711807e-09
719996  14.999917  2.917104e-10  14.999917  2.917104e-10
719997  14.999937  3.821995e-10  14.999937  3.821995e-10
719998  14.999958 -4.497611e-11  14.999958 -4.497611e-11
[719999 rows x 4 columns]
>>> dfs['Summary: Signal Path1.Acoustic Response.Level and Distortion']
Empty DataFrame
Columns: [(Channel, Ch1, Ch2), (Passed Lower Limit, Ch1, Ch2), (Passed Upper Limit, Ch1, Ch2)]
Index: []
>>> dfs['Summary: Signal Path1.Acoustic Response.THD Ratio']
Empty DataFrame
Columns: [(Channel, Ch1, Ch2), (Passed Lower Limit, Ch1, Ch2), (Passed Upper Limit, Ch1, Ch2)]
Index: []
>>> dfs['Summary: Signal Path1.Acoustic Response.Reference Waveform']
Empty DataFrame
Columns: [(Channel, Ch1, Ch2), (Passed Lower Limit, Ch1, Ch2), (Passed Upper Limit, Ch1, Ch2)]
Index: []
Concat all dataframes:
>>> pd.concat(dfs, names=['Measurement'], axis=1)
Measurement Level and Distortion                                 ... Summary: Signal Path1.Acoustic Response.THD Ratio Summary: Signal Path1.Acoustic Response.Reference Waveform                                   
Ch1 (F)             Ch1 (H2)            ...                                Passed Upper Limit                                                    Channel Passed Lower Limit Passed Upper Limit
X         Y          X         Y  ...                                               Ch1                                                        Ch1                Ch1                Ch1
Hz      Vrms         Hz      Vrms  ...                                               Ch2                                                        Ch2                Ch2                Ch2
0                      20.000000  0.007720  20.000000  0.000056  ...                                               NaN                                                NaN                        NaN                NaN
1                      21.117964  0.007472  21.117964  0.000088  ...                                               NaN                                                NaN                        NaN                NaN
2                      22.298420  0.005521  22.298420  0.000107  ...                                               NaN                                                NaN                        NaN                NaN
3                      23.544861  0.007055  23.544861  0.000269  ...                                               NaN                                                NaN                        NaN                NaN
4                      24.860977  0.008749  24.860977  0.000260  ...                                               NaN                                                NaN                        NaN                NaN
...                          ...       ...        ...       ...  ...                                               ...                                                ...                        ...                ...
719994                       NaN       NaN        NaN       NaN  ...                                               NaN                                                NaN                        NaN                NaN
719995                       NaN       NaN        NaN       NaN  ...                                               NaN                                                NaN                        NaN                NaN
719996                       NaN       NaN        NaN       NaN  ...                                               NaN                                                NaN                        NaN                NaN
719997                       NaN       NaN        NaN       NaN  ...                                               NaN                                                NaN                        NaN                NaN
719998                       NaN       NaN        NaN       NaN  ...                                               NaN                                                NaN                        NaN                NaN
[719999 rows x 33 columns]
</details>
				通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论