检测CSV文件中的多个标题

huangapple go评论97阅读模式
英文:

Detect multiple headers in CSV files

问题

在CSV文件中,您希望Python能够自动检测测量类型,而不必像df_titles = ["Level and Distortion", "THD Ratio", "Reference Waveform"]那样手动定义。您可以尝试以下方法来实现这一目标:

  1. import pandas as pd
  2. df = pd.read_csv("three measurement.csv", header=None)
  3. # 寻找测量类型行
  4. df_titles = df.iloc[:, 0].str.strip('"') # 去除双引号并获取第一列
  5. # 创建一个包含唯一值的列表,这些唯一值将成为测量类型的标识
  6. measurement_types = df_titles[df_titles != ""].unique()
  7. # 创建测量类型到数据框的映射字典
  8. measurement_data = {}
  9. for measurement_type in measurement_types:
  10. # 使用布尔索引获取特定测量类型的数据框
  11. measurement_df = df[df_titles == measurement_type]
  12. # 去掉测量类型所在的行,并重置索引
  13. measurement_df = measurement_df.iloc[1:].reset_index(drop=True)
  14. # 创建MultiIndex列标签,从前4行获取信息,并去掉空白字符
  15. header = measurement_df.iloc[:4, :].apply(lambda x: x.str.strip()).ffill(axis=1)
  16. # 去掉前4行,重置索引,并设置列标签
  17. measurement_df = measurement_df.iloc[4:].reset_index(drop=True)
  18. measurement_df.columns = pd.MultiIndex.from_arrays(header.values)
  19. # 存储到字典中
  20. measurement_data[measurement_type] = measurement_df
  21. # 现在,measurement_data 字典包含了自动检测到的所有测量类型和对应的数据框。

这段代码将自动检测CSV文件中的测量类型,创建一个字典,其中测量类型作为键,对应的数据框作为值。您不再需要手动定义df_titles变量,它将根据CSV文件的内容动态确定测量类型。

英文:

I have a large csv file. Inside the csv file there are several headers. I've separated the three headers using code as below.

  1. import pandas as pd
  2. df = pd.read_csv("three measurement.csv", header=None)
  3. # find header rows
  4. df_titles = ["Level and Distortion", "THD Ratio", "Reference Waveform"]
  5. # create groups for each section
  6. groupings = df.iloc[:, 0].str.contains("|".join(df_titles)).cumsum()
  7. # split into new dataframes as dictionary
  8. d = {}
  9. for i, j in df.groupby(groupings):
  10. # define name of dictionary key as title, and set data of DF as values
  11. d[j.iloc[0, 0]] = pd.DataFrame(data=j.values[4:, :],
  12. # create MultiIndex from 3 header rows
  13. columns=pd.MultiIndex.from_arrays(
  14. j.iloc[0:4, :].ffill(axis=1).values))
  15. # suggested not to use, but you can set the variables directly (outside of the dictionary)
  16. globals()[j.iloc[0, 0]] = pd.DataFrame(data=j.values[4:, :],
  17. columns=pd.MultiIndex.from_arrays(
  18. j.iloc[0:4, :].ffill(axis=1).values))

Level and Distortion, THD Ratio, and Reference Waveform are the types of measurements that I have separated. But the measurement type can change (increase or decrease) how to detect the measurement type without specifying the measurement type as in the code above (df_titles=[".."). This means that in the code above to find the header line, you have to specify the names of the measurement types according to the csv that I show below. If the measurement type is different and increasing, how do I find the header row without having to define the df_title variable.
Here CSV File

  1. "Level and Distortion",,,,,,,,,,,,,,,
  2. "Ch1 (F)",,"Ch1 (H2)",,"Ch1 (H3)",,"Ch1 (Total)",,"Ch2 (F)",,"Ch2 (H2)",,"Ch2 (H3)",,"Ch2 (Total)",
  3. X,Y,X,Y,X,Y,X,Y,X,Y,X,Y,X,Y,X,Y
  4. Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms
  5. 20,0.00772013164376534,20,5.60982648239952E-05,20,0.000389709733151927,20,0.011492581958802,20,0.00699792689186063,20,0.000151471712877565,20,0.000389940899485093,20,0.010080448380793
  6. 21.1179638886716,0.00747175133180212,21.1179638886716,8.83327496082501E-05,21.1179638886716,0.000426696028852445,21.1179638886716,0.0122462876404656,21.1179638886716,0.00756340531214287,21.1179638886716,0.000181697169530165,21.1179638886716,0.000443499862648762,21.1179638886716,0.0108494276048029
  7. "THD Ratio",,,,,,,,,,,,,,,
  8. Ch1,,Ch2,,,,,,,,,,,,,
  9. X,Y,X,Y,,,,,,,,,,,,
  10. Hz,%,Hz,%,,,,,,,,,,,,
  11. 20,83.009797319554,20,82.1460991930652,,,,,,,,,,,,
  12. 21.1179638886716,85.3656629417084,21.1179638886716,82.0338466400102,,,,,,,,,,,,
  13. 22.2984199401618,90.6674826441566,22.2984199401618,85.7190774666039,,,,,,,,,,,,
  14. "Reference Waveform",,,,,,,,,,,,,,,
  15. Ch1,,Ch2,,,,,,,,,,,,,
  16. X,Y,X,Y,,,,,,,,,,,,
  17. s,V,s,V,,,,,,,,,,,,
  18. 0,0,0,0,,,,,,,,,,,,
  19. 2.08333333333333E-05,6.47890208369956E-08,2.08333333333333E-05,6.47890208369956E-08,,,,,,,,,,,,
  20. 4.16666666666667E-05,5.18304721721536E-07,4.16666666666667E-05,5.18304721721536E-07,,,,,,,,,,,,
  21. 6.25E-05,1.74923655865586E-06,6.25E-05,1.74923655865586E-06,,,,,,,,,,,,

Actually there are a lot of row data in the csv file.

Because the type of measurement can change (increase or decrease) there are 11 types of measurement. I want python to be able to detect the measurement type without having to define it like df_titles = ["Level and Distortion", "THD Ratio", "Reference Waveform"].

Thanks.

答案1

得分: 1

您可以使用pd.read_csv并使用正确的参数:

  1. import pandas as pd
  2. import numpy as np
  3. df = pd.read_csv('three measurement.csv', dtype='str', header=None, keep_default_na=False, na_values='')
  4. dfs = {}
  5. idx = df.index[df.notna().sum(axis=1).eq(1)].tolist()
  6. for i, j in zip(idx, idx[1:]+[len(df)]):
  7. name = df.loc[i, 0]
  8. df1 = df.loc[i+1:j-1].dropna(how='all', axis=1).replace('N/A', np.nan)
  9. m = pd.to_numeric(df1[0], errors='coerce').notna()
  10. mi = pd.MultiIndex.from_arrays(df1.loc[~m].ffill(axis=1).values)
  11. dfs[name] = pd.DataFrame(df1.loc[m].values.astype(float), columns=mi)

输出:

  1. >>> list(dfs.keys())
  2. ['Level and Distortion',
  3. 'THD Ratio',
  4. 'Reference Waveform',
  5. 'Summary: Signal Path1.Acoustic Response.Level and Distortion',
  6. 'Summary: Signal Path1.Acoustic Response.THD Ratio',
  7. 'Summary: Signal Path1.Acoustic Response.Reference Waveform']
  8. >>> dfs['Level and Distortion']
  9. Ch1 (F) Ch1 (H2) Ch1 (H3) Ch1 (Total) Ch2 (F) Ch2 (H2) Ch2 (H3) Ch2 (Total)
  10. X Y X Y X Y X Y X Y X Y X Y X Y
  11. Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms
  12. 0 20.000000 0.007720 20.000000 0.000056 20.000000 0.000390 20.000000 0.011493 20.000000 0.006998 20.000000 0.000151 20.000000 0.000390 20.000000 0.010080
  13. 1 21.117964 0.007472 21.117964 0.000088 21.117964 0.000427 21.117964 0.012246 21.117964 0.007563 21.117964 0.000182 21.117964 0.000443 21.117964 0.010849
  14. 2 22.298420 0.005521 22.298420 0.000107 22.298420 0.000408 22.298420 0.011866 22.298420 0.007572 22.298420 0.000209 22.298420 0.000505 22.298420 0.012603
  15. 3 23.544861 0.007055 23.544861 0.000269 23.544861 0.000313 23.544861 0.012183 23.544861 0.007603 23.544861 0.000227 23.544861 0.000568 23.544861 0.012779
  16. 4 24.860977 0.008749 24.860977 0.000260 24.860977 0.000653 24.860977 0.013460 24.860977 0.005882 24.860977 0.000217 24.860977 0.000364 24.860977 0.013626
  17. ...
  18. 123 16089.472533 0.060342 NaN NaN NaN NaN NaN NaN 16089.472533 0.061988 NaN NaN NaN NaN NaN NaN
  19. 124 16988.844997 0.059465 NaN NaN NaN NaN NaN NaN 16988.844997 0.047697 NaN NaN NaN NaN NaN NaN
  20. 125 17938.490757 0.036990 NaN NaN NaN NaN NaN NaN 17938.490757 0.035544 NaN NaN NaN NaN NaN NaN
  21. 126 18941.220002 0.061363 NaN NaN NaN NaN NaN NaN 18941.220002 0.050418 NaN NaN NaN NaN NaN NaN
  22. 127 20000.000000 0.064279 NaN NaN NaN NaN NaN NaN 20000.000000 0.059796 NaN NaN NaN NaN NaN NaN
  23. [128 rows x 16 columns]
  24. >>> dfs['THD Ratio']
  25. Ch1 Ch2
  26. X Y X Y
  27. Hz % Hz %
  28. 0 20.000000 83.009797 20.000000 82.146099
  29. 1 21.117964 85.365663 21.117964 82.033847
  30. 2 22.298420 90.667483 22.298420 85.719077
  31. 3 23.544861 86.536381 23.544861 85.941987
  32. 4 24.860977 83.845657 24.860977 91.810750
  33. ...
  34. 112 8845.029526 58.356075 8845.029526 37.976407
  35. 113 9339.450707 32.391086 9339.450707 58.039594
  36. 114 9861.509138 34.756184 9861.509138 38.396633
  37. 115 10412.749693 31.359953 10412.749693 20.900020
  38. 116 10994.803600 35.219231 10994.803600 26.008255
  39. [117 rows x 4 columns]
  40. >>> dfs['Reference Waveform']
  41. Ch1 Ch2
  42. X Y X Y
  43. s V s V
  44. 0 0.000000 0.000000e+00 0.000000 0.000000
  45. <details>
  46. <summary>英文:</summary>
  47. You can use `pd.read_csv` with right parameters:

import pandas as pd
import numpy as np

df = pd.read_csv('three measurement.csv', dtype='str', header=None, keep_default_na=False, na_values='')

dfs = {}
idx = df.index[df.notna().sum(axis=1).eq(1)].tolist()
for i, j in zip(idx, idx[1:]+[len(df)]):
name = df.loc[i, 0]
df1 = df.loc[i+1:j-1].dropna(how='all', axis=1).replace('N/A', np.nan)
m = pd.to_numeric(df1[0], errors='coerce').notna()
mi = pd.MultiIndex.from_arrays(df1.loc[~m].ffill(axis=1).values)
dfs[name] = pd.DataFrame(df1.loc[m].values.astype(float), columns=mi)

  1. Output:

>>> list(dfs.keys())
['Level and Distortion',
'THD Ratio',
'Reference Waveform',
'Summary: Signal Path1.Acoustic Response.Level and Distortion',
'Summary: Signal Path1.Acoustic Response.THD Ratio',
'Summary: Signal Path1.Acoustic Response.Reference Waveform']

>>> dfs['Level and Distortion']
Ch1 (F) Ch1 (H2) Ch1 (H3) Ch1 (Total) Ch2 (F) Ch2 (H2) Ch2 (H3) Ch2 (Total)
X Y X Y X Y X Y X Y X Y X Y X Y
Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms Hz Vrms
0 20.000000 0.007720 20.000000 0.000056 20.000000 0.000390 20.000000 0.011493 20.000000 0.006998 20.000000 0.000151 20.000000 0.000390 20.000000 0.010080
1 21.117964 0.007472 21.117964 0.000088 21.117964 0.000427 21.117964 0.012246 21.117964 0.007563 21.117964 0.000182 21.117964 0.000443 21.117964 0.010849
2 22.298420 0.005521 22.298420 0.000107 22.298420 0.000408 22.298420 0.011866 22.298420 0.007572 22.298420 0.000209 22.298420 0.000505 22.298420 0.012603
3 23.544861 0.007055 23.544861 0.000269 23.544861 0.000313 23.544861 0.012183 23.544861 0.007603 23.544861 0.000227 23.544861 0.000568 23.544861 0.012779
4 24.860977 0.008749 24.860977 0.000260 24.860977 0.000653 24.860977 0.013460 24.860977 0.005882 24.860977 0.000217 24.860977 0.000364 24.860977 0.013626
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
123 16089.472533 0.060342 NaN NaN NaN NaN NaN NaN 16089.472533 0.061988 NaN NaN NaN NaN NaN NaN
124 16988.844997 0.059465 NaN NaN NaN NaN NaN NaN 16988.844997 0.047697 NaN NaN NaN NaN NaN NaN
125 17938.490757 0.036990 NaN NaN NaN NaN NaN NaN 17938.490757 0.035544 NaN NaN NaN NaN NaN NaN
126 18941.220002 0.061363 NaN NaN NaN NaN NaN NaN 18941.220002 0.050418 NaN NaN NaN NaN NaN NaN
127 20000.000000 0.064279 NaN NaN NaN NaN NaN NaN 20000.000000 0.059796 NaN NaN NaN NaN NaN NaN

[128 rows x 16 columns]

>>> dfs['THD Ratio']
Ch1 Ch2
X Y X Y
Hz % Hz %
0 20.000000 83.009797 20.000000 82.146099
1 21.117964 85.365663 21.117964 82.033847
2 22.298420 90.667483 22.298420 85.719077
3 23.544861 86.536381 23.544861 85.941987
4 24.860977 83.845657 24.860977 91.810750
.. ... ... ... ...
112 8845.029526 58.356075 8845.029526 37.976407
113 9339.450707 32.391086 9339.450707 58.039594
114 9861.509138 34.756184 9861.509138 38.396633
115 10412.749693 31.359953 10412.749693 20.900020
116 10994.803600 35.219231 10994.803600 26.008255

[117 rows x 4 columns]

>>> dfs['Reference Waveform']
Ch1 Ch2
X Y X Y
s V s V
0 0.000000 0.000000e+00 0.000000 0.000000e+00
1 0.000021 6.478902e-08 0.000021 6.478902e-08
2 0.000042 5.183047e-07 0.000042 5.183047e-07
3 0.000063 1.749237e-06 0.000063 1.749237e-06
4 0.000083 4.146200e-06 0.000083 4.146200e-06
... ... ... ... ...
719994 14.999875 3.035885e-08 14.999875 3.035885e-08
719995 14.999896 -7.711807e-09 14.999896 -7.711807e-09
719996 14.999917 2.917104e-10 14.999917 2.917104e-10
719997 14.999937 3.821995e-10 14.999937 3.821995e-10
719998 14.999958 -4.497611e-11 14.999958 -4.497611e-11

[719999 rows x 4 columns]

>>> dfs['Summary: Signal Path1.Acoustic Response.Level and Distortion']
Empty DataFrame
Columns: [(Channel, Ch1, Ch2), (Passed Lower Limit, Ch1, Ch2), (Passed Upper Limit, Ch1, Ch2)]
Index: []

>>> dfs['Summary: Signal Path1.Acoustic Response.THD Ratio']
Empty DataFrame
Columns: [(Channel, Ch1, Ch2), (Passed Lower Limit, Ch1, Ch2), (Passed Upper Limit, Ch1, Ch2)]
Index: []

>>> dfs['Summary: Signal Path1.Acoustic Response.Reference Waveform']
Empty DataFrame
Columns: [(Channel, Ch1, Ch2), (Passed Lower Limit, Ch1, Ch2), (Passed Upper Limit, Ch1, Ch2)]
Index: []

  1. Concat all dataframes:

>>> pd.concat(dfs, names=['Measurement'], axis=1)

Measurement Level and Distortion ... Summary: Signal Path1.Acoustic Response.THD Ratio Summary: Signal Path1.Acoustic Response.Reference Waveform
Ch1 (F) Ch1 (H2) ... Passed Upper Limit Channel Passed Lower Limit Passed Upper Limit
X Y X Y ... Ch1 Ch1 Ch1 Ch1
Hz Vrms Hz Vrms ... Ch2 Ch2 Ch2 Ch2
0 20.000000 0.007720 20.000000 0.000056 ... NaN NaN NaN NaN
1 21.117964 0.007472 21.117964 0.000088 ... NaN NaN NaN NaN
2 22.298420 0.005521 22.298420 0.000107 ... NaN NaN NaN NaN
3 23.544861 0.007055 23.544861 0.000269 ... NaN NaN NaN NaN
4 24.860977 0.008749 24.860977 0.000260 ... NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
719994 NaN NaN NaN NaN ... NaN NaN NaN NaN
719995 NaN NaN NaN NaN ... NaN NaN NaN NaN
719996 NaN NaN NaN NaN ... NaN NaN NaN NaN
719997 NaN NaN NaN NaN ... NaN NaN NaN NaN
719998 NaN NaN NaN NaN ... NaN NaN NaN NaN

[719999 rows x 33 columns]

  1. </details>

huangapple
  • 本文由 发表于 2023年5月30日 10:29:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76361273.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定