英文:
How to skip non-valid (space separated) rows in csv file rather than skipping first N rows
问题
You can achieve a more generic approach to skip space-separated rows in a CSV file without specifying the number of rows to skip or the comment sign using Python's pandas library. Here's how you can do it:
import pandas as pd
import io
# Read the CSV file using pandas, treating all lines before the column names as comments
df = pd.read_csv(
    'data/neutrinos.csv',
    sep=':',
    comment=None,
    header=None,
    names=['azimuth', 'zenith', 'bjorkeny', 'energy', 'pos_x', 'pos_y', 'pos_z', 'proba_track', 'proba_cscd'],
    decimal=','
)
# Remove any rows with missing data (if necessary)
df.dropna(inplace=True)
# Now df contains your data
This code will read the CSV file while treating all lines as comments until it encounters the actual data, making it a more generic approach without specifying the number of rows to skip or the comment sign.
英文:
I have a csv file which looks like:
Provided by someone, unformatted by another-one for some purposes,
another-one would never hand out such a mess ;)
Have fun!
$ Column names
:azimuth:zenith:bjorkeny:energy:pos_x:pos_y:pos_z:proba_track:proba_cscd
$ Data
0:2,3495370211373316:1,1160038417256017:0,04899799823760986:3,3664000034332275:52,74:28,831:401,18600000000004:0,8243512974051896:0,17564870259481039
1:5,575785663044353:1,7428377336692398:0,28047099709510803:3,890000104904175:48,369:29,865:417,282:0,8183632734530938:0,18163672654690619
2:4,656124692722159:2,686909147834136:0,1198429986834526:3,2335000038146973:71,722:121,449:363,077:0,8283433133732535:0,17165668662674652
Reading that file in pandas requires skipping first rows, and defining comment as $:
pd.read_csv(
  'data/neutrinos.csv', 
  on_bad_lines='skip', 
  sep=':',
  skiprows=5,
  comment='$',
  index_col=0,
  decimal=','
)
Is there a more generic approach where I can skip all space-separated rows without defining the number of rows to skip or the comment sign? thanks in advance.
答案1
得分: 1
你最安全的选择是首先解析csv文件,跳过只有一个值的行,然后将其作为输入:
```python
import csv
import io
csv.register_dialect('mycsv', delimiter=':', quoting=csv.QUOTE_NONE)
newcsv = ''
with open('test.csv', newline='') as f:
    reader = csv.reader(f, 'mycsv')
    for row in reader:
        if len(row) > 1:
            newcsv += ':'.join(row) + '\n'
df = pd.read_csv(
  io.StringIO(newcsv), 
  sep=':',
  index_col=0,
  decimal=','
)
英文:
Your safest bet is to parse the csv file first, skipping any rows which only have 1 value and then using that as your input:
import csv
import io
csv.register_dialect('mycsv', delimiter=':', quoting=csv.QUOTE_NONE)
newcsv = ''
with open('test.csv', newline='') as f:
    reader = csv.reader(f, 'mycsv')
    for row in reader:
        if len(row) > 1:
            newcsv += ':'.join(row) + '\n'
df = pd.read_csv(
  io.StringIO(newcsv), 
  sep=':',
  index_col=0,
  decimal=','
)
Output:
    azimuth    zenith  bjorkeny  energy   pos_x    pos_y    pos_z  proba_track  proba_cscd
0  2.349537  1.116004  0.048998  3.3664  52.740   28.831  401.186     0.824351    0.175649
1  5.575786  1.742838  0.280471  3.8900  48.369   29.865  417.282     0.818363    0.181637
2  4.656125  2.686909  0.119843  3.2335  71.722  121.449  363.077     0.828343    0.171657
Note that for a large file, you will probably want to build the dataframe row by row (in the csv reader loop) rather than creating a huge newcsv string.
Don't try this at work
For your sample data, in a non-production environment, you could use a side-effect from an on_bad_lines callback to populate a list with the actual rows:
data = []
_ = pd.read_csv('test.csv', on_bad_lines=lambda l:data.append(l), sep=':', index_col=0, decimal=',', engine='python')
df = pd.DataFrame(data[1:], columns=['idx'] + data[0][1:]).set_index('idx')
Output:
                azimuth              zenith  ...         proba_track           proba_cscd
idx                                          ...
0    2.3495370211373316  1.1160038417256017  ...  0.8243512974051896  0.17564870259481039
1     5.575785663044353  1.7428377336692398  ...  0.8183632734530938  0.18163672654690619
2     4.656124692722159   2.686909147834136  ...  0.8283433133732535  0.17165668662674652
答案2
得分: 1
CASE-I:假设第一个'$'是列标题。
我找不到任何关于这个的'pandas'解决方案,但有一种其他方法可以实现这一点:
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    lines = f.readlines()
skiprows = [i for i in range(len(lines)) if lines[i][0]=='$'][0]
这将为您提供参数'skiprows'的值,以供pandas.read_csv()使用。
对于包含数百万行且执行时间很敏感的大型CSV文件,可以进行改进:
skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    for line in f:
        if line[0]=='$':
            skiprows = i
            break
        i+=1
CASE-II(假设没有其他内容):
skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    for line in f:
        if len(line.split(':')) > len(line.split(" ")):
            skiprows = i-1 # -1以达到提到$列的行
            break
        i+=1
基于空格分隔的行将具有比':'分隔的行更多的空格,因此使用相同的逻辑进行比较。
英文:
CASE-I : Assuming the first '$' is the column headers.
I cannot find any 'pandas' solution for this, but there is one other way that this can be achieved:-
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    lines = f.readlines()
skiprows = [i for i in range(len(lines)) if lines[i][0]=='$'][0]
This will give you value of skiprows for the parameter 'skiprows' in pandas.read_csv().
For large CSVs where million of rows are present and execution time is a sensitive factor, the same can be improvised to -
skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    for line in f:
        if line[0]=='$':
            skiprows = i
            break
        i+=1
CASE-II (Assuming nothin)
skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    for line in f:
        if len(line.split(':')) > len(line.split(" ")):
            skiprows = i-1 # -1 to reach the line that mentions $ Columns
            break
        i+=1
The space-separated rows will have more spaces than the ':' separated rows, and hence the comparison is done on the same logic.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论