跳过非有效(以空格分隔的)CSV文件行,而不是跳过前N行。

huangapple go评论53阅读模式
英文:

How to skip non-valid (space separated) rows in csv file rather than skipping first N rows

问题

You can achieve a more generic approach to skip space-separated rows in a CSV file without specifying the number of rows to skip or the comment sign using Python's pandas library. Here's how you can do it:

import pandas as pd
import io

# Read the CSV file using pandas, treating all lines before the column names as comments
df = pd.read_csv(
    'data/neutrinos.csv',
    sep=':',
    comment=None,
    header=None,
    names=['azimuth', 'zenith', 'bjorkeny', 'energy', 'pos_x', 'pos_y', 'pos_z', 'proba_track', 'proba_cscd'],
    decimal=','
)

# Remove any rows with missing data (if necessary)
df.dropna(inplace=True)

# Now df contains your data

This code will read the CSV file while treating all lines as comments until it encounters the actual data, making it a more generic approach without specifying the number of rows to skip or the comment sign.

英文:

I have a csv file which looks like:

Provided by someone, unformatted by another-one for some purposes,
another-one would never hand out such a mess ;)

Have fun!

$ Column names
:azimuth:zenith:bjorkeny:energy:pos_x:pos_y:pos_z:proba_track:proba_cscd

$ Data
0:2,3495370211373316:1,1160038417256017:0,04899799823760986:3,3664000034332275:52,74:28,831:401,18600000000004:0,8243512974051896:0,17564870259481039
1:5,575785663044353:1,7428377336692398:0,28047099709510803:3,890000104904175:48,369:29,865:417,282:0,8183632734530938:0,18163672654690619
2:4,656124692722159:2,686909147834136:0,1198429986834526:3,2335000038146973:71,722:121,449:363,077:0,8283433133732535:0,17165668662674652

Reading that file in pandas requires skipping first rows, and defining comment as $:

pd.read_csv(
  'data/neutrinos.csv', 
  on_bad_lines='skip', 
  sep=':',
  skiprows=5,
  comment='$',
  index_col=0,
  decimal=','
)

Is there a more generic approach where I can skip all space-separated rows without defining the number of rows to skip or the comment sign? thanks in advance.

答案1

得分: 1

你最安全的选择是首先解析csv文件跳过只有一个值的行然后将其作为输入:
```python
import csv
import io

csv.register_dialect('mycsv', delimiter=':', quoting=csv.QUOTE_NONE)

newcsv = ''
with open('test.csv', newline='') as f:
    reader = csv.reader(f, 'mycsv')
    for row in reader:
        if len(row) > 1:
            newcsv += ':'.join(row) + '\n'

df = pd.read_csv(
  io.StringIO(newcsv), 
  sep=':',
  index_col=0,
  decimal=','
)
英文:

Your safest bet is to parse the csv file first, skipping any rows which only have 1 value and then using that as your input:

import csv
import io

csv.register_dialect('mycsv', delimiter=':', quoting=csv.QUOTE_NONE)

newcsv = ''
with open('test.csv', newline='') as f:
    reader = csv.reader(f, 'mycsv')
    for row in reader:
        if len(row) > 1:
            newcsv += ':'.join(row) + '\n'

df = pd.read_csv(
  io.StringIO(newcsv), 
  sep=':',
  index_col=0,
  decimal=','
)

Output:

    azimuth    zenith  bjorkeny  energy   pos_x    pos_y    pos_z  proba_track  proba_cscd
0  2.349537  1.116004  0.048998  3.3664  52.740   28.831  401.186     0.824351    0.175649
1  5.575786  1.742838  0.280471  3.8900  48.369   29.865  417.282     0.818363    0.181637
2  4.656125  2.686909  0.119843  3.2335  71.722  121.449  363.077     0.828343    0.171657

Note that for a large file, you will probably want to build the dataframe row by row (in the csv reader loop) rather than creating a huge newcsv string.

Don't try this at work

For your sample data, in a non-production environment, you could use a side-effect from an on_bad_lines callback to populate a list with the actual rows:

data = []
_ = pd.read_csv('test.csv', on_bad_lines=lambda l:data.append(l), sep=':', index_col=0, decimal=',', engine='python')
df = pd.DataFrame(data[1:], columns=['idx'] + data[0][1:]).set_index('idx')

Output:

                azimuth              zenith  ...         proba_track           proba_cscd
idx                                          ...
0    2.3495370211373316  1.1160038417256017  ...  0.8243512974051896  0.17564870259481039
1     5.575785663044353  1.7428377336692398  ...  0.8183632734530938  0.18163672654690619
2     4.656124692722159   2.686909147834136  ...  0.8283433133732535  0.17165668662674652

答案2

得分: 1

CASE-I:假设第一个'$'是列标题。
我找不到任何关于这个的'pandas'解决方案,但有一种其他方法可以实现这一点:

with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    lines = f.readlines()

skiprows = [i for i in range(len(lines)) if lines[i][0]=='$'][0]

这将为您提供参数'skiprows'的值,以供pandas.read_csv()使用。

对于包含数百万行且执行时间很敏感的大型CSV文件,可以进行改进:

skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    for line in f:
        if line[0]=='$':
            skiprows = i
            break
        i+=1

CASE-II(假设没有其他内容):

skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    for line in f:
        if len(line.split(':')) > len(line.split(" ")):
            skiprows = i-1 # -1以达到提到$列的行
            break

        i+=1

基于空格分隔的行将具有比':'分隔的行更多的空格,因此使用相同的逻辑进行比较。

英文:

CASE-I : Assuming the first '$' is the column headers.
I cannot find any 'pandas' solution for this, but there is one other way that this can be achieved:-

with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    lines = f.readlines()

skiprows = [i for i in range(len(lines)) if lines[i][0]=='$'][0]

This will give you value of skiprows for the parameter 'skiprows' in pandas.read_csv().

For large CSVs where million of rows are present and execution time is a sensitive factor, the same can be improvised to -

skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    for line in f:
        if line[0]=='$':
            skiprows = i
            break
        i+=1

CASE-II (Assuming nothin)

skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    for line in f:
        if len(line.split(':')) > len(line.split(" ")):
            skiprows = i-1 # -1 to reach the line that mentions $ Columns
            break

        i+=1

The space-separated rows will have more spaces than the ':' separated rows, and hence the comparison is done on the same logic.

huangapple
  • 本文由 发表于 2023年5月10日 11:30:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76214707.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定