英文:
How to skip non-valid (space separated) rows in csv file rather than skipping first N rows
问题
You can achieve a more generic approach to skip space-separated rows in a CSV file without specifying the number of rows to skip or the comment sign using Python's pandas
library. Here's how you can do it:
import pandas as pd
import io
# Read the CSV file using pandas, treating all lines before the column names as comments
df = pd.read_csv(
'data/neutrinos.csv',
sep=':',
comment=None,
header=None,
names=['azimuth', 'zenith', 'bjorkeny', 'energy', 'pos_x', 'pos_y', 'pos_z', 'proba_track', 'proba_cscd'],
decimal=','
)
# Remove any rows with missing data (if necessary)
df.dropna(inplace=True)
# Now df contains your data
This code will read the CSV file while treating all lines as comments until it encounters the actual data, making it a more generic approach without specifying the number of rows to skip or the comment sign.
英文:
I have a csv
file which looks like:
Provided by someone, unformatted by another-one for some purposes,
another-one would never hand out such a mess ;)
Have fun!
$ Column names
:azimuth:zenith:bjorkeny:energy:pos_x:pos_y:pos_z:proba_track:proba_cscd
$ Data
0:2,3495370211373316:1,1160038417256017:0,04899799823760986:3,3664000034332275:52,74:28,831:401,18600000000004:0,8243512974051896:0,17564870259481039
1:5,575785663044353:1,7428377336692398:0,28047099709510803:3,890000104904175:48,369:29,865:417,282:0,8183632734530938:0,18163672654690619
2:4,656124692722159:2,686909147834136:0,1198429986834526:3,2335000038146973:71,722:121,449:363,077:0,8283433133732535:0,17165668662674652
Reading that file in pandas
requires skipping first rows, and defining comment as $
:
pd.read_csv(
'data/neutrinos.csv',
on_bad_lines='skip',
sep=':',
skiprows=5,
comment='$',
index_col=0,
decimal=','
)
Is there a more generic approach where I can skip all space-separated rows without defining the number of rows to skip or the comment sign? thanks in advance.
答案1
得分: 1
你最安全的选择是首先解析csv文件,跳过只有一个值的行,然后将其作为输入:
```python
import csv
import io
csv.register_dialect('mycsv', delimiter=':', quoting=csv.QUOTE_NONE)
newcsv = ''
with open('test.csv', newline='') as f:
reader = csv.reader(f, 'mycsv')
for row in reader:
if len(row) > 1:
newcsv += ':'.join(row) + '\n'
df = pd.read_csv(
io.StringIO(newcsv),
sep=':',
index_col=0,
decimal=','
)
英文:
Your safest bet is to parse the csv file first, skipping any rows which only have 1 value and then using that as your input:
import csv
import io
csv.register_dialect('mycsv', delimiter=':', quoting=csv.QUOTE_NONE)
newcsv = ''
with open('test.csv', newline='') as f:
reader = csv.reader(f, 'mycsv')
for row in reader:
if len(row) > 1:
newcsv += ':'.join(row) + '\n'
df = pd.read_csv(
io.StringIO(newcsv),
sep=':',
index_col=0,
decimal=','
)
Output:
azimuth zenith bjorkeny energy pos_x pos_y pos_z proba_track proba_cscd
0 2.349537 1.116004 0.048998 3.3664 52.740 28.831 401.186 0.824351 0.175649
1 5.575786 1.742838 0.280471 3.8900 48.369 29.865 417.282 0.818363 0.181637
2 4.656125 2.686909 0.119843 3.2335 71.722 121.449 363.077 0.828343 0.171657
Note that for a large file, you will probably want to build the dataframe row by row (in the csv reader loop) rather than creating a huge newcsv
string.
Don't try this at work
For your sample data, in a non-production environment, you could use a side-effect from an on_bad_lines
callback to populate a list with the actual rows:
data = []
_ = pd.read_csv('test.csv', on_bad_lines=lambda l:data.append(l), sep=':', index_col=0, decimal=',', engine='python')
df = pd.DataFrame(data[1:], columns=['idx'] + data[0][1:]).set_index('idx')
Output:
azimuth zenith ... proba_track proba_cscd
idx ...
0 2.3495370211373316 1.1160038417256017 ... 0.8243512974051896 0.17564870259481039
1 5.575785663044353 1.7428377336692398 ... 0.8183632734530938 0.18163672654690619
2 4.656124692722159 2.686909147834136 ... 0.8283433133732535 0.17165668662674652
答案2
得分: 1
CASE-I:假设第一个'$'是列标题。
我找不到任何关于这个的'pandas'解决方案,但有一种其他方法可以实现这一点:
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
lines = f.readlines()
skiprows = [i for i in range(len(lines)) if lines[i][0]=='$'][0]
这将为您提供参数'skiprows'的值,以供pandas.read_csv()使用。
对于包含数百万行且执行时间很敏感的大型CSV文件,可以进行改进:
skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
for line in f:
if line[0]=='$':
skiprows = i
break
i+=1
CASE-II(假设没有其他内容):
skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
for line in f:
if len(line.split(':')) > len(line.split(" ")):
skiprows = i-1 # -1以达到提到$列的行
break
i+=1
基于空格分隔的行将具有比':'分隔的行更多的空格,因此使用相同的逻辑进行比较。
英文:
CASE-I : Assuming the first '$' is the column headers.
I cannot find any 'pandas' solution for this, but there is one other way that this can be achieved:-
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
lines = f.readlines()
skiprows = [i for i in range(len(lines)) if lines[i][0]=='$'][0]
This will give you value of skiprows for the parameter 'skiprows' in pandas.read_csv().
For large CSVs where million of rows are present and execution time is a sensitive factor, the same can be improvised to -
skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
for line in f:
if line[0]=='$':
skiprows = i
break
i+=1
CASE-II (Assuming nothin)
skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
for line in f:
if len(line.split(':')) > len(line.split(" ")):
skiprows = i-1 # -1 to reach the line that mentions $ Columns
break
i+=1
The space-separated rows will have more spaces than the ':' separated rows, and hence the comparison is done on the same logic.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论