2023年5月10日 11:30:18go评论61阅读模式

英文:

How to skip non-valid (space separated) rows in csv file rather than skipping first N rows

问题

You can achieve a more generic approach to skip space-separated rows in a CSV file without specifying the number of rows to skip or the comment sign using Python's pandas library. Here's how you can do it:

import pandas as pd
import io

# Read the CSV file using pandas, treating all lines before the column names as comments
df = pd.read_csv(
    'data/neutrinos.csv',
    sep=':',
    comment=None,
    header=None,
    names=['azimuth', 'zenith', 'bjorkeny', 'energy', 'pos_x', 'pos_y', 'pos_z', 'proba_track', 'proba_cscd'],
    decimal=','
)

# Remove any rows with missing data (if necessary)
df.dropna(inplace=True)

# Now df contains your data

This code will read the CSV file while treating all lines as comments until it encounters the actual data, making it a more generic approach without specifying the number of rows to skip or the comment sign.

英文:

I have a csv file which looks like:

Provided by someone, unformatted by another-one for some purposes,
another-one would never hand out such a mess ;)

Have fun!

$ Column names
:azimuth:zenith:bjorkeny:energy:pos_x:pos_y:pos_z:proba_track:proba_cscd

$ Data
0:2,3495370211373316:1,1160038417256017:0,04899799823760986:3,3664000034332275:52,74:28,831:401,18600000000004:0,8243512974051896:0,17564870259481039
1:5,575785663044353:1,7428377336692398:0,28047099709510803:3,890000104904175:48,369:29,865:417,282:0,8183632734530938:0,18163672654690619
2:4,656124692722159:2,686909147834136:0,1198429986834526:3,2335000038146973:71,722:121,449:363,077:0,8283433133732535:0,17165668662674652

Reading that file in pandas requires skipping first rows, and defining comment as $:

pd.read_csv(
  &#39;data/neutrinos.csv&#39;, 
  on_bad_lines=&#39;skip&#39;, 
  sep=&#39;:&#39;,
  skiprows=5,
  comment=&#39;$&#39;,
  index_col=0,
  decimal=&#39;,&#39;
)

Is there a more generic approach where I can skip all space-separated rows without defining the number of rows to skip or the comment sign? thanks in advance.

答案1

得分: 1

你最安全的选择是首先解析csv文件，跳过只有一个值的行，然后将其作为输入:
```python
import csv
import io

csv.register_dialect('mycsv', delimiter=':', quoting=csv.QUOTE_NONE)

newcsv = ''
with open('test.csv', newline='') as f:
    reader = csv.reader(f, 'mycsv')
    for row in reader:
        if len(row) > 1:
            newcsv += ':'.join(row) + '\n'

df = pd.read_csv(
  io.StringIO(newcsv), 
  sep=':',
  index_col=0,
  decimal=','
)

英文:

Your safest bet is to parse the csv file first, skipping any rows which only have 1 value and then using that as your input:

import csv
import io

csv.register_dialect(&#39;mycsv&#39;, delimiter=&#39;:&#39;, quoting=csv.QUOTE_NONE)

newcsv = &#39;&#39;
with open(&#39;test.csv&#39;, newline=&#39;&#39;) as f:
    reader = csv.reader(f, &#39;mycsv&#39;)
    for row in reader:
        if len(row) &gt; 1:
            newcsv += &#39;:&#39;.join(row) + &#39;\n&#39;

df = pd.read_csv(
  io.StringIO(newcsv), 
  sep=&#39;:&#39;,
  index_col=0,
  decimal=&#39;,&#39;
)

Output:

    azimuth    zenith  bjorkeny  energy   pos_x    pos_y    pos_z  proba_track  proba_cscd
0  2.349537  1.116004  0.048998  3.3664  52.740   28.831  401.186     0.824351    0.175649
1  5.575786  1.742838  0.280471  3.8900  48.369   29.865  417.282     0.818363    0.181637
2  4.656125  2.686909  0.119843  3.2335  71.722  121.449  363.077     0.828343    0.171657

Note that for a large file, you will probably want to build the dataframe row by row (in the csv reader loop) rather than creating a huge newcsv string.

Don't try this at work

For your sample data, in a non-production environment, you could use a side-effect from an on_bad_lines callback to populate a list with the actual rows:

data = []
_ = pd.read_csv(&#39;test.csv&#39;, on_bad_lines=lambda l:data.append(l), sep=&#39;:&#39;, index_col=0, decimal=&#39;,&#39;, engine=&#39;python&#39;)
df = pd.DataFrame(data[1:], columns=[&#39;idx&#39;] + data[0][1:]).set_index(&#39;idx&#39;)

Output:

                azimuth              zenith  ...         proba_track           proba_cscd
idx                                          ...
0    2.3495370211373316  1.1160038417256017  ...  0.8243512974051896  0.17564870259481039
1     5.575785663044353  1.7428377336692398  ...  0.8183632734530938  0.18163672654690619
2     4.656124692722159   2.686909147834136  ...  0.8283433133732535  0.17165668662674652

答案2

得分: 1

CASE-I：假设第一个'$'是列标题。
我找不到任何关于这个的'pandas'解决方案，但有一种其他方法可以实现这一点：

with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    lines = f.readlines()

skiprows = [i for i in range(len(lines)) if lines[i][0]=='$'][0]

这将为您提供参数'skiprows'的值，以供pandas.read_csv()使用。

对于包含数百万行且执行时间很敏感的大型CSV文件，可以进行改进：

skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    for line in f:
        if line[0]=='$':
            skiprows = i
            break
        i+=1

CASE-II（假设没有其他内容）：

skiprows = 0; i=0
with open('C:/Users/r.goyal/Desktop/sample.csv') as f:
    for line in f:
        if len(line.split(':')) > len(line.split(" ")):
            skiprows = i-1 # -1以达到提到$列的行
            break

        i+=1

基于空格分隔的行将具有比':'分隔的行更多的空格，因此使用相同的逻辑进行比较。

英文:

CASE-I : Assuming the first '$' is the column headers.
I cannot find any 'pandas' solution for this, but there is one other way that this can be achieved:-

with open(&#39;C:/Users/r.goyal/Desktop/sample.csv&#39;) as f:
    lines = f.readlines()

skiprows = [i for i in range(len(lines)) if lines[i][0]==&#39;$&#39;][0]

This will give you value of skiprows for the parameter 'skiprows' in pandas.read_csv().

For large CSVs where million of rows are present and execution time is a sensitive factor, the same can be improvised to -

skiprows = 0; i=0
with open(&#39;C:/Users/r.goyal/Desktop/sample.csv&#39;) as f:
    for line in f:
        if line[0]==&#39;$&#39;:
            skiprows = i
            break
        i+=1

CASE-II (Assuming nothin)

skiprows = 0; i=0
with open(&#39;C:/Users/r.goyal/Desktop/sample.csv&#39;) as f:
    for line in f:
        if len(line.split(&#39;:&#39;)) &gt; len(line.split(&quot; &quot;)):
            skiprows = i-1 # -1 to reach the line that mentions $ Columns
            break

        i+=1

The space-separated rows will have more spaces than the ':' separated rows, and hence the comparison is done on the same logic.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

跳过非有效（以空格分隔的）CSV文件行，而不是跳过前N行。

问题

答案1

答案2

如何向OpenSky（飞行跟踪器）发出请求

将UTF-8数据转换为带有BOM的UTF-16。

将多个 JSON 值附加到单个 Pandas 列中使用 Python

将数据框从长格式转换为宽格式。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论