Polars的read_csv忽略错误,如果无法忽略它们该怎么办?

huangapple go评论309阅读模式
英文:

Polars read_csv ignore_errors what to do if you can't ignore them?

问题

使用 polars.read_csv 处理大型数据集时出现了字段分隔符问题,导致失败。ignore_errors 选项会跳过错误记录,但我不知道是跳过了一个还是成千上万个记录。是否有办法将错误记录导入到错误文件中,或者报告被忽略的行数?

我希望世界足够简单,以支持单字符列分隔符,但这还没有发生 - 为什么 pandas/pyarrow/polars 不支持多字符字段分隔符呢?

英文:

Using polars.read_csv on a large data set results in a failure because of a field delimiter issue. Ignore_errors skips the erroneous records, but I have no idea if one or thousands of records were ignored. Is there a way to pipe the bad records to a bad file or report the number of ignored rows?

I wish the world was simple enough for data to support single character column delimiters, but that hasn't happened yet - why doesn't pandas/pyarrow/polars support multi character field delimiters?

答案1

得分: 1

Polars库在使用ignore_errors参数时不提供将坏记录导向单独文件或报告被忽略行数的机制。您可以通过以下方式手动执行,但我不确定这是否符合您的要求:

  1. import polars as pl
  2. # 定义CSV文件的路径
  3. csv_file = "path/to/your/file.csv"
  4. # 创建一个空的DataFrame以存储坏记录
  5. bad_records = pl.DataFrame()
  6. # 读取CSV文件并手动处理错误
  7. with open(csv_file, "r") as file:
  8. for line in file:
  9. try:
  10. # 将行解析为DataFrame
  11. df = pl.from_csv_string(line, delimiter=',')
  12. # 根据需要处理有效的DataFrame
  13. # ...
  14. except Exception:
  15. # 如果发生错误,将该行附加到bad_records DataFrame
  16. bad_records = bad_records.append(pl.DataFrame(
    ))
  17. # 将坏记录写入单独的CSV文件
  18. bad_records.to_csv("path/to/bad_records.csv")
  19. # 获取被忽略行数
  20. ignored_rows = len(bad_records)
  21. # 打印被忽略行数
  22. print(f"被忽略的行数:{ignored_rows}")

关于您的第二个问题,您可以在读取CSV文件时使用pandas.read_csv()函数的"sep"参数来更改字段分隔符。"sep"参数允许您指定CSV文件中使用的分隔符字符或字符串。例如:

  1. df = pd.read_csv(csv_file, sep=';') # 用您希望的分隔符替换';'
英文:

Polars library doesn't provide a mechanism to pipe the bad records to a separate file or report the number of ignored rows when using the ignore_errors parameter. You could do it manually in the following way but I don't know if it's what you want:

  1. import polars as pl
  2. # Define the path to your CSV file
  3. csv_file = "path/to/your/file.csv"
  4. # Create an empty DataFrame to store the bad records
  5. bad_records = pl.DataFrame()
  6. # Read the CSV file and handle errors manually
  7. with open(csv_file, "r") as file:
  8. for line in file:
  9. try:
  10. # Parse the line as a DataFrame
  11. df = pl.from_csv_string(line, delimiter=',')
  12. # Process the valid DataFrame as needed
  13. # ...
  14. except Exception:
  15. # If an error occurs, append the line to the bad_records DataFrame
  16. bad_records = bad_records.append(pl.DataFrame(
    ))
  17. # Write the bad records to a separate CSV file
  18. bad_records.to_csv("path/to/bad_records.csv")
  19. # Get the count of ignored rows
  20. ignored_rows = len(bad_records)
  21. # Print the number of ignored rows
  22. print(f"Number of ignored rows: {ignored_rows}")

Regarding your second question., in Pandas you can change the field delimiters when reading a CSV file by specifying the "sep" parameter in the pandas.read_csv() function. The "sep" parameter allows you to specify the delimiter character or string used in the CSV file.
For example:

  1. df = pd.read_csv(csv_file, sep=';') # Replace ';' with your desired delimiter

huangapple
  • 本文由 发表于 2023年6月30日 00:53:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76583118.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定