问题

使用 polars.read_csv 处理大型数据集时出现了字段分隔符问题，导致失败。ignore_errors 选项会跳过错误记录，但我不知道是跳过了一个还是成千上万个记录。是否有办法将错误记录导入到错误文件中，或者报告被忽略的行数？

我希望世界足够简单，以支持单字符列分隔符，但这还没有发生 - 为什么 pandas/pyarrow/polars 不支持多字符字段分隔符呢？

英文:

Using polars.read_csv on a large data set results in a failure because of a field delimiter issue. Ignore_errors skips the erroneous records, but I have no idea if one or thousands of records were ignored. Is there a way to pipe the bad records to a bad file or report the number of ignored rows?

I wish the world was simple enough for data to support single character column delimiters, but that hasn't happened yet - why doesn't pandas/pyarrow/polars support multi character field delimiters?

答案1

得分: 1

Polars库在使用ignore_errors参数时不提供将坏记录导向单独文件或报告被忽略行数的机制。您可以通过以下方式手动执行，但我不确定这是否符合您的要求：

import polars as pl
# 定义CSV文件的路径
csv_file = "path/to/your/file.csv"
# 创建一个空的DataFrame以存储坏记录
bad_records = pl.DataFrame()
# 读取CSV文件并手动处理错误
with open(csv_file, "r") as file:
    for line in file:
        try:
            # 将行解析为DataFrame
            df = pl.from_csv_string(line, delimiter=',')
            # 根据需要处理有效的DataFrame
            # ...
        except Exception:
            # 如果发生错误，将该行附加到bad_records DataFrame
            bad_records = bad_records.append(pl.DataFrame(
))
# 将坏记录写入单独的CSV文件
bad_records.to_csv("path/to/bad_records.csv")
# 获取被忽略行数
ignored_rows = len(bad_records)
# 打印被忽略行数
print(f"被忽略的行数：{ignored_rows}")

关于您的第二个问题，您可以在读取CSV文件时使用pandas.read_csv()函数的"sep"参数来更改字段分隔符。"sep"参数允许您指定CSV文件中使用的分隔符字符或字符串。例如：

df = pd.read_csv(csv_file, sep=';')  # 用您希望的分隔符替换';'

英文:

Polars library doesn't provide a mechanism to pipe the bad records to a separate file or report the number of ignored rows when using the ignore_errors parameter. You could do it manually in the following way but I don't know if it's what you want:

import polars as pl
# Define the path to your CSV file
csv_file = &quot;path/to/your/file.csv&quot;
# Create an empty DataFrame to store the bad records
bad_records = pl.DataFrame()
# Read the CSV file and handle errors manually
with open(csv_file, &quot;r&quot;) as file:
    for line in file:
        try:
            # Parse the line as a DataFrame
            df = pl.from_csv_string(line, delimiter=&#39;,&#39;)
            # Process the valid DataFrame as needed
            # ...
        except Exception:
            # If an error occurs, append the line to the bad_records DataFrame
            bad_records = bad_records.append(pl.DataFrame(
))
# Write the bad records to a separate CSV file
bad_records.to_csv(&quot;path/to/bad_records.csv&quot;)
# Get the count of ignored rows
ignored_rows = len(bad_records)
# Print the number of ignored rows
print(f&quot;Number of ignored rows: {ignored_rows}&quot;)

Regarding your second question., in Pandas you can change the field delimiters when reading a CSV file by specifying the "sep" parameter in the pandas.read_csv() function. The "sep" parameter allows you to specify the delimiter character or string used in the CSV file.
For example:

df = pd.read_csv(csv_file, sep=&#39;;&#39;)  # Replace &#39;;&#39; with your desired delimiter

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Polars的read_csv忽略错误，如果无法忽略它们该怎么办？

问题

答案1

不同HTML结果相同页面（网络抓取）

ValueError: 在转换分数时无法将字符串转换为浮点数

能否在不使用 for 循环的情况下填充空单元格？

有办法让pandas的pd.crosstab默认包含margins=True吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。