Apache CSV Commons: IOException: (line 5) invalid char between encapsulated token and delimiter

huangapple go评论105阅读模式
英文:

Apache CSV Commons: IOException: (line 5) invalid char between encapsulated token and delimiter

问题

Apache CSV Commons库遇到异常中断:

IOException:(第5行) 封装令牌与分隔符之间的无效字符

它尝试读取以下 .csv 文件:

"id", "category_id", "brand_id", "catalog_number"
"6427146", "4045", "764\"13", "A26-30-01"
"6425052", "4058", "764\"13", "P9B02VN"
"6424406", "4054", "764\"13", "A40-30-10-80"
"6152302", "4046", "764\"13", "1.75\\" center distance"
"6152301", "4046", "764\"13", "ZL110"
"6152300", "4046", "764\"13", "ZAF460-70"
"6152299", "4046", "764\"13", "ZA75-84"
"6152298", "4046", "764\"13", "ZA75-80"
"6152297", "4046", "764\"13", "ZA75-55-1SBN153510R5506"

库无法读取第5行。

代码:

@Test
public void testReadCsvFile() throws IOException {
    Reader reader = new FileReader("products-with-escaped-escape-symbol.csv");

    Iterable<CSVRecord> records = CSVFormat.DEFAULT
            .withHeader(HEADERS)
            .withFirstRecordAsHeader()
            .withIgnoreHeaderCase()
            .withIgnoreSurroundingSpaces()
            .withEscape('\\')
            .parse(reader);

    for (CSVRecord record : records) {
        String brandId = record.get("brand_id");
        assertThat(brandId, is("764\&quot;13"));
    }
}

有没有办法调整Apache CSV Commons库以读取具有这种结构的 .csv 文件?

英文:

Apache CSV Commons library interrupted with exception:

IOException: (line 5) invalid char between encapsulated token and delimiter

It tries to read following .csv file:

&quot;id&quot;, &quot;category_id&quot;, &quot;brand_id&quot;, &quot;catalog_number&quot;
&quot;6427146&quot;, &quot;4045&quot;, &quot;764\&quot;13&quot;, &quot;A26-30-01&quot;
&quot;6425052&quot;, &quot;4058&quot;, &quot;764\&quot;13&quot;, &quot;P9B02VN&quot;
&quot;6424406&quot;, &quot;4054&quot;, &quot;764\&quot;13&quot;, &quot;A40-30-10-80&quot;
&quot;6152302&quot;, &quot;4046&quot;, &quot;764\&quot;13&quot;, &quot;1.75\\&quot; center distance&quot;
&quot;6152301&quot;, &quot;4046&quot;, &quot;764\&quot;13&quot;, &quot;ZL110&quot;
&quot;6152300&quot;, &quot;4046&quot;, &quot;764\&quot;13&quot;, &quot;ZAF460-70&quot;
&quot;6152299&quot;, &quot;4046&quot;, &quot;764\&quot;13&quot;, &quot;ZA75-84&quot;
&quot;6152298&quot;, &quot;4046&quot;, &quot;764\&quot;13&quot;, &quot;ZA75-80&quot;
&quot;6152297&quot;, &quot;4046&quot;, &quot;764\&quot;13&quot;, &quot;ZA75-55-1SBN153510R5506&quot;

Library cannot read line #5.

Code:

@Test
public void testReadCsvFile() throws IOException {
    Reader reader = new FileReader(&quot;products-with-escaped-escape-symbol.csv&quot;);

    Iterable&lt;CSVRecord&gt; records = CSVFormat.DEFAULT
            .withHeader(HEADERS)
            .withFirstRecordAsHeader()
            .withIgnoreHeaderCase()
            .withIgnoreSurroundingSpaces()
            .withEscape(&#39;\\&#39;)
            .parse(reader);

    for (CSVRecord record : records) {
        String brandId = record.get(&quot;brand_id&quot;);
        assertThat(brandId, is(&quot;764\&quot;13&quot;));
    }
}

Is there a way to adjust Apache CSV Commons library to read a .csv file with such structure?

答案1

得分: 1

以下是翻译好的部分:

"The best thing you can do is let the source of this data know that they have provided you with invalid input. They are supposed to send a CSV file and they did fail to do that. This input is not a CSV file, it merely somewhat resembles a CSV file.

If that option is not available, you can create your own Reader which filters each line before passing it to the CSVParser:

List records;

CSVFormat format = CSVFormat.DEFAULT
.withHeader(HEADERS)
.withFirstRecordAsHeader()
.withIgnoreHeaderCase()
.withIgnoreSurroundingSpaces()
.withEscape('\');

try (PipedReader reader = new PipedReader();
PipedWriter writer = new PipedWriter(reader)) {

Runnable filterTask = () -> {
    try (BufferedReader fileReader = Files.newBufferedReader(
            Path.of("products-with-escaped-escape-symbol.csv"));
         PipedWriter filteredWriter = writer) {

        String line;
        while ((line = fileReader.readLine()) != null) {
            line = line.replaceAll(
                "^(" +
                "(?:\\s*\"(?:[^\"\\\\]|\\\\.)*\"\\s*,)*" +
                "\\s*\"(?:[^\"\\\\]|\\\\[^\\\\])*\"" +
                ")" +
                "\\\\\\\\\" +
                "($|\\s*[^,])", "$1\\\\\"$2");
            filteredWriter.write(line);
            filteredWriter.write('\n');
        }
    } catch (IOException e) {
        throw new UncheckedIOException(e);
    }
};

CompletableFuture<?> filter = CompletableFuture.runAsync(filterTask);

try (CSVParser parser = format.parse(reader)) {
    records = parser.getRecords();
    filter.get();
} catch (ExecutionException e) {
    Throwable cause = e.getCause();
    if (cause instanceof RuntimeException re) {
        throw re;
    } else {
        throw new RuntimeException(cause);
    }
} catch (InterruptedException e) {
    throw new RuntimeException(e);
}

} catch (IOException e) {
throw new UncheckedIOException(e);
}

This is not a completely reliable solution. If a comma followed '\"' and that comma was intended to be part of the value, rather than a value separator, writing code to recognize that case would be a lot more complicated."

英文:

The best thing you can do is let the source of this data know that they have povided you with invalid input. They are supposed to send a CSV file and they did failed to do that. This input is not a CSV file, it merely somewhat resembles a CSV file.

If that option is not available, you can create your own Reader which filters each line before passing it to the CSVParser:

List&lt;CSVRecord&gt; records;

CSVFormat format = CSVFormat.DEFAULT
        .withHeader(HEADERS)
        .withFirstRecordAsHeader()
        .withIgnoreHeaderCase()
        .withIgnoreSurroundingSpaces()
        .withEscape(&#39;\\&#39;);

try (PipedReader reader = new PipedReader();
     PipedWriter writer = new PipedWriter(reader)) {

    Runnable filterTask = () -&gt; {
        try (BufferedReader fileReader = Files.newBufferedReader(
                Path.of(&quot;products-with-escaped-escape-symbol.csv&quot;));
             PipedWriter filteredWriter = writer) {

            String line;
            while ((line = fileReader.readLine()) != null) {
                line = line.replaceAll(
                    &quot;^(&quot; +
                    &quot;(?:\\s*\&quot;(?:[^\&quot;\\\\]|\\\\.)*\&quot;\\s*,)*&quot; +
                    &quot;\\s*\&quot;(?:[^\&quot;\\\\]|\\\\[^\\\\])*&quot; +
                    &quot;)&quot; +
                    &quot;\\\\\\\\\&quot;($|\\s*[^,])&quot;, &quot;$1\\\\\&quot;$2&quot;);
                filteredWriter.write(line);
                filteredWriter.write(&#39;\n&#39;);
            }
        } catch (IOException e) {
            throw new UncheckedIOException(e);
        }
    };

    CompletableFuture&lt;?&gt; filter = CompletableFuture.runAsync(filterTask);

    try (CSVParser parser = format.parse(reader)) {
        records = parser.getRecords();
        filter.get();
    } catch (ExecutionException e) {
        Throwable cause = e.getCause();
        if (cause instanceof RuntimeException re) {
            throw re;
        } else {
            throw new RuntimeException(cause);
        }
    } catch (InterruptedException e) {
        throw new RuntimeException(e);
    }
} catch (IOException e) {
    throw new UncheckedIOException(e);
}

This is not a completely reliable solution. If a comma followed \\&quot; and that comma was intended to be part of the value, rather than a value separator, writing code to recognize that case would be a lot more complicated.

huangapple
  • 本文由 发表于 2023年5月25日 05:32:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76327541.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定