英文:
Apache CSV Commons: IOException: (line 5) invalid char between encapsulated token and delimiter
问题
Apache CSV Commons库遇到异常中断:
IOException:(第5行) 封装令牌与分隔符之间的无效字符
它尝试读取以下 .csv 文件:
"id", "category_id", "brand_id", "catalog_number"
"6427146", "4045", "764\"13", "A26-30-01"
"6425052", "4058", "764\"13", "P9B02VN"
"6424406", "4054", "764\"13", "A40-30-10-80"
"6152302", "4046", "764\"13", "1.75\\" center distance"
"6152301", "4046", "764\"13", "ZL110"
"6152300", "4046", "764\"13", "ZAF460-70"
"6152299", "4046", "764\"13", "ZA75-84"
"6152298", "4046", "764\"13", "ZA75-80"
"6152297", "4046", "764\"13", "ZA75-55-1SBN153510R5506"
库无法读取第5行。
代码:
@Test
public void testReadCsvFile() throws IOException {
Reader reader = new FileReader("products-with-escaped-escape-symbol.csv");
Iterable<CSVRecord> records = CSVFormat.DEFAULT
.withHeader(HEADERS)
.withFirstRecordAsHeader()
.withIgnoreHeaderCase()
.withIgnoreSurroundingSpaces()
.withEscape('\\')
.parse(reader);
for (CSVRecord record : records) {
String brandId = record.get("brand_id");
assertThat(brandId, is("764\"13"));
}
}
有没有办法调整Apache CSV Commons库以读取具有这种结构的 .csv 文件?
英文:
Apache CSV Commons library interrupted with exception:
IOException: (line 5) invalid char between encapsulated token and delimiter
It tries to read following .csv file:
"id", "category_id", "brand_id", "catalog_number"
"6427146", "4045", "764\"13", "A26-30-01"
"6425052", "4058", "764\"13", "P9B02VN"
"6424406", "4054", "764\"13", "A40-30-10-80"
"6152302", "4046", "764\"13", "1.75\\" center distance"
"6152301", "4046", "764\"13", "ZL110"
"6152300", "4046", "764\"13", "ZAF460-70"
"6152299", "4046", "764\"13", "ZA75-84"
"6152298", "4046", "764\"13", "ZA75-80"
"6152297", "4046", "764\"13", "ZA75-55-1SBN153510R5506"
Library cannot read line #5.
Code:
@Test
public void testReadCsvFile() throws IOException {
Reader reader = new FileReader("products-with-escaped-escape-symbol.csv");
Iterable<CSVRecord> records = CSVFormat.DEFAULT
.withHeader(HEADERS)
.withFirstRecordAsHeader()
.withIgnoreHeaderCase()
.withIgnoreSurroundingSpaces()
.withEscape('\\')
.parse(reader);
for (CSVRecord record : records) {
String brandId = record.get("brand_id");
assertThat(brandId, is("764\"13"));
}
}
Is there a way to adjust Apache CSV Commons library to read a .csv file with such structure?
答案1
得分: 1
以下是翻译好的部分:
"The best thing you can do is let the source of this data know that they have provided you with invalid input. They are supposed to send a CSV file and they did fail to do that. This input is not a CSV file, it merely somewhat resembles a CSV file.
If that option is not available, you can create your own Reader which filters each line before passing it to the CSVParser:
List
CSVFormat format = CSVFormat.DEFAULT
.withHeader(HEADERS)
.withFirstRecordAsHeader()
.withIgnoreHeaderCase()
.withIgnoreSurroundingSpaces()
.withEscape('\');
try (PipedReader reader = new PipedReader();
PipedWriter writer = new PipedWriter(reader)) {
Runnable filterTask = () -> {
try (BufferedReader fileReader = Files.newBufferedReader(
Path.of("products-with-escaped-escape-symbol.csv"));
PipedWriter filteredWriter = writer) {
String line;
while ((line = fileReader.readLine()) != null) {
line = line.replaceAll(
"^(" +
"(?:\\s*\"(?:[^\"\\\\]|\\\\.)*\"\\s*,)*" +
"\\s*\"(?:[^\"\\\\]|\\\\[^\\\\])*\"" +
")" +
"\\\\\\\\\" +
"($|\\s*[^,])", "$1\\\\\"$2");
filteredWriter.write(line);
filteredWriter.write('\n');
}
} catch (IOException e) {
throw new UncheckedIOException(e);
}
};
CompletableFuture<?> filter = CompletableFuture.runAsync(filterTask);
try (CSVParser parser = format.parse(reader)) {
records = parser.getRecords();
filter.get();
} catch (ExecutionException e) {
Throwable cause = e.getCause();
if (cause instanceof RuntimeException re) {
throw re;
} else {
throw new RuntimeException(cause);
}
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
} catch (IOException e) {
throw new UncheckedIOException(e);
}
This is not a completely reliable solution. If a comma followed '\"' and that comma was intended to be part of the value, rather than a value separator, writing code to recognize that case would be a lot more complicated."
英文:
The best thing you can do is let the source of this data know that they have povided you with invalid input. They are supposed to send a CSV file and they did failed to do that. This input is not a CSV file, it merely somewhat resembles a CSV file.
If that option is not available, you can create your own Reader which filters each line before passing it to the CSVParser:
List<CSVRecord> records;
CSVFormat format = CSVFormat.DEFAULT
.withHeader(HEADERS)
.withFirstRecordAsHeader()
.withIgnoreHeaderCase()
.withIgnoreSurroundingSpaces()
.withEscape('\\');
try (PipedReader reader = new PipedReader();
PipedWriter writer = new PipedWriter(reader)) {
Runnable filterTask = () -> {
try (BufferedReader fileReader = Files.newBufferedReader(
Path.of("products-with-escaped-escape-symbol.csv"));
PipedWriter filteredWriter = writer) {
String line;
while ((line = fileReader.readLine()) != null) {
line = line.replaceAll(
"^(" +
"(?:\\s*\"(?:[^\"\\\\]|\\\\.)*\"\\s*,)*" +
"\\s*\"(?:[^\"\\\\]|\\\\[^\\\\])*" +
")" +
"\\\\\\\\\"($|\\s*[^,])", "$1\\\\\"$2");
filteredWriter.write(line);
filteredWriter.write('\n');
}
} catch (IOException e) {
throw new UncheckedIOException(e);
}
};
CompletableFuture<?> filter = CompletableFuture.runAsync(filterTask);
try (CSVParser parser = format.parse(reader)) {
records = parser.getRecords();
filter.get();
} catch (ExecutionException e) {
Throwable cause = e.getCause();
if (cause instanceof RuntimeException re) {
throw re;
} else {
throw new RuntimeException(cause);
}
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
} catch (IOException e) {
throw new UncheckedIOException(e);
}
This is not a completely reliable solution. If a comma followed \\"
and that comma was intended to be part of the value, rather than a value separator, writing code to recognize that case would be a lot more complicated.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论