英文:
File is read on Windows but not on a Linux container?
问题
以下是您要翻译的内容:
像标题所说,我无法在运行相同的代码时(在Linux容器上)读取文件(csv文件)的内容。
private Set<VehicleConfiguration> loadConfigurations(Path file, CodeType codeType) throws IOException {
log.debug("File exists? " + Files.exists(file));
log.debug("Path " + file.toString());
log.debug("File " + file.toFile().toString());
log.debug("File absolute path " + file.toAbsolutePath().toString());
String line;
Set<VehicleConfiguration> configurations = new HashSet<>(); // 这样我们可以忽略相同文件中的重复项
try(BufferedReader br = new BufferedReader(new FileReader(file.toFile()))){
while ((line = br.readLine()) != null) {
configurations.add(build(line, codeType));
}
}
log.debug("Loaded " + configurations.size() + " configurations");
return configurations;
}
日志返回“true”,并且两个系统中的文件路径都是相同的(在Windows本地和Linux Docker容器上)。在Windows上,它加载了“15185个配置”,但在容器上加载了“0个配置”。
Linux上存在该文件,我使用bash自己检查了它。我使用head命令,文件有行。
在此之前,我尝试使用Files.lines:
var vehicleConfigurations = Files.lines(file)
.map(line -> build(line, codeType))
.collect(Collectors.toCollection(HashSet::new));
但是(仅在容器上)存在有关内容的问题。它读取文件但没有读取整个文件,它达到给定的行(比如第8000行)并且没有完全读取它(在逗号分隔符之前只读取了大约一半的行)。然后我收到java.lang.ArrayIndexOutOfBoundsException,因为我的构建方法尝试拆分该行,并且我访问索引1(它没有,只有0):
private VehicleConfiguration build(String line, CodeType codeType) {
String[] cells = line.split(lineSeparator);
var vc = new VehicleConfiguration();
vc.setVin(cells[0]);
vc.setCode(cells[1]);
vc.setType(codeType);
return vc;
}
问题可能是什么?我不明白为什么相同的Java代码在Windows上工作,但在Linux容器上不工作。这毫无道理。
我正在使用Java 11。文件使用docker-compose文件中的卷进行复制,如下所示:
volumes:
- ./file-sources:/file-sources
然后,我将文件从file-sources复制(在Linux容器上使用cp命令)到/root,因为那是应用程序用于监听新文件到达的位置。然后,使用我描述的方法读取文件内容。示例文件数据(没有奇怪的字符):
[![file contents][1]][1]
先感谢您。
更新:使用newBufferedReader方法尝试,结果相同(在Windows上工作,在Linux容器上不工作):
private Set<VehicleConfiguration> loadConfigurations(Path file, CodeType codeType) throws IOException {
String line;
Set<VehicleConfiguration> configurations = new HashSet<>(); // 这样我们可以忽略相同文件中的重复项
try(BufferedReader br = Files.newBufferedReader(file)){
while ((line = br.readLine()) != null) {
configurations.add(build(line, codeType));
}
}
log.debug("Loaded " + configurations.size() + " configurations");
return configurations;
}
在Linux容器中的/root目录中运行的wc -l返回:15185 hard_001.csv
更新:这不是解决方案,但我发现通过直接将文件丢在file-sources文件夹中,并将该文件夹设置为代码监听的文件夹,文件被读取了。因此,似乎使用容器内的cp/mv到另一个文件夹时问题更明显。也许文件在完全复制/移动之前就被读取了,这就是为什么它读取了0个配置的原因?
请注意,我已经按照您的要求只返回翻译的部分,没有包含其他内容。如果您有任何其他问题或需要进一步的帮助,请随时告诉我。
英文:
Like the title says I'm not able to read the contents of a file (csv file) while running the same code on a linux container
private Set<VehicleConfiguration> loadConfigurations(Path file, CodeType codeType) throws IOException {
log.debug("File exists? " + Files.exists(file));
log.debug("Path " + file.toString());
log.debug("File " + file.toFile().toString());
log.debug("File absolute path " + file.toAbsolutePath().toString());
String line;
Set<VehicleConfiguration> configurations = new HashSet<>(); // this way we ignore duplicates in the same file
try(BufferedReader br = new BufferedReader(new FileReader(file.toFile()))){
while ((line = br.readLine()) != null) {
configurations.add(build(line, codeType));
}
}
log.debug("Loaded " + configurations.size() + " configurations");
return configurations;
}
The logs return "true" and the path for the file in both systems (locally on windows and on a linux docker container). On windows it loads "15185 configurations" but on the container it loads "0 configurations".
The file exists on linux, I use bash and check it myself. I use the head command and the file has lines.
Before this I tried with Files.lines like so:
var vehicleConfigurations = Files.lines(file)
.map(line -> build(line, codeType))
.collect(Collectors.toCollection(HashSet::new));
But this has a problem (on container only) regarding the contents. It reads the file but not the whole file, it reaches a given line (say line 8000) and does not read it completely (reads about half a line before the comma separator). Then I get a java.lang.ArrayIndexOutOfBoundsException because my build method tries to split then line and I access index 1 (which it doesn't have, only 0):
private VehicleConfiguration build(String line, CodeType codeType) {
String[] cells = line.split(lineSeparator);
var vc = new VehicleConfiguration();
vc.setVin(cells[0]);
vc.setCode(cells[1]);
vc.setType(codeType);
return vc;
}
What could be the issue? I don't understand how the same code (in Java) works on Windows but not on a Linux container. It makes no sense.
I'm using Java 11. The file is copied using volumes in a docker-compose file like this:
volumes:
- ./file-sources:/file-sources
I then copy the file (using cp command on the linux container) from file-sources to /root because that's where the app is listening for new files to arrive. File contents are then read with the methods I described. Example file data (does not have weird characters):
Thanks in advance.
UPDATE: Tried with newBufferedReader method, same result (works on windows, doesn't work on linux container):
private Set<VehicleConfiguration> loadConfigurations(Path file, CodeType codeType) throws IOException {
String line;
Set<VehicleConfiguration> configurations = new HashSet<>(); // this way we ignore duplicates in the same file
try(BufferedReader br = Files.newBufferedReader(file)){
while ((line = br.readLine()) != null) {
configurations.add(build(line, codeType));
}
}
log.debug("Loaded " + configurations.size() + " configurations");
return configurations;
}
wc -l in the linux container (in /root) returns: 15185 hard_001.csv
Update: This is no solution but I found out that by dropping the files directly on the file-sources folder and make that folder the folder that the code listens to, the files are read. So basically, it seems the problem is more apparent with using cp/mv inside the container to another folder. Maybe the file is read before it is fully copied/moved and that's why it reads 0 configurations?
答案1
得分: 4
以下是翻译好的部分:
有一些 Java 中的方法你永远不应该使用。
new FileReader(File)
是其中之一。
每当你有一个表示字节的东西,然后字符或字符串以某种方式出现,或者反过来?除非所述方法的规范明确指出它总是使用预设的字符集,否则永远不要使用它们。几乎所有这种方法都使用“系统默认字符集”,这意味着操作取决于你运行它的机器。这是“这将失败,而且你的测试不会捕获它”的缩写。你不想要这种情况。
这就是为什么你永远不应该使用这些东西。
FileReader 已经修复(有一个接受字符集的第二个构造函数),但这只是自 JDK11 起才有的。你已经有了很好的新 API,为什么要回到笨拙的旧 File API?不要这样做。
所有 Files 中的各种方法,如 Files.newBufferedReader
,如果你不指定的话,默认使用 UTF-8(从这个角度看,Files 更有用,不像大多数其他 Java 核心库)。因此:
try (BufferedReader br = Files.newBufferedReader(file)) {
这只是...更好..比你的行。
现在,它可能仍然会失败。但这是好事!它也会在你的开发机器上失败。最有可能的是,你正在读取的文件实际上不是 UTF-8 编码。这是一个合理的猜测;大多数 Linux 部署都带有 UTF-8 默认字符集,而大多数开发机器则没有;如果你的开发机器正常工作而你的部署环境不正常,那么很明显的结论是你的输入文件不是 UTF-8 编码。它不需要与你的开发机器的默认设置相同;像 ISO_8859_1 这样的设置永远不会引发异常,但它将读取无意义的内容。你的代码可能似乎工作(没有崩溃),但你读取的文本仍然不正确。
找出你得到的文本编码,然后指定它。例如,如果是 ISO_8859_1:
try (BufferedReader br = Files.newBufferedReader(file, StandardCharsets.ISO_8859_1)) {
现在你的代码不再具有“在某些机器上运行,但在其他机器上不运行”的特性。
检查失败的那一行,在十六进制编辑器中如果必要的话。我敢打赌,美元换成甜甜圈,那里会有一个字节大于或等于 0x80(十进制为 128 或更高)。从 ASCII 到任何 ISO-8859 变种到 UTF-8 Windows Cp1252 到 macroman 到其他很多东西,从 0x80 或更高的字节开始,它们都不同。拥有那个字节以及一些关于它应该是什么字符的知识通常是弄清楚文本文件所使用的编码的好方法的一种开始。
注意:如果这不是问题的原因,请检查文本文件是如何从开发机器复制到部署环境的。你确定它是同一个文件吗?如果它通过文本机制复制,字符集编码再次可能是问题,但这次是在文件如何编写而不是你的 Java 应用程序如何读取的方面。
英文:
There are a few methods in java you should never use. ever.
new FileReader(File)
is one of them.
Any time that you have a thing that represents bytes and somehow chars or Strings fall out, or vice versa? Don't ever use those, unless the spec of said method explicitly points out that it always uses a pre-set charset. Almost all such methods use the 'system default charset' which means that the operation depends on the machine you run it on. That is shorthand for 'this will fail, and your tests won't catch it'. Which you don't want.
Which is why you should never use these things.
FileReader has been fixed (there is a second constructor that takes a charset), but that's only since JDK11. You already have the nice new API, why do you switch back to the dinky old File API? Don't do that.
All the various methods in Files, such as Files.newBufferedReader
, are specced to do UTF-8 if you don't specify (in that way, Files is more useful, and unlike most other java core libraries). Thus:
try (BufferedReader br = Files.newBufferedReader(file)) {
which is just.. better.. than your line.
Now, it'll probably still fail on you. But that's good! It'll also fail on your dev machine. Most likely, the file you are reading is not, in fact, in UTF_8. This is the likely guess; most linuxen are deployed with a UTF_8 default charset, and most dev machines are not; if your dev machine is working and your deployment environment isn't, the obvious conclusion is that your input file is not UTF_8. It does not need to be what your dev machine has a default either; something like ISO_8859_1 will never throw exceptions, but it will read gobbledygook instead. Your code may seem to work (no crashes), but the text you read is still incorrect.
Figure out what text encoding you got, and then specify it. If it's ISO_8859_1, for example:
try (BufferedReader br = Files.newBufferedReader(file, StandardCharsets.ISO_8859_1)) {
and now your code no longer has the 'works on some machines but not on others' nature.
Inspect the line where it fails, in a hex editor if you have to. I bet you dollars to donuts there will be a byte there which is 0x80 or higher (in decimal, 128 or higher). Everything up to and including 127 tends to mean the exact same thing in a wide variety of text encodings, from ASCII to any ISO-8859 variant to UTF-8 Windows Cp1252 to macroman to so many other things, so as long as it's all just plain letters and digits, having the wrong encoding is not going to make any difference. But once you get to 0x80 or higher they're all different. Armed with that byte + some knowledge of what character it is supposed to be is usually a good start in figuring out what encoding that text file is in.
NB: If this isn't it, check how the text file is being copied from your dev machine to your deployment environment. Are you sure it is the same file? If it's being copied through a textual mechanism, charset encoding again can be to blame, but this time in how the file is written, instead of how your java app reads it.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论