修改大型文件的内容

huangapple go评论71阅读模式
英文:

Modify content of large file

问题

我已从数据库中提取出我的表格并保存在 JSON 文件中,现在我想要读取这些文件并将它们上面的所有双引号删除,看起来很简单,我尝试了数百种方法,有些方法导致了内存溢出的问题。我正在处理一些超过 1GB 大小的文件。下面的代码表现出奇怪的行为,我不明白为什么它会返回空文件。

public void replaceDoubleQuotes(String fileName){
    log.debug("开始格式化" + fileName + "...");
    File firstFile = new File("C:/sqlite/db/tables/" + fileName);
    String oldContent = "";
    String newContent = "";
    BufferedReader reader = null;
    BufferedWriter writer = null;
    FileWriter writerFile = null;
    String stringQuotes = "\\\\\\\\\\\"";
    try {
        reader = new BufferedReader(new FileReader(firstFile));
        writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
        writer = new BufferedWriter(writerFile);

        while ((oldContent = reader.readLine()) != null){
            newContent = oldContent.replaceAll(stringQuotes, "");
            writer.write(newContent);
        }

        writer.flush();
        writer.close();
    } catch (Exception e) {
        log.error(e);
    }
}

当我尝试使用 FileWriter(path, true) 在文件末尾进行写入时,程序不会停止,文件内存不断增加,直到硬盘被填满。谢谢帮助。

附注:我还尝试过使用 subString 并在 while 循环后写入 subString,但也没有奏效。

英文:

I have extract my tables from my database in json file, now I want to read this files and remove all double quotes on them, seems easy and tried hundred of solutions, and some take me to the out of memory problems. I'm dealing with files that they have more than 1Gb size.The code that you will find below have a strange behaviour, and I don't understand why it return empty files

  public void replaceDoubleQuotes(String fileName){
	log.debug(" start formatting " + fileName + " ...");
	File firstFile = new File ("C:/sqlite/db/tables/" + fileName);
	String oldContent = "";
	String newContent = "";
	BufferedReader reader = null;
	BufferedWriter writer = null;
	FileWriter writerFile = null;
	String stringQuotes = "\\\\\\\\\"";
	try {
		reader = new BufferedReader(new FileReader(firstFile));
		writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
		writer = new BufferedWriter(writerFile);
		
	while	(( oldContent = reader.readLine()) != null ){
		newContent = oldContent.replaceAll(stringQuotes, "");
		writer.write(newContent);
		}
	
	writer.flush();
	writer.close();
	} catch (Exception e) {
		log.error(e);
	}
}

and when I try to use FileWriter(path,true) to write at the end of the file the program don't stop increasing the file memory till the hard disk will be full, thanks for help

ps : I also tried to use subString and append the new content and after the while I write the subString but also doesn't work

答案1

得分: 3

TL; DR;

不要同时读写同一个文件。

问题

你的代码开始读取,然后立即截断了它正在读取的文件。

 reader = new BufferedReader(new FileReader(firstFile));
 writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
 writer = new BufferedWriter(writerFile);

第一行打开了一个文件的读取句柄。
第二行打开了相同文件的写入句柄。
如果你查看FileWriter构造函数的文档,可能不是很清楚,但是如果你不使用允许你指定append参数的构造函数,那么默认值是false,意味着如果文件已经存在,你会立即截断文件。

在这一点上(第2行),你刚刚擦除了即将要读取的文件。因此,最终你得到了一个空文件。

关于使用append=true

嗯,那么文件在创建时不会被擦除,这是"好"的。因此,你的程序开始读取第一行,并将过滤版本(输出到同一个文件)。

因此,每次读取一行,都会追加另一行。

难怪你的程序永远不会到达文件的末尾:每次它前进一行,都会创建另一行进行处理。一般来说,你永远不会到达文件的末尾(当然,如果文件一开始就是单行,可能会到达,但那是个特例)。

解决方案

将内容写入临时文件,仅在(并且仅在)成功后,如果确实需要的话再交换文件。

这种解决方案的优点是:如果由于某种原因你的进程崩溃,你的原始文件将保持不变,你可以稍后重试,这通常是件好事。你的过程是"可重复的"。

缺点是:在某些情况下,你将需要两倍的空间。(尽管你可以压缩临时文件并减少这个因子,但仍然需要一些空间)

关于内存问题

当处理任意大的文件时,你选择的路径(使用缓冲读写器)是正确的,因为你一次只使用一行的内存。

因此,它通常避免了内存使用问题(当然,除非你有一个没有换行符的文件,在这种情况下,完全没有区别)。

其他解决方案,涉及一次性读取整个文件,然后在内存中执行搜索/替换,然后写回内容,不太适用于大文件,所以很好你避免了这种计算方式。

与内容无关但重要的事项

查看try with resources语法,以正确关闭你的资源(读取器/写入器)。在这里,你忘记关闭读取器,而且你没有适当地关闭写入器(也就是说,在finally子句中)。

另一件事:我相当确定,没有一个普通人编写的Java程序会超越像sedawk这样的工具,它们在大多数Unix平台(以及其他一些平台)上都可用。也许你想检查一下,用Java自己写是否值得,而这在shell中只需一行命令。

英文:

TL; DR;

Do not read and write the same file concurrently.

The issue

Your code starts reading, and then immediately truncates the file it is reading.

 reader = new BufferedReader(new FileReader(firstFile));
 writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
 writer = new BufferedWriter(writerFile);

The first line opens a read handle to the file.
The second line opens a write handle to the same file.
It is not very clear if you look at the documentation of FileWriter constructor, but when you do not use a constructor that allows you to specify the append parameter, then the value is false by default, meaning, you immediately truncate the file if it already exists.

At this point (line 2) you have just erased the file you were about to read. So you end up with an empty file.

What about using append=true

Well, then the file is not erased when it is created, which is "good". So you program starts reading the first line, and outputs (to the same file) the filtered version.

So each time a line is read, another is appended.

No wonder your program will never reach the end of the file : each time it advances a line, it creates another line to process. Generally speaking, you'll never reach end of file (well of course if the file is a single line to begin with, you might but that's a corner case).

The solution

Write to a temporary file, and IF (and only IF) you succed, then swap the files if you really need too.

An advantage of this solution : if for whatever reason your processe crahses, you'll have the original file untouched and you could retry later, which is usually a good thing. Your process is "repeatable".

A disadvantage : you'll need twice the space at some point. (Although you could compress the temp file and reduce this factor but still).

About out of memory issues

When working with arbitrarily large files, the path you chose (using buffered readers and writers) is the right one, because you only use one line-worth of memory at a time.

Therefore it generally avoids memory usage issues (unless of course, you have a file without line breaks, in which case it makes no difference at all).

Other solutions, involving reading the whole file at once, then performing the search/replace in memory, then writing the contents back do not scale that well, so it's good you avoided this kind of computation.

Not related but important

Check out the try with resources syntax to properly close your resources (reader / writer). Here you forgot to close the reader, and you are not closing the writer appropriately anyway (that is : in a finally clause).

Another thing : I'm pretty sure no java program written by a mere mortal will beat tools like sed or awk that are available on most unix platforms (and some more). Maybe you'd want to check if rolling your own in java is worth what is a shell one-liner.

答案2

得分: 0

@GPI已经就为什么同时进行读写操作会导致您遇到的问题提供了很好的答案。值得注意的是,一次性将1GB数据读入堆中可能会导致OutOfMemoryError,除非分配了足够的堆空间,这很可能是问题的原因。为了解决这个问题,您可以使用InputStream,每次读取文件的一部分,然后写入另一个文件,直到进程完成,最后用修改后的文件替换现有文件并进行删除。采用这种方法,甚至可以使用ForkJoinTask来帮助处理,因为这是一个如此庞大的任务。

附注;
可能有比“创建新文件、写入新文件、替换现有文件、删除新文件”更好的解决方案。

英文:

@GPI already provided a great answer on why reading and writing concurrently is causing the issue you're experiencing. It is also worth noting that reading 1gb of data into heap at once can definitely cause a OutOfMemoryError if enough heap isn't allocated which is likely. To solve this problem you could use an InputStream and read chunks of the file at a time, then write to another file until the process is completed, and ultimately replace the existing file with the modified one and delete. With this approach you could even use a ForkJoinTask to help with this since it's such a large job.

Side note;
There may be a better solution than create new file, write to new file, replace existing, delete new file.

huangapple
  • 本文由 发表于 2020年8月19日 20:50:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/63487351.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定