为什么我的文本文件比二进制文件大?

huangapple go评论73阅读模式
英文:

Why is my text file larger than my binary file?

问题

我正在尝试将一个大文本文件写入一个二进制文件,但这个二进制文件的大小与我的文本文件大小相同。我以为写入二进制文件会压缩它?是不是写入二进制文件更高效?如何使我所使用的文本文件的存储最小化?

ArrayList strArr = new ArrayList();
File f = new File("words.txt");
BufferedInputStream in = new BufferedInputStream(new FileInputStream(f));

DataOutputStream out = new DataOutputStream (
new BufferedOutputStream(
new FileOutputStream("word.ser")

                   )); 

byte[] buffer = new byte[8192]; // 或者更大,甚至更小,任何 > 0 的值
int count;
while ((count = in.read(buffer)) > 0) {
out.write(buffer, 0, count);
}
in.close();
out.close();
/*ObjectOutputStream oos = new ObjectOutputStream(
new BufferedOutputStream(
new FileOutputStream("words.ser")

                     )); */

System.out.println(f.length());
File file = new File("words.ser");
System.out.println(file.length());

英文:

I'm trying to write a large text file to a binary file, but the binary file is the same size as my text file. I thought that writing to a binary file would compress it? Is writing to a binary file just more efficient? How can I minimize the storage of my text file for use?

ArrayList<String> strArr = new ArrayList<String>();
File f = new File("words.txt");
BufferedInputStream in = new BufferedInputStream(new FileInputStream(f));
  
DataOutputStream out = new DataOutputStream (
                       new BufferedOutputStream(
                       new FileOutputStream("word.ser")
                    
                       )); 
                       
byte[] buffer = new byte[8192]; // or more, or even less, anything > 0
int count;
while ((count = in.read(buffer)) > 0) {
  out.write(buffer, 0, count);
}
in.close();
out.close();
/*ObjectOutputStream oos = new ObjectOutputStream(
                         new BufferedOutputStream(
                         new FileOutputStream("words.ser")

                         )); */
System.out.println(f.length());
File file = new File("words.ser");
System.out.println(file.length());

答案1

得分: 3

为了压缩文件,您可以使用例如gzip的工具。

在Java中,您可以像这样进行操作:

Path inFile = Paths.get("words.txt");
Path outFile = Paths.get("words.txt.gz");
try (OutputStream out = new GZIPOutputStream(Files.newOutputStream(outFile))) {
	Files.copy(inFile, out);
}
英文:

To compress a file, you can e.g. gzip it.

In Java, you can do that like this:

Path inFile = Paths.get("words.txt");
Path outFile = Paths.get("words.txt.gz");
try (OutputStream out = new GZIPOutputStream(Files.newOutputStream(outFile))) {
	Files.copy(inFile, out);
}

答案2

得分: 3

你感到困惑。

“文本”文件或“二进制”文件实际上并不存在,至少对于硬盘/文件系统来说是这样。它只是一串字节。所有文件都是如此。只不过是字节。

现在,如果这些字节恰好排列成一串序列,比如,如果你从“文件打开”菜单中选择该文件,Microsoft Word可以正确读取它,那么我们可以说“这是一个Word文件”。文件系统对这些轻浮的人类概念完全不关心。它被要求提供一个名为“foo.doc”的文件的字节,它照做了。它完全以相同的方式提供字节,就好像Word要求文件系统从“foo.txt”或“foo.jpg”获取字节一样。如果字节对Word来说没有意义,那就让Word崩溃吧。

那么,“文本文件”是什么呢?情况也是一样的:如果文本编辑工具要求文件系统打开一个文件,并且“成功”,我想我们可以称其为文本文件。对于文件系统来说,它只是一个文件。

现在你知道为什么将文件发送为OutputStream或BufferedWriter等不会有任何区别。那只是修改了字符以字节形式出现的精确机制。假设它是简单的ASCII字符,每个字符占用1个字节,就是这么简单。

如果你希获取得更小,就必须使用压缩算法,比如gzip。需要注意的是,显然,随机数据是无法被压缩的。你能得到的“压缩”量取决于压缩算法能够找到并编码成更高效形式的数据中固有的非熵量。另一个答案展示了一种简单的做法。

英文:

You're confused.

There's no such thing as a 'text' file or a 'binary' file, at least, to a harddisk / a filesystem. It's a bag of bytes. They all are. Just.. bytes.

Now, if the bytes so happen to form a sequence that, say, Microsoft Word will correctly read in if you pick that file from its 'file open' menu, we may say 'this is a Word file'. The filesystem cares absolutely nothing whatsoever for such frivolous human things. It was asked to provide the bytes in a file named 'foo.doc' and it did so. It did so in the exact, precise same fashion it would have done had word asked the filesystem to give it the bytes from 'foo.txt' or 'foo.jpg'. It's up to word to crash if the bytes don't make sense to it.

So, what's a 'text file'. Same deal applies: if a text editing tool asks the file system to open a file, and it 'works', I guess we can call it a text file. To the file system, it's.. just a file.

And now you know why sending the file as an OutputStream or as a BufferedWriter or what not makes no difference. That's just modifying the precise mechanism by which the characters end up in byte form. Assuming it's simple ASCII characters, it's 1 byte per character, simple as that.

If you want it to be smaller, you'd have to use compression algorithms, like gzip. Note that, obviously, random data cannot be compressed. The only amount of 'compression' you get is the amount of non-entropy inherent in the data that your compression algorithm can manage to find and code into a more efficient form. The other answer shows one easy way to do this.

huangapple
  • 本文由 发表于 2020年9月20日 08:46:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/63974597.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定