How to find difference (line-based) in sorted large text files in Java without loading them in full into memory?

huangapple go评论77阅读模式
英文:

How to find difference (line-based) in sorted large text files in Java without loading them in full into memory?

问题

在Java中,如何在不将整个文件完全加载到内存中的情况下,查找已排序大型文本文件之间的差异(基于行)?类似于Unix中的“diff”(该工具似乎也会将整个文件加载到内存中),可以在Java中识别缺失/额外的行。

相关问题链接:https://stackoverflow.com/questions/53598996/comparing-two-text-large-files-with-urls-in-java-with-external-memory-only

英文:

How to find a difference (line-based) in sorted large text files in Java without loading them in full into memory?
Something similar to Unix "diff" (which also seems to be loading whole files in memory), which can identify missing/extra lines, but in Java.

Linked question: https://stackoverflow.com/questions/53598996/comparing-two-text-large-files-with-urls-in-java-with-external-memory-only

答案1

得分: 3

你只需要从拥有最小行的文件中读取。如果两者相同,从两个文件都读取一行;如果其中一个大于另一个,则只从较小的行所在文件中读取。如果连续两次不从同一文件读取,意味着存在差异。在切换读取的过程中,所有在切换之间的行都是不同的(从只读取文件1切换到文件2,或从只读取文件2切换到文件1)。

以下是示例代码的翻译:

// 在从文件1读取切换到文件2读取的情况下:
if(line1.compareTo(line2) > 0){
    if(lastRead == 1) {
        System.out.println(previousLines + " 在 " + path1 + " 中找到,但在 " + path2 + " 中找不到");
        previousLines.clear();
    }
    previousLines.add(line2);
    line2 = in2.readLine();
    lastRead = 1;
}

如果line1大于line2(其中line1是来自文件1的当前行,line2是来自文件2的当前行),则意味着接下来将只从第二个文件中读取。如果过去我只从文件1中读取过(而不是同时从两个文件或第二个文件中读取),则应列出previousLines中的所有行。在previousLines中,我添加了不同的行。lastRead用于跟踪我最后一次读取的文件(0 - 同时从两个文件,1 - 仅从第一个文件,2 - 仅从第二个文件)。

最后更新:
整个方法体,但正如我在评论中提到的,它没有检查在另一个文件之前是否已经完成从一个文件的读取。目前,如果你设置文件的最后一行相同,它可以正常工作。你可以为一个文件或另一个文件的readLine为null时添加进一步的检查。

void toTitleCase(Path path1, Path path2) {

try(BufferedReader in1 = Files.newBufferedReader(path1);
    BufferedReader in2 = Files.newBufferedReader(path2)) {
    String line1 = in1.readLine(), line2 = in2.readLine();
    int lastRead = 0;
    List<String> previousLines = new ArrayList<>();
    while(line1 != null && line2 != null){
        if(line1.compareTo(line2) > 0){
            if(lastRead == 1) {
                System.out.println(previousLines + " 在 " + path1 + " 中找到,但在 " + path2 + " 中找不到");
                previousLines.clear();
            }
            previousLines.add(line2);
            line2 = in2.readLine();
            lastRead = 2;
        } else if(line1.compareTo(line2) < 0){
            if(lastRead == 2) {
                System.out.println(previousLines + " 在 " + path2 + " 中找到,但在 " + path1 + " 中找不到");
                previousLines.clear();
            }
            previousLines.add(line1);
            line1 = in1.readLine();
            lastRead = 1;
        } else {
            if(lastRead == 2) {
                System.out.println(previousLines + " 在 " + path2 + " 中找到,但在 " + path1 + " 中找不到");
            }
            if(lastRead == 1) {
                System.out.println(previousLines + " 在 " + path1 + " 中找到,但在 " + path2 + " 中找不到");
            }
            previousLines.clear();
            line1 = in1.readLine();
            line2 = in2.readLine();
            lastRead = 0;
        }
    }
} catch (IOException e) {
    e.printStackTrace();
}
}
英文:

You would need to read only from file which have smallest line(from compareTo perspective). In case both are the same , you read a line from both files, in case one bigger than other, you read only from the file with smaller compareTo. In case you don't read from same files twice in a row it mean you have a difference. All lines between switching reading are different( Switch from reading only from file 1 to file 2 or both or switching from reading only file 2 to file1 or both).

A sample to be more clear. Case you switch from file1 reading to file2:

            if(line1.compareTo(line2)&gt;0){
if(lastRead==1) {
System.out.println(previousLines+ &quot; found in &quot;+path1 +&quot; but not in &quot;+ path2);
previousLines.clear();
}
previousLines.add(line2);
line2=in2.readLine();
lastRead = 1;
} 

In case line1 is bigger than line2( line1 being current line from file1, line2 current line from file 2), it mean I'll next go to read only from second file. And in case in the past,I've read only from file1(not from both at same time or second one), all lines in previousLines should be listed. In previousLines, I add lines when they are different. lastRead keep track of the last file I read from(0 - both at same time, 1 - only first, 2-only second).

Late edit:
All method body, but as I mentioned in the comment,it didn't check what happen if I finish read from one file before another. As it is now it works fine if you set last line of file the same on both files. You can add further checks for readLine is null for one file or another.

void toTitleCase(Path path1, Path path2) {
try(BufferedReader in1= Files.newBufferedReader(path1);
BufferedReader in2= Files.newBufferedReader(path2)) {
String line1=in1.readLine(),line2=in2.readLine();
int lastRead=0;
List&lt;String&gt; previousLines=new ArrayList&lt;&gt;();
while(line1!=null &amp;&amp; line2!=null){
if(line1.compareTo(line2)&gt;0){
if(lastRead==1) {
System.out.println(previousLines+ &quot; found in &quot;+path1 +&quot; but not in &quot;+ path2);
previousLines.clear();
}
previousLines.add(line2);
line2=in2.readLine();
lastRead = 2;
} else if(line1.compareTo(line2)&lt;0){
if(lastRead==2) {
System.out.println(previousLines+ &quot; found in &quot;+path2 +&quot; but not in &quot;+ path1);
previousLines.clear();
}
previousLines.add(line1);
line1=in1.readLine();
lastRead = 1;
} else{
if(lastRead==2) {
System.out.println(previousLines+ &quot; found in &quot;+path2 +&quot; but not in &quot;+ path1);
}
if(lastRead==1) {
System.out.println(previousLines+ &quot; found in &quot;+path1 +&quot; but not in &quot;+ path2);
}
previousLines.clear();
line1=in1.readLine();
line2=in2.readLine();
lastRead=0;
}
}
} catch (IOException e) {
e.printStackTrace();
}
}

答案2

得分: 2

我认为这可能是一个有趣的问题,所以我整理了一些内容来说明不同应用程序的差异可能是如何工作的。

我有一个用于不同应用程序的单词文件。因此,我提取了前100个单词,并将每个单词的大小缩小到我可以轻松测试的大小。

单词列表1

aback
abandon
abandoned
abashed
abatement
abbey
abbot
abbreviate
abdomen
abducted
aberrant
aberration
abetted
abeyance

单词列表2

aardvark
aback
abacus
abandon
abatement
abbey
abbot
abbreviate
abdicate
abdomen
aberrant
aberration

我的示例应用程序产生两个不同的输出。这是我测试运行的第一个输出,完整的差异输出。

/word1.txt 和 /word2.txt 之间的差异
-----------------------------------------------------
------   插入   ----- | aardvark                 
aback                     | aback                   
------   插入   ----- | abacus                   
abandon                   | abandon                  
abandoned                 | ------   删除   ------
abashed                   | ------   删除   ------
abatement                 | abatement                
abbey                     | abbey                    
abbot                     | abbot                    
abbreviate                | abbreviate               
------   插入   ----- | abdicate                 
abdomen                   | abdomen                  
abducted                  | ------   删除   ------
aberrant                  | aberrant                 
aberration                | aberration               
abetted                   | ------   删除   ------
abeyance                  | ------   删除   ------

现在,对于两个非常长的文件,其中大部分文本将匹配,这个输出将很难阅读。因此,我还创建了一个缩写的输出。

/word1.txt 和 /word2.txt 之间的差异
-----------------------------------------------------
------   插入   ----- | aardvark                 
---------------   1行相同   --------------
------   插入   ----- | abacus                   
---------------   1行相同   --------------
abandoned                 | ------   删除   ------
abashed                   | ------   删除   ------
--------------   4行相同   -------------
------   插入   ----- | abdicate                 
---------------   1行相同   --------------
abducted                  | ------   删除   ------
--------------   2行相同   -------------
abetted                   | ------   删除   ------
abeyance                  | ------   删除   ------

对于这些小型测试文件,两个报告之间的差异不大。

对于两个大型文本文件,缩写的报告会更容易阅读。

以下是示例代码。

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class Difference {
public static void main(String[] args) {
String file1 = "/word1.txt";
String file2 = "/word2.txt";
try {
new Difference().compareFiles(file1, file2);
} catch (IOException e) {
e.printStackTrace();
}
}
private void compareFiles(String file1, String file2)
throws IOException {
int columnWidth = 25;
int pageWidth = columnWidth + columnWidth + 3;
boolean isFullReport = true;
System.out.println(getTitle(file1, file2));
System.out.println(getDashedLine(pageWidth));
System.out.println();
URL url1 = getClass().getResource(file1);
URL url2 = getClass().getResource(file2);
BufferedReader br1 = new BufferedReader(new InputStreamReader(
url1.openStream()));
BufferedReader br2 = new BufferedReader(new InputStreamReader(
url2.openStream()));
int countEqual = 0;
String line1 = br1.readLine();
String line2 = br2.readLine();
while (line1 != null && line2 != null) {
int result = line1.compareTo(line2);
if (result == 0) {
countEqual++;
if (isFullReport) {
System.out.println(getFullEqualsLine(columnWidth,
line1, line2));
}
line1 = br1.readLine();
line2 = br2.readLine();
} else if (result < 0) {
printEqualsLine(pageWidth, countEqual, isFullReport);
countEqual = 0;
System.out.println(getDifferenceLine(columnWidth,
line1, ""));
line1 = br1.readLine();
} else {
printEqualsLine(pageWidth, countEqual, isFullReport);
countEqual = 0;
System.out.println(getDifferenceLine(columnWidth,
"", line2));
line2 = br2.readLine();
}
}
printEqualsLine(pageWidth, countEqual, isFullReport);
while (line1 != null) {
System.out.println(getDifferenceLine(columnWidth,
line1, ""));
line1 = br1.readLine();
}
while (line2 != null) {
System.out.println(getDifferenceLine(columnWidth,
"", line2));
line2 = br2.readLine();
}
br1.close();
br2.close();
}
private void printEqualsLine(int pageWidth, int countEqual,
boolean isFullReport) {
if (!isFullReport && countEqual > 0) {
System.out.println(getEqualsLine(countEqual, pageWidth));
}
}
private String getTitle(String file1, String file2) {
return "Differences between " + file1 + " and " + file2;
}
private String getEqualsLine(int count, int length) {
String lines = "行相同";
if (count == 1) {
lines = "行相同";
}
String output = "   " + count + " " + lines +
"   ";
return getTextLine(length, output);
}
private String getFullEqualsLine(int columnWidth, String line1,
String line2) {
String format = "%-" + columnWidth + "s";
return String.format(format, line1) + " | " +
String.format(format, line2);
}
private String getDifferenceLine(int columnWidth, String line1,
String line2) {
String format = "%-" + columnWidth + "s";
String deleted = getTextLine(columnWidth, "   删除   ");
String inserted = getTextLine(columnWidth, "   插入   ");
if (line1.isEmpty()) {
return
英文:

I thought this might be an interesting problem, so I put something together to illustrate how a difference application might work.

I had a file of words for a different application. So, I grabbed the first 100 words and reduced the size of each down to something I could test with easily.

Word List 1

aback
abandon
abandoned
abashed
abatement
abbey
abbot
abbreviate
abdomen
abducted
aberrant
aberration
abetted
abeyance

Word List 2

aardvark
aback
abacus
abandon
abatement
abbey
abbot
abbreviate
abdicate
abdomen
aberrant
aberration

My example application produces two different outputs. Here's the first output from my test run, the full difference output.

Differences between /word1.txt and /word2.txt
-----------------------------------------------------
------   Inserted   ----- | aardvark                 
aback                     | aback                    
------   Inserted   ----- | abacus                   
abandon                   | abandon                  
abandoned                 | ------   Deleted   ------
abashed                   | ------   Deleted   ------
abatement                 | abatement                
abbey                     | abbey                    
abbot                     | abbot                    
abbreviate                | abbreviate               
------   Inserted   ----- | abdicate                 
abdomen                   | abdomen                  
abducted                  | ------   Deleted   ------
aberrant                  | aberrant                 
aberration                | aberration               
abetted                   | ------   Deleted   ------
abeyance                  | ------   Deleted   ------

Now, for two really long files, where most of the text will match, this output would be hard to read. So, I also created an abbreviated output.

Differences between /word1.txt and /word2.txt
-----------------------------------------------------
------   Inserted   ----- | aardvark                 
---------------   1 line is the same   --------------
------   Inserted   ----- | abacus                   
---------------   1 line is the same   --------------
abandoned                 | ------   Deleted   ------
abashed                   | ------   Deleted   ------
--------------   4 lines are the same   -------------
------   Inserted   ----- | abdicate                 
---------------   1 line is the same   --------------
abducted                  | ------   Deleted   ------
--------------   2 lines are the same   -------------
abetted                   | ------   Deleted   ------
abeyance                  | ------   Deleted   ------

With these small test files, there's not much difference between the two reports.

With two large text files, the abbreviated report would be a lot easier to read.

Here's the example code.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class Difference {
public static void main(String[] args) {
String file1 = &quot;/word1.txt&quot;;
String file2 = &quot;/word2.txt&quot;;
try {
new Difference().compareFiles(file1, file2);
} catch (IOException e) {
e.printStackTrace();
}
}
private void compareFiles(String file1, String file2)
throws IOException {
int columnWidth = 25;
int pageWidth = columnWidth + columnWidth + 3;
boolean isFullReport = true;
System.out.println(getTitle(file1, file2));
System.out.println(getDashedLine(pageWidth));
System.out.println();
URL url1 = getClass().getResource(file1);
URL url2 = getClass().getResource(file2);
BufferedReader br1 = new BufferedReader(new InputStreamReader(
url1.openStream()));
BufferedReader br2 = new BufferedReader(new InputStreamReader(
url2.openStream()));
int countEqual = 0;
String line1 = br1.readLine();
String line2 = br2.readLine();
while (line1 != null &amp;&amp; line2 != null) {
int result = line1.compareTo(line2);
if (result == 0) {
countEqual++;
if (isFullReport) {
System.out.println(getFullEqualsLine(columnWidth,
line1, line2));
}
line1 = br1.readLine();
line2 = br2.readLine();
} else if (result &lt; 0) {
printEqualsLine(pageWidth, countEqual, isFullReport);
countEqual = 0;
System.out.println(getDifferenceLine(columnWidth,
line1, &quot;&quot;));
line1 = br1.readLine();
} else {
printEqualsLine(pageWidth, countEqual, isFullReport);
countEqual = 0;
System.out.println(getDifferenceLine(columnWidth,
&quot;&quot;, line2));
line2 = br2.readLine();
}
}
printEqualsLine(pageWidth, countEqual, isFullReport);
while (line1 != null) {
System.out.println(getDifferenceLine(columnWidth,
line1, &quot;&quot;));
line1 = br1.readLine();
}
while (line2 != null) {
System.out.println(getDifferenceLine(columnWidth,
&quot;&quot;, line2));
line2 = br2.readLine();
}
br1.close();
br2.close();
}
private void printEqualsLine(int pageWidth, int countEqual,
boolean isFullReport) {
if (!isFullReport &amp;&amp; countEqual &gt; 0) {
System.out.println(getEqualsLine(countEqual, pageWidth));
}
}
private String getTitle(String file1, String file2) {
return &quot;Differences between &quot; + file1 + &quot; and &quot; + file2;
}
private String getEqualsLine(int count, int length) {
String lines = &quot;lines are&quot;;
if (count == 1) {
lines = &quot;line is&quot;;
}
String output = &quot;   &quot; + count + &quot; &quot; + lines +
&quot; the same   &quot;;
return getTextLine(length, output);
}
private String getFullEqualsLine(int columnWidth, String line1,
String line2) {
String format = &quot;%-&quot; + columnWidth + &quot;s&quot;;
return String.format(format, line1) + &quot; | &quot; +
String.format(format, line2);
}
private String getDifferenceLine(int columnWidth, String line1,
String line2) {
String format = &quot;%-&quot; + columnWidth + &quot;s&quot;;
String deleted = getTextLine(columnWidth, &quot;   Deleted   &quot;);
String inserted = getTextLine(columnWidth, &quot;   Inserted   &quot;);
if (line1.isEmpty()) {
return inserted + &quot; | &quot; + String.format(format, line2);
} else {
return String.format(format, line1) + &quot; | &quot; + deleted;
}
}
private String getTextLine(int length, String output) {
int half2 = (length - output.length()) / 2;
int half1 = length - output.length() - half2;
output = getDashedLine(half1) + output;
output += getDashedLine(half2);
return output;
}
private String getDashedLine(int count) {
String output = &quot;&quot;;
for (int i = 0; i &lt; count; i++) {
output += &quot;-&quot;;
}
return output;
}
}

huangapple
  • 本文由 发表于 2020年8月21日 18:54:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/63521500.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定