Java:循环遍历CSV并对另一列中的每个唯一值求和的最有效方法

huangapple go评论89阅读模式
英文:

Java: Most efficient way to loop through CSV and sum values of one column for each unique value in another Column

问题

我有一个包含 500,000 行数据和 22 列的 CSV 文件。这些数据表示了美国一年内的所有商业航班。我被要求找出数据集中飞行里程最多的飞机的尾号。第 5 列包含每次飞行的飞机尾号。第 22 列包含总飞行距离。

请看下面的 extractQ3 方法。首先,使用 createHashMap() 方法为整个 CSV 创建了一个 HashMap。然后,我运行了一个 for 循环来识别数据集中的每个唯一尾号,并将它们存储在一个名为 tailNumbers 的数组中。然后,对于每个唯一尾号,我遍历整个 Hashmap 来计算该尾号的总飞行里程。

这段代码在较小的数据集上运行良好,但一旦数据增加到 500,000 行,代码变得极其低效,运行时间变得很长。有没有人能够提供一种更快的方法来完成这个任务呢?

public class FlightData {
        
    HashMap<String,String[]>  dataMap;
        
    public static void main(String[] args) {
            
        FlightData map1 = new FlightData();
        map1.dataMap = map1.createHashMap();

        String answer = map1.extractQ3(map1);  
    }

    public String extractQ3(FlightData map1) {
        ArrayList<String> tailNumbers = new ArrayList<String>();
        ArrayList<Integer> tailMiles = new ArrayList<Integer>();
        
        // 填充数组以存储所有尾号
        for (String[] value : map1.dataMap.values()) {
            if(tailNumbers.contains(value[4])) {  
            } else {
                tailNumbers.add(value[4]);
            }
        }
            
        for (int i = 0; i < tailNumbers.size(); i++) {
            String tempName = tailNumbers.get(i); 
            int miles = 0;
                
            for (String[] value : map1.dataMap.values()) {
                if(value[4].equals(tempName) && value[21].equals("0")) {
                    miles += Integer.parseInt(value[21]);
                }  
            }
            tailMiles.add(miles);     
        }
            
        Integer maxVal = Collections.max(tailMiles);
        Integer maxIdx = tailMiles.indexOf(maxVal);
        String maxPlane = tailNumbers.get(maxIdx);
            
        return maxPlane;
    }

    public HashMap<String,String[]> createHashMap() {
        File flightFile = new File("flights_small.csv");
        HashMap<String,String[]> flightsMap = new HashMap<String,String[]>();
            
        try {
            Scanner s = new Scanner(flightFile);
            while (s.hasNextLine()) {
                String info = s.nextLine();
                String [] piecesOfInfo = info.split(",");
                String flightKey = piecesOfInfo[4] + "_" + piecesOfInfo[2] + "_" + piecesOfInfo[11];
                String[] values = Arrays.copyOfRange(piecesOfInfo, 0, piecesOfInfo.length);
                    
                flightsMap.put(flightKey, values);
            }
                
            s.close();
        }
            
        catch (FileNotFoundException e) {
            System.out.println("Cannot open: " + flightFile);
        }
            
        return flightsMap;
    }
}
英文:

I have a CSV file with 500,000 rows of data and 22 columns. This data represents all commercial flights in the USA for one year. I am being tasked with finding the tail number of the plane that flew the most miles in the data set. Column 5 contains the airplain's tail number for each flight. Column 22 contains the total distance traveled.

Please see my extractQ3 method below. First, created a HashMap for the whole CSV using the createHashMap() method. Then, I ran a for loop to identify every unique tail number in the dataset and stored them in an array called tailNumbers. Then for each unique tail number, I looped through the entire Hashmap to calculate the total miles of distance for that tail number.

The code runs fine on smaller datasets, but once the sized increased to 500,000 rows the code becomes horribly inefficient and takes an eternity to run. Can anyone provide me with a faster way to do this?

public class FlightData {
HashMap&lt;String,String[]&gt;  dataMap;
public static void main(String[] args) {
FlightData map1 = new FlightData();
map1.dataMap = map1.createHashMap();
String answer = map1.extractQ3(map1);  
}
public String extractQ3(FlightData map1) {
ArrayList&lt;String&gt; tailNumbers = new ArrayList&lt;String&gt;();
ArrayList&lt;Integer&gt; tailMiles = new ArrayList&lt;Integer&gt;();
//Filling the Array with all tail numbers
for (String[] value : map1.dataMap.values()) {
if(Arrays.asList(tailNumbers).contains(value[4])) {  
} else {
tailNumbers.add(value[4]);
}
}
for (int i = 0; i &lt; tailNumbers.size(); i++) {
String tempName = tailNumbers.get(i); 
int miles = 0;
for (String[] value : map1.dataMap.values()) {
if(value[4].contentEquals(tempName) &amp;&amp; value[19].contentEquals(&quot;0&quot;)) {
miles = miles + Integer.parseInt(value[21]);
}  
}
tailMiles.add(miles);     
}
Integer maxVal = Collections.max(tailMiles);
Integer maxIdx = tailMiles.indexOf(maxVal);
String maxPlane = tailNumbers.get(maxIdx);
return maxPlane;
}
public HashMap&lt;String,String[]&gt; createHashMap() {
File flightFile = new File(&quot;flights_small.csv&quot;);
HashMap&lt;String,String[]&gt; flightsMap = new HashMap&lt;String,String[]&gt;();
try {
Scanner s = new Scanner(flightFile);
while (s.hasNextLine()) {
String info = s.nextLine();
String [] piecesOfInfo = info.split(&quot;,&quot;);
String flightKey = piecesOfInfo[4] + &quot;_&quot; + piecesOfInfo[2] + &quot;_&quot; + piecesOfInfo[11]; //Setting the Key
String[] values = Arrays.copyOfRange(piecesOfInfo, 0, piecesOfInfo.length);
flightsMap.put(flightKey, values);
}
s.close();
}
catch (FileNotFoundException e)
{
System.out.println(&quot;Cannot open: &quot; + flightFile);
}
return flightsMap;
}
}

答案1

得分: 1

答案取决于你所说的“最有效”,“极其低效”和“花费了很长时间”的具体含义。这些都是主观的术语。答案还可能取决于特定的技术因素(速度与内存消耗之间的权衡;唯一飞行键的数量与总记录数的比例等)。

我建议首先对你的代码进行一些基本的优化。看看是否能获得更好(可接受的)的结果。如果需要更多改进,你可以考虑更高级的优化。

不管你做什么,都要进行一些计时,以了解任何修改的广泛影响。

重点是从“糟糕”变为“可接受”,然后再考虑更高级的调优(如果仍然需要)。

考虑使用BufferedReader而不是Scanner。参见这里。尽管对于你的需求来说,Scanner可能完全足够(即如果它不是瓶颈)。

考虑在扫描循环内部使用逻辑一次性捕获尾号和累积里程数据。以下示例仅为了清晰和简单起见:

// 字符串是尾号。
// 整数保存该尾号的累积飞行里程:
Map<String, Integer> planeMileages = new HashMap();

if (planeMileages.containsKey(tailNumber)) {
    // 添加里程到现有总计:
    int accumulatedMileage = planeMileages.get(tailNumber) + flightMileage;
    planeMileages.put(tailNumber, accumulatedMileage);
} else {
    // 捕获新的尾号:
    planeMileages.put(tailNumber, flightMileage);
}

在完成扫描循环之后,你可以遍历planeMileages以找到最大里程数:

String maxMilesTailNumber;
int maxMiles = 0;
for (Map.Entry<String, Integer> entry : planeMileages.entrySet()) {
    int planeMiles = entry.getValue();
    if (planeMiles > maxMiles) {
        maxMilesTailNumber = entry.getKey();
        maxMiles = planeMiles;
    }
}

警告 - 此方法仅用于说明。它只会捕获一个尾号。可能有多架飞机具有相同的最大里程数。你需要调整逻辑以捕获多个“获胜者”。

上述方法消除了你现有数据结构和相关处理的需求。

如果仍然遇到问题,请添加一些计时器,以查看你的代码中哪些特定区域最慢 - 然后你将有更具体的调优机会可以关注。

英文:

The answer depends on what you mean by "most efficient", "horribly inefficient" and "takes an eternity". These are subjective terms. The answer may also depend on specific technical factors (speed vs. memory consumption; the number of unique flight keys compared to the number of overall records; etc.).

I would recommend applying some basic streamlining to your code, to start with. See if that gets you a better (acceptable) result. If you need more, then you can consider more advanced improvements.

Whatever you do, take some timings to understand the broad impacts of any changes you make.

Focus on going from "horrible" to "acceptable" - and then worry about more advanced tuning after that (if you still need it).

Consider using a BufferedReader instead of a Scanner. See here. Although the scanner may be just fine for your needs (i.e. if it's not a bottleneck).

Consider using logic within your scanner loop to capture tail numbers and accumulated mileage in one pass of the data. The following is deliberately basic, for clarity and simplicity:

// The string is a tail number.
// The integer holds the accumulated miles flown for that tail number:
Map&lt;String, Integer&gt; planeMileages = new HashMap();
if (planeMileages.containsKey(tailNumber)) {
// add miles to existing total:
int accumulatedMileage = planeMileages.get(tailNumber) + flightMileage;
planeMileages.put(tailNumber, accumulatedMileage);
} else {
// capture new tail number:
planeMileages.put(tailNumber, flightMileage);
}

After that, once you have completed the scanner loop, you can iterate over your planeMileages to find the largest mileage:

String maxMilesTailNumber;
int maxMiles = 0;
for (Map.Entry&lt;String, Integer&gt; entry : planeMileages.entrySet()) {
int planeMiles = entry.getValue();
if (planeMiles &gt; maxMiles) {
maxMilesTailNumber = entry.getKey();
maxMiles = planeMiles;
}
}

WARNING - This approach is just for illustration. It will only capture one tail number. There could be multiple planes with the same maximum mileage. You would have to adjust your logic to capture multiple "winners".

The above approach removes the need for several of your existing data structures, and related processing.

If you still face problems, put in some timers to see which specific areas of your code are slowest - and then you will have more specific tuning opportunities you can focus on.

答案2

得分: 0

我建议您使用Java 8的Stream API,这样您就可以利用并行流功能。

英文:

I suggest you use the java 8 Stream API, so that you can take advantage of Parallel streams.

huangapple
  • 本文由 发表于 2020年3月15日 08:47:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/60688729.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定