如何提高这个搜索算法的运行时间?

huangapple go评论79阅读模式
英文:

How can I improve this search algorithms runtime?

问题

我正在尝试解决几年前我在准备即将进行的面试时遇到的问题这个问题在一个PDF文件中进行了概述[在此处][1] 可以找到我编写了一个使用深度优先搜索DFS的简单解决方案对于文档中概述的示例来说效果还不错但是我尚未能够使程序满足以下条件

> 对于一个包含 10,000 个占用 Geos 的 10,000 x 10,000 Geo GeoBlock你的代码应该在一秒内产生出正确答案

为了测试这一点我生成了一个包含 10,000 个随机条目的 CSV 文件当我对其运行代码时平均用时稍微超过 2 秒才能找到其中最大的 geo 块我不确定除了在更快的笔记本电脑上运行之外是否还能对我的方法进行什么改进以将运行时间减少一半以上根据我的调查似乎搜索本身只需要约 8 毫秒所以也许加载数据到内存中的方式是效率低下的部分

我非常感谢任何有关如何改进这一点的建议请参见下面的代码

**GeoBlockAnalyzer**

```java
package analyzer.block.geo.main;

import analyzer.block.geo.model.Geo;
import analyzer.block.geo.result.GeoResult;

import java.awt.*;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.time.LocalDate;
import java.time.format.DateTimeFormatter;
import java.time.format.DateTimeParseException;
import java.util.List;
import java.util.*;

public class GeoBlockAnalyzer {

  // 其他部分已省略...
  
  public GeoResult getLargestGeoBlock() {
    for (final Geo geo : this.geoMap.values()) {
      final List<Geo> visited = new ArrayList<>();
      search(geo, visited);
    }
    return this.result;
  }
  
  // 其他部分已省略...
  
  private void search(Geo geo, final List<Geo> visited) {
    final Deque<Geo> stack = new LinkedList<>();
    stack.push(geo);
    while (!stack.isEmpty()) {
      geo = stack.pop();
      if (visited.contains(geo)) {
        continue;
      }
      visited.add(geo);

      final List<Geo> neighbours = geo.getNeighbours();
      for (int i = neighbours.size() - 1; i >= 0; i--) {
        final Geo g = neighbours.get(i);
        if (!visited.contains(g)) {
          stack.push(g);
        }
      }
    }
    if (this.result.getSize() < visited.size()) {
      this.result = new GeoResult(visited);
    }
  }
  
  // 其他部分已省略...
}

GeoResult

package analyzer.block.geo.result;

import analyzer.block.geo.model.Geo;

import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;

public class GeoResult {

  // 其他部分已省略...

  @Override
  public String toString() {
    final StringBuilder sb = new StringBuilder();
    sb.append("The geos in the largest cluster of occupied Geos for this GeoBlock are: \n");
    for(final Geo geo : this.geosInBlock) {
      sb.append(geo.toString()).append("\n");
    }
    return sb.toString();
  }
}

Geo

package analyzer.block.geo.model;

import java.awt.Point;
import java.time.LocalDate;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;

public class Geo {

  // 其他部分已省略...
}

<details>
<summary>英文:</summary>
I&#39;m trying to solve an interview problem I was given a few years ago in preparation for upcoming interviews. The problem is outlined in a pdf [here][1]. I wrote a simple solution using DFS that works fine for the example outlined in the document, but I haven&#39;t been able to get the program to meet the criteria of 
&gt; Your code should produce correct answers in under a second for a
&gt; 10,000 x 10,000 Geo GeoBlock containing 10,000 occupied Geos.
To test this I generated a CSV file with 10000 random entries and when I run the code against it, it averages just over 2 seconds to find the largest geo block in it. I&#39;m not sure what improvements could be made to my approach to cut the runtime by over half, other than running it on a faster laptop. From my investigations it appears the search itself seems to only take about 8ms, so perhaps the way I load the data into memory is the inefficient part? 
I&#39;d greatly appreciate an advice on how this could be improved. See code below:
**GeoBlockAnalyzer**
package analyzer.block.geo.main;
import analyzer.block.geo.model.Geo;
import analyzer.block.geo.result.GeoResult;
import java.awt.*;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.time.LocalDate;
import java.time.format.DateTimeFormatter;
import java.time.format.DateTimeParseException;
import java.util.List;
import java.util.*;
public class GeoBlockAnalyzer {
private static final DateTimeFormatter formatter = DateTimeFormatter.ofPattern(&quot;yyyy-MM-dd&quot;);
private final int width;
private final int height;
private final String csvFilePath;
private GeoResult result = new GeoResult();
// Map of the geo id and respective geo object
private final Map&lt;Integer, Geo&gt; geoMap = new HashMap&lt;&gt;();
// Map of coordinates to each geo in the grid
private final Map&lt;Point, Geo&gt; coordMap = new HashMap&lt;&gt;();
/**
* Constructs a geo grid of the given width and height, populated with the geo data provided in
* the csv file
*
* @param width the width of the grid
* @param height the height of the grid
* @param csvFilePath the csv file containing the geo data
* @throws IOException
*/
public GeoBlockAnalyzer(final int width, final int height, final String csvFilePath)
throws IOException {
if (!Files.exists(Paths.get(csvFilePath)) || Files.isDirectory(Paths.get(csvFilePath))) {
throw new FileNotFoundException(csvFilePath);
}
if (width &lt;= 0 || height &lt;= 0) {
throw new IllegalArgumentException(&quot;Input height or width is 0 or smaller&quot;);
}
this.width = width;
this.height = height;
this.csvFilePath = csvFilePath;
populateGeoGrid();
populateCoordinatesMap();
calculateGeoNeighbours();
// printNeighbours();
}
/** @return the largest geo block in the input grid */
public GeoResult getLargestGeoBlock() {
for (final Geo geo : this.geoMap.values()) {
final List&lt;Geo&gt; visited = new ArrayList&lt;&gt;();
search(geo, visited);
}
return this.result;
}
/**
* Iterative DFS implementation to find largest geo block.
*
* @param geo the geo to be evaluated
* @param visited list of visited geos
*/
private void search(Geo geo, final List&lt;Geo&gt; visited) {
final Deque&lt;Geo&gt; stack = new LinkedList&lt;&gt;();
stack.push(geo);
while (!stack.isEmpty()) {
geo = stack.pop();
if (visited.contains(geo)) {
continue;
}
visited.add(geo);
final List&lt;Geo&gt; neighbours = geo.getNeighbours();
for (int i = neighbours.size() - 1; i &gt;= 0; i--) {
final Geo g = neighbours.get(i);
if (!visited.contains(g)) {
stack.push(g);
}
}
}
if (this.result.getSize() &lt; visited.size()) {
this.result = new GeoResult(visited);
}
}
/**
* Creates a map of the geo grid from the csv file data
*
* @throws IOException
*/
private void populateGeoGrid() throws IOException {
try (final BufferedReader br = Files.newBufferedReader(Paths.get(this.csvFilePath))) {
int lineNumber = 0;
String line = &quot;&quot;;
while ((line = br.readLine()) != null) {
lineNumber++;
final String[] geoData = line.split(&quot;,&quot;);
LocalDate dateOccupied = null;
// Handle for empty csv cells
for (int i = 0; i &lt; geoData.length; i++) {
// Remove leading and trailing whitespace
geoData[i] = geoData[i].replace(&quot; &quot;, &quot;&quot;);
if (geoData[i].isEmpty() || geoData.length &gt; 3) {
throw new IllegalArgumentException(
&quot;There is missing data in the csv file at line: &quot; + lineNumber);
}
}
try {
dateOccupied = LocalDate.parse(geoData[2], formatter);
} catch (final DateTimeParseException e) {
throw new IllegalArgumentException(&quot;There input date is invalid on line: &quot; + lineNumber);
}
this.geoMap.put(
Integer.parseInt(geoData[0]),
new Geo(Integer.parseInt(geoData[0]), geoData[1], dateOccupied));
}
}
}
/** Create a map of each coordinate in the grid to its respective geo */
private void populateCoordinatesMap() {
// Using the geo id, calculate its point on the grid
for (int i = this.height - 1; i &gt;= 0; i--) {
int blockId = (i * this.width);
for (int j = 0; j &lt; this.width; j++) {
if (this.geoMap.containsKey(blockId)) {
final Geo geo = this.geoMap.get(blockId);
geo.setCoordinates(i, j);
this.coordMap.put(geo.getCoordinates(), geo);
}
blockId++;
}
}
}
private void calculateGeoNeighbours() {
for (final Geo geo : this.geoMap.values()) {
addNeighboursToGeo(geo);
}
}
private void addNeighboursToGeo(final Geo geo) {
final int x = geo.getCoordinates().x;
final int y = geo.getCoordinates().y;
final Point[] possibleNeighbours = {
new Point(x, y + 1), new Point(x - 1, y), new Point(x + 1, y), new Point(x, y - 1)
};
Geo g;
for (final Point p : possibleNeighbours) {
if (this.coordMap.containsKey(p)) {
g = this.coordMap.get(p);
if (g != null) {
geo.getNeighbours().add(g);
}
}
}
}
private void printNeighbours() {
for (final Geo geo : this.geoMap.values()) {
System.out.println(&quot;Geo &quot; + geo.getId() + &quot; has the following neighbours: &quot;);
for (final Geo g : geo.getNeighbours()) {
System.out.println(g.getId());
}
}
}
}
**GeoResult**
package analyzer.block.geo.result;
import analyzer.block.geo.model.Geo;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
public class GeoResult {
private final List&lt;Geo&gt; geosInBlock = new ArrayList&lt;&gt;();
public GeoResult() {
}
public GeoResult(final List&lt;Geo&gt; geosInBlock) {
this.geosInBlock.addAll(geosInBlock);
}
public List&lt;Geo&gt; getGeosInBlock() {
this.geosInBlock.sort(Comparator.comparingInt(Geo::getId));
return this.geosInBlock;
}
public int getSize() {
return this.geosInBlock.size();
}
@Override
public String toString() {
final StringBuilder sb = new StringBuilder();
sb.append(&quot;The geos in the largest cluster of occupied Geos for this GeoBlock are: \n&quot;);
for(final Geo geo : this.geosInBlock) {
sb.append(geo.toString()).append(&quot;\n&quot;);
}
return sb.toString();
}
}
**Geo**
package analyzer.block.geo.model;
import java.awt.Point;
import java.time.LocalDate;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
public class Geo {
private final int id;
private final String name;
private final LocalDate dateOccupied;
private final Point coordinate;
private final List&lt;Geo&gt; neighbours = new ArrayList&lt;&gt;();
public Geo (final int id, final String name, final LocalDate dateOccupied) {
this.id = id;
this.name = name;
this.dateOccupied = dateOccupied;
this.coordinate = new Point();
}
public int getId() {
return this.id;
}
public String getName() {
return this.name;
}
public LocalDate getDateOccupied() {
return this.dateOccupied;
}
public void setCoordinates(final int x, final int y) {
this.coordinate.setLocation(x, y);
}
public Point getCoordinates() {
return this.coordinate;
}
public String toString() {
return this.id + &quot;, &quot; + this.name + &quot;, &quot; + this.dateOccupied;
}
public List&lt;Geo&gt; getNeighbours() {
return this.neighbours;
}
@Override
public int hashCode() {
return Objects.hash(this.id, this.name, this.dateOccupied);
}
@Override
public boolean equals(final Object obj) {
if(this == obj) {
return true;
}
if(obj == null || this.getClass() != obj.getClass()) {
return false;
}
final Geo geo = (Geo) obj;
return this.id == geo.getId() &amp;&amp;
this.name.equals(geo.getName()) &amp;&amp;
this.dateOccupied == geo.getDateOccupied();
}
}
[1]: https://drive.google.com/file/d/16xXiucI2CzhWU03EDU6FukK1YumNbRno/view?usp=sharing
</details>
# 答案1
**得分**: 1
以下是代码的翻译部分:
```Python
def get_neighbours(id, width, height):
neighbours = []
if id % width != 0:
neighbours.append(id - 1)
if (id + 1) % width != 0:
neighbours.append(id + 1)
if id - width >= 0:
neighbours.append(id - width)
if id + width < width * height:
neighbours.append(id + width)
return neighbours
def f(data, width, height):
ids = {}
labels = {}
current_label = 0
for line in data:
[idx, name, dt] = line.split(",")
idx = int(idx)
this_label = None
neighbours = get_neighbours(idx, width, height)
no_neighbour_was_seen = True
for n in neighbours:
# 邻居被看到
if n in ids:
no_neighbour_was_seen = False
# 我们还没有为这个ID分配标签
if not this_label:
this_label = ids[n]["label"]
ids[idx] = {"label": this_label, "data": name + " " + dt}
final_label = labels[this_label]["label"]
labels[final_label]["size"] += 1
labels[final_label]["sum"] += idx
labels[final_label]["IDs"] += [idx]
# 这个邻居还未连接
elif ids[n]["label"] != this_label:
old_label = ids[n]["label"]
old_obj = labels[old_label]
final_label = labels[this_label]["label"]
ids[n]["label"] = final_label
labels[final_label]["size"] += old_obj["size"]
labels[final_label]["sum"] += old_obj["sum"]
labels[final_label]["IDs"] += old_obj["IDs"]
del labels[old_label]
if no_neighbour_was_seen:
this_label = current_label
current_label += 1
ids[idx] = {"label": this_label, "data": name + " " + dt}
labels[this_label] = {"label": this_label, "size": 1, "sum": idx, "IDs": [idx]}
for i in ids:
print(i, ids[i]["label"], ids[i]["data"])
print("")
for i in labels:
print(i)
print(labels[i])
return labels, ids
data = [
"4, Tom, 2010-10-10",
"5, Katie, 2010-08-24",
"6, Nicole, 2011-01-09",
"11, Mel, 2011-01-01",
"13, Matt, 2010-10-14",
"15, Mel, 2011-01-01",
"17, Patrick, 2011-03-10",
"21, Catherine, 2011-02-25",
"22, Michael, 2011-02-25"
]
f(data, 4, 7)
print("")
f(data, 7, 4)

输出:

"""
4 0  Tom  2010-10-10
5 0  Katie  2010-08-24
6 0  Nicole  2011-01-09
11 1  Mel  2011-01-01
13 2  Matt  2010-10-14
15 1  Mel  2011-01-01
17 2  Patrick  2011-03-10
21 2  Catherine  2011-02-25
22 2  Michael  2011-02-25

0
{'sum': 15, 'size': 3, 'IDs': [4, 5, 6], 'label': 0}
1
{'sum': 26, 'size': 2, 'IDs': [11, 15], 'label': 1}
2
{'sum': 73, 'size': 4, 'IDs': [13, 17, 21, 22], 'label': 2}
"""

"""
4 0  Tom  2010-10-10
5 0  Katie  2010-08-24
6 0  Nicole  2011-01-09
11 0  Mel  2011-01-01
13 0  Matt  2010-10-14
15 3  Mel  2011-01-01
17 2  Patrick  2011-03-10
21 3  Catherine  2011-02-25
22 3  Michael  2011-02-25

0
{'sum': 39, 'size': 5, 'IDs': [4, 5, 6, 11, 13], 'label': 0}
2
{'sum': 17, 'size': 1, 'IDs': [17], 'label': 2}
3
{'sum': 58, 'size': 3, 'IDs': [21, 22, 15], 'label': 3}
"""
英文:

Without testing, it seems to me that the main block here is the literal creation of the map, which could be up to 100,000,000 cells. There would be no need for that if instead we labeled each CSV entry and had a function getNeighbours(id, width, height) that returned the list of possible neighbour IDs (think modular arithmetic). As we iterate over each CSV entry in turn, if (1) neighbour IDs were already seen that all had the same label, we'd label the new ID with that label; if (2) no neighbours were seen, we'd use a new label for the new ID; and if (3) two or more different labels existed between seen neighbour IDs, we'd combine them to one label (say the minimal label), by having a hash that mapped a label to its "final" label. Also store the sum and size for each label. Your current solution is O(n), where n is width x height. The idea here would be O(n), where n is the number of occupied Geos.

Here's something really crude in Python that I wouldn't expect to have all scenarios handled but could hopefully give you an idea (sorry, I don't know Java):

def get_neighbours(id, width, height):
  neighbours = []

  if id % width != 0:
    neighbours.append(id - 1)
  if (id + 1) % width != 0:
    neighbours.append(id + 1)
  if id - width &gt;= 0:
    neighbours.append(id - width)
  if id + width &lt; width * height:
    neighbours.append(id + width)

  return neighbours

def f(data, width, height):
  ids = {}
  labels = {}
  current_label = 0
        
  for line in data:
    [idx, name, dt] = line.split(&quot;,&quot;)
    idx = int(idx)
    this_label = None
    neighbours = get_neighbours(idx, width, height)
    no_neighbour_was_seen = True

    for n in neighbours:
      # A neighbour was seen
      if n in ids:
        no_neighbour_was_seen = False

        # We have yet to assign a label to this ID
        if not this_label:
          this_label = ids[n][&quot;label&quot;]
          ids[idx] = {&quot;label&quot;: this_label, &quot;data&quot;: name + &quot; &quot; + dt}
          final_label = labels[this_label][&quot;label&quot;]
          labels[final_label][&quot;size&quot;] += 1
          labels[final_label][&quot;sum&quot;] += idx
          labels[final_label][&quot;IDs&quot;] += [idx]

        # This neighbour has yet to be connected
        elif ids[n][&quot;label&quot;] != this_label:
          old_label = ids[n][&quot;label&quot;]
          old_obj = labels[old_label]
          final_label = labels[this_label][&quot;label&quot;]
          ids[n][&quot;label&quot;] = final_label
          labels[final_label][&quot;size&quot;] += old_obj[&quot;size&quot;]
          labels[final_label][&quot;sum&quot;] += old_obj[&quot;sum&quot;]
          labels[final_label][&quot;IDs&quot;] += old_obj[&quot;IDs&quot;]
          del labels[old_label]

    if no_neighbour_was_seen:
      this_label = current_label
      current_label += 1
      ids[idx] = {&quot;label&quot;: this_label, &quot;data&quot;: name + &quot; &quot; + dt}
      labels[this_label] = {&quot;label&quot;: this_label, &quot;size&quot;: 1, &quot;sum&quot;: idx, &quot;IDs&quot;: [idx]}

  for i in ids:
    print i, ids[i][&quot;label&quot;], ids[i][&quot;data&quot;]
  print &quot;&quot;
  for i in labels:
    print i
    print labels[i]

  return labels, ids
  
          
data = [
  &quot;4, Tom, 2010-10-10&quot;,
  &quot;5, Katie, 2010-08-24&quot;,
  &quot;6, Nicole, 2011-01-09&quot;,
  &quot;11, Mel, 2011-01-01&quot;,
  &quot;13, Matt, 2010-10-14&quot;,
  &quot;15, Mel, 2011-01-01&quot;,
  &quot;17, Patrick, 2011-03-10&quot;,
  &quot;21, Catherine, 2011-02-25&quot;,
  &quot;22, Michael, 2011-02-25&quot;
]

f(data, 4, 7)
print &quot;&quot;
f(data, 7, 4)

Output:

&quot;&quot;&quot;
4 0  Tom  2010-10-10
5 0  Katie  2010-08-24
6 0  Nicole  2011-01-09
11 1  Mel  2011-01-01
13 2  Matt  2010-10-14
15 1  Mel  2011-01-01
17 2  Patrick  2011-03-10
21 2  Catherine  2011-02-25
22 2  Michael  2011-02-25

0
{&#39;sum&#39;: 15, &#39;size&#39;: 3, &#39;IDs&#39;: [4, 5, 6], &#39;label&#39;: 0}
1
{&#39;sum&#39;: 26, &#39;size&#39;: 2, &#39;IDs&#39;: [11, 15], &#39;label&#39;: 1}
2
{&#39;sum&#39;: 73, &#39;size&#39;: 4, &#39;IDs&#39;: [13, 17, 21, 22], &#39;label&#39;: 2}

---

4 0  Tom  2010-10-10
5 0  Katie  2010-08-24
6 0  Nicole  2011-01-09
11 0  Mel  2011-01-01
13 0  Matt  2010-10-14
15 3  Mel  2011-01-01
17 2  Patrick  2011-03-10
21 3  Catherine  2011-02-25
22 3  Michael  2011-02-25

0
{&#39;sum&#39;: 39, &#39;size&#39;: 5, &#39;IDs&#39;: [4, 5, 6, 11, 13], &#39;label&#39;: 0}
2
{&#39;sum&#39;: 17, &#39;size&#39;: 1, &#39;IDs&#39;: [17], &#39;label&#39;: 2}
3
{&#39;sum&#39;: 58, &#39;size&#39;: 3, &#39;IDs&#39;: [21, 22, 15], &#39;label&#39;: 3}
&quot;&quot;&quot;

答案2

得分: 1

主要的优化机会在于一个概念性的优化。不幸的是,这种类型的优化不容易教授,也不容易在参考资料中查找。这里使用的原则是:

> 使用解析公式计算已知结果的成本(几乎总是)比(预)计算结果要便宜。 [1]

从您的代码和问题定义中可以清楚地看出,您没有充分利用这个原则和问题规范。特别地,直接从问题规范中摘取的关键点之一是:

> 对于一个包含 10,000 个占用地块的 10,000 x 10,000 地理地块,您的代码应在一秒内生成正确答案。

当您阅读这个陈述时,您的脑海中应该有几个想法(在考虑运行时效率时):

  • 10,000^2 比 10,000 大得多(确切地说是大了 10,000 倍!)。如果您能够保持一个 O(n) 算法,而不是 O(n^2) 算法(由于使用哈希的原因,预期情况下如此),那么就会有明显的效率提升。
  • 对整个网格进行任何 O(1) 操作(即计算任何操作)都会立即产生一个 O(n^2) 的算法;很明显,这是必须避免的。
  • 根据问题陈述,我们不应该预期需要处理 O(n^2) 的地理块。这应该是问题提出者想要的主要提示。BFS 或 DFS 是一个 O(N+M) 算法,其中 N 和 M 分别是访问的节点和边的数量。因此,我们应该期望一个 O(n) 的搜索。
  • 基于上述几点,清楚地可以得出这里寻找的解决方案应该是 O(10,000),对于一个大小为 10,000 x 10,000 且包含 10,000 个地理地块的问题输入。

您提供的解决方案是 O(n^2),因为:

  1. 您在使用 visited.contains,其中 visited 是一个列表。这在您的测试中并没有显示出问题,因为我怀疑您正在使用小的地理块集群。尝试使用一个包含 10,000 个地理块的大集群。与拥有 3 个地理块的最大集群相比,您应该会看到明显的减速。这里的解决方案是为 visited 使用一个高效的数据结构,一些我想到的是位集(我不知道 Java 是否有可用的,但任何好的编程语言应该都有)或哈希集(显然 Java 有一些可用的)。由于您在测试中没有注意到这一点,这表明您没有足够全面地对您的代码进行测试和分析,也没有使用足够多样的边界情况进行测试。在发布问题之前,我希望能够看到这种类型的基础工作/分析。
  2. 在函数/成员 populateCoordinatesMap 中,您触及了整个 10,000 x 10,000 的网格。这已经是 O(n^2),其中 n=10,000。请注意,coordMap 仅在 populateCoordinatesMap 之外的 addNeighboursToGeo 中使用。这是一个主要的瓶颈,而且没有理由,可以在不需要 coordMap 的情况下在 O(1) 时间内计算 addNeighboursToGeo。然而,我们仍然可以在稍作修改的情况下使用您的代码,修改如下。

我希望解决方案如何修复(1)是显而易见的。为了解决(2),替换 populateCoordinatesMap

  /** 创建一个映射,将网格中的每个坐标映射到其相应的地理块 */
  private void populateCoordinatesMap() {
   for (Map.Entry<Integer, Geo> entry : geoMap.entrySet()) {
     int key = entry.getKey();
     Geo value = entry.getValue();
     int x = key % this.width;
     int y = key / this.width;  
     value.setCoordinates(x, y);
     this.coordMap.put(geo.getCoordinates(), geo); 
   }
  }

请注意,此处使用的原则是:不同于之前遍历整个网格(立即 O(n^2)),此处仅遍历占用的地理块,并使用分析公式进行 2D 数组索引(而不是进行大量计算来计算相同的事情)。实际上,此更改将 populateCoordinatesMap 从 O(n^2) 改进为 O(n)。

以下是一些一般性且带有个人观点的评论:

  • 总体而言,我强烈不赞成在这个问题上使用面向对象的方法,而不是过程式的方法。我认为对于这段代码应该很简单的问题来说,面向对象的方法是完全不合理的,但我理解面试官想要看到这一点。
  • 您试图解决的是一个非常简单的问题,我认为您在这里采取的面向对象的方法使问题变得非常复杂,以至于您无法看到问题的全貌(或者也许是无法看到问题的细节)。在实现算法时,可以采取一个更简单的方法,即使使用面向对象的方法也可以做到这一点。
  • 从上述几点可以清楚地看出,您可能会从了解您所使用语言中的可用工具中
英文:

The major optimization available here is a conceptual one. Unfortunately, this type of optimization is not easy to teach, nor look up in a reference somewhere. The principle being used here is:

> It's (almost always) cheaper to use an analytic formula to compute a known result than to (pre)compute it. [1]

It's clear from your code & the definition of your problem that you are not taking advantage of this principle and the problem specification. In particular, one of the key points taken directly from the problem specification is this:

> Your code should produce correct answers in under a second for a 10,000 x 10,000 Geo GeoBlock containing 10,000 occupied Geos.

When you read this statement a few things should be going through your mind (when thinking about runtime efficiency):

  • 10,000^2 is a much larger number than 10,000 (exactly 10,000 times larger!) There is a clear efficiency gain if you can maintain an algorithm that is O(n) as opposed to O(n^2) (in the expected case because of the use of hashing.)
  • touching (i.e. computing any O(1) operation) for the entire grid is going to immediately yield a O(n^2) algorithm; clearly, this is something that must be avoided if possible
  • from the problem statement, we should never expect O(n^2) geo's that need to be touched. This should be a major hint as to what the person who wrote the problem is looking for. BFS or DFS is an O(N+M) algorithm where N,M are the number of nodes and edges touched. Thus, we should be expecting an O(n) search.
  • based on the above points, it is clear that the solution being looked for here should be O(10,000) for a problem input with grid size 10,000 x 10,000 and 10,000 geos

The solution you provided is O(n^2) because,

  1. You use visited.contains where visited is a List. This is not showing up in your testing as a problem area because I suspect you are using small geo clusters. Try using a large geo cluster (one with 10,000 geos.) You should see a major slow down as compared to say the largest cluster having 3 geos. The solution here is to use an efficient data structure for visited, some that come to mind are a bit set (unknown to me if Java has any available, but any decent language should) or a hash set (clearly Java has some available.) Because you did not notice this in testing, this suggests to me you are not vetting/testing your code well enough with enough varied examples of the corner cases you expect. This should of come up immediately in any thorough testing/profiling of your code. As per my comment, I would of liked to have seen this type of groundwork/profiling done before the question was posted.
  2. You touch the entire 10,000 x 10,000 grid in the function/member populateCoordinatesMap. This is clearly already O(n^2) where n=10,000. Notice, that the only location where coordMap is used outside of populateCoordinatesMap is in addNeighboursToGeo. This is a major bottleneck, and for no reason, addNeighboursToGeo can be computed in O(1) time without the need for a coordMap. However, we can still use your code as is with a minor modification given below.

I hope it is obvious how to fix (1). To fix (2), replace populateCoordinatesMap

  /** Create a map of each coordinate in the grid to its respective geo */
private void populateCoordinatesMap() {
for (Map.Entry&lt;int,Geo&gt; entry : geoMap.entrySet()) {
int key = entry.getKey();
Geo value = entry.getValue();
int x = key % this.width;
int y = key / this.width;  
value.setCoordinates(x, y);
this.coordMap.put(geo.getCoordinates(), geo); 
}
}

Notice the principle being put to use here. Instead of iterating over the entire grid as you were doing before (O(n^2) immediately), this iterates only over the occupied Geos, and uses the analytic formula for indexing a 2D array (as opposed to doing copious computation to compute the same thing.) Effectively, this change improves populateCoordinatesMap from being O(n^2) to being O(n).

Some general & opinionated comments below:

  • Overall, I strongly disagree with using an object oriented approach over a procedural one for this problem. I think the OO approach is completely unjustified for how simple this code should be, but I understand that the interviewer wanted to see it.
  • This is a very simple problem you are trying to solve, and I think the object orientated approach you took here confounds it so much so you could not see the forest for the trees (or perhaps the trees for the forest.) A much simpler approach could of been taken in how this algorithm was implemented, even using an object oriented approach.
  • It's clear from the points above, you could benefit from knowing the available tools in the language you are working in. By this I mean you should know what containers are readily available and what the trade offs are for using each operation on each container. You should also know at least one decent profiling tool for the language you are working with if you are going to be looking into optimizing code. Given that you failed to post a profiling summary, even after I asked for it, it suggests to me you do not know of such a tool with Java. Learn one.

[1] I provide no reference for this principle because it is a first principle, and can be explained by the fact that running fewer constant time operations is cheaper than running many. The assumption here is that the known analytic form requires less computation. There are occasional exceptions to this rule. But it should be stressed that such exceptions are almost always because of hardware limitations or advantages. For example, when computing the hamming distance it is cheaper to use a precomputed LUT for computing the population count on a hardware architecture without access to SSE registers/operations.

huangapple
  • 本文由 发表于 2020年9月28日 17:51:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/64099842.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定