广度优先搜索五个字母的单词链。

huangapple go评论82阅读模式
英文:

BFS five letter word chain

问题

以下是翻译好的内容:

我需要在关于 BFS 单词链的任务方面寻求一些帮助。
这个单词链基于五个字母的单词,两个单词在最后四个字母中的情况下被视为连接在一起。例如,climb 和 blimp 是连接在一起的,因为 climb 中的 l、i、m 和 b 存在于单词 blimp 中。

推荐使用 Sedgewick 的《算法(第四版)》中的有向 BFS,或对其进行修改。代码可以在这里找到:https://algs4.cs.princeton.edu/40graphs/
并使用以下代码从数据文件中读取到单词列表中:

BufferedReader r =
    new BufferedReader(new InputStreamReader(new FileInputStream(fnam)));
ArrayList<String> words = new ArrayList<String>();
while (true) {
    String word = r.readLine();
    if (word == null) { break; }
    assert word.length() == 5;  // 输入检查,如果你启用了断言
    words.add(word);
}

以及以下代码从文件中读取测试用例:

BufferedReader r = 
    new BufferedReader(new InputStreamReader(new FileInputStream(fnam)));
while (true) {
    String line = r.readLine();
    if (line == null) { break; }
    assert line.length() == 11; // 输入检查,如果你启用了断言
    String start = line.substring(0, 5);
    String goal = line.substring(6, 11);
    // ... 在这里搜索从起始单词到目标单词的路径
}

数据文件中的单词为:

their
moist
other
blimp
limps
about
there
pismo
abcde
bcdez
zcdea
bcdef
fzcde

当使用测试用例文件时...

other there
other their
their other
blimp moist
limps limps
moist limps
abcde zcdea

...输出应该是每对单词之间的边数,如果单词之间没有路径,则为 -1。

1
1
-1
3
0
-1
2

我刚开始接触图的工作,不太清楚如何使用 Sedgewick 的 BFS 并修改它以读取测试用例文件。对于任何帮助,我都非常感谢。

英文:

I am in need of some help with an assignment regarding a BFS word chain.
The word chain is based on five-letter words, two words are connected when the last four letters in word x are in word y. For example climb and blimp are connected because l, i, m, and b in climb are in the word blimp.

The recommendation is to use directed BFS from Sedgewick's algorithms 4th edition or to modify it. The code can be found here: https://algs4.cs.princeton.edu/40graphs/ and to use the following code to read
a data file to the list words:

BufferedReader r =
    new BufferedReader(new InputStreamReader(new FileInputStream(fnam)));
ArrayList&lt;String&gt; words = new ArrayList&lt;String&gt;();
while (true) {
    String word = r.readLine();
    if (word == null) { break; }
    assert word.length() == 5;  // inputcheck, if you run with assertions
    words.add(word);
}

and the following code to read the test cases from a file:

BufferedReader r = 
    new BufferedReader(new InputStreamReader(new FileInputStream(fnam)));
while (true) {
    String line = r.readLine();
    if (line == null) { break; }
    assert line.length() == 11; // inputcheck, if you run with assertions
    String start = line.substring(0, 5);
    String goal = line.substring(6, 11);
    // ... search path from start to goal here
}

The words in the data file are:

their
moist
other
blimp
limps
about
there
pismo
abcde
bcdez
zcdea
bcdef
fzcde

When the test case file is used...

other there
other their
their other
blimp moist
limps limps
moist limps
abcde zcdea

...the output should be the number of edges between each word pair, and -1 if there is no path between the words.

1
1
-1
3
0
-1
2

I am new to working with graphs and I am not sure how to use Sedgewick's BFS and modify it to read the test cases file. Any help is appreciated.

答案1

得分: 5

假设n是数据集中的单词数目。

首先,根据给定条件,我们需要为所有上述单词构建一个邻接表,即只有当x的最后四个字母出现在y中时,xy之间才有一条边。构建这个邻接表的时间复杂度是O(n^2 * w),其中w是数据集中每个单词的平均长度。

其次,我们只需要在测试数据上进行传统的BFS遍历。

以下是main函数:

    public static void main(String[] args) throws IOException {
        // 从数据集中获取单词
        List<String> words = readData();
        // 获取要测试的单词对
        List<List<String>> testData = getTestData();
        // 构建邻接表
        Map<String, List<String>> adj = getAdjacencyList(words);
        
        // 对于每个测试,进行传统的BFS
        for (List<String> test : testData) {
            System.out.println(bfs(adj, test));
        }
    }

以下是根据给定条件构建邻接表的函数:

    public static Map<String, List<String>> getAdjacencyList(List<String> words) {
        Map<String, List<String>> adj = new HashMap<>();
        for (int i = 0; i < words.size(); ++i) {
            String word = words.get(i);
            adj.put(word, adj.getOrDefault(word, new ArrayList<>()));
            for (int j = 0; j < words.size(); ++j) {
                if (i == j) continue;
                int count = 0;
                String other = words.get(j);
                for (int k = 1; k < 5; ++k) {
                    count += other.indexOf(word.charAt(k)) != -1 ? 1 : 0;
                }
                // 如果满足条件,则从`word`到`other`存在一条边
                if (count >= 4)
                    adj.get(word).add(other);
            }
        }

        return adj;
    }

以下是BFS函数:

    public static int bfs(Map<String, List<String>> adj, List<String> test) {
        Queue<String> q = new LinkedList<>();
        Set<String> visited = new HashSet<>(); // 用于跟踪已访问的单词,因为图不一定是DAG
        String start = test.get(0);
        String end = test.get(1);
        // 如果`start`和`end`单词相等
        if (start.equals(end))
            return 0;

        q.add(start);
        visited.add(start);
        int count = 0;
        while (!q.isEmpty()) {
            count++;
            int size = q.size();
            for (int i = 0; i < size; ++i) {
                String word = q.poll();
                for (String val : adj.get(word)) {
                    if (val.equals(end))
                        return count; // 返回边的数量
                    if (!visited.contains(val)) // 仅添加尚未访问的单词。
                        q.add(val);
                }
            }
        }
        return -1; // 如果没有边
    }
英文:

Let's assume n is the number of words in the dataset.

Firstly, we need to build an adjacency list for all the above words according to the given condition, i.e. there's an edge between x and y if and only if the last four letters of x are present in y. Building this adjacency list is an O(n^2 * w) operation, where w is the average size of each word in the dataset.

Secondly, all we have to do is a traditional BFS over the test data.

Here's the main function:

    public static void main(String[] args) throws IOException {
        // get words from dataset
        List&lt;String&gt; words = readData();
        // get the word pairs to test
        List&lt;List&lt;String&gt;&gt; testData = getTestData();
        // form an adjacency list
        Map&lt;String, List&lt;String&gt;&gt; adj = getAdjacencyList(words);
        
        // for each test, do a traditional BFS
        for (List&lt;String&gt; test : testData) {
            System.out.println(bfs(adj, test));
        }
    }

Here's the function to build an adjacency list according to the given condition:

    public static Map&lt;String, List&lt;String&gt;&gt; getAdjacencyList(List&lt;String&gt; words) {
        Map&lt;String, List&lt;String&gt;&gt; adj = new HashMap&lt;&gt;();
        for (int i = 0; i &lt; words.size(); ++i) {
            String word = words.get(i);
            adj.put(word, adj.getOrDefault(word, new ArrayList&lt;&gt;()));
            for (int j = 0; j &lt; words.size(); ++j) {
                if (i == j) continue;
                int count = 0;
                String other = words.get(j);
                for (int k = 1; k &lt; 5; ++k) {
                    count += other.indexOf(word.charAt(k)) != -1 ? 1 : 0;
                }
                // if the condition is satisfied, there exists an edge from `word` to `other`
                if (count &gt;= 4)
                    adj.get(word).add(other);
            }
        }

        return adj;
    }

And here's the BFS:

    public static int bfs(Map&lt;String, List&lt;String&gt;&gt; adj, List&lt;String&gt; test) {
        Queue&lt;String&gt; q = new LinkedList&lt;&gt;();
        Set&lt;String&gt; visited = new HashSet&lt;&gt;(); // to keep track of the visited words, since the graph is not necessarily a DAG
        String start = test.get(0);
        String end = test.get(1);
        // if `start` and `end` words are equal
        if (start.equals(end))
            return 0;

        q.add(start);
        visited.add(start);
        int count = 0;
        while (!q.isEmpty()) {
            count++;
            int size = q.size();
            for (int i = 0; i &lt; size; ++i) {
                String word = q.poll();
                for (String val : adj.get(word)) {
                    if (val.equals(end))
                        return count; // return the number of edges
                    if (!visited.contains(val)) // only add the words which aren&#39;t visited yet.
                        q.add(val);
                }
            }
        }
        return -1; // if there isn&#39;t any edge
    }

答案2

得分: 2

@The Room已经给出了一个非常好的答案,但是我想提出一个简单的修改,用于构建邻接表的部分,因为所提供的构建列表的方法的复杂度为O(n^2),这将导致处理大型输入文件时性能较差。

你可以简单地考虑每个单词的每个可能的排序的4个字符的模式,并将其插入到一个带有单词ID(例如索引)的哈希映射中。

C++代码示例:

map<string, vector<int>> mappings;

for (int i = 0; i < words.size(); i++) {
    string word = words[i].substr(0, 4);
    sort(word.begin(), word.end());
    mappings[word].push_back(i);
    for (int j = 0; j < 4; j++) {
        word = words[i].substr(0, 4);
        word[j] = words[i][4];
        sort(word.begin(), word.end());
        mappings[word].push_back(i);
    }
}

现在你有一个单词索引的向量,你知道它们之间必须有一个边,并且任何以向量的键的相同4个字符结尾的单词。

然后你可以简单地构建图,只需要注意不要创建自环(避免创建与节点本身的边)。

代码示例:

// 以O(n * log(边的数量))的复杂度构建图
const int N = 100000; // 仅为示例
vector<int> graph[N];
for (int i = 0; i < words.size(); i++) {
    string tmp = words[i].substr(1, 4);
    sort(tmp.begin(), tmp.end());
    for (int j = 0; j < mappings[tmp].size(); j++) {
        if (j == mappings[tmp][j])
            continue;

        graph[i].push_back(mappings[tmp][j]);
    }
}

最后,你可以循环读取测试文件,获取起始和目标索引(读取文件时,将每个单词存储为键,其索引为值),然后你可以应用BFS函数来计算边的数量,就像@The Room的答案中所描述的那样。

我只是想为那些可能需要处理大型输入的类似问题提供一个解决方案,它将图的构建复杂度从O(N^2)降低到O(N * log(边的数量)),其中N是单词的数量。

英文:

@The Room have provided a pretty good answer , but I want to suggest a simple modification for the adjacency list construction part as the provided approach for building the list is of complexity O(n^2) which will lead for poor performance for large input files.

Simply you can take every possible sorted pattern of 4 characters of each word and insert it in a hash map with the id of the word (index for example).

C++ code example:

map&lt;string , vector&lt;int&gt; &gt;mappings ;

for(int i = 0 ; i &lt; words.size();  i++){
    string word = words[i].substr(0 , 4) ; 
    sort(word.begin() , word.end()); 
    mappings[word].push_back(i); 
    for(int j = 0 ; j &lt; 4 ; j++){
        word = words[i].substr(0 , 4) ; 
        word[j] = words[i][4]; 
        sort(word.begin() , word.end()); 
        mappings[word].push_back(i);
    }
}

Now you have a vector of the words' indices that you know there must be an edge between them and any word ending with the same 4 characters of the vector's key.

And then you can simply build the graph and with just taking care of not creating self loops (avoid making an edge with a node and itself).

Code example:

// Building the graph with complexity of O(n * log(no. of edges))
const int N = 100000; // Just and example 
vector&lt;int&gt;graph[N]; 
for(int i = 0 ; i &lt; words.size(); i++){
    string tmp = words[i].substr(1 , 4); 
    sort(tmp.begin() , tmp.end()); 
    for(int j = 0 ; j &lt; mappings[tmp].size(); j++){
        if (j == mappings[tmp][j])
            continue; 
            
        graph[i].push_back(mappings[tmp][j]);
    }
}

Finally you can loop over your test file , get the start and goal indices (When reading the file store each word as a key with a value of it's index) and then you apply the bfs function to calculate the number of edges as described in the answer of @The Room

I just wanted to suggest this answer for folks that may need a solution for a similar problem with a large inputs which will reduce the complexity of building the graph from O(N^2) to O(N * log(no. of edges)) where N is the number of words.

答案3

得分: 2

我的方法稍微有些不同,而且问题中还有一个微妙的细微差别,我将在下面详细说明:

首先,我们创建一个邻接表:(@Volpe95 对此有一个不错的优化方法)。
使用一个节点映射,其中单词是键。

Map<String, Node> nodes = new HashMap<>();
    
List<String> words = new DataHelper().loadWords("src/main/wordsInput.dat");
System.out.println(words);

for (int i = 0; i < words.size(); i++) {
    String l = words.get(i);
    nodes.put(l, new Node(l));
}

for(Map.Entry<String,Node> l: nodes.entrySet()) {
    for(Map.Entry<String, Node> r:nodes.entrySet()) {
        if (l.equals(r)) continue;
        if (isLinkPair(l.getKey(), r.getKey())) {
            Node t = nodes.get(l.getKey());
            System.out.println(t);
            t.addChild(nodes.get(r.getKey()));
        }
    }
}

isLinkPair 函数检查一个单词的最后四个字母是否可以在一个可能的子单词中找到。

private static boolean isLinkPair(String l, String r) {
    // 仅比较最后4个字符
    for (int i = 1; i < l.length(); i++) {
        if(r.indexOf(l.charAt(i)) == -1){
            return false;
        }
    }
    return true;
}

每个节点都存储了单词以及子节点,还有 edgeTo,这用于计算最短路径,其中每个节点都存储了其父节点。这个子父关系始终位于最短路径上。(为了清晰起见,省略了 getter、setter 等内容和 Equals 函数)

public class Node {
    private Set<Node> children;
    private String word;

    private Node edgeTo;

    private int visited;

    public Node(String word) {
        children = new HashSet<>();
        this.word = word;
        edgeTo = null;
    }
}

基于 Sedgewick 的 BFS 算法会依次搜索每个节点、其直接子节点以及它们的子节点,以此类推。它每次都在距离起点更远的地方进行搜索。请注意,这里使用了队列,而在 Java 中,它是由 LinkedList 实现的。

private boolean bfs(Map<String,Node> map, Node source, Node target) {
    if(source == null || target == null) return false;
    if(source.equals(target))return true;
    Queue<Node> queue = new LinkedList<>();
    source.setVisited();
    queue.add(source);
    while(!queue.isEmpty()) {
        Node v = queue.poll();
        for (Node c : v.getChildren()) {
            if(c.getVisited()==0){
                System.out.println("visiting " + c);
                c.setVisited();
                c.setEdgeTo(v);
                if(c.equals(target)) {
                    return true;
                }
                queue.add(c);
            }
        }
    }

    return false;
}

注意,这里 v 是父节点,c 是其子节点。setEdgeTo 用于设置子节点的父节点。

最后,我们检查结果,其中 source 和 target 分别是源单词和目标单词:

BreadthFirstPaths bfs = new BreadthFirstPaths(nodes,source,target);
int shortestPath = bfs.getShortestPath(nodes,source,target);

那么我上面提到的微妙之处是什么呢?最短路径的计算是必要的,因为 zcdea 有两个父节点 fzcde 和 bcdez,你需要找到在最短路径上的那个。要做到这一点,可以使用子节点的 edgeTo,找到它的父节点,然后重复此过程,直到路径被遍历完,如下所示。由于 BFS 搜索是从起点向外搜索的方式,所以这个子父关系始终位于最短路径上。

// 获取 target 上的 edgeTo(即父节点),找到此节点并获取其父节点
// 一直持续,直到走完最短路径或找不到路径为止
public int getShortestPath(Map<String,Node> map, String source, String target) {
    Node node = map.get(target);
    int pathLength = 0;
    do {
        if(node == null || pathLength > map.size()) return NOPATH;
        if(node.equals(map.get(source))) return pathLength;
        node = map.get(node.getWord()).getEdgeTo();
        pathLength++;
    } while (true);
}

在考虑空间和时间复杂度的权衡和优化方面,总是需要斟酌。

英文:

My approach was slightly different and there is also a subtle nuance to the question which I will come onto below:

First we create an adjacency list: ( @Volpe95 has a nice optimisation for this).
A Map of Nodes is used where the word is the key.

Map&lt;String, Node&gt; nodes = new HashMap&lt;&gt;();

        List&lt;String&gt; words = new DataHelper().loadWords(&quot;src/main/wordsInput.dat&quot;);
        System.out.println(words);

        for (int i = 0; i &lt; words.size(); i++) {
            String l = words.get(i);
            nodes.put(l, new Node(l));
        }

        for(Map.Entry&lt;String,Node&gt; l: nodes.entrySet()) {
            for(Map.Entry&lt;String, Node&gt; r:nodes.entrySet()) {
                if (l.equals(r)) continue;
                if (isLinkPair(l.getKey(), r.getKey())) {
                    Node t = nodes.get(l.getKey());
                    System.out.println(t);
                    t.addChild(nodes.get(r.getKey()));
                }
            }

        }

IsLinkPair checks if the last four letters of a word can be found in a possible child word.

private static boolean isLinkPair(String l, String r) {
        // last 4 chars only
        for (int i = 1; i &lt; l.length(); i++) {
            if(r.indexOf(l.charAt(i)) == -1){
                return false;
            }
        }
        return true;
    }

A node stores each word and children as well as the edgeTo, this is used to calculate the shortest path where each node stores its parent. This child parent will always be on the shortest path. (Sedgewick stores this data in separate arrays but its often easier to group these in a class as it makes the code easier to understand)

(Getters Setters etc omitted for clarity and Equals)

public class Node {
    private Set&lt;Node&gt; children;
    private String word;

    private Node edgeTo;

    private int visited;

    public Node(String word) {
        children = new HashSet&lt;&gt;();
        this.word = word;
        edgeTo = null;
    }
}

The BFS algorithm based on Sedgewick's, searches each node, its immediate children and their children in turn and so on. It is searching ever so distant from the origin each time. Note a queue is used and this is implemented by the LinkedList in Java.

private boolean bfs(Map&lt;String,Node&gt; map, Node source, Node target) {
        if(source == null || target == null) return false;
        if(source.equals(target))return true;
        Queue&lt;Node&gt; queue = new LinkedList&lt;&gt;();
        source.setVisited();
        queue.add(source);
        while(!queue.isEmpty()) {
            Node v = queue.poll();
            for (Node c : v.getChildren()) {
                if(c.getVisited()==0){
                    System.out.println(&quot;visiting &quot; + c);
                    c.setVisited();
                    c.setEdgeTo(v);
                    if(c.equals(target)) {
                        return true;
                    }
                    queue.add(c);
                }
            }
        }

        return false;
    }

Note that v is the parent and c are its children. setEdgeTo is used to set a childs parent.

Finally we check the results where source and target are the source and target words respectively:

BreadthFirstPaths bfs = new BreadthFirstPaths(nodes,source,target);
int shortestPath = bfs.getShortestPath(nodes,source,target);

So what of the nuance I mentioned above? The shortest path calculation is necessary as
zcdea has two parents fzcde and bcdez and you need the one on the shortest path. To do use the edgeTo of a child, find its parent and repeat until the path is walked as shown below. That child parent relationship will always be on the shortest path because of the way the bfs searchs from an origin outwards.

// get edgeTo on target (the parent) , find this node and get its parent
    // continue until the shortest path is walked or no path is found
    public int getShortestPath(Map&lt;String,Node&gt; map, String source, String target) {
        Node node = map.get(target);
        int pathLength = 0;
        do {
            if(node == null || pathLength &gt; map.size()) return NOPATH;
            if(node.equals(map.get(source))) return pathLength;
            node = map.get(node.getWord()).getEdgeTo();
            pathLength++;
        } while (true);
    }

There are always space time complexity tradeoffs to consider and optimisations.

huangapple
  • 本文由 发表于 2020年10月26日 18:03:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/64534975.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定