(Java) 生成关联规则的所有前项

huangapple go评论58阅读模式
英文:

(Java) Generate all antecedents of an association rule

问题

例如,我可能有一个频繁字符项集 {A,B,C,G}。我需要生成所有可能的关联规则前提。在这种情况下:ABC、ABG、ACG、AB、AC、AG、BC、BG、CG、A、B、C、G。我不知道从哪里开始做这个。数小时的研究让我了解了术语和概念,但没有解释如何执行这个特定的步骤。以下是我目前为该方法所做的部分。项集以字符串形式保存,并一起存储在 ArrayList 中。我已经制作了一个可工作的 Apriori 算法来生成频繁项集。

public static ArrayList<String> associationRules(ArrayList<String> data, ArrayList<String> freqItemsets, int minConf){
    ArrayList<String> generatedRules = new ArrayList<String>();
    for(int i = 0; i < freqItemsets.size(); i++) {
        String currentItemset = freqItemsets.get(i);
        if(currentItemset.length() < 2) {
            continue;
        }
        
    }
    
    
    return null; // 临时的返回语句,避免编译错误
}

虽然代码、反馈以及关于此步骤和后续步骤的建议当然都会帮助很大,但我真正需要的只是关于如何执行这一步的英文解释(而不是伪代码或使用不同数据类型的另一种工作方法)。其他一切似乎都可以管理。

英文:

For example I may have a frequent itemset of characters {A, B, C, G}. I need to generate all possible antecendents of association rules. In this case: ABC, ABG, ACG, AB, AC, AG, BC, BG, CG, A, B, C, G. I have no idea where to start on doing this. Hours of research has taught me about the terminology and concept, but nothing has explained how to do this particular step. This is what I have so far for the method. The itemsets are all kept in the form of Strings, and stored together as an ArrayList. I have already made a working Apriori algorithm for the generation of frequent itemsets.

public static ArrayList&lt;String&gt; associationRules(ArrayList&lt;String&gt; data, ArrayList&lt;String&gt; freqItemsets, int minConf){
		ArrayList&lt;String&gt; generatedRules = new ArrayList&lt;String&gt;();
		for(int i = 0; i &lt; freqItemsets.size(); i++) {
			String currentItemset = freqItemsets.get(i);
			if(currentItemset.length() &lt; 2) {
				continue;
			}
			
		}
		
		
		return null; // temporary return statement to avoid compile error
	}

While code, feedback, and advice on this and later steps would of course be a huge help, all I really need is an English explanation of how to do this one step (as opposed to pseudocode or another working method using different data types). Everything else seems manageable.

答案1

得分: 3

假设您已经理解了所需内容的定义(所有以原始列表排序的子集),您可以通过以下方式思考并使用以下属性来完成:

* 像列表中的排序一样
* 有限的
* 可分割的

您需要做的就是多次遍历字符列表,每次都要决定是否在这次包含它或删除它。如果您遍历并捕获了所有可能性,那么就完成了。要做到这一点,您应该找到一种可靠的方法来计算可能的结果字符串。

# 一种迭代的解决方案

考虑可能的比特状态。您有 n 个字符,并为每个字符分配一个比特(在您的情况下为 4 个)。然后,每个可能的比特状态都定义了一个子集的合法排列,例如对于 `{A,B,C,G}`:

`1001` 将是 `AG`

正如我们所知,比特集的所有可能状态都是“可计数的”,换句话说,您只需通过从最小状态计数到最高状态,逐步加 1 来计数。

从 1 计数到 2^n - 1(其中 n 是您拥有的字符数)并通过在正确的顺序中添加(in correct sequence)所有具有 1 作为它们表示位的字符,并略去具有 0 的字符来构建您的 `String`。然后,您会“计数”所有可能的合法排列。

这样的实现在很大程度上取决于程序员及其风格,但对我来说,它大致看起来像这样:
```java
public static List<String> associationRules(List<String> elements) {
  List<String> result = new ArrayList<>();
  long limit = 1 << elements.size(); // 感谢 saka1029 提供的更正。我的代码是 n^2 而不是 2^n。

  // 从 1 计数到 n^2 - 1
  for (long i = 1; i < limit; ++i) {
    StringBuilder seq = new StringBuilder();

    // 对于每个位置(字符),根据位的状态决定是否在其中包含它。
    for (int pos = 0; pos < elements.size(); ++pos) {
      boolean include = ((i >> pos) % 2) == 1; // 这行将在“i”的“pos”位置(从后面)的比特为1时给出true,否则为false。
      if (include) {
        seq.append(elements.get(pos));
      }
    }

    // 将新生成的字符串添加到最终结果中。
    result.add(seq.toString());
  }

  return result;
}

结果如下所示:
[A, B, AB, C, AC, BC, ABC, G, AG, BG, ABG, CG, ACG, BCG, ABCG]

这是一种迭代(非递归)解决方案,但也有一种递归解决方案,可能(或可能不)更容易实现。

一种递归的解决方案

递归解决方案可以通过创建一个方法来实现,该方法以已排序的字符集合和一个布尔状态(包含或不包含)作为参数,并返回所有可能的排序子排列列表。然后,您将使用一个公共方法调用它,该方法传递字符和位置0,以及初始状态truefalse(另一个稍后再来)。

然后,该方法使用分而治之的方法。根据是否设置了包含标志,您会在定义的位置包含字符,并再次调用自己的方法,其中传递一个不包含第一个字符的克隆字符(子)集。

暂时假设您在每个序列中包含第一个字符(但稍后会包含它)。
如果您将字符集{A,B,C,G}传递给此类方法,则该方法将会(开始)按以下方式运行:

A: 对 {B, C, G} 递归
  B: 对 {C, G} 递归
    C: 对 {G} 递归
      G: 集合为空,
      G: 将所有以 'G' 为前缀和没有的字符串添加到结果中。
      G: 返回 { "G", "" }
    C: 将所有以 'C' 为前缀和没有的字符串添加到结果中。
    C: { "CG", "C", "G", "" }
    ...

这种方式,您将递归地收集所有已排序的子集排列。根据是否允许空字符串,您可以在最后删除它,或者根本不添加它。

我像这样实现了它,但还有其他正确的方法:

public static List<String> associationRules2(List<String> elements) {
    List<String> result = new ArrayList<>();
    String thisElement = elements.get(0);

    // 构建子集列表(不包括第一个元素)
    List<String> remaining = new ArrayList<>();
    boolean first = true;
    for (String s : elements) {
        if (first) {
            first = false;
        } else {
            remaining.add(s);
        }
    }

    // 如果子集不为空,我们进行递归。
    if (!remaining.isEmpty()) {
        List<String> subPermutations = associationRules2(remaining);

        // 添加所有不包含 thisElement 的排列。
        result.addAll(subPermutations);

        // 添加所有包含 thisElement 的排列。
        for (String s : subPermutations) {
            result.add(thisElement + s);
        }
    }

    // 最后单独添加 thisElement。
    result.add(thisElement);

    return result;
}

结果为:[G, CG, C, BG, BCG, BC, B, AG, ACG, AC, ABG, ABCG, ABC, AB, A]


<details>
<summary>英文:</summary>

Assuming you nailed the definition of what you actually need (all subsets that are sorted as the original list), you can do this by thinking about it as that and using those properties:

* sorted like in your list
* finite
* dividable

All you need to do is go through your character list multiple times and each time decide per chraracter, whether to include it this time or drop it. If you go through and capture all possibilities, then you are done. To do this you should find a solid way to count through the possible result strings. 

# an iterative solution

Think about possible bit-states. You have n characters and assign each character a bit (in your case 4). Then each possible bit-state defines a legal permutation of a subset, e.g. for `{A, B, C, G}`:

`1001` would be `AG`

As we know, all possible states of a bit set are &#39;countable&#39;, or in other words, you can just count through them by counting from the least state to the highest by adding 1.

Make a loop counting from 1 to 2^n - 1 (where n is the number of chars you have) and then build your `String` by adding (in correct sequence) all characters for which you have a 1 as their representing bit and leave out the characters with a 0. Then you &#39;count&#39; through all possible legal permutations.

Such an implementation highly depends on the programmer and their style, but for me it would look about like this:
```java
public static List&lt;String&gt; associationRules(List&lt;String&gt; elements) {
  List&lt;String&gt; result = new ArrayList&lt;&gt;();
  long limit = 1 &lt;&lt; elements.size(); // thanks to saka1029 for this correction. My code was n^2 not 2^n.

  // count from 1 to n^2 - 1
  for (long i = 1; i &lt; limit; ++i) {
    StringBuilder seq = new StringBuilder();

    // for each position (character) decide, whether to include it based on the state of the bit.
    for (int pos = 0; pos &lt; elements.size(); ++pos) {
      boolean include = ((i &gt;&gt; pos) % 2) == 1; // this line will give you true, if the in &#39;i&#39; the bit at &#39;pos&#39; (from behind) is 1, and false otherwise.
      if (include) {
        seq.append(elements.get(pos));
      }
    }

    // add to the final result the newly generated String.
    result.add(seq.toString());
  }

  return result;
}

and the result looks like this:
[A, B, AB, C, AC, BC, ABC, G, AG, BG, ABG, CG, ACG, BCG, ABCG]

This is an iterative (non-recursive) solution, but there is also a recursive one that may (or may not) be easier to implement still.

a recursive solution

A recursive solution could just simply work by creating a method that takes as arguments the a set of sorted characters and a boolean state (included or not included) and returns a list of all possible sorted subpermutations.
You would then call this with a public method that passes the characters and 0 as position and either true or false as initial state (the other comes later).

The method then works with divide and conquer. You include the character at the defined position (based on whether or not the include flag is set) and call the own method again with a cloned character (sub)set that does not include the first character.

Let's assume for the moment, that you start by not including the first character of each sequence (but later include it).
If you pass to such a method the character set {A, B, C, G} then the method would (start) to operate like this:

A: recurse on {B, C, G}
  B: recurse on {C, G}
    C: recurse on {G}
      G: set is empty,
      G: Add to the result all Strings with &#39;G&#39; prefixed and without.
      G: return {&quot;G&quot;, &quot;&quot;}
    C: Add to the result all Strings with &#39;C&#39; prefixed and without.
    C: {&quot;CG&quot;, &quot;C&quot;, &quot;G&quot;, &quot;&quot;}
    ...

This way, you will collect all sorted subset permutations recursively. Depending on whether the empty String is allowed, you can remove that at the end, or not add it at all.

I implemented it like this, but there are other correct ways:

public static List&lt;String&gt; associationRules2(List&lt;String&gt; elements) {
	List&lt;String&gt; result = new ArrayList&lt;&gt;();
	String thisElement = elements.get(0);
	
	// build the subset list (leaving out the first element
	List&lt;String&gt; remaining = new ArrayList&lt;&gt;();
	boolean first = true;
	for (String s : elements) {
		if (first) {
			first = false;
		} else {
			remaining.add(s);
		}
	}
	
	// if the subset is not empty, we recurse.
	if (! remaining.isEmpty()) {
		List&lt;String&gt; subPermutations = associationRules2(remaining);
		
		// add all permutations without thisElement.
		result.addAll(subPermutations);
		
		// add all permutations *with* thisElement.
		for (String s : subPermutations) {
			result.add(thisElement + s);
		}
	}
	
	// finally add thisElement on it&#39;s own.
	result.add(thisElement);
	
	return result;
}

Result: [G, CG, C, BG, BCG, BC, B, AG, ACG, AC, ABG, ABCG, ABC, AB, A]

huangapple
  • 本文由 发表于 2020年9月29日 12:39:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/64113007.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定