查找多个单词的匹配项

huangapple go评论65阅读模式
英文:

find matches for multiple words

问题

以下是您要求的代码部分的翻译:

I have this PostgreSQL table for storing words:

    CREATE TABLE IF NOT EXISTS words
    (
        id bigint NOT NULL DEFAULT nextval('processed_words_id_seq'::regclass),
        keyword character varying(300) COLLATE pg_catalog."default",
    );
    
    insert into words (keyword)
    VALUES ('while swam is interesting', 
            'ibm is a company like bmw');
    
    CREATE TABLE IF NOT EXISTS trademarks
    (
       id bigint NOT NULL DEFAULT nextval('trademarks_id_seq'::regclass),
       trademark character varying(300) COLLATE pg_catalog."default",
    );

    insert into words (trademarks)
    VALUES ('while swam', 
            'ibm',
            'bmw');

Into table `trademarks` I will have thousands of registered trademarks names.
I want to compare words stored into `words` table keyword, do they match not only for a words but also for word which is in a group of words. For example:

I have a keyword `while swam is interesting` stored into `words.keyword`. I also have a trademark `swam` located in `trademarks.trademark` like `ibm` I have a word match, so I want to detect this using Java code. 

First I want to select all blacklisted keywords convert them in for example List and compare `ibm is a company like bmw` with elements from the list. How I can do this not only for one word but also for a expressions?

something like this?

    Optional<ProcessedWords> keywords = processedWordsService.findRandomKeywordWhereTrademarkBlacklistedIsEmpty();
    
            if(keywords.isPresent())
            {
                List<BlacklistedWords> blacklistedWords = blacklistedWordsService.findAll();
                List<String> list = new ArrayList<>();
                for(BlacklistedWords item:  blacklistedWords){
                    list.add(item.getKeyword());
                }
    
                ProcessedWords processedWords = keywords.get();
                String keyword = processedWords.getKeyword();
    
                if(list.contains(keyword))
                {
                    System.out.println("Found blacklisted word in keyword: " + keyword);
                }
    
            }


@Getter
@Setter
@NoArgsConstructor
@AllArgsConstructor
@Builder(toBuilder = true)
@Entity
@Table(name = "trademarks")
public class BlacklistedWords implements Serializable {

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    @Column(name = "id", unique = true, updatable = false, nullable = false)
    private long id;

    @Column(name = "trademark", length = 200, unique = true)
    private String keyword;
}

如果您需要任何进一步的帮助或解释,请随时告诉我。

英文:

I have this PostgreSQL table for storing words:

CREATE TABLE IF NOT EXISTS words
(
id bigint NOT NULL DEFAULT nextval('processed_words_id_seq'::regclass),
keyword character varying(300) COLLATE pg_catalog."default",
);
insert into words (keyword)
VALUES ('while swam is interesting', 
'ibm is a company like bmw');
CREATE TABLE IF NOT EXISTS trademarks
(
id bigint NOT NULL DEFAULT nextval('trademarks_id_seq'::regclass),
trademark character varying(300) COLLATE pg_catalog."default",
);
insert into words (trademarks)
VALUES ('while swam', 
'ibm',
'bmw');

Into table trademarks I will have thousands of registered trademarks names.
I want to compare words stored into words table keyword, do they match not only for a words but also for word which is in a group of words. For example:

I have a keyword while swam is interesting stored into words.keyword. I also have a trademark swam located in trademarks.trademark like ibm I have a word match, so I want to detect this using Java code.

First I want to select all blacklisted keywords convert them in for example List and compare ibm is a company like bmw with elements from the list. How I can do this not only for one word but also for a expressions?

something like this?

Optional<ProcessedWords> keywords = processedWordsService.findRandomKeywordWhereTrademarkBlacklistedIsEmpty();
if(keywords.isPresent())
{
List<BlacklistedWords> blacklistedWords = blacklistedWordsService.findAll();
List<String> list = new ArrayList<>();
for(BlacklistedWords item:  blacklistedWords){
list.add(item.getKeyword());
}
ProcessedWords processedWords = keywords.get();
String keyword = processedWords.getKeyword();
if(list.contains(keyword))
{
System.out.println("Found blacklisted word in keyword: " + keyword);
}
}
@Getter
@Setter
@NoArgsConstructor
@AllArgsConstructor
@Builder(toBuilder = true)
@Entity
@Table(name = "trademarks")
public class BlacklistedWords implements Serializable {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
@Column(name = "id", unique = true, updatable = false, nullable = false)
private long id;
@Column(name = "trademark", length = 200, unique = true)
private String keyword;
}

Can you guide me how this can be implemented?

答案1

得分: 1

以下是使用Java流进行匹配的示例:

public static void main(String[] args) {
    // 模拟持久层
    List<BlacklistedWords> blacklistedWords = Arrays.asList(new BlacklistedWords[] {new BlacklistedWords(1, "while swam"), new BlacklistedWords(2, "ibm"), new BlacklistedWords(3, "bmw")});
    List<ProcessedWords> keyWords = Arrays.asList(new ProcessedWords[] {new ProcessedWords(1, "while swam is interesting"), new ProcessedWords(2, "ibm is a company like bmw"), new ProcessedWords(3, "miss")});
    
    List<ProcessedWords> hits = keyWords.stream()
        .filter(pw -> blacklistedWords.stream()                    
            .anyMatch(bw -> pw.getKeyword().indexOf(bw.getTrademark()) != -1))
        .collect(Collectors.toList());
    
    System.out.println(hits);
}

输出结果:

[ProcessedWords(id=1, keyword=while swam is interesting), ProcessedWords(id=2, keyword=ibm is a company like bmw)]

请注意,我模拟了持久层,并使用@Data注解了BlacklistedWords(table=trademarks)ProcessedWords(table=words),以获得合理的toString()输出。然而,实际上,不应该这样做,因为它们应该是@Entity

英文:

This is how to do the matching with Java streams:

public static void main(String[] args) {
// stubbing up the persistence layer
List&lt;BlacklistedWords&gt; blacklistedWords = Arrays.asList(new BlacklistedWords[] {new BlacklistedWords(1, &quot;while swam&quot;), new BlacklistedWords(2, &quot;ibm&quot;), new BlacklistedWords(3, &quot;bmw&quot;)});
List&lt;ProcessedWords&gt; keyWords = Arrays.asList(new ProcessedWords[] {new ProcessedWords(1, &quot;while swam is interesting&quot;), new ProcessedWords(2, &quot;ibm is a company like bmw&quot;), new ProcessedWords(3, &quot;miss&quot;)});
List&lt;ProcessedWords&gt; hits = keyWords.stream()
.filter(pw -&gt; blacklistedWords.stream()                    
.anyMatch(bw -&gt;                                 
pw.getKeyword().indexOf(bw.getTrademark()) != -1))
.collect(Collectors.toList());
System.out.println(hits);
}

Output:

[ProcessedWords(id=1, keyword=while swam is interesting), ProcessedWords(id=2, keyword=ibm is a company like bmw)]

Note that I stubbed out the persistence layer with an additional ProcessedWords of "missed" and annotated BlacklistedWords(table=trademarks) & ProcessedWords(table=words) with @Data to get a decent toString(), which you shouldn't because they are @Entity.

答案2

得分: 1

为了满足对整个单词的要求,应该进行以下操作。

List<BlacklistedWords> blacklistedWords = Arrays.asList(new BlacklistedWords[] {new BlacklistedWords(1, "while swam"), new BlacklistedWords(2, "ibm"), new BlacklistedWords(3, "bmw")});
List<ProcessedWords> keyWords = Arrays.asList(new ProcessedWords[] {new ProcessedWords(1, "while swam is interesting"), new ProcessedWords(2, "ibm is a company like bmw"), new ProcessedWords(3, "miss")});

Set<ProcessedWords> hits = new HashSet<>();
blacklistedWords.parallelStream().forEach(bw -> {
    final String trademark = bw.getTrademark();
    final String startsWith = trademark + " ";
    final String contains = " " + startsWith;
    final String endsWith = " " + trademark;
    keyWords.parallelStream().forEach(pw -> {
        final String keyword = pw.getKeyword();
        if (keyword.contains(contains) || keyword.startsWith(startsWith) || keyword.endsWith(endsWith) || keyword.equals(trademark))
            hits.add(pw);
    });
});

由于我们需要的输出是 keyWords 的子集,而且我们不想在内部流中重新计算 " " + trademark + " " 等内容,不建议使用以下方法,但是它是有效的:

List<ProcessedWords> hits1 = keyWords.parallelStream().filter(pw -> blacklistedWords.parallelStream().anyMatch(bw -> {
    final String keyword = pw.getKeyword();
    final String trademark = bw.getTrademark();
    return keyword.contains(" " + trademark + " ") || keyword.startsWith(trademark + " ") || keyword.endsWith(" " + trademark) || keyword.equals(trademark);
})).collect(Collectors.toList());
英文:

In order to satisfy the requirement for whole words, the following should be done.

    List&lt;BlacklistedWords&gt; blacklistedWords = Arrays.asList(new BlacklistedWords[] {new BlacklistedWords(1, &quot;while swam&quot;), new BlacklistedWords(2, &quot;ibm&quot;), new BlacklistedWords(3, &quot;bmw&quot;)});
List&lt;ProcessedWords&gt; keyWords = Arrays.asList(new ProcessedWords[] {new ProcessedWords(1, &quot;while swam is interesting&quot;), new ProcessedWords(2, &quot;ibm is a company like bmw&quot;), new ProcessedWords(3, &quot;miss&quot;)});
Set&lt;ProcessedWords&gt; hits = new HashSet&lt;&gt;();
blacklistedWords.parallelStream().forEach(bw -&gt; {
final String trademark = bw.getTrademark();
final String startsWith = trademark + &quot; &quot;;
final String contains = &quot; &quot; + startsWith;
final String endsWith = &quot; &quot; + trademark;
keyWords.parallelStream().forEach(pw -&gt; {
final String keyword = pw.getKeyword();
if (keyword.contains(contains) || keyword.startsWith(startsWith) || keyword.endsWith(endsWith)
|| keyword.equals(trademark))
hits.add(pw);
});
});

Because the output we require is a subset of keyWords and we do not want to recalculate the " " + trademark + " ", etc in the inner stream, the following is not advised, but works:

	List&lt;ProcessedWords&gt; hits1 = keyWords.parallelStream().filter(pw -&gt; blacklistedWords.parallelStream().anyMatch(bw -&gt; {
final String keyword = pw.getKeyword();
final String trademark = bw.getTrademark();
return keyword.contains(&quot; &quot; + trademark + &quot; &quot;) || keyword.startsWith(trademark + &quot; &quot;) || keyword.endsWith(&quot; &quot; + trademark) || keyword.equals(trademark);
})).collect(Collectors.toList());

huangapple
  • 本文由 发表于 2023年2月24日 05:31:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/75550523.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定