有没有办法在使用spacyr进行词形还原时保留单词间的连字符?

huangapple go评论46阅读模式
英文:

Is there a way to keep between-word hyphens when lemmatizing using spacyr?

问题

I'm using spacyr to lemmatise a corpus of speeches, and then using quanteda to tokenise and analyze results (via textstat_frequency()). My issue is that some key terms in the texts are hyphenated. When I tokenise using quanteda, I do not lose these between-word hyphens, and the hyphenated terms are treated as one token, which is my desired result. However, when I use spacyr to lemmatise first, hyphenated words are not kept together. I've tried nounphrase_consolidate(), which does keep hyphenated words, but I find the results to be very inconsistent, as sometimes a term of interest is kept on its own during this consolidation, and in other instances is combined as part of a larger nounphrase. This is suboptimal because I'm looking at a particular dictionary of features in my final step with textstat_frequency, some of which are hyphenated terms.

It seems like this is this solution in spacy, but was curious if there's a similar option in spacyr: https://stackoverflow.com/questions/55241927/spacy-intra-word-hyphens-how-to-treat-them-one-word

Thanks for any thoughts or suggestions. Code below. It doesn't make a difference whether I use remove_punct or not when tokenising.

test.sp <- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
test.sp$token <- test.sp$lemma
test.np <- nounphrase_consolidate(test.sp)
test.tokens.3 <- as.tokens(test.np)
test.tokens.3 <- tokens(test.tokens.3, remove_symbols = TRUE,
                      remove_numbers = TRUE,
                      remove_punct = TRUE,
                      remove_url = TRUE) %>%
  tokens_tolower() %>%
  tokens_select(pattern = stopwords("en"), selection = "remove")

(Note: I have excluded the code parts as per your request.)

英文:

I'm using spacyr to lemmatise a corpus of speeches, and then using quanteda to tokenise and analyze results (via textstat_frequency()). My issue is that some key terms in the texts are hyphenated. When I tokenise using quanteda, I do not lose these between-word hyphens, and the hyphenated terms are treated as one token, which is my desired result. However, when I use spacyr to lemmatise first, hyphenated words are not kept together. I've tried nounphrase_consolidate(), which does keep hyphenated words, but I find the results to be very inconsistent, as sometimes a term of interest is kept on its own during this consolidation, and in other instances is combined as part of a larger nounphrase. This is suboptimal because I'm looking at a particular dictionary of features in my final step with textstat_frequenecy, some of which are hyphenated terms.

It seems like this is this solution in spacy, but was curious if there's a similar option in spacyr: https://stackoverflow.com/questions/55241927/spacy-intra-word-hyphens-how-to-treat-them-one-word

Thanks for any thoughts or suggestions. Code below. It doesn't make a difference whether I use remove_punct or not when tokenising.

    test.sp &lt;- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
test.sp$token &lt;- test.sp$lemma
test.np &lt;- nounphrase_consolidate(test.sp)
test.tokens.3 &lt;- as.tokens(test.np)
test.tokens.3 &lt;- tokens(test.tokens.3, remove_symbols = TRUE,
                      remove_numbers = TRUE,
                      remove_punct = TRUE,
                      remove_url = TRUE) %&gt;% 
  tokens_tolower() %&gt;% 
  tokens_select(pattern = stopwords(&quot;en&quot;), selection = &quot;remove&quot;) 

答案1

得分: 0

你应该能够使用tokens_compound()在quanteda中重新连接连字符的单词。

library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("spacyr")

test.corpus <- c(d1 = "NLP is fast-moving.",
                 d2 = "A co-ordinated effort.")
test.sp <- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
#> Found 'spacy_condaenv'. spacyr will use this environment
#> successfully initialized (spaCy Version: 3.4.4, language model: en_core_web_sm)
#> (python options: type = "condaenv", value = "spacy_condaenv")
test.sp$token <- test.sp$lemma
test.np <- nounphrase_consolidate(test.sp)
test.tokens.3 <- as.tokens(test.np)

tokens_compound(test.tokens.3, pattern = phrase("* - *"), concatenator = "")
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "NLP"          "be"           "fast-moving"  "."
#> 
#> d2 :
#> [1] "a_co-ordinated_effort" "."

创建于2023-06-09,使用 reprex v2.0.2

英文:

You should be able to rejoin the hyphenated words in quanteda, using tokens_compound().

library(&quot;quanteda&quot;)
#&gt; Package version: 3.3.1
#&gt; Unicode version: 14.0
#&gt; ICU version: 71.1
#&gt; Parallel computing: 10 of 10 threads used.
#&gt; See https://quanteda.io for tutorials and examples.
library(&quot;spacyr&quot;)

test.corpus &lt;- c(d1 = &quot;NLP is fast-moving.&quot;,
                 d2 = &quot;A co-ordinated effort.&quot;)
test.sp &lt;- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
#&gt; Found &#39;spacy_condaenv&#39;. spacyr will use this environment
#&gt; successfully initialized (spaCy Version: 3.4.4, language model: en_core_web_sm)
#&gt; (python options: type = &quot;condaenv&quot;, value = &quot;spacy_condaenv&quot;)
test.sp$token &lt;- test.sp$lemma
test.np &lt;- nounphrase_consolidate(test.sp)
test.tokens.3 &lt;- as.tokens(test.np)

tokens_compound(test.tokens.3, pattern = phrase(&quot;* - *&quot;), concatenator = &quot;&quot;)
#&gt; Tokens consisting of 2 documents.
#&gt; d1 :
#&gt; [1] &quot;NLP&quot;         &quot;be&quot;          &quot;fast-moving&quot; &quot;.&quot;          
#&gt; 
#&gt; d2 :
#&gt; [1] &quot;a_co-ordinated_effort&quot; &quot;.&quot;

<sup>Created on 2023-06-09 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年6月8日 14:57:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76429315.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定