问题

I'm using spacyr to lemmatise a corpus of speeches, and then using quanteda to tokenise and analyze results (via textstat_frequency()). My issue is that some key terms in the texts are hyphenated. When I tokenise using quanteda, I do not lose these between-word hyphens, and the hyphenated terms are treated as one token, which is my desired result. However, when I use spacyr to lemmatise first, hyphenated words are not kept together. I've tried nounphrase_consolidate(), which does keep hyphenated words, but I find the results to be very inconsistent, as sometimes a term of interest is kept on its own during this consolidation, and in other instances is combined as part of a larger nounphrase. This is suboptimal because I'm looking at a particular dictionary of features in my final step with textstat_frequency, some of which are hyphenated terms.

It seems like this is this solution in spacy, but was curious if there's a similar option in spacyr: https://stackoverflow.com/questions/55241927/spacy-intra-word-hyphens-how-to-treat-them-one-word

Thanks for any thoughts or suggestions. Code below. It doesn't make a difference whether I use remove_punct or not when tokenising.

test.sp <- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
test.sp$token <- test.sp$lemma
test.np <- nounphrase_consolidate(test.sp)
test.tokens.3 <- as.tokens(test.np)
test.tokens.3 <- tokens(test.tokens.3, remove_symbols = TRUE,
                      remove_numbers = TRUE,
                      remove_punct = TRUE,
                      remove_url = TRUE) %>%
  tokens_tolower() %>%
  tokens_select(pattern = stopwords("en"), selection = "remove")

(Note: I have excluded the code parts as per your request.)

英文:

It seems like this is this solution in spacy, but was curious if there's a similar option in spacyr: https://stackoverflow.com/questions/55241927/spacy-intra-word-hyphens-how-to-treat-them-one-word

Thanks for any thoughts or suggestions. Code below. It doesn't make a difference whether I use remove_punct or not when tokenising.

    test.sp &lt;- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
test.sp$token &lt;- test.sp$lemma
test.np &lt;- nounphrase_consolidate(test.sp)
test.tokens.3 &lt;- as.tokens(test.np)
test.tokens.3 &lt;- tokens(test.tokens.3, remove_symbols = TRUE,
                      remove_numbers = TRUE,
                      remove_punct = TRUE,
                      remove_url = TRUE) %&gt;% 
  tokens_tolower() %&gt;% 
  tokens_select(pattern = stopwords(&quot;en&quot;), selection = &quot;remove&quot;)

答案1

得分: 0

你应该能够使用tokens_compound()在quanteda中重新连接连字符的单词。

library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("spacyr")

test.corpus <- c(d1 = "NLP is fast-moving.",
                 d2 = "A co-ordinated effort.")
test.sp <- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
#> Found 'spacy_condaenv'. spacyr will use this environment
#> successfully initialized (spaCy Version: 3.4.4, language model: en_core_web_sm)
#> (python options: type = "condaenv", value = "spacy_condaenv")
test.sp$token <- test.sp$lemma
test.np <- nounphrase_consolidate(test.sp)
test.tokens.3 <- as.tokens(test.np)

tokens_compound(test.tokens.3, pattern = phrase("* - *"), concatenator = "")
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "NLP"          "be"           "fast-moving"  "."
#> 
#> d2 :
#> [1] "a_co-ordinated_effort" "."

^{创建于2023-06-09，使用 reprex v2.0.2}

英文:

You should be able to rejoin the hyphenated words in quanteda, using tokens_compound().

library(&quot;quanteda&quot;)
#&gt; Package version: 3.3.1
#&gt; Unicode version: 14.0
#&gt; ICU version: 71.1
#&gt; Parallel computing: 10 of 10 threads used.
#&gt; See https://quanteda.io for tutorials and examples.
library(&quot;spacyr&quot;)

test.corpus &lt;- c(d1 = &quot;NLP is fast-moving.&quot;,
                 d2 = &quot;A co-ordinated effort.&quot;)
test.sp &lt;- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
#&gt; Found &#39;spacy_condaenv&#39;. spacyr will use this environment
#&gt; successfully initialized (spaCy Version: 3.4.4, language model: en_core_web_sm)
#&gt; (python options: type = &quot;condaenv&quot;, value = &quot;spacy_condaenv&quot;)
test.sp$token &lt;- test.sp$lemma
test.np &lt;- nounphrase_consolidate(test.sp)
test.tokens.3 &lt;- as.tokens(test.np)

tokens_compound(test.tokens.3, pattern = phrase(&quot;* - *&quot;), concatenator = &quot;&quot;)
#&gt; Tokens consisting of 2 documents.
#&gt; d1 :
#&gt; [1] &quot;NLP&quot;         &quot;be&quot;          &quot;fast-moving&quot; &quot;.&quot;          
#&gt; 
#&gt; d2 :
#&gt; [1] &quot;a_co-ordinated_effort&quot; &quot;.&quot;

<sup>Created on 2023-06-09 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有没有办法在使用spacyr进行词形还原时保留单词间的连字符？

问题

答案1

Running python script (importing spacy) from Java using Runtime.exec

Highlight python-docx with regex and spacy.

Quanteda语料库在在WSL2上运行时出错 – 摘要函数

如何从一个字符串创建一个LangChain文档

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论