英文:
Is there a way to keep between-word hyphens when lemmatizing using spacyr?
问题
I'm using spacyr to lemmatise a corpus of speeches, and then using quanteda to tokenise and analyze results (via textstat_frequency()). My issue is that some key terms in the texts are hyphenated. When I tokenise using quanteda, I do not lose these between-word hyphens, and the hyphenated terms are treated as one token, which is my desired result. However, when I use spacyr to lemmatise first, hyphenated words are not kept together. I've tried nounphrase_consolidate(), which does keep hyphenated words, but I find the results to be very inconsistent, as sometimes a term of interest is kept on its own during this consolidation, and in other instances is combined as part of a larger nounphrase. This is suboptimal because I'm looking at a particular dictionary of features in my final step with textstat_frequency, some of which are hyphenated terms.
It seems like this is this solution in spacy, but was curious if there's a similar option in spacyr: https://stackoverflow.com/questions/55241927/spacy-intra-word-hyphens-how-to-treat-them-one-word
Thanks for any thoughts or suggestions. Code below. It doesn't make a difference whether I use remove_punct or not when tokenising.
test.sp <- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
test.sp$token <- test.sp$lemma
test.np <- nounphrase_consolidate(test.sp)
test.tokens.3 <- as.tokens(test.np)
test.tokens.3 <- tokens(test.tokens.3, remove_symbols = TRUE,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_url = TRUE) %>%
tokens_tolower() %>%
tokens_select(pattern = stopwords("en"), selection = "remove")
(Note: I have excluded the code parts as per your request.)
英文:
I'm using spacyr to lemmatise a corpus of speeches, and then using quanteda to tokenise and analyze results (via textstat_frequency()). My issue is that some key terms in the texts are hyphenated. When I tokenise using quanteda, I do not lose these between-word hyphens, and the hyphenated terms are treated as one token, which is my desired result. However, when I use spacyr to lemmatise first, hyphenated words are not kept together. I've tried nounphrase_consolidate(), which does keep hyphenated words, but I find the results to be very inconsistent, as sometimes a term of interest is kept on its own during this consolidation, and in other instances is combined as part of a larger nounphrase. This is suboptimal because I'm looking at a particular dictionary of features in my final step with textstat_frequenecy, some of which are hyphenated terms.
It seems like this is this solution in spacy, but was curious if there's a similar option in spacyr: https://stackoverflow.com/questions/55241927/spacy-intra-word-hyphens-how-to-treat-them-one-word
Thanks for any thoughts or suggestions. Code below. It doesn't make a difference whether I use remove_punct or not when tokenising.
test.sp <- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
test.sp$token <- test.sp$lemma
test.np <- nounphrase_consolidate(test.sp)
test.tokens.3 <- as.tokens(test.np)
test.tokens.3 <- tokens(test.tokens.3, remove_symbols = TRUE,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_url = TRUE) %>%
tokens_tolower() %>%
tokens_select(pattern = stopwords("en"), selection = "remove")
答案1
得分: 0
你应该能够使用tokens_compound()
在quanteda中重新连接连字符的单词。
library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("spacyr")
test.corpus <- c(d1 = "NLP is fast-moving.",
d2 = "A co-ordinated effort.")
test.sp <- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
#> Found 'spacy_condaenv'. spacyr will use this environment
#> successfully initialized (spaCy Version: 3.4.4, language model: en_core_web_sm)
#> (python options: type = "condaenv", value = "spacy_condaenv")
test.sp$token <- test.sp$lemma
test.np <- nounphrase_consolidate(test.sp)
test.tokens.3 <- as.tokens(test.np)
tokens_compound(test.tokens.3, pattern = phrase("* - *"), concatenator = "")
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "NLP" "be" "fast-moving" "."
#>
#> d2 :
#> [1] "a_co-ordinated_effort" "."
创建于2023-06-09,使用 reprex v2.0.2
英文:
You should be able to rejoin the hyphenated words in quanteda, using tokens_compound()
.
library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("spacyr")
test.corpus <- c(d1 = "NLP is fast-moving.",
d2 = "A co-ordinated effort.")
test.sp <- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
#> Found 'spacy_condaenv'. spacyr will use this environment
#> successfully initialized (spaCy Version: 3.4.4, language model: en_core_web_sm)
#> (python options: type = "condaenv", value = "spacy_condaenv")
test.sp$token <- test.sp$lemma
test.np <- nounphrase_consolidate(test.sp)
test.tokens.3 <- as.tokens(test.np)
tokens_compound(test.tokens.3, pattern = phrase("* - *"), concatenator = "")
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "NLP" "be" "fast-moving" "."
#>
#> d2 :
#> [1] "a_co-ordinated_effort" "."
<sup>Created on 2023-06-09 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论