特殊字符为什么会导致R函数丢弃搜索字符串的组件?

huangapple go评论71阅读模式
英文:

Why would a special character cause an R function to drop components of the search string?

问题

我正在使用R包'easyPubMed'来调查物种以及对这些物种的研究工作(即总出版物数量)。通常,我可以使用函数get_pubmed_ids("example string[TI]")从NCBI数据库返回关于标题中包含"example string"的出版物数量的信息。

然而,我遇到了一个奇怪的问题。似乎当我尝试在get_pubmed_ids()中输入特定的物种名称时,如果物种名称中有破折号或撇号,会丢失一些单词。例如,如果我想搜索关于Tawny-breasted Tinamou的出版物,我输入以下内容:

get_pubmed_ids("Tawny-breasted Tinamou[TI]")

我注意到我得到了一组奇怪的结果。我之所以注意到这一点,是因为多种类型的Tinamou物种都返回了完全相同的31篇出版物。我调查了返回的信息并找到了问题,但无法找到解决方案。具体来说,该函数接受带有特殊字符的物种名称:

$OriginalQuery[1] "Tawny-breasted+Tinamou[TI]"

然而,似乎该函数修改了文本,因为"Query Translation"显示如下内容:

$QueryTranslation[1] ""Tinamou"[Title]"

没有特殊字符的物种名称(例如Common Raven)没有这个错误。而且当我在NCBI数据库的Web浏览器中搜索"Tawny-breasted Tinamou[TI]"时,似乎可以正常工作。

如果有人对为什么字符串中的特定字符会导致函数丢失物种名称的部分有建议或潜在解释,我会非常感兴趣。

谢谢。

我已经尝试在原始数据库中进行搜索,以确保搜索字符串总体上可以正常工作,但没有成功。我还尝试过使用转义斜杠来修改字符,以便它们被识别为特殊字符,但似乎没有起作用。但我不确定我是否正确使用了转义斜杠。总之,我已经尝试让R函数使用正确的搜索字符串,但没有成功。

英文:

I am using the R package 'easyPubMed' to investigate species and the research effort (i.e. total number of publications) on those species. Typically, I can use the function get_pubmed_ids("example string[TI]") to return information from the NCBI database on how many publications have "example string" in the title.

However, I've run into a curious effect. It seems that when I try to enter specific species names in get_pubmed_ids() there are words dropped if the species name has a dash or an apostrophe. For instance, if I want to search for publications on the Tawny-breasted Tinamou, I enter the following:

get_pubmed_ids("Tawny-breasted Tinamou[TI]")

I noticed I get a set of odd results. I noticed this because multiple types of Tinamou species all returned exactly 31 publications. I investigated the returned information and isolated the problem, but can't figure out a solution. Specifically, the function does accept the species name with the special character:

$OriginalQuery[1] "Tawny-breasted+Tinamou[TI]"

However, it seems the function modifies the text because the 'Query Translation' shows the following:
$QueryTranslation[1] ""Tinamou"[Title]"

Species names without a special character (e.g. Common Raven) do not have this error. And when I search "Tawny-breasted Tinamou[TI]" in the web browser of the NCBI database it seems to work.

If anyone has suggestions or potential explanations for why specific characters within the string cause the function to drop parts of the species name, I would be very interested.

Thank you.

I have attempted the search in the original database to make sure the search string would work overall, without success. I have also tried to modify the characters using escape slashes so they might be recognized as special characters, but that did not seem to work. However I am not sure I used the escape slashes correctly. In sum, I've tried to have the R function employ the correct search string without avail.

答案1

得分: 2

我认为这是NCBI方面的问题。R代码没有对字符串进行任何奇怪的操作,如果您直接在NCBI网站上搜索,将看到相同的结果。

如果您查看查询返回的XML,它会显示有关NCBI在幕后执行的更多详细信息:

当您搜索 common raven[TI] 时:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Common+Raven[TI]

您可以看到查询被翻译为:

<QueryTranslation>"common raven"[Title]</QueryTranslation>

它将整个短语放在引号中,并知道整个短语应该是一个标题。


当您搜索 Tawny-breasted Tinamou[TI] 时:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Tawny-breasted+Tinamou[TI]

<QueryTranslation>"Tinamou"[Title]</QueryTranslation>
<ErrorList>
 <PhraseNotFound>Tawny-breasted</PhraseNotFound>
</ErrorList>

所以它没有将查询放在引号中,而是将其视为两个单独的搜索项:"Tawny-breasted" 和 "Tinamou[TI]"。由于"Tawny-breasted"没有返回结果,它被丢弃,只搜索"Tinamou"[Title]。


如果您想正确搜索整个术语,您需要自己添加引号,如下所示:

get_pubmed_ids('“Tawny-breasted Tinamou”[TI]')
$Count
[1] '0'

$RetMax
[1] '0'

$RetStart
[1] '0'

$QueryKey
[1] '1'

$WebEnv
[1] 'MCID_648b50fadd6389294158ec04'

$QueryTranslation
[1] '“Tawny-breasted Tinamou”[TI]'

$IdList
named list()

$TranslationSet
list()

$OriginalQuery
[1] '“Tawny-breasted+Tinamou”[TI]'

请注意,您需要使用双引号 (),因此字符串需要像上面那样用单引号括起来,或者您需要使用 \ 转义内部引号:

get_pubmed_ids("“Tawny-breasted Tinamou”[TI]")
英文:

I think this is on the NCBI side. The R code isn't doing anything strange with the strings and you see the same results if you search the NCBI website directly.

If you look at the XML returned by the queries, it shows more detail about what NCBI is doing behind the scenes:

When you search for common raven[TI]:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Common+Raven[TI]

you can see the query is translated as:

<QueryTranslation>"common raven"[Title]</QueryTranslation>

It puts the whole phrase in quotes and knows that the entire thing should be a title.


When you search for Tawny-breasted Tinamou[TI]:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Tawny-breasted+Tinamou[TI]

<QueryTranslation>"Tinamou"[Title]</QueryTranslation>
<ErrorList>
 <PhraseNotFound>Tawny-breasted</PhraseNotFound>
</ErrorList>

So it's not putting the query in quotes, and treats it as 2 separate search terms: "Tawny-breasted" and "Tinamou[TI]". Since "Tawny-breasted" returns no results, it is dropped and it only searches for "Tinamou"[Title].


If you want to properly search for the whole term, you need to add the quotes yourself, as shown below:

get_pubmed_ids('"Tawny-breasted Tinamou"[TI]')
$Count
[1] "0"

$RetMax
[1] "0"

$RetStart
[1] "0"

$QueryKey
[1] "1"

$WebEnv
[1] "MCID_648b50fadd6389294158ec04"

$QueryTranslation
[1] "\"Tawny-breasted Tinamou\"[TI]"

$IdList
named list()

$TranslationSet
list()

$OriginalQuery
[1] "\"Tawny-breasted+Tinamou\"[TI]"

Note that you need to use double quotes ("), so the string needs to be enclosed in single quotes as above, or you need to escape the inner quotes with \:

get_pubmed_ids("\"Tawny-breasted Tinamou\"[TI]")

huangapple
  • 本文由 发表于 2023年6月16日 00:56:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76483926.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定