2023年4月13日 23:14:37go评论96阅读模式

英文:

Finding the words or sentence that is followed by a search word and put them into a dictionary python

问题

I have translated the provided content. Here's the translated part:

{
   "CODE": "ID",
   "LVI": "none",
   "HEALTH": "h1 - Health in general"
}

英文:

I have to find the words or sentence that follow a search word and put them into a dictionary. My data is in the pdf which I already extract it to a text using PyPDF2 library. I am new to NLP and I don't know how to implement this part of code.
I know how to find 1 word that follows the search word, but sometimes it is a word, sometimes it is sentence which can be identify by \n.

> the text example:
>
> ["CODE: ID\nStudy of Men's brain ID\nBased upon 16", '3 valid cases
> out of 76', '33 total cases.\n•Mean: 54695.29\n•Minimum:
> 8.00\nVariable Type: 'numeric \nHEALTH: h1 - Health in general\n xxx', ' ccc']

import PyPDF2
search_keywords=[&#39;CODE&#39;,&#39;LVI&#39;,&#39;HEALTH&#39;]
pdfFileObj = open(&#39;df.pdf&#39;, &#39;rb&#39;)
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(5)
text=(pageObj.extractText())
text=text.split(&quot;,&quot;)
text

the out put should be :

{&quot;CODE&quot;:&quot;ID&quot;
&quot;VIS&quot;:none
&quot;HEALTH&quot;:&quot;h1 - Health in general&quot;}

答案1

得分: 1

以下是代码部分的翻译：

您可以使用：
* `str.splitlines` 在换行符上拆分文本;
* `str.split` 与 `maxsplit=1` 以 `':'` 为分隔符拆分文本。
由于我们从不在空格上拆分，因此搜索词后的值是一个词或整个句子都无所谓。

请注意，代码部分已被排除在外，只提供了注释的翻译。

英文:

You can use:

str.splitlines to split your text on newlines;
str.split with maxsplit=1 to split your text on ':'.

Since we never split on spaces, it doesn't matter whether the value after the search word is only one word or a full sentence.

texts = [&quot;CODE: ID\nStudy of Men&#39;s brain ID\nBased upon 16&quot;,&#39;3 valid cases out of 76&#39;,&#39;33 total cases.\n * Mean: 54695.29\n* Minimum: 8.00\nVariable Type: numeric\n HEALTH: h1 - Health in general\nxxx&#39;, &#39; ccc&#39;]
search_keywords = [&#39;CODE&#39;,&#39;LVI&#39;,&#39;HEALTH&#39;]
pairs = [tuple(word.strip() for word in sentence.split(&#39;:&#39;, maxsplit=1)) for text in texts for sentence in text.splitlines()]
pairs_dict = dict(pair for pair in pairs if len(pair) == 2)
result = {k: pairs_dict.get(k) for k in search_keywords}
print(result)
# {&#39;CODE&#39;: &#39;ID&#39;, &#39;LVI&#39;: None, &#39;HEALTH&#39;: &#39;h1 - Health in general&#39;}

答案2

得分: 1

你应该学习正则表达式，它们在Python的re模块中实现。正则表达式（又称regex）允许你搜索文本模式，而不仅限于关键字。这对你的自然语言处理工作很有用。 re 文档在这里，你还可以在这里找到教程。

以下是你需要的代码：

import re
text=["CODE: ID\nStudy of Men's brain ID\nBased upon 16", '3 valid cases out of 76', '33 total cases.\n•Mean: 54695.29\n•Minimum: 8.00\nVariable Type: numeric \nHEALTH: h1 - Health in general\n xxx', ' ccc']
results={}
	
for keyword in ['CODE','LVI','HEALTH']:
	keyword_found=False
	for sentence in text:
		result=re.search(f'({keyword}): (.+)\n',sentence) #this searches for each keyword and captures any text after a colon+space and before \n
		if result:
			keyword_found=True
			results.update({result.group(1):result.group(2)})
			break
	if keyword_found==False:
		results.update({keyword:None})
print(results)
{'CODE': ' ID', 'LVI': None, 'HEALTH': ' h1 - Health in general'}

如果你需要更改搜索模式，你需要修改re.search(f'({keyword}): (.+)\n',sentence)中的搜索参数，例如，如果你想要关键字后的两行文本，搜索参数将是f'({keyword}): (.+\n.+\n)'。

英文:

You should read up on Regular Expressions, which are implemented in the re module in Python. Regular expressions (AKA regex) let you search for patterns of text in addition to key words. This will be useful for your NLP work. The re documentation is here, and you can find a tutorial here.

This code accomplishes what you're after:

import re
text=[&quot;CODE: ID\nStudy of Men&#39;s brain ID\nBased upon 16&quot;, &#39;3 valid cases out of 76&#39;, &#39;33 total cases.\n•Mean: 54695.29\n•Minimum: 8.00\nVariable Type: numeric \nHEALTH: h1 - Health in general\n xxx&#39;, &#39; ccc&#39;]
results={}
	
for keyword in [&#39;CODE&#39;,&#39;LVI&#39;,&#39;HEALTH&#39;]:
	keyword_found=False
	for sentence in text:
		result=re.search(f&#39;({keyword}): (.+)\n&#39;,sentence) #this searches for each keyword and captures any text after a colon+space and before \n
		if result:
			keyword_found=True
			results.update({result.group(1):result.group(2)})
			break
	if keyword_found==False:
		results.update({keyword:None})
print(results)
#{&#39;CODE&#39;: &#39; ID&#39;, &#39;LVI&#39;: None, &#39;HEALTH&#39;: &#39; h1 - Health in general&#39;}

If you need to change the search pattern, you need to modify the search parameter in re.search(f'({keyword}): (.+)\n',sentence), e.g. if you want two lines after they keyword, the search parameter would be f'({keyword}): (.+\n.+)\n'.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Finding the words or sentence that is followed by a search word and put them into a dictionary python

问题

答案1

答案2

在类中如何调用函数内的函数

Analyzing restaurant data using python. Need help merging two datasets on both check # and date

使用Python从GitHub存储库下载文件。

RabbitMQ消息丢失

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。