Finding the words or sentence that is followed by a search word and put them into a dictionary python

huangapple go评论62阅读模式
英文:

Finding the words or sentence that is followed by a search word and put them into a dictionary python

问题

I have translated the provided content. Here's the translated part:

{
   "CODE": "ID",
   "LVI": "none",
   "HEALTH": "h1 - Health in general"
}
英文:

I have to find the words or sentence that follow a search word and put them into a dictionary. My data is in the pdf which I already extract it to a text using PyPDF2 library. I am new to NLP and I don't know how to implement this part of code.
I know how to find 1 word that follows the search word, but sometimes it is a word, sometimes it is sentence which can be identify by \n.

> the text example:
>
> ["CODE: ID\nStudy of Men's brain ID\nBased upon 16", '3 valid cases
> out of 76', '33 total cases.\n•Mean: 54695.29\n•Minimum:
> 8.00\nVariable Type: 'numeric \nHEALTH: h1 - Health in general\n xxx', ' ccc']

import PyPDF2
search_keywords=['CODE','LVI','HEALTH']


pdfFileObj = open('df.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(5)
text=(pageObj.extractText())
text=text.split(",")
text

the out put should be :

{"CODE":"ID"
"VIS":none
"HEALTH":"h1 - Health in general"}

答案1

得分: 1

以下是代码部分的翻译:

您可以使用
* `str.splitlines` 在换行符上拆分文本;
* `str.split``maxsplit=1``':'` 为分隔符拆分文本

由于我们从不在空格上拆分因此搜索词后的值是一个词或整个句子都无所谓

请注意,代码部分已被排除在外,只提供了注释的翻译。

英文:

You can use:

  • str.splitlines to split your text on newlines;
  • str.split with maxsplit=1 to split your text on ':'.

Since we never split on spaces, it doesn't matter whether the value after the search word is only one word or a full sentence.

texts = ["CODE: ID\nStudy of Men's brain ID\nBased upon 16",'3 valid cases out of 76','33 total cases.\n * Mean: 54695.29\n* Minimum: 8.00\nVariable Type: numeric\n HEALTH: h1 - Health in general\nxxx', ' ccc']
search_keywords = ['CODE','LVI','HEALTH']

pairs = [tuple(word.strip() for word in sentence.split(':', maxsplit=1)) for text in texts for sentence in text.splitlines()]
pairs_dict = dict(pair for pair in pairs if len(pair) == 2)

result = {k: pairs_dict.get(k) for k in search_keywords}

print(result)
# {'CODE': 'ID', 'LVI': None, 'HEALTH': 'h1 - Health in general'}

答案2

得分: 1

你应该学习正则表达式,它们在Python的re模块中实现。正则表达式(又称regex)允许你搜索文本模式,而不仅限于关键字。这对你的自然语言处理工作很有用。 re 文档在这里,你还可以在这里找到教程。

以下是你需要的代码:

import re
text=["CODE: ID\nStudy of Men's brain ID\nBased upon 16", '3 valid cases out of 76', '33 total cases.\n•Mean: 54695.29\n•Minimum: 8.00\nVariable Type: numeric \nHEALTH: h1 - Health in general\n xxx', ' ccc']

results={}
	
for keyword in ['CODE','LVI','HEALTH']:
	keyword_found=False
	for sentence in text:
		result=re.search(f'({keyword}): (.+)\n',sentence) #this searches for each keyword and captures any text after a colon+space and before \n
		if result:
			keyword_found=True
			results.update({result.group(1):result.group(2)})
			break
	if keyword_found==False:
		results.update({keyword:None})

print(results)

{'CODE': ' ID', 'LVI': None, 'HEALTH': ' h1 - Health in general'}

如果你需要更改搜索模式,你需要修改re.search(f'({keyword}): (.+)\n',sentence)中的搜索参数,例如,如果你想要关键字后的两行文本,搜索参数将是f'({keyword}): (.+\n.+\n)'

英文:

You should read up on Regular Expressions, which are implemented in the re module in Python. Regular expressions (AKA regex) let you search for patterns of text in addition to key words. This will be useful for your NLP work. The re documentation is here, and you can find a tutorial here.

This code accomplishes what you're after:

import re
text=["CODE: ID\nStudy of Men's brain ID\nBased upon 16", '3 valid cases out of 76', '33 total cases.\n•Mean: 54695.29\n•Minimum: 8.00\nVariable Type: numeric \nHEALTH: h1 - Health in general\n xxx', ' ccc']

results={}
	
for keyword in ['CODE','LVI','HEALTH']:
	keyword_found=False
	for sentence in text:
		result=re.search(f'({keyword}): (.+)\n',sentence) #this searches for each keyword and captures any text after a colon+space and before \n
		if result:
			keyword_found=True
			results.update({result.group(1):result.group(2)})
			break
	if keyword_found==False:
		results.update({keyword:None})

print(results)

#{'CODE': ' ID', 'LVI': None, 'HEALTH': ' h1 - Health in general'}

If you need to change the search pattern, you need to modify the search parameter in re.search(f'({keyword}): (.+)\n',sentence), e.g. if you want two lines after they keyword, the search parameter would be f'({keyword}): (.+\n.+)\n'.

huangapple
  • 本文由 发表于 2023年4月13日 23:14:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76007112.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定