英文:
Finding the words or sentence that is followed by a search word and put them into a dictionary python
问题
I have translated the provided content. Here's the translated part:
{
"CODE": "ID",
"LVI": "none",
"HEALTH": "h1 - Health in general"
}
英文:
I have to find the words or sentence that follow a search word and put them into a dictionary. My data is in the pdf which I already extract it to a text using PyPDF2 library. I am new to NLP and I don't know how to implement this part of code.
I know how to find 1 word that follows the search word, but sometimes it is a word, sometimes it is sentence which can be identify by \n.
> the text example:
>
> ["CODE: ID\nStudy of Men's brain ID\nBased upon 16", '3 valid cases
> out of 76', '33 total cases.\n•Mean: 54695.29\n•Minimum:
> 8.00\nVariable Type: 'numeric \nHEALTH: h1 - Health in general\n xxx', ' ccc']
import PyPDF2
search_keywords=['CODE','LVI','HEALTH']
pdfFileObj = open('df.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(5)
text=(pageObj.extractText())
text=text.split(",")
text
the out put should be :
{"CODE":"ID"
"VIS":none
"HEALTH":"h1 - Health in general"}
答案1
得分: 1
以下是代码部分的翻译:
您可以使用:
* `str.splitlines` 在换行符上拆分文本;
* `str.split` 与 `maxsplit=1` 以 `':'` 为分隔符拆分文本。
由于我们从不在空格上拆分,因此搜索词后的值是一个词或整个句子都无所谓。
请注意,代码部分已被排除在外,只提供了注释的翻译。
英文:
You can use:
str.splitlines
to split your text on newlines;str.split
withmaxsplit=1
to split your text on':'
.
Since we never split on spaces, it doesn't matter whether the value after the search word is only one word or a full sentence.
texts = ["CODE: ID\nStudy of Men's brain ID\nBased upon 16",'3 valid cases out of 76','33 total cases.\n * Mean: 54695.29\n* Minimum: 8.00\nVariable Type: numeric\n HEALTH: h1 - Health in general\nxxx', ' ccc']
search_keywords = ['CODE','LVI','HEALTH']
pairs = [tuple(word.strip() for word in sentence.split(':', maxsplit=1)) for text in texts for sentence in text.splitlines()]
pairs_dict = dict(pair for pair in pairs if len(pair) == 2)
result = {k: pairs_dict.get(k) for k in search_keywords}
print(result)
# {'CODE': 'ID', 'LVI': None, 'HEALTH': 'h1 - Health in general'}
答案2
得分: 1
你应该学习正则表达式,它们在Python的re
模块中实现。正则表达式(又称regex)允许你搜索文本模式,而不仅限于关键字。这对你的自然语言处理工作很有用。 re
文档在这里,你还可以在这里找到教程。
以下是你需要的代码:
import re
text=["CODE: ID\nStudy of Men's brain ID\nBased upon 16", '3 valid cases out of 76', '33 total cases.\n•Mean: 54695.29\n•Minimum: 8.00\nVariable Type: numeric \nHEALTH: h1 - Health in general\n xxx', ' ccc']
results={}
for keyword in ['CODE','LVI','HEALTH']:
keyword_found=False
for sentence in text:
result=re.search(f'({keyword}): (.+)\n',sentence) #this searches for each keyword and captures any text after a colon+space and before \n
if result:
keyword_found=True
results.update({result.group(1):result.group(2)})
break
if keyword_found==False:
results.update({keyword:None})
print(results)
{'CODE': ' ID', 'LVI': None, 'HEALTH': ' h1 - Health in general'}
如果你需要更改搜索模式,你需要修改re.search(f'({keyword}): (.+)\n',sentence)
中的搜索参数,例如,如果你想要关键字后的两行文本,搜索参数将是f'({keyword}): (.+\n.+\n)'
。
英文:
You should read up on Regular Expressions, which are implemented in the re
module in Python. Regular expressions (AKA regex) let you search for patterns of text in addition to key words. This will be useful for your NLP work. The re
documentation is here, and you can find a tutorial here.
This code accomplishes what you're after:
import re
text=["CODE: ID\nStudy of Men's brain ID\nBased upon 16", '3 valid cases out of 76', '33 total cases.\n•Mean: 54695.29\n•Minimum: 8.00\nVariable Type: numeric \nHEALTH: h1 - Health in general\n xxx', ' ccc']
results={}
for keyword in ['CODE','LVI','HEALTH']:
keyword_found=False
for sentence in text:
result=re.search(f'({keyword}): (.+)\n',sentence) #this searches for each keyword and captures any text after a colon+space and before \n
if result:
keyword_found=True
results.update({result.group(1):result.group(2)})
break
if keyword_found==False:
results.update({keyword:None})
print(results)
#{'CODE': ' ID', 'LVI': None, 'HEALTH': ' h1 - Health in general'}
If you need to change the search pattern, you need to modify the search parameter in re.search(f'({keyword}): (.+)\n',sentence)
, e.g. if you want two lines after they keyword, the search parameter would be f'({keyword}): (.+\n.+)\n'
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论