非贪婪正则表达式返回错误结果。

huangapple go评论63阅读模式
英文:

Non greedy regex returns wrong result

问题

我试图通过识别在句点和冒号之间出现的文本来清理摘要,后面紧跟大写字母。为此,我使用正则表达式:

re.findall(r"\.\s(.*?):\s?[A-Z]", text) for the text

text = '背景:黄酮类化合物构成植物次生代谢产物中最为良好表征的一类,具有巨大的药用潜力。植物黄酮类黄酮素,如桂皮素,据报道对恶性人类细胞具有促凋亡作用。目的:本研究旨在探讨桂皮素对人类胃癌细胞的抗增殖效应。材料和方法:通过3-(4,5-二甲基噻唑-2-基)-2,5-二苯基四唑溴化物(MTT)和克隆形成实验评估细胞存活率。通过吖啶橙/溴化乙啶(AO/EB)和V/PI检测凋亡。通过西方印迹分析检测蛋白表达。结果:结果显示桂皮素抑制了人类胃癌细胞的增殖。桂皮素对人类胃癌细胞(BGC-823、SGC-7901和MGC-803)的IC50在8至10μM之间。然而,桂皮素对正常GES-1细胞的抗增殖效应相对较低。桂皮素对正常GES-1细胞的IC50发现为120μM。克隆形成实验表明,桂皮素以剂量依赖的方式抑制了BGC-823和MGC-803细胞的克隆形成。吖啶橙和溴化乙啶(AO/EB)染色显示桂皮素诱导BGC-823和MGC-803细胞凋亡。在8μM桂皮素下,BGC-823细胞中凋亡的百分比从对照组的7.4%增加到40.5%,而MGC-803细胞中从对照组的6.56%增加到33.53%。西方印迹显示桂皮素导致BGC-823和MGC-803细胞中Bax和裂解的caspase-3的增加,以及Bcl-2表达的减少。结论:综合而言,结果表明桂皮素对胃癌细胞具有促凋亡和抗肿瘤的潜力,暗示其在未来可能具有治疗意义。'

然而,第一个提取的模式是:

'植物黄酮类黄酮素,如桂皮素,据报道对恶性人类细胞具有促凋亡作用。目的',

而应该是 'Objectives'

英文:

I am trying to clean an abstract by spotting the texts showing up between a dot and a colon followed by an uppercase character. To do so, I use the regex:

re.findall(r"\.\s(.*?):\s?[A-Z]", text) for the text

text = 'Background: Flavonoids constitute one of the best-characterized groups of plant secondary metabolites with enormous pharmaceutical potential. A flavone type of plant flavonoid, cirsilineol, has been reported to exhibit proapoptotic effects against malignant human cells. Objectives: The present study was designed to investigate the antiproliferative effects of cirsilineol against human gastric cancer cells. Materials and Methods: Cell viability was assessed by 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide (MTT) and colony formation assays. Apoptosis was detected by acridine orange/ethidium bromide (AO/EB) and annexin V/propidium iodide (PI) assay. Protein expression was examined by western blotting analysis. Results: The results showed cirsilineol inhibits the proliferation of human gastric cancer cells. The IC50 of cirsilineol against human gastric cancer cells (BGC-823, SGC-7901, and MGC-803) ranged from 8 to 10 mu M. Nonetheless, cirsilineol exhibited comparatively lower antiproliferative effects against normal GES-1 cells. The IC50 of cirsilineol against normal GES-1 cells was found to be 120 mu M. Colony formation assay showed that cirsilineol suppressed the colony formation of BGC-823 and MGC-803 cells in a dose-dependent manner. Acridine orange and ethidium bromide (AO/EB) staining showed that cirsilineol induced apoptosis in BGC-823 and MGC-803 cells. The percentage of apoptosis increased from 7.4% in control to 40.5% in BGC-823 cells and from 6.56% in control to 33.53% in MGC-803 cells at 8 mu M cirsilineol. Western blotting showed cirsilineol caused an increase in Bax and cleaved caspase-3 and a decrease in Bcl-2 expression in both BGC-823 and MGC-803 cells. Conclusion: Together, the results are indicative of the proapoptotic and antitumor potential of cirsilineol against gastric cancer cells, suggestive of its possible therapeutic significance in future.'

However, the first extracted pattern is:

'A flavone type of plant flavonoid, cirsilineol, has been reported to exhibit proapoptotic effects against malignant human cells. Objectives',

while it should be 'Objectives'

What am I missing here?

答案1

得分: 1

懒惰修饰符表示,如果匹配可以停止,它就会停止,不再继续查找。它不影响匹配的起始位置。

根据你的描述,你需要从匹配中排除 .。因此,在这种情况下,你的正则表达式将是:

\.\s([^.]*?):\s?[A-Z] 

这样,除了开头的点外,匹配中将不允许其他点。

另外,你可以使用

(?<=\.\s)[^.]+(?=: \s?[A-Z])

这样匹配的结果将只包含点和冒号之间的文本,后面跟着大写字母,而不包括那些点、冒号和大写字母,如果你需要使用其他语言。对于 Python 这两种方式都有效

英文:

Lazy modifier means that if matching can stop now it show stop, and not look more. It doesn't affect starting position of matching.

To to what you described, you need to exclude . from matching. So your regex in this case will be:

\.\s([^.]*?):\s?[A-Z] 

This way no dots are allowed in your match except for beginning one.

Also, you could use

(?&lt;=\.\s)[^.]+(?=:\s?[A-Z])

This way result of matching will contain only text between a dot and a colon followed by upper letter, but not those dot, colon and upper letter, if you'll need to use other languages.
For python it works both ways!

huangapple
  • 本文由 发表于 2023年4月6日 18:58:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75948725.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定