正则表达式用于捕获从一个模式开始到另一个模式结束的所有文本。

huangapple go评论66阅读模式
英文:

Regular expression for capturing all text starting at one pattern and ending at another

问题

I have translated the text you provided. Here is the translated content:

我正在使用Python从PDF中抓取文本数据。有一个包含我需要的数据的常见模式,以数字模式开头,以字符串模式结尾。我需要使用正则表达式捕获所有文本,包括这些模式。

我有一个正则表达式,当我将数据从PDF转换为文本并读取文本时有效。但是当我使用PyPDF2从PDF页面中提取文本时,正则表达式失效。

数据流如下:

提交日期:8/21/2022\n录入日期:10/21/2022\n解雇日期:01/23/2023\n关闭日期:01/30/2023\n17-55018-\nQRTbk 7 Windows PC\n操作系统:xxx\n角色:AdminHubertson

起始点是17-55018-字符串,我有一个有效的正则表达式:

[0-9]{2}-[0-9]{5}-
```

结束点是`角色:Admin`,足够唯一以识别。

我尝试了多种捕获方法,包括使用先行断言来获取我需要的文本。我已在regex101上测试了这些方法,它们有效,但我无法使它们在实际代码中工作。

一些我尝试过的模式:
```
[0-9]{2}-[0-9]{5}-\s(\n(?!Role)(.*))*Role: Admin
[0-9]{2}-[0-9]{5}-\.(.*?)Role: Admin
[0-9]{2}-[0-9]{5}-.*(?=Role).*Role: Admin
```

希望这可以帮助你解决问题。

<details>
<summary>英文:</summary>

I am scraping text data off a pdf using python. There is a common pattern that contains the data I need that begins with a numerical pattern and ends with a string pattern. I need to capture all the text, including the patterns using a regular expression.

I have a regular expression that works when I import the data by going pdf to txt and reading the text in. When I use PyPDF2 to extract the text from the pdf pages, the regular expression fails.

The data stream looks like this
```
Filed: 8/21/2022\nEntered:  10/21/2022\nDischarged:  01/23/2023\nClosed: 01/30/2023\n17-55018-   \nQRTbk 7 Windows PC\n OS:xxx\nRole: AdminHubertson
```

The start point is the `17-55018-` string which I have a regex that works:
```
[0-9]{2}-[0-9]{5}-
```

The end point is the `Role: Admin` which is unique enough to compile.

I have tried a number of capture methods using lookaheads to get the text I need. These methods I have tested on regex101 and they work but I cannot get them to work

Some patterns I have tried:  
```
[0-9]{2}-[0-9]{5}-\s(\n(?!Role)(.*))*Role: Admin
[0-9]{2}-[0-9]{5}-\.(.*?)Role: Admin
[0-9]{2}-[0-9]{5}-.*(?=Role).*Role: Admin
```

</details>


# 答案1
**得分**: 0

尝试这个:

    \d{2}\-\d{5}.*?Role:\sAdmin

<details>
<summary>英文:</summary>

Try this one:

    \d{2}\-\d{5}.*?Role:\sAdmin

</details>



huangapple
  • 本文由 发表于 2023年6月1日 07:35:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76377868.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定