2023年4月19日 16:14:56go评论120阅读模式

英文:

Extracting information from a list of strings using regex

问题

提取信息：
 层级    金额   最低费用 子基金              账户维护
0.09%  第一个 GBP&#163;250 百万   GBP&#163;22,000  GBP&#163;2,750   &#163;00 每年 
                                                     每位投资者 GBPL&#163;25 
                                                     手动 GBPE&#163;25 
                                                   自动GBPE&#163;S5 
0.08%   下一个 GBP&#163;250 百万                   GBPE&#163;L,500
0.06%   下一个 GBP&#163;500 百万                   GBP&#163;3,000

英文:

I have a list of strings from which I wish to extract information around amount, percentages etc. Being new to regex I have been struggling with the process. Below are my input & desired output & the piece of code that I tried using.

Input list:

[&#39;0.09% of the first GBP&#163;250 million of the Company’s Net Asset Value;&#39;, &#39;0.08% of the next GBP&#163;250 million of the Company’s Net Asset Value;&#39;, &quot;0.06% of the next GBP&#163;500 million of the Company&#39;s Net Asset Value; and&quot;, &#39;e GBP&#163;22,000 in respect of cach of (he Company’s Sub-Funds which shall be accrued for on a daily basis&#39;, &#39;in accordance with the formula GBP&#163;22,000 + 365, Minimum fee to be levied at a Company level,&#39;, &#39;e Preparation of fund interim and annual financial statements... GBP&#163;2,750 per sub-fund pa&#39;, &#39;e UK Tax Reporting... ww. GBPE&#163;L,500 per sub-fund pa&#39;, &#39;BUSD Tax Reporting’ v GBP&#163;3,000 per sub-find pa&#39;, &#39;&#169; Account maintenance &#163;00 sess resect GBPL&#163;25 per investor pa&#39;, &#39;&#187; Manual .. GBPE&#163;25 per transaction&#39;, &#39;&quot;Automated GBPE&#163;S5 per Gransaction&#39;, &#39;e Investor registration and AML {ce GBP&#163;50 per new investor account,&#39;, &#39;&#171; Fund distribution/dividend fee GBP&#163;750 per distribution/dividend per sub fund.&#39;]

Code:

import re
def extract_pounds(text):
    regex = &quot;&#163;(\w+)&quot;
    return re.findall(regex, str(text))
for word in empty_df:
    pounds = extract_pounds(word)
    print(pounds)

I am getting the following output which is far from being close to my desired output:

[&#39;250&#39;]
[&#39;250&#39;]
[&#39;500&#39;]
[&#39;22&#39;]
[&#39;22&#39;]

Desired output:

 Tier    Amount   Minimum Fee Sub-Fund               AccountMaintain
 0.09%   first GBP&#163;250 million   GBP&#163;22,000  GBP&#163;2,750   &#163;00 sess 
                                                       resect GBPL&#163;25 
                                                       Manual GBPE&#163;25 
                                                     AutomatedGBPE&#163;S5 
 0.08%   next GBP&#163;250 million                        GBPE&#163;L,500
 0.06%   next GBP&#163;500 million                        GBP&#163;3,000

答案1

得分: 1

使用[tag:pandas]，你可以尝试以下代码：
    import re
    import pandas 
    pat = r"([\d.]+%) of the (\w+ GBP£\d+ \w+)"
    df = pd.Series(lst[:-1]).str.extract(pat).set_axis(["Tier", "Amount"], axis=1)
    
    df.loc[0, "Minimum Fee"] = re.search("GBP£\d+,\d+", lst[-1]).group(0)
    
输出：
    print(df)
    
        Tier                 Amount Minimum Fee
    0  0.09%  first GBP£250 million  GBP£22,000
    1  0.08%   next GBP£250 million         NaN
    2  0.06%   next GBP£500 million         NaN
***更新：***
根据你的更新的问题/列表，使用以下代码：
    pat1 = r"([\d.]+%) of the (\w+ GBP£\d+ \w+)"
    df = pd.Series(lst).str.extract(pat1).set_axis(["Tier", "Amount"], axis=1).dropna()
    
    pat2 = r"(GBP£\d+,\d+).*Minimum fee"
    result = re.search(pattern, " ".join(lst))
    mfee = result.group(1) if result else None
        
    df.loc[0, "Minimum Fee"] = mfee
    
输出：
    print(df)
    
        Tier                 Amount Minimum Fee
    0  0.09%  first GBP£250 million  GBP£22,000
    1  0.08%   next GBP£250 million         NaN
    2  0.06%   next GBP£500 million         NaN

英文:

With [tag:pandas], you can try something like this :

import re
import pandas 
pat = r&quot;([\d.]+%) of the (\w+ GBP&#163;\d+ \w+)&quot;
df = pd.Series(lst[:-1]).str.extract(pat).set_axis([&quot;Tier&quot;, &quot;Amount&quot;], axis=1)
df.loc[0, &quot;Minimum Fee&quot;] = re.search(&quot;GBP&#163;\d+,\d+&quot;, lst[-1]).group(0)

Output :

print(df)
    Tier                 Amount Minimum Fee
0  0.09%  first GBP&#163;250 million  GBP&#163;22,000
1  0.08%   next GBP&#163;250 million         NaN
2  0.06%   next GBP&#163;500 million         NaN

UPDATE :

Based on your updated question/list, use this :

pat1 = r&quot;([\d.]+%) of the (\w+ GBP&#163;\d+ \w+)&quot;
df = pd.Series(lst).str.extract(pat1).set_axis([&quot;Tier&quot;, &quot;Amount&quot;], axis=1).dropna()
pat2 = r&quot;(GBP&#163;\d+,\d+).*Minimum fee&quot;
result = re.search(pattern, &quot; &quot;.join(lst))
mfee = result.group(1) if result else None
    
df.loc[0, &quot;Minimum Fee&quot;] = mfee

Output :

print(df)
    Tier                 Amount Minimum Fee
0  0.09%  first GBP&#163;250 million  GBP&#163;22,000
1  0.08%   next GBP&#163;250 million         NaN
2  0.06%   next GBP&#163;500 million         NaN

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用正则表达式从字符串列表中提取信息。

问题

答案1

`aiomultiprocessing`池冻结和OSError：[Errno 24] 打开文件太多

Pytestfs写入然后读取不返回预期值。

(Apache Beam) Cannot increase executor memory – it is fixed at 1024M despite using multiple settings

在Python或其他语言中是否可以嵌入一个360度视图器到PDF中？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。