英文:
Extracting information from a list of strings using regex
问题
提取信息:
层级 金额 最低费用 子基金 账户维护
0.09% 第一个 GBP£250 百万 GBP£22,000 GBP£2,750 £00 每年
每位投资者 GBPL£25
手动 GBPE£25
自动GBPE£S5
0.08% 下一个 GBP£250 百万 GBPE£L,500
0.06% 下一个 GBP£500 百万 GBP£3,000
英文:
I have a list of strings from which I wish to extract information around amount, percentages etc. Being new to regex I have been struggling with the process. Below are my input & desired output & the piece of code that I tried using.
Input list:
['0.09% of the first GBP£250 million of the Company’s Net Asset Value;', '0.08% of the next GBP£250 million of the Company’s Net Asset Value;', "0.06% of the next GBP£500 million of the Company's Net Asset Value; and", 'e GBP£22,000 in respect of cach of (he Company’s Sub-Funds which shall be accrued for on a daily basis', 'in accordance with the formula GBP£22,000 + 365, Minimum fee to be levied at a Company level,', 'e Preparation of fund interim and annual financial statements... GBP£2,750 per sub-fund pa', 'e UK Tax Reporting... ww. GBPE£L,500 per sub-fund pa', 'BUSD Tax Reporting’ v GBP£3,000 per sub-find pa', '© Account maintenance £00 sess resect GBPL£25 per investor pa', '» Manual .. GBPE£25 per transaction', '"Automated GBPE£S5 per Gransaction', 'e Investor registration and AML {ce GBP£50 per new investor account,', '« Fund distribution/dividend fee GBP£750 per distribution/dividend per sub fund.']
Code:
import re
def extract_pounds(text):
regex = "£(\w+)"
return re.findall(regex, str(text))
for word in empty_df:
pounds = extract_pounds(word)
print(pounds)
I am getting the following output which is far from being close to my desired output:
['250']
['250']
['500']
['22']
['22']
Desired output:
Tier Amount Minimum Fee Sub-Fund AccountMaintain
0.09% first GBP£250 million GBP£22,000 GBP£2,750 £00 sess
resect GBPL£25
Manual GBPE£25
AutomatedGBPE£S5
0.08% next GBP£250 million GBPE£L,500
0.06% next GBP£500 million GBP£3,000
答案1
得分: 1
使用[tag:pandas],你可以尝试以下代码:
import re
import pandas
pat = r"([\d.]+%) of the (\w+ GBP£\d+ \w+)"
df = pd.Series(lst[:-1]).str.extract(pat).set_axis(["Tier", "Amount"], axis=1)
df.loc[0, "Minimum Fee"] = re.search("GBP£\d+,\d+", lst[-1]).group(0)
输出:
print(df)
Tier Amount Minimum Fee
0 0.09% first GBP£250 million GBP£22,000
1 0.08% next GBP£250 million NaN
2 0.06% next GBP£500 million NaN
***更新:***
根据你的更新的问题/列表,使用以下代码:
pat1 = r"([\d.]+%) of the (\w+ GBP£\d+ \w+)"
df = pd.Series(lst).str.extract(pat1).set_axis(["Tier", "Amount"], axis=1).dropna()
pat2 = r"(GBP£\d+,\d+).*Minimum fee"
result = re.search(pattern, " ".join(lst))
mfee = result.group(1) if result else None
df.loc[0, "Minimum Fee"] = mfee
输出:
print(df)
Tier Amount Minimum Fee
0 0.09% first GBP£250 million GBP£22,000
1 0.08% next GBP£250 million NaN
2 0.06% next GBP£500 million NaN
英文:
With [tag:pandas], you can try something like this :
import re
import pandas
pat = r"([\d.]+%) of the (\w+ GBP£\d+ \w+)"
df = pd.Series(lst[:-1]).str.extract(pat).set_axis(["Tier", "Amount"], axis=1)
df.loc[0, "Minimum Fee"] = re.search("GBP£\d+,\d+", lst[-1]).group(0)
Output :
print(df)
Tier Amount Minimum Fee
0 0.09% first GBP£250 million GBP£22,000
1 0.08% next GBP£250 million NaN
2 0.06% next GBP£500 million NaN
UPDATE :
Based on your updated question/list, use this :
pat1 = r"([\d.]+%) of the (\w+ GBP£\d+ \w+)"
df = pd.Series(lst).str.extract(pat1).set_axis(["Tier", "Amount"], axis=1).dropna()
pat2 = r"(GBP£\d+,\d+).*Minimum fee"
result = re.search(pattern, " ".join(lst))
mfee = result.group(1) if result else None
df.loc[0, "Minimum Fee"] = mfee
Output :
print(df)
Tier Amount Minimum Fee
0 0.09% first GBP£250 million GBP£22,000
1 0.08% next GBP£250 million NaN
2 0.06% next GBP£500 million NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论