使用正则表达式从字符串列表中提取信息。

huangapple go评论120阅读模式
英文:

Extracting information from a list of strings using regex

问题

  1. 提取信息:
  2. 层级 金额 最低费用 子基金 账户维护
  3. 0.09% 第一个 GBP£250 百万 GBP£22,000 GBP£2,750 £00 每年
  4. 每位投资者 GBPL£25
  5. 手动 GBPE£25
  6. 自动GBPE£S5
  7. 0.08% 下一个 GBP£250 百万 GBPE£L,500
  8. 0.06% 下一个 GBP£500 百万 GBP£3,000
英文:

I have a list of strings from which I wish to extract information around amount, percentages etc. Being new to regex I have been struggling with the process. Below are my input & desired output & the piece of code that I tried using.

Input list:

  1. ['0.09% of the first GBP£250 million of the Companys Net Asset Value;', '0.08% of the next GBP£250 million of the Companys Net Asset Value;', "0.06% of the next GBP£500 million of the Company's Net Asset Value; and", 'e GBP£22,000 in respect of cach of (he Companys Sub-Funds which shall be accrued for on a daily basis', 'in accordance with the formula GBP£22,000 + 365, Minimum fee to be levied at a Company level,', 'e Preparation of fund interim and annual financial statements... GBP£2,750 per sub-fund pa', 'e UK Tax Reporting... ww. GBPE£L,500 per sub-fund pa', 'BUSD Tax Reporting v GBP£3,000 per sub-find pa', '© Account maintenance £00 sess resect GBPL£25 per investor pa', '» Manual .. GBPE£25 per transaction', '"Automated GBPE£S5 per Gransaction', 'e Investor registration and AML {ce GBP£50 per new investor account,', '« Fund distribution/dividend fee GBP£750 per distribution/dividend per sub fund.']

Code:

  1. import re
  2. def extract_pounds(text):
  3. regex = "£(\w+)"
  4. return re.findall(regex, str(text))
  5. for word in empty_df:
  6. pounds = extract_pounds(word)
  7. print(pounds)

I am getting the following output which is far from being close to my desired output:

  1. ['250']
  2. ['250']
  3. ['500']
  4. ['22']
  5. ['22']

Desired output:

  1. Tier Amount Minimum Fee Sub-Fund AccountMaintain
  2. 0.09% first GBP£250 million GBP£22,000 GBP£2,750 £00 sess
  3. resect GBPL£25
  4. Manual GBPE£25
  5. AutomatedGBPE£S5
  6. 0.08% next GBP£250 million GBPE£L,500
  7. 0.06% next GBP£500 million GBP£3,000

答案1

得分: 1

  1. 使用[tag:pandas],你可以尝试以下代码:
  2. import re
  3. import pandas
  4. pat = r"([\d.]+%) of the (\w+ GBP£\d+ \w+)"
  5. df = pd.Series(lst[:-1]).str.extract(pat).set_axis(["Tier", "Amount"], axis=1)
  6. df.loc[0, "Minimum Fee"] = re.search("GBP£\d+,\d+", lst[-1]).group(0)
  7. 输出:
  8. print(df)
  9. Tier Amount Minimum Fee
  10. 0 0.09% first GBP£250 million GBP£22,000
  11. 1 0.08% next GBP£250 million NaN
  12. 2 0.06% next GBP£500 million NaN
  13. ***更新:***
  14. 根据你的更新的问题/列表,使用以下代码:
  15. pat1 = r"([\d.]+%) of the (\w+ GBP£\d+ \w+)"
  16. df = pd.Series(lst).str.extract(pat1).set_axis(["Tier", "Amount"], axis=1).dropna()
  17. pat2 = r"(GBP£\d+,\d+).*Minimum fee"
  18. result = re.search(pattern, " ".join(lst))
  19. mfee = result.group(1) if result else None
  20. df.loc[0, "Minimum Fee"] = mfee
  21. 输出:
  22. print(df)
  23. Tier Amount Minimum Fee
  24. 0 0.09% first GBP£250 million GBP£22,000
  25. 1 0.08% next GBP£250 million NaN
  26. 2 0.06% next GBP£500 million NaN
英文:

With [tag:pandas], you can try something like this :

  1. import re
  2. import pandas
  3. pat = r"([\d.]+%) of the (\w+ GBP£\d+ \w+)"
  4. df = pd.Series(lst[:-1]).str.extract(pat).set_axis(["Tier", "Amount"], axis=1)
  5. df.loc[0, "Minimum Fee"] = re.search("GBP£\d+,\d+", lst[-1]).group(0)

Output :

  1. print(df)
  2. Tier Amount Minimum Fee
  3. 0 0.09% first GBP£250 million GBP£22,000
  4. 1 0.08% next GBP£250 million NaN
  5. 2 0.06% next GBP£500 million NaN

UPDATE :

Based on your updated question/list, use this :

  1. pat1 = r"([\d.]+%) of the (\w+ GBP£\d+ \w+)"
  2. df = pd.Series(lst).str.extract(pat1).set_axis(["Tier", "Amount"], axis=1).dropna()
  3. pat2 = r"(GBP£\d+,\d+).*Minimum fee"
  4. result = re.search(pattern, " ".join(lst))
  5. mfee = result.group(1) if result else None
  6. df.loc[0, "Minimum Fee"] = mfee

Output :

  1. print(df)
  2. Tier Amount Minimum Fee
  3. 0 0.09% first GBP£250 million GBP£22,000
  4. 1 0.08% next GBP£250 million NaN
  5. 2 0.06% next GBP£500 million NaN

huangapple
  • 本文由 发表于 2023年4月19日 16:14:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76052167.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定