英文:
Python - extract information from email
问题
以下是您提供的电子邮件示例的翻译部分:
电子邮件示例1
亲爱的大家,
请注意总销售量和剩余总库存
总销售量:45677
总剩余库存 A:3456
如有任何疑问或意见,请随时联系。
最好的问候,
电子邮件示例2
亲爱的大家,
请查看以下数据:
总体积:1,231,245
剩余库存 A 的数量:232
剩余库存 B 的数量:1,435
电子邮件示例3
亲爱的大家,
请查找我们的总体积为 233,435
总剩余库存 A:2453
电子邮件示例4
在五月份,我们剩余库存 A 为 90,TEU 为 4190。
我想从这些电子邮件中提取体积和总剩余库存数字。是否有任何关于如何使用 Python 获取这些数字的提示?
我已经准备好了以下代码,用于从电子邮件中提取数字。但我无法区分哪个数字是总销售量,哪个数字是总剩余库存。
英文:
I am new to Python. Below are some sample emails I received.
Email sample 1
Dear all,
Please note the Total selling volume and total remaining stock
Total selling volume: 45677
Total remaining stock A:3456
Remain at your disposal in case of any doubt or comments.
Best Regards,
Email sample 2
Dear all,
Please see the data as below:
Tol volume: 1,231,245
No. of remaining stock A: 232
No. of remaining stock B: 1,435
Email sample 3
Dear All,
Please find our volume was 233,435
Total remaining stock A: 2453
Email sample 4
In May we had 90 remaining stock A and 4190 TEUs.
I would like to extract the volume and total remaining stock figures from those emails. Any hints if I can get those figures by using python?
I have prepared the below code to extract the figures from email. However I can not distinguish which figure is total selling volume, total remaining stock
import re
import pandas as pd
import win32com.client
from datetime import datetime, timedelta
outlook = win32com.client.Dispatch('outlook.application')
mapi = outlook.GetNamespace("MAPI")
inbox = mapi.GetDefaultFolder(6).Folders.Item("AI email testing")
#outlook.GetDefaultFolder(6) .Folders.Item("Your_Folder_Name")
#inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items
received_dt = datetime.now() - timedelta(days=1)
received_dt = received_dt.strftime('%m/%d/%Y %H:%M %p')
for message in list(messages):
#print (message)
body_content = message.body
body_content =body_content[body_content.find("Subject:"):]
#print(body_content)
figures = re.findall("\d+(?:,\d+)*(?:\.\d+)?",body_content)
print(figures)
答案1
得分: 0
以下是使用正则表达式的解决方案:
from __future__ import annotations
import re
from typing import List, Tuple
def get_number(text: str) -> float | int | str:
"""
从输入字符串中提取第一个数值。
该函数使用正则表达式从`text`中提取第一个数值。
如果未找到数字值,则返回原始字符串。如果从提取的数字中删除逗号(如果有的话)。
该函数首先尝试将数字转换为整数,如果失败,则尝试将其转换为浮点数。
Parameters
----------
text : str
应从中提取数值的字符串。
Returns
-------
float | int | str
转换为int或float的`text`中的第一个数值,如果未找到数字值,则返回原始`text`。
Raises
------
ValueError
如果提取的数字无法转换为整数或浮点数。
Examples
--------
函数用法和行为的说明。
>>> get_number("Hello world 123!")
123
>>> get_number("I have 2,200 dollars.")
2200
>>> get_number("No numbers here.")
'No numbers here.'
>>> get_number("It is over 9000!")
9000
>>> get_number("The value of pi is about 3.14159.")
3.14159
>>> get_number("Total: 123,456,789.")
123456789.0
"""
number = re.search(r'(\d+|,)+.', text, re.I)
if number:
number = number[0].strip().replace(',', '')
if not number:
print(f"Found no numbers inside text: {text!r}")
return text
try:
return int(number)
except ValueError:
return float(number)
def extract_stock_volume_from_email(email: str) -> Tuple[int | float | str, int | float | str]:
"""
从电子邮件文本中提取容量和剩余库存A的详细信息。
此函数使用正则表达式解析给定的电子邮件文本并提取有关容量和剩余库存A的详细信息。
然后清理提取的值并返回。
Parameters
----------
email : str
要解析的电子邮件文本。
Returns
-------
volume : int | float | str
从电子邮件中提取的容量。
如果未找到容量详细信息,则返回'Volume not found'。
remaining_stock_a : int | float | str
从电子邮件中提取的剩余库存A。
如果未找到库存A详细信息,则返回'Remaining stock A not found'。
Raises
------
re.error
如果使用无效的正则表达式。
See Also
--------
re.search:用于提取容量和剩余库存详细信息的方法。
Examples
--------
>>> email_text = "The volume was 5000 TEUs. Stock A: 1000 units."
>>> extract_stock_volume_from_email(email_text)
(5000, 1000)
>>> email_text = "No volume and stock data available."
>>> extract_stock_volume_from_email(email_text)
('Volume not found', 'Remaining stock A not found')
"""
# 提取容量
volume = re.search(
r'(?:volume:|volume was|TEUs\.|TEUs |TEU |$)\s(\d+|,)+.*?|(\d+|,)+.(?:\sTEUs|\sTEU)',
email, re.I
)
if volume:
volume = get_number(volume[0].strip())
if not volume:
volume = 'Volume not found'
# 提取剩余库存A
remaining_stock_a = re.search(r'(?:stock A:|stock A: |$)(\d+|,)+.*?', email, re.I)
if remaining_stock_a:
remaining_stock_a = remaining_stock_a[0].strip()
if not remaining_stock_a:
remaining_stock_a = re.search(r'(\d+)(.+)(stock A)', email, re.I)
if remaining_stock_a:
remaining_stock_a = remaining_stock_a[0].strip()
if remaining_stock_a:
remaining_stock_a = get_number(remaining_stock_a)
if not remaining_stock_a:
remaining_stock_a = 'Remaining stock A not found'
return volume, remaining_stock_a
def extract_stock_volume_from_emails(
emails: List[str],
) -> List[Tuple[int | float | str, int | float | str]]:
"""
将函数`extract_stock_volume_from_email`应用于电子邮件列表。
Parameters
----------
emails : List[str]
要解析的电子邮件文本列表。
Returns
-------
List[Tuple[int | float | str, int | float | str]]
包含从每封电子邮件中提取的容量和剩余库存A的元组列表。
如果无法从电子邮件中提取容量或库存A详细信息,则元组中的相应元素将是'Volume not found'或'Remaining stock A not found'。
Raises
------
re.error
如果在`extract_stock_volume_from_email`中使用无效的正则表达式。
See Also
--------
extract_stock_volume_from_email:用于从每封电子邮件中提取详细信息的函数。
Examples
--------
>>> email_texts = [
... "The volume was 5000 TEUs. Stock A: 1000 units.",
... "No volume and stock data available.",
... ]
>>> extract_stock_volume_from_emails(email_texts)
[(5000, 1000), ('Volume not found', 'Remaining stock A not found')]
"""
return list(map(extract_stock_volume_from_email, emails))
使用上述代码对您提供的示例电子邮件进行操作:
emails = [
r"""Dear all,
Please note the Total selling volume and total remaining stock
Total selling volume: 45677 Total remaining stock A:3456
Remain at your disposal in case of any doubt or comments.
Best Regards,""",
r"""Dear all,
Please see the data as below:
Tol volume: 1,231,245 No. of remaining stock A: 232 No. of remaining stock B: 1,435""",
r"""Dear All,
Please find our volume was
<details>
<summary>英文:</summary>
Here's a solution using RegEx:
```python
from __future__ import annotations
import re
from typing import List, Tuple
def get_number(text: str) -> float | int | str:
"""
Extract the first numeric value from the input string.
The function uses regular expressions to extract the first numeric
occurrence from `text`. If no numeric value is found, the original string
is returned. Commas are removed from the extracted number, if any.
The function first attempts to convert the number to an integer,
and if that fails, it tries to convert it to a float.
Parameters
----------
text : str
The string from which the numeric value should be extracted.
Returns
-------
float | int | str
The first numeric value in `text` converted to int or float,
or original `text` if no numeric value is found.
Raises
------
ValueError
If the extracted number can't be converted to an integer or a float.
Examples
--------
Illustration of the function usage and behavior.
>>> get_number("Hello world 123!")
123
>>> get_number("I have 2,200 dollars.")
2200
>>> get_number("No numbers here.")
'No numbers here.'
>>> get_number("It is over 9000!")
9000
>>> get_number("The value of pi is about 3.14159.")
3.14159
>>> get_number("Total: 123,456,789.")
123456789.0
"""
number = re.search(r'(\d+|,)+.', text, re.I)
if number:
number = number[0].strip().replace(',', '')
if not number:
print(f"Found no numbers inside text: {text!r}")
return text
try:
return int(number)
except ValueError:
return float(number)
def extract_stock_volume_from_email(email: str) -> Tuple[int | float | str, int | float | str]:
"""
Extract the volume and remaining stock A details from an email text.
This function employs regular expressions to parse the given email text and
extract details about volume and remaining stock A.
The values extracted are then cleaned and returned.
Parameters
----------
email : str
Text from the email to parse.
Returns
-------
volume : int | float | str
Volume extracted from the email.
Returns 'Volume not found' if no volume details are found.
remaining_stock_a : int | float | str
Remaining stock A extracted from the email.
Returns 'Remaining stock A not found' if no stock A details are found.
Raises
------
re.error
If a non-valid regular expression is used.
See Also
--------
re.search : The method used for extracting volume and remaining stock details.
Examples
--------
>>> email_text = "The volume was 5000 TEUs. Stock A: 1000 units."
>>> extract_stock_volume_from_email(email_text)
(5000, 1000)
>>> email_text = "No volume and stock data available."
>>> extract_stock_volume_from_email(email_text)
('Volume not found', 'Remaining stock A not found')
"""
# Extract the volume
volume = re.search(
r'(?:volume:|volume was|TEUs\.|TEUs |TEU |$)\s(\d+|,)+.*?|(\d+|,)+.(?:\sTEUs|\sTEU)',
email, re.I
)
if volume:
volume = get_number(volume[0].strip())
if not volume:
volume = 'Volume not found'
# Extract the remaining stock
remaining_stock_a = re.search(r'(?:stock A:|stock A: |$)(\d+|,)+.*?', email, re.I)
if remaining_stock_a:
remaining_stock_a = remaining_stock_a[0].strip()
if not remaining_stock_a:
remaining_stock_a = re.search(r'(\d+)(.+)(stock A)', email, re.I)
if remaining_stock_a:
remaining_stock_a = remaining_stock_a[0].strip()
if remaining_stock_a:
remaining_stock_a = get_number(remaining_stock_a)
if not remaining_stock_a:
remaining_stock_a = 'Remaining stock A not found'
# print(f"Volume: {volume}\nRemaining Stock A: {remaining_stock_a}\n")
return volume, remaining_stock_a
def extract_stock_volume_from_emails(
emails: List[str],
) -> List[Tuple[int | float | str, int | float | str]]:
"""
Apply the function `extract_stock_volume_from_email` to a list of emails.
Parameters
----------
emails : List[str]
A list of email texts to be parsed.
Returns
-------
List[Tuple[int | float | str, int | float | str]]
A list of tuples. Each tuple contains the volume and remaining stock A
extracted from each email. If no volume or stock A details could be
extracted from an email, the corresponding element in the tuple will be
'Volume not found' or 'Remaining stock A not found', respectively.
Raises
------
re.error
If a non-valid regular expression is used in `extract_stock_volume_from_email`.
See Also
--------
extract_stock_volume_from_email : The function used to extract details from each email.
Examples
--------
>>> email_texts = [
... "The volume was 5000 TEUs. Stock A: 1000 units.",
... "No volume and stock data available.",
... ]
>>> extract_stock_volume_from_emails(email_texts)
[(5000, 1000), ('Volume not found', 'Remaining stock A not found')]
"""
return list(map(extract_stock_volume_from_email, emails))
Using the above code on the e-mails you provided as example:
emails = [
r"""Dear all,
Please note the Total selling volume and total remaining stock
Total selling volume: 45677 Total remaining stock A:3456
Remain at your disposal in case of any doubt or comments.
Best Regards,""",
r"""Dear all,
Please see the data as below:
Tol volume: 1,231,245 No. of remaining stock A: 232 No. of remaining stock B: 1,435""",
r"""Dear All,
Please find our volume was 233,435
Total remaining stock A: 2453""",
r"In May we had 90 remaining stock A and 4190 TEUs.",
]
extract_stock_volume_from_emails(emails)
# Returns:
#
# [(45677, 3456), (1231245, 232), (233435, 2453), (4190, 90)]
# ^----^ ^--^
# | |
# | +-- Remaining stock A
# +-- Volume
Note
It should be noted that the function extract_stock_volume_from_email
, that parses each e-mail is not failproof. The RegEx patterns it contains were all based on the e-mails you provided as example. If other e-mails don't follow the same patterns as the example e-mails, these additional patterns will have to be added to the extract_stock_volume_from_email
function.
答案2
得分: 0
import re
email_content = """亲爱的大家,请注意总销售量
和总剩余库存 总销售量:45677 总剩余库存A:3456 如有任何疑问或意见,请随时联系。
最好的问候,
"""
正则表达式模式以匹配数字及其上下文
number_pattern = r'总销售量:(\d+[.,]?\d+)\s+总剩余库存A:(\d+[.,]?\d+)'
使用正则表达式提取数字及其上下文
matches = re.findall(number_pattern, email_content)
for match in matches:
total_selling_volume = match[0]
total_remaining_stock = match[1]
print("总销售量:", total_selling_volume)
print("总剩余库存A:", total_remaining_stock)
输出
总销售量:45677
总剩余库存A:3456
英文:
import re
email_content = """Dear all,Please note the Total selling volume
and total remaining stock Total selling volume: 45677 Total
remaining stock A:3456 Remain at your disposal in case of any
doubt or comments.
Best Regards,
"""
#Regular expression pattern to match numbers and their context
number_pattern = r'Total selling volume: (\d+[.,]?\d+)\s+Total
remaining stock A:(\d+[.,]?\d+)'
#Extract numbers and their context using regular expression
matches = re.findall(number_pattern, email_content)
for match in matches:
total_selling_volume = match[0]
total_remaining_stock = match[1]
print("Total selling volume:", total_selling_volume)
print("Total remaining stock:", total_remaining_stock)
#Output
Total selling volume: 45677
Total remaining stock: 3456
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论