2023年6月13日 11:32:47go评论109阅读模式

英文:

Python - extract information from email

问题

以下是您提供的电子邮件示例的翻译部分：

电子邮件示例1

亲爱的大家，

请注意总销售量和剩余总库存

总销售量：45677
总剩余库存 A：3456

如有任何疑问或意见，请随时联系。

最好的问候，

电子邮件示例2

亲爱的大家，

请查看以下数据：

总体积：1,231,245
剩余库存 A 的数量：232
剩余库存 B 的数量：1,435

电子邮件示例3

亲爱的大家，

请查找我们的总体积为 233,435

总剩余库存 A：2453

电子邮件示例4

在五月份，我们剩余库存 A 为 90，TEU 为 4190。

我想从这些电子邮件中提取体积和总剩余库存数字。是否有任何关于如何使用 Python 获取这些数字的提示？

我已经准备好了以下代码，用于从电子邮件中提取数字。但我无法区分哪个数字是总销售量，哪个数字是总剩余库存。

英文:

I am new to Python. Below are some sample emails I received.

Email sample 1

Dear all,

Please note the Total selling volume and total remaining stock

Total selling volume: 45677
Total remaining stock A:3456

Remain at your disposal in case of any doubt or comments.

Best Regards,

Email sample 2

Dear all,

Please see the data as below:

Tol volume: 1,231,245
No. of remaining stock A: 232
No. of remaining stock B: 1,435

Email sample 3

Dear All,

Please find our volume was 233,435

Total remaining stock A: 2453

Email sample 4

In May we had 90 remaining stock A and 4190 TEUs.

I would like to extract the volume and total remaining stock figures from those emails. Any hints if I can get those figures by using python?

I have prepared the below code to extract the figures from email. However I can not distinguish which figure is total selling volume, total remaining stock

import re
import pandas as pd
import win32com.client
from datetime import datetime, timedelta
outlook = win32com.client.Dispatch(&#39;outlook.application&#39;)
mapi = outlook.GetNamespace(&quot;MAPI&quot;)
inbox = mapi.GetDefaultFolder(6).Folders.Item(&quot;AI email testing&quot;)
#outlook.GetDefaultFolder(6) .Folders.Item(&quot;Your_Folder_Name&quot;)
#inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items
received_dt = datetime.now() - timedelta(days=1)
received_dt = received_dt.strftime(&#39;%m/%d/%Y %H:%M %p&#39;)
for message in list(messages):
    #print (message)
    body_content = message.body
    body_content =body_content[body_content.find(&quot;Subject:&quot;):]
    #print(body_content)
    figures = re.findall(&quot;\d+(?:,\d+)*(?:\.\d+)?&quot;,body_content)
    print(figures)

答案1

得分: 0

以下是使用正则表达式的解决方案：

from __future__ import annotations
import re
from typing import List, Tuple
def get_number(text: str) -> float | int | str:
    """
    从输入字符串中提取第一个数值。
    该函数使用正则表达式从`text`中提取第一个数值。
    如果未找到数字值，则返回原始字符串。如果从提取的数字中删除逗号（如果有的话）。
    该函数首先尝试将数字转换为整数，如果失败，则尝试将其转换为浮点数。
    Parameters
    ----------
    text : str
        应从中提取数值的字符串。
    Returns
    -------
    float | int | str
        转换为int或float的`text`中的第一个数值，如果未找到数字值，则返回原始`text`。
    Raises
    ------
    ValueError
        如果提取的数字无法转换为整数或浮点数。
    Examples
    --------
    函数用法和行为的说明。
    >>> get_number("Hello world 123!")
    123
    >>> get_number("I have 2,200 dollars.")
    2200
    >>> get_number("No numbers here.")
    'No numbers here.'
    >>> get_number("It is over 9000!")
    9000
    >>> get_number("The value of pi is about 3.14159.")
    3.14159
    >>> get_number("Total: 123,456,789.")
    123456789.0
    """
    number = re.search(r'(\d+|,)+.', text, re.I)
    if number:
        number = number[0].strip().replace(',', '')
    if not number:
        print(f"Found no numbers inside text: {text!r}")
        return text
    try:
        return int(number)
    except ValueError:
        return float(number)
def extract_stock_volume_from_email(email: str) -> Tuple[int | float | str, int | float | str]:
    """
    从电子邮件文本中提取容量和剩余库存A的详细信息。
    此函数使用正则表达式解析给定的电子邮件文本并提取有关容量和剩余库存A的详细信息。
    然后清理提取的值并返回。
    Parameters
    ----------
    email : str
        要解析的电子邮件文本。
    Returns
    -------
    volume : int | float | str
        从电子邮件中提取的容量。
        如果未找到容量详细信息，则返回'Volume not found'。
    remaining_stock_a : int | float | str
        从电子邮件中提取的剩余库存A。
        如果未找到库存A详细信息，则返回'Remaining stock A not found'。
    Raises
    ------
    re.error
        如果使用无效的正则表达式。
    See Also
    --------
    re.search：用于提取容量和剩余库存详细信息的方法。
    Examples
    --------
    >>> email_text = "The volume was 5000 TEUs. Stock A: 1000 units."
    >>> extract_stock_volume_from_email(email_text)
    (5000, 1000)
    >>> email_text = "No volume and stock data available."
    >>> extract_stock_volume_from_email(email_text)
    ('Volume not found', 'Remaining stock A not found')
    """
    # 提取容量
    volume = re.search(
        r'(?:volume:|volume was|TEUs\.|TEUs |TEU |$)\s(\d+|,)+.*?|(\d+|,)+.(?:\sTEUs|\sTEU)',
        email, re.I
    )
    if volume:
        volume = get_number(volume[0].strip())
    if not volume:
        volume = 'Volume not found'
    # 提取剩余库存A
    remaining_stock_a = re.search(r'(?:stock A:|stock A: |$)(\d+|,)+.*?', email, re.I)
    if remaining_stock_a:
        remaining_stock_a = remaining_stock_a[0].strip()
    if not remaining_stock_a:
        remaining_stock_a = re.search(r'(\d+)(.+)(stock A)', email, re.I)
        if remaining_stock_a:
            remaining_stock_a = remaining_stock_a[0].strip()
    if remaining_stock_a:
        remaining_stock_a = get_number(remaining_stock_a)
    if not remaining_stock_a:
        remaining_stock_a = 'Remaining stock A not found'
    return volume, remaining_stock_a
def extract_stock_volume_from_emails(
    emails: List[str],
) -> List[Tuple[int | float | str, int | float | str]]:
    """
    将函数`extract_stock_volume_from_email`应用于电子邮件列表。
    Parameters
    ----------
    emails : List[str]
        要解析的电子邮件文本列表。
    Returns
    -------
    List[Tuple[int | float | str, int | float | str]]
        包含从每封电子邮件中提取的容量和剩余库存A的元组列表。
        如果无法从电子邮件中提取容量或库存A详细信息，则元组中的相应元素将是'Volume not found'或'Remaining stock A not found'。
    Raises
    ------
    re.error
        如果在`extract_stock_volume_from_email`中使用无效的正则表达式。
    See Also
    --------
    extract_stock_volume_from_email：用于从每封电子邮件中提取详细信息的函数。
    Examples
    --------
    >>> email_texts = [
    ...     "The volume was 5000 TEUs. Stock A: 1000 units.",
    ...     "No volume and stock data available.",
    ... ]
    >>> extract_stock_volume_from_emails(email_texts)
    [(5000, 1000), ('Volume not found', 'Remaining stock A not found')]
    """
    return list(map(extract_stock_volume_from_email, emails))

使用上述代码对您提供的示例电子邮件进行操作：

emails = [
    r"""Dear all,
Please note the Total selling volume and total remaining stock
Total selling volume: 45677 Total remaining stock A:3456
Remain at your disposal in case of any doubt or comments.
Best Regards,""",
    r"""Dear all,
Please see the data as below:
Tol volume: 1,231,245 No. of remaining stock A: 232 No. of remaining stock B: 1,435""",
    r"""Dear All,
Please find our volume was 
<details>
<summary>英文:</summary>
Here&#39;s a solution using RegEx:
```python
from __future__ import annotations
import re
from typing import List, Tuple
def get_number(text: str) -&gt; float | int | str:
    &quot;&quot;&quot;
    Extract the first numeric value from the input string.
    The function uses regular expressions to extract the first numeric
    occurrence from `text`. If no numeric value is found, the original string
    is returned. Commas are removed from the extracted number, if any.
    The function first attempts to convert the number to an integer,
    and if that fails, it tries to convert it to a float.
    Parameters
    ----------
    text : str
        The string from which the numeric value should be extracted.
    Returns
    -------
    float | int | str
        The first numeric value in `text` converted to int or float,
        or original `text` if no numeric value is found.
    Raises
    ------
    ValueError
        If the extracted number can&#39;t be converted to an integer or a float.
    Examples
    --------
    Illustration of the function usage and behavior.
    &gt;&gt;&gt; get_number(&quot;Hello world 123!&quot;)
    123
    &gt;&gt;&gt; get_number(&quot;I have 2,200 dollars.&quot;)
    2200
    &gt;&gt;&gt; get_number(&quot;No numbers here.&quot;)
    &#39;No numbers here.&#39;
    &gt;&gt;&gt; get_number(&quot;It is over 9000!&quot;)
    9000
    &gt;&gt;&gt; get_number(&quot;The value of pi is about 3.14159.&quot;)
    3.14159
    &gt;&gt;&gt; get_number(&quot;Total: 123,456,789.&quot;)
    123456789.0
    &quot;&quot;&quot;
    number = re.search(r&#39;(\d+|,)+.&#39;, text, re.I)
    if number:
        number = number[0].strip().replace(&#39;,&#39;, &#39;&#39;)
    if not number:
        print(f&quot;Found no numbers inside text: {text!r}&quot;)
        return text
    try:
        return int(number)
    except ValueError:
        return float(number)
def extract_stock_volume_from_email(email: str) -&gt; Tuple[int | float | str, int | float | str]:
    &quot;&quot;&quot;
    Extract the volume and remaining stock A details from an email text.
    This function employs regular expressions to parse the given email text and
    extract details about volume and remaining stock A.
    The values extracted are then cleaned and returned.
    Parameters
    ----------
    email : str
        Text from the email to parse.
    Returns
    -------
    volume : int | float | str
        Volume extracted from the email.
        Returns &#39;Volume not found&#39; if no volume details are found.
    remaining_stock_a : int | float | str
        Remaining stock A extracted from the email.
        Returns &#39;Remaining stock A not found&#39; if no stock A details are found.
    Raises
    ------
    re.error
        If a non-valid regular expression is used.
    See Also
    --------
    re.search : The method used for extracting volume and remaining stock details.
    Examples
    --------
    &gt;&gt;&gt; email_text = &quot;The volume was 5000 TEUs. Stock A: 1000 units.&quot;
    &gt;&gt;&gt; extract_stock_volume_from_email(email_text)
    (5000, 1000)
    &gt;&gt;&gt; email_text = &quot;No volume and stock data available.&quot;
    &gt;&gt;&gt; extract_stock_volume_from_email(email_text)
    (&#39;Volume not found&#39;, &#39;Remaining stock A not found&#39;)
    &quot;&quot;&quot;
    # Extract the volume
    volume = re.search(
        r&#39;(?:volume:|volume was|TEUs\.|TEUs |TEU |$)\s(\d+|,)+.*?|(\d+|,)+.(?:\sTEUs|\sTEU)&#39;,
        email, re.I
    )
    if volume:
        volume = get_number(volume[0].strip())
    if not volume:
        volume = &#39;Volume not found&#39;
    # Extract the remaining stock
    remaining_stock_a = re.search(r&#39;(?:stock A:|stock A: |$)(\d+|,)+.*?&#39;, email, re.I)
    if remaining_stock_a:
        remaining_stock_a = remaining_stock_a[0].strip()
    if not remaining_stock_a:
        remaining_stock_a = re.search(r&#39;(\d+)(.+)(stock A)&#39;, email, re.I)
        if remaining_stock_a:
            remaining_stock_a = remaining_stock_a[0].strip()
    if remaining_stock_a:
        remaining_stock_a = get_number(remaining_stock_a)
    if not remaining_stock_a:
        remaining_stock_a = &#39;Remaining stock A not found&#39;
    # print(f&quot;Volume: {volume}\nRemaining Stock A: {remaining_stock_a}\n&quot;)
    return volume, remaining_stock_a
def extract_stock_volume_from_emails(
    emails: List[str],
) -&gt; List[Tuple[int | float | str, int | float | str]]:
    &quot;&quot;&quot;
    Apply the function `extract_stock_volume_from_email` to a list of emails.
    Parameters
    ----------
    emails : List[str]
        A list of email texts to be parsed.
    Returns
    -------
    List[Tuple[int | float | str, int | float | str]]
        A list of tuples. Each tuple contains the volume and remaining stock A
        extracted from each email. If no volume or stock A details could be
        extracted from an email, the corresponding element in the tuple will be
        &#39;Volume not found&#39; or &#39;Remaining stock A not found&#39;, respectively.
    Raises
    ------
    re.error
        If a non-valid regular expression is used in `extract_stock_volume_from_email`.
    See Also
    --------
    extract_stock_volume_from_email : The function used to extract details from each email.
    Examples
    --------
    &gt;&gt;&gt; email_texts = [
    ...     &quot;The volume was 5000 TEUs. Stock A: 1000 units.&quot;,
    ...     &quot;No volume and stock data available.&quot;,
    ... ]
    &gt;&gt;&gt; extract_stock_volume_from_emails(email_texts)
    [(5000, 1000), (&#39;Volume not found&#39;, &#39;Remaining stock A not found&#39;)]
    &quot;&quot;&quot;
    return list(map(extract_stock_volume_from_email, emails))

Using the above code on the e-mails you provided as example:

emails = [
    r&quot;&quot;&quot;Dear all,
Please note the Total selling volume and total remaining stock
Total selling volume: 45677 Total remaining stock A:3456
Remain at your disposal in case of any doubt or comments.
Best Regards,&quot;&quot;&quot;,
    r&quot;&quot;&quot;Dear all,
Please see the data as below:
Tol volume: 1,231,245 No. of remaining stock A: 232 No. of remaining stock B: 1,435&quot;&quot;&quot;,
    r&quot;&quot;&quot;Dear All,
Please find our volume was 233,435
Total remaining stock A: 2453&quot;&quot;&quot;,
    r&quot;In May we had 90 remaining stock A and 4190 TEUs.&quot;,
]
extract_stock_volume_from_emails(emails)
# Returns:
#
# [(45677, 3456), (1231245, 232), (233435, 2453), (4190, 90)]
#  ^----^  ^--^
#  |       |
#  |       +-- Remaining stock A
#  +-- Volume

Note

It should be noted that the function extract_stock_volume_from_email, that parses each e-mail is not failproof. The RegEx patterns it contains were all based on the e-mails you provided as example. If other e-mails don't follow the same patterns as the example e-mails, these additional patterns will have to be added to the extract_stock_volume_from_email function.

答案2

得分: 0

import re

email_content = """亲爱的大家，请注意总销售量
和总剩余库存总销售量：45677 总剩余库存A：3456 如有任何疑问或意见，请随时联系。
最好的问候，
"""

正则表达式模式以匹配数字及其上下文

number_pattern = r'总销售量：(\d+[.,]?\d+)\s+总剩余库存A：(\d+[.,]?\d+)'

使用正则表达式提取数字及其上下文

matches = re.findall(number_pattern, email_content)

for match in matches:
total_selling_volume = match[0]
total_remaining_stock = match[1]
print("总销售量：", total_selling_volume)
print("总剩余库存A：", total_remaining_stock)

输出

总销售量：45677
总剩余库存A：3456

英文:

import re
email_content = &quot;&quot;&quot;Dear all,Please note the Total selling volume 
and total remaining stock Total selling volume: 45677 Total 
remaining stock A:3456 Remain at your disposal in case of any 
doubt or comments.
Best Regards,
&quot;&quot;&quot;
#Regular expression pattern to match numbers and their context
number_pattern = r&#39;Total selling volume: (\d+[.,]?\d+)\s+Total 
remaining stock A:(\d+[.,]?\d+)&#39;
#Extract numbers and their context using regular expression
matches = re.findall(number_pattern, email_content)
for match in matches:
total_selling_volume = match[0]
total_remaining_stock = match[1]
print(&quot;Total selling volume:&quot;, total_selling_volume)
print(&quot;Total remaining stock:&quot;, total_remaining_stock)
#Output
Total selling volume: 45677
Total remaining stock: 3456

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python – 从电子邮件中提取信息

问题

答案1

Note

答案2

正则表达式模式以匹配数字及其上下文

使用正则表达式提取数字及其上下文

输出

My python ‘if’ statement is causing a syntax error.

如何使用Tesseract 5从图像中检测数字？

使用键盘在 QDialog 中导航 QComboBox。

清除和重置 ipywidgets 的 TextArea。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论