Python – 从电子邮件中提取信息

huangapple go评论68阅读模式
英文:

Python - extract information from email

问题

以下是您提供的电子邮件示例的翻译部分:

电子邮件示例1

亲爱的大家,

请注意总销售量和剩余总库存

总销售量:45677
总剩余库存 A:3456

如有任何疑问或意见,请随时联系。

最好的问候,

电子邮件示例2

亲爱的大家,

请查看以下数据:

总体积:1,231,245
剩余库存 A 的数量:232
剩余库存 B 的数量:1,435

电子邮件示例3

亲爱的大家,

请查找我们的总体积为 233,435

总剩余库存 A:2453

电子邮件示例4

在五月份,我们剩余库存 A 为 90,TEU 为 4190。

我想从这些电子邮件中提取体积和总剩余库存数字。是否有任何关于如何使用 Python 获取这些数字的提示?

我已经准备好了以下代码,用于从电子邮件中提取数字。但我无法区分哪个数字是总销售量,哪个数字是总剩余库存。

英文:

I am new to Python. Below are some sample emails I received.

Email sample 1

Dear all,

Please note the Total selling volume and total remaining stock

Total selling volume: 45677
Total remaining stock A:3456

Remain at your disposal in case of any doubt or comments.

Best Regards,

Email sample 2

Dear all,

Please see the data as below:

Tol volume: 1,231,245
No. of remaining stock A: 232
No. of remaining stock B: 1,435

Email sample 3

Dear All,

Please find our volume was 233,435

Total remaining stock A: 2453

Email sample 4

In May we had 90 remaining stock A and 4190 TEUs.

I would like to extract the volume and total remaining stock figures from those emails. Any hints if I can get those figures by using python?

I have prepared the below code to extract the figures from email. However I can not distinguish which figure is total selling volume, total remaining stock

import re
import pandas as pd
import win32com.client
from datetime import datetime, timedelta

outlook = win32com.client.Dispatch('outlook.application')
mapi = outlook.GetNamespace("MAPI")
inbox = mapi.GetDefaultFolder(6).Folders.Item("AI email testing")
#outlook.GetDefaultFolder(6) .Folders.Item("Your_Folder_Name")
#inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items

received_dt = datetime.now() - timedelta(days=1)
received_dt = received_dt.strftime('%m/%d/%Y %H:%M %p')


for message in list(messages):
    #print (message)
    body_content = message.body
    body_content =body_content[body_content.find("Subject:"):]
    #print(body_content)
    figures = re.findall("\d+(?:,\d+)*(?:\.\d+)?",body_content)
    print(figures)

答案1

得分: 0

以下是使用正则表达式的解决方案:

from __future__ import annotations

import re
from typing import List, Tuple


def get_number(text: str) -> float | int | str:
    """
    从输入字符串中提取第一个数值。

    该函数使用正则表达式从`text`中提取第一个数值。
    如果未找到数字值,则返回原始字符串。如果从提取的数字中删除逗号(如果有的话)。
    该函数首先尝试将数字转换为整数,如果失败,则尝试将其转换为浮点数。

    Parameters
    ----------
    text : str
        应从中提取数值的字符串。

    Returns
    -------
    float | int | str
        转换为int或float的`text`中的第一个数值,如果未找到数字值,则返回原始`text`。

    Raises
    ------
    ValueError
        如果提取的数字无法转换为整数或浮点数。

    Examples
    --------
    函数用法和行为的说明。

    >>> get_number("Hello world 123!")
    123
    >>> get_number("I have 2,200 dollars.")
    2200
    >>> get_number("No numbers here.")
    'No numbers here.'
    >>> get_number("It is over 9000!")
    9000
    >>> get_number("The value of pi is about 3.14159.")
    3.14159
    >>> get_number("Total: 123,456,789.")
    123456789.0
    """
    number = re.search(r'(\d+|,)+.', text, re.I)
    if number:
        number = number[0].strip().replace(',', '')
    if not number:
        print(f"Found no numbers inside text: {text!r}")
        return text
    try:
        return int(number)
    except ValueError:
        return float(number)


def extract_stock_volume_from_email(email: str) -> Tuple[int | float | str, int | float | str]:
    """
    从电子邮件文本中提取容量和剩余库存A的详细信息。

    此函数使用正则表达式解析给定的电子邮件文本并提取有关容量和剩余库存A的详细信息。
    然后清理提取的值并返回。

    Parameters
    ----------
    email : str
        要解析的电子邮件文本。

    Returns
    -------
    volume : int | float | str
        从电子邮件中提取的容量。
        如果未找到容量详细信息,则返回'Volume not found'。
    remaining_stock_a : int | float | str
        从电子邮件中提取的剩余库存A。
        如果未找到库存A详细信息,则返回'Remaining stock A not found'。

    Raises
    ------
    re.error
        如果使用无效的正则表达式。

    See Also
    --------
    re.search:用于提取容量和剩余库存详细信息的方法。

    Examples
    --------
    >>> email_text = "The volume was 5000 TEUs. Stock A: 1000 units."
    >>> extract_stock_volume_from_email(email_text)
    (5000, 1000)
    >>> email_text = "No volume and stock data available."
    >>> extract_stock_volume_from_email(email_text)
    ('Volume not found', 'Remaining stock A not found')
    """
    # 提取容量
    volume = re.search(
        r'(?:volume:|volume was|TEUs\.|TEUs |TEU |$)\s(\d+|,)+.*?|(\d+|,)+.(?:\sTEUs|\sTEU)',
        email, re.I
    )
    if volume:
        volume = get_number(volume[0].strip())
    if not volume:
        volume = 'Volume not found'

    # 提取剩余库存A
    remaining_stock_a = re.search(r'(?:stock A:|stock A: |$)(\d+|,)+.*?', email, re.I)
    if remaining_stock_a:
        remaining_stock_a = remaining_stock_a[0].strip()
    if not remaining_stock_a:
        remaining_stock_a = re.search(r'(\d+)(.+)(stock A)', email, re.I)
        if remaining_stock_a:
            remaining_stock_a = remaining_stock_a[0].strip()
    if remaining_stock_a:
        remaining_stock_a = get_number(remaining_stock_a)
    if not remaining_stock_a:
        remaining_stock_a = 'Remaining stock A not found'
    return volume, remaining_stock_a


def extract_stock_volume_from_emails(
    emails: List[str],
) -> List[Tuple[int | float | str, int | float | str]]:
    """
    将函数`extract_stock_volume_from_email`应用于电子邮件列表。

    Parameters
    ----------
    emails : List[str]
        要解析的电子邮件文本列表。

    Returns
    -------
    List[Tuple[int | float | str, int | float | str]]
        包含从每封电子邮件中提取的容量和剩余库存A的元组列表。
        如果无法从电子邮件中提取容量或库存A详细信息,则元组中的相应元素将是'Volume not found'或'Remaining stock A not found'。

    Raises
    ------
    re.error
        如果在`extract_stock_volume_from_email`中使用无效的正则表达式。

    See Also
    --------
    extract_stock_volume_from_email:用于从每封电子邮件中提取详细信息的函数。

    Examples
    --------
    >>> email_texts = [
    ...     "The volume was 5000 TEUs. Stock A: 1000 units.",
    ...     "No volume and stock data available.",
    ... ]
    >>> extract_stock_volume_from_emails(email_texts)
    [(5000, 1000), ('Volume not found', 'Remaining stock A not found')]
    """
    return list(map(extract_stock_volume_from_email, emails))

使用上述代码对您提供的示例电子邮件进行操作:

emails = [
    r"""Dear all,

Please note the Total selling volume and total remaining stock

Total selling volume: 45677 Total remaining stock A:3456

Remain at your disposal in case of any doubt or comments.

Best Regards,""",
    r"""Dear all,

Please see the data as below:

Tol volume: 1,231,245 No. of remaining stock A: 232 No. of remaining stock B: 1,435""",
    r"""Dear All,

Please find our volume was 

<details>
<summary>英文:</summary>

Here&#39;s a solution using RegEx:

```python
from __future__ import annotations

import re
from typing import List, Tuple


def get_number(text: str) -&gt; float | int | str:
    &quot;&quot;&quot;
    Extract the first numeric value from the input string.

    The function uses regular expressions to extract the first numeric
    occurrence from `text`. If no numeric value is found, the original string
    is returned. Commas are removed from the extracted number, if any.
    The function first attempts to convert the number to an integer,
    and if that fails, it tries to convert it to a float.

    Parameters
    ----------
    text : str
        The string from which the numeric value should be extracted.

    Returns
    -------
    float | int | str
        The first numeric value in `text` converted to int or float,
        or original `text` if no numeric value is found.

    Raises
    ------
    ValueError
        If the extracted number can&#39;t be converted to an integer or a float.

    Examples
    --------
    Illustration of the function usage and behavior.

    &gt;&gt;&gt; get_number(&quot;Hello world 123!&quot;)
    123
    &gt;&gt;&gt; get_number(&quot;I have 2,200 dollars.&quot;)
    2200
    &gt;&gt;&gt; get_number(&quot;No numbers here.&quot;)
    &#39;No numbers here.&#39;
    &gt;&gt;&gt; get_number(&quot;It is over 9000!&quot;)
    9000
    &gt;&gt;&gt; get_number(&quot;The value of pi is about 3.14159.&quot;)
    3.14159
    &gt;&gt;&gt; get_number(&quot;Total: 123,456,789.&quot;)
    123456789.0
    &quot;&quot;&quot;
    number = re.search(r&#39;(\d+|,)+.&#39;, text, re.I)
    if number:
        number = number[0].strip().replace(&#39;,&#39;, &#39;&#39;)
    if not number:
        print(f&quot;Found no numbers inside text: {text!r}&quot;)
        return text
    try:
        return int(number)
    except ValueError:
        return float(number)


def extract_stock_volume_from_email(email: str) -&gt; Tuple[int | float | str, int | float | str]:
    &quot;&quot;&quot;
    Extract the volume and remaining stock A details from an email text.

    This function employs regular expressions to parse the given email text and
    extract details about volume and remaining stock A.
    The values extracted are then cleaned and returned.

    Parameters
    ----------
    email : str
        Text from the email to parse.

    Returns
    -------
    volume : int | float | str
        Volume extracted from the email.
        Returns &#39;Volume not found&#39; if no volume details are found.
    remaining_stock_a : int | float | str
        Remaining stock A extracted from the email.
        Returns &#39;Remaining stock A not found&#39; if no stock A details are found.

    Raises
    ------
    re.error
        If a non-valid regular expression is used.

    See Also
    --------
    re.search : The method used for extracting volume and remaining stock details.

    Examples
    --------
    &gt;&gt;&gt; email_text = &quot;The volume was 5000 TEUs. Stock A: 1000 units.&quot;
    &gt;&gt;&gt; extract_stock_volume_from_email(email_text)
    (5000, 1000)
    &gt;&gt;&gt; email_text = &quot;No volume and stock data available.&quot;
    &gt;&gt;&gt; extract_stock_volume_from_email(email_text)
    (&#39;Volume not found&#39;, &#39;Remaining stock A not found&#39;)
    &quot;&quot;&quot;
    # Extract the volume
    volume = re.search(
        r&#39;(?:volume:|volume was|TEUs\.|TEUs |TEU |$)\s(\d+|,)+.*?|(\d+|,)+.(?:\sTEUs|\sTEU)&#39;,
        email, re.I
    )
    if volume:
        volume = get_number(volume[0].strip())
    if not volume:
        volume = &#39;Volume not found&#39;

    # Extract the remaining stock
    remaining_stock_a = re.search(r&#39;(?:stock A:|stock A: |$)(\d+|,)+.*?&#39;, email, re.I)
    if remaining_stock_a:
        remaining_stock_a = remaining_stock_a[0].strip()
    if not remaining_stock_a:
        remaining_stock_a = re.search(r&#39;(\d+)(.+)(stock A)&#39;, email, re.I)
        if remaining_stock_a:
            remaining_stock_a = remaining_stock_a[0].strip()
    if remaining_stock_a:
        remaining_stock_a = get_number(remaining_stock_a)
    if not remaining_stock_a:
        remaining_stock_a = &#39;Remaining stock A not found&#39;
    # print(f&quot;Volume: {volume}\nRemaining Stock A: {remaining_stock_a}\n&quot;)
    return volume, remaining_stock_a


def extract_stock_volume_from_emails(
    emails: List[str],
) -&gt; List[Tuple[int | float | str, int | float | str]]:
    &quot;&quot;&quot;
    Apply the function `extract_stock_volume_from_email` to a list of emails.

    Parameters
    ----------
    emails : List[str]
        A list of email texts to be parsed.

    Returns
    -------
    List[Tuple[int | float | str, int | float | str]]
        A list of tuples. Each tuple contains the volume and remaining stock A
        extracted from each email. If no volume or stock A details could be
        extracted from an email, the corresponding element in the tuple will be
        &#39;Volume not found&#39; or &#39;Remaining stock A not found&#39;, respectively.

    Raises
    ------
    re.error
        If a non-valid regular expression is used in `extract_stock_volume_from_email`.

    See Also
    --------
    extract_stock_volume_from_email : The function used to extract details from each email.

    Examples
    --------
    &gt;&gt;&gt; email_texts = [
    ...     &quot;The volume was 5000 TEUs. Stock A: 1000 units.&quot;,
    ...     &quot;No volume and stock data available.&quot;,
    ... ]
    &gt;&gt;&gt; extract_stock_volume_from_emails(email_texts)
    [(5000, 1000), (&#39;Volume not found&#39;, &#39;Remaining stock A not found&#39;)]
    &quot;&quot;&quot;
    return list(map(extract_stock_volume_from_email, emails))

Using the above code on the e-mails you provided as example:

emails = [
    r&quot;&quot;&quot;Dear all,

Please note the Total selling volume and total remaining stock

Total selling volume: 45677 Total remaining stock A:3456

Remain at your disposal in case of any doubt or comments.

Best Regards,&quot;&quot;&quot;,
    r&quot;&quot;&quot;Dear all,

Please see the data as below:

Tol volume: 1,231,245 No. of remaining stock A: 232 No. of remaining stock B: 1,435&quot;&quot;&quot;,
    r&quot;&quot;&quot;Dear All,

Please find our volume was 233,435

Total remaining stock A: 2453&quot;&quot;&quot;,
    r&quot;In May we had 90 remaining stock A and 4190 TEUs.&quot;,
]
extract_stock_volume_from_emails(emails)
# Returns:
#
# [(45677, 3456), (1231245, 232), (233435, 2453), (4190, 90)]
#  ^----^  ^--^
#  |       |
#  |       +-- Remaining stock A
#  +-- Volume

Note

It should be noted that the function extract_stock_volume_from_email, that parses each e-mail is not failproof. The RegEx patterns it contains were all based on the e-mails you provided as example. If other e-mails don't follow the same patterns as the example e-mails, these additional patterns will have to be added to the extract_stock_volume_from_email function.

答案2

得分: 0

import re

email_content = """亲爱的大家,请注意总销售量
和总剩余库存 总销售量:45677 总剩余库存A:3456 如有任何疑问或意见,请随时联系。
最好的问候,
"""

正则表达式模式以匹配数字及其上下文

number_pattern = r'总销售量:(\d+[.,]?\d+)\s+总剩余库存A:(\d+[.,]?\d+)'

使用正则表达式提取数字及其上下文

matches = re.findall(number_pattern, email_content)

for match in matches:
total_selling_volume = match[0]
total_remaining_stock = match[1]
print("总销售量:", total_selling_volume)
print("总剩余库存A:", total_remaining_stock)

输出

总销售量:45677
总剩余库存A:3456

英文:
import re
email_content = &quot;&quot;&quot;Dear all,Please note the Total selling volume 
and total remaining stock Total selling volume: 45677 Total 
remaining stock A:3456 Remain at your disposal in case of any 
doubt or comments.
Best Regards,
&quot;&quot;&quot;
#Regular expression pattern to match numbers and their context
number_pattern = r&#39;Total selling volume: (\d+[.,]?\d+)\s+Total 
remaining stock A:(\d+[.,]?\d+)&#39;
#Extract numbers and their context using regular expression
matches = re.findall(number_pattern, email_content)
for match in matches:
total_selling_volume = match[0]
total_remaining_stock = match[1]
print(&quot;Total selling volume:&quot;, total_selling_volume)
print(&quot;Total remaining stock:&quot;, total_remaining_stock)
#Output
Total selling volume: 45677
Total remaining stock: 3456

huangapple
  • 本文由 发表于 2023年6月13日 11:32:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76461534.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定