在 HTML 脚本中进行网页抓取的正则表达式代码以查找电子邮件地址。

huangapple go评论73阅读模式
英文:

regex code to find email address within HTML script webscraping

问题

以下是您的代码的翻译部分:

我正在尝试通过网页抓取从一些公司网站提取电话号码地址和电子邮件地址

我的代码如下

l = 'https://www.zimmermanfinancialgroup.com/about'
address_t = []
phone_num_t = []

# 对链接进行请求
response = requests.get(l)

soup = BeautifulSoup(response.content, "html.parser")

phone_regex = "(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}"
# 提取电话号码信息
match = soup.findAll(string=re.compile(phone_regex))

if match:
    print("找到匹配的字符串:", match)
else:
    print("未找到匹配的字符串")

# 提取电子邮件地址信息

mail = "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b"

match_a = soup.findAll(string=re.compile(mail))

match_a

上面的代码可以正确提取电话号码,但无法检测到电子邮件地址,对于其他网站(https://www.benefitexperts.com/about-us/)也存在相同的问题。

英文:

I am trying to extract phone, address and email from couple of corporate websites through webscraping

My code for that is as follows

l = 'https://www.zimmermanfinancialgroup.com/about'
address_t = []
phone_num_t = []

    # make a request to the link
response = requests.get(l)

soup = BeautifulSoup(response.content, "html.parser")
#soup = BeautifulSoup(response.content, 'html.parser')


phone_regex = "(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}"
    # extract the phone number information
match = soup.findAll(string=re.compile(phone_regex))

if match:
    print("Found the matching string:", match)
else:
    print("Matching string not found")

# extract email address information

mail = "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b"

match_a = soup.findAll(string=re.compile(mail))


match_a

The above code is working fine and it extracts phone number correctly, but it's not able to detect email address, same issue with other website (https://www.benefitexperts.com/about-us/)

答案1

得分: 1

邮件地址您要查找的位于(如果存在)标签的href属性中,形式为字符串'mailto:somemail@adrress.com'。
因此,您只需将href作为关键字参数传递给findall函数,它将匹配所有具有href属性的节点并匹配正则表达式。

在Beautiful Soup官方文档中了解更多关键字参数的信息:
https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#the-keyword-arguments

或者简单地

match_a = soup.findAll(href=re.compile(mail))

您可以进行一些清理工作以准确提取电子邮件地址

match_a = [a['href'].strip('mailto:') for a in match_a]
英文:

The mail address you are looking for is located at href attribute of (if it exist) an <a> tag as a string 'mailto:somemail@adrress.com'.
So you need just to pass href as keyword argument to the findall function so it will match all nodes having href as attribute and match the regulare expression.

check more about keyword arguments at the BeautifulSoup official docs
https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#the-keyword-arguments

Or simply

match_a = soup.findAll(href=re.compile(mail))

you do some clean up to extract exactly mail address

match_a = [a[&#39;href&#39;].strip(&#39;mailto:&#39;) for a in match_a]

huangapple
  • 本文由 发表于 2023年4月20日 04:15:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76058487.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定