2023年4月20日 04:15:36go评论101阅读模式

英文:

regex code to find email address within HTML script webscraping

问题

以下是您的代码的翻译部分：

我正在尝试通过网页抓取从一些公司网站提取电话号码、地址和电子邮件地址。
我的代码如下：
l = 'https://www.zimmermanfinancialgroup.com/about'
address_t = []
phone_num_t = []
# 对链接进行请求
response = requests.get(l)
soup = BeautifulSoup(response.content, "html.parser")
phone_regex = "(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}"
# 提取电话号码信息
match = soup.findAll(string=re.compile(phone_regex))
if match:
    print("找到匹配的字符串：", match)
else:
    print("未找到匹配的字符串")
# 提取电子邮件地址信息
mail = "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b"
match_a = soup.findAll(string=re.compile(mail))
match_a

上面的代码可以正确提取电话号码，但无法检测到电子邮件地址，对于其他网站（https://www.benefitexperts.com/about-us/）也存在相同的问题。

英文:

I am trying to extract phone, address and email from couple of corporate websites through webscraping

My code for that is as follows

l = &#39;https://www.zimmermanfinancialgroup.com/about&#39;
address_t = []
phone_num_t = []
    # make a request to the link
response = requests.get(l)
soup = BeautifulSoup(response.content, &quot;html.parser&quot;)
#soup = BeautifulSoup(response.content, &#39;html.parser&#39;)
phone_regex = &quot;(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}&quot;
    # extract the phone number information
match = soup.findAll(string=re.compile(phone_regex))
if match:
    print(&quot;Found the matching string:&quot;, match)
else:
    print(&quot;Matching string not found&quot;)
# extract email address information
mail = &quot;\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b&quot;
match_a = soup.findAll(string=re.compile(mail))
match_a

The above code is working fine and it extracts phone number correctly, but it's not able to detect email address, same issue with other website (https://www.benefitexperts.com/about-us/)

答案1

得分: 1

邮件地址您要查找的位于（如果存在）标签的href属性中，形式为字符串'mailto:somemail@adrress.com'。
因此，您只需将href作为关键字参数传递给findall函数，它将匹配所有具有href属性的节点并匹配正则表达式。

在Beautiful Soup官方文档中了解更多关键字参数的信息：
https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#the-keyword-arguments

或者简单地

match_a = soup.findAll(href=re.compile(mail))

您可以进行一些清理工作以准确提取电子邮件地址

match_a = [a['href'].strip('mailto:') for a in match_a]

英文:

The mail address you are looking for is located at href attribute of (if it exist) an <a> tag as a string 'mailto:somemail@adrress.com'.
So you need just to pass href as keyword argument to the findall function so it will match all nodes having href as attribute and match the regulare expression.

check more about keyword arguments at the BeautifulSoup official docs
https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#the-keyword-arguments

Or simply

match_a = soup.findAll(href=re.compile(mail))

you do some clean up to extract exactly mail address

match_a = [a[&#39;href&#39;].strip(&#39;mailto:&#39;) for a in match_a]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在 HTML 脚本中进行网页抓取的正则表达式代码以查找电子邮件地址。

问题

答案1

如何安装Detectron2

GCP开启计费时出现计费错误

Lambda函数无法连接到Lex机器人。

RSA – CTF 加密和解密

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。