英文:
regex code to find email address within HTML script webscraping
问题
以下是您的代码的翻译部分:
我正在尝试通过网页抓取从一些公司网站提取电话号码、地址和电子邮件地址。
我的代码如下:
l = 'https://www.zimmermanfinancialgroup.com/about'
address_t = []
phone_num_t = []
# 对链接进行请求
response = requests.get(l)
soup = BeautifulSoup(response.content, "html.parser")
phone_regex = "(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}"
# 提取电话号码信息
match = soup.findAll(string=re.compile(phone_regex))
if match:
print("找到匹配的字符串:", match)
else:
print("未找到匹配的字符串")
# 提取电子邮件地址信息
mail = "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b"
match_a = soup.findAll(string=re.compile(mail))
match_a
上面的代码可以正确提取电话号码,但无法检测到电子邮件地址,对于其他网站(https://www.benefitexperts.com/about-us/)也存在相同的问题。
英文:
I am trying to extract phone, address and email from couple of corporate websites through webscraping
My code for that is as follows
l = 'https://www.zimmermanfinancialgroup.com/about'
address_t = []
phone_num_t = []
# make a request to the link
response = requests.get(l)
soup = BeautifulSoup(response.content, "html.parser")
#soup = BeautifulSoup(response.content, 'html.parser')
phone_regex = "(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}"
# extract the phone number information
match = soup.findAll(string=re.compile(phone_regex))
if match:
print("Found the matching string:", match)
else:
print("Matching string not found")
# extract email address information
mail = "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b"
match_a = soup.findAll(string=re.compile(mail))
match_a
The above code is working fine and it extracts phone number correctly, but it's not able to detect email address, same issue with other website (https://www.benefitexperts.com/about-us/)
答案1
得分: 1
邮件地址您要查找的位于(如果存在)标签的href属性中,形式为字符串'mailto:somemail@adrress.com'。
因此,您只需将href作为关键字参数传递给findall函数,它将匹配所有具有href属性的节点并匹配正则表达式。
在Beautiful Soup官方文档中了解更多关键字参数的信息:
https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#the-keyword-arguments
或者简单地
match_a = soup.findAll(href=re.compile(mail))
您可以进行一些清理工作以准确提取电子邮件地址
match_a = [a['href'].strip('mailto:') for a in match_a]
英文:
The mail address you are looking for is located at href attribute of (if it exist) an <a> tag as a string 'mailto:somemail@adrress.com'.
So you need just to pass href as keyword argument to the findall function so it will match all nodes having href as attribute and match the regulare expression.
check more about keyword arguments at the BeautifulSoup official docs
https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#the-keyword-arguments
Or simply
match_a = soup.findAll(href=re.compile(mail))
you do some clean up to extract exactly mail address
match_a = [a['href'].strip('mailto:') for a in match_a]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论