英文:
Split text at specific character in BeautifulSoup
问题
抱歉,您的代码部分中包含了许多HTML标签和特殊字符,无法直接进行翻译。如果您需要有关Python和BeautifulSoup的帮助,请提出您的具体问题,我会尽力提供帮助。
英文:
I am brand new to Python and BeautifulSoup so please forgive the lack of proper vocabulary in my question.
I am trying to extract the list from this webpage: http://spajournalism.com/membership/ - I want all the publications that are asssociated with a specific university. I'd like to end up with a list of dictionaries like:
[{publication_url: url1, publication_name: name1, uni: uni1}, {publication_url: url2, publication_name: name2, uni: uni2}]
Unfortunately the content on the webpage is quite messy, HTML-wise and it's proving tricky. My code is currently:
import lxml.etree
import requests
from bs4 import BeautifulSoup
url = "http://spajournalism.com/membership/"
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
section = soup.find("div", "entry-content clearfix")
links = section.find_all("a")
#list = []
#for link in links:
# publication = {
# "Link" : link.get("href"),
# "Publication" : link.parent.text
# }
for link in links:
print("Link: ", link.get("href"), "Text: ", link.parent.text)
This returns a list of the following nature:
Link: http://www.swanseastudentmedia.com/waterfront/ Text: The Waterfront – Swansea University
Link: https://www.seren.bangor.ac.uk/ Text: Y Seren – Bangor University
...etc
I would like to, instead of getting all the text in one go with link.parent.text
, split it at the hyphen ( – ), and get something more like:
Link: http://www.swanseastudentmedia.com/waterfront/ Text: The Waterfront University: Swansea University
Link: https://www.seren.bangor.ac.uk/ Text: Y Seren University: Bangor University
...etc
I have tried something like the following:
for link in links:
text = link.parent.text
linktext = link.string
text.replace(linktext, " ") # Replace the redundant link text with nothing
print("Link: ", link.get("href"), "Publication: ", linktext, "University: ", text)
But the replacing the redundant text with nothing doesn't seem to work because what I get is:
Link: http://www.swanseastudentmedia.com/waterfront/ Publication: The Waterfront University: The Waterfront – Swansea University
Link: https://www.seren.bangor.ac.uk/ Publication: Y Seren University: Y Seren – Bangor University
...etc
Is there a way of doing this? Any searches I do are full of results referring to something called Dash which isn't relevant to me. Thanks
答案1
得分: 1
替换()函数不会改变text_variable,它会返回一个新的字符串。
将
text.replace(linktext, " ")
改为
text = text.replace(linktext, " ").split("–", 1)[-1].strip()
英文:
replace() function dont change the text_variable, it returns a new string
Change
text.replace(linktext, " ")
To
text = text.replace(linktext, " ").split("–", 1)[-1].strip()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论