2023年7月13日 18:10:21go评论69阅读模式

英文:

Split text at specific character in BeautifulSoup

问题

抱歉，您的代码部分中包含了许多HTML标签和特殊字符，无法直接进行翻译。如果您需要有关Python和BeautifulSoup的帮助，请提出您的具体问题，我会尽力提供帮助。

英文:

I am brand new to Python and BeautifulSoup so please forgive the lack of proper vocabulary in my question.

I am trying to extract the list from this webpage: http://spajournalism.com/membership/ - I want all the publications that are asssociated with a specific university. I'd like to end up with a list of dictionaries like:
[{publication_url: url1, publication_name: name1, uni: uni1}, {publication_url: url2, publication_name: name2, uni: uni2}]

Unfortunately the content on the webpage is quite messy, HTML-wise and it's proving tricky. My code is currently:

import lxml.etree
import requests
from bs4 import BeautifulSoup

url = &quot;http://spajournalism.com/membership/&quot;
page = requests.get(url)
soup = BeautifulSoup(page.content, &quot;lxml&quot;)

section = soup.find(&quot;div&quot;, &quot;entry-content clearfix&quot;)
links = section.find_all(&quot;a&quot;)

#list = []

#for link in links:
#    publication = {
#        &quot;Link&quot; : link.get(&quot;href&quot;),
#        &quot;Publication&quot; : link.parent.text
#    }

for link in links:
    print(&quot;Link: &quot;, link.get(&quot;href&quot;), &quot;Text: &quot;, link.parent.text)

This returns a list of the following nature:

Link:  http://www.swanseastudentmedia.com/waterfront/ Text:  The Waterfront – Swansea University
Link:  https://www.seren.bangor.ac.uk/ Text:  Y Seren – Bangor University
...etc

I would like to, instead of getting all the text in one go with link.parent.text, split it at the hyphen ( – ), and get something more like:

Link:  http://www.swanseastudentmedia.com/waterfront/ Text:  The Waterfront University: Swansea University
Link:  https://www.seren.bangor.ac.uk/ Text:  Y Seren University: Bangor University
...etc

I have tried something like the following:

for link in links:
    text = link.parent.text
    linktext = link.string
    text.replace(linktext, &quot; &quot;) # Replace the redundant link text with nothing

    print(&quot;Link: &quot;, link.get(&quot;href&quot;), &quot;Publication: &quot;, linktext, &quot;University: &quot;, text)

But the replacing the redundant text with nothing doesn't seem to work because what I get is:

Link:  http://www.swanseastudentmedia.com/waterfront/ Publication:  The Waterfront University:  The Waterfront – Swansea University
Link:  https://www.seren.bangor.ac.uk/ Publication:  Y Seren University:  Y Seren – Bangor University
...etc

Is there a way of doing this? Any searches I do are full of results referring to something called Dash which isn't relevant to me. Thanks

答案1

得分: 1

替换()函数不会改变text_variable，它会返回一个新的字符串。

将

text.replace(linktext, " ")

改为

text = text.replace(linktext, " ").split("–", 1)[-1].strip()

英文:

replace() function dont change the text_variable, it returns a new string

Change

text.replace(linktext, &quot; &quot;)

text = text.replace(linktext, &quot; &quot;).split(&quot;–&quot;, 1)[-1].strip()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在BeautifulSoup中根据特定字符分割文本：

问题

答案1

pandas.read_xml() 意外行为

Creating a leaderboard in streamlit.

使用异常来避免yahoofinance错误

如何在Python中使用xml.etree.ElementTree解析时保留一些标签不被解析？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论