使用Python的Beautiful Soup仅获取网页链接的标题

huangapple go评论73阅读模式
英文:

Using Python's beautiful soup to get only the titles of links of webpage

问题

我是新手但我已经创建了一个用于在网页上进行链接爬取的代码

这是我有的代码
```python
page_to_scrape=requests.get('http://lungtung.com/nhacvang/pub/tapesbyletr.asp?strLTR=T&page=15')

soup=BeautifulSoup(page_to_scrape.text,"html.parser")
title=soup.findAll("div", attrs={"class": "subsection"})

for x in zip(title):
    print(x)

x.get_text()

它给我返回的结果是:

(<div class="subsection"><a href="tapes_d.asp?FrTapeID=819">Trường Sơn (đĩa nhựa): TS-000168-1</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=39">Trường Sơn 1: Hát Giữa Quê Hương</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=40">Trường Sơn 2: Quê Hương và Tuổi Trẻ</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=41">Trường Sơn 3: Quê Hương và Người Tình</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=42">Trường Sơn 4: Hôm Nay, Ngày Mai</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=43">Trường Sơn 5: Tình Trong Khói Lửa</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=44">Trường Sơn 6: Quê Hương và Tuổi Loạn</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=45">Trường Sơn 7: Quê Hương, Mùa Trăng, Mùa Thu</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=46">Trường Sơn 8: Băng Nhạc Trường Sơn</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=175">Trần Ngọc Đức: Băng Vàng - Bóng Tình Yêu</a></div>,)

这让我很高兴,因为我知道我正在取得进展,但我希望它只打印出末尾链接的名称(例如:"truong son 1", "truong son 2"等)。我应该怎么做?我觉得我需要在beautifulsoup库中使用一个不同的函数,但我不知道是什么函数。


<details>
<summary>英文:</summary>

I&#39;m new at this but Ive created a code to webscrape a list of links on a webpage.

here is the code that I have

page_to_scrape=requests.get('http://lungtung.com/nhacvang/pub/tapesbyletr.asp?strLTR=T&page=15')

soup=BeautifulSoup(page_to_scrape.text,"html.parser")
title=soup.findAll("div", attrs={"class": "subsection"})

for x in zip(title):
print(x)

x.get_text()

the results that it gives me is

(<div class="subsection"><a href="tapes_d.asp?FrTapeID=819">Trường Sơn (đĩa nhựa): TS-000168-1</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=39">Trường Sơn 1: Hát Giữa Quê Hương</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=40">Trường Sơn 2: Quê Hương và Tuổi Trẻ</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=41">Trường Sơn 3: Quê Hương và Người Tình</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=42">Trường Sơn 4: Hôm Nay, Ngày Mai</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=43">Trường Sơn 5: Tình Trong Khói Lửa</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=44">Trường Sơn 6: Quê Hương và Tuổi Loạn</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=45">Trường Sơn 7: Quê Hương, Mùa Trăng, Mùa Thu</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=46">Trường Sơn 8: Băng Nhạc Trường Sơn</a></div>,)
(<div class="subsection"><a href="tapes_d.asp?FrTapeID=175">Trần Ngọc Đức: Băng Vàng - Bóng Tình Yêu</a></div>,)




this makes me happy because i know im getting somewhere, but I want it to print out is the only the names of the links towards the end (truong son 1, truong son 2, etc)
how would i go about that? i feel like i have to use a different function in the beautifulsoup library. but i dont know what.

</details>


# 答案1
**得分**: 0

你可以使用select来与选择器一起使用。名称在div.subsection中,在这段代码中,我将所有内容附加到一个列表中。实际上,你可以将这个列表转换为数据框或其他东西。

```python
page_to_scrape = requests.get('http://lungtung.com/nhacvang/pub/tapesbyletr.asp?strLTR=T&page=15')

soup = BeautifulSoup(page_to_scrape.text, "html.parser")
title = soup.select('div.subsection')

data = []
for x in title:
    data.append(x.text)

print(data)

我不知道你为什么只抓取一个页面,但你可以像这样抓取所有页面。

data = []
for i in range(1, 16):
    page_to_scrape = requests.get(f'http://lungtung.com/nhacvang/pub/tapesbyletr.asp?strLTR=T&page={i}')
    soup = BeautifulSoup(page_to_scrape.text, "html.parser")
    title = soup.select('div.subsection')
    for x in title:
        data.append(x.text)

print(data)
英文:

You can use select for using with selectors. The names is in div.subsection and in this code I appended all to one list. Actually you can convert this list to dataframe or something.

page_to_scrape=requests.get(&#39;http://lungtung.com/nhacvang/pub/tapesbyletr.asp?strLTR=T&amp;page=15&#39;)

soup=BeautifulSoup(page_to_scrape.text,&quot;html.parser&quot;)
title=soup.select(&#39;div.subsection&#39;)


data = []
for x in title:
    data.append(x.text)

print(data)

I do not know why are you scraping just one page but you can scrape all pages like that.


data = []
for i in range(1,16):
    page_to_scrape = requests.get(f&#39;http://lungtung.com/nhacvang/pub/tapesbyletr.asp?strLTR=T&amp;page={i}&#39;)
    soup = BeautifulSoup(page_to_scrape.text, &quot;html.parser&quot;)
    title = soup.select(&#39;div.subsection&#39;)
    for x in title:
        data.append(x.text)



print(data)

huangapple
  • 本文由 发表于 2023年2月24日 11:40:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/75552419.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定