在Python中搜索列表中多个子字符串?

huangapple go评论63阅读模式
英文:

searching a list for multiple substrings in python?

问题

Sure, here's the translated code snippet for your request:

import requests
from bs4 import BeautifulSoup
import re

ca = requests.get(ca_data)
soup = BeautifulSoup(ca.content, 'html.parser')
links = []

for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
   links.append(link.get('href'))

r = re.compile(".*vote")
newlist = list(filter(r.match, links))
print(newlist)

subs = 'sen_floor'
sen_votes = list(filter(lambda x: subs in x, newlist))
print(str(sen_votes))

sub = 'asm_floor'
asm_votes = list(filter(lambda x: sub in x, newlist))
print(str(asm_votes))

Please note that this code appears to be written in Python and is used for web scraping. If you encounter issues with the "asm_floor" search not working, you may want to check the HTML structure of the page you're scraping to ensure that the links containing "asm_floor" are correctly formatted.

英文:

So I have a list containing 10-15 links, and I want to search for the links that contain either 'sen_floor' or 'asm_floor'

this is my code so far (ca_data is the original link):

import requests
from bs4 import BeautifulSoup
import re

ca = requests.get(ca_data)
soup = BeautifulSoup(ca.content, 'html.parser')
links = []

for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
   links.append(link.get('href'))

r = re.compile(".*vote")
newlist = list(filter(r.match, links))
print(newlist)

subs = 'sen_floor'
sen_votes = list(filter(lambda x: subs in x, newlist))
print(str(sen_votes))

this effectively returns a list of all links containing sen_floor. Ideally I'd like to have a separate list with asm_floor. I tried repeating the last paragraph:

sub = 'asm_floor'
asm_votes = list(filter(lambda x: sub in x, newlist))
print(str(asm_votes))

but it doesn't work, just returns the same result as the sen_floor search.

Help?

答案1

得分: 1

import requests
from bs4 import BeautifulSoup

r = requests.get(
"http://www.legislature.ca.gov/cgi-bin/port-postquery?bill_number=ab_2&sess=CUR&house=B&author=alejo_%3Calejo%3E")

soup = BeautifulSoup(r.text, 'html.parser')

sen = []
asm = []
for item in soup.findAll("a", {'href': True}):
item = item.get("href")
if 'sen_floor' in item:
sen.append(item)
elif 'asm_floor' in item:
asm.append(item)

英文:
import requests
from bs4 import BeautifulSoup

r = requests.get(
    "http://www.legislature.ca.gov/cgi-bin/port-postquery?bill_number=ab_2&sess=CUR&house=B&author=alejo_%3Calejo%3E")

soup = BeautifulSoup(r.text, 'html.parser')

sen = []
asm = []
for item in soup.findAll("a", {'href': True}):
    item = item.get("href")
    if 'sen_floor' in item:
        sen.append(item)
    elif 'asm_floor' in item:
        asm.append(item)

答案2

得分: 0

只需使用包含运算符和Or语法来指定要在hrefs中匹配的多个子字符串。这将仅返回包含指定子字符串之一的hrefs。如果要检查多个页面,请在循环中使用,并确保更新soup对象。

matches = [i['href'] for i in soup.select('[href*=asm_floor],[href*=sen_floor]')]

分开列表以添加到

import requests
from bs4 import BeautifulSoup

r = requests.get("http://www.legislature.ca.gov/cgi-bin/port-postquery?bill_number=ab_2&sess=CUR&house=B&author=alejo_%3Calejo%3E")
soup = BeautifulSoup(r.text, 'html.parser')
sen = [i['href'] for i in soup.select('[href*=sen_floor]')]
asm = [i['href'] for i in soup.select('[href*=asm_floor]')]
print('sen: ', sen)
print('asm:', asm)

Note: The code provided is a translation of the code snippet you provided.

英文:

Just use contains operator with Or syntax to specify multiple substrings to match on in hrefs. This returns only hrefs containing either of the specified substrings. Use in a loop if checking multiple pages ensuring you update soup object.

matches = [i['href'] for i in soup.select('[href*=asm_floor],[href*=sen_floor]')]

Separate lists to add to

import requests
from bs4 import BeautifulSoup

r = requests.get("http://www.legislature.ca.gov/cgi-bin/port-postquery?bill_number=ab_2&sess=CUR&house=B&author=alejo_%3Calejo%3E")
soup = BeautifulSoup(r.text, 'html.parser')
sen = [i['href'] for i in soup.select('[href*=sen_floor]')]
asm = [i['href'] for i in soup.select('[href*=asm_floor]')]
print('sen: ', sen)
print('asm:', asm)

huangapple
  • 本文由 发表于 2020年1月3日 21:12:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/59579215.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定