英文:
Python beautifulSoup: create and combine lists and remove redundancies like \n
问题
import requests
from requests_html import HTML, HTMLSession
from bs4 import BeautifulSoup
import pandas as pd
import csv
import json
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
lehigh = requests.get(url).text
soup = BeautifulSoup(lehigh, 'lxml')
opp_list = []
for opp in soup.find_all('div', class_="sidearm-schedule-game-opponent-text"):
opp_list.append(opp.text)
conf_list = []
for conf in soup.find_all('div', class_="sidearm-schedule-game-conference-conference"):
conf_list.append(conf.text)
data = {'opponent': opp_list, 'conference': conf_list}
df = pd.DataFrame(data)
print(df)
英文:
How can I combine the full lists into a dataframe. When I print it seems to only print the first record and it also includes \n and other redundancies like ' etc.
import requests
from requests_html import HTML, HTMLSession
from bs4 import BeautifulSoup
import pandas as pd
import csv
import json
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
lehigh = requests.get(url).text
soup = BeautifulSoup(lehigh,'lxml')
for opp in soup.find_all('div',class_="sidearm-schedule-game-opponent-text"):
opp_list = []
opp_list.append(opp.text)
# print(opp_list)
for conf in soup.find_all('div',class_="sidearm-schedule-game-conference-conference"):
conf_list = []
conf_list.append(conf.text)
# print(conf_list)
dict = {'opponent':[opp_list],'conference':[conf_list]}
df = pd.DataFrame(dict)
print(df)
答案1
得分: 1
你在每次迭代中都将opp_list
和conf_list
设置为[]
- 只需初始化它们一次。此外,你不必在创建字典时使用大括号{'opponent':opp_list,'conference':conf_list}
。
要去除空格,你可以使用.get_text()
方法,并使用strip=True
和separator=
参数。
例如:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
lehigh = requests.get(url).text
soup = BeautifulSoup(lehigh,'lxml')
opp_list = []
for opp in soup.find_all('div',class_="sidearm-schedule-game-opponent-text"):
opp_list.append(opp.get_text(strip=True, separator=' '))
conf_list = []
for conf in soup.find_all('div',class_="sidearm-schedule-game-conference-conference"):
conf_list.append(conf.get_text(strip=True))
dict = {'opponent':opp_list,'conference':conf_list}
df = pd.DataFrame(dict)
print(df)
输出:
opponent conference
0 at UConn
1 vs Drexel
2 at George Washington
3 at St. John's
4 vs Binghamton
5 at Rider
6 vs Penn
7 at Army Patriot League*
8 vs Cornell
9 at Boston U Patriot League*
10 vs #20 Colgate Patriot League*
11 vs Navy Patriot League*
12 at Lafayette Patriot League*
13 at Dartmouth
14 vs American Patriot League*
15 at Bucknell Patriot League*
16 at Loyola (Md.) Patriot League*
17 vs Holy Cross Senior Night Patriot League*
18 vs No. 3 Colgate (Semifinals)
英文:
You are setting opp_list
and conf_list
in every iteration to []
- initialize them only once. Alson, you don't have to put brackets in dictionary creation {'opponent':opp_list,'conference':conf_list}
To remove whitespace, you can use .get_text()
method with strip=True
and separator=
parameters.
For example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
lehigh = requests.get(url).text
soup = BeautifulSoup(lehigh,'lxml')
opp_list = []
for opp in soup.find_all('div',class_="sidearm-schedule-game-opponent-text"):
opp_list.append(opp.get_text(strip=True, separator=' '))
conf_list = []
for conf in soup.find_all('div',class_="sidearm-schedule-game-conference-conference"):
conf_list.append(conf.get_text(strip=True))
dict = {'opponent':opp_list,'conference':conf_list}
df = pd.DataFrame(dict)
print(df)
Prints:
opponent conference
0 at UConn
1 vs Drexel
2 at George Washington
3 at St. John's
4 vs Binghamton
5 at Rider
6 vs Penn
7 at Army Patriot League*
8 vs Cornell
9 at Boston U Patriot League*
10 vs #20 Colgate Patriot League*
11 vs Navy Patriot League*
12 at Lafayette Patriot League*
13 at Dartmouth
14 vs American Patriot League*
15 at Bucknell Patriot League*
16 at Loyola (Md.) Patriot League*
17 vs Holy Cross Senior Night Patriot League*
18 vs No. 3 Colgate (Semifinals)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论