英文:
working with BeautifulSoup - defining the entities for getting all the data of the target page - perhaps panda would solve this even better
问题
以下是代码的翻译部分:
我正在使用BeautifulSoup进行一项任务 - 这是一种用于网络数据抓取的强大的Python库。我的目标是从这个页面获取数据:https://schulfinder.kultus-bw.de,请注意,这是一个用于查找某个地区所有学校的公共页面。
所以,典型的数据集如下:
地址名称
地址2
类别
街道
邮政编码和地点
电话1
电话2
电子邮件
我认为,使用Python的话,我将按以下步骤进行:
首先,我需要向URL发送请求并获取页面的HTML内容:
```python
url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content
然后,在下一步,我需要创建一个BeautifulSoup对象并找到包含学校名称的HTML元素:
soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})
提取HTML元素中的学校名称并将其存储在列表中:
school_names = [school.text.strip() for school in schools]
最后,我需要打印学校名称列表:
print(school_names)
完整的代码如下:
import requests
from bs4 import BeautifulSoup
url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})
school_names = [school.text.strip() for school in schools]
print(school_names)
但我需要获取所有数据集:
地址名称
地址2
类别
街道
邮政编码和地点
电话1
电话2
电子邮件
最好的方法是将其输出为CSV格式。如果我对Python更熟悉,我会运行以下代码并使用pandas处理数据,我认为pandas会更容易处理这种数据。
希望这对你有所帮助。如果你有更多问题或需要进一步的帮助,请随时提问。
英文:
i am in the mid of a task with BeautifulSoup - the awesome python-library for all things scraping. what is aimed: i want to get the data out of this page: https://schulfinder.kultus-bw.de note; its a public page for finding all schools in a certain region.
so a typical dataset will look like:
Adresse Name
Adresse 2
Kategorie
Straße
PLZ und Ort
Tel 1
Tel 2
Mail
well i think - with the usage of Python i will go like so:
firstly i will have to send a request to the URL and get the page HTML content:
url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content
afterwards - the next step i will have to create a BeautifulSoup object and find the HTML elements that contain the school names:
soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})
Extract the school names from the HTML elements and store them in a list:
school_names = [school.text.strip() for school in schools]
and subsequently i need to print the list of school names:
print(school_names)
well the complete code would look like this:
import requests
from bs4 import BeautifulSoup
url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})
school_names = [school.text.strip() for school in schools]
print(school_names)
but i need to have all the dataset -
Adresse Name
Adresse 2
Kategorie
Straße
PLZ und Ort
Tel 1
Tel 2
Mail
best thing would be to output it in CSV-formate; well if i would be a bit more familiar with Python then i would run this little code and would work with pandas - i guess that pandas would be much easier to work on that kind of thing.
..
update: see some images of the page:
update 2 i try to run this in google-colab: i get the following errors..
question: do i need to install some of the packages into collab!?
import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
from string import ascii_lowercase as chars
from itertools import product
do i need to take care for the preliminaries in google-colab?!
see the errorlog that i have gotten
100%|██████████| 676/676 [00:00<00:00, 381711.03it/s]
0it [00:00, ?it/s]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
5 frames
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'branches'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'branches'
end of errorlog - gotten from google-colab:
see below the errors - that i have gotten from Anaconda:
Anaconda: logs at home
100%|██████████| 676/676 [00:00<00:00, 9586.24it/s]
0it [00:00, ?it/s]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3628 try:
-> 3629 return self._engine.get_loc(casted_key)
3630 except KeyError as err:
~/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
~/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'branches'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/tmp/ipykernel_27106/2163647892.py in <module>
36 df = pd.DataFrame(all_data)
37
---> 38 df = df.explode('branches')
39 df = df.explode('trades')
40 df = pd.concat([df, df.pop('branches').apply(pd.Series).add_prefix('branch_')], axis=1)
~/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in explode(self, column, ignore_index)
8346 df = self.reset_index(drop=True)
8347 if len(columns) == 1:
-> 8348 result = df[columns[0]].explode()
8349 else:
8350 mylen = lambda x: len(x) if is_list_like(x) else -1
~/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
3503 if self.columns.nlevels > 1:
3504 return self._getitem_multilevel(key)
-> 3505 indexer = self.columns.get_loc(key)
3506 if is_integer(indexer):
3507 indexer = [indexer]
~/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3629 return self._engine.get_loc(casted_key)
3630 except KeyError as err:
-> 3631 raise KeyError(key) from err
3632 except TypeError:
3633 # If we have a listlike key, _check_indexing_error will raise
KeyError: 'branches'
conclusio: i am trying to find out more - i am eagerly trying to get more insights and to run the code ...
many thanks for all the help - ahd for encouraging to dive in all things python... - this is awesme.
have a great day...
答案1
得分: 2
以下是代码部分的翻译:
import requests
import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
from string import ascii_lowercase as chars
from itertools import product
api_url1 = 'https://schulfinder.kultus-bw.de/api/schools?distance=&outposts=1&owner=&school_kind=&term={term}&types=&work_schedule=&'
api_url2 = 'https://schulfinder.kultus-bw.de/api/school?uuid={uuid}'
def get_school(term):
try:
return requests.get(api_url1.format(term=term)).json()
except:
return []
def get_school_detail(uuid):
return requests.get(api_url2.format(uuid=uuid)).json()
if __name__ == '__main__':
l = [''.join(t) for t in product(chars, chars)]
# you can try also to get all 3-character combinations (this will yield 4476 results (but the first step will take longer)
# l = [''.join(t) for t in product(chars, chars, chars)]
all_data = []
all_uuids = set()
with Pool(processes=8) as pool:
for result in tqdm(pool.imap_unordered(get_school, l), total=len(l)):
for item in result:
all_uuids.add(item['uuid'])
with Pool(processes=16) as pool:
for r in tqdm(pool.imap_unordered(get_school_detail, all_uuids), total=len(all_uuids)):
all_data.append(r)
df = pd.DataFrame(all_data)
df = df.explode('branches')
df = df.explode('trades')
df = pd.concat([df, df.pop('branches').apply(pd.Series).add_prefix('branch_')], axis=1)
df = pd.concat([df, df.pop('trades').apply(pd.Series).add_prefix('trade_')], axis=1)
print(df.head())
df.to_csv('data.csv', index=False)
希望这对您有所帮助。如果您需要更多信息或有其他问题,请随时提问。
英文:
You can try this: When you enter aa
and click "Suchen" the server returns all items that contains "aa". So you can try all combinations (aa
, ab
, ac
, ...) to get all school IDs and then get info about all schools:
import requests
import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
from string import ascii_lowercase as chars
from itertools import product
api_url1 = 'https://schulfinder.kultus-bw.de/api/schools?distance=&outposts=1&owner=&school_kind=&term={term}&types=&work_schedule='
api_url2 = 'https://schulfinder.kultus-bw.de/api/school?uuid={uuid}'
def get_school(term):
try:
return requests.get(api_url1.format(term=term)).json()
except:
return []
def get_school_detail(uuid):
return requests.get(api_url2.format(uuid=uuid)).json()
if __name__ == '__main__':
l = [''.join(t) for t in product(chars, chars)]
# you can try also to get all 3-character combinations (this will yield 4476 results (but the first step will take longer)
# l = [''.join(t) for t in product(chars, chars, chars)]
all_data = []
all_uuids = set()
with Pool(processes=8) as pool:
for result in tqdm(pool.imap_unordered(get_school, l), total=len(l)):
for item in result:
all_uuids.add(item['uuid'])
with Pool(processes=16) as pool:
for r in tqdm(pool.imap_unordered(get_school_detail, all_uuids), total=len(all_uuids)):
all_data.append(r)
df = pd.DataFrame(all_data)
df = df.explode('branches')
df = df.explode('trades')
df = pd.concat([df, df.pop('branches').apply(pd.Series).add_prefix('branch_')], axis=1)
df = pd.concat([df, df.pop('trades').apply(pd.Series).add_prefix('trade_')], axis=1)
print(df.head())
df.to_csv('data.csv', index=False)
This will get info about all 4461 schools and saves data to data.csv
:
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 676/676 [00:38<00:00, 17.63it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4461/4461 [00:22<00:00, 194.86it/s]
outpost_number name street house_number postcode city phone fax email website tablet_tranche tablet_platform tablet_branches tablet_trades lat lng official branch_branch_id branch_acronym branch_description_long trade_0 trade_trade_id trade_description
0 0 Schule am Schlosspark Realschule und Werkrealschule Schussenrieder Str. 25 88326 Aulendorf +4975259238102 +4975259238104 poststelle@04160556.schule.bwl.de http://www.schuleamschlosspark.de None None None None 47.95760 9.63881 0 15110 RS Realschule NaN NaN NaN
0 0 Schule am Schlosspark Realschule und Werkrealschule Schussenrieder Str. 25 88326 Aulendorf +4975259238102 +4975259238104 poststelle@04160556.schule.bwl.de http://www.schuleamschlosspark.de None None None None 47.95760 9.63881 0 14210 WRS Werkrealschule NaN NaN NaN
1 0 Schauenburg-Schule Grundschule Urloffen Schauenburgstr. 4 77767 Appenweier +49780597236 +497805914396 poststelle@04155676.schule.bwl.de http://www.schauenburgschule-urloffen.de None None None None 48.56460 7.97361 0 12110 GS Grundschule NaN NaN NaN
2 0 Klosterwiesenschule Grundschule Boschstr. 1 88255 Baindt +49750294114132 +49750294114139 poststelle@04139725.schule.bwl.de http://www.baindt.de/schule None None None None 47.84319 9.65829 0 12110 GS Grundschule NaN NaN NaN
3 0 Montessori-Grundschule Nußdorf Zum Laugele 7 88662 Überlingen +49755165620 None poststelle@04117742.schule.bwl.de http://www.grundschule-nussdorf.de None None None None 47.75325 9.19516 0 12110 GS Grundschule NaN NaN NaN
...
screenshot from LibreOffice:
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论