working with BeautifulSoup – defining the entities for getting all the data of the target page – perhaps panda would solve this even better

huangapple go评论67阅读模式
英文:

working with BeautifulSoup - defining the entities for getting all the data of the target page - perhaps panda would solve this even better

问题

以下是代码的翻译部分:

我正在使用BeautifulSoup进行一项任务 - 这是一种用于网络数据抓取的强大的Python库我的目标是从这个页面获取数据https://schulfinder.kultus-bw.de请注意这是一个用于查找某个地区所有学校的公共页面

所以典型的数据集如下

    地址名称
    地址2
    类别
    街道
    邮政编码和地点
    电话1
    电话2
    电子邮件

我认为使用Python的话我将按以下步骤进行

首先我需要向URL发送请求并获取页面的HTML内容

```python
url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content

然后,在下一步,我需要创建一个BeautifulSoup对象并找到包含学校名称的HTML元素:

soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})

提取HTML元素中的学校名称并将其存储在列表中:

school_names = [school.text.strip() for school in schools]

最后,我需要打印学校名称列表:

print(school_names)

完整的代码如下:

import requests
from bs4 import BeautifulSoup

url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content

soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})
school_names = [school.text.strip() for school in schools]

print(school_names)

但我需要获取所有数据集:

地址名称
地址2
类别
街道
邮政编码和地点
电话1
电话2
电子邮件

最好的方法是将其输出为CSV格式。如果我对Python更熟悉,我会运行以下代码并使用pandas处理数据,我认为pandas会更容易处理这种数据。

希望这对你有所帮助。如果你有更多问题或需要进一步的帮助,请随时提问。

英文:

i am in the mid of a task with BeautifulSoup - the awesome python-library for all things scraping. what is aimed: i want to get the data out of this page: https://schulfinder.kultus-bw.de note; its a public page for finding all schools in a certain region.

so a typical dataset will look like:

Adresse Name
Adresse 2
Kategorie
Straße
PLZ und Ort
Tel 1
Tel 2
Mail 

well i think - with the usage of Python i will go like so:

firstly i will have to send a request to the URL and get the page HTML content:

url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content

afterwards - the next step i will have to create a BeautifulSoup object and find the HTML elements that contain the school names:

soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})
Extract the school names from the HTML elements and store them in a list:
school_names = [school.text.strip() for school in schools]

and subsequently i need to print the list of school names:

print(school_names)

well the complete code would look like this:

import requests
from bs4 import BeautifulSoup

url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content

soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})
school_names = [school.text.strip() for school in schools]

print(school_names)

but i need to have all the dataset -

Adresse Name
Adresse 2
Kategorie
Straße
PLZ und Ort
Tel 1
Tel 2
Mail 

best thing would be to output it in CSV-formate; well if i would be a bit more familiar with Python then i would run this little code and would work with pandas - i guess that pandas would be much easier to work on that kind of thing.

..

update: see some images of the page:

working with BeautifulSoup – defining the entities for getting all the data of the target page – perhaps panda would solve this even better

working with BeautifulSoup – defining the entities for getting all the data of the target page – perhaps panda would solve this even better

update 2 i try to run this in google-colab: i get the following errors..
question: do i need to install some of the packages into collab!?

import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
from string import ascii_lowercase as chars
from itertools import product

do i need to take care for the preliminaries in google-colab?!

see the errorlog that i have gotten

100%|██████████| 676/676 [00:00<00:00, 381711.03it/s]
0it [00:00, ?it/s]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

5 frames
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'branches'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'branches'

end of errorlog - gotten from google-colab:

see below the errors - that i have gotten from Anaconda:

Anaconda: logs at home

100%|██████████| 676/676 [00:00<00:00, 9586.24it/s]
0it [00:00, ?it/s]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3628             try:
-> 3629                 return self._engine.get_loc(casted_key)
   3630             except KeyError as err:

~/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'branches'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_27106/2163647892.py in <module>
     36     df = pd.DataFrame(all_data)
     37 
---> 38     df = df.explode('branches')
     39     df = df.explode('trades')
     40     df = pd.concat([df, df.pop('branches').apply(pd.Series).add_prefix('branch_')], axis=1)

~/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in explode(self, column, ignore_index)
   8346         df = self.reset_index(drop=True)
   8347         if len(columns) == 1:
-> 8348             result = df[columns[0]].explode()
   8349         else:
   8350             mylen = lambda x: len(x) if is_list_like(x) else -1

~/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3503             if self.columns.nlevels > 1:
   3504                 return self._getitem_multilevel(key)
-> 3505             indexer = self.columns.get_loc(key)
   3506             if is_integer(indexer):
   3507                 indexer = [indexer]

~/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3629                 return self._engine.get_loc(casted_key)
   3630             except KeyError as err:
-> 3631                 raise KeyError(key) from err
   3632             except TypeError:
   3633                 # If we have a listlike key, _check_indexing_error will raise

KeyError: 'branches'

conclusio: i am trying to find out more - i am eagerly trying to get more insights and to run the code ...

many thanks for all the help - ahd for encouraging to dive in all things python... - this is awesme.
have a great day...

答案1

得分: 2

以下是代码部分的翻译:

import requests
import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
from string import ascii_lowercase as chars
from itertools import product

api_url1 = 'https://schulfinder.kultus-bw.de/api/schools?distance=&outposts=1&owner=&school_kind=&term={term}&types=&work_schedule=&'
api_url2 = 'https://schulfinder.kultus-bw.de/api/school?uuid={uuid}'

def get_school(term):
    try:
        return requests.get(api_url1.format(term=term)).json()
    except:
        return []

def get_school_detail(uuid):
    return requests.get(api_url2.format(uuid=uuid)).json()

if __name__ == '__main__':
    l = [''.join(t) for t in product(chars, chars)]
    # you can try also to get all 3-character combinations (this will yield 4476 results (but the first step will take longer)
    # l = [''.join(t) for t in product(chars, chars, chars)]

    all_data = []
    all_uuids = set()

    with Pool(processes=8) as pool:
        for result in tqdm(pool.imap_unordered(get_school, l), total=len(l)):
            for item in result:
                all_uuids.add(item['uuid'])

    with Pool(processes=16) as pool:
        for r in tqdm(pool.imap_unordered(get_school_detail, all_uuids), total=len(all_uuids)):
            all_data.append(r)

    df = pd.DataFrame(all_data)

    df = df.explode('branches')
    df = df.explode('trades')
    df = pd.concat([df, df.pop('branches').apply(pd.Series).add_prefix('branch_')], axis=1)
    df = pd.concat([df, df.pop('trades').apply(pd.Series).add_prefix('trade_')], axis=1)

    print(df.head())

    df.to_csv('data.csv', index=False)

希望这对您有所帮助。如果您需要更多信息或有其他问题,请随时提问。

英文:

You can try this: When you enter aa and click "Suchen" the server returns all items that contains "aa". So you can try all combinations (aa, ab, ac, ...) to get all school IDs and then get info about all schools:

import requests
import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
from string import ascii_lowercase as chars
from itertools import product

api_url1 = 'https://schulfinder.kultus-bw.de/api/schools?distance=&outposts=1&owner=&school_kind=&term={term}&types=&work_schedule='
api_url2 = 'https://schulfinder.kultus-bw.de/api/school?uuid={uuid}'

def get_school(term):
    try:
        return requests.get(api_url1.format(term=term)).json()
    except:
        return []

def get_school_detail(uuid):
    return requests.get(api_url2.format(uuid=uuid)).json()

if __name__ == '__main__':
    l = [''.join(t) for t in product(chars, chars)]
    # you can try also to get all 3-character combinations (this will yield 4476 results (but the first step will take longer)
    # l = [''.join(t) for t in product(chars, chars, chars)]

    all_data = []
    all_uuids = set()

    with Pool(processes=8) as pool:
        for result in tqdm(pool.imap_unordered(get_school, l), total=len(l)):
            for item in result:
                all_uuids.add(item['uuid'])

    with Pool(processes=16) as pool:
        for r in tqdm(pool.imap_unordered(get_school_detail, all_uuids), total=len(all_uuids)):
            all_data.append(r)

    df = pd.DataFrame(all_data)

    df = df.explode('branches')
    df = df.explode('trades')
    df = pd.concat([df, df.pop('branches').apply(pd.Series).add_prefix('branch_')], axis=1)
    df = pd.concat([df, df.pop('trades').apply(pd.Series).add_prefix('trade_')], axis=1)

    print(df.head())

    df.to_csv('data.csv', index=False)

This will get info about all 4461 schools and saves data to data.csv:

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 676/676 [00:38<00:00, 17.63it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4461/4461 [00:22<00:00, 194.86it/s]
  outpost_number                                                 name               street house_number postcode        city            phone              fax                              email                                   website tablet_tranche tablet_platform tablet_branches tablet_trades       lat      lng  official  branch_branch_id branch_acronym branch_description_long  trade_0 trade_trade_id trade_description
0              0  Schule am Schlosspark Realschule und Werkrealschule  Schussenrieder Str.           25    88326   Aulendorf   +4975259238102   +4975259238104  poststelle@04160556.schule.bwl.de         http://www.schuleamschlosspark.de           None            None            None          None  47.95760  9.63881         0             15110             RS              Realschule      NaN            NaN               NaN
0              0  Schule am Schlosspark Realschule und Werkrealschule  Schussenrieder Str.           25    88326   Aulendorf   +4975259238102   +4975259238104  poststelle@04160556.schule.bwl.de         http://www.schuleamschlosspark.de           None            None            None          None  47.95760  9.63881         0             14210            WRS          Werkrealschule      NaN            NaN               NaN
1              0              Schauenburg-Schule Grundschule Urloffen      Schauenburgstr.            4    77767  Appenweier     +49780597236    +497805914396  poststelle@04155676.schule.bwl.de  http://www.schauenburgschule-urloffen.de           None            None            None          None  48.56460  7.97361         0             12110             GS             Grundschule      NaN            NaN               NaN
2              0                      Klosterwiesenschule Grundschule            Boschstr.            1    88255      Baindt  +49750294114132  +49750294114139  poststelle@04139725.schule.bwl.de               http://www.baindt.de/schule           None            None            None          None  47.84319  9.65829         0             12110             GS             Grundschule      NaN            NaN               NaN
3              0                       Montessori-Grundschule Nußdorf          Zum Laugele            7    88662  Überlingen     +49755165620             None  poststelle@04117742.schule.bwl.de        http://www.grundschule-nussdorf.de           None            None            None          None  47.75325  9.19516         0             12110             GS             Grundschule      NaN            NaN               NaN

...

screenshot from LibreOffice:

working with BeautifulSoup – defining the entities for getting all the data of the target page – perhaps panda would solve this even better

huangapple
  • 本文由 发表于 2023年3月12日 07:42:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/75710267.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定