2023年3月12日 07:42:05go评论88阅读模式

英文:

working with BeautifulSoup - defining the entities for getting all the data of the target page - perhaps panda would solve this even better

问题

以下是代码的翻译部分：

我正在使用BeautifulSoup进行一项任务 - 这是一种用于网络数据抓取的强大的Python库。我的目标是从这个页面获取数据：https://schulfinder.kultus-bw.de，请注意，这是一个用于查找某个地区所有学校的公共页面。

所以，典型的数据集如下：

    地址名称
    地址2
    类别
    街道
    邮政编码和地点
    电话1
    电话2
    电子邮件

我认为，使用Python的话，我将按以下步骤进行：

首先，我需要向URL发送请求并获取页面的HTML内容：

```python
url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content

然后，在下一步，我需要创建一个BeautifulSoup对象并找到包含学校名称的HTML元素：

soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})

提取HTML元素中的学校名称并将其存储在列表中：

school_names = [school.text.strip() for school in schools]

最后，我需要打印学校名称列表：

print(school_names)

完整的代码如下：

import requests
from bs4 import BeautifulSoup

url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content

soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})
school_names = [school.text.strip() for school in schools]

print(school_names)

但我需要获取所有数据集：

地址名称
地址2
类别
街道
邮政编码和地点
电话1
电话2
电子邮件

最好的方法是将其输出为CSV格式。如果我对Python更熟悉，我会运行以下代码并使用pandas处理数据，我认为pandas会更容易处理这种数据。

希望这对你有所帮助。如果你有更多问题或需要进一步的帮助，请随时提问。

英文:

i am in the mid of a task with BeautifulSoup - the awesome python-library for all things scraping. what is aimed: i want to get the data out of this page: https://schulfinder.kultus-bw.de note; its a public page for finding all schools in a certain region.

so a typical dataset will look like:

Adresse Name
Adresse 2
Kategorie
Stra&#223;e
PLZ und Ort
Tel 1
Tel 2
Mail

well i think - with the usage of Python i will go like so:

firstly i will have to send a request to the URL and get the page HTML content:

url = &#39;https://schulfinder.kultus-bw.de&#39;
response = requests.get(url)
html_content = response.content

afterwards - the next step i will have to create a BeautifulSoup object and find the HTML elements that contain the school names:

soup = BeautifulSoup(html_content, &#39;html.parser&#39;)
schools = soup.find_all(&#39;a&#39;, {&#39;class&#39;: &#39;dropdown-item&#39;})
Extract the school names from the HTML elements and store them in a list:
school_names = [school.text.strip() for school in schools]

and subsequently i need to print the list of school names:

print(school_names)

well the complete code would look like this:

import requests
from bs4 import BeautifulSoup

url = &#39;https://schulfinder.kultus-bw.de&#39;
response = requests.get(url)
html_content = response.content

soup = BeautifulSoup(html_content, &#39;html.parser&#39;)
schools = soup.find_all(&#39;a&#39;, {&#39;class&#39;: &#39;dropdown-item&#39;})
school_names = [school.text.strip() for school in schools]

print(school_names)

but i need to have all the dataset -

Adresse Name
Adresse 2
Kategorie
Stra&#223;e
PLZ und Ort
Tel 1
Tel 2
Mail

best thing would be to output it in CSV-formate; well if i would be a bit more familiar with Python then i would run this little code and would work with pandas - i guess that pandas would be much easier to work on that kind of thing.

update: see some images of the page:

update 2 i try to run this in google-colab: i get the following errors..
question: do i need to install some of the packages into collab!?

import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
from string import ascii_lowercase as chars
from itertools import product

do i need to take care for the preliminaries in google-colab?!

see the errorlog that i have gotten

100%|██████████| 676/676 [00:00&lt;00:00, 381711.03it/s]
0it [00:00, ?it/s]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-&gt; 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

5 frames
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: &#39;branches&#39;

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-&gt; 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: &#39;branches&#39;

end of errorlog - gotten from google-colab:

see below the errors - that i have gotten from Anaconda:

Anaconda: logs at home

100%|██████████| 676/676 [00:00&lt;00:00, 9586.24it/s]
0it [00:00, ?it/s]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3628             try:
-&gt; 3629                 return self._engine.get_loc(casted_key)
   3630             except KeyError as err:

~/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: &#39;branches&#39;

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_27106/2163647892.py in &lt;module&gt;
     36     df = pd.DataFrame(all_data)
     37 
---&gt; 38     df = df.explode(&#39;branches&#39;)
     39     df = df.explode(&#39;trades&#39;)
     40     df = pd.concat([df, df.pop(&#39;branches&#39;).apply(pd.Series).add_prefix(&#39;branch_&#39;)], axis=1)

~/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in explode(self, column, ignore_index)
   8346         df = self.reset_index(drop=True)
   8347         if len(columns) == 1:
-&gt; 8348             result = df[columns[0]].explode()
   8349         else:
   8350             mylen = lambda x: len(x) if is_list_like(x) else -1

~/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3503             if self.columns.nlevels &gt; 1:
   3504                 return self._getitem_multilevel(key)
-&gt; 3505             indexer = self.columns.get_loc(key)
   3506             if is_integer(indexer):
   3507                 indexer = [indexer]

~/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3629                 return self._engine.get_loc(casted_key)
   3630             except KeyError as err:
-&gt; 3631                 raise KeyError(key) from err
   3632             except TypeError:
   3633                 # If we have a listlike key, _check_indexing_error will raise

KeyError: &#39;branches&#39;

conclusio: i am trying to find out more - i am eagerly trying to get more insights and to run the code ...

many thanks for all the help - ahd for encouraging to dive in all things python... - this is awesme.
have a great day...

答案1

得分: 2

以下是代码部分的翻译：

import requests
import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
from string import ascii_lowercase as chars
from itertools import product

api_url1 = 'https://schulfinder.kultus-bw.de/api/schools?distance=&outposts=1&owner=&school_kind=&term={term}&types=&work_schedule=&'
api_url2 = 'https://schulfinder.kultus-bw.de/api/school?uuid={uuid}'

def get_school(term):
    try:
        return requests.get(api_url1.format(term=term)).json()
    except:
        return []

def get_school_detail(uuid):
    return requests.get(api_url2.format(uuid=uuid)).json()

if __name__ == '__main__':
    l = [''.join(t) for t in product(chars, chars)]
    # you can try also to get all 3-character combinations (this will yield 4476 results (but the first step will take longer)
    # l = [''.join(t) for t in product(chars, chars, chars)]

    all_data = []
    all_uuids = set()

    with Pool(processes=8) as pool:
        for result in tqdm(pool.imap_unordered(get_school, l), total=len(l)):
            for item in result:
                all_uuids.add(item['uuid'])

    with Pool(processes=16) as pool:
        for r in tqdm(pool.imap_unordered(get_school_detail, all_uuids), total=len(all_uuids)):
            all_data.append(r)

    df = pd.DataFrame(all_data)

    df = df.explode('branches')
    df = df.explode('trades')
    df = pd.concat([df, df.pop('branches').apply(pd.Series).add_prefix('branch_')], axis=1)
    df = pd.concat([df, df.pop('trades').apply(pd.Series).add_prefix('trade_')], axis=1)

    print(df.head())

    df.to_csv('data.csv', index=False)

希望这对您有所帮助。如果您需要更多信息或有其他问题，请随时提问。

英文:

You can try this: When you enter aa and click "Suchen" the server returns all items that contains "aa". So you can try all combinations (aa, ab, ac, ...) to get all school IDs and then get info about all schools:

import requests
import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
from string import ascii_lowercase as chars
from itertools import product

api_url1 = &#39;https://schulfinder.kultus-bw.de/api/schools?distance=&amp;outposts=1&amp;owner=&amp;school_kind=&amp;term={term}&amp;types=&amp;work_schedule=&#39;
api_url2 = &#39;https://schulfinder.kultus-bw.de/api/school?uuid={uuid}&#39;

def get_school(term):
    try:
        return requests.get(api_url1.format(term=term)).json()
    except:
        return []

def get_school_detail(uuid):
    return requests.get(api_url2.format(uuid=uuid)).json()

if __name__ == &#39;__main__&#39;:
    l = [&#39;&#39;.join(t) for t in product(chars, chars)]
    # you can try also to get all 3-character combinations (this will yield 4476 results (but the first step will take longer)
    # l = [&#39;&#39;.join(t) for t in product(chars, chars, chars)]

    all_data = []
    all_uuids = set()

    with Pool(processes=8) as pool:
        for result in tqdm(pool.imap_unordered(get_school, l), total=len(l)):
            for item in result:
                all_uuids.add(item[&#39;uuid&#39;])

    with Pool(processes=16) as pool:
        for r in tqdm(pool.imap_unordered(get_school_detail, all_uuids), total=len(all_uuids)):
            all_data.append(r)

    df = pd.DataFrame(all_data)

    df = df.explode(&#39;branches&#39;)
    df = df.explode(&#39;trades&#39;)
    df = pd.concat([df, df.pop(&#39;branches&#39;).apply(pd.Series).add_prefix(&#39;branch_&#39;)], axis=1)
    df = pd.concat([df, df.pop(&#39;trades&#39;).apply(pd.Series).add_prefix(&#39;trade_&#39;)], axis=1)

    print(df.head())

    df.to_csv(&#39;data.csv&#39;, index=False)

This will get info about all 4461 schools and saves data to data.csv:

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 676/676 [00:38&lt;00:00, 17.63it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4461/4461 [00:22&lt;00:00, 194.86it/s]
  outpost_number                                                 name               street house_number postcode        city            phone              fax                              email                                   website tablet_tranche tablet_platform tablet_branches tablet_trades       lat      lng  official  branch_branch_id branch_acronym branch_description_long  trade_0 trade_trade_id trade_description
0              0  Schule am Schlosspark Realschule und Werkrealschule  Schussenrieder Str.           25    88326   Aulendorf   +4975259238102   +4975259238104  poststelle@04160556.schule.bwl.de         http://www.schuleamschlosspark.de           None            None            None          None  47.95760  9.63881         0             15110             RS              Realschule      NaN            NaN               NaN
0              0  Schule am Schlosspark Realschule und Werkrealschule  Schussenrieder Str.           25    88326   Aulendorf   +4975259238102   +4975259238104  poststelle@04160556.schule.bwl.de         http://www.schuleamschlosspark.de           None            None            None          None  47.95760  9.63881         0             14210            WRS          Werkrealschule      NaN            NaN               NaN
1              0              Schauenburg-Schule Grundschule Urloffen      Schauenburgstr.            4    77767  Appenweier     +49780597236    +497805914396  poststelle@04155676.schule.bwl.de  http://www.schauenburgschule-urloffen.de           None            None            None          None  48.56460  7.97361         0             12110             GS             Grundschule      NaN            NaN               NaN
2              0                      Klosterwiesenschule Grundschule            Boschstr.            1    88255      Baindt  +49750294114132  +49750294114139  poststelle@04139725.schule.bwl.de               http://www.baindt.de/schule           None            None            None          None  47.84319  9.65829         0             12110             GS             Grundschule      NaN            NaN               NaN
3              0                       Montessori-Grundschule Nu&#223;dorf          Zum Laugele            7    88662  &#220;berlingen     +49755165620             None  poststelle@04117742.schule.bwl.de        http://www.grundschule-nussdorf.de           None            None            None          None  47.75325  9.19516         0             12110             GS             Grundschule      NaN            NaN               NaN

...

screenshot from LibreOffice:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

working with BeautifulSoup – defining the entities for getting all the data of the target page – perhaps panda would solve this even better

问题

答案1

NotFittedError: This RandomForestRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator

如何修复 Python 正则表达式的 if 语句？

Python多线程在任务处理完成后无法终止。

asyncio.Future.done()在任务完成时为什么没有设置为True？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论