2023年4月1日 00:54:16go评论162阅读模式

英文:

This code is suppose to extract contact person and fax number looks okay but does not work

问题

以下是代码部分，不需要翻译：

import requests
from bs4 import BeautifulSoup

def get_contact_person_and_fax_number(url):
    # 发送GET请求到链接url
    link_response = requests.get(url)

    # 使用BeautifulSoup解析HTML内容
    link_soup = BeautifulSoup(link_response.content, "html.parser")

    # 查找联系人和传真号码
    contact_person = None
    fax_number = None
    contact_div = link_soup.find("div", {"class": "info-gen-box clearfix"})
    if contact_div:
        contact_person = contact_div.find("div", {"class": "info-gen-box clearfix"})
        if contact_person:
            contact_person = contact_person.text.strip()
        fax_number = contact_div.find("div", {"class": "info-fax info-mar"})
        if fax_number:
            fax_number = fax_number.text.strip()

    return (contact_person, fax_number)

# 提示用户输入NAICS代码
naics_code = input("输入NAICS代码: ")

# 遍历所有搜索结果页面
for page_num in range(1, 81):
    # 发送GET请求到搜索页面
    page_url = f"https://www.usaopps.com/government_contractors/search.htm?naics={naics_code}&page={page_num}"
    page_response = requests.get(page_url)

    # 使用BeautifulSoup解析HTML内容
    page_soup = BeautifulSoup(page_response.content, "html.parser")

    # 查找当前页面上的所有搜索结果
    results = page_soup.select(".lr-title")
    links = page_soup.find_all('div', {'class': 'lr-title'})

    # 遍历每个搜索结果链接并提取联系人和传真号码（如果可用）
    for link in links:
        # 提取链接URL
        url = link.find('a').get('href')

        if not url.startswith('http'):
            url = f"https://www.usaopps.com{url}"

        # 获取此链接的联系人和传真号码
        contact_person, fax_number = get_contact_person_and_fax_number(url)

        # 打印结果
        print(f"Contact Person: {contact_person}")
        print(f"Fax Number: {fax_number}")

如果您需要任何进一步的翻译或帮助，请告诉我。

英文:

The result is this (but there are results on the site)
> py > Contact Person: None > Fax Number: None > Contact Person: None. >

import requests
from bs4 import BeautifulSoup


def get_contact_person_and_fax_number(url):
    # Send GET request to link url
    link_response = requests.get(url)

    # Parse HTML content with BeautifulSoup
    link_soup = BeautifulSoup(link_response.content, &quot;html.parser&quot;)

    # Find the contact person and fax number
    contact_person = None
    fax_number = None
    contact_div = link_soup.find(&quot;div&quot;, {&quot;class&quot;: &quot;info-gen-box clearfix&quot;})
    if contact_div:
        contact_person = contact_div.find(&quot;div&quot;, {&quot;class&quot;: &quot;info-gen-box clearfix&quot;})
        if contact_person:
            contact_person = contact_person.text.strip()
        fax_number = contact_div.find(&quot;div&quot;, {&quot;class&quot;: &quot;info-fax info-mar&quot;})
        if fax_number:
            fax_number = fax_number.text.strip()

    return (contact_person, fax_number)


# Prompt user for NAICS code
naics_code = input(&quot;Enter NAICS code: &quot;)

# Loop through all search result pages
for page_num in range(1, 81):
    # Send GET request to search page
    page_url = f&quot;https://www.usaopps.com/government_contractors/search.htm?naics={naics_code}&amp;page={page_num}&quot;
    page_response = requests.get(page_url)

    # Parse HTML content with BeautifulSoup
    page_soup = BeautifulSoup(page_response.content, &quot;html.parser&quot;)

    # Find all search results on current page
    results = page_soup.select(&quot;.lr-title&quot;)
    links = page_soup.find_all(&#39;div&#39;, {&#39;class&#39;: &#39;lr-title&#39;})

    # Loop through each search result link and extract contact person and fax number if available
    for link in links:
        # Extract link url
        url = link.find(&#39;a&#39;).get(&#39;href&#39;)

        if not url.startswith(&#39;http&#39;):
            url = f&quot;https://www.usaopps.com{url}&quot;

        # Get contact person and fax number for this link
        contact_person, fax_number = get_contact_person_and_fax_number(url)

        # Print results
        print(f&quot;Contact Person: {contact_person}&quot;)
        print(f&quot;Fax Number: {fax_number}&quot;)

答案1

得分: 1

这个部分对我来说似乎是问题所在。一旦contact_div被设置，就再也没有相同类别的div存在...实际上，看网站的情况，它的设置不再是这样的了。自从您开始这个项目以来，它有变化吗？

contact_div = link_soup.find("div", {"class": "info-gen-box clearfix"})
if contact_div:
    contact_person = contact_div.find("div", {"class": "info-gen-box clearfix"})
    if contact_person:
        contact_person = contact_person.text.strip()
    fax_number = contact_div.find("div", {"class": "info-fax info-mar"})
    if fax_number:
        fax_number = fax_number.text.strip()

我只看到一个<div class="info-gen-box clearfix">，而它在您尝试访问它内部的info-fax info-mar div之前就已经关闭了。

英文:

This section seems to be the problem to me. Once the contact_div is set, there isn't another div of the same class inside ... in fact, looking at the site, this isn't how it's set up anymore. Did it change since you started this?

contact_div = link_soup.find(&quot;div&quot;, {&quot;class&quot;: &quot;info-gen-box clearfix&quot;})
if contact_div:
contact_person = contact_div.find(&quot;div&quot;, {&quot;class&quot;: &quot;info-gen-box clearfix&quot;})
if contact_person:
contact_person = contact_person.text.strip()
fax_number = contact_div.find(&quot;div&quot;, {&quot;class&quot;: &quot;info-fax info-mar&quot;})
if fax_number:
fax_number = fax_number.text.strip()

I see only one <div class="info-gen-box clearfix"> and it closes before the info-fax info-mar div that you are trying to access inside it.

答案2

得分: 0

以下是翻译的代码部分：

### 缺失信息背后的原因

在 `get_contact_person_and_fax_number` 中，你有以下代码：

```py
contact_div = link_soup.find("div", {"class": "info-gen-box clearfix"})
if contact_div:
    contact_person = contact_div.find("div", {"class": "info-gen-box clearfix"})

fax_number = contact_div.find("div", {"class": "info-fax info-mar"})

然而，在这些页面上没有嵌套的 div.info-gen-box.clearfix（这是一个打字错误吗？）和 div.info-fax.info-mar 不在 contact_div 内（你可以将 fax_number=link_soup.find("div",{"class":"info-fax info-mar"}) 放在 if contact_div 之外）。

建议的解决方案

以下是我建议的 get_contact_person_and_fax_number 的版本：

def get_company_info(compUrl, pre_print=''):
    # 使用 requests + BeautifulSoup 获取并解析 HTML 内容
    lSoup = BeautifulSoup((lr := requests.get(compUrl)).content, "html.parser")

    # 查找侧边框，并将 dt/dd 标签与公司信息一起合并
    contact_div = lSoup.select_one('div[id="box-sideinfo"] dl:has(dt+dd)')
    ckv = zip(
        contact_div.select('dt:has(+dd)'), contact_div.select('dt+dd')
    ) if contact_div else []  # 如果找不到容器则使用空列表
    # 从 zip 中构建字典
    get_k = lambda dx: ' '.join(dx.get_text(' ').split()).strip(':')  # [最小间距]
    def get_v(dx):
        if dx.find('dd'): return list(dx.stripped_strings)[0]  # 用于嵌套信息
        return dx.get_text(' • ', strip=True)
    company_dets = {get_k(dt): get_v(dd) for dt, dd in ckv}  # [字典解析]
    # 打印并返回
    lrStat = f'<[{lr.status_code} {lr.reason}]> 在 {lr.elapsed} 从 {lr.url}'
    if isinstance(pre_print, str):
        print(pre_print + f'{len(company_dets)} 家公司详细信息 - {lrStat}')
    if not contact_div: company_dets['错误消息'] = f'无信息 - {lrStat}'
    return company_dets if company_dets else {'msg': f'！错误！无信息 - {lrStat}'}

def get_contact_person_and_fax_number(url):
    cDets = get_company_info(url, None)
    return (cDets.get('联系人'), cDets.get('传真'))

你也可以直接在 for link in links: 中使用 get_company_info：

# 如果不是以 'http' 开头的URL，添加前缀
if not url.startswith('http'):
    url = f"https://www.usaopps.com{url}"
# 打印此链接的联系人和传真号码
keys_list, company_info = ['联系人', '传真'], get_company_info(url)  #, None)
for k in keys_list:
    print(f'    {k}: {company_info.get(k)}')

其他建议

以下建议不是必需的，但我有关于你的 for page_num... 循环的一些建议/备注：

我认为更安全的方式是这样的：
```
    for link in page_soup.select('div.lr-title:has(a[href])'):
        url = link.find('a', href=True)['href']
```
- [顺便问一下，为什么你有 results 和 links，我预计它们是相同的 ResultSet，特别是因为你似乎没有使用 results ？]
应该有一些方式来停止循环，以防止不必要地发送请求到没有结果的页面；可以
- 只需 if not links: break 来停在第一个没有结果的页面
- 或者使用一个具有两个条件的 while 循环 [下面进行了演示]

# 提示用户输入 NAICS 代码并构建第一页的URL
naics_code = input("输入NAICS代码：")
page_url = f"https://www.usaopps.com/government_contractors/search.htm?naics={naics_code}"

# 开始 while 循环
all_rows, pg_num, max_pgs = [], 0, 81
# max_pgs = int(max_pgs) if (max_pgs:=input('Page limit:')).isdigit() else None
while page_url and (pg_num := pg_num + 1) <= (max_pgs if max_pgs else pg_num):
    if not page_url.startswith('http'): page_url = f"https://www.usaopps.com{page_url}"
    pgSoup = BeautifulSoup((pr := requests.get(page_url)).content, "html.parser")
    llen = len(links := pgSoup.select('div.lr-title:has(a[href])'))

    ## 仅用于打印进度（结果数量和哪一页）
    abtRes = pgSoup.select('div.list-total,div.list-head>h2')[:2]
    pgProgressPlus = ' for '.join()
    print(f'[Page {pg_num}][{llen} 链接]', pgProgressPlus, '\n')

    # 遍历并抓取结果链接，同时打印和/或收集公司信息
    for ln, link in enumerate(links, 1):
        url = link.find('a', href=True)['href']
        if not url.startswith('http'): url = f"https://www.usaopps.com{url}"
        # all_rows.append(company_info:=get_company_info(url,None)) ## 收集数据
        company_info = get_company_info(url, f'\n  [Page {pg_num}][Link {ln} of {llen}]')
        # for k,v in company_info.items():print(f'      {k}: {v}') ## 打印所有
        keys_list = ['联系人', '传真']
        for k in keys_list: print(f'      {k}: {company_info.get(k)}')

    # 找到下一页链接并更新为下一个循环
    next_link = pgSoup.select_one('span.page-link a[href]:-soup-contains("下一页")')
    page_url = next_link['

<details>
<summary>英文:</summary>

### Reason Behind the Missing Information

In `get_contact_person_and_fax_number` you have

&gt; ```py
&gt;     contact_div = link_soup.find(&quot;div&quot;, {&quot;class&quot;: &quot;info-gen-box clearfix&quot;})
&gt;     if contact_div:
&gt;         contact_person = contact_div.find(&quot;div&quot;, {&quot;class&quot;: &quot;info-gen-box clearfix&quot;})
&gt; ```
&gt; ```py
&gt;         fax_number = contact_div.find(&quot;div&quot;, {&quot;class&quot;: &quot;info-fax info-mar&quot;})
&gt; ```

However, there is no nested `div.info-gen-box.clearfix`&lt;sup&gt;([view][1])&lt;/sup&gt; on these pages (was that a typo?) and `div.info-fax.info-mar`&lt;sup&gt;([view][2])&lt;/sup&gt; is not inside *`contact_div`* (you could just have `fax_number=link_soup.find(&quot;div&quot;,{&quot;class&quot;:&quot;info-fax info-mar&quot;})` **outside** the *`if contact_div`* block).

---

### Suggested Solution

Below is my suggested version of `get_contact_person_and_fax_number`&lt;sup&gt;([view example output][3])&lt;/sup&gt;

```py
def get_company_info(compUrl, pre_print=&#39;&#39;):
    # Fetch and Parse HTML content with requests + BeautifulSoup
    lSoup = BeautifulSoup((lr:=requests.get(compUrl)).content, &quot;html.parser&quot;)

    # Find the side box and zip together dt/dd tags with company info
    contact_div = lSoup.select_one(&#39;div[id=&quot;box-sideinfo&quot;] dl:has(dt+dd)&#39;) 
    ckv = zip(
        contact_div.select(&#39;dt:has(+dd)&#39;), contact_div.select(&#39;dt+dd&#39;)
    ) if contact_div else [] ## use empty list if container is not found 
    
    # Build a dictionary from zip
    get_k = lambda dx: &#39; &#39;.join(dx.get_text(&#39; &#39;).split()).strip(&#39;:&#39;) ## [minimal spacing]
    def get_v(dx):
        if dx.find(&#39;dd&#39;): return list(dx.stripped_strings)[0] ## for nested info
        return dx.get_text(&#39; • &#39;, strip=True)
    company_dets = {get_k(dt): get_v(dd) for dt,dd in ckv} ## [dict comprehension]

    # Print and Return
    lrStat = f&#39;&lt;[{lr.status_code} {lr.reason}]&gt; in {lr.elapsed} from {lr.url}&#39;
    if isinstance(pre_print, str):
        print(pre_print + f&#39;{len(company_dets)} company details - {lrStat}&#39;)
    if not contact_div: company_dets[&#39;Error Message&#39;] = f&#39;No info - {lrStat}&#39;
    return company_dets if company_dets else {&#39;msg&#39;: f&#39;!Error! No info - {lrStat}&#39;}

def get_contact_person_and_fax_number(url):
    cDets = get_company_info(url, None)
    return (cDets.get(&#39;Contact Person&#39;), cDets.get(&#39;Fax&#39;))

You could also directly use get_company_info in for link in links:<sup>(view example output)</sup>

        # if not url.startswith(&#39;http&#39;): url = f&quot;https://www.usaopps.com{url}&quot;

        # Print contact person and fax number for this link
        keys_list, company_info = [&#39;Contact Person&#39;, &#39;Fax&#39;], get_company_info(url) #, None)
        for k in keys_list: print(f&#39;    {k}: {company_info.get(k)}&#39;)

Other Suggestions

These are not essential, but I have some recommendations/notes about your for page_num... loop:

imo it would be safer to have
```
    for link in page_soup.select(&#39;div.lr-title:has(a[href])&#39;):
        url = link.find(&#39;a&#39;, href=True)[&#39;href&#39;]
```
- [btw, why do you have results and links, which I expect are the same ResultSet, especially since you don't see to be using results for anything?]
there should be something to stop the loop after the last page, otherwise you could be unnecessarily sending requests to pages without results; either
- just if not links: break to stop as soon as you get to a page without results
- or use a while loop with two conditions [demonstrated below]

# Prompt user for NAICS code and build page 1 URL
naics_code = input(&quot;Enter NAICS code: &quot;)
page_url = f&quot;https://www.usaopps.com/government_contractors/search.htm?naics={naics_code}&quot;

# Begin while loop
all_rows, pg_num, max_pgs = [], 0, 81
# max_pgs = int(max_pgs) if (max_pgs:=input(&#39;Page limit:&#39;)).isdigit() else None
while page_url and (pg_num:=pg_num+1)&lt;=(max_pgs if max_pgs else pg_num):
    if not page_url.startswith(&#39;http&#39;): page_url = f&quot;https://www.usaopps.com{page_url}&quot;
    pgSoup = BeautifulSoup((pr:=requests.get(page_url)).content, &quot;html.parser&quot;)
    llen = len(links := pgSoup.select(&#39;div.lr-title:has(a[href])&#39;))

    ## JUST FOR PRINTING PROGRESS (how many results and which page)
    abtRes = pgSoup.select(&#39;div.list-total,div.list-head&gt;h2&#39;)[:2]
    pgProgressPlus = &#39; for &#39;.join()
    print(f&#39;[Page {pg_num}][{llen} Links]&#39;, pgProgressPlus, &#39;\n&#39;)

    # Loop through and scrape result links while printing and/or collecting company info
    for ln, link in enumerate(links,1):
        url = link.find(&#39;a&#39;, href=True)[&#39;href&#39;]
        if not url.startswith(&#39;http&#39;): url = f&quot;https://www.usaopps.com{url}&quot;

        # all_rows.append(company_info:=get_company_info(url,None)) ## collect data
        company_info = get_company_info(url, f&#39;\n  [Page {pg_num}][Link {ln} of {llen}]&#39;)
        
        # for k,v in company_info.items():print(f&#39;      {k}: {v}&#39;) ## print all
        keys_list = [&#39;Contact Person&#39;,&#39;Fax&#39;]
        for k in keys_list: print(f&#39;      {k}: {company_info.get(k)}&#39;)
    
    # Find next page link and update for next loop
    next_link = pgSoup.select_one(&#39;span.page-link a[href]:-soup-contains(&quot;Next&quot;)&#39;)
    page_url = next_link[&#39;href&#39;] if next_link else None
    print(&#39;\n\n&#39;)

If you uncomment the ## collect data line, the collected data would be something like

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

这段代码应该提取联系人和传真号码，看起来没问题，但不起作用。

问题

答案1

答案2

建议的解决方案

其他建议

Other Suggestions

在Ansible中创建自定义字典列表可以通过以下方式实现：

Python: Exit script when it sits Idle after a period of time

在Python中导入嵌套包时出现问题。

`del`方法如果实例被存储，为什么不会被调用？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论