如何从HTML中找到有颜色的字体并将其添加到解析器中?

huangapple go评论62阅读模式
英文:

How to find a coloured font from HTML and add it to parser?

问题

我有一个从RSСF网站解析竞赛信息的代码。是的,又是解析。但别担心,我已经知道了要解析什么以及如何解析。我已经编写了代码。对我来说,它运行得非常顺利。没有出现任何错误。

import requests
from bs4 import BeautifulSoup
import re
import os
from urllib.request import urlopen
import json
from urllib.parse import unquote

import warnings
warnings.filterwarnings("ignore")

BASE_URL = 'https://www.rscf.ru/contests'

session = requests.Session()
session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0'

items = []
max_page = 10
for page in range(1, max_page + 1):
    url = f'{BASE_URL}/?PAGEN_2={page}/' if page > 1 else BASE_URL
    print(url)

    rs = session.get(url, verify=False)
    rs.raise_for_status()

    soup = BeautifulSoup(rs.content, 'html.parser')
    for item in soup.select('.classification-table-row.contest-table-row'):
        number = item.select_one('.contest-num').text
        title = item.select_one('.contest-name').text
        date = item.select_one('.contest-date').text.replace("\n", "").replace("Подать заявку", "")
        documents = item.select_one('.contest-docs').text.replace("\n", " ").replace("        ", " ").replace("    ", " ")
        synopsis = item.select_one('.contest-status').text.replace("\n", " ")
        items.append({
            'Номер': number,
            'Наименование конкурса': title,
            'Приём заявок': date,
            'Статус': synopsis,
            'Документы': documents,
        })

with open('out.json', 'w', encoding='utf-8') as f:
    json.dump(items, f, indent=4, ensure_ascii=False)

一切都正常运行。有一个小问题。

事实是,该网站有一个特点 - 文本的颜色。根据竞赛是活动的还是已完成的,状态会以特定的颜色突出显示。如果正在接受申请,状态会以绿色突出显示。如果正在进行审核,状态会以橙色突出显示。如果竞赛已经结束,状态会以红色突出显示。这是竞赛的链接。

https://www.rscf.ru/contests/
我需要代码将HTML中标记为红色、橙色或绿色的文本输出为JSON。不幸的是,我在互联网上找不到类似的东西。只有能够改变文本颜色的代码,但不能提取准备好的文本。

我尝试编写了以下代码

redword = item.select_one('.contest-danger').text
orangeword = item.select_one('.contest-danger').text
greenword = item.select_one('.contest-success').text
for synopsis in item.select_one('.contest-status').text:
    try:
        syn = re.sub(orangeword, str(synopsis))
    except:
        syn = re.sub(orangeword, str(greenword))
    items.append({
        'Номер': number,
        'Наименование конкурса': title,
        'Приём заявок': date,
        'Статус': syn,
        'Документы': documents,
    })

但它只给了我错误

redword = item.select_one('.contest-danger').text
AttributeError: 'NoneType' object has no attribute 'text'

你能帮我吗?

英文:

I have a code that parses information about competitions from the RSСF website. Yes, yes, parsing again. But don't worry, I already know what and how. And wrote the code. It works like clockwork for me. Doesn't give any errors.

import requests
from bs4 import BeautifulSoup
import re
import os
from urllib.request import urlopen
import json
from urllib.parse import unquote

import warnings
warnings.filterwarnings("ignore")

BASE_URL = 'https://www.rscf.ru/contests'

session = requests.Session()
session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0'

items = []
max_page = 10
for page in range(1, max_page + 1):
    url = f'{BASE_URL}/?PAGEN_2={page}/' if page > 1 else BASE_URL
    print(url)

    rs = session.get(url, verify=False)
    rs.raise_for_status()

    soup = BeautifulSoup(rs.content, 'html.parser')
    for item in soup.select('.classification-table-row.contest-table-row'):
        number = item.select_one('.contest-num').text
        title = item.select_one('.contest-name').text
        date = item.select_one('.contest-date').text.replace("\n", "").replace("Подать заявку", "")
        documents = item.select_one('.contest-docs').text.replace("\n", " ").replace("        ", " ").replace("    ", " ")
        synopsis = item.select_one('.contest-status').text.replace("\n", " ")
        items.append({
            'Номер': number,
            'Наименование конкурса': title,
            'Приём заявок': date,
            'Статус': synopsis,
            'Документы': documents,
        })

with open('out.json', 'w', encoding='utf-8') as f:
    json.dump(items, f, indent=4, ensure_ascii=False)

Everything works, everything is in order. There is one nuance.

The fact is that the site has such a feature - the color of the text. Depending on whether the competition is active or completed, the status is colored in a certain color. If applications are being accepted, the status is highlighted in green. If an examination is carried out - orange. And if the contest is over - red. Here are the contests.

https://www.rscf.ru/contests/
And I need the code to output in JSON the text that is marked in red, orange or green in HTML. Unfortunately, I couldn't find anything similar on the Internet. There are only codes that color the text in color. But do not extract ready.

I tried to write a code

redword = item.select_one('.contest-danger').text
        orangeword = item.select_one('.contest-danger').text
        greenword = item.select_one('.contest-success').text
        for synopsis in item.select_one('.contest-status').text:
            try:
                syn = re.sub(orangeword, str(synopsis))
            except:
                syn = re.sub(orangeword, str(greenword))
        items.append({
            'Номер': number,
            'Наименование конкурса': title,
            'Приём заявок': date,
            'Статус': syn,
            'Документы': documents,
        })

but it gave me only error

redword = item.select_one('.contest-danger').text
AttributeError: 'NoneType' object has no attribute 'text'

Can you help me please?

答案1

得分: 1

你可以在这里获取颜色。

这是长篇解释。

from bs4 import BeautifulSoup

# 假设 'html' 是你的HTML内容
soup = BeautifulSoup(html, 'html.parser')

# 使用CSS选择器来定位目标元素
element = soup.select_one('h1')  # 用你的目标选择器替换 'h1'

# 检查元素是否存在
if element:
    # 获取元素的 'style' 属性
    style = element.get('style')
    # 解析 'style' 属性以提取颜色
    if style:
        # 将 'style' 属性分割成单独的样式
        styles = style.split(';')
        # 搜索 'color' 样式
        for style in styles:
            if 'color' in style:
                # 提取颜色值
                color = style.split(':')[1].strip()
                print("Color:", color)
else:
    print("未找到元素。")

简化版本

title = item.select_one('.contest-name')
style = title.get('style')

if style:
    # 将样式分割成单独的样式
    styles = style.split(';')

    # 搜索颜色样式
    for style in styles:
        if 'color' in style:
            # 提取颜色值
            color = style.split(':')[1].strip()
            print("颜色:", color)
else:
    print("未找到样式属性。")
英文:

you can get the color here

Here is the long explanation.

from bs4 import BeautifulSoup

# Assume 'html' is your HTML content
soup = BeautifulSoup(html, 'html.parser')

# Use a CSS selector to target the desired element
element = soup.select_one('h1')  # Replace 'h1' with your target selector

# Check if the element exists
if element:
    # Get the 'style' attribute of the element
    style = element.get('style')
    # Parse the 'style' attribute to extract the color
    if style:
        # Split the 'style' attribute into individual styles
        styles = style.split(';')
        # Search for the 'color' style
        for style in styles:
            if 'color' in style:
                # Extract the color value
                color = style.split(':')[1].strip()
                print("Color:", color)
else:
    print("Element not found.")

Short version

title = item.select_one('.contest-name')
style = title.get('style')

if style:
    # Split the style into individual styles
    styles = style.split(';')

    # Search for the color style
    for style in styles:
        if 'color' in style:
            # Extract the color value
            color = style.split(':')[1].strip()
            print("Color:", color)
else:
    print("Style attribute not found.")

答案2

得分: 0

这是您提供的代码的翻译部分:

所以我决定编写下一个代码

import requests
from bs4 import BeautifulSoup
import re
import os
from urllib.request import urlopen
import json
from urllib.parse import unquote

import warnings
warnings.filterwarnings("ignore")

BASE_URL = 'https://www.rscf.ru/contests'

session = requests.Session()
session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0'

items = []
max_page = 10
for page in range(1, max_page + 1):
    url = f'{BASE_URL}/?PAGEN_2={page}/' if page > 1 else BASE_URL
    print(url)

    rs = session.get(url, verify=False)
    rs.raise_for_status()

    soup = BeautifulSoup(rs.content, 'html.parser')
    for item in soup.select('.classification-table-row.contest-table-row'):
        number = item.select_one('.contest-num').text
        title = item.select_one('.contest-name').text
        date = item.select_one('.contest-date').text.replace("\n", "").replace("Подать заявку", "")
        documents = item.select_one('.contest-docs').text.replace("\n", " ").replace("        ", " ").replace("    ", " ")
        try:
            synopsis = 
展开收缩
del synopsis[:1] syn = str(synopsis).replace("['", '').replace("']", '') except: synopsis =
展开收缩
del synopsis[:1] syn = str(synopsis).replace("['", '').replace("']", '') items.append({ 'Номер': number, 'Наименование конкурса': title, 'Приём заявок': date, 'Статус': syn, 'Документы': documents, }) with open('out.json', 'w', encoding='utf-8') as f: json.dump(items, f, indent=4, ensure_ascii=False)

结果是:

{
"Номер": "92",
"Наименование конкурса": " Конкурс на получение грантов РНФ по мероприятию «Проведение фундаментальных научных исследований и поисковых научных исследований отдельными научными группами»",
"Приём заявок": "до 15.11.2023 17:00",
"Статус": "Прием заявок",
"Документы": " Извещение Конкурсная документация    "
},
{
"Номер": "3005",
"Наименование конкурса": "Конкурс на получение грантов РНФ «Проведение пилотных проектов НИОКР в рамках стратегических инициатив Президента РФ в научно-технологической сфере» по теме: «Разработка нитрид-галлиевого СВЧ-транзистора S-диапазона с выходной мощностью не менее 120 Вт»",
"Приём заявок": "до 02.06.2023 17:00",
"Статус": "Конкурс завершен",
"Документы": " Извещение Конкурсная документация Список победителей "
},
英文:

So, I decided to write the next code.

import requests
from bs4 import BeautifulSoup
import re
import os
from urllib.request import urlopen
import json
from urllib.parse import unquote
import warnings
warnings.filterwarnings("ignore")
BASE_URL = 'https://www.rscf.ru/contests'
session = requests.Session()
session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0'
items = []
max_page = 10
for page in range(1, max_page + 1):
url = f'{BASE_URL}/?PAGEN_2={page}/' if page > 1 else BASE_URL
print(url)
rs = session.get(url, verify=False)
rs.raise_for_status()
soup = BeautifulSoup(rs.content, 'html.parser')
for item in soup.select('.classification-table-row.contest-table-row'):
number = item.select_one('.contest-num').text
title = item.select_one('.contest-name').text
date = item.select_one('.contest-date').text.replace("\n", "").replace("Подать заявку", "")
documents = item.select_one('.contest-docs').text.replace("\n", " ").replace("        ", " ").replace("    ", " ")
try:
synopsis = 
展开收缩
del synopsis[:1] syn = str(synopsis).replace("['", '').replace("']", '') except: synopsis =
展开收缩
del synopsis[:1] syn = str(synopsis).replace("['", '').replace("']", '') items.append({ 'Номер': number, 'Наименование конкурса': title, 'Приём заявок': date, 'Статус': syn, 'Документы': documents, }) with open('out.json', 'w', encoding='utf-8') as f: json.dump(items, f, indent=4, ensure_ascii=False)

Result is:

{
"Номер": "92",
"Наименование конкурса": " Конкурс на получение грантов РНФ по мероприятию «Проведение фундаментальных научных исследований и поисковых научных исследований отдельными научными группами»",
"Приём заявок": "до 15.11.2023 17:00",
"Статус": "Прием заявок",
"Документы": " Извещение Конкурсная документация    "
},
{
"Номер": "3005",
"Наименование конкурса": "Конкурс на получение грантов РНФ «Проведение пилотных проектов НИОКР в рамках стратегических инициатив Президента РФ в научно-технологической сфере» по теме: «Разработка нитрид-галлиевого СВЧ-транзистора S-диапазона с выходной мощностью не менее 120 Вт»",
"Приём заявок": "до 02.06.2023 17:00",
"Статус": "Конкурс завершен",
"Документы": " Извещение Конкурсная документация Список победителей "
},
{

You can try it by yourself

huangapple
  • 本文由 发表于 2023年6月27日 20:40:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76564974.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定