2023年3月8日 19:09:57go评论144阅读模式

英文:

Regular Expression for Version Format

问题

以下是您要的翻译内容：

我的Python脚本：
- 访问来自Excel文件的URL
- 提取网页上的版本信息
- 将提取的版本与Excel文件中的版本进行比较。

它创建一个新文件，其中包含额外的列“最新版本”。如果版本相同，它在“最新版本”列中返回“相同”，否则返回提取的版本。
但它在“最新版本”列的所有行中都返回“8”。

这是我的函数：

```python
import requests
from bs4 import BeautifulSoup

def extract_version(url, current_version):
    # 发送HTTP请求到URL
    response = requests.get(url)
    # 解析网页的HTML内容
    soup = BeautifulSoup(response.content, 'html.parser')
    # 使用正则表达式提取版本信息
    version_pattern = re.compile(r'\d+(?:\.\d+)*[a-zA-Z]*')
    match = version_pattern.search(str(soup))
    if match:
        extracted_version = match.group()
        if str(extracted_version) == str(current_version):
            return '相同'
        else:
            return extracted_version
    else:
        return ''

以下是一些URL和它们在我的Excel文件中声明的版本：

modyolo.com/lords-mobile.html, 2.97 
apkmody.io/games/zombie-frontier-3-mod-apk, 2.52 
modyolo.com/car-mechanic-simulator-21.html, 2.1.63 
modyolo.com/roblox-2.html, 2.564.424c

我尝试过：

以不同方式编写\d，例如[0-9]
用{1,}替代+
在正则表达式的开头加上^

但它总是返回相同的输出“8”，或者在我的最新版本列中什么都不返回（我的第三次尝试）。

如何从这些网站中提取版本信息？


<details>
<summary>英文:</summary>

My python script:
- visits URLs from an Excel file
- extracts version information present on the webpage
- compares the extracted version with the version mentioned in the Excel file. 

It creates a new file with an additional column &#39;latest version&#39;. If the versions are the same, it returns &#39;same&#39; in column &#39;latest version&#39;, else it returns the extracted version. 
But it is returning &#39;8&#39; in all rows of latest version.

Here is my function:

import requests
from bs4 import BeautifulSoup

def extract_version(url, current_version):
# Make HTTP request to URL
response = requests.get(url)
# Parse HTML content of webpage
soup = BeautifulSoup(response.content, 'html.parser')
# Extract version information using regular expressions
version_pattern = re.compile(r'\d+(?:.\d+)[a-zA-Z]')
match = version_pattern.search(str(soup))
if match:
extracted_version = match.group()
if str(extracted_version) == str(current_version):
return 'Same'
else:
return extracted_version
else:
return ''


Here are a few URL&#39;s with their version as stated in my Excel file:

```csv
modyolo.com/lords-mobile.html, 2.97 
apkmody.io/games/zombie-frontier-3-mod-apk, 2.52 
modyolo.com/car-mechanic-simulator-21.html, 2.1.63 
modyolo.com/roblox-2.html, 2.564.424c

I tried:

writing \d differently for example as [0-9]
replaced + with {1,}
a ^ in the beginning of my regex

but it always gave the same output of 8 or it returned nothing in my latest version column (my third attempt).

How can I scrape the version information from these sites?

答案1

得分: 2

在你发布的示例URL中，网页包含一个元素<script type="application/ld+json">。该元素包含了你需要的所有信息的JSON，例如在https://modyolo.com/roblox-2.html上：

&lt;script type=&quot;application/ld+json&quot;&gt;
	{
	    &quot;@context&quot;: &quot;https://schema.org/&quot;,
	    &quot;@type&quot;: &quot;SoftwareApplication&quot;,
	    &quot;name&quot;: &quot;Roblox&quot;,
	    &quot;applicationCategory&quot;: &quot;GameApplication&quot;,
	    &quot;operatingSystem&quot;: &quot;Android&quot;,
	    &quot;softwareVersion&quot;: &quot;2.564.444&quot;,
	    &quot;offers&quot;: {
	        &quot;@type&quot;: &quot;Offer&quot;,
	        &quot;price&quot;: &quot;0&quot;,
	        &quot;priceCurrency&quot;: &quot;USD&quot;
	    },
	    &quot;aggregateRating&quot;: {
	        &quot;@type&quot;: &quot;AggregateRating&quot;,
	        &quot;bestRating&quot;: 5,
	        &quot;worstRating&quot;: 1,
	        &quot;ratingCount&quot;: 856,
	        &quot;ratingValue&quot;: 4.1	    }
	}
	&lt;/script&gt;

因此，我的方法是首先从文档中过滤出该元素，然后从中提取版本信息：

def extract_version(url, current_version):
    # 向URL发出HTTP请求
    response = requests.get(url)
    # 解析网页的HTML内容
    soup = BeautifulSoup(response.content, 'html.parser')
    # 仅获取包含特定类型的标签
    results = soup.findAll("script", {"type" : "application/ld+json"})
    # 过滤掉仅具有该属性而没有其他属性的标签
    result = [x for x in results if x.attrs == {'type': 'application/ld+json'}]
    # 将抓取的数据转换为字典
    data = json.loads(data[0].get_text())
    # 通过获取正确的键来提取版本信息
    extracted_version = data['softwareVersion']
    等等...

你可能需要尝试不同的键来获取软件版本。在这个示例中，它是softwareVersion，但在其他网站上可能略有不同。

英文:

In the example URLs you've posted, the webpage contains an element <script type="application/ld+json">. That element contains a neat JSON of all the info you need, e.g. on https://modyolo.com/roblox-2.html:

&lt;script type=&quot;application/ld+json&quot;&gt;
	{
	    &quot;@context&quot;: &quot;https://schema.org/&quot;,
	    &quot;@type&quot;: &quot;SoftwareApplication&quot;,
	    &quot;name&quot;: &quot;Roblox&quot;,
	    &quot;applicationCategory&quot;: &quot;GameApplication&quot;,
	    &quot;operatingSystem&quot;: &quot;Android&quot;,
	    &quot;softwareVersion&quot;: &quot;2.564.444&quot;,
	    &quot;offers&quot;: {
	        &quot;@type&quot;: &quot;Offer&quot;,
	        &quot;price&quot;: &quot;0&quot;,
	        &quot;priceCurrency&quot;: &quot;USD&quot;
	    },
	    &quot;aggregateRating&quot;: {
	        &quot;@type&quot;: &quot;AggregateRating&quot;,
	        &quot;bestRating&quot;: 5,
	        &quot;worstRating&quot;: 1,
	        &quot;ratingCount&quot;: 856,
	        &quot;ratingValue&quot;: 4.1	    }
	}
	&lt;/script&gt;

So, my approach would be to first filter out that element from the soup, and then extract the version info from there:

def extract_version(url, current_version):
    # Make HTTP request to URL
    response = requests.get(url)
    # Parse HTML content of webpage
    soup = BeautifulSoup(response.content, &#39;html.parser&#39;)
    # Only get tags that contain that specific type
    results = soup.findAll(&quot;script&quot;, {&quot;type&quot; : &quot;application/ld+json&quot;})
    # Filter out tags that only have that attribute and no others
    result = [x for x in results if x.attrs == {&#39;type&#39;: &#39;application/ld+json&#39;}]
    # Translate the scraped data to a dictionary
    data = json.loads(data[0].get_text())
    # Extract version information by getting the right key
    extracted_version = data[&#39;softwareVersion&#39;]
    etc...

You might need to try different keys to get the software version It's softwareVersion in this example, but it might be something slightly different on other websites.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

正则表达式版本格式

问题

答案1

`RUN apt-key adv –keyserver keyserver.ubuntu.com –recv-keys 871920D1991BC93C` 返回错误。

你可以使用Scipy如何对时间序列数据集应用低通滤波器和高通滤波器？

Pandas从字符串中提取在列表中出现的短语。

使用迭代器通过chunksize重构pandas

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论