正则表达式版本格式

huangapple go评论134阅读模式
英文:

Regular Expression for Version Format

问题

以下是您要的翻译内容:

我的Python脚本
- 访问来自Excel文件的URL
- 提取网页上的版本信息
- 将提取的版本与Excel文件中的版本进行比较

它创建一个新文件其中包含额外的列最新版本”。如果版本相同它在最新版本列中返回相同”,否则返回提取的版本
但它在最新版本列的所有行中都返回8”。

这是我的函数

```python
import requests
from bs4 import BeautifulSoup

def extract_version(url, current_version):
    # 发送HTTP请求到URL
    response = requests.get(url)
    # 解析网页的HTML内容
    soup = BeautifulSoup(response.content, 'html.parser')
    # 使用正则表达式提取版本信息
    version_pattern = re.compile(r'\d+(?:\.\d+)*[a-zA-Z]*')
    match = version_pattern.search(str(soup))
    if match:
        extracted_version = match.group()
        if str(extracted_version) == str(current_version):
            return '相同'
        else:
            return extracted_version
    else:
        return ''

以下是一些URL和它们在我的Excel文件中声明的版本:

modyolo.com/lords-mobile.html, 2.97 
apkmody.io/games/zombie-frontier-3-mod-apk, 2.52 
modyolo.com/car-mechanic-simulator-21.html, 2.1.63 
modyolo.com/roblox-2.html, 2.564.424c

我尝试过:

  • 以不同方式编写\d,例如[0-9]
  • {1,}替代+
  • 在正则表达式的开头加上^

但它总是返回相同的输出“8”,或者在我的最新版本列中什么都不返回(我的第三次尝试)。

如何从这些网站中提取版本信息?


<details>
<summary>英文:</summary>

My python script:
- visits URLs from an Excel file
- extracts version information present on the webpage
- compares the extracted version with the version mentioned in the Excel file. 

It creates a new file with an additional column &#39;latest version&#39;. If the versions are the same, it returns &#39;same&#39; in column &#39;latest version&#39;, else it returns the extracted version. 
But it is returning &#39;8&#39; in all rows of latest version.

Here is my function:

import requests
from bs4 import BeautifulSoup

def extract_version(url, current_version):
# Make HTTP request to URL
response = requests.get(url)
# Parse HTML content of webpage
soup = BeautifulSoup(response.content, 'html.parser')
# Extract version information using regular expressions
version_pattern = re.compile(r'\d+(?:.\d+)[a-zA-Z]')
match = version_pattern.search(str(soup))
if match:
extracted_version = match.group()
if str(extracted_version) == str(current_version):
return 'Same'
else:
return extracted_version
else:
return ''


Here are a few URL&#39;s with their version as stated in my Excel file:

```csv
modyolo.com/lords-mobile.html, 2.97 
apkmody.io/games/zombie-frontier-3-mod-apk, 2.52 
modyolo.com/car-mechanic-simulator-21.html, 2.1.63 
modyolo.com/roblox-2.html, 2.564.424c

I tried:

  • writing \d differently for example as [0-9]
  • replaced + with {1,}
  • a ^ in the beginning of my regex

but it always gave the same output of 8 or it returned nothing in my latest version column (my third attempt).

How can I scrape the version information from these sites?

答案1

得分: 2

在你发布的示例URL中,网页包含一个元素&lt;script type=&quot;application/ld+json&quot;&gt;。该元素包含了你需要的所有信息的JSON,例如在https://modyolo.com/roblox-2.html上:

&lt;script type=&quot;application/ld+json&quot;&gt;
	{
	    &quot;@context&quot;: &quot;https://schema.org/&quot;,
	    &quot;@type&quot;: &quot;SoftwareApplication&quot;,
	    &quot;name&quot;: &quot;Roblox&quot;,
	    &quot;applicationCategory&quot;: &quot;GameApplication&quot;,
	    &quot;operatingSystem&quot;: &quot;Android&quot;,
	    &quot;softwareVersion&quot;: &quot;2.564.444&quot;,
	    &quot;offers&quot;: {
	        &quot;@type&quot;: &quot;Offer&quot;,
	        &quot;price&quot;: &quot;0&quot;,
	        &quot;priceCurrency&quot;: &quot;USD&quot;
	    },
	    &quot;aggregateRating&quot;: {
	        &quot;@type&quot;: &quot;AggregateRating&quot;,
	        &quot;bestRating&quot;: 5,
	        &quot;worstRating&quot;: 1,
	        &quot;ratingCount&quot;: 856,
	        &quot;ratingValue&quot;: 4.1	    }
	}
	&lt;/script&gt;

因此,我的方法是首先从文档中过滤出该元素,然后从中提取版本信息:

def extract_version(url, current_version):
    # 向URL发出HTTP请求
    response = requests.get(url)
    # 解析网页的HTML内容
    soup = BeautifulSoup(response.content, 'html.parser')
    # 仅获取包含特定类型的标签
    results = soup.findAll("script", {"type" : "application/ld+json"})
    # 过滤掉仅具有该属性而没有其他属性的标签
    result = [x for x in results if x.attrs == {'type': 'application/ld+json'}]
    # 将抓取的数据转换为字典
    data = json.loads(data[0].get_text())
    # 通过获取正确的键来提取版本信息
    extracted_version = data['softwareVersion']
    等等...

你可能需要尝试不同的键来获取软件版本。在这个示例中,它是softwareVersion,但在其他网站上可能略有不同。

英文:

In the example URLs you've posted, the webpage contains an element &lt;script type=&quot;application/ld+json&quot;&gt;. That element contains a neat JSON of all the info you need, e.g. on https://modyolo.com/roblox-2.html:

&lt;script type=&quot;application/ld+json&quot;&gt;
	{
	    &quot;@context&quot;: &quot;https://schema.org/&quot;,
	    &quot;@type&quot;: &quot;SoftwareApplication&quot;,
	    &quot;name&quot;: &quot;Roblox&quot;,
	    &quot;applicationCategory&quot;: &quot;GameApplication&quot;,
	    &quot;operatingSystem&quot;: &quot;Android&quot;,
	    &quot;softwareVersion&quot;: &quot;2.564.444&quot;,
	    &quot;offers&quot;: {
	        &quot;@type&quot;: &quot;Offer&quot;,
	        &quot;price&quot;: &quot;0&quot;,
	        &quot;priceCurrency&quot;: &quot;USD&quot;
	    },
	    &quot;aggregateRating&quot;: {
	        &quot;@type&quot;: &quot;AggregateRating&quot;,
	        &quot;bestRating&quot;: 5,
	        &quot;worstRating&quot;: 1,
	        &quot;ratingCount&quot;: 856,
	        &quot;ratingValue&quot;: 4.1	    }
	}
	&lt;/script&gt;

So, my approach would be to first filter out that element from the soup, and then extract the version info from there:

def extract_version(url, current_version):
    # Make HTTP request to URL
    response = requests.get(url)
    # Parse HTML content of webpage
    soup = BeautifulSoup(response.content, &#39;html.parser&#39;)
    # Only get tags that contain that specific type
    results = soup.findAll(&quot;script&quot;, {&quot;type&quot; : &quot;application/ld+json&quot;})
    # Filter out tags that only have that attribute and no others
    result = [x for x in results if x.attrs == {&#39;type&#39;: &#39;application/ld+json&#39;}]
    # Translate the scraped data to a dictionary
    data = json.loads(data[0].get_text())
    # Extract version information by getting the right key
    extracted_version = data[&#39;softwareVersion&#39;]
    etc...

You might need to try different keys to get the software version It's softwareVersion in this example, but it might be something slightly different on other websites.

huangapple
  • 本文由 发表于 2023年3月8日 19:09:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/75672238.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定