英文:
Regular Expression for Version Format
问题
以下是您要的翻译内容:
我的Python脚本:
- 访问来自Excel文件的URL
- 提取网页上的版本信息
- 将提取的版本与Excel文件中的版本进行比较。
它创建一个新文件,其中包含额外的列“最新版本”。如果版本相同,它在“最新版本”列中返回“相同”,否则返回提取的版本。
但它在“最新版本”列的所有行中都返回“8”。
这是我的函数:
```python
import requests
from bs4 import BeautifulSoup
def extract_version(url, current_version):
# 发送HTTP请求到URL
response = requests.get(url)
# 解析网页的HTML内容
soup = BeautifulSoup(response.content, 'html.parser')
# 使用正则表达式提取版本信息
version_pattern = re.compile(r'\d+(?:\.\d+)*[a-zA-Z]*')
match = version_pattern.search(str(soup))
if match:
extracted_version = match.group()
if str(extracted_version) == str(current_version):
return '相同'
else:
return extracted_version
else:
return ''
以下是一些URL和它们在我的Excel文件中声明的版本:
modyolo.com/lords-mobile.html, 2.97
apkmody.io/games/zombie-frontier-3-mod-apk, 2.52
modyolo.com/car-mechanic-simulator-21.html, 2.1.63
modyolo.com/roblox-2.html, 2.564.424c
我尝试过:
- 以不同方式编写
\d
,例如[0-9]
- 用
{1,}
替代+
- 在正则表达式的开头加上
^
但它总是返回相同的输出“8”,或者在我的最新版本列中什么都不返回(我的第三次尝试)。
如何从这些网站中提取版本信息?
<details>
<summary>英文:</summary>
My python script:
- visits URLs from an Excel file
- extracts version information present on the webpage
- compares the extracted version with the version mentioned in the Excel file.
It creates a new file with an additional column 'latest version'. If the versions are the same, it returns 'same' in column 'latest version', else it returns the extracted version.
But it is returning '8' in all rows of latest version.
Here is my function:
import requests
from bs4 import BeautifulSoup
def extract_version(url, current_version):
# Make HTTP request to URL
response = requests.get(url)
# Parse HTML content of webpage
soup = BeautifulSoup(response.content, 'html.parser')
# Extract version information using regular expressions
version_pattern = re.compile(r'\d+(?:.\d+)[a-zA-Z]')
match = version_pattern.search(str(soup))
if match:
extracted_version = match.group()
if str(extracted_version) == str(current_version):
return 'Same'
else:
return extracted_version
else:
return ''
Here are a few URL's with their version as stated in my Excel file:
```csv
modyolo.com/lords-mobile.html, 2.97
apkmody.io/games/zombie-frontier-3-mod-apk, 2.52
modyolo.com/car-mechanic-simulator-21.html, 2.1.63
modyolo.com/roblox-2.html, 2.564.424c
I tried:
- writing
\d
differently for example as[0-9]
- replaced
+
with{1,}
- a
^
in the beginning of my regex
but it always gave the same output of 8 or it returned nothing in my latest version column (my third attempt).
How can I scrape the version information from these sites?
答案1
得分: 2
在你发布的示例URL中,网页包含一个元素<script type="application/ld+json">
。该元素包含了你需要的所有信息的JSON,例如在https://modyolo.com/roblox-2.html上:
<script type="application/ld+json">
{
"@context": "https://schema.org/",
"@type": "SoftwareApplication",
"name": "Roblox",
"applicationCategory": "GameApplication",
"operatingSystem": "Android",
"softwareVersion": "2.564.444",
"offers": {
"@type": "Offer",
"price": "0",
"priceCurrency": "USD"
},
"aggregateRating": {
"@type": "AggregateRating",
"bestRating": 5,
"worstRating": 1,
"ratingCount": 856,
"ratingValue": 4.1 }
}
</script>
因此,我的方法是首先从文档中过滤出该元素,然后从中提取版本信息:
def extract_version(url, current_version):
# 向URL发出HTTP请求
response = requests.get(url)
# 解析网页的HTML内容
soup = BeautifulSoup(response.content, 'html.parser')
# 仅获取包含特定类型的标签
results = soup.findAll("script", {"type" : "application/ld+json"})
# 过滤掉仅具有该属性而没有其他属性的标签
result = [x for x in results if x.attrs == {'type': 'application/ld+json'}]
# 将抓取的数据转换为字典
data = json.loads(data[0].get_text())
# 通过获取正确的键来提取版本信息
extracted_version = data['softwareVersion']
等等...
你可能需要尝试不同的键来获取软件版本。在这个示例中,它是softwareVersion
,但在其他网站上可能略有不同。
英文:
In the example URLs you've posted, the webpage contains an element <script type="application/ld+json">
. That element contains a neat JSON of all the info you need, e.g. on https://modyolo.com/roblox-2.html:
<script type="application/ld+json">
{
"@context": "https://schema.org/",
"@type": "SoftwareApplication",
"name": "Roblox",
"applicationCategory": "GameApplication",
"operatingSystem": "Android",
"softwareVersion": "2.564.444",
"offers": {
"@type": "Offer",
"price": "0",
"priceCurrency": "USD"
},
"aggregateRating": {
"@type": "AggregateRating",
"bestRating": 5,
"worstRating": 1,
"ratingCount": 856,
"ratingValue": 4.1 }
}
</script>
So, my approach would be to first filter out that element from the soup, and then extract the version info from there:
def extract_version(url, current_version):
# Make HTTP request to URL
response = requests.get(url)
# Parse HTML content of webpage
soup = BeautifulSoup(response.content, 'html.parser')
# Only get tags that contain that specific type
results = soup.findAll("script", {"type" : "application/ld+json"})
# Filter out tags that only have that attribute and no others
result = [x for x in results if x.attrs == {'type': 'application/ld+json'}]
# Translate the scraped data to a dictionary
data = json.loads(data[0].get_text())
# Extract version information by getting the right key
extracted_version = data['softwareVersion']
etc...
You might need to try different keys to get the software version It's softwareVersion
in this example, but it might be something slightly different on other websites.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论