使用Python从大型文本文件中提取特定URL

huangapple go评论76阅读模式
英文:

Extracting a particular URL from a large text file using Python

问题

我正在尝试从一个大文本文件中提取特定的URL。

数据(或文本文件):

[{"profile":"164","width":638,"height":360,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-central1-h264-360p2F3314456635875455007.mp4~hmac=938482df34b6c94756876549908053d738d426ca/vimeo-transcode-storage-prod-us-central1-h264-360p/01/2982/28/714910924/3314655007.mp4","cdn":"akamai_interconnect","quality":"360p","id":"a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93","origin":"gcs"},{"profile":"165","width":958,"height":540,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-west1-h264-540p%2F01%2F2982%2F28%2F714910924%2F3314655137.mp4~hmac=938482df31237894b6c947e9908053d738d4dd26ca/vimeo-transcode-storage-prod-us-west1-h264-540p/01/2982/28/714910924/3314655137.mp4","cdn":"akamai_interconnect","quality":"540p","id":"08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8","origin":"gcs"},{"profile":"174","width":1278,"height":720,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-east1-h264-720p%2F01%FFF2F2F22F3314655135.mp4~hmac=c44e0c048f008ce4996434ce5d6543234567189dbc63/vimeo-transcode-storage-prod-us-east1-h264-720p/01/2982/28/714910924/3314655135.mp4","cdn":"akamai_interconnect","quality":"720p","id":"625db5ed-2175-4977-a562-de40d84aab45","origin":"gcs"}]},"file_codecs":{"av1":[],"avc":["a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93","d5589a0c-a69a-4428-ad7e-0c2a1e4e9f92","08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8","625db5ed-2175-4977-a562-de40d84aab45"],"hevc":{"dvh1":[],"hdr":[],"sdr":[]}},"lang":"en","referrer"

这里是可以用在线文本查看器查看的相同文本。 这里是我试图提取的部分的图像:

使用Python从大型文本文件中提取特定URL

问题描述:

  • 文档中有几个链接,包括看起来类似(但不相同)于我试图提取的链接。我唯一需要提取的链接(来自txt文件)包含其中的"720p"。

  • URL看起来像:“https://vod-progressive.akamaized.net [..] .mp4”。

  • 请注意,为了避免仅获取所需链接的子集,链接中包含一个".mp4",一次在内部,一次在末尾。也就是说,它看起来像:“https://vod-progressive.akamaized.net [..] .mp4~hmac= [..] .mp4”,如您可以在附加的图像中突出显示的文本中看到的那样。

以下代码仅部分提取所有URL:

text_string = ""
with open("text.txt", "r") as text_file:
    text_string = text_file.read()

urls = re.findall('http
展开收缩
?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text_string)
print(urls)

它打印出以下内容: ['https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596']

需要对代码进行的修改:

  1. 链接未被完全获取。仅获取到了其初始部分。
  2. 不是每个链接都是所需的。所需链接的描述在“问题描述”下面。
英文:

I am trying to extract a particular URL from a large text file.

Data (or text file):

[{"profile":"164","width":638,"height":360,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-central1-h264-360p2F3314456635875455007.mp4~hmac=938482df34b6c94756876549908053d738d426ca/vimeo-transcode-storage-prod-us-central1-h264-360p/01/2982/28/714910924/3314655007.mp4","cdn":"akamai_interconnect","quality":"360p","id":"a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93","origin":"gcs"},{"profile":"165","width":958,"height":540,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-west1-h264-540p%2F01%2F2982%2F28%2F714910924%2F3314655137.mp4~hmac=938482df31237894b6c947e9908053d738d4dd26ca/vimeo-transcode-storage-prod-us-west1-h264-540p/01/2982/28/714910924/3314655137.mp4","cdn":"akamai_interconnect","quality":"540p","id":"08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8","origin":"gcs"},{"profile":"174","width":1278,"height":720,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-east1-h264-720p%2F01%FFF2F2F22F3314655135.mp4~hmac=c44e0c048f008ce4996434ce5d6543234567189dbc63/vimeo-transcode-storage-prod-us-east1-h264-720p/01/2982/28/714910924/3314655135.mp4","cdn":"akamai_interconnect","quality":"720p","id":"625db5ed-2175-4977-a562-de40d84aab45","origin":"gcs"}]},"file_codecs":{"av1":[],"avc":["a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93","d5589a0c-a69a-4428-ad7e-0c2a1e4e9f92","08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8","625db5ed-2175-4977-a562-de40d84aab45"],"hevc":{"dvh1":[],"hdr":[],"sdr":[]}},"lang":"en","referrer"

Here's the same text that can be viewed using an online text viewer. and here's an image of the part that I am trying to extract:

使用Python从大型文本文件中提取特定URL

Description of the problem:

  • There are several links in the text document, including links that look similar (not same) to the one that I am trying to extract. The only link (from the txt file) that I need to extract contains 720p somewhere in it.

  • The URL looks like "https://vod-progressive.akamaized.net [..] .mp4".

  • Do note, so that we don't end up getting only a subset of the desired link, that the link contains a .mp4 once inside and once in the end. That is, it looks like "https://vod-progressive.akamaized.net [..] .mp4~hmac= [..] .mp4" as you should be able to see in the highlighted text in the image attached.

The following code extracts all the urls but only partially:

text_string = ""
with open("text.txt", "r") as text_file:
    text_string = text_file.read()

urls = re.findall('http
展开收缩
?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text_string) print(urls)

It prints the following: ['https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596']

Modifications needed to the code:

  1. The links are not fully obtained. Only the initial part of it is obtained.
  2. Not every link is needed. The desired link is described under "Description of the Problem".

答案1

得分: 1

import json

data = json.loads(json_str)
data["url"]

你想要的URL可以通过 data["url"] 获取。

英文:
import json

data = json.loads(json_str)
data["url"]

The url that you want can be retrieved by data["url"]

huangapple
  • 本文由 发表于 2023年5月29日 04:12:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76353423.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定