英文:
Extracting a particular URL from a large text file using Python
问题
我正在尝试从一个大文本文件中提取特定的URL。
数据(或文本文件):
[{"profile":"164","width":638,"height":360,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-central1-h264-360p2F3314456635875455007.mp4~hmac=938482df34b6c94756876549908053d738d426ca/vimeo-transcode-storage-prod-us-central1-h264-360p/01/2982/28/714910924/3314655007.mp4","cdn":"akamai_interconnect","quality":"360p","id":"a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93","origin":"gcs"},{"profile":"165","width":958,"height":540,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-west1-h264-540p%2F01%2F2982%2F28%2F714910924%2F3314655137.mp4~hmac=938482df31237894b6c947e9908053d738d4dd26ca/vimeo-transcode-storage-prod-us-west1-h264-540p/01/2982/28/714910924/3314655137.mp4","cdn":"akamai_interconnect","quality":"540p","id":"08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8","origin":"gcs"},{"profile":"174","width":1278,"height":720,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-east1-h264-720p%2F01%FFF2F2F22F3314655135.mp4~hmac=c44e0c048f008ce4996434ce5d6543234567189dbc63/vimeo-transcode-storage-prod-us-east1-h264-720p/01/2982/28/714910924/3314655135.mp4","cdn":"akamai_interconnect","quality":"720p","id":"625db5ed-2175-4977-a562-de40d84aab45","origin":"gcs"}]},"file_codecs":{"av1":[],"avc":["a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93","d5589a0c-a69a-4428-ad7e-0c2a1e4e9f92","08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8","625db5ed-2175-4977-a562-de40d84aab45"],"hevc":{"dvh1":[],"hdr":[],"sdr":[]}},"lang":"en","referrer"
这里是可以用在线文本查看器查看的相同文本。 这里是我试图提取的部分的图像:
问题描述:
-
文档中有几个链接,包括看起来类似(但不相同)于我试图提取的链接。我唯一需要提取的链接(来自txt文件)包含其中的"720p"。
-
URL看起来像:“https://vod-progressive.akamaized.net [..] .mp4”。
-
请注意,为了避免仅获取所需链接的子集,链接中包含一个".mp4",一次在内部,一次在末尾。也就是说,它看起来像:“https://vod-progressive.akamaized.net [..] .mp4~hmac= [..] .mp4”,如您可以在附加的图像中突出显示的文本中看到的那样。
以下代码仅部分提取所有URL:
text_string = ""
with open("text.txt", "r") as text_file:
text_string = text_file.read()
urls = re.findall('http展开收缩?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text_string)
print(urls)
它打印出以下内容: ['https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596']
需要对代码进行的修改:
- 链接未被完全获取。仅获取到了其初始部分。
- 不是每个链接都是所需的。所需链接的描述在“问题描述”下面。
英文:
I am trying to extract a particular URL from a large text file.
Data (or text file):
[{"profile":"164","width":638,"height":360,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-central1-h264-360p2F3314456635875455007.mp4~hmac=938482df34b6c94756876549908053d738d426ca/vimeo-transcode-storage-prod-us-central1-h264-360p/01/2982/28/714910924/3314655007.mp4","cdn":"akamai_interconnect","quality":"360p","id":"a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93","origin":"gcs"},{"profile":"165","width":958,"height":540,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-west1-h264-540p%2F01%2F2982%2F28%2F714910924%2F3314655137.mp4~hmac=938482df31237894b6c947e9908053d738d4dd26ca/vimeo-transcode-storage-prod-us-west1-h264-540p/01/2982/28/714910924/3314655137.mp4","cdn":"akamai_interconnect","quality":"540p","id":"08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8","origin":"gcs"},{"profile":"174","width":1278,"height":720,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-east1-h264-720p%2F01%FFF2F2F22F3314655135.mp4~hmac=c44e0c048f008ce4996434ce5d6543234567189dbc63/vimeo-transcode-storage-prod-us-east1-h264-720p/01/2982/28/714910924/3314655135.mp4","cdn":"akamai_interconnect","quality":"720p","id":"625db5ed-2175-4977-a562-de40d84aab45","origin":"gcs"}]},"file_codecs":{"av1":[],"avc":["a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93","d5589a0c-a69a-4428-ad7e-0c2a1e4e9f92","08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8","625db5ed-2175-4977-a562-de40d84aab45"],"hevc":{"dvh1":[],"hdr":[],"sdr":[]}},"lang":"en","referrer"
Here's the same text that can be viewed using an online text viewer. and here's an image of the part that I am trying to extract:
Description of the problem:
-
There are several links in the text document, including links that look similar (not same) to the one that I am trying to extract. The only link (from the txt file) that I need to extract contains 720p somewhere in it.
-
The URL looks like "
https://vod-progressive.akamaized.net [..] .mp4
". -
Do note, so that we don't end up getting only a subset of the desired link, that the link contains a
.mp4
once inside and once in the end. That is, it looks like "https://vod-progressive.akamaized.net [..] .mp4~hmac= [..] .mp4
" as you should be able to see in the highlighted text in the image attached.
The following code extracts all the urls but only partially:
text_string = ""
with open("text.txt", "r") as text_file:
text_string = text_file.read()
urls = re.findall('http展开收缩?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text_string)
print(urls)
It prints the following: ['https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596']
Modifications needed to the code:
- The links are not fully obtained. Only the initial part of it is obtained.
- Not every link is needed. The desired link is described under "Description of the Problem".
答案1
得分: 1
import json
data = json.loads(json_str)
data["url"]
你想要的URL可以通过 data["url"]
获取。
英文:
import json
data = json.loads(json_str)
data["url"]
The url that you want can be retrieved by data["url"]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论